[syslog-ng] syslog-ng on solaris locks up after a while
Igor Manassypov
imanassypov at rogers.com
Thu Nov 12 17:28:56 CET 2009
Hi Balazs,
Thanks for your prompt reply. Can you please direct me to the link where I can obtain the patch?
Thanks!
-igor
Igor M., M.Eng, P.Eng Network Architect
--- On Thu, 11/12/09, Balazs Scheidler <bazsi at balabit.hu> wrote:
From: Balazs Scheidler <bazsi at balabit.hu>
Subject: Re: [syslog-ng] syslog-ng on solaris locks up after a while
To: imanassypov at rogers.com, "Syslog-ng users' and developers' mailing list" <syslog-ng at lists.balabit.hu>
Cc: "Pallagi Zoltán" <pzolee at balabit.hu>
Date: Thursday, November 12, 2009, 11:11 AM
Hi,
This seems to be the same issue as the one fixed by this patch:
Author: Balazs Scheidler <bazsi at balabit.hu> 2009-08-30 11:41:24
Committer: Balazs Scheidler <bazsi at balabit.hu> 2009-08-30 11:41:24
Parent: 1ad4da07d5305ba0140ac385d661ab6de25fc5f3 ([patterndb] estring parser length calculation must include ending quote)
Child: c2e8aa58763a89cab58d05fb7a2b2a18021413b4 ([logmsg] added support for ASA timestamps)
Branches: master, remotes/balabit/master, remotes/origin/master
Follows: v3.0.4
Precedes:
[afinter] don't block on the internal_msg_queue even in the threaded case (fixes: pub#48)
A hang was reported in bugzilla ticket #48 which seems to have
been caused by MARK messages interfering with local messages:
* if the MARK is due in the same poll iteration as a local message
* the MARK timeout is checked and the internal source is marked as having
input available
* then the local message comes in pushing the mark timeout further ahead
in time
* then the internal() dispatch callback checks the mark timeout again,
but at this time it is already in the future ->
* the dispatch callback falls back to fetching the internal message from
internal_msg_queue, assuming it was that which caused the dispatch
callback to be scheduled
* this blocks indefinitely.
The solution is very simple: use g_async_queue_try_pop() instead of
g_async_queue_pop(), the dispatch code already takes care about a
NULL message value.
On Tue, 2009-11-10 at 05:09 -0800, Igor Manassypov wrote:
> (gdb) bt full
> #0 0xfed46df0 in __lwp_park () from /lib/libc.so.1
> No symbol table info available.
> #1 0xfed40c44 in cond_sleep_queue () from /lib/libc.so.1
> No symbol table info available.
> #2 0xfed40e08 in cond_wait_queue () from /lib/libc.so.1
> No symbol table info available.
> #3 0xfed41350 in cond_wait () from /lib/libc.so.1
> No symbol table info available.
> #4 0xfed4138c in pthread_cond_wait () from /lib/libc.so.1
> No symbol table info available.
> #5 0xff119d80 in g_async_queue_pop_intern_unlocked (queue=0x757e0,
> try=0, end_time=0x75618) at gasyncqueue.c:359
> retval = (gpointer) 0xa15b8
> __PRETTY_FUNCTION__ = "g_async_queue_pop_intern_unlocked"
> #6 0xff119e80 in g_async_queue_pop (queue=0x757e0) at
> gasyncqueue.c:398
> retval = (gpointer) 0x757e0
> __PRETTY_FUNCTION__ = "g_async_queue_pop"
> #7 0x0003e984 in afinter_source_dispatch (source=0x8d260,
> callback=0x3e9dc <afinter_source_dispatch_msg>, user_data=0x8d1e0)
>
> at afinter.c:112
> msg = (LogMessage *) 0xa0dc0
> path_options = {flow_control = -1, matched = 0x0}
> tv = {tv_sec = 1257363112, tv_usec = 441817}
> #8 0xff143564 in g_main_context_dispatch (context=0x8d158) at
> gmain.c:2144
> No locals.
> #9 0xff1459a4 in g_main_context_iterate (context=0x8d158, block=1,
> dispatch=1, self=0x76030) at gmain.c:2778
> max_priority = 2147483647
> timeout = 4000
> some_ready = 1
> nfds = 4
> allocated_nfds = 1
> fds = (GPollFD *) 0x788c8
> __PRETTY_FUNCTION__ = "g_main_context_iterate"
> #10 0xff146050 in g_main_context_iteration (context=0x8d158,
> may_block=1) at gmain.c:2841
> retval = 1
> #11 0x0001bc20 in main_loop_run (cfg=0xffbffbc8) at main.c:149
> iters = 0
> stats_timer_id = 0
> #12 0x0001c260 in main (argc=1, argv=0xffbffd44) at main.c:394
> cfg = (GlobalConfig *) 0x794d0
> rc = 0
> ctx = (GOptionContext *) 0x76030
> error = (GError *) 0x0
>
>
>
> Igor M., M.Eng, P.Eng Network Architect
>
> --- On Mon, 11/9/09, Pallagi Zoltán <pzolee at balabit.hu> wrote:
>
> From: Pallagi Zoltán <pzolee at balabit.hu>
> Subject: Re: [syslog-ng] syslog-ng on solaris locks up after a
> while
> To: imanassypov at rogers.com, "Syslog-ng users' and developers'
> mailing list" <syslog-ng at lists.balabit.hu>
> Date: Monday, November 9, 2009, 11:35 AM
>
> Igor Manassypov írta:
> > Would this one make more sense?
> >
> >
> >
> > bash-3.00# ps -eaf | grep syslog
> > root 22562 22561 0 Nov 04 ?
> > 0:30 /usr/local/sbin/syslog-ng
> > root 22561 1 0 Nov 04 ?
> > 0:00 /usr/local/sbin/syslog-ng
> >
> > bash-3.00# truss -f -p 22562
> > 22562/2: door_return(0x00000000, 0, 0x00000000, 0)
> > (sleeping...)
> > 22562/1: lwp_park(0x00000000, 0)
> > (sleeping....)
> > 22562/1: Received signal #11, SIGSEGV, in
> > lwp_park() [default]
> > 22562/1: siginfo: SIGSEGV pid=12717 uid=0
> > 22562/1: lwp_park(0x00000000, 0)
> > Err#4 EINTR
> >
> > Core was generated by `/usr/local/sbin/syslog-ng'.
> > Program terminated with signal 11, Segmentation fault.
> > [New process 88098 ]
> > [New process 153634 ]
> > #0 0xfed46df0 in __lwp_park () from /lib/libc.so..1
> > #0 0xfed46df0 in __lwp_park () from /lib/libc.so..1
> >
> > bash-3.00# gdb syslog-ng core
> >
> > Core was generated by `/usr/local/sbin/syslog-ng'.
> > Program terminated with signal 11, Segmentation fault.
> > [New process 88098 ]
> > [New process 153634 ]
> > #0 0xfed46df0 in __lwp_park () from /lib/libc.so..1
> > (gdb)
> Please show us output of "bt full" too
> >
> >
> > --- On Tue, 11/3/09, Balazs Scheidler <bazsi at balabit.hu>
> > wrote:
> >
> > From: Balazs Scheidler <bazsi at balabit..hu>
> > Subject: Re: [syslog-ng] syslog-ng on solaris locks
> > up after a while
> > To: imanassypov at rogers.com, "Syslog-ng users' and
> > developers' mailing list"
> > <syslog-ng at lists.balabit.hu>
> > Cc: "Pallagi Zoltán" <pzolee at balabit.hu>,
> > network at ci.com
> > Date: Tuesday, November 3, 2009, 2:11 PM
> >
> > Hi,
> >
> > The problem is that you killed the supervisor
> > process, which restarts
> > syslog-ng in case it crashes.. However the hang is
> > not in this part, but
> > in its child.
> >
> > So by looking at the ps output, I'd say that in this
> > situation you
> > should have trussed 13621 and not its parent.
> >
> > On Tue, 2009-11-03 at 08:54 -0800, Igor Manassypov
> > wrote:
> > > Hi Zoltan,
> > >
> > >
> > > Here are the traces:
> > >
> > > bash-3.00# ps -eaf | grep syslog
> > > root 12694 12616 0 11:37:07 pts/1 0:00
> > grep syslog
> > > root 13012 1 0 Oct 21 ? 0:00
> > syslog-ng -v
> > > root 13013 13012 0 Oct 21 ? 0:41
> > syslog-ng -v
> > > root 13620 1 0 Oct 08 ?
> > > 0:00 /usr/local/sbin/syslog-ng
> > > root 13621 13620 0 Oct 08 ?
> > > 6:16 /usr/local/sbin/syslog-ng
> > > bash-3.00# truss -f -p "13620"
> > > 13620: waitid(P_PID, 13621, 0xFFBFF468, WEXITED|
> > WTRAPPED)
> > > (sleeping...)
> > >
> > > 13620: Received signal #11, SIGSEGV, in
> > waitid() [default]
> > > 13620: siginfo: SIGSEGV pid=12717 uid=0
> > > 13620: waitid(P_PID, 13621, 0xFFBFF468, WEXITED|
> > WTRAPPED) Err#4 EINTR
> > >
> > > Core was generated by `/usr/local/sbin/syslog-ng'.
> > > Program terminated with signal 11, Segmentation
> > fault.
> > > [New process 79156 ]
> > > #0 0xfed4ad80 in _waitid () from /lib/libc.so.1
> > > (gdb) bt full
> > > #0 0xfed4ad80 in _waitid () from /lib/libc.so.1
> > > No symbol table info available.
> > > #1 0xfecee038 in _waitpid () from /lib/libc.so.1
> > > No symbol table info available.
> > > #2 0xfed3a70c in waitpid () from /lib/libc.so.1
> > > No symbol table info available.
> > > #3 0x0003017c in g_process_start () at
> > gprocess.c:1042
> > > rc = 0
> > > deadlock = 0
> > > pid = 13621
> > > __PRETTY_FUNCTION__ = "g_process_start"
> > > #4 0x0001c214 in main (argc=1, argv=0xffbffd14)
> > at main.c:371
> > > cfg = (GlobalConfig *) 0x10034
> > > rc = 310272
> > > ctx = (GOptionContext *) 0x76030
> > > error = (GError *) 0x0
> > >
> > > Please let me know if I can provide you with more
> > information,
> > >
> > > Thanks!
> > >
> > > --- On Tue, 11/3/09, Pallagi Zoltán
> > <pzolee at balabit.hu> wrote:
> > >
> > > From: Pallagi Zoltán <pzolee at balabit.hu>
> > > Subject: Re: [syslog-ng] syslog-ng on
> > solaris locks up after a
> > > while
> > > To: imanassypov at rogers.com, "Syslog-ng
> > users' and developers'
> > > mailing list" <syslog-ng at lists.balabit.hu>
> > > Received: Tuesday, November 3, 2009, 11:10
> > AM
> > >
> > > Hi Igor,
> > >
> > > Can you show me truss output or backtrace
> > of the stuck
> > > syslog-ng?:
> > > truss:
> > >
> > > truss -f -p "syslog-ng pid"
> > >
> > > backtrace:
> > >
> > > kill -11 "syslog-ng pid" (syslog-ng will
> > drop a core file)
> > > gdb syslog-ng core
> > > bt full
> > >
> > > Igor Manassypov írta:
> > > > Hello,
> > > >
> > > >
> > > > I am having an issue with a solaris
> > installation of the
> > > > syslog-ng. It is configured such that
> > all the logs are
> > > > stored different per-ip folders. This is
> > my centralized
> > > > logging device, so it is fairly heavily
> > loaded with
> > > > receiving logs from a few dozen hosts.
> > The syslog-ng process
> > > > locks up every two to three weeks, with
> > no messages logging
> > > > to any of the files. The only way of
> > getting it back is kill
> > > > -9 the process and restart it.
> > > >
> > > > Is there any known issue of same sorts
> > and is there any
> > > > other way around it other than recycling
> > the daemon every
> > > > night?
> > > >
> > > >
> > > > here is the version info:
> > > >
> > > > bash-3.00# syslog-ng --version
> > > > syslog-ng 3.0.4
> > > > Revision: ssh
> > > >
> > +git://bazsi@git.balabit//var/scm/git/syslog-ng/syslog-ng-ose--mainline--3.0#master#1b5d618e301ad94aa20e692ffba16469dece8d10
> > > > Compile-Date: Aug 11 2009 10:44:17
> > > > Enable-Threads: on
> > > > Enable-Debug: off
> > > > Enable-GProf: off
> > > > Enable-Memtrace: off
> > > > Enable-Sun-STREAMS: on
> > > > Enable-Sun-Door: on
> > > > Enable-IPv6: off
> > > > Enable-Spoof-Source: on
> > > > Enable-TCP-Wrapper: off
> > > > Enable-SSL: on
> > > > Enable-SQL: on
> > > > Enable-Linux-Caps: off
> > > > Enable-Pcre: on
> > > >
> > > > bash-3.00# uname -a
> > > > SunOS prelude 5.10 Generic_137137-09
> > sun4v sparc SUNW,T5240
> > > > Thanks!
> > > >
> > > > -igor
> > > >
> > > > Igor Manassypov., M.Eng, P.Eng, CCIE
> > 23032, CCVP Network
> > > > Architect
> > > >
> > > >
> > ____________________________________________________________
> > > >
> > > >
> > ______________________________________________________________________________
> > > > Member info:
> > https://lists.balabit.hu/mailman/listinfo/syslog-ng
> > > > Documentation:
> > http://www.balabit.com/support/documentation/?product=syslog-ng
> > > > FAQ:
> > http://www.campin.net/syslog-ng/faq.html
> > > >
> > > >
> > >
> > >
> > >
> > ______________________________________________________________________________
> > > Member info:
> > https://lists.balabit.hu/mailman/listinfo/syslog-ng
> > > Documentation:
> > http://www.balabit.com/support/documentation/?product=syslog-ng
> > > FAQ: http://www.campin.net/syslog-ng/faq..html
> > >
> > --
> > Bazsi
> >
> >
> >
> >
> > ____________________________________________________________
> >
> > ______________________________________________________________________________
> > Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> > Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
> > FAQ: http://www.campin.net/syslog-ng/faq.html
> >
> >
>
>
> ______________________________________________________________________________
> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
> FAQ: http://www.campin.net/syslog-ng/faq.html
>
--
Bazsi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.balabit.hu/pipermail/syslog-ng/attachments/20091112/e6bb7d4d/attachment-0001.htm
More information about the syslog-ng
mailing list