[syslog-ng] pipe causing lockup?

Mon Dec 7 13:56:09 CET 2009

On Sun, 2009-12-06 at 21:17 -0600, Tim Rupp wrote:
> Balazs Scheidler wrote:
> > On Sat, 2009-12-05 at 18:09 -0600, Tim Rupp wrote:
> >> Hi list,
> >>
> >> I'm hoping someone can explain this to me because it's taken several
> >> days of strace'ing and debugging for me to draw this conclusion.
> >>
> >> I have a source that right now is receiving a relay of syslog from
> >> another syslog-ng server, version 1.6.11. The new syslog-ng server is
> >> 3.0.4 using the RHEL 5 amd64 rpm.
> >>
> >> The new server has the following source to receive from the old server
> >>
> >> source s_net {
> >>         udp(
> >>                 ip(0.0.0.0) port(514)
> >>         );
> >> };
> >>
> >>
> >> I was seeing similar problems as described by Mike in this discussion
> >>
> >> https://lists.balabit.hu/pipermail/syslog-ng/2009-September/013371.html
> >>
> >>
> >> I strace'd the syslog-ng process, would wait for it to die, and then
> >> attempted to see if there were any patterns. In each strace syslog-ng
> >> appeared to hand right after processing a write to a fifo; the relevant
> >> strace output is below.
> >>
> >> 25461 write(340, "Dec  3 22:32:32 cmsstor52 sshd[28272]: Authorized to
> >> root, krb5 principal host/cmssrv28.fnal.gov at FNAL.GOV (krb5_kuserok)\n",
> >> 121) = 121
> >> 25461 poll([{fd=39, events=POLLOUT}, {fd=6, events=POLLIN}, {fd=3,
> >> events=POLLIN}, {fd=10, events=POLLIN}, {fd=5, events=POLLIN}, {fd=8,
> >> events=POLLIN}, {fd=
> >> 9, events=POLLIN}], 7, 29155) = 1 ([{fd=9, revents=POLLIN}])
> >> 25461 recvfrom(9, "<14>Dec  3 22:32:30 fndhcp1 dhcp_checkd[1679]: [ID
> >> 702911 user.info] -- DHCP Watchdog reports:   All DHCP registration
> >> processes are runni
> >> ng --\n", 8192, 0, {sa_family=AF_INET, sin_port=htons(46911),
> >> sin_addr=inet_addr("131.225.110.121")}, [16]) = 144
> >> 25461 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3543, ...}) = 0
> >> 25461 poll([{fd=39, events=POLLOUT}], 1, 0) = 0 (Timeout)
> >> 25461 lseek(14, 0, SEEK_END)            = 1212423178
> >> 25461 write(14, "Dec  3 22:32:30 fndhcp1 dhcp_checkd[1679]: [ID 702911
> >> user.info] -- DHCP Watchdog reports:   All DHCP registration processes
> >> are running --\
> >> n", 140) = 140
> >> 25461 lseek(38, 0, SEEK_END)            = 641655
> >> 25461 write(38, "Dec  3 22:32:30 fndhcp1 dhcp_checkd[1679]: [ID 702911
> >> user.info] -- DHCP Watchdog reports:   All DHCP registration processes
> >> are running --\
> >> n", 140) = 140
> >> 25461 lseek(40, 0, SEEK_END)            = 641655
> >> 25461 write(40, "Dec  3 22:32:30 fndhcp1 dhcp_checkd[1679]: [ID 702911
> >> user.info] -- DHCP Watchdog reports:   All DHCP registration processes
> >> are running --\
> >> n", 140) = 140
> >> 25461 lseek(41, 0, SEEK_END)            = 641655
> >> 25461 write(41, "Dec  3 22:32:30 fndhcp1 dhcp_checkd[1679]: [ID 702911
> >> user.info] -- DHCP Watchdog reports:   All DHCP registration processes
> >> are running --\
> >> n", 140) = 140
> >>
> >>
> >>
> >> fd 9 is the UDP socket.
> >>
> >> Right after those messages, it would do a bunch of close() calls
> >>
> >> ...
> >> 25461 close(203)                        = 0
> >> 25461 close(150)                        = 0
> >> 25461 close(565)                        = 0
> >> 25461 close(637)                        = 0
> >> 25461 close(517)                        = 0
> >> ...
> >>
> >>
> >>
> >> followed by a poll that, this time, didn't include the UDP socket
> >>
> >>
> >> 25461 poll([{fd=39, events=POLLOUT}, {fd=6, events=POLLIN}, {fd=3,
> >> events=POLLIN}, {fd=10, events=POLLIN}, {fd=5, events=POLLIN}, {fd=8,
> >> events=POLLIN}], 6,
> >> 0) = 0 (Timeout)
> >> 25461 poll([{fd=39, events=POLLOUT}], 1, 0) = 0 (Timeout)
> >> 25461 lseek(11, 0, SEEK_END)            = 107767627
> >> 25461 write(11, "Dec  3 22:33:01 clogged syslog-ng[25461]: Reaping
> >> unused destination files;
> >> template='/logging/syslog-ng/$HOST/messages'\n", 121) = 121
> >> 25461 lseek(12, 0, SEEK_END)            = 105497311
> >>
> >>
> >> I don't have flow-control enabled in my config.
> >>
> >>
> >>
> >> After several days of head scratching and reading, I narrowed it down to
> >> having something to do with my DHCP destinations/filters/logs
> >>
> >> I have a pipe() destination that points to a fifo
> >>
> >> 	destination d_cstdb    { pipe("/opt/dhcp_syslog_cst_db"); };
> >>
> >> a filter
> >>
> >> 	filter f_dhcp   { facility(local4,user) and host("fndhcp1*"); };
> >>
> >> and the corresponding log statement
> >>
> >> 	log { source(s_net); filter(f_dhcp); destination(d_cstdb); };
> >>
> >>
> >>
> >> I've concluded that the problem is caused by the pipe/fifo.
> >>
> >> On the old server I have something reading from the fifo. On the new
> >> server I hadn't stood that process up yet.
> >>
> >> syslog-ng would start just fine, would even write several entries to
> >> that fifo, and then would reach a point where things would hang. It
> >> would stop reading from the UDP socket, and just log things like
> >> destinations expiring.
> >>
> >> I looked online at the documentation for FIFOs and linux and noted that
> >> 3.0.4 source appears to specify that fifo's be opened non-blocking.
> >>
> >> So forgive the silly question, but if the fifo is opened non-blocking,
> >> why is syslog-ng appearing to block when writing to a fifo that isn't
> >> being read from?
> >>
> >> Also, why would the UDP socket file descriptor be removed from the poll
> >> list? An lsof of the syslog-ng process clearly showed that it was still
> >> bound to *:syslog
> >>
> >> Is this is bug or normal operation? Maybe I don't understand linux's
> >> behavior in regards to fifos.
> > 
> > On the top of my head, do you have any other flags on any of the
> > affected log statements? We've found a bug which triggers enabled
> > flow-control if only the "final" flag is enabled on one of the affected
> > log statements.
> > 
> > This is the patch which fixes this problem:
> > 
> > diff --git a/src/logmpx.c b/src/logmpx.c
> > index e7d5bbb..3d34f5f 100644
> > --- a/src/logmpx.c
> > +++ b/src/logmpx.c
> > @@ -29,7 +29,7 @@ log_multiplexer_add_next_hop(LogMultiplexer *self, LogPipe *next_hop)
> >  {
> >    /* NOTE: this flag is currently unused. it'll be used to tell whether we can process the multiplexed pipes in parallel threads. */
> >    if (next_hop->flags & (PIF_FINAL + PIF_FALLBACK))
> > -    self->super.flags &= PIF_MPX_INDEP_PATHS;
> > +    self->super.flags &= ~PIF_MPX_INDEP_PATHS;
> >    g_ptr_array_add(self->next_hops, next_hop);
> >  }
> >  
> > 
> > Can you confirm whether you have "final" flag, or whether this patch 
> > solves the issue for you?
> > 
> > Thanks in advance.
> > 
> > 
> 
> I checked our config file and we are not using the "final" flag at all.
> We're using a flags option for only one log statement
> 
> log { source(s_net); destination(d_leftovers); flags(fallback); };
> 
> Since changing the pipe destination to a file, syslog-ng has run happily
> since my last email (it used to hang about 30 minutes into a restart. So
> it must be due to nothing reading at the other end of the fifo.
> 
> I recompiled with the above patch and syslog-ng has been humming along
> without fail for several hours now. I'll let it run through the night
> and check it in the morning.

Great to hear. "final" and "fallback" fall into the same category, they
both produced the same ill behaviour. If you look at the context of the
patch I've posted in my previous email:

...
if (next_hop->flags & (PIF_FINAL + PIF_FALLBACK))
...

So the behaviour was wrong either "final" or "fallback" was set on a log
statement. Too bad this wasn't discovered earlier, this has been part of
the whole history of syslog-ng 3.0.

Probably who did use syslog-ng 3.0 either didn't use flags, or they
didn't notice, because their destinations worked properly.

This is the official patch, now pushed to syslog-ng public repository:

commit 495bdc3690fe1c01ed95b29f16e97829444973ee
Author: Balazs Scheidler <bazsi at balabit.hu>
Date:   Mon Dec 7 13:36:30 2009 +0100

    The flow-control flag was sometimes enabled even if not requested by the user

    In case a final or fallback flag was enabled on a log statement, it could enable
    the flow-control on the same level.

-- 
Bazsi