[syslog-ng] pipe causing lockup?

Tim Rupp caphrim007 at gmail.com
Mon Dec 7 14:21:02 CET 2009



Balazs Scheidler wrote:
> On Sun, 2009-12-06 at 21:17 -0600, Tim Rupp wrote:
>> Balazs Scheidler wrote:
>>> On Sat, 2009-12-05 at 18:09 -0600, Tim Rupp wrote:
>>>> Hi list,
>>>>
>>>> I'm hoping someone can explain this to me because it's taken several
>>>> days of strace'ing and debugging for me to draw this conclusion.
>>>>
>>>> I have a source that right now is receiving a relay of syslog from
>>>> another syslog-ng server, version 1.6.11. The new syslog-ng server is
>>>> 3.0.4 using the RHEL 5 amd64 rpm.
>>>>
>>>> The new server has the following source to receive from the old server
>>>>
>>>> source s_net {
>>>>         udp(
>>>>                 ip(0.0.0.0) port(514)
>>>>         );
>>>> };
>>>>
>>>>
>>>> I was seeing similar problems as described by Mike in this discussion
>>>>
>>>> https://lists.balabit.hu/pipermail/syslog-ng/2009-September/013371.html
>>>>
>>>>
>>>> I strace'd the syslog-ng process, would wait for it to die, and then
>>>> attempted to see if there were any patterns. In each strace syslog-ng
>>>> appeared to hand right after processing a write to a fifo; the relevant
>>>> strace output is below.
>>>>
>>>> 25461 write(340, "Dec  3 22:32:32 cmsstor52 sshd[28272]: Authorized to
>>>> root, krb5 principal host/cmssrv28.fnal.gov at FNAL.GOV (krb5_kuserok)\n",
>>>> 121) = 121
>>>> 25461 poll([{fd=39, events=POLLOUT}, {fd=6, events=POLLIN}, {fd=3,
>>>> events=POLLIN}, {fd=10, events=POLLIN}, {fd=5, events=POLLIN}, {fd=8,
>>>> events=POLLIN}, {fd=
>>>> 9, events=POLLIN}], 7, 29155) = 1 ([{fd=9, revents=POLLIN}])
>>>> 25461 recvfrom(9, "<14>Dec  3 22:32:30 fndhcp1 dhcp_checkd[1679]: [ID
>>>> 702911 user.info] -- DHCP Watchdog reports:   All DHCP registration
>>>> processes are runni
>>>> ng --\n", 8192, 0, {sa_family=AF_INET, sin_port=htons(46911),
>>>> sin_addr=inet_addr("131.225.110.121")}, [16]) = 144
>>>> 25461 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3543, ...}) = 0
>>>> 25461 poll([{fd=39, events=POLLOUT}], 1, 0) = 0 (Timeout)
>>>> 25461 lseek(14, 0, SEEK_END)            = 1212423178
>>>> 25461 write(14, "Dec  3 22:32:30 fndhcp1 dhcp_checkd[1679]: [ID 702911
>>>> user.info] -- DHCP Watchdog reports:   All DHCP registration processes
>>>> are running --\
>>>> n", 140) = 140
>>>> 25461 lseek(38, 0, SEEK_END)            = 641655
>>>> 25461 write(38, "Dec  3 22:32:30 fndhcp1 dhcp_checkd[1679]: [ID 702911
>>>> user.info] -- DHCP Watchdog reports:   All DHCP registration processes
>>>> are running --\
>>>> n", 140) = 140
>>>> 25461 lseek(40, 0, SEEK_END)            = 641655
>>>> 25461 write(40, "Dec  3 22:32:30 fndhcp1 dhcp_checkd[1679]: [ID 702911
>>>> user.info] -- DHCP Watchdog reports:   All DHCP registration processes
>>>> are running --\
>>>> n", 140) = 140
>>>> 25461 lseek(41, 0, SEEK_END)            = 641655
>>>> 25461 write(41, "Dec  3 22:32:30 fndhcp1 dhcp_checkd[1679]: [ID 702911
>>>> user.info] -- DHCP Watchdog reports:   All DHCP registration processes
>>>> are running --\
>>>> n", 140) = 140
>>>>
>>>>
>>>>
>>>> fd 9 is the UDP socket.
>>>>
>>>> Right after those messages, it would do a bunch of close() calls
>>>>
>>>> ...
>>>> 25461 close(203)                        = 0
>>>> 25461 close(150)                        = 0
>>>> 25461 close(565)                        = 0
>>>> 25461 close(637)                        = 0
>>>> 25461 close(517)                        = 0
>>>> ...
>>>>
>>>>
>>>>
>>>> followed by a poll that, this time, didn't include the UDP socket
>>>>
>>>>
>>>> 25461 poll([{fd=39, events=POLLOUT}, {fd=6, events=POLLIN}, {fd=3,
>>>> events=POLLIN}, {fd=10, events=POLLIN}, {fd=5, events=POLLIN}, {fd=8,
>>>> events=POLLIN}], 6,
>>>> 0) = 0 (Timeout)
>>>> 25461 poll([{fd=39, events=POLLOUT}], 1, 0) = 0 (Timeout)
>>>> 25461 lseek(11, 0, SEEK_END)            = 107767627
>>>> 25461 write(11, "Dec  3 22:33:01 clogged syslog-ng[25461]: Reaping
>>>> unused destination files;
>>>> template='/logging/syslog-ng/$HOST/messages'\n", 121) = 121
>>>> 25461 lseek(12, 0, SEEK_END)            = 105497311
>>>>
>>>>
>>>> I don't have flow-control enabled in my config.
>>>>
>>>>
>>>>
>>>> After several days of head scratching and reading, I narrowed it down to
>>>> having something to do with my DHCP destinations/filters/logs
>>>>
>>>> I have a pipe() destination that points to a fifo
>>>>
>>>> 	destination d_cstdb    { pipe("/opt/dhcp_syslog_cst_db"); };
>>>>
>>>> a filter
>>>>
>>>> 	filter f_dhcp   { facility(local4,user) and host("fndhcp1*"); };
>>>>
>>>> and the corresponding log statement
>>>>
>>>> 	log { source(s_net); filter(f_dhcp); destination(d_cstdb); };
>>>>
>>>>
>>>>
>>>> I've concluded that the problem is caused by the pipe/fifo.
>>>>
>>>> On the old server I have something reading from the fifo. On the new
>>>> server I hadn't stood that process up yet.
>>>>
>>>> syslog-ng would start just fine, would even write several entries to
>>>> that fifo, and then would reach a point where things would hang. It
>>>> would stop reading from the UDP socket, and just log things like
>>>> destinations expiring.
>>>>
>>>> I looked online at the documentation for FIFOs and linux and noted that
>>>> 3.0.4 source appears to specify that fifo's be opened non-blocking.
>>>>
>>>> So forgive the silly question, but if the fifo is opened non-blocking,
>>>> why is syslog-ng appearing to block when writing to a fifo that isn't
>>>> being read from?
>>>>
>>>> Also, why would the UDP socket file descriptor be removed from the poll
>>>> list? An lsof of the syslog-ng process clearly showed that it was still
>>>> bound to *:syslog
>>>>
>>>> Is this is bug or normal operation? Maybe I don't understand linux's
>>>> behavior in regards to fifos.
>>> On the top of my head, do you have any other flags on any of the
>>> affected log statements? We've found a bug which triggers enabled
>>> flow-control if only the "final" flag is enabled on one of the affected
>>> log statements.
>>>
>>> This is the patch which fixes this problem:
>>>
>>> diff --git a/src/logmpx.c b/src/logmpx.c
>>> index e7d5bbb..3d34f5f 100644
>>> --- a/src/logmpx.c
>>> +++ b/src/logmpx.c
>>> @@ -29,7 +29,7 @@ log_multiplexer_add_next_hop(LogMultiplexer *self, LogPipe *next_hop)
>>>  {
>>>    /* NOTE: this flag is currently unused. it'll be used to tell whether we can process the multiplexed pipes in parallel threads. */
>>>    if (next_hop->flags & (PIF_FINAL + PIF_FALLBACK))
>>> -    self->super.flags &= PIF_MPX_INDEP_PATHS;
>>> +    self->super.flags &= ~PIF_MPX_INDEP_PATHS;
>>>    g_ptr_array_add(self->next_hops, next_hop);
>>>  }
>>>  
>>>
>>> Can you confirm whether you have "final" flag, or whether this patch 
>>> solves the issue for you?
>>>
>>> Thanks in advance.
>>>
>>>
>> I checked our config file and we are not using the "final" flag at all.
>> We're using a flags option for only one log statement
>>
>> log { source(s_net); destination(d_leftovers); flags(fallback); };
>>
>> Since changing the pipe destination to a file, syslog-ng has run happily
>> since my last email (it used to hang about 30 minutes into a restart. So
>> it must be due to nothing reading at the other end of the fifo.
>>
>> I recompiled with the above patch and syslog-ng has been humming along
>> without fail for several hours now. I'll let it run through the night
>> and check it in the morning.
> 
> Great to hear. "final" and "fallback" fall into the same category, they
> both produced the same ill behaviour. If you look at the context of the
> patch I've posted in my previous email:
> 
> ...
> if (next_hop->flags & (PIF_FINAL + PIF_FALLBACK))
> ...
> 
> So the behaviour was wrong either "final" or "fallback" was set on a log
> statement. Too bad this wasn't discovered earlier, this has been part of
> the whole history of syslog-ng 3.0.
> 
> Probably who did use syslog-ng 3.0 either didn't use flags, or they
> didn't notice, because their destinations worked properly.
> 
> This is the official patch, now pushed to syslog-ng public repository:
> 
> commit 495bdc3690fe1c01ed95b29f16e97829444973ee
> Author: Balazs Scheidler <bazsi at balabit.hu>
> Date:   Mon Dec 7 13:36:30 2009 +0100
> 
>     The flow-control flag was sometimes enabled even if not requested by the user
>     
>     In case a final or fallback flag was enabled on a log statement, it could enable
>     the flow-control on the same level.
> 
> 
> 

Just following up that it ran through the night without fail, so that
patch appears to indeed fix the issue.

Thanks!
Tim


More information about the syslog-ng mailing list