syslog-ng-bounces@lists.balabit.hu wrote on 24/02/2008 20:00:05:
On Thu, 2008-02-21 at 16:56 +0000, AndyH@nominet.org.uk wrote:
syslog-ng-bounces@lists.balabit.hu wrote on 21/02/2008 16:16:52:
At one point we saw the whole system lock up for 7 minutes. No-one could ssh to the server but those on it could carry on. It seemed that everything was waiting on syslog-ng. I ran truss on it and nothing appeared in the output for ages, then all of a sudden it started working again. Nothing was logged for these 7 minutes
Truss output:
-bash-3.00$ sudo truss -failed -p 1418 Base time stamp: 1203503996.5724 [ Wed Feb 20 10:39:56 GMT 2008 ] 1418/1: psargs: /opt/syslog-ng/sbin/syslog-ng 1418/1: 98890.2339 getmsg(3, 0xFFFFFFFF7FFFF630, 0xFFFFFFFF7FFFF620, 0xFFFFFFFF7FFFF5FC) = 0
I could successfully reproduce the message loss on my solaris 9
system,
but I need some more time to investigate the issue.
Good news - you can reproduce the problem. I'll let you investigate.
During my investigations I found the followings:
1) Solaris /dev/log is not flow controlled in any way, if the buffer in the STREAMS device-head overflows, the message is gone for good and a warning is printed on the console, the same that you mentioned in one of your earlier mails.
2) The door support of syslog-ng was broken, because the door device was opened before going into the background, and doors are not inherited through forks. You can work this around by running syslog-ng with the --foreground command line option, and sending it to the background explicitly with your shell. Alternatively please apply the patch below.
3) If the door support works, clients generating messages slow somewhat down, because they invoke a door-call. This slow down then causes less messages to be dropped, but still some of it are dropped.
4) As syslog-ng is not threaded, it polls its inputs and writes its outputs from the same thread. E.g. while the output is being written, inputs are starved somewhat. This causes latency on the input side and as the STREAMS buffer is small, this latency causes more messages to be dropped, than syslogd.
Can it be made threaded? Although it might still lose some messages it should lose less of them.
5) syslogd does not lose messages if there's a single client that generates messages. If I start two instances of my simple test program, it also starts losing messages. The reason that it does not drop messages that it has a dedicated thread for polling inputs, thus the latency is smaller. On the other hand it puts all incoming messages to a queue which is not limited in size. Thus if the file write thread is slow, syslogd may consume unlimited amount of memory.
6) You can decrease the latency between getmsg() calls in syslog-ng, by * using a higher log-fetch-limit, * using throttle on the output (requires syslog-ng 2.1.x), * ensuring that your FIFO (log-fifo-size) is appropriately sized,
The effects are as follows: * log-fetch-limit increases the number of getmsg() calls per poll iterations * throttle limits the number of messages written out per second, effectively limiting the time spent writing the output,
So that's about it. The message loss is inherently caused by the fact that /dev/log is not flow controlled, and even worse the buffer size cannot be changed.
All options I have left is to force clients to wait somewhat when they invoke the syslog door, this'd be poor man's flow-control though.
This patch fixes the door initialization problem:
http://git.balabit.hu/?p=bazsi/syslog-ng-2. 1.git;a=commit;h=5b65710e9781203d005e6f4bcb3f33bad4f827b6
patch fails to compile: afstreams.c:195: warning: implicit declaration of function `register_application_hook' afstreams.c:195: error: `AH_POST_DAEMONIZED' undeclared (first use in this function) afstreams.c:195: error: (Each undeclared identifier is reported only once afstreams.c:195: error: for each function it appears in.) Andy