[syslog-ng] Syslog-ng losing messages on solaris 10

Balazs Scheidler bazsi at balabit.hu
Thu Feb 21 17:16:52 CET 2008


On Thu, 2008-02-21 at 15:32 +0000, AndyH at nominet.org.uk wrote:
> 
> syslog-ng-bounces at lists.balabit.hu wrote on 21/02/2008 15:19:15:
> 
> >
> > On Wed, 2008-02-20 at 18:30 +0100, Balazs Scheidler wrote:
> > > On Wed, 2008-02-20 at 14:32 +0000, AndyH at nominet.org.uk wrote:
> > > > When I run the syslogd as supplied with Solaris 10 then all messages
> get
> > > > logged, but when I use syslog-ng then it loses messages.  On a Sun
> V210 I
> > > > see these messages
> > > >
> > > > message overflow on /dev/log minor #6 -- is syslogd(1M) running?
> > > > message overflow on /dev/log minor #6 -- is syslogd(1M) running?
> > > > message overflow on /dev/log minor #6 -- is syslogd(1M) running?
> > ...
> >
> > > diff --git a/src/afstreams.c b/src/afstreams.c
> > > index 009b074..d0a76f3 100644
> > > --- a/src/afstreams.c
> > > +++ b/src/afstreams.c
> > > @@ -134,7 +134,7 @@ afstreams_sd_init(LogPipe *s, GlobalConfig
> > *cfg, PersistentConfig *persist)
> > >            close(fd);
> > >            return FALSE;
> > >          }
> > > -      self->reader = log_reader_new(streams_read_new(fd),
> > LR_LOCAL | LR_NOMREAD | LR_PKTTERM, s, &self->reader_options);
> > > +      self->reader = log_reader_new(streams_read_new(fd),
> > LR_LOCAL | LR_PKTTERM, s, &self->reader_options);
> > >        log_pipe_append(self->reader, s);
> > >
> > >        if (self->door_filename)
> > >
> > > This will cause the log-fetch-limit() option to become effective,
> > thus several messages
> > > are going to be fetched for every iteration, this can easily
> > multiply performance.
> > >
> > > Please also check if the local messages get mangled in any way, I
> > seriously doubt
> > > that would happen, but messing with message transports always
> > carries some risk.
> > >
> >
> > Can you please send feedback on this patch? Thanks.
> 
> Sorry for the delay - I've been doing some more testing.  With the patch we
> are still losing messages and getting the overflow messages on the console.
> 
> Solaris syslogd logged 250k messages without missing any, but syslog-ng
> loses lots of messages - 40-50% when we are hitting it with 7000
> messages/sec.

I'm missing something obvious on Solaris then. syslog-ng currently polls
the /dev/log device and only reads messages from it whenever poll
indicates readability.

With the patch it performs several getmsg() calls for per poll
iteration.

I've just checked what the native syslogd does. It polls in LWP#5 then
calls getmsg() and then another thread delivers the message to the
system log, and another to an UDP destination.

I don't think syslog-ng would be that bad, delivering the message, just
because of the usage of threads.

An interesting question where the lossage occurs? Is syslog-ng itself
reporting message loss, or it happens in the log device?

If there's a kernel warning, then probably it's the log device.

> 
> At one point we saw the whole system lock up for 7 minutes.  No-one could
> ssh to the server but those on it could carry on.  It seemed that
> everything was waiting on syslog-ng.  I ran truss on it and nothing
> appeared in the output for ages, then all of a sudden it started working
> again.  Nothing was logged for these 7 minutes

This is probably fixed by this patch (I forgot to set the streams file
descriptor to nonblocking mode), and syslog-ng blocked while there was
no log messages.

diff --git a/src/afstreams.c b/src/afstreams.c
index d0a76f3..fdf18c4 100644
--- a/src/afstreams.c
+++ b/src/afstreams.c
@@ -134,6 +134,7 @@ afstreams_sd_init(LogPipe *s, GlobalConfig *cfg, PersistentConfig *persist)
           close(fd);
           return FALSE;
         }
+      g_fd_set_nonblock(fd, TRUE);
       self->reader = log_reader_new(streams_read_new(fd), LR_LOCAL | LR_PKTTERM, s, &self->reader_options);
       log_pipe_append(self->reader, s);
       


> 
> Truss output:
> 
> -bash-3.00$ sudo truss -failed -p 1418
> Base time stamp:  1203503996.5724  [ Wed Feb 20 10:39:56 GMT 2008 ]
> 1418/1:         psargs: /opt/syslog-ng/sbin/syslog-ng
> 1418/1:         98890.2339      getmsg(3, 0xFFFFFFFF7FFFF630,
> 0xFFFFFFFF7FFFF620, 0xFFFFFFFF7FFFF5FC) = 0

I could successfully reproduce the message loss on my solaris 9 system,
but I need some more time to investigate the issue.

-- 
Bazsi



More information about the syslog-ng mailing list