blow up a flea

23 Sep 2004

      Hi All!

I and my co-workers have had a trip around "STATS: dropped N" log
messages in the last few days, and I am made to share my
experience with you.

I'm administrating a central log collector, archiver, retriever
and analyser service.  Of course its heart is syslog-ng.
Syslog-ng stores the incoming data in separate files for each
recognized subsystems in separate directories for every host.  For
example you can think about subsystems like cron, postfix, sshd,
sendmail, exim, apache, su and sudo, netfilter, ipchains, kernel
messages, grsec or execlog for special kernels, System, Security,
Application, IIS, ActiveDirectory logs by a Windows source...  The
analyser part of the system runs a few scripts on the files
containing yesterday arrived log messages, for example eximstats,
or so.  I have written a syslog-ng log analyser script just for
fun.  One step of the analyser is to collect those evil reputed
"STATS: dropped N" messages and draw a graph, that shows the rate
of dropped messages based on the last days crop.  It was a little
bit frustrating to see that graph for a few intranet log forwarder
machines, because of more than 30 million lost lines per day!

First I couldent believe!  "Is that value measured in bytes??
Maybe in bits?" - that was the first idea of one of my co-worker,
but the answer was crushing: it is measured in log lines!

There have had plenty of agonized crack at the problem: increase
the buffer size (certainly this was a silly idea), verify the
network interfaces, drivers, cables (the idea was good and shed
light on other problemes nevertheless solved the original), plans
to upgrade to gigabit ethernet, eliminate every network devices
and use only cross utp cables, use more computers for the analysis
diffused on the intranet, etc., etc.

After all there was an answer message on this list with a little
mistake.  It suggested that the number in the log lines can only
be from tcp destinations.  That's wrong!  It can arise from other
destination too.  For example from program(), pipe() and maybe
more... and there come the propitiation:  f*ckin' debug pipes!!!!

We have a local practice of pipe destinations.  Chroot-ed
environments often have a /var/log/messages pipe inside, and the
syslog-ng running in the base system sends the data quarried from
the jail back to that pipe for debugging reasons.  If you are
acting in the jail, you can "cat" the pipe and see the related
events without exiting to the base system (where the data is
stored in regular files).  On the log forwarder machines there are
pipes for debugging the incoming tcp and udp log traffic (for
example it's nice to explore and locate unknown log sources, to
see that remote logging is on, or to identify dummy systems using
udp protocol).  Pipes seemd to be practical because they didn't
eat up the disk space even so ther was a point to access the log
stream if it needed.

Our solution is an ugly workaround.  I have written a little C
code that opens a pipe with O_NONBLOCK and sends data from
standard input to the pipe.  Because of non-blocking io syslog-ng
cannot detect the data loss.  So pipe destinations are replaced to
program(".../sbin/pipecat .../var/log/pipe") destinations to
eliminate false scaled syslog-ng statistics. (It is a pity that
program("/bin/cat > /var/log/pipe"); generates the same wrong
statistic, so my pipecat seems the one and only solution.  Do You
have any better idea?)

I'm sorry if I was boring!  I think promlems, their solutions and
others experiences can be of service to anybody.

Let these be a lesson for all of you out there! :)

--
Have a nice day!
bSanyI

bedo.sandor＠ln.matav.hu

tags

participants (1)