Hi All! I and my co-workers have had a trip around "STATS: dropped N" log messages in the last few days, and I am made to share my experience with you. I'm administrating a central log collector, archiver, retriever and analyser service. Of course its heart is syslog-ng. Syslog-ng stores the incoming data in separate files for each recognized subsystems in separate directories for every host. For example you can think about subsystems like cron, postfix, sshd, sendmail, exim, apache, su and sudo, netfilter, ipchains, kernel messages, grsec or execlog for special kernels, System, Security, Application, IIS, ActiveDirectory logs by a Windows source... The analyser part of the system runs a few scripts on the files containing yesterday arrived log messages, for example eximstats, or so. I have written a syslog-ng log analyser script just for fun. One step of the analyser is to collect those evil reputed "STATS: dropped N" messages and draw a graph, that shows the rate of dropped messages based on the last days crop. It was a little bit frustrating to see that graph for a few intranet log forwarder machines, because of more than 30 million lost lines per day! First I couldent believe! "Is that value measured in bytes?? Maybe in bits?" - that was the first idea of one of my co-worker, but the answer was crushing: it is measured in log lines! There have had plenty of agonized crack at the problem: increase the buffer size (certainly this was a silly idea), verify the network interfaces, drivers, cables (the idea was good and shed light on other problemes nevertheless solved the original), plans to upgrade to gigabit ethernet, eliminate every network devices and use only cross utp cables, use more computers for the analysis diffused on the intranet, etc., etc. After all there was an answer message on this list with a little mistake. It suggested that the number in the log lines can only be from tcp destinations. That's wrong! It can arise from other destination too. For example from program(), pipe() and maybe more... and there come the propitiation: f*ckin' debug pipes!!!! We have a local practice of pipe destinations. Chroot-ed environments often have a /var/log/messages pipe inside, and the syslog-ng running in the base system sends the data quarried from the jail back to that pipe for debugging reasons. If you are acting in the jail, you can "cat" the pipe and see the related events without exiting to the base system (where the data is stored in regular files). On the log forwarder machines there are pipes for debugging the incoming tcp and udp log traffic (for example it's nice to explore and locate unknown log sources, to see that remote logging is on, or to identify dummy systems using udp protocol). Pipes seemd to be practical because they didn't eat up the disk space even so ther was a point to access the log stream if it needed. Our solution is an ugly workaround. I have written a little C code that opens a pipe with O_NONBLOCK and sends data from standard input to the pipe. Because of non-blocking io syslog-ng cannot detect the data loss. So pipe destinations are replaced to program(".../sbin/pipecat .../var/log/pipe") destinations to eliminate false scaled syslog-ng statistics. (It is a pity that program("/bin/cat > /var/log/pipe"); generates the same wrong statistic, so my pipecat seems the one and only solution. Do You have any better idea?) I'm sorry if I was boring! I think promlems, their solutions and others experiences can be of service to anybody. Let these be a lesson for all of you out there! :) -- Have a nice day! bSanyI
participants (1)
-
bedo.sandor@ln.matav.hu