[syslog-ng] strange thing

Stefan Seufert seuf@ccsw.de
Sat, 15 Apr 2000 12:29:11 +0200


Hi,

yesterday I experienced a very strange problem with syslog-ng which I'd like
to report. Maybe someone here has a clue what might have caused this. I am
not able to reproduce that problem (on a different machine and I don't want
to try it on the machnie I had experienced it since this fast the main
server of my company.

Well, what happened? I was playing around with syslog-ng. I had added a
program(); destination but for some reason it seemed not to work as i wanted
(due to a name resolution problem, there always was the IP instead of the
hostname in the logfiles so my script did not parse the input correctly, but
that not relevant). In order to see what was going on I added a "destination
home { udp(1.2.3.4); };" and a "log (source(net); destination(home); };" At
home, from where i was working I had a simple VB program running on my win
workstation which was listening on port 514 udp and putting everything it
recieved in a log-window. However, no messages appaered. So I decided to
write the messages to a log file wich should be more reliable than sending
them about a dialup connection, so I modified the log statement to "log
(source(net); destination(home); destination(all); };" where all was
"destination all { file("/var/log/allmessages"); };". From that moment on
(i.e. after the HUP) the whole system went to sleep. Every process trying to
use syslog blocked. Within a few seconds I had some hundred pop3d and
sendmail tasks running, my own ssh was blocked since I tried to issued a
logger command. I was not able to telnet or ssh to this host since both
daemon tried to log when I connected. Luckily enough someone else at my
company still had an open telnet. I called him and advised him to remove the
malicious lines from the config and send syslog-ng a SIGHUP. No Effect. Only
a SIGKILL was able to help us out of this strange situation. Within seconds
all the daemons went back to work again.

I really have no idea what might have caused this and I am not able to give
you more details. As I said, I was not able to reproduce this situation on a
test server. I am an developer myself and I perfectly know that this
description is nearly useless because it is lacking facts, but I am not able
to deliever some and I thought that it might be better to state that their
might be a problem so that if someone else will report something similar it
might help to make it easier to puzzle the whole thing together.

Please note that the system is very old (Kernel 2.0.35, still libc) and has
an uptime of more than 300 days now without having ECC ram. This system
became "to important to be upgraded". It will be replaced in the near future
with a new machine so that we do not have to take it down but could switch
softly. Maybe I can reproduce that situation then and send you some straces
etc.

Stefan

---
Programming today is a race between software engineers
striving to build bigger and  better idiot-proof programs,
and the Universe trying to produce bigger and better idiots.
So far, the Universe is winning.