Interesting issue with syslog-ng 3.3
Hi! While this mail might sound a bit vague, it will - if nothing else - serve as a reminder for me to investigate the issue furhter. On one of my servers (PowerPC, running Debian Squeeze), I have a syslog-ng 3.3 running, a reasonably recent (2-3 day old) git snapshot. It works quite well, except that I was able to trace back my server's recent hangs to syslog-ng: The server had a ~120 day uptime when I upgraded from 3.1 to 3.3, and since that time, it had to be rebooted two times already, just in two weeks time. Last time, I didn't have any open connections to it, so couldn't investigate, but tonight, I had an ssh session open with a screen session inside. So I tried to look around: first, I wanted to check the logs, but knew I wouldn't find anything, as it stopped sending the logs to my other server about two hours before I noticed the problem. Even worse, when I tried to sudo, that hung, indefinitely. Weird. There was nothing in dmesg, and nothing interesting in the logs it did send before becoming unresponsive. HTTP still worked too, as did a few other services. I could do nearly anything as a user. So I tried stracing crontab, and it hung when it tried to send logs to /dev/log. Interesting! I tried logger, same happens. I suspect that for one reason or the other, /dev/log got overwhelmed, and even worse, syslog-ng ended up trying to log something aswell, which made it hang too. And thus, the queue remained full, and everything that tried to log, got stuck. HTTP continued to work, since my httpd isn't using syslog for its logs. I could poke around in my shell, since that wasn't logging, either. This never happened with 3.1, and the only thing I changed in the config is the @version, pretty much. Thus, I suspect, there's some very nasty bug in 3.3beta2 that I haven't found yet. I'm leaving a root shell open this time, so that I can poke around further next time (along with a syslog-ng compiled with debug symbols). In the meantime, I thought I'll drop a note, hoping that perhaps Bazsi or someone from the syslog-ng devel team would have an idea where to look, and what to check next time this happens. -- |8]
I wonder if it's related at all to the memory leak? On Tue, Aug 30, 2011 at 3:40 PM, Gergely Nagy <algernon@balabit.hu> wrote:
Hi!
While this mail might sound a bit vague, it will - if nothing else - serve as a reminder for me to investigate the issue furhter.
On one of my servers (PowerPC, running Debian Squeeze), I have a syslog-ng 3.3 running, a reasonably recent (2-3 day old) git snapshot. It works quite well, except that I was able to trace back my server's recent hangs to syslog-ng:
The server had a ~120 day uptime when I upgraded from 3.1 to 3.3, and since that time, it had to be rebooted two times already, just in two weeks time. Last time, I didn't have any open connections to it, so couldn't investigate, but tonight, I had an ssh session open with a screen session inside.
So I tried to look around: first, I wanted to check the logs, but knew I wouldn't find anything, as it stopped sending the logs to my other server about two hours before I noticed the problem. Even worse, when I tried to sudo, that hung, indefinitely. Weird.
There was nothing in dmesg, and nothing interesting in the logs it did send before becoming unresponsive. HTTP still worked too, as did a few other services. I could do nearly anything as a user.
So I tried stracing crontab, and it hung when it tried to send logs to /dev/log. Interesting! I tried logger, same happens.
I suspect that for one reason or the other, /dev/log got overwhelmed, and even worse, syslog-ng ended up trying to log something aswell, which made it hang too. And thus, the queue remained full, and everything that tried to log, got stuck.
HTTP continued to work, since my httpd isn't using syslog for its logs. I could poke around in my shell, since that wasn't logging, either.
This never happened with 3.1, and the only thing I changed in the config is the @version, pretty much. Thus, I suspect, there's some very nasty bug in 3.3beta2 that I haven't found yet.
I'm leaving a root shell open this time, so that I can poke around further next time (along with a syslog-ng compiled with debug symbols).
In the meantime, I thought I'll drop a note, hoping that perhaps Bazsi or someone from the syslog-ng devel team would have an idea where to look, and what to check next time this happens.
-- |8]
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
Martin Holste <mcholste@gmail.com> writes:
I wonder if it's related at all to the memory leak?
Nope, it isn't. The OOM killer would've booted it in that case, and/or the rest of the system would've died aswell. (And I actually checked the memory usage, and it was well within limits, syslog-ng didn't use more than it usually does) -- |8]
We had this problem at AMD. The problem turned out to be that /dev/console was attached to a device (an iLO in our case) that went offline occasionally and would block on writes. We fixed it by updating our syslog-ng.conf to write to the console using a pipe() directive instead of file(). You may have something similar, especially if there are occasional messages that get routed to /dev/console (or any other pipe/device that may block). Paul Krizak 7171 Southwest Pkwy MS B200.3A MTS Systems Engineer Austin, TX 78735 Advanced Micro Devices Desk: (512) 602-8775 Linux/Unix Systems Engineering Cell: (512) 791-0686 Global IT Infrastructure Fax: (512) 602-0468 On 08/30/2011 01:40 PM, Gergely Nagy wrote:
Hi!
While this mail might sound a bit vague, it will - if nothing else - serve as a reminder for me to investigate the issue furhter.
On one of my servers (PowerPC, running Debian Squeeze), I have a syslog-ng 3.3 running, a reasonably recent (2-3 day old) git snapshot. It works quite well, except that I was able to trace back my server's recent hangs to syslog-ng:
The server had a ~120 day uptime when I upgraded from 3.1 to 3.3, and since that time, it had to be rebooted two times already, just in two weeks time. Last time, I didn't have any open connections to it, so couldn't investigate, but tonight, I had an ssh session open with a screen session inside.
So I tried to look around: first, I wanted to check the logs, but knew I wouldn't find anything, as it stopped sending the logs to my other server about two hours before I noticed the problem. Even worse, when I tried to sudo, that hung, indefinitely. Weird.
There was nothing in dmesg, and nothing interesting in the logs it did send before becoming unresponsive. HTTP still worked too, as did a few other services. I could do nearly anything as a user.
So I tried stracing crontab, and it hung when it tried to send logs to /dev/log. Interesting! I tried logger, same happens.
I suspect that for one reason or the other, /dev/log got overwhelmed, and even worse, syslog-ng ended up trying to log something aswell, which made it hang too. And thus, the queue remained full, and everything that tried to log, got stuck.
HTTP continued to work, since my httpd isn't using syslog for its logs. I could poke around in my shell, since that wasn't logging, either.
This never happened with 3.1, and the only thing I changed in the config is the @version, pretty much. Thus, I suspect, there's some very nasty bug in 3.3beta2 that I haven't found yet.
I'm leaving a root shell open this time, so that I can poke around further next time (along with a syslog-ng compiled with debug symbols).
In the meantime, I thought I'll drop a note, hoping that perhaps Bazsi or someone from the syslog-ng devel team would have an idea where to look, and what to check next time this happens.
Paul Krizak <paul.krizak@amd.com> writes:
We had this problem at AMD. The problem turned out to be that /dev/console was attached to a device (an iLO in our case) that went offline occasionally and would block on writes. We fixed it by updating our syslog-ng.conf to write to the console using a pipe() directive instead of file().
You may have something similar, especially if there are occasional messages that get routed to /dev/console (or any other pipe/device that may block).
Hmm, that sounds like a good idea. I do have stuff going to /dev/console, and I believe messages do get routed there from time to time. I'll tweak my config and see if it helps. Thanks! -- |8]
Paul Krizak <paul.krizak@amd.com> writes:
We had this problem at AMD. The problem turned out to be that /dev/console was attached to a device (an iLO in our case) that went offline occasionally and would block on writes. We fixed it by updating our syslog-ng.conf to write to the console using a pipe() directive instead of file().
You may have something similar, especially if there are occasional messages that get routed to /dev/console (or any other pipe/device that may block).
I managed to reliably reproduce the problem, thanks to the above suggestion: source s_src { system(); tcp(port(12345)); }; destination d_xconsole { pipe("/dev/xconsole"); }; log { source(s_src); destination(d_xconsole); }; Throwing a few thousand logs on this while there's nothing listening on the other end of /dev/xconsole will eventually hang syslog-ng 3.3, even if I use pipe() instead of file(). Emptying /dev/xconsole will, as expected, restore normal operation. This used to work with previous releases, and judging by Paul's mail, it works with whatever version they have (which, I assume, is not 3.3). Since I never ever use xconsole, I just removed that destination from my config, but the underlying bug should be fixed nevertheless (unless this is the expected behaviour, which I doubt). -- |8]
On Fri, 2011-09-02 at 13:21 +0200, Gergely Nagy wrote:
Paul Krizak <paul.krizak@amd.com> writes:
We had this problem at AMD. The problem turned out to be that /dev/console was attached to a device (an iLO in our case) that went offline occasionally and would block on writes. We fixed it by updating our syslog-ng.conf to write to the console using a pipe() directive instead of file().
You may have something similar, especially if there are occasional messages that get routed to /dev/console (or any other pipe/device that may block).
I managed to reliably reproduce the problem, thanks to the above suggestion:
source s_src { system(); tcp(port(12345)); }; destination d_xconsole { pipe("/dev/xconsole"); }; log { source(s_src); destination(d_xconsole); };
Throwing a few thousand logs on this while there's nothing listening on the other end of /dev/xconsole will eventually hang syslog-ng 3.3, even if I use pipe() instead of file(). Emptying /dev/xconsole will, as expected, restore normal operation.
This used to work with previous releases, and judging by Paul's mail, it works with whatever version they have (which, I assume, is not 3.3).
Since I never ever use xconsole, I just removed that destination from my config, but the underlying bug should be fixed nevertheless (unless this is the expected behaviour, which I doubt).
This patch fixes this issue: commit bb0057e78b81c574f2ce677d13d23ac6df7ac057 Author: Balazs Scheidler <bazsi@balabit.hu> Date: Sat Sep 3 10:37:28 2011 +0200 pipe/file destination: fixed flipped slow-flow-control state The condition to enable soft flow control was flipped between file and pipe destinations. Files should have had it enabled, pipe() disabled. This patch fixes that. Reported-By: Gergely Nagy <algernon@balabit.hu> Signed-off-by: Balazs Scheidler <bazsi@balabit.hu> -- Bazsi
participants (5)
-
Balazs Scheidler
-
Gergely Nagy
-
Gergely Nagy
-
Martin Holste
-
Paul Krizak