syslog-ng deadlock if /dev/console locks?
Hi, we're using syslog-ng 3.1.2 and have run into what appears to be a bug, but I'd like to get the community's opinion before we dig further into it. We have a bunch of HP servers with iLO2 and iLO3 devices, configured with their virtual serial ports on COM1 (ttyS0). We subsequently have the OS (RHEL4, RHEL5) configured to use COM1 as its console (e.g. /dev/console). This is a very standard configuration that allows us to get remote access to the machines without having to purchase the iLO Advanced KVM feature. It also lets us use the Magic SysRq keys to probe dead systems and stuff, so in general it's not something we're keen to change. What we have found, however, is that there are some cases where the iLO will freeze and requires a reboot. When the iLO reboots, however, the kernel's connection to /dev/console (through the virtual serial port) hangs and blocks. Any traffic to /dev/console just sits in the kernel's buffer and is never delivered. Once the buffer is full, the kernel simply blocks on any write to /dev/console. Now this is a Bad Thing in general, and we're working with HP to try and remedy this bug. However, what concerns me is that syslog-ng, when faced with this behavior, also blocks, even for log messages not bound for /dev/console. What we have observed is that a system with syslog-ng will keep delivering the occasional console message to /dev/console (ex. *.emerg messages) and meanwhile the file-based log paths keep working. But once /dev/console blocks, the next time a console message is delivered, *all* of syslog-ng blocks waiting for that message to be delivered, and all of the file-based paths block as well. The result is that pretty much everything on the system stops working. For example, you can't log in, even as root, because the login process blocks on the syslog command that writes to /var/log/secure. Anything that uses syslog suddenly blocks. Is this expected behavior? I would think that syslog-ng would be able to continue accepting and delivering messages, even if one of the log paths is stalled on a blocked write. -- Paul Krizak 7171 Southwest Pkwy MS B200.3A MTS Systems Engineer Austin, TX 78735 Advanced Micro Devices Desk: (512) 602-8775 Linux/Unix Systems Engineering Cell: (512) 791-0686 Global IT Infrastructure Fax: (512) 602-0468
Hello, On Wed, Jan 26, 2011 at 4:12 PM, Paul Krizak <paul.krizak@amd.com> wrote:
Hi, we're using syslog-ng 3.1.2 and have run into what appears to be a bug, but I'd like to get the community's opinion before we dig further into it.
We have a bunch of HP servers with iLO2 and iLO3 devices, configured with their virtual serial ports on COM1 (ttyS0). We subsequently have the OS (RHEL4, RHEL5) configured to use COM1 as its console (e.g. /dev/console). This is a very standard configuration that allows us to get remote access to the machines without having to purchase the iLO Advanced KVM feature. It also lets us use the Magic SysRq keys to probe dead systems and stuff, so in general it's not something we're keen to change.
What we have found, however, is that there are some cases where the iLO will freeze and requires a reboot. When the iLO reboots, however, the kernel's connection to /dev/console (through the virtual serial port) hangs and blocks. Any traffic to /dev/console just sits in the kernel's buffer and is never delivered. Once the buffer is full, the kernel simply blocks on any write to /dev/console.
Now this is a Bad Thing in general, and we're working with HP to try and remedy this bug. However, what concerns me is that syslog-ng, when faced with this behavior, also blocks, even for log messages not bound for /dev/console.
syslog-ng uses a single thread (with the exception of database destinations) running the event loop so when a read() or a write() blocks then it affects the whole log processing
What we have observed is that a system with syslog-ng will keep delivering the occasional console message to /dev/console (ex. *.emerg messages) and meanwhile the file-based log paths keep working. But once /dev/console blocks, the next time a console message is delivered, *all* of syslog-ng blocks waiting for that message to be delivered, and all of the file-based paths block as well. The result is that pretty much everything on the system stops working. For example, you can't log in, even as root, because the login process blocks on the syslog command that writes to /var/log/secure. Anything that uses syslog suddenly blocks.
Is this expected behavior? I would think that syslog-ng would be able to continue accepting and delivering messages, even if one of the log paths is stalled on a blocked write.
syslog-ng uses non-blocking I/O for all sources / destinations but despite of this the kernel could still block it therefore syslog-ng protects reads/writes in logtransport.c with alarm() so it should recover when timeout is set and a read/write blocked. For me it looks like the timeout is not set in all cases, only file and program sources initialise transport->timeout to 10 secs so I'd say this isn't expected behaviour - it is a bug. Regards, Sandor
Interesting. We are definitely not seeing the 10 sec timeout protection -- it flat-out hangs. But we are using a file destination for the console device. We used to use a pipe() destination back in the syslog-ng-2.x days, but when I tried that in 3.x it complained loudly about it, and suggested using file() instead. We never had this problem in syslog-ng-2.x. Could it be that pipe() is actually the better/more correct way to write to the console? Here's the relevant snippets from our configuration: source s_local { # standard Linux log source (this is the default place for the syslog() # function to send logs to) unix-stream("/dev/log"); } source s_kernel { # messages from the kernel file("/proc/kmsg" program_override("kernel")); }; log { # /var/log/syslog-ng.log source(s_self); destination(d_self); }; filter f_console { # Stuff that goes to the console ( facility(local1) and priority(info) ) or ( priority(emerg) ); }; destination d_console { file("/dev/console"); }; log { # Console source(s_local); source(s_kernel); source(s_self); filter(f_console); destination(d_console); }; Paul Krizak 7171 Southwest Pkwy MS B200.3A MTS Systems Engineer Austin, TX 78735 Advanced Micro Devices Desk: (512) 602-8775 Linux/Unix Systems Engineering Cell: (512) 791-0686 Global IT Infrastructure Fax: (512) 602-0468 On 01/26/11 10:03, Sandor Geller wrote:
Hello,
On Wed, Jan 26, 2011 at 4:12 PM, Paul Krizak<paul.krizak@amd.com> wrote:
Hi, we're using syslog-ng 3.1.2 and have run into what appears to be a bug, but I'd like to get the community's opinion before we dig further into it.
We have a bunch of HP servers with iLO2 and iLO3 devices, configured with their virtual serial ports on COM1 (ttyS0). We subsequently have the OS (RHEL4, RHEL5) configured to use COM1 as its console (e.g. /dev/console). This is a very standard configuration that allows us to get remote access to the machines without having to purchase the iLO Advanced KVM feature. It also lets us use the Magic SysRq keys to probe dead systems and stuff, so in general it's not something we're keen to change.
What we have found, however, is that there are some cases where the iLO will freeze and requires a reboot. When the iLO reboots, however, the kernel's connection to /dev/console (through the virtual serial port) hangs and blocks. Any traffic to /dev/console just sits in the kernel's buffer and is never delivered. Once the buffer is full, the kernel simply blocks on any write to /dev/console.
Now this is a Bad Thing in general, and we're working with HP to try and remedy this bug. However, what concerns me is that syslog-ng, when faced with this behavior, also blocks, even for log messages not bound for /dev/console.
syslog-ng uses a single thread (with the exception of database destinations) running the event loop so when a read() or a write() blocks then it affects the whole log processing
What we have observed is that a system with syslog-ng will keep delivering the occasional console message to /dev/console (ex. *.emerg messages) and meanwhile the file-based log paths keep working. But once /dev/console blocks, the next time a console message is delivered, *all* of syslog-ng blocks waiting for that message to be delivered, and all of the file-based paths block as well. The result is that pretty much everything on the system stops working. For example, you can't log in, even as root, because the login process blocks on the syslog command that writes to /var/log/secure. Anything that uses syslog suddenly blocks.
Is this expected behavior? I would think that syslog-ng would be able to continue accepting and delivering messages, even if one of the log paths is stalled on a blocked write.
syslog-ng uses non-blocking I/O for all sources / destinations but despite of this the kernel could still block it therefore syslog-ng protects reads/writes in logtransport.c with alarm() so it should recover when timeout is set and a read/write blocked. For me it looks like the timeout is not set in all cases, only file and program sources initialise transport->timeout to 10 secs so I'd say this isn't expected behaviour - it is a bug.
Regards,
Sandor ______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
Hi, On Wed, Jan 26, 2011 at 5:10 PM, Paul Krizak <paul.krizak@amd.com> wrote:
Interesting. We are definitely not seeing the 10 sec timeout protection -- it flat-out hangs. But we are using a file destination for the console device. We used to use a pipe() destination back in the syslog-ng-2.x days, but when I tried that in 3.x it complained loudly about it, and suggested using file() instead. We never had this problem in syslog-ng-2.x. Could it be that pipe() is actually the better/more correct way to write to the console?
Under the hood the file() and pipe() drivers were more or less the same. The issue is that only the source side sets the timeout, the destination doesn't so when a log arrives on a source other than file() or program() then timeout never gets set. One notable difference between file() and pipe() is that for files syslog-ng assumes that the file is always writeable (which is true at least on linux) while for pipes syslog-ng checks with poll() first so when the 4k page used for the pipe is full then syslog-ng won't try to write to the pipe effectively avoiding being blocked.
Here's the relevant snippets from our configuration:
source s_local { # standard Linux log source (this is the default place for the syslog() # function to send logs to) unix-stream("/dev/log"); }
it's somewhat better to use unix-dgram() for /dev/log
source s_kernel { # messages from the kernel file("/proc/kmsg" program_override("kernel")); };
log { # /var/log/syslog-ng.log source(s_self); destination(d_self); };
filter f_console { # Stuff that goes to the console ( facility(local1) and priority(info) ) or ( priority(emerg) ); };
destination d_console { file("/dev/console"); };
log { # Console source(s_local); source(s_kernel); source(s_self); filter(f_console); destination(d_console); };
and here you connect a stream source (/dev/log) with /dev/console Regards, Sandor
Sandor Geller wrote:
Hi,
On Wed, Jan 26, 2011 at 5:10 PM, Paul Krizak <paul.krizak@amd.com> wrote:
Interesting. We are definitely not seeing the 10 sec timeout protection -- it flat-out hangs. But we are using a file destination for the console device. We used to use a pipe() destination back in the syslog-ng-2.x days, but when I tried that in 3.x it complained loudly about it, and suggested using file() instead. We never had this problem in syslog-ng-2.x. Could it be that pipe() is actually the better/more correct way to write to the console?
Under the hood the file() and pipe() drivers were more or less the same. The issue is that only the source side sets the timeout, the destination doesn't so when a log arrives on a source other than file() or program() then timeout never gets set. One notable difference between file() and pipe() is that for files syslog-ng assumes that the file is always writeable (which is true at least on linux) while for pipes syslog-ng checks with poll() first so when the 4k page used for the pipe is full then syslog-ng won't try to write to the pipe effectively avoiding being blocked.
I loudly disagree. Files are not "always writable". We continue to bump into the case where something generates tones of logs and fills the filesystem. The files are not writeable when this occurs, and syslog-ng can never recover from this even when space is made available again. The lame logic needs to be applied to files as is done for all other destinations.
Here's the relevant snippets from our configuration:
source s_local { # standard Linux log source (this is the default place for the syslog() # function to send logs to) unix-stream("/dev/log"); }
it's somewhat better to use unix-dgram() for /dev/log
source s_kernel { # messages from the kernel file("/proc/kmsg" program_override("kernel")); };
log { # /var/log/syslog-ng.log source(s_self); destination(d_self); };
filter f_console { # Stuff that goes to the console ( facility(local1) and priority(info) ) or ( priority(emerg) ); };
destination d_console { file("/dev/console"); };
log { # Console source(s_local); source(s_kernel); source(s_self); filter(f_console); destination(d_console); };
and here you connect a stream source (/dev/log) with /dev/console
Regards,
Sandor ______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
-- Evan Rempel erempel@uvic.ca Senior Systems Administrator 250.721.7691 Unix Services, University Systems, University of Victoria
I loudly disagree. Files are not "always writable". We continue to bump into the case where something generates tones of logs and fills the filesystem. The files are not writeable when this occurs, and syslog-ng can never recover from this even when space is made available again. The lame logic needs to be applied to files as is done for all other destinations.
I'm sorry but our opinion doesn't matter too much... If I read the linux kernel code correctly then for regular files poll() isn't even implemented at the VFS layer so files are always writeable. syslog-ng's behaviour of not wasting time for calling poll() for regular files is correct. Of course I could be wrong so BalaBit folks are more than welcome to chime in :) Regards, Sandor
On Thu, Jan 27, 2011 at 11:04:01AM +0100, Sandor Geller wrote:
I loudly disagree. Files are not "always writable". We continue to bump into the case where something generates tones of logs and fills the filesystem. The files are not writeable when this occurs, and syslog-ng can never recover from this even when space is made available again. The lame logic needs to be applied to files as is done for all other destinations.
I'm sorry but our opinion doesn't matter too much... If I read the linux kernel code correctly then for regular files poll() isn't even implemented at the VFS layer so files are always writeable. syslog-ng's behaviour of not wasting time for calling poll() for regular files is correct.
Of course I could be wrong so BalaBit folks are more than welcome to chime in :)
Regards, Sandor
Could you not just call something which checks for space in the block device? Imperfect but certainly better than doing nothing. Matthew.
FYI, we have found experimentally that syslog-ng 3.1.2 will use poll() on /dev/console without complaining, and that doing this actually seems to solve the problem. We have not been able to reproduce the console hang after switching the console destination to use poll() instead of file(). Paul Krizak 7171 Southwest Pkwy MS B200.3A MTS Systems Engineer Austin, TX 78735 Advanced Micro Devices Desk: (512) 602-8775 Linux/Unix Systems Engineering Cell: (512) 791-0686 Global IT Infrastructure Fax: (512) 602-0468 On 01/27/11 10:17, Matthew Hall wrote:
On Thu, Jan 27, 2011 at 11:04:01AM +0100, Sandor Geller wrote:
I loudly disagree. Files are not "always writable". We continue to bump into the case where something generates tones of logs and fills the filesystem. The files are not writeable when this occurs, and syslog-ng can never recover from this even when space is made available again. The lame logic needs to be applied to files as is done for all other destinations.
I'm sorry but our opinion doesn't matter too much... If I read the linux kernel code correctly then for regular files poll() isn't even implemented at the VFS layer so files are always writeable. syslog-ng's behaviour of not wasting time for calling poll() for regular files is correct.
Of course I could be wrong so BalaBit folks are more than welcome to chime in :)
Regards, Sandor
Could you not just call something which checks for space in the block device?
Imperfect but certainly better than doing nothing.
Matthew. ______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
On Tue, 2011-02-01 at 17:31 -0600, Paul Krizak wrote:
FYI, we have found experimentally that syslog-ng 3.1.2 will use poll() on /dev/console without complaining, and that doing this actually seems to solve the problem. We have not been able to reproduce the console hang after switching the console destination to use poll() instead of file().
I guess you are referring to pipe(), not poll(), as the latter is a system call. But yes, in the knowledge of the code, syslog-ng should behave properly in this case. But that's what my patch also implements. -- Bazsi
On Thu, 2011-01-27 at 08:17 -0800, Matthew Hall wrote:
On Thu, Jan 27, 2011 at 11:04:01AM +0100, Sandor Geller wrote:
I loudly disagree. Files are not "always writable". We continue to bump into the case where something generates tones of logs and fills the filesystem. The files are not writeable when this occurs, and syslog-ng can never recover from this even when space is made available again. The lame logic needs to be applied to files as is done for all other destinations.
I'm sorry but our opinion doesn't matter too much... If I read the linux kernel code correctly then for regular files poll() isn't even implemented at the VFS layer so files are always writeable. syslog-ng's behaviour of not wasting time for calling poll() for regular files is correct.
Of course I could be wrong so BalaBit folks are more than welcome to chime in :)
Regards, Sandor
Could you not just call something which checks for space in the block device?
Imperfect but certainly better than doing nothing.
I can't remember when I implemented it, but syslog-ng now properly reacts to ENOSPC errors, just like any other kind of I/O error returned by write(). It suspends writing to the given destination for time_reopen() time, and then tries again. -- Bazsi
On Wed, 2011-01-26 at 09:28 -0800, Evan Rempel wrote:
Sandor Geller wrote:
Hi,
On Wed, Jan 26, 2011 at 5:10 PM, Paul Krizak <paul.krizak@amd.com> wrote:
Interesting. We are definitely not seeing the 10 sec timeout protection -- it flat-out hangs. But we are using a file destination for the console device. We used to use a pipe() destination back in the syslog-ng-2.x days, but when I tried that in 3.x it complained loudly about it, and suggested using file() instead. We never had this problem in syslog-ng-2.x. Could it be that pipe() is actually the better/more correct way to write to the console?
Under the hood the file() and pipe() drivers were more or less the same. The issue is that only the source side sets the timeout, the destination doesn't so when a log arrives on a source other than file() or program() then timeout never gets set. One notable difference between file() and pipe() is that for files syslog-ng assumes that the file is always writeable (which is true at least on linux) while for pipes syslog-ng checks with poll() first so when the 4k page used for the pipe is full then syslog-ng won't try to write to the pipe effectively avoiding being blocked.
I loudly disagree. Files are not "always writable". We continue to bump into the case where something generates tones of logs and fills the filesystem. The files are not writeable when this occurs, and syslog-ng can never recover from this even when space is made available again. The lame logic needs to be applied to files as is done for all other destinations.
This should have been fixed in 3.2 already. (maybe earlier, I can't remember exactly). And this is a different case than the one we discussed. -- Bazsi
We ran into this issue when upgrading iLO on all our boxes. When the iLO was upgraded, /dev/console went completely unresponsive, and things started to hang. The solution turned out to be 'echo h > /proc/sysrq-trigger'. Apparently when the kernel went to write out to the serial port, it ran into problems and would reinitialize it. After that everything started working fine. -Patrick Sent: Wed Jan 26 2011 11:03:37 GMT-0500 (Eastern Standard Time) From: Sandor Geller <Sandor.Geller@morganstanley.com> To: Syslog-ng users' and developers' mailing list <syslog-ng@lists.balabit.hu> "Sowell, Brett" <Brett.Sowell@amd.com>, "Petrini, Bryce" <Bryce.Petrini@amd.com>, "Hart, Corey" <Corey.Hart@amd.com> Subject: Re: [syslog-ng] syslog-ng deadlock if /dev/console locks?
Hello,
On Wed, Jan 26, 2011 at 4:12 PM, Paul Krizak <paul.krizak@amd.com> wrote:
Hi, we're using syslog-ng 3.1.2 and have run into what appears to be a bug, but I'd like to get the community's opinion before we dig further into it.
We have a bunch of HP servers with iLO2 and iLO3 devices, configured with their virtual serial ports on COM1 (ttyS0). We subsequently have the OS (RHEL4, RHEL5) configured to use COM1 as its console (e.g. /dev/console). This is a very standard configuration that allows us to get remote access to the machines without having to purchase the iLO Advanced KVM feature. It also lets us use the Magic SysRq keys to probe dead systems and stuff, so in general it's not something we're keen to change.
What we have found, however, is that there are some cases where the iLO will freeze and requires a reboot. When the iLO reboots, however, the kernel's connection to /dev/console (through the virtual serial port) hangs and blocks. Any traffic to /dev/console just sits in the kernel's buffer and is never delivered. Once the buffer is full, the kernel simply blocks on any write to /dev/console.
Now this is a Bad Thing in general, and we're working with HP to try and remedy this bug. However, what concerns me is that syslog-ng, when faced with this behavior, also blocks, even for log messages not bound for /dev/console.
syslog-ng uses a single thread (with the exception of database destinations) running the event loop so when a read() or a write() blocks then it affects the whole log processing
What we have observed is that a system with syslog-ng will keep delivering the occasional console message to /dev/console (ex. *.emerg messages) and meanwhile the file-based log paths keep working. But once /dev/console blocks, the next time a console message is delivered, *all* of syslog-ng blocks waiting for that message to be delivered, and all of the file-based paths block as well. The result is that pretty much everything on the system stops working. For example, you can't log in, even as root, because the login process blocks on the syslog command that writes to /var/log/secure. Anything that uses syslog suddenly blocks.
Is this expected behavior? I would think that syslog-ng would be able to continue accepting and delivering messages, even if one of the log paths is stalled on a blocked write.
syslog-ng uses non-blocking I/O for all sources / destinations but despite of this the kernel could still block it therefore syslog-ng protects reads/writes in logtransport.c with alarm() so it should recover when timeout is set and a read/write blocked. For me it looks like the timeout is not set in all cases, only file and program sources initialise transport->timeout to 10 secs so I'd say this isn't expected behaviour - it is a bug.
Regards,
Sandor ______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
Fascinating. So just triggering the kernel to print something to the console (h is "help") caused /dev/console to properly realign and syslog-ng woke back up? You didn't have to restart syslog-ng or reboot the box or anything? Paul Krizak 7171 Southwest Pkwy MS B200.3A MTS Systems Engineer Austin, TX 78735 Advanced Micro Devices Desk: (512) 602-8775 Linux/Unix Systems Engineering Cell: (512) 791-0686 Global IT Infrastructure Fax: (512) 602-0468 On 01/26/11 10:11, Patrick H. wrote:
We ran into this issue when upgrading iLO on all our boxes. When the iLO was upgraded, /dev/console went completely unresponsive, and things started to hang. The solution turned out to be 'echo h > /proc/sysrq-trigger'. Apparently when the kernel went to write out to the serial port, it ran into problems and would reinitialize it. After that everything started working fine.
-Patrick
Sent: Wed Jan 26 2011 11:03:37 GMT-0500 (Eastern Standard Time) From: Sandor Geller <Sandor.Geller@morganstanley.com> To: Syslog-ng users' and developers' mailing list <syslog-ng@lists.balabit.hu> "Sowell, Brett" <Brett.Sowell@amd.com>, "Petrini, Bryce" <Bryce.Petrini@amd.com>, "Hart, Corey" <Corey.Hart@amd.com> Subject: Re: [syslog-ng] syslog-ng deadlock if /dev/console locks?
Hello,
On Wed, Jan 26, 2011 at 4:12 PM, Paul Krizak<paul.krizak@amd.com> wrote:
Hi, we're using syslog-ng 3.1.2 and have run into what appears to be a bug, but I'd like to get the community's opinion before we dig further into it.
We have a bunch of HP servers with iLO2 and iLO3 devices, configured with their virtual serial ports on COM1 (ttyS0). We subsequently have the OS (RHEL4, RHEL5) configured to use COM1 as its console (e.g. /dev/console). This is a very standard configuration that allows us to get remote access to the machines without having to purchase the iLO Advanced KVM feature. It also lets us use the Magic SysRq keys to probe dead systems and stuff, so in general it's not something we're keen to change.
What we have found, however, is that there are some cases where the iLO will freeze and requires a reboot. When the iLO reboots, however, the kernel's connection to /dev/console (through the virtual serial port) hangs and blocks. Any traffic to /dev/console just sits in the kernel's buffer and is never delivered. Once the buffer is full, the kernel simply blocks on any write to /dev/console.
Now this is a Bad Thing in general, and we're working with HP to try and remedy this bug. However, what concerns me is that syslog-ng, when faced with this behavior, also blocks, even for log messages not bound for /dev/console.
syslog-ng uses a single thread (with the exception of database destinations) running the event loop so when a read() or a write() blocks then it affects the whole log processing
What we have observed is that a system with syslog-ng will keep delivering the occasional console message to /dev/console (ex. *.emerg messages) and meanwhile the file-based log paths keep working. But once /dev/console blocks, the next time a console message is delivered, *all* of syslog-ng blocks waiting for that message to be delivered, and all of the file-based paths block as well. The result is that pretty much everything on the system stops working. For example, you can't log in, even as root, because the login process blocks on the syslog command that writes to /var/log/secure. Anything that uses syslog suddenly blocks.
Is this expected behavior? I would think that syslog-ng would be able to continue accepting and delivering messages, even if one of the log paths is stalled on a blocked write.
syslog-ng uses non-blocking I/O for all sources / destinations but despite of this the kernel could still block it therefore syslog-ng protects reads/writes in logtransport.c with alarm() so it should recover when timeout is set and a read/write blocked. For me it looks like the timeout is not set in all cases, only file and program sources initialise transport->timeout to 10 secs so I'd say this isn't expected behaviour - it is a bug.
Regards,
Sandor ______________________________________________________________________________ Member info:https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation:http://www.balabit.com/support/documentation/?product=syslog-ng FAQ:http://www.campin.net/syslog-ng/faq.html
Nope, just 'echo h', that was it. -Patrick Sent: Wed Jan 26 2011 11:16:31 GMT-0500 (Eastern Standard Time) From: Paul Krizak <paul.krizak@amd.com> To: Syslog-ng users' and developers' mailing list <syslog-ng@lists.balabit.hu> "Patrick H." <syslogng@feystorm.net>, "Sowell, Brett" <Brett.Sowell@amd.com>, "Petrini, Bryce" <Bryce.Petrini@amd.com>, "Hart, Corey" <Corey.Hart@amd.com> Subject: Re: [syslog-ng] syslog-ng deadlock if /dev/console locks?
Fascinating. So just triggering the kernel to print something to the console (h is "help") caused /dev/console to properly realign and syslog-ng woke back up? You didn't have to restart syslog-ng or reboot the box or anything?
Paul Krizak 7171 Southwest Pkwy MS B200.3A MTS Systems Engineer Austin, TX 78735 Advanced Micro Devices Desk: (512) 602-8775 Linux/Unix Systems Engineering Cell: (512) 791-0686 Global IT Infrastructure Fax: (512) 602-0468
On 01/26/11 10:11, Patrick H. wrote:
We ran into this issue when upgrading iLO on all our boxes. When the iLO was upgraded, /dev/console went completely unresponsive, and things started to hang. The solution turned out to be 'echo h > /proc/sysrq-trigger'. Apparently when the kernel went to write out to the serial port, it ran into problems and would reinitialize it. After that everything started working fine.
-Patrick
Sent: Wed Jan 26 2011 11:03:37 GMT-0500 (Eastern Standard Time) From: Sandor Geller <Sandor.Geller@morganstanley.com> To: Syslog-ng users' and developers' mailing list <syslog-ng@lists.balabit.hu> "Sowell, Brett" <Brett.Sowell@amd.com>, "Petrini, Bryce" <Bryce.Petrini@amd.com>, "Hart, Corey" <Corey.Hart@amd.com> Subject: Re: [syslog-ng] syslog-ng deadlock if /dev/console locks?
Hello,
On Wed, Jan 26, 2011 at 4:12 PM, Paul Krizak<paul.krizak@amd.com> wrote:
Hi, we're using syslog-ng 3.1.2 and have run into what appears to be a bug, but I'd like to get the community's opinion before we dig further into it.
We have a bunch of HP servers with iLO2 and iLO3 devices, configured with their virtual serial ports on COM1 (ttyS0). We subsequently have the OS (RHEL4, RHEL5) configured to use COM1 as its console (e.g. /dev/console). This is a very standard configuration that allows us to get remote access to the machines without having to purchase the iLO Advanced KVM feature. It also lets us use the Magic SysRq keys to probe dead systems and stuff, so in general it's not something we're keen to change.
What we have found, however, is that there are some cases where the iLO will freeze and requires a reboot. When the iLO reboots, however, the kernel's connection to /dev/console (through the virtual serial port) hangs and blocks. Any traffic to /dev/console just sits in the kernel's buffer and is never delivered. Once the buffer is full, the kernel simply blocks on any write to /dev/console.
Now this is a Bad Thing in general, and we're working with HP to try and remedy this bug. However, what concerns me is that syslog-ng, when faced with this behavior, also blocks, even for log messages not bound for /dev/console.
syslog-ng uses a single thread (with the exception of database destinations) running the event loop so when a read() or a write() blocks then it affects the whole log processing
What we have observed is that a system with syslog-ng will keep delivering the occasional console message to /dev/console (ex. *.emerg messages) and meanwhile the file-based log paths keep working. But once /dev/console blocks, the next time a console message is delivered, *all* of syslog-ng blocks waiting for that message to be delivered, and all of the file-based paths block as well. The result is that pretty much everything on the system stops working. For example, you can't log in, even as root, because the login process blocks on the syslog command that writes to /var/log/secure. Anything that uses syslog suddenly blocks.
Is this expected behavior? I would think that syslog-ng would be able to continue accepting and delivering messages, even if one of the log paths is stalled on a blocked write.
syslog-ng uses non-blocking I/O for all sources / destinations but despite of this the kernel could still block it therefore syslog-ng protects reads/writes in logtransport.c with alarm() so it should recover when timeout is set and a read/write blocked. For me it looks like the timeout is not set in all cases, only file and program sources initialise transport->timeout to 10 secs so I'd say this isn't expected behaviour - it is a bug.
Regards,
Sandor ______________________________________________________________________________
Member info:https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation:http://www.balabit.com/support/documentation/?product=syslog-ng
On Wed, 2011-01-26 at 17:03 +0100, Sandor Geller wrote:
Hello,
On Wed, Jan 26, 2011 at 4:12 PM, Paul Krizak <paul.krizak@amd.com> wrote:
Hi, we're using syslog-ng 3.1.2 and have run into what appears to be a bug, but I'd like to get the community's opinion before we dig further into it.
We have a bunch of HP servers with iLO2 and iLO3 devices, configured with their virtual serial ports on COM1 (ttyS0). We subsequently have the OS (RHEL4, RHEL5) configured to use COM1 as its console (e.g. /dev/console). This is a very standard configuration that allows us to get remote access to the machines without having to purchase the iLO Advanced KVM feature. It also lets us use the Magic SysRq keys to probe dead systems and stuff, so in general it's not something we're keen to change.
What we have found, however, is that there are some cases where the iLO will freeze and requires a reboot. When the iLO reboots, however, the kernel's connection to /dev/console (through the virtual serial port) hangs and blocks. Any traffic to /dev/console just sits in the kernel's buffer and is never delivered. Once the buffer is full, the kernel simply blocks on any write to /dev/console.
Now this is a Bad Thing in general, and we're working with HP to try and remedy this bug. However, what concerns me is that syslog-ng, when faced with this behavior, also blocks, even for log messages not bound for /dev/console.
syslog-ng uses a single thread (with the exception of database destinations) running the event loop so when a read() or a write() blocks then it affects the whole log processing
What we have observed is that a system with syslog-ng will keep delivering the occasional console message to /dev/console (ex. *.emerg messages) and meanwhile the file-based log paths keep working. But once /dev/console blocks, the next time a console message is delivered, *all* of syslog-ng blocks waiting for that message to be delivered, and all of the file-based paths block as well. The result is that pretty much everything on the system stops working. For example, you can't log in, even as root, because the login process blocks on the syslog command that writes to /var/log/secure. Anything that uses syslog suddenly blocks.
Is this expected behavior? I would think that syslog-ng would be able to continue accepting and delivering messages, even if one of the log paths is stalled on a blocked write.
syslog-ng uses non-blocking I/O for all sources / destinations but despite of this the kernel could still block it therefore syslog-ng protects reads/writes in logtransport.c with alarm() so it should recover when timeout is set and a read/write blocked. For me it looks like the timeout is not set in all cases, only file and program sources initialise transport->timeout to 10 secs so I'd say this isn't expected behaviour - it is a bug.
that alarm stuff got implemented because of /proc/kmsg, which - because of a kernel bug - doesn't support non-blocking I/O properly. The file source driver (usually used for /proc/kmsg) sets that, even though the kernel should never block in that case. So I wouldn't call this a bug, the alarm is a workaround for a specific case and /dev/console is different. The culprit seems to be that indeed file() destinations always assumes that files are always writable, which is only true for regular files, but not for devices. So what needs to be done is to apply regular polling if the file is non-regular. What about this patch (untested): diff --git a/src/affile.c b/src/affile.c index b5e1bef..24e5986 100644 --- a/src/affile.c +++ b/src/affile.c @@ -42,7 +42,7 @@ static gboolean affile_open_file(gchar *name, gint flags, gint uid, gint gid, gint mode, gint dir_uid, gint dir_gid, gint dir_mode, - gboolean create_dirs, gboolean privileged, gboolean is_pipe, gint *fd) + gboolean create_dirs, gboolean privileged, gboolean is_pipe, gboolean *regular, gint *fd) { cap_t saved_caps; struct stat st; @@ -79,7 +79,11 @@ affile_open_file(gchar *name, gint flags, evt_tag_str("filename", name), NULL); } + if (regular) + *regular = !!S_ISREG(st.st_mode); } + else if (regular) + *regular = TRUE; *fd = open(name, flags, mode < 0 ? 0600 : mode); if (is_pipe && *fd < 0 && errno == ENOENT) { @@ -119,7 +123,7 @@ affile_sd_open_file(AFFileSourceDriver *self, gchar *name, gint *fd) else flags = O_RDONLY | O_NOCTTY | O_NONBLOCK | O_LARGEFILE; - if (affile_open_file(name, flags, -1, -1, -1, 0, 0, 0, 0, !!(self->flags & AFFILE_PRIVILEGED), !!(self->flags & AFFILE_PIPE), fd)) + if (affile_open_file(name, flags, -1, -1, -1, 0, 0, 0, 0, !!(self->flags & AFFILE_PRIVILEGED), !!(self->flags & AFFILE_PIPE), NULL, fd)) return TRUE; return FALSE; } @@ -442,6 +446,7 @@ affile_dw_init(LogPipe *s) int fd, flags; struct stat st; GlobalConfig *cfg = log_pipe_get_config(s); + gboolean regular; if (cfg) self->time_reopen = cfg->time_reopen; @@ -452,7 +457,7 @@ affile_dw_init(LogPipe *s) NULL); if (self->owner->overwrite_if_older > 0 && - stat(self->filename->str, &st) == 0 && + stat(self->filename->str, &st) == 0 && st.st_mtime < time(NULL) - self->owner->overwrite_if_older) { msg_info("Destination file is older than overwrite_if_older(), overwriting", @@ -471,13 +476,13 @@ affile_dw_init(LogPipe *s) if (affile_open_file(self->filename->str, flags, self->owner->file_uid, self->owner->file_gid, self->owner->file_perm, self->owner->dir_uid, self->owner->dir_gid, self->owner->dir_perm, - !!(self->owner->flags & AFFILE_CREATE_DIRS), FALSE, !!(self->owner->flags & AFFILE_PIPE), &fd)) + !!(self->owner->flags & AFFILE_CREATE_DIRS), FALSE, !!(self->owner->flags & AFFILE_PIPE), ®ular, &fd)) { guint write_flags; if (!self->writer) { - self->writer = log_writer_new(LW_FORMAT_FILE | ((self->owner->flags & AFFILE_PIPE) ? 0 : LW_ALWAYS_WRITABLE)); + self->writer = log_writer_new(LW_FORMAT_FILE | ((self->owner->flags & AFFILE_PIPE || !regular) ? 0 : LW_ALWAYS_WRITABLE)); log_writer_set_options((LogWriter *) self->writer, s, &self->owner->writer_options, 1, self->owner->flags & AFFILE_PIPE ? SCS_PIPE : SCS_FILE, self->owner->super.id, self->filename->str); log_pipe_append(&self->super, self->writer); } -- Bazsi
On Sat, 2011-02-05 at 16:19 +0100, Balazs Scheidler wrote:
On Wed, 2011-01-26 at 17:03 +0100, Sandor Geller wrote:
Hello,
On Wed, Jan 26, 2011 at 4:12 PM, Paul Krizak <paul.krizak@amd.com> wrote:
Hi, we're using syslog-ng 3.1.2 and have run into what appears to be a bug, but I'd like to get the community's opinion before we dig further into it.
We have a bunch of HP servers with iLO2 and iLO3 devices, configured with their virtual serial ports on COM1 (ttyS0). We subsequently have the OS (RHEL4, RHEL5) configured to use COM1 as its console (e.g. /dev/console). This is a very standard configuration that allows us to get remote access to the machines without having to purchase the iLO Advanced KVM feature. It also lets us use the Magic SysRq keys to probe dead systems and stuff, so in general it's not something we're keen to change.
What we have found, however, is that there are some cases where the iLO will freeze and requires a reboot. When the iLO reboots, however, the kernel's connection to /dev/console (through the virtual serial port) hangs and blocks. Any traffic to /dev/console just sits in the kernel's buffer and is never delivered. Once the buffer is full, the kernel simply blocks on any write to /dev/console.
Now this is a Bad Thing in general, and we're working with HP to try and remedy this bug. However, what concerns me is that syslog-ng, when faced with this behavior, also blocks, even for log messages not bound for /dev/console.
syslog-ng uses a single thread (with the exception of database destinations) running the event loop so when a read() or a write() blocks then it affects the whole log processing
What we have observed is that a system with syslog-ng will keep delivering the occasional console message to /dev/console (ex. *.emerg messages) and meanwhile the file-based log paths keep working. But once /dev/console blocks, the next time a console message is delivered, *all* of syslog-ng blocks waiting for that message to be delivered, and all of the file-based paths block as well. The result is that pretty much everything on the system stops working. For example, you can't log in, even as root, because the login process blocks on the syslog command that writes to /var/log/secure. Anything that uses syslog suddenly blocks.
Is this expected behavior? I would think that syslog-ng would be able to continue accepting and delivering messages, even if one of the log paths is stalled on a blocked write.
syslog-ng uses non-blocking I/O for all sources / destinations but despite of this the kernel could still block it therefore syslog-ng protects reads/writes in logtransport.c with alarm() so it should recover when timeout is set and a read/write blocked. For me it looks like the timeout is not set in all cases, only file and program sources initialise transport->timeout to 10 secs so I'd say this isn't expected behaviour - it is a bug.
that alarm stuff got implemented because of /proc/kmsg, which - because of a kernel bug - doesn't support non-blocking I/O properly.
The file source driver (usually used for /proc/kmsg) sets that, even though the kernel should never block in that case.
So I wouldn't call this a bug, the alarm is a workaround for a specific case and /dev/console is different.
The culprit seems to be that indeed file() destinations always assumes that files are always writable, which is only true for regular files, but not for devices. So what needs to be done is to apply regular polling if the file is non-regular.
What about this patch (untested):
diff --git a/src/affile.c b/src/affile.c index b5e1bef..24e5986 100644 --- a/src/affile.c +++ b/src/affile.c
I've tested and commited this patch in syslog-ng 3.2 with this comment, but the original patch posted against 3.1 should also fix, as my testing didn't reveal problems. I'll eventually backport this fix into the 3.1 branch. Thanks for the report and the detailed analysis. commit 61940d18c205d36cb7dd0b30dba741cc8459e2ac Author: Balazs Scheidler <bazsi@balabit.hu> Date: Sat Feb 5 19:22:39 2011 +0100 affile: only regular files are assumed to be always writable The file destination avoids polling regular files as those are always reported to be writable by the underlying kernel anyway. This is however not true if a device file is being opened, like /dev/console. A workaround was to use the pipe() driver, but that has other consequences (like opening the pipes in read/write mode, rather than write only). Reported-By: Paul Krizak <paul.krizak@amd.com> Signed-off-by: Balazs Scheidler <bazsi@balabit.hu> -- Bazsi
participants (6)
-
Balazs Scheidler
-
Evan Rempel
-
Matthew Hall
-
Patrick H.
-
Paul Krizak
-
Sandor Geller