syslog-ng takes 100% CPU when network fails
Hi, We are running syslog-ng 2.0.9 on a HP-UX 11.31 server. We have configured this system as a client to forward logs to a remote server. When there is a network failure ( simulated by ifconfig down ) syslog-ng starts to consume CPU and even after the network comes back, it does not forward any log messages and continues to hog CPU. We did system call tracing using tusc and found that "poll()" gets "POLLERR" event from TCP socket descriptor, but syslog-ng does not call any socket calls for the TCP, only calls "gettimeofday()". In the logs given, TCP connection to server is disconnected at 13:10:30. From that time, poll() receives POLLERR on the TCP socket (fd=6) and starts loop on gettimeofday(). Attached are the sar, netstat and tusc logs. Thanks, Manu
On Fri, 2008-10-24 at 05:50 +0000, D S, Manu (STSD) wrote:
Hi,
We are running syslog-ng 2.0.9 on a HP-UX 11.31 server. We have configured this system as a client to forward logs to a remote server. When there is a network failure ( simulated by ifconfig down ) syslog-ng starts to consume CPU and even after the network comes back, it does not forward any log messages and continues to hog CPU.
We did system call tracing using tusc and found that "poll()" gets "POLLERR" event from TCP socket descriptor, but syslog-ng does not call any socket calls for the TCP, only calls "gettimeofday()".
In the logs given, TCP connection to server is disconnected at 13:10:30. From that time, poll() receives POLLERR on the TCP socket (fd=6) and starts loop on gettimeofday(). Attached are the sar, netstat and tusc logs.
First of all, Thanks for the detailed error report. As I see the problem seems to be caused by the fact that HP-UX returns POLLERR only without the other bits (e.g. POLLHUP) syslog-ng would handle this gracefully if either the other bits would be set, or there'd be some pending messages to send, in which case a normal write() error would occur. This patch should fix the problem, although I only compile-tested it. I'd appreciate if you could test this patch in your environment. diff --git a/src/logwriter.c b/src/logwriter.c index bb82b43..7a5fcf7 100644 --- a/src/logwriter.c +++ b/src/logwriter.c @@ -139,6 +139,13 @@ log_writer_fd_dispatch(GSource *source, log_writer_broken(self->writer, NC_CLOSE); return FALSE; } + else if (self->pollfd.revents & (G_IO_ERR)) + { + msg_error("POLLERR occurred while idle", + evt_tag_int("fd", self->fd->fd), + NULL); + log_writer_broken(self->writer, NC_WRITE_ERROR); + } else if (self->writer->queue->length || self->writer->partial) { if (!log_writer_flush_log(self->writer, self->fd)) -- Bazsi
participants (2)
-
Balazs Scheidler
-
D S, Manu (STSD)