[syslog-ng]preparing syslog-ng 1.6.6

Balazs Scheidler syslog-ng@lists.balabit.hu
Thu, 03 Feb 2005 09:53:09 +0100


On Wed, 2005-02-02 at 17:38 +0100, Roberto Nibali wrote:
> Hello,
> 
> > As there are some changes since syslog-ng 1.6.5, I'd like to release a
> > new version. Before doing so I'd like to ask you to test the latest
> > snapshot a little bit.
> 
> Ok, while browsing through our internal bugzilla DB I've found another bug which 
> is still reproduceable with the latest drop of your sources. The problem is that 
> with a TCP reset from the receiving end of syslog-ng messages we lose one 
> message line while syslog-ng tries to reconnect. Here is a brief instruction on 
> how to reproduce this use case:
[detailed error description removed]

I know about this problem but I'm afraid it is very difficult to solve
in the 1.6.x branch, and should be solved by the new development branch.
The problem is that syslog-ng polls the destination TCP sockets for
writing only, and whenever the remote endpoint closes the TCP
connection, it is not indicated in any way (as closing a socket triggers
readability and not writability). Whenever a message is to be written to
this socket, the first write() syscall succeeds, and only the next
write() will return EPIPE, so syslog-ng is able to detect the broken
connection.

Here's the strace that shows this behaviour:

The message aa4 is sent still on the old connection:
read(5, "<5>Feb  3 09:42:01 bazsi: aa4\0", 2048) = 30
time(NULL)                              = 1107420121
time(NULL)                              = 1107420121
time(NULL)                              = 1107420121
poll([{fd=5, events=POLLIN, revents=POLLIN|POLLHUP}, {fd=4, events=POLLOUT, revents=POLLOUT}, {fd=3, events=POLLIN}], 3, 100) = 2
read(5, "", 2048)                       = 0
write(2, "Read EOF on fd 5.\n", 18)     = 18
write(2, "Marking fd 5 for closing.\n", 26) = 26
write(4, "<5>Feb  3 09:42:01 src@bzorp bazsi: aa4\n", 40) = 40
time(NULL)                              = 1107420121
poll([{fd=5, events=POLLIN, revents=POLLIN|POLLHUP}, {fd=4, events=0}, {fd=3, events=POLLIN}], 3, 100) = 1
time(NULL)                              = 1107420121

Then I terminated netcat, and sent another message:
read(5, "<5>Feb  3 09:42:05 bazsi: aa5\0", 2048) = 30
time(NULL)                              = 1107420125
time(NULL)                              = 1107420125
time(NULL)                              = 1107420125
poll([{fd=5, events=POLLIN, revents=POLLIN|POLLHUP}, {fd=4, events=POLLOUT, revents=POLLOUT}, {fd=3, events=POLLIN}], 3, 100) = 2
read(5, "", 2048)                       = 0
write(2, "Read EOF on fd 5.\n", 18)     = 18
write(2, "Marking fd 5 for closing.\n", 26) = 26
write(4, "<5>Feb  3 09:42:05 src@bzorp bazsi: aa5\n", 40) = 40
time(NULL)                              = 1107420125
poll([{fd=5, events=POLLIN, revents=POLLIN|POLLHUP}, {fd=4, events=0, revents=POLLERR|POLLHUP}, {fd=3, events=POLLIN}], 3, 100) = 2

Even though the connection was broken the write() syscall returned
success, and only the next poll() call indicates some problems.

time(NULL)                              = 1107420125
poll([{fd=5, events=POLLIN, revents=POLLIN|POLLHUP}, {fd=4, events=0, revents=POLLERR|POLLHUP}, {fd=3, events=POLLIN}], 3, 100) =
2
write(2, "io_iter(): POLLHUP on inactive fd!\n", 35) = 35
write(2, "Marking fd 4 for closing.\n", 26) = 26
time(NULL)                              = 1107420125
write(2, "Closing fd 5.\n", 14)         = 14
close(5)                                = 0
write(2, "Connection broken to AF_INET(127.0.0.1:2000), reopening in 1 seconds\n", 69) = 69
time(NULL)                              = 1107420125
write(2, "Closing fd 4.\n", 14)         = 14
close(4)                                = 0

The current solution in version 1.9 is to poll the fd for reading as 
well as writing, so we get immediate notification when the connection is
broken. The same problem could still occur, though only within a very
small timeframe.


-- 
Bazsi