On Wed, 2005-02-02 at 17:38 +0100, Roberto Nibali wrote:
Hello,
As there are some changes since syslog-ng 1.6.5, I'd like to release a new version. Before doing so I'd like to ask you to test the latest snapshot a little bit.
Ok, while browsing through our internal bugzilla DB I've found another bug which is still reproduceable with the latest drop of your sources. The problem is that with a TCP reset from the receiving end of syslog-ng messages we lose one message line while syslog-ng tries to reconnect. Here is a brief instruction on how to reproduce this use case: [detailed error description removed]
I know about this problem but I'm afraid it is very difficult to solve in the 1.6.x branch, and should be solved by the new development branch. The problem is that syslog-ng polls the destination TCP sockets for writing only, and whenever the remote endpoint closes the TCP connection, it is not indicated in any way (as closing a socket triggers readability and not writability). Whenever a message is to be written to this socket, the first write() syscall succeeds, and only the next write() will return EPIPE, so syslog-ng is able to detect the broken connection. Here's the strace that shows this behaviour: The message aa4 is sent still on the old connection: read(5, "<5>Feb 3 09:42:01 bazsi: aa4\0", 2048) = 30 time(NULL) = 1107420121 time(NULL) = 1107420121 time(NULL) = 1107420121 poll([{fd=5, events=POLLIN, revents=POLLIN|POLLHUP}, {fd=4, events=POLLOUT, revents=POLLOUT}, {fd=3, events=POLLIN}], 3, 100) = 2 read(5, "", 2048) = 0 write(2, "Read EOF on fd 5.\n", 18) = 18 write(2, "Marking fd 5 for closing.\n", 26) = 26 write(4, "<5>Feb 3 09:42:01 src@bzorp bazsi: aa4\n", 40) = 40 time(NULL) = 1107420121 poll([{fd=5, events=POLLIN, revents=POLLIN|POLLHUP}, {fd=4, events=0}, {fd=3, events=POLLIN}], 3, 100) = 1 time(NULL) = 1107420121 Then I terminated netcat, and sent another message: read(5, "<5>Feb 3 09:42:05 bazsi: aa5\0", 2048) = 30 time(NULL) = 1107420125 time(NULL) = 1107420125 time(NULL) = 1107420125 poll([{fd=5, events=POLLIN, revents=POLLIN|POLLHUP}, {fd=4, events=POLLOUT, revents=POLLOUT}, {fd=3, events=POLLIN}], 3, 100) = 2 read(5, "", 2048) = 0 write(2, "Read EOF on fd 5.\n", 18) = 18 write(2, "Marking fd 5 for closing.\n", 26) = 26 write(4, "<5>Feb 3 09:42:05 src@bzorp bazsi: aa5\n", 40) = 40 time(NULL) = 1107420125 poll([{fd=5, events=POLLIN, revents=POLLIN|POLLHUP}, {fd=4, events=0, revents=POLLERR|POLLHUP}, {fd=3, events=POLLIN}], 3, 100) = 2 Even though the connection was broken the write() syscall returned success, and only the next poll() call indicates some problems. time(NULL) = 1107420125 poll([{fd=5, events=POLLIN, revents=POLLIN|POLLHUP}, {fd=4, events=0, revents=POLLERR|POLLHUP}, {fd=3, events=POLLIN}], 3, 100) = 2 write(2, "io_iter(): POLLHUP on inactive fd!\n", 35) = 35 write(2, "Marking fd 4 for closing.\n", 26) = 26 time(NULL) = 1107420125 write(2, "Closing fd 5.\n", 14) = 14 close(5) = 0 write(2, "Connection broken to AF_INET(127.0.0.1:2000), reopening in 1 seconds\n", 69) = 69 time(NULL) = 1107420125 write(2, "Closing fd 4.\n", 14) = 14 close(4) = 0 The current solution in version 1.9 is to poll the fd for reading as well as writing, so we get immediate notification when the connection is broken. The same problem could still occur, though only within a very small timeframe. -- Bazsi