On Thu, 2005-02-03 at 11:31 +0100, Roberto Nibali wrote:
The problem is that syslog-ng polls the destination TCP sockets for writing only,
That's the issue with poll(). Dumb question: Why not using select()?
the same, in the kernel internally both poll and select use the same interfaces, e.g. the same rules apply how they will indicate readability.
and whenever the remote endpoint closes the TCP connection, it is not indicated in any way (as closing a socket triggers readability and not writability). Whenever a message is to be written to this socket, the first write() syscall succeeds, and only the next write() will return EPIPE, so syslog-ng is able to detect the broken connection.
Also a signal SIGPIPE is invoked. Again, stupid question: Couldn't you use SIGPIPE to inject the write request off the remaining buffer?
I think SIGPIPE is issued at the same time write() returns EPIPE, so I think it also happens at the second write, and the kernel already acknowledged the previous message. I was curious whether this was true, but even when I disabled SIG_IGNing SIGPIPE, it did not occur for some reason. tcp(7) states that SIGPIPEs are only triggered for SO_KEEPALIVE-d sockets, I enabled SO_KEEPALIVE still no SIGPIPEs. Anyway I don't think SIGPIPE would help us here. The real problem is that the kernel returns success for the write() system call, while the connection was already broken. I hackish solution would be to buffer the last line written, and in case of failure push it back to the FIFO queue. This is ugly but could work.
The way I see syslog-ng functionally working (extremely simplified, please correct), is:
o syslog-ng polls on read_fds for incoming syslog messages o syslog-ng maintains a queue or linked list of messages where the newly arrived messages get queued up for delivery. This queue is also used in case a destination is down and needs to be reprobed (reopened) for a connection. o syslog-ng polls on write_fds for outgoing possibilities and if success, sends out in FIFO the queued messages.
yes, it is more or less correct, though the same loop is used for read/write polling.
o TCP close -> eof reaches the socket and gets passed up to syslog-ng which has already sent (write()) one line _but_ not yet lost the buffer it has written. o A new write will return EPIPE and a SIGPIPE signal.
The idea is to either pass the EPIPE back to the caller function sending the syslog message chunk or to invoke a signal handler that signals the caller to resend that message again. Or use select() to poll for readability? Or create a thread within the calling stack (same process with access to the write buffer) of the poll function and have it wait on a condition variable which is set upon EPIPE. The thread waits in pthread_cond_wait() and will write the last successfully written buffer again.
Again the solution I outlined above might be ok, though the fact that syslog-ng might coalesce outgoing TCP writes it is not very simple. [snip]
Could you point me to the code in question in 1.6.x so I could check it out for myself, please?
it is implemented in libol/src/pkt_buffer.c, packet buffers can operate in two modes packet and stream mode. packet mode makes the output routines write a single message at a time, stream mode enables write coalescing. These two modes have two independent flush functions: static int do_flush_stream(struct abstract_buffer *c, struct abstract_write *w) static int do_flush_pkt(struct abstract_buffer *c, struct abstract_write *w) Solving the first would mean to save the last coalesced buffer and push it back to the buffer in case of EPIPE, but in fact this can also introduce platform dependence, I'm not sure all IP stacks behave identically. The other option is to add reading the socket to the poll loop like it is done in 1.9.x (can be done using io_read_write instead of io_write, and drop the connection from the read callback when the socket is readable and read() returns 0 bytes.) -- Bazsi