On Wed, 2008-06-25 at 08:49 -0400, Richard Vigeant wrote:
On 24-Jun-08, at 3:37 AM, Balazs Scheidler wrote:
On Thu, 2008-06-19 at 15:54 -0400, Richard Vigeant wrote:
Hi,
I have a configuration where several nodes send all log messages to a central server. The applications on remote nodes send their logs locally either via UDP or a unix socket. The syslog-ng running on remote nodes simply pick up all log messages from all sources, i.e. TCP, UDP, /proc/kmsg, /dev/log and internal, and transmit all messages to the central server uisng TCP. The remote node's config file follows.
We've been having intermittent problems where the central server would suddenly stop logging messages from certain nodes. We noticed that very often restarting syslog-ng on the central server would fix the condition and logging would carry on.
Howver I discovered a new rare case where restarting the central syslog-ng didn't work. I found out by doing a tcpdump that the remote syslog-ng was not sending the log messages. I have done an strace on the remote syslog-ng and it shows that nothing happens after a message has been "recvfrom()" or "read()". Then I have restarted syslog-ng and things went back to normal. In the 2nd strace we can see that there is a "write()" after the "read()".
I might be guessing here as I don't really know which fd is which, but I think you've ran into an issue that some others have experienced previously.
In the case when the traffic does not work, syslog-ng is correctly polling fd 8 for output, I assumed that fd 8 is the fd of the connection to the server. (it is in the 2nd strace dump).
So syslog-ng is polling for writing out on fd 8, but the poll system call does not indicate writability. This usually means that the tcp() window is full, the server does not accept new data.
State based firewalls often drop inactive connections after a period of time and in case packets arrive for a connection for which no state exists, packets are dropped.
Do you have a firewall between the client and the server?
No firewall. Clients and server are all on the same LAN. This is one of our local QA environment.
Note that I have seen similar cases where the problem occurred on the server and the output is a file. However I can't currently reproduce it.
Hmmm, and neither the clients nor the server is running connection tracking, right? If my initial analysis is correct (an lsof output should confirm that), then the problem is that syslog-ng is unable to send to the TCP connection and it is the TCP stack of the OS that tells this to syslog-ng. If this is a QA network, can you run tcpdump to sniff the packets and see how the on-wire traffic looks like? -- Bazsi