[syslog-ng]More information on time_reopen, continuous reconnects, and syslog-ng1.4.10 - 1.4.12.

matthew.copeland@honeywell.com matthew.copeland@honeywell.com
Wed, 5 Sep 2001 20:09:06 +0000 (UTC)


-------------------------------------------------------------
Background reminder:
	Red Hat 6.2 box acting as a remote TCP logging client
	doesn't try to reconnect more than once.  Red Hat 7.1 
	box acting as a remote TCP logging client attempts to
	reconnect every time_reopen seconds just like it is 
	supposed to do.
-------------------------------------------------------------
Well, I have spent some more time on this problem, and I have it narrowed
down quite a bit.

Using an strace of the syslog-ng client, you see the following under Red
Hat Linux 7.1 and Red Hat Linux 6.2.  (More details after straces)

Red Hat Linux 6.2
...
9373  [400e5dc2] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5
9373  [400d9ce4] fcntl(5, F_GETFL)      = 0x2 (flags O_RDWR)
9373  [400d9ce4] fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0
9373  [400d9ce4] fcntl(5, F_SETFD, FD_CLOEXEC) = 0
9373  [400d9b14] write(2, "io.c: connecting using fd 5\n", 28) = 28
9373  [400e5a82] connect(5, {sin_family=AF_INET, sin_port=htons(999), sin_addr=inet_addr("151.150.32.135")}}, 16) = -1 EINPROGRESS (Operation now in progress)
9373  [400bbf7d] time(NULL)             = 999641180
9373  [400def50] poll([{fd=5, events=POLLOUT}, {fd=4, events=POLLIN}], 2, 100) = 0
9373  [400def50] poll([{fd=5, events=POLLOUT, revents=POLLERR}, {fd=4, events=POLLIN}], 2, 60000) = 1
9373  [400d9b14] write(2, "Marking fd 5 for closing.\n", 26) = 26
...

Red Hat Linux 7.1
...
6325  [40131462] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 4
6325  [40124187] fcntl64(4, F_GETFL)    = 0x2 (flags O_RDWR)
6325  [40124187] fcntl64(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
6325  [40124187] fcntl64(4, F_SETFD, FD_CLOEXEC) = 0
6325  [40123f84] write(2, "io.c: connecting using fd 4\n", 28) = -1 EIO (Input/output error)
6325  [40131122] connect(4, {sin_family=AF_INET, sin_port=htons(999), sin_addr=inet_addr("151.150.32.141")}}, 16) = -1 EINPROGRESS (Operation now in progress)
6325  [400f876d] time(NULL)             = 999637318
6325  [40129227] poll([{fd=4, events=POLLOUT, revents=POLLERR|POLLHUP}, {fd=3, events=POLLIN}], 2, 100) = 1
6325  [401311e2] getsockopt(4, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
6325  [40123f84] write(2, "Error connecting to remote host "..., 77) = -1 EIO (Input/output error)
6325  [400f876d] time(NULL)             = 999637318
6325  [400f876d] time(NULL)             = 999637318
6325  [40123f84] write(2, "Closing fd 4.\n", 14) = -1 EIO (Input/output error)
...

The first poll that you are seeing in both of these traces is the poll on
line 197 of io.c for libol 0.2.23 (syslog-ng 1.4.12).  Notice that under
the Red Hat 7.1 version that we get a return value of 1 where the first
poll has revents of POLLERR and POLLHUP.  Under Red Hat 6.2, our poll
returns a 0 and says that everything is fine and dandy, until we do our
next poll at line 202.  At this point, we get POLLERR for our socket file
descripter.  I am still tracing through the code again to write down how
it effects things, but I am fairly sure this is it.  At a high level, the
Red Hat 7.1 version, when it closes the socket, sets up a callback to
retry the connection at the time_reopen interval, but the Red Hat 6.2
version kills the fd and doesn't setup a callback for it.  (The io_iter
function in io.c is kind of long and it is in the second pass that this
stuff happens, so it is taking a little while to figure out what is going
on.)   I will send out more information as I receive, but if anyone comes
up with an easy way to patch this, please let me know.  I have people at
work breathing down my neck to figure this one out.

Thanks,
Matthew M. Copeland




On Tue, 28 Aug 2001, Balazs Scheidler wrote:

> On Mon, Aug 27, 2001 at 10:38:10PM +0000, matthew.copeland@honeywell.com wrote:
> > 
> > Well, I have spent some more time trying to narrow down the problem with
> > the client not reconnecting to the server more than once.  When I tried it
> > at home though, low and behold it worked.  After much investigation, I
> > have found that if I run the client off of Red Hat 6.2, it doesn't work,
> > but if I run the client off Red Hat 7.1, it does work.  So, the question
> > obviously becomes, why?  We have different kernels, libraries, and
> > compilers.  Anyone care to hazard a guess?  I am using the latest and
> > greatest versions of syslog-ng and libol now for all of my testing.
> 
> To be honest I have no clues. I'm working on a Debian potato (kernel 2.2.19,
> glibc 2.1.3) But this bug showed up sometimes previously as well.
> 
>