[syslog-ng]More information on time_reopen, continuous reconnects,
and syslog-ng1.4.10 - 1.4.12.
matthew.copeland@honeywell.com
matthew.copeland@honeywell.com
Wed, 5 Sep 2001 20:09:06 +0000 (UTC)
-------------------------------------------------------------
Background reminder:
Red Hat 6.2 box acting as a remote TCP logging client
doesn't try to reconnect more than once. Red Hat 7.1
box acting as a remote TCP logging client attempts to
reconnect every time_reopen seconds just like it is
supposed to do.
-------------------------------------------------------------
Well, I have spent some more time on this problem, and I have it narrowed
down quite a bit.
Using an strace of the syslog-ng client, you see the following under Red
Hat Linux 7.1 and Red Hat Linux 6.2. (More details after straces)
Red Hat Linux 6.2
...
9373 [400e5dc2] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5
9373 [400d9ce4] fcntl(5, F_GETFL) = 0x2 (flags O_RDWR)
9373 [400d9ce4] fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0
9373 [400d9ce4] fcntl(5, F_SETFD, FD_CLOEXEC) = 0
9373 [400d9b14] write(2, "io.c: connecting using fd 5\n", 28) = 28
9373 [400e5a82] connect(5, {sin_family=AF_INET, sin_port=htons(999), sin_addr=inet_addr("151.150.32.135")}}, 16) = -1 EINPROGRESS (Operation now in progress)
9373 [400bbf7d] time(NULL) = 999641180
9373 [400def50] poll([{fd=5, events=POLLOUT}, {fd=4, events=POLLIN}], 2, 100) = 0
9373 [400def50] poll([{fd=5, events=POLLOUT, revents=POLLERR}, {fd=4, events=POLLIN}], 2, 60000) = 1
9373 [400d9b14] write(2, "Marking fd 5 for closing.\n", 26) = 26
...
Red Hat Linux 7.1
...
6325 [40131462] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 4
6325 [40124187] fcntl64(4, F_GETFL) = 0x2 (flags O_RDWR)
6325 [40124187] fcntl64(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
6325 [40124187] fcntl64(4, F_SETFD, FD_CLOEXEC) = 0
6325 [40123f84] write(2, "io.c: connecting using fd 4\n", 28) = -1 EIO (Input/output error)
6325 [40131122] connect(4, {sin_family=AF_INET, sin_port=htons(999), sin_addr=inet_addr("151.150.32.141")}}, 16) = -1 EINPROGRESS (Operation now in progress)
6325 [400f876d] time(NULL) = 999637318
6325 [40129227] poll([{fd=4, events=POLLOUT, revents=POLLERR|POLLHUP}, {fd=3, events=POLLIN}], 2, 100) = 1
6325 [401311e2] getsockopt(4, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
6325 [40123f84] write(2, "Error connecting to remote host "..., 77) = -1 EIO (Input/output error)
6325 [400f876d] time(NULL) = 999637318
6325 [400f876d] time(NULL) = 999637318
6325 [40123f84] write(2, "Closing fd 4.\n", 14) = -1 EIO (Input/output error)
...
The first poll that you are seeing in both of these traces is the poll on
line 197 of io.c for libol 0.2.23 (syslog-ng 1.4.12). Notice that under
the Red Hat 7.1 version that we get a return value of 1 where the first
poll has revents of POLLERR and POLLHUP. Under Red Hat 6.2, our poll
returns a 0 and says that everything is fine and dandy, until we do our
next poll at line 202. At this point, we get POLLERR for our socket file
descripter. I am still tracing through the code again to write down how
it effects things, but I am fairly sure this is it. At a high level, the
Red Hat 7.1 version, when it closes the socket, sets up a callback to
retry the connection at the time_reopen interval, but the Red Hat 6.2
version kills the fd and doesn't setup a callback for it. (The io_iter
function in io.c is kind of long and it is in the second pass that this
stuff happens, so it is taking a little while to figure out what is going
on.) I will send out more information as I receive, but if anyone comes
up with an easy way to patch this, please let me know. I have people at
work breathing down my neck to figure this one out.
Thanks,
Matthew M. Copeland
On Tue, 28 Aug 2001, Balazs Scheidler wrote:
> On Mon, Aug 27, 2001 at 10:38:10PM +0000, matthew.copeland@honeywell.com wrote:
> >
> > Well, I have spent some more time trying to narrow down the problem with
> > the client not reconnecting to the server more than once. When I tried it
> > at home though, low and behold it worked. After much investigation, I
> > have found that if I run the client off of Red Hat 6.2, it doesn't work,
> > but if I run the client off Red Hat 7.1, it does work. So, the question
> > obviously becomes, why? We have different kernels, libraries, and
> > compilers. Anyone care to hazard a guess? I am using the latest and
> > greatest versions of syslog-ng and libol now for all of my testing.
>
> To be honest I have no clues. I'm working on a Debian potato (kernel 2.2.19,
> glibc 2.1.3) But this bug showed up sometimes previously as well.
>
>