syslog-ng: broken pipe

older
Re: [syslog-ng]More information on...

Ravi Malghan

14 Aug 2001 14 Aug '01

4:24 p.m.

Hello: I have installed syslog NG 1.4.11. Every now and then I have to restart the both the server and client syslog-ng b'cos the client doesnot send any messages. The errors I see on the server side look as follows Aug 13 22:45:00 local@iadrse01/iadrse01 syslog-ng[2410]: io.c: do_write: write() failed (errno 32), Broken pipe Aug 13 22:45:00 local@iadrse01/iadrse01 syslog-ng[2410]: Connection broken, reopening in 60 seconds I also see this error in the /var/adm/messages after a while Aug 14 14:09:51 local@iadcor01 syslog-ng[3614]: Error accepting AF_INET connection from: 216.182.213.206:36010, opened connections: 10, max: 10 Has anybody else seen this happening? TIA Ravi

Attachments:

attachment.html (text/html — 1.6 KB)

Show replies by date

Balazs Scheidler

14 Aug 14 Aug

5:25 p.m.

On Tue, Aug 14, 2001 at 10:24:17AM -0400, Ravi Malghan wrote:

...

Hello: I have installed syslog NG 1.4.11. Every now and then I have to restart the both the server and client syslog-ng b'cos the client doesnot send any messages. The errors I see on the server side look as follows

Aug 13 22:45:00 local@iadrse01/iadrse01 syslog-ng[2410]: io.c: do_write: write() failed (errno 32), Broken pipe Aug 13 22:45:00 local@iadrse01/iadrse01 syslog-ng[2410]: Connection broken, reopening in 60 seconds

I also see this error in the /var/adm/messages after a while Aug 14 14:09:51 local@iadcor01 syslog-ng[3614]: Error accepting AF_INET connection from: 216.182.213.206:36010, opened connections: 10, max: 10

you might want to increase the max_connections property of the given TCP source. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

Ravi Malghan

10:04 p.m.

I have about 5-10 clients connecting to the server. Everytime a client disconnects for some reason, the server does not release the port. Hence if I increase the max_connections to 20, I will start seeing error after sometime (longer than if I have 10) since the server does not release the old connections. When I do a netstat the old connections still show up as established? Shouldn't they time out and be released. Thanks Ravi ----- Original Message ----- From: "Balazs Scheidler" <bazsi@balabit.hu> To: <syslog-ng@lists.balabit.hu> Sent: Tuesday, August 14, 2001 11:25 AM Subject: Re: [syslog-ng]syslog-ng: broken pipe

...

On Tue, Aug 14, 2001 at 10:24:17AM -0400, Ravi Malghan wrote:

...
Hello: I have installed syslog NG 1.4.11. Every now and then I have to restart the both the server and client syslog-ng b'cos the client doesnot send any messages. The errors I see on the server side look as follows

Aug 13 22:45:00 local@iadrse01/iadrse01 syslog-ng[2410]: io.c: do_write: write() failed (errno 32), Broken pipe Aug 13 22:45:00 local@iadrse01/iadrse01 syslog-ng[2410]: Connection broken, reopening in 60 seconds

I also see this error in the /var/adm/messages after a while Aug 14 14:09:51 local@iadcor01 syslog-ng[3614]: Error accepting AF_INET connection from: 216.182.213.206:36010, opened connections: 10, max: 10

you might want to increase the max_connections property of the given TCP source.

-- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

_______________________________________________ syslog-ng maillist - syslog-ng@lists.balabit.hu https://lists.balabit.hu/mailman/listinfo/syslog-ng

Balazs Scheidler

15 Aug 15 Aug

10:46 a.m.

On Tue, Aug 14, 2001 at 04:04:02PM -0400, Ravi Malghan wrote:

...

I have about 5-10 clients connecting to the server. Everytime a client disconnects for some reason, the server does not release the port. Hence if I increase the max_connections to 20, I will start seeing error after sometime (longer than if I have 10) since the server does not release the old connections.

When I do a netstat the old connections still show up as established? Shouldn't they time out and be released.

if the client drops the connection it shouldn't show up as established in netstat. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

matthew.copeland＠honeywell.com

5:44 p.m.

New subject: time_reopen, continuous reconnects, and syslog-ng 1.4.10 - 1.4.12.

I have been using syslog-ng 1.4.10 for some stuff at work for a while. Now that we actually have our remote system up and running over frame-relay, our remote logging has got real important. I have set the time_reopen option in my options field to 30 seconds, and I think that it tries to connect in 30 seconds when there is a network failue, but it doesn't continue to try reconnecting, or at least if it does connect, it won't spit any more data to the server until I HUP the client syslog-ng process. Is there another option for making syslog continue trying to reconnect? If not, do I just need to modify the io_callout stuff to make it do so, or do I have to play something else? Is this just a bug or is the current behavior the expected behavior? I just don't like the idea of having to login to 250 systems just to HUP the syslog-ng process. Even if I can script it. Matthew M. Copeland

Balazs Scheidler

16 Aug 16 Aug

11:26 a.m.

New subject: time_reopen, continuous reconnects, and syslog-ng 1.4.10 - 1.4.12.

On Wed, Aug 15, 2001 at 03:44:56PM +0000, matthew.copeland@honeywell.com wrote:

...

I have been using syslog-ng 1.4.10 for some stuff at work for a while. Now that we actually have our remote system up and running over frame-relay, our remote logging has got real important. I have set the time_reopen option in my options field to 30 seconds, and I think that it tries to connect in 30 seconds when there is a network failue, but it doesn't continue to try reconnecting, or at least if it does connect, it won't spit any more data to the server until I HUP the client syslog-ng process. Is there another option for making syslog continue trying to reconnect? If not, do I just need to modify the io_callout stuff to make it do so, or do I have to play something else? Is this just a bug or is the current behavior the expected behavior?

I just don't like the idea of having to login to 250 systems just to HUP the syslog-ng process. Even if I can script it.

This is a bug I didn't have time to track down. As it seems syslog-ng tries to reconnect after connection failure, and sometimes after a successful connection establishment just stops sending data, and also stops trying to reconnect. Reading the source didn't reveal any information, and I couldn't reproduce the problem myself. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

matthew.copeland＠honeywell.com

4:07 p.m.

New subject: time_reopen, continuous reconnects, and syslog-ng 1.4.10 - 1.4.12.

...

...
I just don't like the idea of having to login to 250 systems just to HUP the syslog-ng process. Even if I can script it.

This is a bug I didn't have time to track down. As it seems syslog-ng tries to reconnect after connection failure, and sometimes after a successful connection establishment just stops sending data, and also stops trying to reconnect. Reading the source didn't reveal any information, and I couldn't reproduce the problem myself.

Here is how I reproduce the problem. It happens everytime when I do this. Unfortunately, it requires about 20 minutes or so. :) On client syslog-ng, fire up syslog-ng with --debug with the config pointing to the server. On server syslog-ng, fire up syslog-ng with --debug. use a little test program to generate some data. The one I have been using is #!/usr/bin/perl local($i) = 0; while(1) { print "The number is ", $i, "\n"; $i++; } Now, try it out to make sure that you get a connection and that data goes through. Kill the test program. Reach around the back of your computer (laptop in my case) and disconnect the computer from the network. Fire up the test program. Wait until you get the message saying "Connection broken, reopening in %i seconds", where %i is the reopen time. Wait until after that reopen attempt should have gone through. (In my case, I have reopen set to 30, so I wait more than 30 seconds.) Now, reconnect the client to the network. Bingo, no data until you HUP the process to get it to reconnect. Matthew M. Copeland

Ramji Chandramouli

17 Aug 17 Aug

3:22 a.m.

New subject: time_reopen, continuous reconnects, and syslog-ng1.4.10 - 1.4.12.

This sounds a lot like a problem I was running into when the server syslog-ng died and I could not get the client syslog-ng to try to reconnect. I made the following change to afinet.c. In the function do_init_afinet_dest(), if (self->conn_fd) { return ST_OK | ST_GOON; } else { werror("Error creating AF_INET socket (%z)\n", strerror(errno)); + io_callout(self->cfg->backend, + self->cfg->time_reopen, + make_driver_reinit(&self->super.super.super, self->cfg)); } Once I added this io_callout(), I was able to get it working in my set-up. I hope this helps. matthew.copeland@honeywell.com wrote:

...

...
...
I just don't like the idea of having to login to 250 systems just to HUP the syslog-ng process. Even if I can script it.

This is a bug I didn't have time to track down. As it seems syslog-ng tries to reconnect after connection failure, and sometimes after a successful connection establishment just stops sending data, and also stops trying to reconnect. Reading the source didn't reveal any information, and I couldn't reproduce the problem myself.

Here is how I reproduce the problem. It happens everytime when I do this. Unfortunately, it requires about 20 minutes or so. :)

On client syslog-ng, fire up syslog-ng with --debug with the config pointing to the server. On server syslog-ng, fire up syslog-ng with --debug.

use a little test program to generate some data. The one I have been using is

#!/usr/bin/perl

local($i) = 0; while(1) { print "The number is ", $i, "\n"; $i++; }

Now, try it out to make sure that you get a connection and that data goes through. Kill the test program. Reach around the back of your computer (laptop in my case) and disconnect the computer from the network. Fire up the test program. Wait until you get the message saying "Connection broken, reopening in %i seconds", where %i is the reopen time. Wait until after that reopen attempt should have gone through. (In my case, I have reopen set to 30, so I wait more than 30 seconds.) Now, reconnect the client to the network. Bingo, no data until you HUP the process to get it to reconnect.

Matthew M. Copeland

_______________________________________________ syslog-ng maillist - syslog-ng@lists.balabit.hu https://lists.balabit.hu/mailman/listinfo/syslog-ng

Balazs Scheidler

10:35 a.m.

New subject: time_reopen, continuous reconnects, and syslog-ng1.4.10 - 1.4.12.

On Fri, Aug 17, 2001 at 01:22:30AM +0000, Ramji Chandramouli wrote:

...

This sounds a lot like a problem I was running into when the server syslog-ng died and I could not get the client syslog-ng to try to reconnect.

I made the following change to afinet.c.

In the function do_init_afinet_dest(),

if (self->conn_fd) { return ST_OK | ST_GOON; } else { werror("Error creating AF_INET socket (%z)\n", strerror(errno));

+ io_callout(self->cfg->backend, + self->cfg->time_reopen, + make_driver_reinit(&self->super.super.super, self->cfg)); }

Once I added this io_callout(), I was able to get it working in my set-up. I hope this helps.

Could someone please test this patch? If it fixes the problem I'm willing to include it. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

matthew.copeland＠honeywell.com

18 Aug 18 Aug

12:40 a.m.

New subject: time_reopen, continuous reconnects, and syslog-ng1.4.10 - 1.4.12.

On Fri, 17 Aug 2001, Balazs Scheidler wrote:

...

On Fri, Aug 17, 2001 at 01:22:30AM +0000, Ramji Chandramouli wrote:

...
This sounds a lot like a problem I was running into when the server syslog-ng died and I could not get the client syslog-ng to try to reconnect.

I made the following change to afinet.c.

In the function do_init_afinet_dest(),

if (self->conn_fd) { return ST_OK | ST_GOON; } else { werror("Error creating AF_INET socket (%z)\n", strerror(errno));

+ io_callout(self->cfg->backend, + self->cfg->time_reopen, + make_driver_reinit(&self->super.super.super, self->cfg)); }

Once I added this io_callout(), I was able to get it working in my set-up. I hope this helps.

Could someone please test this patch? If it fixes the problem I'm willing to include it.

This patch did not solve my problem at least. It might solve someone elses problem though. Does anyone have any other ideas on how to solve the client reconnect problem? Thanks for all the help, Matthew M. Copeland

Brad Arlt

3:05 a.m.

New subject: time_reopen, continuous reconnects, and syslog-ng1.4.10 - 1.4.12.

On Fri, Aug 17, 2001 at 10:40:59PM +0000, matthew.copeland@honeywell.com wrote:

...

This patch did not solve my problem at least. It might solve someone elses problem though. Does anyone have any other ideas on how to solve the client reconnect problem?

Thanks for all the help,

I can almost remember the problem you were having (hopefully I am right). You said that the server still thought that it was connected to the client yes? Would not KEEPALIVE solve your problems? That way the socket could detect loss of connection. If that isn't good enough you will need a heart beat and a really short time out on the recv. ---------------------------------------------------------------------------- __o Bradley Arlt Email: arlt@cpsc.ucalgary.ca o__ _ \<_ WWW: www.acs.ucalgary.ca/~bdarlt _>/ _ (_)/(_) -Eat well, sleep peacefully, drink lots, and ride like hell. (_)\(_)

matthew.copeland＠honeywell.com

20 Aug 20 Aug

4:33 p.m.

New subject: time_reopen, continuous reconnects, and syslog-ng1.4.10 - 1.4.12.

I don't think that it is a problem with the server. I think that it is a problem with the client. If I HUP the client and it reconnects, I start seeing data go across to the server again. Other clients connected to the same server while we are having problems with the downed client are still able to send data through. This is what leads me to think that it is the client and not the server. Matthew M. Copeland On Fri, 17 Aug 2001, Brad Arlt wrote:

...

On Fri, Aug 17, 2001 at 10:40:59PM +0000, matthew.copeland@honeywell.com wrote:

...
This patch did not solve my problem at least. It might solve someone elses problem though. Does anyone have any other ideas on how to solve the client reconnect problem?

Thanks for all the help,

I can almost remember the problem you were having (hopefully I am right). You said that the server still thought that it was connected to the client yes?

Would not KEEPALIVE solve your problems? That way the socket could detect loss of connection. If that isn't good enough you will need a heart beat and a really short time out on the recv. ---------------------------------------------------------------------------- __o Bradley Arlt Email: arlt@cpsc.ucalgary.ca o__ _ \<_ WWW: www.acs.ucalgary.ca/~bdarlt _>/ _ (_)/(_) -Eat well, sleep peacefully, drink lots, and ride like hell. (_)\(_)

_______________________________________________ syslog-ng maillist - syslog-ng@lists.balabit.hu https://lists.balabit.hu/mailman/listinfo/syslog-ng

-- You may be sure that when a man begins to call himself a "realist," he is preparing to do something he is secretly ashamed of doing. -- Sydney Harris

matthew.copeland＠honeywell.com

28 Aug 28 Aug

12:38 a.m.

New subject: More information on time_reopen, continuous reconnects, and syslog-ng1.4.10 - 1.4.12.

Well, I have spent some more time trying to narrow down the problem with the client not reconnecting to the server more than once. When I tried it at home though, low and behold it worked. After much investigation, I have found that if I run the client off of Red Hat 6.2, it doesn't work, but if I run the client off Red Hat 7.1, it does work. So, the question obviously becomes, why? We have different kernels, libraries, and compilers. Anyone care to hazard a guess? I am using the latest and greatest versions of syslog-ng and libol now for all of my testing. Matthew M. Copeland On Mon, 20 Aug 2001 matthew.copeland@honeywell.com wrote:

...

I don't think that it is a problem with the server. I think that it is a problem with the client. If I HUP the client and it reconnects, I start seeing data go across to the server again. Other clients connected to the same server while we are having problems with the downed client are still able to send data through. This is what leads me to think that it is the client and not the server.

Matthew M. Copeland

On Fri, 17 Aug 2001, Brad Arlt wrote:

...
On Fri, Aug 17, 2001 at 10:40:59PM +0000, matthew.copeland@honeywell.com wrote:

...
This patch did not solve my problem at least. It might solve someone elses problem though. Does anyone have any other ideas on how to solve the client reconnect problem?

Thanks for all the help,

I can almost remember the problem you were having (hopefully I am right). You said that the server still thought that it was connected to the client yes?

Would not KEEPALIVE solve your problems? That way the socket could detect loss of connection. If that isn't good enough you will need a heart beat and a really short time out on the recv. ---------------------------------------------------------------------------- __o Bradley Arlt Email: arlt@cpsc.ucalgary.ca o__ _ \<_ WWW: www.acs.ucalgary.ca/~bdarlt _>/ _ (_)/(_) -Eat well, sleep peacefully, drink lots, and ride like hell. (_)\(_)

_______________________________________________ syslog-ng maillist - syslog-ng@lists.balabit.hu https://lists.balabit.hu/mailman/listinfo/syslog-ng

-- You may be sure that when a man begins to call himself a "realist," he is preparing to do something he is secretly ashamed of doing. -- Sydney Harris

Balazs Scheidler

10:14 a.m.

New subject: More information on time_reopen, continuous reconnects, and syslog-ng1.4.10 - 1.4.12.

On Mon, Aug 27, 2001 at 10:38:10PM +0000, matthew.copeland@honeywell.com wrote:

...

Well, I have spent some more time trying to narrow down the problem with the client not reconnecting to the server more than once. When I tried it at home though, low and behold it worked. After much investigation, I have found that if I run the client off of Red Hat 6.2, it doesn't work, but if I run the client off Red Hat 7.1, it does work. So, the question obviously becomes, why? We have different kernels, libraries, and compilers. Anyone care to hazard a guess? I am using the latest and greatest versions of syslog-ng and libol now for all of my testing.

To be honest I have no clues. I'm working on a Debian potato (kernel 2.2.19, glibc 2.1.3) But this bug showed up sometimes previously as well. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

matthew.copeland＠honeywell.com

5 Sep 5 Sep

10:09 p.m.

New subject: More information on time_reopen, continuous reconnects, and syslog-ng1.4.10 - 1.4.12.

------------------------------------------------------------- Background reminder: Red Hat 6.2 box acting as a remote TCP logging client doesn't try to reconnect more than once. Red Hat 7.1 box acting as a remote TCP logging client attempts to reconnect every time_reopen seconds just like it is supposed to do. ------------------------------------------------------------- Well, I have spent some more time on this problem, and I have it narrowed down quite a bit. Using an strace of the syslog-ng client, you see the following under Red Hat Linux 7.1 and Red Hat Linux 6.2. (More details after straces) Red Hat Linux 6.2 ... 9373 [400e5dc2] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5 9373 [400d9ce4] fcntl(5, F_GETFL) = 0x2 (flags O_RDWR) 9373 [400d9ce4] fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 9373 [400d9ce4] fcntl(5, F_SETFD, FD_CLOEXEC) = 0 9373 [400d9b14] write(2, "io.c: connecting using fd 5\n", 28) = 28 9373 [400e5a82] connect(5, {sin_family=AF_INET, sin_port=htons(999), sin_addr=inet_addr("151.150.32.135")}}, 16) = -1 EINPROGRESS (Operation now in progress) 9373 [400bbf7d] time(NULL) = 999641180 9373 [400def50] poll([{fd=5, events=POLLOUT}, {fd=4, events=POLLIN}], 2, 100) = 0 9373 [400def50] poll([{fd=5, events=POLLOUT, revents=POLLERR}, {fd=4, events=POLLIN}], 2, 60000) = 1 9373 [400d9b14] write(2, "Marking fd 5 for closing.\n", 26) = 26 ... Red Hat Linux 7.1 ... 6325 [40131462] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 4 6325 [40124187] fcntl64(4, F_GETFL) = 0x2 (flags O_RDWR) 6325 [40124187] fcntl64(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 6325 [40124187] fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 6325 [40123f84] write(2, "io.c: connecting using fd 4\n", 28) = -1 EIO (Input/output error) 6325 [40131122] connect(4, {sin_family=AF_INET, sin_port=htons(999), sin_addr=inet_addr("151.150.32.141")}}, 16) = -1 EINPROGRESS (Operation now in progress) 6325 [400f876d] time(NULL) = 999637318 6325 [40129227] poll([{fd=4, events=POLLOUT, revents=POLLERR|POLLHUP}, {fd=3, events=POLLIN}], 2, 100) = 1 6325 [401311e2] getsockopt(4, SOL_SOCKET, SO_ERROR, [111], [4]) = 0 6325 [40123f84] write(2, "Error connecting to remote host "..., 77) = -1 EIO (Input/output error) 6325 [400f876d] time(NULL) = 999637318 6325 [400f876d] time(NULL) = 999637318 6325 [40123f84] write(2, "Closing fd 4.\n", 14) = -1 EIO (Input/output error) ... The first poll that you are seeing in both of these traces is the poll on line 197 of io.c for libol 0.2.23 (syslog-ng 1.4.12). Notice that under the Red Hat 7.1 version that we get a return value of 1 where the first poll has revents of POLLERR and POLLHUP. Under Red Hat 6.2, our poll returns a 0 and says that everything is fine and dandy, until we do our next poll at line 202. At this point, we get POLLERR for our socket file descripter. I am still tracing through the code again to write down how it effects things, but I am fairly sure this is it. At a high level, the Red Hat 7.1 version, when it closes the socket, sets up a callback to retry the connection at the time_reopen interval, but the Red Hat 6.2 version kills the fd and doesn't setup a callback for it. (The io_iter function in io.c is kind of long and it is in the second pass that this stuff happens, so it is taking a little while to figure out what is going on.) I will send out more information as I receive, but if anyone comes up with an easy way to patch this, please let me know. I have people at work breathing down my neck to figure this one out. Thanks, Matthew M. Copeland On Tue, 28 Aug 2001, Balazs Scheidler wrote:

...

On Mon, Aug 27, 2001 at 10:38:10PM +0000, matthew.copeland@honeywell.com wrote:

...
Well, I have spent some more time trying to narrow down the problem with the client not reconnecting to the server more than once. When I tried it at home though, low and behold it worked. After much investigation, I have found that if I run the client off of Red Hat 6.2, it doesn't work, but if I run the client off Red Hat 7.1, it does work. So, the question obviously becomes, why? We have different kernels, libraries, and compilers. Anyone care to hazard a guess? I am using the latest and greatest versions of syslog-ng and libol now for all of my testing.

To be honest I have no clues. I'm working on a Debian potato (kernel 2.2.19, glibc 2.1.3) But this bug showed up sometimes previously as well.

Balazs Scheidler

10:42 p.m.

New subject: More information on time_reopen, continuous reconnects, and syslog-ng1.4.10 - 1.4.12.

On Wed, Sep 05, 2001 at 08:09:06PM +0000, matthew.copeland@honeywell.com wrote:

...

------------------------------------------------------------- Background reminder: Red Hat 6.2 box acting as a remote TCP logging client doesn't try to reconnect more than once. Red Hat 7.1 box acting as a remote TCP logging client attempts to reconnect every time_reopen seconds just like it is supposed to do. ------------------------------------------------------------- Well, I have spent some more time on this problem, and I have it narrowed down quite a bit.

Using an strace of the syslog-ng client, you see the following under Red Hat Linux 7.1 and Red Hat Linux 6.2. (More details after straces)

Thanks for tracking down this issue. The problem might be the difference between libc/kernel versions. Earlier libcs used to emulate poll using select (glibc 2.0), this is not the case as strace reports it as poll. But Rh 6.2 and 7.1 may contain different kernel versions which behave differently. The problem is that rh 6.2 returns only POLLERR without POLLHUP, and syslog-ng expects POLLHUP for closed sessions. This patch may fix this problem and create new ones, however at 22:43pm, this is the best I can make: Index: io.c =================================================================== RCS file: /var/cvs/libol/src/io.c,v retrieving revision 1.25 diff -u -r1.25 io.c --- io.c 2001/08/26 21:28:18 1.25 +++ io.c 2001/09/05 20:39:02 @@ -231,7 +231,7 @@ if (!fd->super.alive) continue; - if (fds[i].revents & POLLHUP) { + if (fds[i].revents & (POLLHUP|POLLERR|POLLNVAL)) { if (fd->want_read && fd->read) READ_FD(fd); else if (fd->want_write && fd->write) @@ -246,10 +246,12 @@ close_fd(fd, CLOSE_PROTOCOL_FAILURE); continue; } + /* if (fds[i].revents & (POLLNVAL | POLLERR)) { close_fd(fd, CLOSE_POLL_FAILED); continue; } + */ if (fds[i].revents & POLLOUT) if (fd->want_write && fd->write) WRITE_FD(fd); -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

matthew.copeland＠honeywell.com

6 Sep 6 Sep

2:04 a.m.

New subject: More information on time_reopen, continuous reconnects, and syslog-ng1.4.10 - 1.4.12.

...

Thanks for tracking down this issue. The problem might be the difference between libc/kernel versions. Earlier libcs used to emulate poll using select (glibc 2.0), this is not the case as strace reports it as poll. But Rh 6.2 and 7.1 may contain different kernel versions which behave differently.

The problem is that rh 6.2 returns only POLLERR without POLLHUP, and syslog-ng expects POLLHUP for closed sessions. This patch may fix this problem and create new ones, however at 22:43pm, this is the best I can make:

Well, I gave this patch a try, but it doesn't seem to fix the problem. I haven't walked through it with gdb yet with the patch in place though but the messages indicating a reconnect attempt in 10 seconds only flashed by once, which is how it was behaving before. I will take another look at it tomorrow morning and see if I can figure out some more of what is happening. Matthew M. Copeland

...

Index: io.c =================================================================== RCS file: /var/cvs/libol/src/io.c,v retrieving revision 1.25 diff -u -r1.25 io.c --- io.c 2001/08/26 21:28:18 1.25 +++ io.c 2001/09/05 20:39:02 @@ -231,7 +231,7 @@ if (!fd->super.alive) continue;

- if (fds[i].revents & POLLHUP) { + if (fds[i].revents & (POLLHUP|POLLERR|POLLNVAL)) { if (fd->want_read && fd->read) READ_FD(fd); else if (fd->want_write && fd->write) @@ -246,10 +246,12 @@ close_fd(fd, CLOSE_PROTOCOL_FAILURE); continue; } + /* if (fds[i].revents & (POLLNVAL | POLLERR)) { close_fd(fd, CLOSE_POLL_FAILED); continue; } + */ if (fds[i].revents & POLLOUT) if (fd->want_write && fd->write) WRITE_FD(fd);

-- You may be sure that when a man begins to call himself a "realist," he is preparing to do something he is secretly ashamed of doing. -- Sydney Harris

matthew.copeland＠honeywell.com

7:25 p.m.

New subject: More information on time_reopen, continuous reconnects, and syslog-ng1.4.10 - 1.4.12.

I think that part of the problem with this patch is that fd->super.alive doesn't get set to zero. I could be wrong, but I traced through the original code using gdb and then I received your patch. So after trying the patch and finding that it didn't work, I went back and looked at the output of the script file. The first time the socket times out, fd->super.alive gets set to 0 and the second poll doesn't happen. Then, when you go into the for loop if continues back fd->super.alive is zero. If this is the setup to get it to setup the callback for reconnecting later, maybe we should do the POLLERR check when we do the fd->super.alive check, and set fd->super.alive = 0 if we get into that if statement. What do you think? (I have included the script output session of my walk through the code using gdb.) Matthew M. Copeland

...

...
Index: io.c =================================================================== RCS file: /var/cvs/libol/src/io.c,v retrieving revision 1.25 diff -u -r1.25 io.c --- io.c 2001/08/26 21:28:18 1.25 +++ io.c 2001/09/05 20:39:02 @@ -231,7 +231,7 @@ if (!fd->super.alive) continue;

- if (fds[i].revents & POLLHUP) { + if (fds[i].revents & (POLLHUP|POLLERR|POLLNVAL)) { if (fd->want_read && fd->read) READ_FD(fd); else if (fd->want_write && fd->write) @@ -246,10 +246,12 @@ close_fd(fd, CLOSE_PROTOCOL_FAILURE); continue; } + /* if (fds[i].revents & (POLLNVAL | POLLERR)) { close_fd(fd, CLOSE_POLL_FAILED); continue; } + */ if (fds[i].revents & POLLOUT) if (fd->want_write && fd->write) WRITE_FD(fd);

-- You may be sure that when a man begins to call himself a "realist," he is preparing to do something he is secretly ashamed of doing. -- Sydney Harris

matthew.copeland＠honeywell.com

7 Sep 7 Sep

1:13 a.m.

New subject: More information on time_reopen, continuous reconnects, and syslog-ng1.4.10 - 1.4.12.

Well, the issue of it not reconnecting at startup if the server was down has been fixed by that patch also. Thanks for the help. Matthew M. Copeland

Balazs Scheidler

10:41 a.m.

New subject: More information on time_reopen, continuous reconnects, and syslog-ng1.4.10 - 1.4.12.

On Thu, Sep 06, 2001 at 11:13:59PM +0000, matthew.copeland@honeywell.com wrote:

...

Well, the issue of it not reconnecting at startup if the server was down has been fixed by that patch also. Thanks for the help.

great. thanks for your cooperation in finding and fixing this bug. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

matthew.copeland＠honeywell.com

12:59 a.m.

New subject: More information on time_reopen, continuous reconnects, and syslog-ng1.4.10 - 1.4.12.

Okay, I must have screwed something up. I walked through the code with a debugger for this patch and it worked just fine this time. Maybe when I recompiled libol and syslog-ng after applying this patch, it didn't compile in the new libol. Anyhow, good work and thanks for the help. My next thing to check out is to see whether it will attempt to connect again if it can't reach the server when it is first started up. I was noticing some problems with that, but it might be related to this. Thanks for all the help again. Matthew M. Copeland On Wed, 5 Sep 2001, Balazs Scheidler wrote:

...

On Wed, Sep 05, 2001 at 08:09:06PM +0000, matthew.copeland@honeywell.com wrote:

...
------------------------------------------------------------- Background reminder: Red Hat 6.2 box acting as a remote TCP logging client doesn't try to reconnect more than once. Red Hat 7.1 box acting as a remote TCP logging client attempts to reconnect every time_reopen seconds just like it is supposed to do. ------------------------------------------------------------- Well, I have spent some more time on this problem, and I have it narrowed down quite a bit.

Using an strace of the syslog-ng client, you see the following under Red Hat Linux 7.1 and Red Hat Linux 6.2. (More details after straces)

Thanks for tracking down this issue. The problem might be the difference between libc/kernel versions. Earlier libcs used to emulate poll using select (glibc 2.0), this is not the case as strace reports it as poll. But Rh 6.2 and 7.1 may contain different kernel versions which behave differently.

The problem is that rh 6.2 returns only POLLERR without POLLHUP, and syslog-ng expects POLLHUP for closed sessions. This patch may fix this problem and create new ones, however at 22:43pm, this is the best I can make:

Index: io.c =================================================================== RCS file: /var/cvs/libol/src/io.c,v retrieving revision 1.25 diff -u -r1.25 io.c --- io.c 2001/08/26 21:28:18 1.25 +++ io.c 2001/09/05 20:39:02 @@ -231,7 +231,7 @@ if (!fd->super.alive) continue;

- if (fds[i].revents & POLLHUP) { + if (fds[i].revents & (POLLHUP|POLLERR|POLLNVAL)) { if (fd->want_read && fd->read) READ_FD(fd); else if (fd->want_write && fd->write) @@ -246,10 +246,12 @@ close_fd(fd, CLOSE_PROTOCOL_FAILURE); continue; } + /* if (fds[i].revents & (POLLNVAL | POLLERR)) { close_fd(fd, CLOSE_POLL_FAILED); continue; } + */ if (fds[i].revents & POLLOUT) if (fd->want_write && fd->write) WRITE_FD(fd);

-- You may be sure that when a man begins to call himself a "realist," he is preparing to do something he is secretly ashamed of doing. -- Sydney Harris

8943

Age (days ago)

8967

Last active (days ago)

List overview

Download

20 comments

5 participants

participants (5)

Balazs Scheidler
Brad Arlt
matthew.copeland＠honeywell.com
Ramji Chandramouli
Ravi Malghan