Hello: I have installed syslog NG 1.4.11. Every now and then I have to restart the both the server and client syslog-ng b'cos the client doesnot send any messages. The errors I see on the server side look as follows Aug 13 22:45:00 local@iadrse01/iadrse01 syslog-ng[2410]: io.c: do_write: write() failed (errno 32), Broken pipe Aug 13 22:45:00 local@iadrse01/iadrse01 syslog-ng[2410]: Connection broken, reopening in 60 seconds I also see this error in the /var/adm/messages after a while Aug 14 14:09:51 local@iadcor01 syslog-ng[3614]: Error accepting AF_INET connection from: 216.182.213.206:36010, opened connections: 10, max: 10 Has anybody else seen this happening? TIA Ravi
On Tue, Aug 14, 2001 at 10:24:17AM -0400, Ravi Malghan wrote:
Hello: I have installed syslog NG 1.4.11. Every now and then I have to restart the both the server and client syslog-ng b'cos the client doesnot send any messages. The errors I see on the server side look as follows
Aug 13 22:45:00 local@iadrse01/iadrse01 syslog-ng[2410]: io.c: do_write: write() failed (errno 32), Broken pipe Aug 13 22:45:00 local@iadrse01/iadrse01 syslog-ng[2410]: Connection broken, reopening in 60 seconds
I also see this error in the /var/adm/messages after a while Aug 14 14:09:51 local@iadcor01 syslog-ng[3614]: Error accepting AF_INET connection from: 216.182.213.206:36010, opened connections: 10, max: 10
you might want to increase the max_connections property of the given TCP source. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1
I have about 5-10 clients connecting to the server. Everytime a client disconnects for some reason, the server does not release the port. Hence if I increase the max_connections to 20, I will start seeing error after sometime (longer than if I have 10) since the server does not release the old connections. When I do a netstat the old connections still show up as established? Shouldn't they time out and be released. Thanks Ravi ----- Original Message ----- From: "Balazs Scheidler" <bazsi@balabit.hu> To: <syslog-ng@lists.balabit.hu> Sent: Tuesday, August 14, 2001 11:25 AM Subject: Re: [syslog-ng]syslog-ng: broken pipe
On Tue, Aug 14, 2001 at 10:24:17AM -0400, Ravi Malghan wrote:
Hello: I have installed syslog NG 1.4.11. Every now and then I have to restart the both the server and client syslog-ng b'cos the client doesnot send any messages. The errors I see on the server side look as follows
Aug 13 22:45:00 local@iadrse01/iadrse01 syslog-ng[2410]: io.c: do_write: write() failed (errno 32), Broken pipe Aug 13 22:45:00 local@iadrse01/iadrse01 syslog-ng[2410]: Connection broken, reopening in 60 seconds
I also see this error in the /var/adm/messages after a while Aug 14 14:09:51 local@iadcor01 syslog-ng[3614]: Error accepting AF_INET connection from: 216.182.213.206:36010, opened connections: 10, max: 10
you might want to increase the max_connections property of the given TCP source.
-- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1
_______________________________________________ syslog-ng maillist - syslog-ng@lists.balabit.hu https://lists.balabit.hu/mailman/listinfo/syslog-ng
On Tue, Aug 14, 2001 at 04:04:02PM -0400, Ravi Malghan wrote:
I have about 5-10 clients connecting to the server. Everytime a client disconnects for some reason, the server does not release the port. Hence if I increase the max_connections to 20, I will start seeing error after sometime (longer than if I have 10) since the server does not release the old connections.
When I do a netstat the old connections still show up as established? Shouldn't they time out and be released.
if the client drops the connection it shouldn't show up as established in netstat. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1
I have been using syslog-ng 1.4.10 for some stuff at work for a while. Now that we actually have our remote system up and running over frame-relay, our remote logging has got real important. I have set the time_reopen option in my options field to 30 seconds, and I think that it tries to connect in 30 seconds when there is a network failue, but it doesn't continue to try reconnecting, or at least if it does connect, it won't spit any more data to the server until I HUP the client syslog-ng process. Is there another option for making syslog continue trying to reconnect? If not, do I just need to modify the io_callout stuff to make it do so, or do I have to play something else? Is this just a bug or is the current behavior the expected behavior? I just don't like the idea of having to login to 250 systems just to HUP the syslog-ng process. Even if I can script it. Matthew M. Copeland
On Wed, Aug 15, 2001 at 03:44:56PM +0000, matthew.copeland@honeywell.com wrote:
I have been using syslog-ng 1.4.10 for some stuff at work for a while. Now that we actually have our remote system up and running over frame-relay, our remote logging has got real important. I have set the time_reopen option in my options field to 30 seconds, and I think that it tries to connect in 30 seconds when there is a network failue, but it doesn't continue to try reconnecting, or at least if it does connect, it won't spit any more data to the server until I HUP the client syslog-ng process. Is there another option for making syslog continue trying to reconnect? If not, do I just need to modify the io_callout stuff to make it do so, or do I have to play something else? Is this just a bug or is the current behavior the expected behavior?
I just don't like the idea of having to login to 250 systems just to HUP the syslog-ng process. Even if I can script it.
This is a bug I didn't have time to track down. As it seems syslog-ng tries to reconnect after connection failure, and sometimes after a successful connection establishment just stops sending data, and also stops trying to reconnect. Reading the source didn't reveal any information, and I couldn't reproduce the problem myself. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1
I just don't like the idea of having to login to 250 systems just to HUP the syslog-ng process. Even if I can script it.
This is a bug I didn't have time to track down. As it seems syslog-ng tries to reconnect after connection failure, and sometimes after a successful connection establishment just stops sending data, and also stops trying to reconnect. Reading the source didn't reveal any information, and I couldn't reproduce the problem myself.
Here is how I reproduce the problem. It happens everytime when I do this. Unfortunately, it requires about 20 minutes or so. :) On client syslog-ng, fire up syslog-ng with --debug with the config pointing to the server. On server syslog-ng, fire up syslog-ng with --debug. use a little test program to generate some data. The one I have been using is #!/usr/bin/perl local($i) = 0; while(1) { print "The number is ", $i, "\n"; $i++; } Now, try it out to make sure that you get a connection and that data goes through. Kill the test program. Reach around the back of your computer (laptop in my case) and disconnect the computer from the network. Fire up the test program. Wait until you get the message saying "Connection broken, reopening in %i seconds", where %i is the reopen time. Wait until after that reopen attempt should have gone through. (In my case, I have reopen set to 30, so I wait more than 30 seconds.) Now, reconnect the client to the network. Bingo, no data until you HUP the process to get it to reconnect. Matthew M. Copeland
This sounds a lot like a problem I was running into when the server syslog-ng died and I could not get the client syslog-ng to try to reconnect. I made the following change to afinet.c. In the function do_init_afinet_dest(), if (self->conn_fd) { return ST_OK | ST_GOON; } else { werror("Error creating AF_INET socket (%z)\n", strerror(errno)); + io_callout(self->cfg->backend, + self->cfg->time_reopen, + make_driver_reinit(&self->super.super.super, self->cfg)); } Once I added this io_callout(), I was able to get it working in my set-up. I hope this helps. matthew.copeland@honeywell.com wrote:
I just don't like the idea of having to login to 250 systems just to HUP the syslog-ng process. Even if I can script it.
This is a bug I didn't have time to track down. As it seems syslog-ng tries to reconnect after connection failure, and sometimes after a successful connection establishment just stops sending data, and also stops trying to reconnect. Reading the source didn't reveal any information, and I couldn't reproduce the problem myself.
Here is how I reproduce the problem. It happens everytime when I do this. Unfortunately, it requires about 20 minutes or so. :)
On client syslog-ng, fire up syslog-ng with --debug with the config pointing to the server. On server syslog-ng, fire up syslog-ng with --debug.
use a little test program to generate some data. The one I have been using is
#!/usr/bin/perl
local($i) = 0; while(1) { print "The number is ", $i, "\n"; $i++; }
Now, try it out to make sure that you get a connection and that data goes through. Kill the test program. Reach around the back of your computer (laptop in my case) and disconnect the computer from the network. Fire up the test program. Wait until you get the message saying "Connection broken, reopening in %i seconds", where %i is the reopen time. Wait until after that reopen attempt should have gone through. (In my case, I have reopen set to 30, so I wait more than 30 seconds.) Now, reconnect the client to the network. Bingo, no data until you HUP the process to get it to reconnect.
Matthew M. Copeland
_______________________________________________ syslog-ng maillist - syslog-ng@lists.balabit.hu https://lists.balabit.hu/mailman/listinfo/syslog-ng
On Fri, Aug 17, 2001 at 01:22:30AM +0000, Ramji Chandramouli wrote:
This sounds a lot like a problem I was running into when the server syslog-ng died and I could not get the client syslog-ng to try to reconnect.
I made the following change to afinet.c.
In the function do_init_afinet_dest(),
if (self->conn_fd) { return ST_OK | ST_GOON; } else { werror("Error creating AF_INET socket (%z)\n", strerror(errno));
+ io_callout(self->cfg->backend, + self->cfg->time_reopen, + make_driver_reinit(&self->super.super.super, self->cfg)); }
Once I added this io_callout(), I was able to get it working in my set-up. I hope this helps.
Could someone please test this patch? If it fixes the problem I'm willing to include it. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1
On Fri, 17 Aug 2001, Balazs Scheidler wrote:
On Fri, Aug 17, 2001 at 01:22:30AM +0000, Ramji Chandramouli wrote:
This sounds a lot like a problem I was running into when the server syslog-ng died and I could not get the client syslog-ng to try to reconnect.
I made the following change to afinet.c.
In the function do_init_afinet_dest(),
if (self->conn_fd) { return ST_OK | ST_GOON; } else { werror("Error creating AF_INET socket (%z)\n", strerror(errno));
+ io_callout(self->cfg->backend, + self->cfg->time_reopen, + make_driver_reinit(&self->super.super.super, self->cfg)); }
Once I added this io_callout(), I was able to get it working in my set-up. I hope this helps.
Could someone please test this patch? If it fixes the problem I'm willing to include it.
This patch did not solve my problem at least. It might solve someone elses problem though. Does anyone have any other ideas on how to solve the client reconnect problem? Thanks for all the help, Matthew M. Copeland
On Fri, Aug 17, 2001 at 10:40:59PM +0000, matthew.copeland@honeywell.com wrote:
This patch did not solve my problem at least. It might solve someone elses problem though. Does anyone have any other ideas on how to solve the client reconnect problem?
Thanks for all the help,
I can almost remember the problem you were having (hopefully I am right). You said that the server still thought that it was connected to the client yes? Would not KEEPALIVE solve your problems? That way the socket could detect loss of connection. If that isn't good enough you will need a heart beat and a really short time out on the recv. ---------------------------------------------------------------------------- __o Bradley Arlt Email: arlt@cpsc.ucalgary.ca o__ _ \<_ WWW: www.acs.ucalgary.ca/~bdarlt _>/ _ (_)/(_) -Eat well, sleep peacefully, drink lots, and ride like hell. (_)\(_)
I don't think that it is a problem with the server. I think that it is a problem with the client. If I HUP the client and it reconnects, I start seeing data go across to the server again. Other clients connected to the same server while we are having problems with the downed client are still able to send data through. This is what leads me to think that it is the client and not the server. Matthew M. Copeland On Fri, 17 Aug 2001, Brad Arlt wrote:
On Fri, Aug 17, 2001 at 10:40:59PM +0000, matthew.copeland@honeywell.com wrote:
This patch did not solve my problem at least. It might solve someone elses problem though. Does anyone have any other ideas on how to solve the client reconnect problem?
Thanks for all the help,
I can almost remember the problem you were having (hopefully I am right). You said that the server still thought that it was connected to the client yes?
Would not KEEPALIVE solve your problems? That way the socket could detect loss of connection. If that isn't good enough you will need a heart beat and a really short time out on the recv. ---------------------------------------------------------------------------- __o Bradley Arlt Email: arlt@cpsc.ucalgary.ca o__ _ \<_ WWW: www.acs.ucalgary.ca/~bdarlt _>/ _ (_)/(_) -Eat well, sleep peacefully, drink lots, and ride like hell. (_)\(_)
_______________________________________________ syslog-ng maillist - syslog-ng@lists.balabit.hu https://lists.balabit.hu/mailman/listinfo/syslog-ng
-- You may be sure that when a man begins to call himself a "realist," he is preparing to do something he is secretly ashamed of doing. -- Sydney Harris
Well, I have spent some more time trying to narrow down the problem with the client not reconnecting to the server more than once. When I tried it at home though, low and behold it worked. After much investigation, I have found that if I run the client off of Red Hat 6.2, it doesn't work, but if I run the client off Red Hat 7.1, it does work. So, the question obviously becomes, why? We have different kernels, libraries, and compilers. Anyone care to hazard a guess? I am using the latest and greatest versions of syslog-ng and libol now for all of my testing. Matthew M. Copeland On Mon, 20 Aug 2001 matthew.copeland@honeywell.com wrote:
I don't think that it is a problem with the server. I think that it is a problem with the client. If I HUP the client and it reconnects, I start seeing data go across to the server again. Other clients connected to the same server while we are having problems with the downed client are still able to send data through. This is what leads me to think that it is the client and not the server.
Matthew M. Copeland
On Fri, 17 Aug 2001, Brad Arlt wrote:
On Fri, Aug 17, 2001 at 10:40:59PM +0000, matthew.copeland@honeywell.com wrote:
This patch did not solve my problem at least. It might solve someone elses problem though. Does anyone have any other ideas on how to solve the client reconnect problem?
Thanks for all the help,
I can almost remember the problem you were having (hopefully I am right). You said that the server still thought that it was connected to the client yes?
Would not KEEPALIVE solve your problems? That way the socket could detect loss of connection. If that isn't good enough you will need a heart beat and a really short time out on the recv. ---------------------------------------------------------------------------- __o Bradley Arlt Email: arlt@cpsc.ucalgary.ca o__ _ \<_ WWW: www.acs.ucalgary.ca/~bdarlt _>/ _ (_)/(_) -Eat well, sleep peacefully, drink lots, and ride like hell. (_)\(_)
_______________________________________________ syslog-ng maillist - syslog-ng@lists.balabit.hu https://lists.balabit.hu/mailman/listinfo/syslog-ng
-- You may be sure that when a man begins to call himself a "realist," he is preparing to do something he is secretly ashamed of doing. -- Sydney Harris
On Mon, Aug 27, 2001 at 10:38:10PM +0000, matthew.copeland@honeywell.com wrote:
Well, I have spent some more time trying to narrow down the problem with the client not reconnecting to the server more than once. When I tried it at home though, low and behold it worked. After much investigation, I have found that if I run the client off of Red Hat 6.2, it doesn't work, but if I run the client off Red Hat 7.1, it does work. So, the question obviously becomes, why? We have different kernels, libraries, and compilers. Anyone care to hazard a guess? I am using the latest and greatest versions of syslog-ng and libol now for all of my testing.
To be honest I have no clues. I'm working on a Debian potato (kernel 2.2.19, glibc 2.1.3) But this bug showed up sometimes previously as well. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1
------------------------------------------------------------- Background reminder: Red Hat 6.2 box acting as a remote TCP logging client doesn't try to reconnect more than once. Red Hat 7.1 box acting as a remote TCP logging client attempts to reconnect every time_reopen seconds just like it is supposed to do. ------------------------------------------------------------- Well, I have spent some more time on this problem, and I have it narrowed down quite a bit. Using an strace of the syslog-ng client, you see the following under Red Hat Linux 7.1 and Red Hat Linux 6.2. (More details after straces) Red Hat Linux 6.2 ... 9373 [400e5dc2] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5 9373 [400d9ce4] fcntl(5, F_GETFL) = 0x2 (flags O_RDWR) 9373 [400d9ce4] fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 9373 [400d9ce4] fcntl(5, F_SETFD, FD_CLOEXEC) = 0 9373 [400d9b14] write(2, "io.c: connecting using fd 5\n", 28) = 28 9373 [400e5a82] connect(5, {sin_family=AF_INET, sin_port=htons(999), sin_addr=inet_addr("151.150.32.135")}}, 16) = -1 EINPROGRESS (Operation now in progress) 9373 [400bbf7d] time(NULL) = 999641180 9373 [400def50] poll([{fd=5, events=POLLOUT}, {fd=4, events=POLLIN}], 2, 100) = 0 9373 [400def50] poll([{fd=5, events=POLLOUT, revents=POLLERR}, {fd=4, events=POLLIN}], 2, 60000) = 1 9373 [400d9b14] write(2, "Marking fd 5 for closing.\n", 26) = 26 ... Red Hat Linux 7.1 ... 6325 [40131462] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 4 6325 [40124187] fcntl64(4, F_GETFL) = 0x2 (flags O_RDWR) 6325 [40124187] fcntl64(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 6325 [40124187] fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 6325 [40123f84] write(2, "io.c: connecting using fd 4\n", 28) = -1 EIO (Input/output error) 6325 [40131122] connect(4, {sin_family=AF_INET, sin_port=htons(999), sin_addr=inet_addr("151.150.32.141")}}, 16) = -1 EINPROGRESS (Operation now in progress) 6325 [400f876d] time(NULL) = 999637318 6325 [40129227] poll([{fd=4, events=POLLOUT, revents=POLLERR|POLLHUP}, {fd=3, events=POLLIN}], 2, 100) = 1 6325 [401311e2] getsockopt(4, SOL_SOCKET, SO_ERROR, [111], [4]) = 0 6325 [40123f84] write(2, "Error connecting to remote host "..., 77) = -1 EIO (Input/output error) 6325 [400f876d] time(NULL) = 999637318 6325 [400f876d] time(NULL) = 999637318 6325 [40123f84] write(2, "Closing fd 4.\n", 14) = -1 EIO (Input/output error) ... The first poll that you are seeing in both of these traces is the poll on line 197 of io.c for libol 0.2.23 (syslog-ng 1.4.12). Notice that under the Red Hat 7.1 version that we get a return value of 1 where the first poll has revents of POLLERR and POLLHUP. Under Red Hat 6.2, our poll returns a 0 and says that everything is fine and dandy, until we do our next poll at line 202. At this point, we get POLLERR for our socket file descripter. I am still tracing through the code again to write down how it effects things, but I am fairly sure this is it. At a high level, the Red Hat 7.1 version, when it closes the socket, sets up a callback to retry the connection at the time_reopen interval, but the Red Hat 6.2 version kills the fd and doesn't setup a callback for it. (The io_iter function in io.c is kind of long and it is in the second pass that this stuff happens, so it is taking a little while to figure out what is going on.) I will send out more information as I receive, but if anyone comes up with an easy way to patch this, please let me know. I have people at work breathing down my neck to figure this one out. Thanks, Matthew M. Copeland On Tue, 28 Aug 2001, Balazs Scheidler wrote:
On Mon, Aug 27, 2001 at 10:38:10PM +0000, matthew.copeland@honeywell.com wrote:
Well, I have spent some more time trying to narrow down the problem with the client not reconnecting to the server more than once. When I tried it at home though, low and behold it worked. After much investigation, I have found that if I run the client off of Red Hat 6.2, it doesn't work, but if I run the client off Red Hat 7.1, it does work. So, the question obviously becomes, why? We have different kernels, libraries, and compilers. Anyone care to hazard a guess? I am using the latest and greatest versions of syslog-ng and libol now for all of my testing.
To be honest I have no clues. I'm working on a Debian potato (kernel 2.2.19, glibc 2.1.3) But this bug showed up sometimes previously as well.
On Wed, Sep 05, 2001 at 08:09:06PM +0000, matthew.copeland@honeywell.com wrote:
------------------------------------------------------------- Background reminder: Red Hat 6.2 box acting as a remote TCP logging client doesn't try to reconnect more than once. Red Hat 7.1 box acting as a remote TCP logging client attempts to reconnect every time_reopen seconds just like it is supposed to do. ------------------------------------------------------------- Well, I have spent some more time on this problem, and I have it narrowed down quite a bit.
Using an strace of the syslog-ng client, you see the following under Red Hat Linux 7.1 and Red Hat Linux 6.2. (More details after straces)
Thanks for tracking down this issue. The problem might be the difference between libc/kernel versions. Earlier libcs used to emulate poll using select (glibc 2.0), this is not the case as strace reports it as poll. But Rh 6.2 and 7.1 may contain different kernel versions which behave differently. The problem is that rh 6.2 returns only POLLERR without POLLHUP, and syslog-ng expects POLLHUP for closed sessions. This patch may fix this problem and create new ones, however at 22:43pm, this is the best I can make: Index: io.c =================================================================== RCS file: /var/cvs/libol/src/io.c,v retrieving revision 1.25 diff -u -r1.25 io.c --- io.c 2001/08/26 21:28:18 1.25 +++ io.c 2001/09/05 20:39:02 @@ -231,7 +231,7 @@ if (!fd->super.alive) continue; - if (fds[i].revents & POLLHUP) { + if (fds[i].revents & (POLLHUP|POLLERR|POLLNVAL)) { if (fd->want_read && fd->read) READ_FD(fd); else if (fd->want_write && fd->write) @@ -246,10 +246,12 @@ close_fd(fd, CLOSE_PROTOCOL_FAILURE); continue; } + /* if (fds[i].revents & (POLLNVAL | POLLERR)) { close_fd(fd, CLOSE_POLL_FAILED); continue; } + */ if (fds[i].revents & POLLOUT) if (fd->want_write && fd->write) WRITE_FD(fd); -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1
Thanks for tracking down this issue. The problem might be the difference between libc/kernel versions. Earlier libcs used to emulate poll using select (glibc 2.0), this is not the case as strace reports it as poll. But Rh 6.2 and 7.1 may contain different kernel versions which behave differently.
The problem is that rh 6.2 returns only POLLERR without POLLHUP, and syslog-ng expects POLLHUP for closed sessions. This patch may fix this problem and create new ones, however at 22:43pm, this is the best I can make:
Well, I gave this patch a try, but it doesn't seem to fix the problem. I haven't walked through it with gdb yet with the patch in place though but the messages indicating a reconnect attempt in 10 seconds only flashed by once, which is how it was behaving before. I will take another look at it tomorrow morning and see if I can figure out some more of what is happening. Matthew M. Copeland
Index: io.c =================================================================== RCS file: /var/cvs/libol/src/io.c,v retrieving revision 1.25 diff -u -r1.25 io.c --- io.c 2001/08/26 21:28:18 1.25 +++ io.c 2001/09/05 20:39:02 @@ -231,7 +231,7 @@ if (!fd->super.alive) continue;
- if (fds[i].revents & POLLHUP) { + if (fds[i].revents & (POLLHUP|POLLERR|POLLNVAL)) { if (fd->want_read && fd->read) READ_FD(fd); else if (fd->want_write && fd->write) @@ -246,10 +246,12 @@ close_fd(fd, CLOSE_PROTOCOL_FAILURE); continue; } + /* if (fds[i].revents & (POLLNVAL | POLLERR)) { close_fd(fd, CLOSE_POLL_FAILED); continue; } + */ if (fds[i].revents & POLLOUT) if (fd->want_write && fd->write) WRITE_FD(fd);
-- You may be sure that when a man begins to call himself a "realist," he is preparing to do something he is secretly ashamed of doing. -- Sydney Harris
I think that part of the problem with this patch is that fd->super.alive doesn't get set to zero. I could be wrong, but I traced through the original code using gdb and then I received your patch. So after trying the patch and finding that it didn't work, I went back and looked at the output of the script file. The first time the socket times out, fd->super.alive gets set to 0 and the second poll doesn't happen. Then, when you go into the for loop if continues back fd->super.alive is zero. If this is the setup to get it to setup the callback for reconnecting later, maybe we should do the POLLERR check when we do the fd->super.alive check, and set fd->super.alive = 0 if we get into that if statement. What do you think? (I have included the script output session of my walk through the code using gdb.) Matthew M. Copeland
Index: io.c =================================================================== RCS file: /var/cvs/libol/src/io.c,v retrieving revision 1.25 diff -u -r1.25 io.c --- io.c 2001/08/26 21:28:18 1.25 +++ io.c 2001/09/05 20:39:02 @@ -231,7 +231,7 @@ if (!fd->super.alive) continue;
- if (fds[i].revents & POLLHUP) { + if (fds[i].revents & (POLLHUP|POLLERR|POLLNVAL)) { if (fd->want_read && fd->read) READ_FD(fd); else if (fd->want_write && fd->write) @@ -246,10 +246,12 @@ close_fd(fd, CLOSE_PROTOCOL_FAILURE); continue; } + /* if (fds[i].revents & (POLLNVAL | POLLERR)) { close_fd(fd, CLOSE_POLL_FAILED); continue; } + */ if (fds[i].revents & POLLOUT) if (fd->want_write && fd->write) WRITE_FD(fd);
-- You may be sure that when a man begins to call himself a "realist," he is preparing to do something he is secretly ashamed of doing. -- Sydney Harris
Well, the issue of it not reconnecting at startup if the server was down has been fixed by that patch also. Thanks for the help. Matthew M. Copeland
On Thu, Sep 06, 2001 at 11:13:59PM +0000, matthew.copeland@honeywell.com wrote:
Well, the issue of it not reconnecting at startup if the server was down has been fixed by that patch also. Thanks for the help.
great. thanks for your cooperation in finding and fixing this bug. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1
Okay, I must have screwed something up. I walked through the code with a debugger for this patch and it worked just fine this time. Maybe when I recompiled libol and syslog-ng after applying this patch, it didn't compile in the new libol. Anyhow, good work and thanks for the help. My next thing to check out is to see whether it will attempt to connect again if it can't reach the server when it is first started up. I was noticing some problems with that, but it might be related to this. Thanks for all the help again. Matthew M. Copeland On Wed, 5 Sep 2001, Balazs Scheidler wrote:
On Wed, Sep 05, 2001 at 08:09:06PM +0000, matthew.copeland@honeywell.com wrote:
------------------------------------------------------------- Background reminder: Red Hat 6.2 box acting as a remote TCP logging client doesn't try to reconnect more than once. Red Hat 7.1 box acting as a remote TCP logging client attempts to reconnect every time_reopen seconds just like it is supposed to do. ------------------------------------------------------------- Well, I have spent some more time on this problem, and I have it narrowed down quite a bit.
Using an strace of the syslog-ng client, you see the following under Red Hat Linux 7.1 and Red Hat Linux 6.2. (More details after straces)
Thanks for tracking down this issue. The problem might be the difference between libc/kernel versions. Earlier libcs used to emulate poll using select (glibc 2.0), this is not the case as strace reports it as poll. But Rh 6.2 and 7.1 may contain different kernel versions which behave differently.
The problem is that rh 6.2 returns only POLLERR without POLLHUP, and syslog-ng expects POLLHUP for closed sessions. This patch may fix this problem and create new ones, however at 22:43pm, this is the best I can make:
Index: io.c =================================================================== RCS file: /var/cvs/libol/src/io.c,v retrieving revision 1.25 diff -u -r1.25 io.c --- io.c 2001/08/26 21:28:18 1.25 +++ io.c 2001/09/05 20:39:02 @@ -231,7 +231,7 @@ if (!fd->super.alive) continue;
- if (fds[i].revents & POLLHUP) { + if (fds[i].revents & (POLLHUP|POLLERR|POLLNVAL)) { if (fd->want_read && fd->read) READ_FD(fd); else if (fd->want_write && fd->write) @@ -246,10 +246,12 @@ close_fd(fd, CLOSE_PROTOCOL_FAILURE); continue; } + /* if (fds[i].revents & (POLLNVAL | POLLERR)) { close_fd(fd, CLOSE_POLL_FAILED); continue; } + */ if (fds[i].revents & POLLOUT) if (fd->want_write && fd->write) WRITE_FD(fd);
-- You may be sure that when a man begins to call himself a "realist," he is preparing to do something he is secretly ashamed of doing. -- Sydney Harris
participants (5)
-
Balazs Scheidler
-
Brad Arlt
-
matthew.copeland@honeywell.com
-
Ramji Chandramouli
-
Ravi Malghan