[warning: really long post follows] I'm running syslog-ng 1.4.14 on a number of Red Hat 7.1/7.2 systems. It's configured fairly simply, writing logs to /var/log/programname.$YEAR$MONTH$DATE and also forwards logs via UDP to a central syslog-ng hub. client config: options { use_fqdn(on); chain_hostnames(off); sync(0); }; source s_local { internal(); unix-stream("/dev/log"); }; source s_kmsg { file("/proc/kmsg"); }; destination d_sort_program { file("/var/log/$PROGRAM.$MONTH$DAY$YEAR" group("logs") perm(0640) create_dirs(yes) dir_group("logs") dir_perm(0750) ); }; destination d_syslog_hub { udp(172.16.1.4 port(514)); }; log { source(s_local); source(s_kmsg); destination(d_sort_program); destination(d_syslog_hub); }; The problem is that syslog-ng seems to die on a fairly regular basis. I can't say for sure if it's because our syslog hub wasn't available for an extended period of time or if it's due to some sort of problem with the code. I'm doing some testing to see if the availability of the hub makes any difference and I'll make a follow-up in a few days. Anyway, I was curious if anyone had an idea on what might be going on. An strace of syslog-ng on one of the machines before it died is very interesting. poll([{fd=6, events=0}, {fd=5, events=0, revents=POLLERR}, {fd=2, events=POLLIN}, {fd=4, events=POLLIN}, {fd=3, events=POLLIN}], 5, 100) = 1 time(NULL) = 1021313163 getpid() = 17061 time(NULL) = 1021313163 open("/var/log/syslog-ng.05132002", O_WRONLY|O_NONBLOCK|O_APPEND|O_CREAT|O_NOCTTY, 0640) = 0 chown32("/var/log/syslog-ng.05132002", 0, 6000) = 0 fcntl64(0, F_GETFL) = 0xc01 (flags O_WRONLY|O_NONBLOCK|O_APPEND) fcntl64(0, F_SETFL, O_WRONLY|O_NONBLOCK|O_APPEND) = 0 fcntl64(0, F_SETFD, FD_CLOEXEC) = 0 time(NULL) = 1021313163 time(NULL) = 1021313163 time(NULL) = 1021313163 close(5) = 0 poll([{fd=0, events=0}, {fd=6, events=0}, {fd=2, events=POLLIN}, {fd=4, events=POLLIN}, {fd=3, events=POLLIN}], 5, 100) = 0 getpid() = 17061 time(NULL) = 1021313163 getpid() = 17061 time(NULL) = 1021313163 getpid() = 17061 time(NULL) = 1021313163 getpid() = 17061 time(NULL) = 1021313163 And so on, alternating between getpid() and time() until it finally segfaults and dies. A backtrace is also interesting; it seems like syslog-ng might be getting caught in an infinite loop: #0 0x400b49c9 in chunk_alloc (ar_ptr=0x0, nb=0) at malloc.c:2843 #1 0x400b4818 in __libc_malloc (bytes=88) at malloc.c:2811 #2 0x08053b22 in debug_malloc () #3 0x08053bdd in xalloc () #4 0x08053da0 in ol_string_alloc () #5 0x08051203 in c_format_cstring () #6 0x0804b98b in make_internal_message () #7 0x080503e6 in write_internal_message () #8 0x08053291 in msg_vformat () #9 0x0805339c in fatal () #10 0x08053e3d in ol_object_check () #11 0x0804a7c6 in do_add_source_name () #12 0x080503fc in write_internal_message () #13 0x08053291 in msg_vformat () #14 0x0805339c in fatal () #15 0x08053e3d in ol_object_check () #16 0x0804a7c6 in do_add_source_name () #17 0x080503fc in write_internal_message () #18 0x08053291 in msg_vformat () #19 0x0805339c in fatal () #20 0x08053e3d in ol_object_check () #21 0x0804a7c6 in do_add_source_name () #22 0x080503fc in write_internal_message () #23 0x08053291 in msg_vformat () #24 0x0805339c in fatal () Calls #20 - #24 repeat over and over. The entire backtrace is huge, almost 35,000 calls. Here's the beginning of it: #34475 0x08053e3d in ol_object_check () #34476 0x0804a7c6 in do_add_source_name () #34477 0x080503fc in write_internal_message () #34478 0x08053291 in msg_vformat () #34479 0x0805339c in fatal () #34480 0x08053e3d in ol_object_check () #34481 0x0804a7c6 in do_add_source_name () #34482 0x080503fc in write_internal_message () #34483 0x08053291 in msg_vformat () #34484 0x0805339c in fatal () #34485 0x080544d7 in dont_free_live_resource () #34486 0x080543f3 in do_resource_free () #34487 0x08053961 in gc_sweep () #34488 0x08053a5b in gc () #34489 0x08053aea in gc_maybe () #34490 0x080515df in io_iter () #34491 0x0804a095 in main_loop () #34492 0x0804a3d1 in main () #34493 0x40050647 in __libc_start_main (main=0x804a1c8 <main>, argc=2, ubp_av=0xbffff754, init=0x8049664 <_init>, fini=0x8054620 <_fini>, rtld_fini=0x4000dcd4 <_dl_fini>, stack_end=0xbffff74c) at ../sysdeps/generic/libc-start.c:129 It looks like it's getting caught in an infinite loop which finally exhausts its stack space and causes it to segfault(?). If you need any more information, let me know. I also still have the core if anybody wants it. Any ideas would be much appreciated! thanks, john -- John Morrissey _o /\ ---- __o jwm@horde.net _-< \_ / \ ---- < \, www.horde.net/ __(_)/_(_)________/ \_______(_) /_(_)__
I have noticed this also with the version 1.5.17. I just thought it was my setup. ----- Original Message ----- From: "John Morrissey" <jwm@horde.net> To: <syslog-ng@lists.balabit.hu> Sent: Tuesday, May 14, 2002 4:36 PM Subject: [syslog-ng]syslog-ng dying randomly?
[warning: really long post follows]
I'm running syslog-ng 1.4.14 on a number of Red Hat 7.1/7.2 systems. It's configured fairly simply, writing logs to /var/log/programname.$YEAR$MONTH$DATE and also forwards logs via UDP to a central syslog-ng hub.
client config: options { use_fqdn(on); chain_hostnames(off); sync(0); };
source s_local { internal(); unix-stream("/dev/log"); }; source s_kmsg { file("/proc/kmsg"); };
destination d_sort_program { file("/var/log/$PROGRAM.$MONTH$DAY$YEAR" group("logs") perm(0640) create_dirs(yes) dir_group("logs") dir_perm(0750) ); }; destination d_syslog_hub { udp(172.16.1.4 port(514)); };
log { source(s_local); source(s_kmsg); destination(d_sort_program); destination(d_syslog_hub); };
The problem is that syslog-ng seems to die on a fairly regular basis. I can't say for sure if it's because our syslog hub wasn't available for an extended period of time or if it's due to some sort of problem with the code. I'm doing some testing to see if the availability of the hub makes any difference and I'll make a follow-up in a few days.
Anyway, I was curious if anyone had an idea on what might be going on. An strace of syslog-ng on one of the machines before it died is very interesting.
poll([{fd=6, events=0}, {fd=5, events=0, revents=POLLERR}, {fd=2, events=POLLIN}, {fd=4, events=POLLIN}, {fd=3, events=POLLIN}], 5, 100) = 1 time(NULL) = 1021313163 getpid() = 17061 time(NULL) = 1021313163 open("/var/log/syslog-ng.05132002", O_WRONLY|O_NONBLOCK|O_APPEND|O_CREAT|O_NOCTTY, 0640) = 0 chown32("/var/log/syslog-ng.05132002", 0, 6000) = 0 fcntl64(0, F_GETFL) = 0xc01 (flags O_WRONLY|O_NONBLOCK|O_APPEND) fcntl64(0, F_SETFL, O_WRONLY|O_NONBLOCK|O_APPEND) = 0 fcntl64(0, F_SETFD, FD_CLOEXEC) = 0 time(NULL) = 1021313163 time(NULL) = 1021313163 time(NULL) = 1021313163 close(5) = 0 poll([{fd=0, events=0}, {fd=6, events=0}, {fd=2, events=POLLIN}, {fd=4, events=POLLIN}, {fd=3, events=POLLIN}], 5, 100) = 0 getpid() = 17061 time(NULL) = 1021313163 getpid() = 17061 time(NULL) = 1021313163 getpid() = 17061 time(NULL) = 1021313163 getpid() = 17061 time(NULL) = 1021313163
And so on, alternating between getpid() and time() until it finally segfaults and dies. A backtrace is also interesting; it seems like syslog-ng might be getting caught in an infinite loop:
#0 0x400b49c9 in chunk_alloc (ar_ptr=0x0, nb=0) at malloc.c:2843 #1 0x400b4818 in __libc_malloc (bytes=88) at malloc.c:2811 #2 0x08053b22 in debug_malloc () #3 0x08053bdd in xalloc () #4 0x08053da0 in ol_string_alloc () #5 0x08051203 in c_format_cstring () #6 0x0804b98b in make_internal_message () #7 0x080503e6 in write_internal_message () #8 0x08053291 in msg_vformat () #9 0x0805339c in fatal () #10 0x08053e3d in ol_object_check () #11 0x0804a7c6 in do_add_source_name () #12 0x080503fc in write_internal_message () #13 0x08053291 in msg_vformat () #14 0x0805339c in fatal () #15 0x08053e3d in ol_object_check () #16 0x0804a7c6 in do_add_source_name () #17 0x080503fc in write_internal_message () #18 0x08053291 in msg_vformat () #19 0x0805339c in fatal () #20 0x08053e3d in ol_object_check () #21 0x0804a7c6 in do_add_source_name () #22 0x080503fc in write_internal_message () #23 0x08053291 in msg_vformat () #24 0x0805339c in fatal ()
Calls #20 - #24 repeat over and over. The entire backtrace is huge, almost 35,000 calls. Here's the beginning of it:
#34475 0x08053e3d in ol_object_check () #34476 0x0804a7c6 in do_add_source_name () #34477 0x080503fc in write_internal_message () #34478 0x08053291 in msg_vformat () #34479 0x0805339c in fatal () #34480 0x08053e3d in ol_object_check () #34481 0x0804a7c6 in do_add_source_name () #34482 0x080503fc in write_internal_message () #34483 0x08053291 in msg_vformat () #34484 0x0805339c in fatal () #34485 0x080544d7 in dont_free_live_resource () #34486 0x080543f3 in do_resource_free () #34487 0x08053961 in gc_sweep () #34488 0x08053a5b in gc () #34489 0x08053aea in gc_maybe () #34490 0x080515df in io_iter () #34491 0x0804a095 in main_loop () #34492 0x0804a3d1 in main () #34493 0x40050647 in __libc_start_main (main=0x804a1c8 <main>, argc=2, ubp_av=0xbffff754, init=0x8049664 <_init>, fini=0x8054620 <_fini>, rtld_fini=0x4000dcd4 <_dl_fini>, stack_end=0xbffff74c) at ../sysdeps/generic/libc-start.c:129
It looks like it's getting caught in an infinite loop which finally exhausts its stack space and causes it to segfault(?).
If you need any more information, let me know. I also still have the core if anybody wants it. Any ideas would be much appreciated!
thanks, john -- John Morrissey _o /\ ---- __o jwm@horde.net _-< \_ / \ ---- < \, www.horde.net/ __(_)/_(_)________/ \_______(_) /_(_)__
_______________________________________________ syslog-ng maillist - syslog-ng@lists.balabit.hu https://lists.balabit.hu/mailman/listinfo/syslog-ng
On Tue, May 14, 2002 at 08:40:23PM -0700, Jim Gifford wrote:
I have noticed this also with the version 1.5.17. I just thought it was my setup.
hmm. then there must be another bad unref somewhere. Please apply the patch below, it hopefully catches the issue (it is NOT a fix, just an aid to catch the bug): diff -u -r1.8 xalloc.c --- xalloc.c 4 Feb 2000 19:02:04 -0000 1.8 +++ xalloc.c 15 May 2002 08:16:47 -0000 @@ -225,6 +225,7 @@ #if 0 debug("ol_string_free: freeing %xi.\n", (void *) s); #endif + assert(((unsigned int) s->use_cnt) < 10); if (--s->use_cnt == 0) { #ifdef DEBUG_ALLOC if (s->header.magic != -1717) -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1
On Wed, May 15, 2002 at 10:17:32AM +0200, Balazs Scheidler wrote:
On Tue, May 14, 2002 at 08:40:23PM -0700, Jim Gifford wrote:
I have noticed this also with the version 1.5.17. I just thought it was my setup.
hmm. then there must be another bad unref somewhere. Please apply the patch below, it hopefully catches the issue (it is NOT a fix, just an aid to catch the bug):
sorry to repeat myself, but this patch adds even more sanity checking: Index: xalloc.c =================================================================== RCS file: /var/cvs/libol/src/xalloc.c,v retrieving revision 1.8 diff -u -r1.8 xalloc.c --- xalloc.c 4 Feb 2000 19:02:04 -0000 1.8 +++ xalloc.c 15 May 2002 08:20:02 -0000 @@ -225,6 +225,8 @@ #if 0 debug("ol_string_free: freeing %xi.\n", (void *) s); #endif + assert(((unsigned int) s->use_cnt) < 10); + assert(s->use_cnt); if (--s->use_cnt == 0) { #ifdef DEBUG_ALLOC if (s->header.magic != -1717) -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1
On Wed, May 15, 2002 at 10:20:45AM +0200, Balazs Scheidler wrote: % sorry to repeat myself, but this patch adds even more sanity checking: I patched libol on a local machine and started syslog-ng. It ran for much longer than normal, but then I realized that it could reach the syslog hub. When I removed its access to the syslog hub, syslog-ng died (segfaulted) within a couple minutes with the following internal messages logged: May 15 12:44:09 serviceman01 syslog-ng[9566]: syslog-ng version 1.4.14 starting May 15 12:44:09 serviceman01 syslog-ng: syslog-ng startup succeeded May 15 12:44:50 serviceman01 syslog-ng[9566]: Connection broken, reopening in 60 seconds May 15 12:49:22 serviceman01 syslog-ng[9566]: Connection broken, reopening in 60 seconds Backtrace was the same. There were no assertions logged; would they be syslogged? It looks like syslog-ng gets wedged somehow before it dies and might not be able to log anything. thanks for your help, john -- John Morrissey _o /\ ---- __o jwm@horde.net _-< \_ / \ ---- < \, www.horde.net/ __(_)/_(_)________/ \_______(_) /_(_)__
On Wed, May 15, 2002 at 01:15:41PM -0400, John Morrissey wrote:
On Wed, May 15, 2002 at 10:20:45AM +0200, Balazs Scheidler wrote: % sorry to repeat myself, but this patch adds even more sanity checking:
I patched libol on a local machine and started syslog-ng. It ran for much longer than normal, but then I realized that it could reach the syslog hub. When I removed its access to the syslog hub, syslog-ng died (segfaulted) within a couple minutes with the following internal messages logged:
May 15 12:44:09 serviceman01 syslog-ng[9566]: syslog-ng version 1.4.14 starting May 15 12:44:09 serviceman01 syslog-ng: syslog-ng startup succeeded May 15 12:44:50 serviceman01 syslog-ng[9566]: Connection broken, reopening in 60 seconds May 15 12:49:22 serviceman01 syslog-ng[9566]: Connection broken, reopening in 60 seconds
Backtrace was the same. There were no assertions logged; would they be syslogged? It looks like syslog-ng gets wedged somehow before it dies and might not be able to log anything.
the asserts wouldn't be syslogged, but the terminating signal would become SIGABRT instead of SIGSEGV. I'm trying to reproduce the problem here. I changed the reopen time from 60 to 3 seconds, and sending log messages using logger with a shell script. How did you remove its ability to send messages to the hub? using a packet filter rule, directing it to an unused port, or? -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1
On Thu, May 16, 2002 at 11:57:28AM +0200, Balazs Scheidler wrote: % the asserts wouldn't be syslogged, but the terminating signal would become % SIGABRT instead of SIGSEGV. Still looks like SIGSEGV: [root@villa:pts/1 /staff/jwm> gdb `which syslog-ng` core [snip] Core was generated by `syslog-ng'. Program terminated with signal 11, Segmentation fault. % I'm trying to reproduce the problem here. I changed the reopen time from 60 % to 3 seconds, and sending log messages using logger with a shell script. If you want direct access to a machine, I might be able to get it for you. % How did you remove its ability to send messages to the hub? using a packet % filter rule, directing it to an unused port, or? A filter rule on the hub that would cause the client to receive ECONNREFUSED. I don't know if it helps, but I think it's interesting that if I start syslog-ng with the hub "unavailable," it seems to run for a a much longer time (properly?), but if I start syslog-ng and *then* remove access to the hub, syslog-ng dies within a few minutes. john -- John Morrissey _o /\ ---- __o jwm@horde.net _-< \_ / \ ---- < \, www.horde.net/ __(_)/_(_)________/ \_______(_) /_(_)__
participants (3)
-
Balazs Scheidler
-
Jim Gifford
-
John Morrissey