[syslog-ng]syslog-ng dying randomly?

Tue, 14 May 2002 19:36:17 -0400

[warning: really long post follows]

I'm running syslog-ng 1.4.14 on a number of Red Hat 7.1/7.2 systems. It's
configured fairly simply, writing logs to
/var/log/programname.$YEAR$MONTH$DATE and also forwards logs via UDP to a
central syslog-ng hub.

client config:
options {
        use_fqdn(on);
        chain_hostnames(off);
        sync(0);
};

source s_local {
        internal();
        unix-stream("/dev/log");
};
source s_kmsg {
        file("/proc/kmsg");
};

destination d_sort_program {
        file("/var/log/$PROGRAM.$MONTH$DAY$YEAR"
                group("logs")
                perm(0640)
                create_dirs(yes)
                dir_group("logs")
                dir_perm(0750)
        );
};
destination d_syslog_hub {
        udp(172.16.1.4 port(514));
};

log {
        source(s_local);
        source(s_kmsg);
        destination(d_sort_program);
        destination(d_syslog_hub);
};

The problem is that syslog-ng seems to die on a fairly regular basis. I
can't say for sure if it's because our syslog hub wasn't available for an
extended period of time or if it's due to some sort of problem with the
code. I'm doing some testing to see if the availability of the hub makes any
difference and I'll make a follow-up in a few days.

Anyway, I was curious if anyone had an idea on what might be going on. An
strace of syslog-ng on one of the machines before it died is very
interesting.

poll([{fd=6, events=0}, {fd=5, events=0, revents=POLLERR}, {fd=2, events=POLLIN}, {fd=4, events=POLLIN}, {fd=3, events=POLLIN}], 5, 100) = 1
time(NULL)                              = 1021313163
getpid()                                = 17061
time(NULL)                              = 1021313163
open("/var/log/syslog-ng.05132002", O_WRONLY|O_NONBLOCK|O_APPEND|O_CREAT|O_NOCTTY, 0640) = 0
chown32("/var/log/syslog-ng.05132002", 0, 6000) = 0
fcntl64(0, F_GETFL)                     = 0xc01 (flags O_WRONLY|O_NONBLOCK|O_APPEND)
fcntl64(0, F_SETFL, O_WRONLY|O_NONBLOCK|O_APPEND) = 0
fcntl64(0, F_SETFD, FD_CLOEXEC)         = 0
time(NULL)                              = 1021313163
time(NULL)                              = 1021313163
time(NULL)                              = 1021313163
close(5)                                = 0
poll([{fd=0, events=0}, {fd=6, events=0}, {fd=2, events=POLLIN}, {fd=4, events=POLLIN}, {fd=3, events=POLLIN}], 5, 100) = 0
getpid()                                = 17061
time(NULL)                              = 1021313163
getpid()                                = 17061
time(NULL)                              = 1021313163
getpid()                                = 17061
time(NULL)                              = 1021313163
getpid()                                = 17061
time(NULL)                              = 1021313163

And so on, alternating between getpid() and time() until it finally
segfaults and dies. A backtrace is also interesting; it seems like syslog-ng
might be getting caught in an infinite loop:

#0  0x400b49c9 in chunk_alloc (ar_ptr=0x0, nb=0) at malloc.c:2843
#1  0x400b4818 in __libc_malloc (bytes=88) at malloc.c:2811
#2  0x08053b22 in debug_malloc ()
#3  0x08053bdd in xalloc ()
#4  0x08053da0 in ol_string_alloc ()
#5  0x08051203 in c_format_cstring ()
#6  0x0804b98b in make_internal_message ()
#7  0x080503e6 in write_internal_message ()
#8  0x08053291 in msg_vformat ()
#9  0x0805339c in fatal ()
#10 0x08053e3d in ol_object_check ()
#11 0x0804a7c6 in do_add_source_name ()
#12 0x080503fc in write_internal_message ()
#13 0x08053291 in msg_vformat ()
#14 0x0805339c in fatal ()
#15 0x08053e3d in ol_object_check ()
#16 0x0804a7c6 in do_add_source_name ()
#17 0x080503fc in write_internal_message ()
#18 0x08053291 in msg_vformat ()
#19 0x0805339c in fatal ()
#20 0x08053e3d in ol_object_check ()
#21 0x0804a7c6 in do_add_source_name ()
#22 0x080503fc in write_internal_message ()
#23 0x08053291 in msg_vformat ()
#24 0x0805339c in fatal ()

Calls #20 - #24 repeat over and over. The entire backtrace is huge, almost
35,000 calls. Here's the beginning of it:

#34475 0x08053e3d in ol_object_check ()
#34476 0x0804a7c6 in do_add_source_name ()
#34477 0x080503fc in write_internal_message ()
#34478 0x08053291 in msg_vformat ()
#34479 0x0805339c in fatal ()
#34480 0x08053e3d in ol_object_check ()
#34481 0x0804a7c6 in do_add_source_name ()
#34482 0x080503fc in write_internal_message ()
#34483 0x08053291 in msg_vformat ()
#34484 0x0805339c in fatal ()
#34485 0x080544d7 in dont_free_live_resource ()
#34486 0x080543f3 in do_resource_free ()
#34487 0x08053961 in gc_sweep ()
#34488 0x08053a5b in gc ()
#34489 0x08053aea in gc_maybe ()
#34490 0x080515df in io_iter ()
#34491 0x0804a095 in main_loop ()
#34492 0x0804a3d1 in main ()
#34493 0x40050647 in __libc_start_main (main=0x804a1c8 <main>, argc=2,
ubp_av=0xbffff754, 
    init=0x8049664 <_init>, fini=0x8054620 <_fini>, rtld_fini=0x4000dcd4
<_dl_fini>, 
    stack_end=0xbffff74c) at ../sysdeps/generic/libc-start.c:129

It looks like it's getting caught in an infinite loop which finally exhausts
its stack space and causes it to segfault(?).

If you need any more information, let me know. I also still have the core if
anybody wants it. Any ideas would be much appreciated!

thanks,
john
-- 
John Morrissey          _o            /\         ----  __o
jwm@horde.net        _-< \_          /  \       ----  <  \,
www.horde.net/    __(_)/_(_)________/    \_______(_) /_(_)__