[syslog-ng]syslog-ng-1.4.17 crashes

Balazs Scheidler bazsi@balabit.hu
Thu, 14 Nov 2002 18:07:02 +0100


On Thu, Nov 14, 2002 at 05:03:47PM +0100, Heinz Ekker wrote:
> Hi!
> 
> I am using syslog-ng 1.4.17 with libol 0.2.24 on a central log host
> running RedHat 7.3. 
> 
> It all worked fine so far, until the load on the logging servers got
> higher and higher, resulting in about 900MB Logs daily. Then, syslog-ng
> started to die randomly, apparently not connected to any particular load
> peaks (at least as far as I was able to check), just the normal inferno.
> 
> After finding and eliminating that d*mn RedHat's 'ulimit -c 0' in the
> rc-script, I got several core dumps, which, when examined with gdb, all
> show the following backtrace:
> 
> (gdb) bt
> #0  0x400530a1 in kill () from /lib/libc.so.6
> #1  0x40052e99 in raise () from /lib/libc.so.6
> #2  0x40054364 in abort () from /lib/libc.so.6
> #3  0x080529e5 in fatal ()
> #4  0x080530a7 in xalloc ()
> #5  0x080531f7 in ol_string_alloc ()
> #6  0x0805068f in c_format ()
> #7  0x08053501 in do_flush ()
> #8  0x0805162d in write_callback ()
> #9  0x080511d7 in io_iter ()
> #10 0x08049c45 in main_loop ()
> #11 0x08049f81 in main ()
> #12 0x400421c4 in __libc_start_main () from /lib/libc.so.6
> 
> As far as I know, malloc only returns NULL, if it was unable to allocate
> the requested memory. The machine has 1 GB physical RAM and another Gig
> of Swap space. I'm running the sar data collector, and at all times 
> there were loads of free memory. Swap stays untouched, the machine is
> not doing much besides syslogging.
> 
> At loss for any solution, I did a panic upgrade to 1.5.23 with libol
> 0.3.5 today, when syslog-ng died 3 times within 30 minutes. So far it
> runs stable, but I'll know more tomorrow.
> 
> My questions: Is this a bug in the 1.4 series? Can I sleep well while
> running 1.5 (marked as 'development')?
> 
> It is imperative for us that no messages, or at least as few as
> possible, are lost, for dealing with abuse requests and customer
> inquiries.

I don't know about this bug. the backtrace seems to indicate that this
c_format() call is failing:

item->packet = c_format("%s", s->length - res, s->data + res);

res is the number of bytes returned by write(), s->length is the data block
to write, s->data is the data to write

it is checked that res is >= 0, and as it is signed the error indication
(-1) doesn't count.

s->length - res might be a big value if:

1) s->length < res

   this is not possible as res must be less than or equal to s->length

2) s->length itself is negative

   this doesn't seem to be possible either, and IMHO write() would return
   -1 in which case this code path is not touched.

can you analyze the core a bit more? (it is no use to send it to me, as it
might contain libc different from my system)

gdb syslog-ng -c core
(gdb) frame 4

this selects the frame of xalloc()

now display part of the stack:

p $ebp
x/40 $ebp-20

I'll try to find how many bytes c_format_() wants to allocate. This might
help to track down the problem.

This code is different in libol 0.3 (thus in syslog-ng 1.5) so it might be
more stable.

1.5.x itself seems to be solid (I don't know any pending problems now, other
than minor cosmetic changes like the configure script)

-- 
Bazsi
PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1