Re: [syslog-ng]syslog-ng-1.4.17 crashes

14 Nov 2002

      On Thu, Nov 14, 2002 at 05:03:47PM +0100, Heinz Ekker wrote:
...
Hi!
I am using syslog-ng 1.4.17 with libol 0.2.24 on a central log host
running RedHat 7.3.
It all worked fine so far, until the load on the logging servers got
higher and higher, resulting in about 900MB Logs daily. Then, syslog-ng
started to die randomly, apparently not connected to any particular load
peaks (at least as far as I was able to check), just the normal inferno.
After finding and eliminating that d*mn RedHat's 'ulimit -c 0' in the
rc-script, I got several core dumps, which, when examined with gdb, all
show the following backtrace:
(gdb) bt
#0  0x400530a1 in kill () from /lib/libc.so.6
#1  0x40052e99 in raise () from /lib/libc.so.6
#2  0x40054364 in abort () from /lib/libc.so.6
#3  0x080529e5 in fatal ()
#4  0x080530a7 in xalloc ()
#5  0x080531f7 in ol_string_alloc ()
#6  0x0805068f in c_format ()
#7  0x08053501 in do_flush ()
#8  0x0805162d in write_callback ()
#9  0x080511d7 in io_iter ()
#10 0x08049c45 in main_loop ()
#11 0x08049f81 in main ()
#12 0x400421c4 in __libc_start_main () from /lib/libc.so.6
As far as I know, malloc only returns NULL, if it was unable to allocate
the requested memory. The machine has 1 GB physical RAM and another Gig
of Swap space. I'm running the sar data collector, and at all times 
there were loads of free memory. Swap stays untouched, the machine is
not doing much besides syslogging.
At loss for any solution, I did a panic upgrade to 1.5.23 with libol
0.3.5 today, when syslog-ng died 3 times within 30 minutes. So far it
runs stable, but I'll know more tomorrow.
My questions: Is this a bug in the 1.4 series? Can I sleep well while
running 1.5 (marked as 'development')?
It is imperative for us that no messages, or at least as few as
possible, are lost, for dealing with abuse requests and customer
inquiries.
I don't know about this bug. the backtrace seems to indicate that this
c_format() call is failing:

item->packet = c_format("%s", s->length - res, s->data + res);

res is the number of bytes returned by write(), s->length is the data block
to write, s->data is the data to write

it is checked that res is >= 0, and as it is signed the error indication
(-1) doesn't count.

s->length - res might be a big value if:

1) s->length < res

   this is not possible as res must be less than or equal to s->length

2) s->length itself is negative

   this doesn't seem to be possible either, and IMHO write() would return
   -1 in which case this code path is not touched.

can you analyze the core a bit more? (it is no use to send it to me, as it
might contain libc different from my system)

gdb syslog-ng -c core
(gdb) frame 4

this selects the frame of xalloc()

now display part of the stack:

p $ebp
x/40 $ebp-20

I'll try to find how many bytes c_format_() wants to allocate. This might
help to track down the problem.

This code is different in libol 0.3 (thus in syslog-ng 1.5) so it might be
more stable.

1.5.x itself seems to be solid (I don't know any pending problems now, other
than minor cosmetic changes like the configure script)

-- 
Bazsi
PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

Re: [syslog-ng]syslog-ng-1.4.17 crashes

Balazs Scheidler