On Thu, Nov 14, 2002 at 05:03:47PM +0100, Heinz Ekker wrote:
Hi!
I am using syslog-ng 1.4.17 with libol 0.2.24 on a central log host running RedHat 7.3.
It all worked fine so far, until the load on the logging servers got higher and higher, resulting in about 900MB Logs daily. Then, syslog-ng started to die randomly, apparently not connected to any particular load peaks (at least as far as I was able to check), just the normal inferno.
After finding and eliminating that d*mn RedHat's 'ulimit -c 0' in the rc-script, I got several core dumps, which, when examined with gdb, all show the following backtrace:
(gdb) bt #0 0x400530a1 in kill () from /lib/libc.so.6 #1 0x40052e99 in raise () from /lib/libc.so.6 #2 0x40054364 in abort () from /lib/libc.so.6 #3 0x080529e5 in fatal () #4 0x080530a7 in xalloc () #5 0x080531f7 in ol_string_alloc () #6 0x0805068f in c_format () #7 0x08053501 in do_flush () #8 0x0805162d in write_callback () #9 0x080511d7 in io_iter () #10 0x08049c45 in main_loop () #11 0x08049f81 in main () #12 0x400421c4 in __libc_start_main () from /lib/libc.so.6
As far as I know, malloc only returns NULL, if it was unable to allocate the requested memory. The machine has 1 GB physical RAM and another Gig of Swap space. I'm running the sar data collector, and at all times there were loads of free memory. Swap stays untouched, the machine is not doing much besides syslogging.
At loss for any solution, I did a panic upgrade to 1.5.23 with libol 0.3.5 today, when syslog-ng died 3 times within 30 minutes. So far it runs stable, but I'll know more tomorrow.
My questions: Is this a bug in the 1.4 series? Can I sleep well while running 1.5 (marked as 'development')?
It is imperative for us that no messages, or at least as few as possible, are lost, for dealing with abuse requests and customer inquiries.
I don't know about this bug. the backtrace seems to indicate that this c_format() call is failing: item->packet = c_format("%s", s->length - res, s->data + res); res is the number of bytes returned by write(), s->length is the data block to write, s->data is the data to write it is checked that res is >= 0, and as it is signed the error indication (-1) doesn't count. s->length - res might be a big value if: 1) s->length < res this is not possible as res must be less than or equal to s->length 2) s->length itself is negative this doesn't seem to be possible either, and IMHO write() would return -1 in which case this code path is not touched. can you analyze the core a bit more? (it is no use to send it to me, as it might contain libc different from my system) gdb syslog-ng -c core (gdb) frame 4 this selects the frame of xalloc() now display part of the stack: p $ebp x/40 $ebp-20 I'll try to find how many bytes c_format_() wants to allocate. This might help to track down the problem. This code is different in libol 0.3 (thus in syslog-ng 1.5) so it might be more stable. 1.5.x itself seems to be solid (I don't know any pending problems now, other than minor cosmetic changes like the configure script) -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1