syslog-ng-1.4.17 crashes
Hi! I am using syslog-ng 1.4.17 with libol 0.2.24 on a central log host running RedHat 7.3. It all worked fine so far, until the load on the logging servers got higher and higher, resulting in about 900MB Logs daily. Then, syslog-ng started to die randomly, apparently not connected to any particular load peaks (at least as far as I was able to check), just the normal inferno. After finding and eliminating that d*mn RedHat's 'ulimit -c 0' in the rc-script, I got several core dumps, which, when examined with gdb, all show the following backtrace: (gdb) bt #0 0x400530a1 in kill () from /lib/libc.so.6 #1 0x40052e99 in raise () from /lib/libc.so.6 #2 0x40054364 in abort () from /lib/libc.so.6 #3 0x080529e5 in fatal () #4 0x080530a7 in xalloc () #5 0x080531f7 in ol_string_alloc () #6 0x0805068f in c_format () #7 0x08053501 in do_flush () #8 0x0805162d in write_callback () #9 0x080511d7 in io_iter () #10 0x08049c45 in main_loop () #11 0x08049f81 in main () #12 0x400421c4 in __libc_start_main () from /lib/libc.so.6 As far as I know, malloc only returns NULL, if it was unable to allocate the requested memory. The machine has 1 GB physical RAM and another Gig of Swap space. I'm running the sar data collector, and at all times there were loads of free memory. Swap stays untouched, the machine is not doing much besides syslogging. At loss for any solution, I did a panic upgrade to 1.5.23 with libol 0.3.5 today, when syslog-ng died 3 times within 30 minutes. So far it runs stable, but I'll know more tomorrow. My questions: Is this a bug in the 1.4 series? Can I sleep well while running 1.5 (marked as 'development')? It is imperative for us that no messages, or at least as few as possible, are lost, for dealing with abuse requests and customer inquiries. TIA, -- Heinz Ekker, mailto:heinz.ekker@uta.at Do not meddle in the affairs | UTA Systems Engineering Unix of wizards, for they are | 1220 Wien, Ares Tower Zelle 619 subtle and quick to anger. | Phone: +43 1 9009 4625/Fax -3697
On Thu, Nov 14, 2002 at 05:03:47PM +0100, Heinz Ekker wrote:
Hi!
I am using syslog-ng 1.4.17 with libol 0.2.24 on a central log host running RedHat 7.3.
It all worked fine so far, until the load on the logging servers got higher and higher, resulting in about 900MB Logs daily. Then, syslog-ng started to die randomly, apparently not connected to any particular load peaks (at least as far as I was able to check), just the normal inferno.
After finding and eliminating that d*mn RedHat's 'ulimit -c 0' in the rc-script, I got several core dumps, which, when examined with gdb, all show the following backtrace:
(gdb) bt #0 0x400530a1 in kill () from /lib/libc.so.6 #1 0x40052e99 in raise () from /lib/libc.so.6 #2 0x40054364 in abort () from /lib/libc.so.6 #3 0x080529e5 in fatal () #4 0x080530a7 in xalloc () #5 0x080531f7 in ol_string_alloc () #6 0x0805068f in c_format () #7 0x08053501 in do_flush () #8 0x0805162d in write_callback () #9 0x080511d7 in io_iter () #10 0x08049c45 in main_loop () #11 0x08049f81 in main () #12 0x400421c4 in __libc_start_main () from /lib/libc.so.6
As far as I know, malloc only returns NULL, if it was unable to allocate the requested memory. The machine has 1 GB physical RAM and another Gig of Swap space. I'm running the sar data collector, and at all times there were loads of free memory. Swap stays untouched, the machine is not doing much besides syslogging.
At loss for any solution, I did a panic upgrade to 1.5.23 with libol 0.3.5 today, when syslog-ng died 3 times within 30 minutes. So far it runs stable, but I'll know more tomorrow.
My questions: Is this a bug in the 1.4 series? Can I sleep well while running 1.5 (marked as 'development')?
It is imperative for us that no messages, or at least as few as possible, are lost, for dealing with abuse requests and customer inquiries.
I don't know about this bug. the backtrace seems to indicate that this c_format() call is failing: item->packet = c_format("%s", s->length - res, s->data + res); res is the number of bytes returned by write(), s->length is the data block to write, s->data is the data to write it is checked that res is >= 0, and as it is signed the error indication (-1) doesn't count. s->length - res might be a big value if: 1) s->length < res this is not possible as res must be less than or equal to s->length 2) s->length itself is negative this doesn't seem to be possible either, and IMHO write() would return -1 in which case this code path is not touched. can you analyze the core a bit more? (it is no use to send it to me, as it might contain libc different from my system) gdb syslog-ng -c core (gdb) frame 4 this selects the frame of xalloc() now display part of the stack: p $ebp x/40 $ebp-20 I'll try to find how many bytes c_format_() wants to allocate. This might help to track down the problem. This code is different in libol 0.3 (thus in syslog-ng 1.5) so it might be more stable. 1.5.x itself seems to be solid (I don't know any pending problems now, other than minor cosmetic changes like the configure script) -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1
On Thu, Nov 14, 2002 at 06:07:02PM +0100, Balazs Scheidler wrote:
can you analyze the core a bit more? (it is no use to send it to me, as it might contain libc different from my system)
gdb syslog-ng -c core (gdb) frame 4
this selects the frame of xalloc()
now display part of the stack:
p $ebp x/40 $ebp-20
I'll try to find how many bytes c_format_() wants to allocate. This might help to track down the problem.
Is that of any help for you? (gdb) frame 4 #4 0x080530a7 in xalloc () (gdb) p $ebp $1 = (void *) 0xbfffe868 (gdb) x/40 $ebp-20 0xbfffe854: 0x00000098 0x4014afa0 0x00000000 0xcedfcc12 0xbfffe864: 0xcedfcc12 0xbfffe888 0x080531f7 0xcedfcc21 0xbfffe874: 0x4014afa0 0xcedfcc12 0x0000003b 0x08075ce0 0xbfffe884: 0x0000003b 0xbfffe8b8 0x0805068f 0xcedfcc12 0xbfffe894: 0xbfffe8c4 0xbfffe8b8 0x4009f548 0x08075b70 0xbfffe8a4: 0x0806c224 0xbffff8e8 0x0000003b 0x08075ce0 0xbfffe8b4: 0x080817ac 0xbffff8f8 0x08053501 0x080540bd 0xbfffe8c4: 0xcedfcc12 0x3927911d 0x00000000 0x0806f5ac 0xbfffe8d4: 0x31203431 0x00001000 0x080691a8 0x3e32323c 0xbfffe8e4: 0x20766f4e 0x31203431 0x30303a33 0x2034323a
This code is different in libol 0.3 (thus in syslog-ng 1.5) so it might be more stable.
1.5.x itself seems to be solid (I don't know any pending problems now, other than minor cosmetic changes like the configure script)
So far I didn't experience any crashes, and judging by the MRTG graphs based on the syslog entries, it seems to cope with the load. Thanks for your help so far, he
On Thu, Nov 14, 2002 at 07:49:00PM +0100, Heinz Ekker wrote:
On Thu, Nov 14, 2002 at 06:07:02PM +0100, Balazs Scheidler wrote:
can you analyze the core a bit more? (it is no use to send it to me, as it might contain libc different from my system)
gdb syslog-ng -c core (gdb) frame 4
this selects the frame of xalloc()
now display part of the stack:
p $ebp x/40 $ebp-20
I'll try to find how many bytes c_format_() wants to allocate. This might help to track down the problem.
Is that of any help for you?
(gdb) frame 4 #4 0x080530a7 in xalloc () (gdb) p $ebp $1 = (void *) 0xbfffe868 (gdb) x/40 $ebp-20 0xbfffe854: 0x00000098 0x4014afa0 0x00000000 0xcedfcc12 0xbfffe864: 0xcedfcc12 0xbfffe888 0x080531f7 0xcedfcc21 0xbfffe874: 0x4014afa0 0xcedfcc12 0x0000003b 0x08075ce0 0xbfffe884: 0x0000003b 0xbfffe8b8 0x0805068f 0xcedfcc12 0xbfffe894: 0xbfffe8c4 0xbfffe8b8 0x4009f548 0x08075b70 0xbfffe8a4: 0x0806c224 0xbffff8e8 0x0000003b 0x08075ce0 0xbfffe8b4: 0x080817ac 0xbffff8f8 0x08053501 0x080540bd 0xbfffe8c4: 0xcedfcc12 0x3927911d 0x00000000 0x0806f5ac 0xbfffe8d4: 0x31203431 0x00001000 0x080691a8 0x3e32323c 0xbfffe8e4: 0x20766f4e 0x31203431 0x30303a33 0x2034323a
hmm.. as it seems the length c_format() wants to allocate is 0xcedfcc21 bytes, cca 3.5GB, somehow the log entry to be written is completely bogus. do_flush() calls c_format with a format "%s", string length 0xcedfcc12 and data pointer 0x3927911d, if the allocation would succeed, syslog-ng would crash anyway, when trying to dereference 0x3927911d. The funny thing is that A_WRITE() is called with exactly these parameters, and it didn't crash, so something changed the data between A_WRITE() and c_format(). Can you send me your syslog-ng binary and the core file in private? Maybe I can do some more analyisis. (I hope core files don't depend on libc version, or if they do, it would still show some valid data) -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1
On Fri, Nov 15, 2002 at 10:24:24AM +0100, Balazs Scheidler wrote:
On Thu, Nov 14, 2002 at 07:49:00PM +0100, Heinz Ekker wrote:
On Thu, Nov 14, 2002 at 06:07:02PM +0100, Balazs Scheidler wrote:
hmm.. as it seems the length c_format() wants to allocate is 0xcedfcc21 bytes, cca 3.5GB, somehow the log entry to be written is completely bogus.
do_flush() calls c_format with a format "%s", string length 0xcedfcc12 and data pointer 0x3927911d, if the allocation would succeed, syslog-ng would crash anyway, when trying to dereference 0x3927911d.
The funny thing is that A_WRITE() is called with exactly these parameters, and it didn't crash, so something changed the data between A_WRITE() and c_format().
Can you send me your syslog-ng binary and the core file in private? Maybe I can do some more analyisis. (I hope core files don't depend on libc version, or if they do, it would still show some valid data)
I'm finished with my analysis. First of all you have linked syslog-ng 1.4 against libol 0.3. This is _not_ recommended (and can only be done by patching syslog-ng configure script) I've checked out why the whole thing might have crashed anyway, but there were no obvious reasons. One of the local variables (move) in the do_flush() function contains a bogus value. I've tried to reproduce it with syslog-ng 1.5 without success. The disassembly of your do_flush function shows (at least to me and I was watching it quite a lot) that you might have compiled libol with a buggy compiler. I might be wrong, but the memory slot referenced when calling c_format() is not initialized by the assembly function in any way (it is but in a later part of the function), but it is initialized by the C code. Don't you happen to be using the buggy gcc 2.96 release from RedHat? The relevant part is: ; if (item->packet->length != move) 0x80534e6 <do_flush+94>: je 0x8053510 <do_flush+136> ;; not equal 0x80534e8 <do_flush+96>: push %ecx ;; ecx := move, this slot is used for the local variable res as well! 0x80534e9 <do_flush+97>: mov 0xffffefdc(%ebp),%ecx ;; offset into the string (s->data + move) 0x80534ef <do_flush+103>: lea 0xc(%ecx,%esi,1),%eax 0x80534f3 <do_flush+107>: push %eax ;; calculate (s->length - move), $ecx is move, $edx is s->length 0x80534f4 <do_flush+108>: sub %ecx,%edx 0x80534f6 <do_flush+110>: push %edx ;; format string ("%s") 0x80534f7 <do_flush+111>: push $0x80540bd 0x80534fc <do_flush+116>: call 0x8050670 <c_format> as you can see the local variable $ebp+0xffffefdc is stored in $ecx as the value of the local variable 'move', and $ebp+0xffffefdc gets its first value outside the FOR_QUEUE loop. The funny thing is that this same slot is used for the local variable res as well, as the return value of A_WRITE is stored there. Here's a full disassembly of the do_flush function, annotated with my comments: (gdb) disassemble do_flush Dump of assembler code for function do_flush: 0x8053488 <do_flush>: push %ebp 0x8053489 <do_flush+1>: mov %esp,%ebp 0x805348b <do_flush+3>: push %edi 0x805348c <do_flush+4>: push %esi 0x805348d <do_flush+5>: push %ebx 0x805348e <do_flush+6>: sub $0x1024,%esp 0x8053494 <do_flush+12>: pushl 0x8(%ebp) 0x8053497 <do_flush+15>: push $0x8058e00 0x805349c <do_flush+20>: call 0x8053250 <ol_object_check> 0x80534a1 <do_flush+25>: mov %eax,0xffffefe4(%ebp) 0x80534a7 <do_flush+31>: movl $0x0,0xffffefe0(%ebp) 0x80534b1 <do_flush+41>: mov 0x28(%eax),%edi 0x80534b4 <do_flush+44>: add $0x10,%esp ; jump to check the condition 0x80534b7 <do_flush+47>: jmp 0x805353c <do_flush+180> ; body of for loop starts here (FOR_QUEUE) 0x80534bc <do_flush+52>: push %esi 0x80534bd <do_flush+53>: push %ebx 0x80534be <do_flush+54>: add $0xc,%eax 0x80534c1 <do_flush+57>: mov 0xffffefe0(%ebp),%ecx 0x80534c7 <do_flush+63>: lea 0xffffefe8(%ecx,%ebp,1),%edx 0x80534ce <do_flush+70>: push %eax 0x80534cf <do_flush+71>: push %edx 0x80534d0 <do_flush+72>: call 0x8049688 <memcpy> 0x80534d5 <do_flush+77>: add %ebx,0xffffefe0(%ebp) 0x80534db <do_flush+83>: mov 0x8(%edi),%esi 0x80534de <do_flush+86>: mov 0x8(%esi),%edx 0x80534e1 <do_flush+89>: add $0x10,%esp 0x80534e4 <do_flush+92>: cmp %ebx,%edx ; if (item->packet->length != move) 0x80534e6 <do_flush+94>: je 0x8053510 <do_flush+136> ;; not equal 0x80534e8 <do_flush+96>: push %ecx ;; ecx := move, this slot is used for the local variable res as well! 0x80534e9 <do_flush+97>: mov 0xffffefdc(%ebp),%ecx ;; offset into the string (s->data + move) 0x80534ef <do_flush+103>: lea 0xc(%ecx,%esi,1),%eax 0x80534f3 <do_flush+107>: push %eax ;; calculate (s->length - move), $ecx is move, $edx is s->length 0x80534f4 <do_flush+108>: sub %ecx,%edx 0x80534f6 <do_flush+110>: push %edx ;; format string ("%s") 0x80534f7 <do_flush+111>: push $0x80540bd 0x80534fc <do_flush+116>: call 0x8050670 <c_format> 0x8053501 <do_flush+121>: mov %eax,0x8(%edi) 0x8053504 <do_flush+124>: mov %esi,(%esp,1) 0x8053507 <do_flush+127>: call 0x805320c <ol_string_free> 0x805350c <do_flush+132>: jmp 0x8053533 <do_flush+171> 0x805350e <do_flush+134>: mov %esi,%esi ;; equal (item->packet->length == move) 0x8053510 <do_flush+136>: mov 0xffffefe4(%ebp),%eax 0x8053516 <do_flush+142>: decl 0x38(%eax) 0x8053519 <do_flush+145>: sub $0xc,%esp 0x805351c <do_flush+148>: push %edi 0x805351d <do_flush+149>: call 0x8052a74 <ol_queue_remove> 0x8053522 <do_flush+154>: pop %eax 0x8053523 <do_flush+155>: pushl 0x8(%edi) 0x8053526 <do_flush+158>: call 0x805320c <ol_string_free> 0x805352b <do_flush+163>: mov %edi,(%esp,1) 0x805352e <do_flush+166>: call 0x805334c <ol_space_free> 0x8053533 <do_flush+171>: add $0x10,%esp 0x8053536 <do_flush+174>: mov 0xffffefd8(%ebp),%edi ; check for condition (FOR_QUEUE) 0x805353c <do_flush+180>: mov (%edi),%eax 0x805353e <do_flush+182>: test %eax,%eax 0x8053540 <do_flush+184>: mov %eax,0xffffefd8(%ebp) 0x8053546 <do_flush+190>: je 0x8053567 <do_flush+223> ; edx := 4096 (sizeof(buf)) 0x8053548 <do_flush+192>: mov $0x1000,%edx ; eax := item->packet->length 0x805354d <do_flush+197>: mov 0x8(%edi),%eax ; 0x8053550 <do_flush+200>: sub 0xffffefe0(%ebp),%edx 0x8053556 <do_flush+206>: mov 0x8(%eax),%ebx 0x8053559 <do_flush+209>: cmp %edx,%ebx 0x805355b <do_flush+211>: jbe 0x805355f <do_flush+215> 0x805355d <do_flush+213>: mov %edx,%ebx 0x805355f <do_flush+215>: test %edx,%edx 0x8053561 <do_flush+217>: jne 0x80534bc <do_flush+52> ; end of FOR_QUEUE loop 0x8053567 <do_flush+223>: push %esi 0x8053568 <do_flush+224>: lea 0xffffefe8(%ebp),%esi 0x805356e <do_flush+230>: push %esi 0x805356f <do_flush+231>: pushl 0xffffefe0(%ebp) 0x8053575 <do_flush+237>: pushl 0xc(%ebp) 0x8053578 <do_flush+240>: mov 0xc(%ebp),%edx ; call to A_WRITE 0x805357b <do_flush+243>: call *0xc(%edx) 0x805357e <do_flush+246>: add $0x10,%esp 0x8053581 <do_flush+249>: test %eax,%eax ; move := eax 0x8053583 <do_flush+251>: mov %eax,0xffffefdc(%ebp) 0x8053589 <do_flush+257>: js 0x8053628 <do_flush+416> 0x805358f <do_flush+263>: cmp %eax,0xffffefe0(%ebp) 0x8053595 <do_flush+269>: jne 0x80535b4 <do_flush+300> 0x8053597 <do_flush+271>: mov 0xffffefe4(%ebp),%ecx 0x805359d <do_flush+277>: mov 0x14(%ecx),%eax 0x80535a0 <do_flush+280>: test %eax,%eax 0x80535a2 <do_flush+282>: je 0x805363c <do_flush+436> 0x80535a8 <do_flush+288>: movl $0x1,(%eax) 0x80535ae <do_flush+294>: jmp 0x805363c <do_flush+436> 0x80535b3 <do_flush+299>: nop 0x80535b4 <do_flush+300>: mov 0xffffefdc(%ebp),%ebx 0x80535ba <do_flush+306>: test %ebx,%ebx 0x80535bc <do_flush+308>: je 0x805363c <do_flush+436> 0x80535be <do_flush+310>: sub $0xc,%esp 0x80535c1 <do_flush+313>: push $0xc 0x80535c3 <do_flush+315>: call 0x805332c <ol_space_alloc> 0x80535c8 <do_flush+320>: mov 0xffffefdc(%ebp),%edx 0x80535ce <do_flush+326>: add $0xc,%esp 0x80535d1 <do_flush+329>: mov %eax,%ebx 0x80535d3 <do_flush+331>: lea (%edx,%esi,1),%eax 0x80535d6 <do_flush+334>: push %eax 0x80535d7 <do_flush+335>: sub %edx,0xffffefe0(%ebp) 0x80535dd <do_flush+341>: pushl 0xffffefe0(%ebp) 0x80535e3 <do_flush+347>: push $0x80540bd 0x80535e8 <do_flush+352>: call 0x8050670 <c_format> 0x80535ed <do_flush+357>: mov %eax,0x8(%ebx) 0x80535f0 <do_flush+360>: pop %edx 0x80535f1 <do_flush+361>: pop %ecx 0x80535f2 <do_flush+362>: mov 0xffffefe4(%ebp),%eax 0x80535f8 <do_flush+368>: add $0x28,%eax 0x80535fb <do_flush+371>: push %ebx 0x80535fc <do_flush+372>: push %eax 0x80535fd <do_flush+373>: call 0x8052a38 <ol_queue_add_head> 0x8053602 <do_flush+378>: mov 0xffffefe4(%ebp),%ecx 0x8053608 <do_flush+384>: mov 0x38(%ecx),%eax 0x805360b <do_flush+387>: inc %eax 0x805360c <do_flush+388>: add $0x10,%esp 0x805360f <do_flush+391>: cmp 0x3c(%ecx),%eax 0x8053612 <do_flush+394>: mov %eax,0x38(%ecx) 0x8053615 <do_flush+397>: jne 0x805363c <do_flush+436> 0x8053617 <do_flush+399>: mov 0x14(%ecx),%eax 0x805361a <do_flush+402>: test %eax,%eax 0x805361c <do_flush+404>: je 0x805363c <do_flush+436> 0x805361e <do_flush+406>: movl $0x0,(%eax) 0x8053624 <do_flush+412>: jmp 0x805363c <do_flush+436> 0x8053626 <do_flush+414>: mov %esi,%esi 0x8053628 <do_flush+416>: sub $0xc,%esp 0x805362b <do_flush+419>: push $0x8057c20 0x8053630 <do_flush+424>: call 0x805293c <werror> 0x8053635 <do_flush+429>: mov $0x4,%eax 0x805363a <do_flush+434>: jmp 0x805363e <do_flush+438> 0x805363c <do_flush+436>: xor %eax,%eax 0x805363e <do_flush+438>: lea 0xfffffff4(%ebp),%esp 0x8053641 <do_flush+441>: pop %ebx 0x8053642 <do_flush+442>: pop %esi 0x8053643 <do_flush+443>: pop %edi 0x8053644 <do_flush+444>: pop %ebp 0x8053645 <do_flush+445>: ret 0x8053646 <do_flush+446>: mov %esi,%esi End of assembler dump. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1
On Tue, Nov 19, 2002 at 06:14:58PM +0100, Balazs Scheidler wrote:
First of all you have linked syslog-ng 1.4 against libol 0.3. This is _not_ recommended (and can only be done by patching syslog-ng configure script)
Oh! If I did, it must have happened by accident - but I certainly did not change anything in the configure-script.
The disassembly of your do_flush function shows (at least to me and I was watching it quite a lot) that you might have compiled libol with a buggy compiler. I might be wrong, but the memory slot referenced when calling c_format() is not initialized by the assembly function in any way (it is but in a later part of the function), but it is initialized by the C code.
Don't you happen to be using the buggy gcc 2.96 release from RedHat?
Now that you mention it: [root@goldfinger redhat]# gcc -v Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/2.96/specs gcc version 2.96 20000731 (Red Hat Linux 7.3 2.96-110)
The relevant part is: [...] as you can see
No, I don't ;-) I'm just a regular sysadmin, and can't make head nor tail of assembly language, but thanks anyway - sounds like a brilliant analysis. And sorry to have wasted your time... DeadRat can be a pain in the a**, sometimes. Heinz Ekker
participants (2)
-
Balazs Scheidler
-
Heinz Ekker