Problems with recovering from 'disk full' I/O error
Hi all, I've come across a situation where syslog-ng (3.3.3 and 3.3.9) aborts after trying to write to a disk that has no space and then is cleaned up. I've been able to reproduce the 'error' condition with the following Set up: In the config file: ########################################################### # test log destination # filter f_test{program("LOGID_99-*" type("glob"));}; destination d_test { file("/var/testpartition/test.log" perm(0644) flags(no-multi-line)); }; log { source(s_local); filter(f_test); destination(d_test); flags(final); }; 1) Fill up the '/var/testpartition' disk: root@device:/var/testpartition# dd if=/dev/zero of=/var/testpartition/foo bs=1M dd: writing '/var/testpartition/foo': No space left on device 3531+0 records in 3529+1 records out 2) Send messages ('TEST: pre I/O error')to 'test.log' until syslog-ng complains: 2013-06-14 22:41:09.623 [device] [syslog-ng-ERROR:] "I/O error occurred while writing; fd='22', error='No space left on device (28)'" 2013-06-14 22:41:09.623 [device] [syslog-ng-NOTICE:] "Suspending write operation because of an I/O error; fd='22', time_reopen='60'" 3) Restart syslog-ng with a SIGHUP root@device:/var/testpartition# kill -s HUP $(cat /var/run/syslog-ng.pid) Program received signal SIGHUP, Hangup. 0x00007f38ee8dad63 in __epoll_wait_nocancel () from /xxx/xxx/xxx/xxx/xxx/xxx/lib/libc.so.6 (gdb) continue Continuing. 4) Send another message ('TEST: post I/O error; post SIGHUP')to syslog-ng (it complains): 2013-06-14 22:41:37.147 [device] [syslog-ng-NOTICE:] "Configuration reload request received, reloading configuration;" 2013-06-14 22:41:59.610 [device] [syslog-ng-ERROR:] "I/O error occurred while writing; fd='23', error='No space left on device (28)'" 2013-06-14 22:41:59.610 [device] [syslog-ng-NOTICE:] "Suspending write operation because of an I/O error; fd='23', time_reopen='60'" 5) Clean up '/var/testpartition' disk: root@device:/var/testpartition# rm foo 6) Send another message ('TEST: post I/O error; post SIGHUP 2'); syslog-ng aborts: Program received signal SIGABRT, Aborted. 0x00007f38ee83eda5 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. in ../nptl/sysdeps/unix/sysv/linux/raise.c (gdb) bt #0 0x00007f38ee83eda5 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007f38ee8402c3 in abort () at abort.c:88 #2 0x00007f38ee87af18 in __libc_message (do_abort=2, fmt=0x7f38ee927cc8 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:170 #3 0x00007f38ee8804b8 in malloc_printerr (action=2, str=0x7f38ee927dd0 "double free or corruption (!prev)", ptr=<value optimized out>) at malloc.c:5891 #4 0x00007f38ee8825c6 in __libc_free (mem=0x7f38ee91e820) at malloc.c:3626 #5 0x00007f38ef7d989a in ?? () #6 0x0000000000401800 in ?? () at elf-init.c:99 #7 0x00007f38efcf0257 in log_proto_file_writer_flush (s=0x674de0) at logproto.c:306 #8 0x00007f38efcf04a8 in log_proto_file_writer_post (s=0x674de0, msg=0x6558a0 "2013-06-14T22:42:30.476+00:00 device LOGID_99-INFO: TEST: post I/O error; post SIGHUP 2\n", msg_len=89, consumed=0x7fffa53e570c) at logproto.c:380 #9 0x00007f38efcfc274 in log_proto_post (s=0x674de0, msg=0x6558a0 "2013-06-14T22:42:30.476+00:00 device LOGID_99-INFO: TEST: post I/O error; post SIGHUP 2\n", msg_len=89, consumed=0x7fffa53e570c) at logproto.h:95 #10 0x00007f38efcfc095 in log_writer_flush (self=0x655a30, flush_mode=LW_FLUSH_NORMAL) at logwriter.c:983 #11 0x00007f38efcf9be3 in log_writer_work_perform (s=0x655a30) at logwriter.c:129 #12 0x00007f38efcf9efe in log_writer_io_flush_output (s=0x655a30) at logwriter.c:198 #13 0x00007f38efd238b5 in iv_run_tasks (st=0x6036a0) at iv_task.c:46 #14 0x00007f38efd22a86 in iv_main () at iv_main.c:266 #15 0x00007f38efcfedf9 in main_loop_run () at mainloop.c:736 #16 0x00000000004017c9 in main (argc=1, argv=0x7fffa53e5968) at main.c:263 (gdb) The last two test messages are being written (run together with the last partial message that generated the 'disk full' I/O error): root@device:/var/testpartition# tail test.log 2013-06-14T22:40:58.140+00:00 device LOGID_99-INFO: TEST: pre I/O error 2013-06-14T22:40:59.150+00:00 device LOGID_99-INFO: TEST: pre I/O error 2013-06-14T22:41:00.180+00:00 device LOGID_99-INFO: TEST: pre I/O error 2013-06-14T22:41:01.176+00:00 device LOGID_99-INFO: TEST: pre I/O error 2013-06-14T22:41:02.170+00:00 device LOGID_99-INFO: TEST: pre I/O error 2013-06-14T22:41:03.010+00:00 device LOGID_99-INFO: TEST: pre I/O error 2013-06-14T22:41:03.930+00:00 device LOGID_99-INFO: TEST: pre I/O error 2013-06-14T22:41:05.040+00:00 device LOGID_99-INFO: TEST: pre I/O error 2013-06-14T22:41:06.180+00:00 device LOGID_99-INFO: TEST: pre I/O error 2013-06-14T22:41:07.140+00:00 device LOGID_99-INFO: T2013-06-14T22:42:30.476+00:00 device LOGID_99-INFO: TEST: post I/O error; post SIGHUP 2013-06-14T22:42:30.476+00:00 device LOGID_99-INFO: TEST: post I/O error; post SIGHUP 2 root@device:/var/testpartition# Thanks, Chris ---------------------------------------- Christopher Johnson chris.johnson3@hp.com<mailto:chris.johnson3@hp.com> HP Software - Security Product Group (916) 785-2817 ----------------------------------------
Thanks for the detailed report. Seems to be a use after free or double free in the error path. I'll have a closer look. On Jun 15, 2013 2:15 AM, "Johnson, Chris (HP TippingPoint Roseville)" < chris.johnson3@hp.com> wrote:
Hi all,****
I've come across a situation where syslog-ng (3.3.3 and 3.3.9) aborts after trying to write to a disk that has no space and then is cleaned up.* ***
** **
I've been able to reproduce the 'error' condition with the following Set up:****
In the config file:****
###########################################################****
# test log destination****
#****
filter f_test{program("LOGID_99-*" type("glob"));};****
destination d_test { file("/var/testpartition/test.log" perm(0644) flags(no-multi-line)); };****
log {****
source(s_local);****
filter(f_test);****
destination(d_test);****
flags(final);****
};****
** **
**1) **Fill up the '/var/testpartition' disk:****
root@device:/var/testpartition# dd if=/dev/zero of=/var/testpartition/foo bs=1M****
dd: writing '/var/testpartition/foo': No space left on device****
3531+0 records in****
3529+1 records out****
**2) **Send messages (*'TEST: pre I/O error'*)to 'test.log' until syslog-ng complains:****
2013-06-14 22:41:09.623 [device] [syslog-ng-ERROR:] "I/O error occurred while writing; fd='22', error='No space left on device (28)'"****
2013-06-14 22:41:09.623 [device] [syslog-ng-NOTICE:] "Suspending write operation because of an I/O error; fd='22', time_reopen='60'"****
**3) **Restart syslog-ng with a SIGHUP****
root@device:/var/testpartition# kill -s HUP $(cat /var/run/syslog-ng.pid)* ***
** **
Program received signal SIGHUP, Hangup.****
0x00007f38ee8dad63 in __epoll_wait_nocancel () from /xxx/xxx/xxx/xxx/xxx/xxx/lib/libc.so.6****
(gdb) continue****
Continuing.****
**4) **Send another message (*'TEST: post I/O error; post SIGHUP'*)to syslog-ng (it complains):****
2013-06-14 22:41:37.147 [device] [syslog-ng-NOTICE:] "Configuration reload request received, reloading configuration;"****
2013-06-14 22:41:59.610 [device] [syslog-ng-ERROR:] "I/O error occurred while writing; fd='23', error='No space left on device (28)'"****
2013-06-14 22:41:59.610 [device] [syslog-ng-NOTICE:] "Suspending write operation because of an I/O error; fd='23', time_reopen='60'"****
**5) **Clean up '/var/testpartition' disk:****
root@device:/var/testpartition# rm foo****
**6) **Send another message (*'TEST: post I/O error; post SIGHUP 2'*); syslog-ng aborts:****
Program received signal SIGABRT, Aborted.****
0x00007f38ee83eda5 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64****
64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. ****
in ../nptl/sysdeps/unix/sysv/linux/raise.c****
(gdb) bt****
#0 0x00007f38ee83eda5 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64****
#1 0x00007f38ee8402c3 in abort () at abort.c:88****
#2 0x00007f38ee87af18 in __libc_message (do_abort=2, fmt=0x7f38ee927cc8 "*** glibc detected *** %s: %s: 0x%s ***\n")****
at ../sysdeps/unix/sysv/linux/libc_fatal.c:170****
#3 0x00007f38ee8804b8 in malloc_printerr (action=2, str=0x7f38ee927dd0 "double free or corruption (!prev)",****
ptr=<value optimized out>) at malloc.c:5891****
#4 0x00007f38ee8825c6 in __libc_free (mem=0x7f38ee91e820) at malloc.c:3626 ****
#5 0x00007f38ef7d989a in ?? ()****
#6 0x0000000000401800 in ?? () at elf-init.c:99****
#7 0x00007f38efcf0257 in log_proto_file_writer_flush (s=0x674de0) at logproto.c:306****
#8 0x00007f38efcf04a8 in log_proto_file_writer_post (s=0x674de0,****
msg=0x6558a0 "2013-06-14T22:42:30.476+00:00 device LOGID_99-INFO: TEST: post I/O error; post SIGHUP 2\n", msg_len=89,****
consumed=0x7fffa53e570c) at logproto.c:380****
#9 0x00007f38efcfc274 in log_proto_post (s=0x674de0,****
msg=0x6558a0 "2013-06-14T22:42:30.476+00:00 device LOGID_99-INFO: TEST: post I/O error; post SIGHUP 2\n", msg_len=89,****
consumed=0x7fffa53e570c) at logproto.h:95****
#10 0x00007f38efcfc095 in log_writer_flush (self=0x655a30, flush_mode=LW_FLUSH_NORMAL) at logwriter.c:983****
#11 0x00007f38efcf9be3 in log_writer_work_perform (s=0x655a30) at logwriter.c:129****
#12 0x00007f38efcf9efe in log_writer_io_flush_output (s=0x655a30) at logwriter.c:198****
#13 0x00007f38efd238b5 in iv_run_tasks (st=0x6036a0) at iv_task.c:46****
#14 0x00007f38efd22a86 in iv_main () at iv_main.c:266****
#15 0x00007f38efcfedf9 in main_loop_run () at mainloop.c:736****
#16 0x00000000004017c9 in main (argc=1, argv=0x7fffa53e5968) at main.c:263 ****
(gdb)****
** **
The last two test messages *are* being written (run together with the last partial message that generated the 'disk full' I/O error):****
root@device:/var/testpartition# tail test.log****
2013-06-14T22:40:58.140+00:00 device LOGID_99-INFO: TEST: pre I/O error*** *
2013-06-14T22:40:59.150+00:00 device LOGID_99-INFO: TEST: pre I/O error*** *
2013-06-14T22:41:00.180+00:00 device LOGID_99-INFO: TEST: pre I/O error*** *
2013-06-14T22:41:01.176+00:00 device LOGID_99-INFO: TEST: pre I/O error*** *
2013-06-14T22:41:02.170+00:00 device LOGID_99-INFO: TEST: pre I/O error*** *
2013-06-14T22:41:03.010+00:00 device LOGID_99-INFO: TEST: pre I/O error*** *
2013-06-14T22:41:03.930+00:00 device LOGID_99-INFO: TEST: pre I/O error*** *
2013-06-14T22:41:05.040+00:00 device LOGID_99-INFO: TEST: pre I/O error*** *
2013-06-14T22:41:06.180+00:00 device LOGID_99-INFO: TEST: pre I/O error*** *
2013-06-14T22:41:07.140+00:00 device LOGID_99-INFO: T2013-06-14T22:42:30.476+00:00 device LOGID_99-INFO: TEST: post I/O error; post SIGHUP 2013-06-14T22:42:30.476+00:00 device LOGID_99-INFO: TEST: post I/O error; post SIGHUP 2****
root@device:/var/testpartition#****
** **
Thanks,****
Chris****
----------------------------------------****
Christopher Johnson****
chris.johnson3@hp.com****
HP Software - Security Product Group****
(916) 785-2817****
----------------------------------------****
** **
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
Hi, I have pushed the fix on the hotfix/3.4/full-disk-crash-fix branch on the github repository. Hopefully Algernon picks it up soon enough to master, but I'm not sure when he plans a release in 3.4 On Sat, Jun 15, 2013 at 7:06 PM, Balazs Scheidler <bazsi77@gmail.com> wrote:
Thanks for the detailed report. Seems to be a use after free or double free in the error path. I'll have a closer look. On Jun 15, 2013 2:15 AM, "Johnson, Chris (HP TippingPoint Roseville)" < chris.johnson3@hp.com> wrote:
Hi all,****
I've come across a situation where syslog-ng (3.3.3 and 3.3.9) aborts after trying to write to a disk that has no space and then is cleaned up. ****
** **
I've been able to reproduce the 'error' condition with the following Set up:****
In the config file:****
###########################################################****
# test log destination****
#****
filter f_test{program("LOGID_99-*" type("glob"));};****
destination d_test { file("/var/testpartition/test.log" perm(0644) flags(no-multi-line)); };****
log {****
source(s_local);****
filter(f_test);****
destination(d_test);****
flags(final);****
};****
** **
**1) **Fill up the '/var/testpartition' disk:****
root@device:/var/testpartition# dd if=/dev/zero of=/var/testpartition/foo bs=1M****
dd: writing '/var/testpartition/foo': No space left on device****
3531+0 records in****
3529+1 records out****
**2) **Send messages (*'TEST: pre I/O error'*)to 'test.log' until syslog-ng complains:****
2013-06-14 22:41:09.623 [device] [syslog-ng-ERROR:] "I/O error occurred while writing; fd='22', error='No space left on device (28)'"****
2013-06-14 22:41:09.623 [device] [syslog-ng-NOTICE:] "Suspending write operation because of an I/O error; fd='22', time_reopen='60'"****
**3) **Restart syslog-ng with a SIGHUP****
root@device:/var/testpartition# kill -s HUP $(cat /var/run/syslog-ng.pid) ****
** **
Program received signal SIGHUP, Hangup.****
0x00007f38ee8dad63 in __epoll_wait_nocancel () from /xxx/xxx/xxx/xxx/xxx/xxx/lib/libc.so.6****
(gdb) continue****
Continuing.****
**4) **Send another message (*'TEST: post I/O error; post SIGHUP'*)to syslog-ng (it complains):****
2013-06-14 22:41:37.147 [device] [syslog-ng-NOTICE:] "Configuration reload request received, reloading configuration;"****
2013-06-14 22:41:59.610 [device] [syslog-ng-ERROR:] "I/O error occurred while writing; fd='23', error='No space left on device (28)'"****
2013-06-14 22:41:59.610 [device] [syslog-ng-NOTICE:] "Suspending write operation because of an I/O error; fd='23', time_reopen='60'"****
**5) **Clean up '/var/testpartition' disk:****
root@device:/var/testpartition# rm foo****
**6) **Send another message (*'TEST: post I/O error; post SIGHUP 2'* ); syslog-ng aborts:****
Program received signal SIGABRT, Aborted.****
0x00007f38ee83eda5 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64****
64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.****
in ../nptl/sysdeps/unix/sysv/linux/raise.c****
(gdb) bt****
#0 0x00007f38ee83eda5 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64****
#1 0x00007f38ee8402c3 in abort () at abort.c:88****
#2 0x00007f38ee87af18 in __libc_message (do_abort=2, fmt=0x7f38ee927cc8 "*** glibc detected *** %s: %s: 0x%s ***\n")****
at ../sysdeps/unix/sysv/linux/libc_fatal.c:170****
#3 0x00007f38ee8804b8 in malloc_printerr (action=2, str=0x7f38ee927dd0 "double free or corruption (!prev)",****
ptr=<value optimized out>) at malloc.c:5891****
#4 0x00007f38ee8825c6 in __libc_free (mem=0x7f38ee91e820) at malloc.c:3626****
#5 0x00007f38ef7d989a in ?? ()****
#6 0x0000000000401800 in ?? () at elf-init.c:99****
#7 0x00007f38efcf0257 in log_proto_file_writer_flush (s=0x674de0) at logproto.c:306****
#8 0x00007f38efcf04a8 in log_proto_file_writer_post (s=0x674de0,****
msg=0x6558a0 "2013-06-14T22:42:30.476+00:00 device LOGID_99-INFO: TEST: post I/O error; post SIGHUP 2\n", msg_len=89,****
consumed=0x7fffa53e570c) at logproto.c:380****
#9 0x00007f38efcfc274 in log_proto_post (s=0x674de0,****
msg=0x6558a0 "2013-06-14T22:42:30.476+00:00 device LOGID_99-INFO: TEST: post I/O error; post SIGHUP 2\n", msg_len=89,****
consumed=0x7fffa53e570c) at logproto.h:95****
#10 0x00007f38efcfc095 in log_writer_flush (self=0x655a30, flush_mode=LW_FLUSH_NORMAL) at logwriter.c:983****
#11 0x00007f38efcf9be3 in log_writer_work_perform (s=0x655a30) at logwriter.c:129****
#12 0x00007f38efcf9efe in log_writer_io_flush_output (s=0x655a30) at logwriter.c:198****
#13 0x00007f38efd238b5 in iv_run_tasks (st=0x6036a0) at iv_task.c:46****
#14 0x00007f38efd22a86 in iv_main () at iv_main.c:266****
#15 0x00007f38efcfedf9 in main_loop_run () at mainloop.c:736****
#16 0x00000000004017c9 in main (argc=1, argv=0x7fffa53e5968) at main.c:263 ****
(gdb)****
** **
The last two test messages *are* being written (run together with the last partial message that generated the 'disk full' I/O error):****
root@device:/var/testpartition# tail test.log****
2013-06-14T22:40:58.140+00:00 device LOGID_99-INFO: TEST: pre I/O error** **
2013-06-14T22:40:59.150+00:00 device LOGID_99-INFO: TEST: pre I/O error** **
2013-06-14T22:41:00.180+00:00 device LOGID_99-INFO: TEST: pre I/O error** **
2013-06-14T22:41:01.176+00:00 device LOGID_99-INFO: TEST: pre I/O error** **
2013-06-14T22:41:02.170+00:00 device LOGID_99-INFO: TEST: pre I/O error** **
2013-06-14T22:41:03.010+00:00 device LOGID_99-INFO: TEST: pre I/O error** **
2013-06-14T22:41:03.930+00:00 device LOGID_99-INFO: TEST: pre I/O error** **
2013-06-14T22:41:05.040+00:00 device LOGID_99-INFO: TEST: pre I/O error** **
2013-06-14T22:41:06.180+00:00 device LOGID_99-INFO: TEST: pre I/O error** **
2013-06-14T22:41:07.140+00:00 device LOGID_99-INFO: T2013-06-14T22:42:30.476+00:00 device LOGID_99-INFO: TEST: post I/O error; post SIGHUP 2013-06-14T22:42:30.476+00:00 device LOGID_99-INFO: TEST: post I/O error; post SIGHUP 2****
root@device:/var/testpartition#****
** **
Thanks,****
Chris****
----------------------------------------****
Christopher Johnson****
chris.johnson3@hp.com****
HP Software - Security Product Group****
(916) 785-2817****
----------------------------------------****
** **
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
-- Bazsi
Balazs Scheidler <bazsi77@gmail.com> writes:
Hi,
I have pushed the fix on the hotfix/3.4/full-disk-crash-fix branch on the github repository. Hopefully Algernon picks it up soon enough to master, but I'm not sure when he plans a release in 3.4
I'll pick it to 3.4 master in a bit, and I plan to do a 3.4 release early next week. Thanks! -- |8]
participants (3)
-
Balazs Scheidler
-
Gergely Nagy
-
Johnson, Chris (HP TippingPoint Roseville)