[Bug 161] New: Syslog-ng stops processing after reload
https://bugzilla.balabit.com/show_bug.cgi?id=161 Summary: Syslog-ng stops processing after reload Product: syslog-ng Version: 3.3.x Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: unspecified Component: syslog-ng AssignedTo: bazsi@balabit.hu ReportedBy: bugzilla.syslogng@feystorm.net Type of the Report: --- Estimated Hours: 0.0 syslog-ng version: 3.3.4 OS: RHEL 6.2 Whenever I issue a reload to our syslog-ng instance syslog-ng stops processing messages and sits there doing nothing. The issue only seems to occur when the server is processing a high log volume. This server is part of a 2-node active/passive HA group. So we have 2 servers with the exact same configuration. Whenever I issue the reload on the server that is not currently active, it works fine. I can issue a failover and the server that was able to reload fine now hangs. The active server in this HA group processes around 35,000 messages per second. I havent posted the config because it is fairly large, however I can if desired. The most notable thing in it is that threading is enabled. Below is the backtrace of syslog-ng stuck in this state. I took a gcore, so I can gather any additional information that is needed. Also, if I try to do a full backtrace ('where full') in gdb on any thread other than thread 5, gdb segfaults. Not sure if that has any relevance or not. (gdb) info threads 5 Thread 0x7fb9c654ab00 (LWP 11040) 0x000000353cedd967 in writev () from /lib64/libc.so.6 4 Thread 0x7fb9c25fb700 (LWP 11049) 0x000000353cee5d03 in epoll_wait () from /lib64/libc.so.6 3 Thread 0x7fb9c2ffc700 (LWP 11048) 0x000000353cee5d03 in epoll_wait () from /lib64/libc.so.6 2 Thread 0x7fb9c39fd700 (LWP 11046) 0x000000353cee5d03 in epoll_wait () from /lib64/libc.so.6 * 1 Thread 0x7fb9c43fe700 (LWP 11045) 0x000000353cee5d03 in epoll_wait () from /lib64/libc.so.6 (gdb) where #0 0x000000353cee5d03 in epoll_wait () from /lib64/libc.so.6 #1 0x00007fb9c65cfa3e in iv_epoll_poll (numfds=1, active=0x7fb9c43fb9a0, msec=10000) at iv_method_epoll.c:73 #2 0x00007fb9c65cf1bb in iv_main () at iv_main.c:265 #3 0x00007fb9c65d3c36 in iv_work_thread (_thr=0x14fc610) at iv_work.c:196 #4 0x00007fb9c65d42a8 in iv_thread_handler (_thr=0x14fc6e0) at iv_thread.c:100 #5 0x000000353d2077f1 in start_thread () from /lib64/libpthread.so.0 #6 0x000000353cee570d in clone () from /lib64/libc.so.6 (gdb) thread 2 [Switching to thread 2 (Thread 0x7fb9c39fd700 (LWP 11046))]#0 0x000000353cee5d03 in epoll_wait () from /lib64/libc.so.6 (gdb) where #0 0x000000353cee5d03 in epoll_wait () from /lib64/libc.so.6 #1 0x00007fb9c65cfa3e in iv_epoll_poll (numfds=1, active=0x7fb9c39fa9a0, msec=10000) at iv_method_epoll.c:73 #2 0x00007fb9c65cf1bb in iv_main () at iv_main.c:265 #3 0x00007fb9c65d3c36 in iv_work_thread (_thr=0x14fc8e0) at iv_work.c:196 #4 0x00007fb9c65d42a8 in iv_thread_handler (_thr=0x14fc9b0) at iv_thread.c:100 #5 0x000000353d2077f1 in start_thread () from /lib64/libpthread.so.0 #6 0x000000353cee570d in clone () from /lib64/libc.so.6 (gdb) thread 3 [Switching to thread 3 (Thread 0x7fb9c2ffc700 (LWP 11048))]#0 0x000000353cee5d03 in epoll_wait () from /lib64/libc.so.6 (gdb) where #0 0x000000353cee5d03 in epoll_wait () from /lib64/libc.so.6 #1 0x00007fb9c65cfa3e in iv_epoll_poll (numfds=1, active=0x7fb9c2ff99a0, msec=10000) at iv_method_epoll.c:73 #2 0x00007fb9c65cf1bb in iv_main () at iv_main.c:265 #3 0x00007fb9c65d3c36 in iv_work_thread (_thr=0x15635c0) at iv_work.c:196 #4 0x00007fb9c65d42a8 in iv_thread_handler (_thr=0x14fc380) at iv_thread.c:100 #5 0x000000353d2077f1 in start_thread () from /lib64/libpthread.so.0 #6 0x000000353cee570d in clone () from /lib64/libc.so.6 (gdb) thread 4 [Switching to thread 4 (Thread 0x7fb9c25fb700 (LWP 11049))]#0 0x000000353cee5d03 in epoll_wait () from /lib64/libc.so.6 (gdb) where #0 0x000000353cee5d03 in epoll_wait () from /lib64/libc.so.6 #1 0x00007fb9c65cfa3e in iv_epoll_poll (numfds=1, active=0x7fb9c25f89a0, msec=10000) at iv_method_epoll.c:73 #2 0x00007fb9c65cf1bb in iv_main () at iv_main.c:265 #3 0x00007fb9c65d3c36 in iv_work_thread (_thr=0x1566d20) at iv_work.c:196 #4 0x00007fb9c65d42a8 in iv_thread_handler (_thr=0x1566a70) at iv_thread.c:100 #5 0x000000353d2077f1 in start_thread () from /lib64/libpthread.so.0 #6 0x000000353cee570d in clone () from /lib64/libc.so.6 (gdb) thread 5 [Switching to thread 5 (Thread 0x7fb9c654ab00 (LWP 11040))]#0 0x000000353cedd967 in writev () from /lib64/libc.so.6 (gdb) where #0 0x000000353cedd967 in writev () from /lib64/libc.so.6 #1 0x00007fb9c65820f8 in log_proto_file_writer_flush (s=0x171b5a0) at logproto.c:270 #2 0x00007fb9c658c403 in log_proto_flush (s=0x171b5a0) at logproto.h:86 #3 0x00007fb9c658ec3e in log_writer_flush (self=0x171adb0, flush_mode=LW_FLUSH_QUEUE) at logwriter.c:1032 #4 0x00007fb9c658f006 in log_writer_deinit (s=0x171adb0) at logwriter.c:1106 #5 0x00007fb9c63399d6 in log_pipe_deinit (s=0x171adb0) at ../../lib/logpipe.h:254 #6 0x00007fb9c633b3cd in affile_dw_deinit (s=0x171ac20) at affile.c:627 #7 0x00007fb9c63399d6 in log_pipe_deinit (s=0x171ac20) at ../../lib/logpipe.h:254 #8 0x00007fb9c633c211 in affile_dd_deinit_writer (key=0x171ad60, value=0x171ac20, user_data=0x0) at affile.c:974 #9 0x000000353e22a7a3 in g_hash_table_foreach () from /lib64/libglib-2.0.so.0 #10 0x00007fb9c633c349 in affile_dd_deinit (s=0x1539990) at affile.c:996 #11 0x00007fb9c6575d86 in log_pipe_deinit (s=0x1539990) at logpipe.h:254 #12 0x00007fb9c65761b4 in log_dest_group_deinit (s=0x1539ef0) at dgroup.c:76 #13 0x00007fb9c656f54c in log_pipe_deinit (s=0x1539ef0) at logpipe.h:254 #14 0x00007fb9c657078c in log_center_deinit (self=0x14f35c0) at center.c:576 #15 0x00007fb9c6571296 in cfg_deinit (cfg=0x14f1ca0) at cfg.c:261 #16 0x00007fb9c6590cac in main_loop_reload_config_apply () at mainloop.c:494 #17 0x00007fb9c6590aab in main_loop_io_worker_job_complete (self=0x156d160) at mainloop.c:417 #18 0x00007fb9c65d3ccc in iv_work_event (_pool=0x14f1bf0) at iv_work.c:221 #19 0x00007fb9c65d2862 in iv_event_run_pending_events (_dummy=0x0) at iv_event.c:67 #20 0x00007fb9c65d2cb5 in iv_event_raw_got_event (_this=0x7fb9c654aa18) at iv_event_raw.c:82 #21 0x00007fb9c65cefec in iv_run_active_list (active=0x7fff33ed17b0) at iv_main.c:219 #22 0x00007fb9c65cf1cc in iv_main () at iv_main.c:269 #23 0x00007fb9c65913e9 in main_loop_run () at mainloop.c:731 #24 0x00000000004016d0 in main () -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=161 --- Comment #1 from Balazs Scheidler <bazsi@balabit.hu> 2012-02-24 11:00:34 --- Thanks for the detailed report. hmm. the main thread is stuck during reload all the other threads are sitting idle. The main thread is stuck in a call to writev() or is busy looping in writev(), and the core just happened to be within the kernel. can you please check the arguments passed to the writev() call? e.g. please issue the following commands: (gdb) frame 1 (gdb) p *self That should display the argument values passed to writev(). Thanks. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=161 --- Comment #2 from Patrick <bugzilla.syslogng@feystorm.net> 2012-02-24 14:31:02 --- (gdb) frame 1 #1 0x00007fb9c65820f8 in log_proto_file_writer_flush (s=0x171b5a0) at logproto.c:270 270 rc = writev(self->fd, self->buffer, self->buf_count); (gdb) p *self $1 = {super = {transport = 0x171b570, convert = 0xffffffffffffffff, encoding = 0x0, flags = 0, prepare = 0x7fb9c658264e <log_proto_file_writer_prepare>, is_preemptable = 0, restart_with_state = 0, fetch = 0, queued = 0, post = 0x7fb9c65823ef <log_proto_file_writer_post>, flush = 0x7fb9c658208d <log_proto_file_writer_flush>, free_fn = 0}, partial = 0x0, partial_len = 0, partial_pos = 0, buf_size = 10, buf_count = 7, fd = 723, sum_len = 2485, buffer = 0x171b5a0} -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=161 --- Comment #3 from Balazs Scheidler <bazsi@balabit.hu> 2012-03-11 14:59:03 --- is it possible that the syslog-ng you are running is writing to some kind of destination that might stuck (and perhaps doesn't support non-blocking I/O)? The writev() call is only issued once (so busy looping over writev is not likely), the arguments seem to be sane. It should definitely return in a couple of minutes at most. Are all your core files sitting in writev() or there are differences? -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=161 --- Comment #4 from Patrick <bugzilla.syslogng@feystorm.net> 2012-03-19 20:56:30 --- I havent forgotten about this bug. Since this only occurs in our production environment I have to pick a non-peak time where if we lose a few log messages it wont hurt. Will hopefully have more info soon. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=161 --- Comment #5 from Patrick <bugzilla.syslogng@feystorm.net> 2012-04-27 19:38:30 --- Ok, so I think you are correct on this, it wasnt hanging, it was just taking forever to write out to NFS. I didnt expect syslog-ng to have that much unwritten data as I have the flush_timeout and flush_line settings set pretty low on all destinations. However syslog-ng maintains about 2000 open files at any given time, so 2000 files with just a little bit of data could add up. I think this might also be loosely related to bug 175 which I just opened. It might be trying to write a ton of data causing the linux dirty writes thresholds to be exceeded and triggering a pdflush, which then in turn causes syslog-ng to hang until the flush is complete. Its probably safe to consider this bug invalid. I'll see what the behavior is after bug 175 is fixed. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=161 Gergely Nagy <algernon@balabit.hu> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |algernon@balabit.hu Status|NEW |NEEDINFO --- Comment #6 from Gergely Nagy <algernon@balabit.hu> 2012-09-07 15:57:57 --- (In reply to comment #5)
Its probably safe to consider this bug invalid. I'll see what the behavior is after bug 175 is fixed.
Is this problem still present? -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=161 --- Comment #7 from Patrick <bugzilla.syslogng@feystorm.net> 2012-09-10 20:21:15 --- (In reply to comment #6)
(In reply to comment #5)
Its probably safe to consider this bug invalid. I'll see what the behavior is after bug 175 is fixed.
Is this problem still present?
No. There is still a significant delay before processing resumes (after a reload), but with the file syncing enabled it's not as bad as it was. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
participants (1)
-
bugzilla@bugzilla.balabit.com