[Bug 260] New: exit with code 134
https://bugzilla.balabit.com/show_bug.cgi?id=260 Summary: exit with code 134 Product: syslog-ng Version: 3.4.x Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: unspecified Component: syslog-ng AssignedTo: bazsi@balabit.hu ReportedBy: erempel@uvic.ca Type of the Report: --- Estimated Hours: 0.0 I just did a reload on version 3.4.4 (with latest patches) and triggered 2013-11-05T21:24:33-08:00 host daemon.crit supervise/syslog-ng[25895]: Daemon exited due to a deadlock/signal/failure, restarting; exitcode='134' Is it possible that the return code of syslog-ng is somehow linked to the return code of a child program destination? core file sent to Balazs in private e-mail -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=260 --- Comment #1 from Balazs Scheidler <bazsi@balabit.hu> 2013-11-09 11:31:41 --- After analysing the core file, it turns out that I can't continue without reproducing the issue with less optimization as explained to Evan in a private email. Hopefully he can reproduce it with a debug build of syslog-ng, until then, this bug is on hold. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=260 --- Comment #2 from Evan Rempel <erempel@uvic.ca> 2013-11-16 07:32:49 --- Sent Balazs a private em will on how to get the core dump and latest debug binaries. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=260 --- Comment #3 from Balazs Scheidler <bazsi@balabit.hu> 2013-11-19 16:22:58 --- I have now tried to diagnose the root cause for this issue. Like in the previous core, epoll() returns failure while modifying the I/O interest mask of an fd using EPOLL_CTL_MOD The fd that fails belongs to a syslog(transport(udp)) source. Memory is seemingly not corrupted. The interesting part is that ivykis operates with the assumption that the fd had been registered for polling both EPOLLIN and EPOLLOUT but not for EPOLLERR. The "normal" setting for a UDP source is EPOLLIN and EPOLLERR and I couldn't find a way it'd set the mask to EPOLLIN and EPOLLOUT. My best guess is that the struct iv_fd "registered_bands" member is corrupted, this theory is supported by the fact that epoll() returns failure, which generally means that it was not registered at all. But how that can be corrupted is a mystery to me, it is always updated in the same thread. Can you by chance describe what happens when this happens? Is it always related to reloading syslog-ng? Is it perhaps possible to run syslog-ng under strace until this happens? It produces a _lot_ of output and increases CPU demand a lot, but might give a clue on how syslog-ng is working with the fd in question. Also, if you happen to have further core files, those could also be helpful. The fact if this always happens with the same fd or not could help. Thanks in advance. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=260 --- Comment #4 from Evan Rempel <erempel@uvic.ca> 2013-11-19 18:01:27 --- 2013-11-15T09:15:00-08:00 X daemon.info syslog-ng-stats: server stopping on socket "/var/local/syslog-ng.server.ctl" 2013-11-15T09:15:00-08:00 X daemon.info syslog-ng-stats: server starting on socket "/var/local/syslog-ng.server.ctl" 2013-11-15T09:15:01-08:00 X daemon.info syslog-ng-stats: server stopping on socket "/var/local/syslog-ng.server.ctl" 2013-11-15T09:15:01-08:00 X daemon.crit supervise/syslog-ng[16298]: Daemon exited due to a deadlock/signal/failure, restarting; exitcode='134' It only seems to exit during a reload. We have a program destination that logs when it stops and when it starts. It looks like the reload resulted in the stop/start, and then the crash occurs. Just thinking about how our 12 sysadmins work and possibly reload syslog-ng. What happens if a HUP is received prior to completing the reload from a previous HUP? I have placed two more core files into the same folder as the first. You should be able to download them. I'll look at getting an strace, but on our production system I think it will be too cpu intensive and/or the output too large. Any particular strace flags you want or need? -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=260 --- Comment #5 from Balazs Scheidler <bazsi@balabit.hu> 2013-11-21 14:43:23 --- recursive HUPs should be handled just fine. strace options: strace -o strace.log -ttT -f -s 256 -p <pid> -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=260 --- Comment #6 from Evan Rempel <erempel@uvic.ca> 2013-11-21 18:33:45 --- Twice I have caused a core dump during a reload, but both times the exist code as 133. supervise/syslog-ng[16298]: Daemon exited due to a deadlock/signal/failure, restarting; exitcode='133' I have made one of the strace file and the related core dump in the same place as last time. Let me know if you want me to trace again. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=260 --- Comment #7 from Balazs Scheidler <bazsi@balabit.hu> 2013-11-24 20:55:01 --- Hi, Now I had some time to review to stuff you sent me. Thanks for the efforts. As it seems, strace is unusable, the 133 exit code you have seen indicates a SIGTRAP, which is the signal raised when a breakpoint is hit. Since the only remotely debugger-like is strace, it seems we have hit some kind of strace issue, one that injects a SIGTRAP to the child. Could be a race or something. But fortunately, I have found a clue in one of the core dumps. One of the dumps was a segfault and not an abort: Core was generated by `/usr/local/sbin/syslog-ng --cfgfile=/usr/local/etc/syslog-ng/syslog-ng.server.c'. Program terminated with signal 11, Segmentation fault. #0 0x00007fa668c88381 in __strlen_sse2 () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install UVic-syslog-ng-3.4.4-99.el6.x86_64 (gdb) bt #0 0x00007fa668c88381 in __strlen_sse2 () from /lib64/libc.so.6 #1 0x00007fa668c88076 in strdup () from /lib64/libc.so.6 #2 0x00007fa6697ed94b in evt_tag_str (tag=0x7fa66a574d00 "encoding", value=0x6479786f72 <Address 0x6479786f72 out of bounds>) at evttags.c:76 #3 0x00007fa66a5341bd in log_proto_server_options_validate (options=0x14cd298) at logproto-server.c:138 #4 0x00007fa66a539c10 in log_proto_server_validate_options (self=0x97d1b0) at logproto-server.h:82 #5 0x00007fa66a53aecf in log_reader_init (s=0xc4f040) at logreader.c:638 #6 0x00007fa6683dab9f in log_pipe_init (s=0xc4f040, cfg=0x0) at ../../lib/logpipe.h:253 #7 0x00007fa6683db068 in afsocket_sc_init (s=0xe45280) at afsocket-source.c:149 #8 0x00007fa6683dab9f in log_pipe_init (s=0xe45280, cfg=0x0) at ../../lib/logpipe.h:253 #9 0x00007fa6683dbf41 in afsocket_sd_init (s=0xaa8230) at afsocket-source.c:548 #10 0x00007fa66a5243b0 in log_pipe_init (s=0xaa8230, cfg=0x1d750e0) at logpipe.h:253 #11 0x00007fa66a526111 in cfg_tree_start (self=0x1d75248) at cfg-tree.c:1064 #12 0x00007fa66a51fec1 in cfg_init (cfg=0x1d750e0) at cfg.c:220 #13 0x00007fa66a5435ae in main_loop_reload_config_apply () at mainloop.c:501 #14 0x00007fa66a5434b8 in main_loop_io_worker_sync_call (func=0x7fa66a54355b <main_loop_reload_config_apply>) at mainloop.c:456 #15 0x00007fa66a543850 in main_loop_reload_config_initiate () at mainloop.c:574 #16 0x00007fa66a5438ae in sig_hup_handler (s=0x0) at mainloop.c:608 #17 0x00007fa66a56f8e8 in iv_signal_event (_this=0x7fa66a7a8ce0) at iv_signal.c:170 #18 0x00007fa66a56dc95 in iv_event_raw_got_event (_this=0x7fa66a7a8d20) at iv_event_raw_posix.c:89 #19 0x00007fa66a56e354 in iv_fd_poll_and_run (st=0x7b91a0, to=0x7fffa91ed630) at iv_fd.c:163 #20 0x00007fa66a56f2bb in iv_main () at iv_main_posix.c:117 #21 0x00007fa66a543c72 in main_loop_run () at mainloop.c:736 #22 0x000000000040187b in main () And this is one of the possible symptoms of a bug I have fixed in 3.4.5, more specifically this one: https://bugzilla.balabit.com/show_bug.cgi?id=253 I did find the root cause of that one because of one of the valgrind runs that you made, but since it wasn't reported by you, the fix is not in your local syslog-ng build. That bug is a potential use-after-free and as such it may actually cause the SIGABRTs by corrupting memory. So, it'd make sense to apply this patch too as it's definitely affecting you, but the segfault is not triggered: $ git show 861489c4b3a895ca88f9cdbb98b351f1c5a330cf commit 861489c4b3a895ca88f9cdbb98b351f1c5a330cf Author: Balazs Scheidler <bazsi@balabit.hu> Date: Sat Nov 2 20:19:12 2013 +0100 LogProtoServer: fixed a use-after-free after reload self->options is a borrowed memory area, managed by the owning driver that constructed the given LogProtoServer instance. During reload however, the driver object itself is freed, while the LogProtoServer instance is kept alive, thereby causing the options pointer to point into the void. The solution is to update the options pointer whenever the owner is changed, chaining into a similar solution in the LogReader code which does exactly that. Signed-off-by: Balazs Scheidler <bazsi@balabit.hu> diff --git a/lib/logproto-server.h b/lib/logproto-server.h index 6188f65..969a807 100644 --- a/lib/logproto-server.h +++ b/lib/logproto-server.h @@ -82,6 +82,12 @@ log_proto_server_validate_options(LogProtoServer *self) return log_proto_server_options_validate(self->options); } +static inline void +log_proto_server_set_options(LogProtoServer *self, const LogProtoServerOptions *options) +{ + self->options = options; +} + static inline gboolean log_proto_server_prepare(LogProtoServer *s, gint *fd, GIOCondition *cond) { diff --git a/lib/logreader.c b/lib/logreader.c index 9a60345..c498ae1 100644 --- a/lib/logreader.c +++ b/lib/logreader.c @@ -697,6 +697,8 @@ log_reader_set_options(LogPipe *s, LogPipe *control, LogReaderOptions *options, self->control = control; self->options = options; + if (self->proto) + log_proto_server_set_options(self->proto, &self->options->proto_options.super); } /* run in the main thread in reaction to a log_reader_reopen to change I've also reviewed the rest of the core files, but they seem to be very similar to the first crash. A different fd, but still the ivykis POLL mask is different from reality, which causes the abort within ivykis. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=260 --- Comment #8 from Evan Rempel <erempel@uvic.ca> 2013-12-31 21:02:26 --- Given that syslog-ng versions 3.4.5, 3.4.6 and 3.4.7 have been released since I reported this, I am upgrading to 3.4.7 to see if this issue has "gone away" with the many related bug fixes in these releases. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=260 --- Comment #9 from Balazs Scheidler <bazsi@balabit.hu> 2013-12-31 21:12:11 --- (In reply to comment #8)
Given that syslog-ng versions 3.4.5, 3.4.6 and 3.4.7 have been released since I reported this, I am upgrading to 3.4.7 to see if this issue has "gone away" with the many related bug fixes in these releases.
yeah, that makes sense. ps: happy new year :) -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=260 Evan Rempel <erempel@uvic.ca> changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution| |FIXED Status|NEW |RESOLVED --- Comment #10 from Evan Rempel <erempel@uvic.ca> 2014-03-06 18:50:56 --- looks like I failed to report on the 3.4.7 stability. All of the "Daemon exited due to a deadlock/signal/failure" on reload have gone away. one of the other patches addressed this issue. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=260 --- Comment #11 from Balazs Scheidler <bazsi@balabit.hu> 2014-03-06 20:38:29 --- Thanks for the feedback. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
participants (1)
-
bugzilla@bugzilla.balabit.com