[Bug 255] New: segmentation faults often
https://bugzilla.balabit.com/show_bug.cgi?id=255 Summary: segmentation faults often Product: syslog-ng Version: 3.4.x Platform: PC OS/Version: Linux Status: NEW Severity: blocker Priority: unspecified Component: syslog-ng AssignedTo: bazsi@balabit.hu ReportedBy: erempel@uvic.ca Type of the Report: bug Estimated Hours: 0.0 In my setup, which is big, using syslog-ng 3.4.4 I get segmentation faults 2-3 times per minute. 3.4.1 did this once every couple of weeks, and 3.4.2 was worse, but 3.4.4 is completely unusable for me. What would you like from me to aid tracking this down? -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=255 --- Comment #1 from Balazs Scheidler <bazsi@balabit.hu> 2013-10-17 22:47:54 --- that really is odd. can you post a backtrace? If it seems complicated, I might be available to log into your system and troubleshoot it there if you are open to that sort of thing. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=255 --- Comment #2 from Evan Rempel <erempel@uvic.ca> 2013-10-17 23:00:03 --- (gdb) backtrace #0 0x00007f2e54000148 in ?? () #1 0x00007f2e69dfc181 in log_msg_clone_ack (msg=<value optimized out>, user_data=<value optimized out>) at logmsg.c:1097 #2 0x00007f2e69dfe205 in log_msg_refcache_stop () at logmsg.c:1495 #3 0x00007f2e69e0a4d2 in log_writer_flush (self=0x28401a0, flush_mode=LW_FLUSH_NORMAL) at logwriter.c:1046 #4 0x00007f2e69e0a6dd in log_writer_work_perform (s=0x28401a0) at logwriter.c:129 #5 0x00007f2e69e0addb in main_loop_io_worker_job_start (self=0x2840390) at mainloop.c:371 #6 0x00007f2e69e271da in iv_work_thread_do_work (_thr=0x2857270) at iv_work.c:118 #7 0x00007f2e69e2653a in iv_run_tasks (st=0x7f2e400008c0) at iv_task.c:48 #8 0x00007f2e69e2875c in iv_main () at iv_main_posix.c:106 #9 0x00007f2e69e26ff1 in iv_work_thread (_thr=0x2857270) at iv_work.c:200 #10 0x00007f2e69e291c8 in iv_thread_handler (_thr=0x2857340) at iv_thread_posix.c:142 #11 0x00007f2e68870851 in start_thread () from /lib64/libpthread.so.0 #12 0x00007f2e685be94d in clone () from /lib64/libc.so.6 -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=255 --- Comment #3 from Balazs Scheidler <bazsi@balabit.hu> 2013-10-18 10:58:45 --- Definitely not a simple crash. Can you send me core plus syslognh binaries in private? I would want to nail this as fast as possible Thanks -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=255 Gergely Nagy <algernon@balabit.hu> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |algernon@balabit.hu --- Comment #4 from Gergely Nagy <algernon@balabit.hu> 2013-10-18 15:39:35 --- This is something I've seen before, and we have an open bug (#224) about it, too. But I've never managed to reproduce the problem. Core & binaries to Bazsi and myself in private would be very useful. Even better, if you can compile with CFLAGS="-ggdb3 -O0", and send a core & binaries from that run, that would be the most useful thing, I believe. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=255 --- Comment #5 from Balazs Scheidler <bazsi@balabit.hu> 2013-10-20 12:38:29 --- based on information Evan sent in private, I came up with the following potential fix: https://github.com/balabit/syslog-ng-3.4/tree/f/fix-refcache-stop-race -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=255 --- Comment #6 from Evan Rempel <erempel@uvic.ca> 2013-10-20 18:33:40 --- Although I have only been running this for 5 minutes, I have not seen a crash. Load on our syslog servers is lighter on the weekends, so this might just be coincidental, but so far this looks like it has addressed the issue. I will report again on Monday afternoon after one day of regular business load. Thanks for the fast reply Bazsi -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=255 --- Comment #7 from Balazs Scheidler <bazsi@balabit.hu> 2013-10-20 18:41:00 --- so far so good. hope it stays that way. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=255 --- Comment #8 from Evan Rempel <erempel@uvic.ca> 2013-10-21 01:21:44 --- more than 30 million log messages and there is no sign of a memory leak or a segfault. Time USER PID PPID NI PRI %CPU VSZ ELAPSED TIME COMMAND 103101 root 12307 19898 0 19 13.5 694896 00:06:11 00:00:50 /usr/local/sbin/syslog-ng 104601 root 12307 19898 0 19 12.8 695132 00:21:11 00:02:43 /usr/local/sbin/syslog-ng 110101 root 12307 19898 0 19 12.7 705376 00:36:11 00:04:36 /usr/local/sbin/syslog-ng 111601 root 12307 19898 0 19 12.6 695104 00:51:11 00:06:28 /usr/local/sbin/syslog-ng 113101 root 12307 19898 0 19 12.8 705348 01:06:11 00:08:28 /usr/local/sbin/syslog-ng 114601 root 12307 19898 0 19 12.9 695104 01:21:11 00:10:30 /usr/local/sbin/syslog-ng 120101 root 12307 19898 0 19 13.0 695120 01:36:11 00:12:30 /usr/local/sbin/syslog-ng 121601 root 12307 19898 0 19 13.1 695168 01:51:11 00:14:35 /usr/local/sbin/syslog-ng 123101 root 12307 19898 0 19 13.1 695168 02:06:11 00:16:38 /usr/local/sbin/syslog-ng So I think we are safe on the memory leak worry, and I'm fairly confident that this has addressed the segfault. As I said earlier, I'll wait for the load of a weekday of business before I completely sign off on this, but it is looking really good. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
I just upgraded syslog-ng from 3.4.1 to 3.4.4 and can confirm that with the same configuration, it is using twice as much cpu. See the attached graph. We also take 15 minute snapshots of our processes and for the syslog-ng process the CPU usage is shown below. Note that when the process ID changes from 12307 to 5156 the CPU doubles. Time USER PPID PPID %CPU VSZ ELAPSED TIME COMMAND 074602 root 12307 19898 11.1 787228 21:21:12 02:23:20 /usr/local/sbin/syslog-ng 080101 root 12307 19898 11.1 787228 21:36:11 02:25:04 /usr/local/sbin/syslog-ng 081601 root 12307 19898 11.2 787228 21:51:11 02:26:58 /usr/local/sbin/syslog-ng 083101 root 12307 19898 11.2 787228 22:06:11 02:28:44 /usr/local/sbin/syslog-ng 084601 root 12307 19898 11.2 787228 22:21:11 02:30:32 /usr/local/sbin/syslog-ng 090101 root 12307 19898 11.2 776984 22:36:11 02:32:25 /usr/local/sbin/syslog-ng 091601 root 12307 19898 11.2 787228 22:51:11 02:34:35 /usr/local/sbin/syslog-ng 093102 root 5156 5155 23.3 705312 00:06:20 00:01:28 /usr/local/sbin/syslog-ng 094601 root 5156 5155 24.1 705772 00:21:19 00:05:08 /usr/local/sbin/syslog-ng 100101 root 5156 5155 24.0 705772 00:36:19 00:08:44 /usr/local/sbin/syslog-ng 101601 root 5156 5155 24.3 705772 00:51:19 00:12:29 /usr/local/sbin/syslog-ng 103101 root 5156 5155 23.8 705772 01:06:19 00:15:50 /usr/local/sbin/syslog-ng 104601 root 5156 5155 23.4 705772 01:21:19 00:19:05 /usr/local/sbin/syslog-ng 110101 root 5156 5155 23.2 705772 01:36:19 00:22:23 /usr/local/sbin/syslog-ng 111601 root 5156 5155 23.2 705772 01:51:19 00:25:56 /usr/local/sbin/syslog-ng 113101 root 5156 5155 23.3 705772 02:06:19 00:29:30 /usr/local/sbin/syslog-ng 114601 root 5156 5155 23.4 705772 02:21:19 00:33:09 /usr/local/sbin/syslog-ng 120101 root 5156 5155 23.4 705772 02:36:19 00:36:38 /usr/local/sbin/syslog-ng 121601 root 5156 5155 23.5 705772 02:51:19 00:40:17 /usr/local/sbin/syslog-ng 123101 root 5156 5155 23.5 705772 03:06:19 00:43:49 /usr/local/sbin/syslog-ng 124601 root 5156 5155 23.4 705772 03:21:19 00:47:14 /usr/local/sbin/syslog-ng 130101 root 5156 5155 23.5 705772 03:36:19 00:51:00 /usr/local/sbin/syslog-ng 131601 root 5156 5155 23.7 705772 03:51:19 00:54:50 /usr/local/sbin/syslog-ng 133101 root 5156 5155 23.9 705772 04:06:19 00:59:04 /usr/local/sbin/syslog-ng 134601 root 5156 5155 23.8 705772 04:21:19 01:02:25 /usr/local/sbin/syslog-ng 140102 root 5156 5155 23.8 705772 04:36:20 01:06:00 /usr/local/sbin/syslog-ng 141601 root 5156 5155 23.8 705772 04:51:19 01:09:30 /usr/local/sbin/syslog-ng 143101 root 5156 5155 23.9 705772 05:06:19 01:13:13 /usr/local/sbin/syslog-ng
I may have spoke too soon. This is with binaries compiled with CFLAGS="-ggdb3 -O0" which is different than I used when compiling syslog-ng 3.4.1 After some testing is completed in a day or two, I will recompile without those flags and I'll have an apples to apples comparison. Evan On 10/20/2013 04:37 PM, Evan Rempel wrote:
I just upgraded syslog-ng from 3.4.1 to 3.4.4 and can confirm that with the same configuration, it is using twice as much cpu. See the attached graph.
We also take 15 minute snapshots of our processes and for the syslog-ng process the CPU usage is shown below. Note that when the process ID changes from 12307 to 5156 the CPU doubles.
Time USER PPID PPID %CPU VSZ ELAPSED TIME COMMAND 074602 root 12307 19898 11.1 787228 21:21:12 02:23:20 /usr/local/sbin/syslog-ng 080101 root 12307 19898 11.1 787228 21:36:11 02:25:04 /usr/local/sbin/syslog-ng 081601 root 12307 19898 11.2 787228 21:51:11 02:26:58 /usr/local/sbin/syslog-ng 083101 root 12307 19898 11.2 787228 22:06:11 02:28:44 /usr/local/sbin/syslog-ng 084601 root 12307 19898 11.2 787228 22:21:11 02:30:32 /usr/local/sbin/syslog-ng 090101 root 12307 19898 11.2 776984 22:36:11 02:32:25 /usr/local/sbin/syslog-ng 091601 root 12307 19898 11.2 787228 22:51:11 02:34:35 /usr/local/sbin/syslog-ng 093102 root 5156 5155 23.3 705312 00:06:20 00:01:28 /usr/local/sbin/syslog-ng 094601 root 5156 5155 24.1 705772 00:21:19 00:05:08 /usr/local/sbin/syslog-ng 100101 root 5156 5155 24.0 705772 00:36:19 00:08:44 /usr/local/sbin/syslog-ng 101601 root 5156 5155 24.3 705772 00:51:19 00:12:29 /usr/local/sbin/syslog-ng 103101 root 5156 5155 23.8 705772 01:06:19 00:15:50 /usr/local/sbin/syslog-ng 104601 root 5156 5155 23.4 705772 01:21:19 00:19:05 /usr/local/sbin/syslog-ng 110101 root 5156 5155 23.2 705772 01:36:19 00:22:23 /usr/local/sbin/syslog-ng 111601 root 5156 5155 23.2 705772 01:51:19 00:25:56 /usr/local/sbin/syslog-ng 113101 root 5156 5155 23.3 705772 02:06:19 00:29:30 /usr/local/sbin/syslog-ng 114601 root 5156 5155 23.4 705772 02:21:19 00:33:09 /usr/local/sbin/syslog-ng 120101 root 5156 5155 23.4 705772 02:36:19 00:36:38 /usr/local/sbin/syslog-ng 121601 root 5156 5155 23.5 705772 02:51:19 00:40:17 /usr/local/sbin/syslog-ng 123101 root 5156 5155 23.5 705772 03:06:19 00:43:49 /usr/local/sbin/syslog-ng 124601 root 5156 5155 23.4 705772 03:21:19 00:47:14 /usr/local/sbin/syslog-ng 130101 root 5156 5155 23.5 705772 03:36:19 00:51:00 /usr/local/sbin/syslog-ng 131601 root 5156 5155 23.7 705772 03:51:19 00:54:50 /usr/local/sbin/syslog-ng 133101 root 5156 5155 23.9 705772 04:06:19 00:59:04 /usr/local/sbin/syslog-ng 134601 root 5156 5155 23.8 705772 04:21:19 01:02:25 /usr/local/sbin/syslog-ng 140102 root 5156 5155 23.8 705772 04:36:20 01:06:00 /usr/local/sbin/syslog-ng 141601 root 5156 5155 23.8 705772 04:51:19 01:09:30 /usr/local/sbin/syslog-ng 143101 root 5156 5155 23.9 705772 05:06:19 01:13:13 /usr/local/sbin/syslog-ng
The fix I made could have a performance impact, but not so severe. -O0 explains this though. On Oct 21, 2013 1:40 AM, "Evan Rempel" <erempel@uvic.ca> wrote:
I may have spoke too soon. This is with binaries compiled with CFLAGS="-ggdb3 -O0" which is different than I used when compiling syslog-ng 3.4.1
After some testing is completed in a day or two, I will recompile without those flags and I'll have an apples to apples comparison.
Evan
On 10/20/2013 04:37 PM, Evan Rempel wrote:
I just upgraded syslog-ng from 3.4.1 to 3.4.4 and can confirm that with the same configuration, it is using twice as much cpu. See the attached graph.
We also take 15 minute snapshots of our processes and for the syslog-ng process the CPU usage is shown below. Note that when the process ID changes from 12307 to 5156 the CPU doubles.
Time USER PPID PPID %CPU VSZ ELAPSED TIME COMMAND 074602 root 12307 19898 11.1 787228 21:21:12 02:23:20 /usr/local/sbin/syslog-ng 080101 root 12307 19898 11.1 787228 21:36:11 02:25:04 /usr/local/sbin/syslog-ng 081601 root 12307 19898 11.2 787228 21:51:11 02:26:58 /usr/local/sbin/syslog-ng 083101 root 12307 19898 11.2 787228 22:06:11 02:28:44 /usr/local/sbin/syslog-ng 084601 root 12307 19898 11.2 787228 22:21:11 02:30:32 /usr/local/sbin/syslog-ng 090101 root 12307 19898 11.2 776984 22:36:11 02:32:25 /usr/local/sbin/syslog-ng 091601 root 12307 19898 11.2 787228 22:51:11 02:34:35 /usr/local/sbin/syslog-ng 093102 root 5156 5155 23.3 705312 00:06:20 00:01:28 /usr/local/sbin/syslog-ng 094601 root 5156 5155 24.1 705772 00:21:19 00:05:08 /usr/local/sbin/syslog-ng 100101 root 5156 5155 24.0 705772 00:36:19 00:08:44 /usr/local/sbin/syslog-ng 101601 root 5156 5155 24.3 705772 00:51:19 00:12:29 /usr/local/sbin/syslog-ng 103101 root 5156 5155 23.8 705772 01:06:19 00:15:50 /usr/local/sbin/syslog-ng 104601 root 5156 5155 23.4 705772 01:21:19 00:19:05 /usr/local/sbin/syslog-ng 110101 root 5156 5155 23.2 705772 01:36:19 00:22:23 /usr/local/sbin/syslog-ng 111601 root 5156 5155 23.2 705772 01:51:19 00:25:56 /usr/local/sbin/syslog-ng 113101 root 5156 5155 23.3 705772 02:06:19 00:29:30 /usr/local/sbin/syslog-ng 114601 root 5156 5155 23.4 705772 02:21:19 00:33:09 /usr/local/sbin/syslog-ng 120101 root 5156 5155 23.4 705772 02:36:19 00:36:38 /usr/local/sbin/syslog-ng 121601 root 5156 5155 23.5 705772 02:51:19 00:40:17 /usr/local/sbin/syslog-ng 123101 root 5156 5155 23.5 705772 03:06:19 00:43:49 /usr/local/sbin/syslog-ng 124601 root 5156 5155 23.4 705772 03:21:19 00:47:14 /usr/local/sbin/syslog-ng 130101 root 5156 5155 23.5 705772 03:36:19 00:51:00 /usr/local/sbin/syslog-ng 131601 root 5156 5155 23.7 705772 03:51:19 00:54:50 /usr/local/sbin/syslog-ng 133101 root 5156 5155 23.9 705772 04:06:19 00:59:04 /usr/local/sbin/syslog-ng 134601 root 5156 5155 23.8 705772 04:21:19 01:02:25 /usr/local/sbin/syslog-ng 140102 root 5156 5155 23.8 705772 04:36:20 01:06:00 /usr/local/sbin/syslog-ng 141601 root 5156 5155 23.8 705772 04:51:19 01:09:30 /usr/local/sbin/syslog-ng 143101 root 5156 5155 23.9 705772 05:06:19 01:13:13 /usr/local/sbin/syslog-ng
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
https://bugzilla.balabit.com/show_bug.cgi?id=255 --- Comment #9 from Evan Rempel <erempel@uvic.ca> 2013-10-21 22:40:39 --- Gone through the peak load period of a business day with peaks of 11,000 mps. No crashes, so that is great. The memory usage pattern is continually going up, but not outside of some other infrequent periods with version 3.4.1 I am going to recompile without the CFLAGS="-ggdb3 -O0" to get back to an apples to apples comparison and run the rest of the week when I will report back with memory usage patterns. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=255 --- Comment #10 from Evan Rempel <erempel@uvic.ca> 2013-10-25 19:00:27 ---
I am going to recompile without the CFLAGS="-ggdb3 -O0" to get back to an apples to apples comparison and run the rest of the week when I will report back with memory usage patterns.
After 4 business days of usage, there has not been a single crash, so this looks great. The memory issue does not look so good. In 3.4.1 the memory would sometimes grow slowly in blocks (10M sometimes) but would also at times reduce by a block of 10M as well. In 3.4.4 the memory continually grows in similar blocks, but any reductions are very small 1-200KB and in the last 4 days has never reduced by a large 10M block, or even 1M block. So I think that there is a memory leak. Anything I can do to help track this down? -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=255 --- Comment #11 from Balazs Scheidler <bazsi@balabit.hu> 2013-10-25 20:40:36 --- the leak is probably unrelated. if this was causing it, it would be a massive leak. do you should open a new ticket for that. we will probably need to compile yemalloc or tcmalloc for leak hunting for your so we can do that in production. I'll have to refresh my memories about them too -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
https://bugzilla.balabit.com/show_bug.cgi?id=255 Balazs Scheidler <bazsi@balabit.hu> changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution| |FIXED Status|NEW |RESOLVED --- Comment #12 from Balazs Scheidler <bazsi@balabit.hu> 2013-11-02 20:21:00 --- this one goes to fixed, as it seems we have successfully fixed the issue, and the fix is already on master. -- Configure bugmail: https://bugzilla.balabit.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching all bug changes.
participants (3)
-
Balazs Scheidler
-
bugzilla@bugzilla.balabit.com
-
Evan Rempel