syslog-ng hangs with high message volume to sqlite database
I seem to have run into a situation where syslog-ng (3.3.4) will hang when sending a high message volume to an sqlite3 database. When I run the `loggen` utility with a high message rate, syslog-ng writes out the following error: Jan 28 22:58:12 storm.feystorm.net syslog-ng[22839]: Error running SQL query; type='sqlite3', host='', port='', user='syslog-ng', database='/syslogtest/logs.sql3', error='5: database is locked', query='COMMIT' Jan 28 22:58:12 storm.feystorm.net syslog-ng[22839]: SQL transaction commit failed, rewinding backlog and starting again; Syslog-ng will sit stuck like this for a minute and then log the error again. Sometimes it'll write a few more entries to the database before it errors, but its never more than a few. I'm not sure if this issue is a problem with libdbi or syslog-ng, which is why I'm here and not the bug tracker. I came across this issue trying to track down another issue where syslog-ng seems to hang completely and will never recover (wont respond to anything other than a `kill -9`). Whether this is the same issue or not, I do not know. I was able to trigger this with the following loggen command: loggen -r 60 -I 1 --active-connections=2 -x ./socket This server is a fairly low power server (its a central log collector for my home network), so it you might have to significantly bump up the message rate to duplicate the issue. Config: @version: 3.3 options { keep_hostname(yes); chain_hostnames(no); normalize_hostnames(yes); use_dns(no); use_fqdn(yes); stats_freq(0); mark_freq(0); }; source s_socket { unix-stream("/syslogtest/socket"); }; source s_internal { internal(); }; destination d_internal { file("/syslogtest/internal.log"); }; log { source(s_internal); destination(d_internal); }; destination d_sqlite { sql( type('sqlite3') database("/syslogtest/logs.sql3") table("logs") columns("time", "time_r", "host", "facility", "priority", "program", "pid", "tag", "message") values("$S_UNIXTIME", "$R_UNIXTIME", "$FULLHOST", "$FACILITY_NUM", "$LEVEL_NUM", "$PROGRAM", "$PID", "$DBTAG", "$MSG") null("") flags(explicit-commits) flush_lines(10) flush_timeout(200) ); }; filter f_test1 { message('^prg00000') or program('syslog-ng'); }; filter f_test2 { message('^prg00001') or program('syslog-ng'); }; log { source(s_internal); source(s_socket); destination(d_sqlite); }; #log { source(s_internal); source(s_socket); filter(f_test1); destination(d_sqlite); }; #log { source(s_internal); source(s_socket); filter(f_test2); destination(d_sqlite); }; Database schema: CREATE TABLE logs (id integer primary key autoincrement, time unsigned integer not null, time_r unsigned integer not null, host text not null, facility unsigned integer not null, priority unsigned integer not null, program text not null, pid unsigned integer, tag text, message text not null); CREATE INDEX host on logs (host, time); CREATE INDEX program on logs (program, time);
Sent: Sat Jan 28 2012 23:11:08 GMT-0500 (EST) From: Patrick Hemmer <syslogng@stormcloud9.net> To: syslog-ng@lists.balabit.hu Subject: [syslog-ng] syslog-ng hangs with high message volume to sqlite database
I seem to have run into a situation where syslog-ng (3.3.4) will hang when sending a high message volume to an sqlite3 database. When I run the `loggen` utility with a high message rate, syslog-ng writes out the following error: Jan 28 22:58:12 storm.feystorm.net syslog-ng[22839]: Error running SQL query; type='sqlite3', host='', port='', user='syslog-ng', database='/syslogtest/logs.sql3', error='5: database is locked', query='COMMIT' Jan 28 22:58:12 storm.feystorm.net syslog-ng[22839]: SQL transaction commit failed, rewinding backlog and starting again;
Syslog-ng will sit stuck like this for a minute and then log the error again. Sometimes it'll write a few more entries to the database before it errors, but its never more than a few. I'm not sure if this issue is a problem with libdbi or syslog-ng, which is why I'm here and not the bug tracker. I came across this issue trying to track down another issue where syslog-ng seems to hang completely and will never recover (wont respond to anything other than a `kill -9`). Whether this is the same issue or not, I do not know. I was able to trigger this with the following loggen command: loggen -r 60 -I 1 --active-connections=2 -x ./socket This server is a fairly low power server (its a central log collector for my home network), so it you might have to significantly bump up the message rate to duplicate the issue.
Config: @version: 3.3
options { keep_hostname(yes); chain_hostnames(no); normalize_hostnames(yes); use_dns(no); use_fqdn(yes); stats_freq(0); mark_freq(0); };
source s_socket { unix-stream("/syslogtest/socket"); }; source s_internal { internal(); }; destination d_internal { file("/syslogtest/internal.log"); }; log { source(s_internal); destination(d_internal); };
destination d_sqlite { sql( type('sqlite3') database("/syslogtest/logs.sql3") table("logs") columns("time", "time_r", "host", "facility", "priority", "program", "pid", "tag", "message") values("$S_UNIXTIME", "$R_UNIXTIME", "$FULLHOST", "$FACILITY_NUM", "$LEVEL_NUM", "$PROGRAM", "$PID", "$DBTAG", "$MSG") null("") flags(explicit-commits) flush_lines(10) flush_timeout(200) ); }; filter f_test1 { message('^prg00000') or program('syslog-ng'); }; filter f_test2 { message('^prg00001') or program('syslog-ng'); }; log { source(s_internal); source(s_socket); destination(d_sqlite); }; #log { source(s_internal); source(s_socket); filter(f_test1); destination(d_sqlite); }; #log { source(s_internal); source(s_socket); filter(f_test2); destination(d_sqlite); };
Database schema: CREATE TABLE logs (id integer primary key autoincrement, time unsigned integer not null, time_r unsigned integer not null, host text not null, facility unsigned integer not null, priority unsigned integer not null, program text not null, pid unsigned integer, tag text, message text not null); CREATE INDEX host on logs (host, time); CREATE INDEX program on logs (program, time);
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
I finally got the original problem I was trying to track down to happen (had to wait for it after enabling debugging symbols). I'm not sure if theyre related, but they look like they are, so I'm going to attach it on to this one. The issue here is that syslog-ng hangs completely. Doesnt respond to anything other than `kill -9` (looks like 2 threads are in a deadlock). Strace shows Process 6979 attached - interrupt to quit futex(0x349eb0d1b0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> Where it just sits there forever. Here's some gdb info: (gdb) info threads Id Target Id Frame 2 Thread 0x34711563700 (LWP 6986) "syslog-ng" 0x0000034710b6afc4 in __lll_lock_wait () from /lib64/libpthread.so.0 * 1 Thread 0x34711566b00 (LWP 6979) "syslog-ng" 0x0000034710b6afc4 in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) where #0 0x0000034710b6afc4 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x0000034710b66459 in _L_lock_508 () from /lib64/libpthread.so.0 #2 0x0000034710b6627b in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x000003470dc97039 in afsql_dd_queue (s=0x349eb408f0, msg=0x349eb4b430, path_options=0x3add5269ef0, user_data=0x0) at afsql.c:1159 #4 0x00000347110d597f in log_pipe_queue (s=0x349eb408f0, msg=0x349eb4b430, path_options=0x3add526a040) at logpipe.h:288 #5 0x00000347110d5e9d in log_dest_group_queue (s=0x349eb40f10, msg=0x349eb4b430, path_options=0x3add526a040, user_data=0x0) at dgroup.c:97 #6 0x00000347110df3bc in log_pipe_queue (s=0x349eb40f10, msg=0x349eb4b430, path_options=0x3add526a040) at logpipe.h:288 #7 0x00000347110df6d5 in log_multiplexer_queue (s=0x349eb084e0, msg=0x349eb4b430, path_options=0x3add526a3e0, user_data=0x0) at logmpx.c:125 #8 0x00000347110d7b20 in log_pipe_queue (s=0x349eb084e0, msg=0x349eb4b430, path_options=0x3add526a3e0) at logpipe.h:288 #9 0x00000347110d7a9f in log_pipe_forward_msg (self=0x349eb44390, msg=0x349eb4b430, path_options=0x3add526a3e0) at logpipe.h:275 #10 0x00000347110d979c in log_filter_pipe_queue (s=0x349eb44390, msg=0x349eb4b430, path_options=0x3add526a3e0, user_data=0x0) at filter.c:698 #11 0x00000347110ed3ec in log_pipe_queue (s=0x349eb44390, msg=0x349eb4b430, path_options=0x3add526a3e0) at logpipe.h:288 #12 0x00000347110ed36b in log_pipe_forward_msg (self=0x349eb458b0, msg=0x349eb4b430, path_options=0x3add526a3e0) at logpipe.h:275 #13 0x00000347110ed724 in log_rewrite_queue (s=0x349eb458b0, msg=0x349eb4b430, path_options=0x3add526a3e0, user_data=0x0) at logrewrite.c:65 #14 0x00000347110ed3ec in log_pipe_queue (s=0x349eb458b0, msg=0x349eb4b430, path_options=0x3add526a3e0) at logpipe.h:288 #15 0x00000347110ed36b in log_pipe_forward_msg (self=0x349eb45620, msg=0x349eb4b430, path_options=0x3add526a3e0) at logpipe.h:275 #16 0x00000347110ed724 in log_rewrite_queue (s=0x349eb45620, msg=0x349eb4b430, path_options=0x3add526a3e0, user_data=0x0) at logrewrite.c:65 #17 0x00000347110df3bc in log_pipe_queue (s=0x349eb45620, msg=0x349eb4b430, path_options=0x3add526a3e0) at logpipe.h:288 #18 0x00000347110df6d5 in log_multiplexer_queue (s=0x349eb07840, msg=0x34708013350, path_options=0x3add526a680, user_data=0x0) at logmpx.c:125 #19 0x00000347110ffb93 in log_pipe_queue (s=0x349eb07840, msg=0x34708013350, path_options=0x3add526a680) at logpipe.h:288 #20 0x00000347110ffb12 in log_pipe_forward_msg (self=0x349eb39df0, msg=0x34708013350, path_options=0x3add526a680) at logpipe.h:275 #21 0x000003471110012d in log_source_group_queue (s=0x349eb39df0, msg=0x34708013350, path_options=0x3add526a680, user_data=0x0) at sgroup.c:102 #22 0x00000347110ee3fe in log_pipe_queue (s=0x349eb39df0, msg=0x34708013350, path_options=0x3add526a680) at logpipe.h:288 #23 0x00000347110ee37d in log_pipe_forward_msg (self=0x349eb39d10, msg=0x34708013350, path_options=0x3add526a680) at logpipe.h:275 #24 0x00000347110ee417 in log_pipe_queue (s=0x349eb39d10, msg=0x34708013350, path_options=0x3add526a680) at logpipe.h:292 #25 0x00000347110ee37d in log_pipe_forward_msg (self=0x349eb3b220, msg=0x34708013350, path_options=0x3add526a680) at logpipe.h:275 #26 0x00000347110ef1d4 in log_source_queue (s=0x349eb3b220, msg=0x34708013350, path_options=0x3add526a750, user_data=0x0) at logsource.c:290 #27 0x00000347110cc454 in log_pipe_queue (s=0x349eb3b220, msg=0x34708013350, path_options=0x3add526a750) at logpipe.h:288 #28 0x00000347110cc718 in afinter_source_post (s=0x349eb3b220) at afinter.c:115 #29 0x000003471111e184 in iv_event_run_pending_events (_dummy=0x0) at iv_event.c:67 #30 0x000003471111e6a5 in iv_event_raw_got_event (_this=0x34711566978) at iv_event_raw.c:82 #31 0x0000034711120ebe in iv_run_active_list (active=0x3add526ac90) at iv_main.c:219 #32 0x00000347111210e8 in iv_main () at iv_main.c:269 #33 0x00000347110f67de in main_loop_run () at mainloop.c:731 #34 0x000000349e8f5d08 in main (argc=1, argv=0x3add526ae18) at main.c:260 (gdb) thread 2 [Switching to thread 2 (Thread 0x34711563700 (LWP 6986))] #0 0x0000034710b6afc4 in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) where #0 0x0000034710b6afc4 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x0000034710b66459 in _L_lock_508 () from /lib64/libpthread.so.0 #2 0x0000034710b6627b in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x000003470dc9603e in afsql_dd_database_thread (arg=0x349eb408f0) at afsql.c:863 #4 0x00000347110f87f5 in worker_thread_func (st=0x349eb0d1e0) at misc.c:623 #5 0x0000034710df7dc6 in ?? () from /usr/lib64/libglib-2.0.so.0 #6 0x0000034710b63b2a in start_thread () from /lib64/libpthread.so.0 #7 0x00000347108af71d in clone () from /lib64/libc.so.6 I have a gcore from this as well, so I can get any additional info you need.
Patrick Hemmer <syslogng@stormcloud9.net> writes:
Here's some gdb info: (gdb) info threads Id Target Id Frame 2 Thread 0x34711563700 (LWP 6986) "syslog-ng" 0x0000034710b6afc4 in __lll_lock_wait () from /lib64/libpthread.so.0 * 1 Thread 0x34711566b00 (LWP 6979) "syslog-ng" 0x0000034710b6afc4 in __lll_lock_wait () from /lib64/libpthread.so.0 [...] (gdb) where #0 0x0000034710b6afc4 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x0000034710b66459 in _L_lock_508 () from /lib64/libpthread.so.0 #2 0x0000034710b6627b in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x000003470dc97039 in afsql_dd_queue (s=0x349eb408f0, msg=0x349eb4b430, path_options=0x3add5269ef0, user_data=0x0) at afsql.c:1159 [...]
(gdb) thread 2 [Switching to thread 2 (Thread 0x34711563700 (LWP 6986))] #0 0x0000034710b6afc4 in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) where #0 0x0000034710b6afc4 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x0000034710b66459 in _L_lock_508 () from /lib64/libpthread.so.0 #2 0x0000034710b6627b in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x000003470dc9603e in afsql_dd_database_thread (arg=0x349eb408f0) at afsql.c:863 #4 0x00000347110f87f5 in worker_thread_func (st=0x349eb0d1e0) at misc.c:623 #5 0x0000034710df7dc6 in ?? () from /usr/lib64/libglib-2.0.so.0 #6 0x0000034710b63b2a in start_thread () from /lib64/libpthread.so.0 #7 0x00000347108af71d in clone () from /lib64/libc.so.6
This looks interesting, and suspiciosly similar to something I saw before. I'll see if I can track it down. By the way: "thread apply all backtrace full" is a very handy sequence to remember: it gets a full backtrace of all threads, so you don't have to switch between them and do a where each time. -- |8]
Sent: Tue Jan 31 2012 05:11:05 GMT-0500 (EST) From: Gergely Nagy <algernon@balabit.hu> To: Syslog-ng users' and developers' mailing list <syslog-ng@lists.balabit.hu> Subject: Re: [syslog-ng] syslog-ng hangs with high message volume to sqlite database
Patrick Hemmer<syslogng@stormcloud9.net> writes:
Here's some gdb info: (gdb) info threads Id Target Id Frame 2 Thread 0x34711563700 (LWP 6986) "syslog-ng" 0x0000034710b6afc4 in __lll_lock_wait () from /lib64/libpthread.so.0 * 1 Thread 0x34711566b00 (LWP 6979) "syslog-ng" 0x0000034710b6afc4 in __lll_lock_wait () from /lib64/libpthread.so.0 [...] (gdb) where #0 0x0000034710b6afc4 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x0000034710b66459 in _L_lock_508 () from /lib64/libpthread.so.0 #2 0x0000034710b6627b in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x000003470dc97039 in afsql_dd_queue (s=0x349eb408f0, msg=0x349eb4b430, path_options=0x3add5269ef0, user_data=0x0) at afsql.c:1159 [...]
(gdb) thread 2 [Switching to thread 2 (Thread 0x34711563700 (LWP 6986))] #0 0x0000034710b6afc4 in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) where #0 0x0000034710b6afc4 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x0000034710b66459 in _L_lock_508 () from /lib64/libpthread.so.0 #2 0x0000034710b6627b in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x000003470dc9603e in afsql_dd_database_thread (arg=0x349eb408f0) at afsql.c:863 #4 0x00000347110f87f5 in worker_thread_func (st=0x349eb0d1e0) at misc.c:623 #5 0x0000034710df7dc6 in ?? () from /usr/lib64/libglib-2.0.so.0 #6 0x0000034710b63b2a in start_thread () from /lib64/libpthread.so.0 #7 0x00000347108af71d in clone () from /lib64/libc.so.6 This looks interesting, and suspiciosly similar to something I saw before. I'll see if I can track it down.
By the way: "thread apply all backtrace full" is a very handy sequence to remember: it gets a full backtrace of all threads, so you don't have to switch between them and do a where each time.
Have you been able to make any progress on this? I've got a cron job which is checking for this every few minutes and `kill -9`ing syslog-ng if it sees it. Seems like the entire system will hang if syslog-ng gets into this state. I'm guessing max pending socket connections (on /dev/log) is reached, or the already existing sockets fill up, or something. If I dont kill -9 syslog-ng before this happens I cant even log in and am forced to do a power reset on the box. Thanks -Patrick
Patrick Hemmer <syslogng@stormcloud9.net> writes:
This looks interesting, and suspiciosly similar to something I saw before. I'll see if I can track it down.
By the way: "thread apply all backtrace full" is a very handy sequence to remember: it gets a full backtrace of all threads, so you don't have to switch between them and do a where each time.
Have you been able to make any progress on this?
Nope, unfortunately I wasn't able to allocate time for this yet. But I'll look into it the next time I'm doing a batch of syslog-ng work (most probably this coming friday, perhaps earlier, if time permits).
I've got a cron job which is checking for this every few minutes and `kill -9`ing syslog-ng if it sees it. Seems like the entire system will hang if syslog-ng gets into this state. I'm guessing max pending socket connections (on /dev/log) is reached, or the already existing sockets fill up, or something. If I dont kill -9 syslog-ng before this happens I cant even log in and am forced to do a power reset on the box.
Aye, that's pretty much the same symptoms I saw (except I saw it with mongodb, and the bug that was in there doesn't exist in afsql, so it's probably something similar, yet different). -- |8]
Sent: Mon Feb 06 2012 19:28:36 GMT-0500 (EST) From: Gergely Nagy <algernon@balabit.hu> To: syslog-ng@lists.balabit.hu Subject: Re: [syslog-ng] syslog-ng hangs with high message volume to sqlite database
Patrick Hemmer<syslogng@stormcloud9.net> writes:
This looks interesting, and suspiciosly similar to something I saw before. I'll see if I can track it down.
By the way: "thread apply all backtrace full" is a very handy sequence to remember: it gets a full backtrace of all threads, so you don't have to switch between them and do a where each time.
Have you been able to make any progress on this? Nope, unfortunately I wasn't able to allocate time for this yet. But I'll look into it the next time I'm doing a batch of syslog-ng work (most probably this coming friday, perhaps earlier, if time permits).
I've got a cron job which is checking for this every few minutes and `kill -9`ing syslog-ng if it sees it. Seems like the entire system will hang if syslog-ng gets into this state. I'm guessing max pending socket connections (on /dev/log) is reached, or the already existing sockets fill up, or something. If I dont kill -9 syslog-ng before this happens I cant even log in and am forced to do a power reset on the box. Aye, that's pretty much the same symptoms I saw (except I saw it with mongodb, and the bug that was in there doesn't exist in afsql, so it's probably something similar, yet different). I think I may have solved this. It was driving me insane as the problem had gotten even worse. I couldnt go 5 minutes without syslog-ng hanging. When using flush_timeout, if there were messages pending commit, and the flush_timeout was reached, it wasnt releasing the lock before restarting the loop. And then immediately after the loop restarted it tried to get a lock, but since it hadnt released the last one, it just hung there. Now, I'm not sure if this is really it or not. It seems to have solved it, but its just seems a little bit too obvious, making me feel like I dont understand what the code is doing there. But as mentioned, it does seem to have solved the issue as I've gone a few hours now where before I couldnt go a few minutes.
--- syslog-ng-3.3.4.orig/modules/afsql/afsql.c 2011-11-12 07:48:47.000000000 -0500 +++ syslog-ng-3.3.4/modules/afsql/afsql.c 2012-02-09 23:25:02.544892824 -0500 @@ -890,6 +890,7 @@ { afsql_dd_disconnect(self); afsql_dd_suspend(self); + g_mutex_unlock(self->db_thread_mutex); continue; } } Note, this does not solve the issue I was getting where it would complain with "error='5: database is locked', query='COMMIT'". This still happens every now and then, but it does seem to recover eventually.
Patrick Hemmer <syslogng@stormcloud9.net> writes:
I think I may have solved this. It was driving me insane as the problem had gotten even worse. I couldnt go 5 minutes without syslog-ng hanging. When using flush_timeout, if there were messages pending commit, and the flush_timeout was reached, it wasnt releasing the lock before restarting the loop. And then immediately after the loop restarted it tried to get a lock, but since it hadnt released the last one, it just hung there. Now, I'm not sure if this is really it or not. It seems to have solved it, but its just seems a little bit too obvious, making me feel like I dont understand what the code is doing there. But as mentioned, it does seem to have solved the issue as I've gone a few hours now where before I couldnt go a few minutes.
--- syslog-ng-3.3.4.orig/modules/afsql/afsql.c 2011-11-12 07:48:47.000000000 -0500 +++ syslog-ng-3.3.4/modules/afsql/afsql.c 2012-02-09 23:25:02.544892824 -0500 @@ -890,6 +890,7 @@ { afsql_dd_disconnect(self); afsql_dd_suspend(self); + g_mutex_unlock(self->db_thread_mutex); continue; } }
Nice catch. I was just about to investigate this issue when I saw your fix, and I can tell you that this fix is, indeed, correct (apparently, it was also found internally by one of our developers, but for some reason I overlooked it, and didn't forward it to the list - apologies!). I'll turn it into an appliable patch in a bit. Thanks for the fix!
Note, this does not solve the issue I was getting where it would complain with "error='5: database is locked', query='COMMIT'". This still happens every now and then, but it does seem to recover eventually.
I may have an idea for this, too - will test in a bit, and get back to you as soon as I find something. -- |8]
participants (2)
-
Gergely Nagy
-
Patrick Hemmer