deadlock when logging to SQL destination (PostgreSQL)
Hello to everyone, i'd like to as if someone else experienced such thing before. I have a setup where i'm storing maillog and mailfilter log into PostgreSQL database via syslog-ng. Togethe with that, PostgreSQL is logging via syslog-ng as well. This setup creates some sort of deadlock: after some time of processing (varies) syslog-ng freezes, thus completely blocking the server. It is not possible to do anything on the server as all calls to syslog just blocks. I'm not able to open new SSH session, as login is logged via syslog. Emails do not work as senmail is logging via syslog. These and all other applications just hung, whenever they try to write something into syslog. This deadlock is caused by rather "circular" setup, when syslog writes into Postgres and Postgres logs into syslog. I'm able to reproduce the problem in my testing environment. When Postgres does not log into syslog (only CSV logs are enabled, everything goes just fine. Same thing if syslog-ng stores nessages in Postgres on different host - everything is OK. But in scenario when postgres is on localhost and logs into syslog, i'am experiencing those deadlocks. Does anyone has similar experience? We're regulary using syslog-ng 3.4.6, but i have juste tested it on latest version 3.4.8 - same problem appeared. Postgres is of version 9.2. We're using our own CentOS-based distro (currently CentOS 6). I can provide configs, or reproduce the problem if needed. Thanks for any help or hints. -- Tomáš Novosad LinuxBox.cz, s.r.o. 28. října 168, 709 00 Ostrava tel.: +420 591 166 221 mobil: +420 737 238 655 email: tomas.novosad@linuxbox.cz jabber: novosad@linuxbox.cz www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis@linuxbox.cz
Hi - Not directly similar, but somewhat analogous behaviour was observed using syslog-ng to log to a remote elasticsearch destination. The system exhibited the same symptoms - no ability to connect or login. We have no logs or data to directly support this - but my suspicion is that it had to do with TCP connections - possibly "maxing out" the system and leaving it impossible to establish any new connections. We could not (and cannot) conduct any real diagnostics on this system, since it is a production server (kinda painful when it locks up) But I would suggest looking at system level measurements & settings (memory, TCP connections) Also - you don't mention load - do you have the ability to control the load and monitor the system as it increases? You might try holding a small load for some time (minutes? hours?) while measuring system behaviour and ramping up in a "step function" to see if you can identify what is going on. Good luck, Jim On 01/02/2015 09:24 AM, Tomáš Novosad wrote:
Hello to everyone,
i'd like to as if someone else experienced such thing before.
I have a setup where i'm storing maillog and mailfilter log into PostgreSQL database via syslog-ng. Togethe with that, PostgreSQL is logging via syslog-ng as well.
This setup creates some sort of deadlock: after some time of processing (varies) syslog-ng freezes, thus completely blocking the server. It is not possible to do anything on the server as all calls to syslog just blocks. I'm not able to open new SSH session, as login is logged via syslog. Emails do not work as senmail is logging via syslog. These and all other applications just hung, whenever they try to write something into syslog.
This deadlock is caused by rather "circular" setup, when syslog writes into Postgres and Postgres logs into syslog. I'm able to reproduce the problem in my testing environment. When Postgres does not log into syslog (only CSV logs are enabled, everything goes just fine. Same thing if syslog-ng stores nessages in Postgres on different host - everything is OK. But in scenario when postgres is on localhost and logs into syslog, i'am experiencing those deadlocks.
Does anyone has similar experience?
We're regulary using syslog-ng 3.4.6, but i have juste tested it on latest version 3.4.8 - same problem appeared. Postgres is of version 9.2. We're using our own CentOS-based distro (currently CentOS 6).
I can provide configs, or reproduce the problem if needed.
Thanks for any help or hints.
Hello, and thanks for reply. However that is not my case. The problem is not becouse of TCP conns limit, but because SSH daemon tries to log a message via syslog-ng and it is blocket on the attempt, because syslog is blocked. Syslog is just blocked and does nothing. And this completely block all services on the server, which tries to log a message via syslog. I can reproduce the problem on virtual machine, so i have acces to blocked system. Even command: # logger "bla bli" just hangs and blocks. My colleague used a debbuger on a blocked syslog and found it locked in this position (this does not say much to me): #0 0x00007f7948ff6654 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7948ff1f4a in _L_lock_1034 () from /lib64/libpthread.so.0 #2 0x00007f7948ff1e0c in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f79471ef3da in afsql_dd_message_became_available_in_the_queue (user_data=0x2778340) at afsql.c:889 #4 0x00007f794a52dfea in log_queue_push_notify (self=0x259f0f0) at logqueue.c:58 #5 0x00007f794a52f029 in log_queue_fifo_push_tail (s=0x259f0f0,msg=0x2725cf0, path_options=0x7fffc5c92660) at logqueue-fifo.c:263 #6 0x00007f79471efbc4 in log_queue_push_tail (s=0x2778340,msg=0x2725cf0, path_options=0x7fffc5c92660, user_data=0x0) at ../../lib/logqueue.h:84 #7 afsql_dd_queue (s=0x2778340, msg=0x2725cf0, path_options=0x7fffc5c92660, user_data=0x0) at afsql.c:1198 #8 0x00007f794a527a55 in log_pipe_queue (s=0x25a3c60, msg=0x2725cf0,path_options=0x7fffc5c927b0, user_data=<value optimized out>) at logpipe.h:320 #9 log_multiplexer_queue (s=0x25a3c60, msg=0x2725cf0,path_options=0x7fffc5c927b0, user_data=<value optimized out>) at logmpx.c:106 #10 0x00007f794a527a55 in log_pipe_queue (s=0x25ff210, msg=0x2725cf0,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #11 log_multiplexer_queue (s=0x25ff210, msg=0x2725cf0,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logmpx.c:106 #12 0x00007f794a5311d5 in log_pipe_forward_msg (s=0x277b1c0,msg=0x2725cf0, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #13 log_rewrite_queue (s=0x277b1c0, msg=0x2725cf0,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logrewrite.c:68 #14 0x00007f794a5311d5 in log_pipe_forward_msg (s=0x277a980,msg=0x2725cf0, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #15 log_rewrite_queue (s=0x277a980, msg=0x2725cf0,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logrewrite.c:68 #16 0x00007f794a5311d5 in log_pipe_forward_msg (s=0x277a1a0,msg=0x2725cf0, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #17 log_rewrite_queue (s=0x277a1a0, msg=0x2725cf0, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logrewrite.c:68 #18 0x00007f794a5311d5 in log_pipe_forward_msg (s=0x2779bc0,msg=0x2725cf0, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #19 log_rewrite_queue (s=0x2779bc0, msg=0x2725cf0,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logrewrite.c:68 #20 0x00007f794a52ac7c in log_pipe_forward_msg (s=0x2779460,msg=0x2689d00, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #21 log_parser_queue (s=0x2779460, msg=0x2689d00,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logparser.c:81 #22 0x00007f794a522fcb in log_pipe_forward_msg (s=0x2788480,msg=0x26b4230, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #23 log_filter_pipe_queue (s=0x2788480, msg=0x26b4230,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at filter.c:731 #24 0x00007f794a527744 in log_pipe_queue (s=<value optimized out>,msg=0x26b4230, path_options=0x0) at logpipe.h:320 #25 log_pipe_forward_msg (s=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289 #26 log_pipe_queue (s=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:324 #27 log_pipe_forward_msg (s=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289 #28 log_pipe_queue (s=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:324 #29 0x00007f794a527af1 in log_pipe_queue (s=0x275bbe0, msg=0x26b4230,path_options=0x7fffc5c93160, user_data=<value optimized out>) at logpipe.h:289 #30 log_multiplexer_queue (s=0x275bbe0, msg=0x26b4230,path_options=0x7fffc5c93160, user_data=<value optimized out>) at logmpx.c:106 #31 0x00007f794a531dbe in log_pipe_queue (self=<value optimized out>,msg=0x26b4230, path_options=0x0) at logpipe.h:320 #32 log_pipe_forward_msg (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289 #33 log_pipe_queue (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:324 #34 log_pipe_forward_msg (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289 #35 log_pipe_queue (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:324 #36 log_pipe_forward_msg (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289 #37 0x00007f794a531e2d in log_pipe_queue (self=<value optimized out>,msg=0x26b4230, path_options=0x0) at logpipe.h:324 #38 log_pipe_forward_msg (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289 I think i'll have to report this to Balabit. Thanks for help -- Tomáš Novosad LinuxBox.cz, s.r.o. 28. října 168, 709 00 Ostrava tel.: +420 591 166 221 mobil: +420 737 238 655 email: tomas.novosad@linuxbox.cz jabber: novosad@linuxbox.cz www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis@linuxbox.cz On 2. 1. 10:08, Jim Hendrick wrote:
Hi -
Not directly similar, but somewhat analogous behaviour was observed using syslog-ng to log to a remote elasticsearch destination.
The system exhibited the same symptoms - no ability to connect or login.
We have no logs or data to directly support this - but my suspicion is that it had to do with TCP connections - possibly "maxing out" the system and leaving it impossible to establish any new connections.
We could not (and cannot) conduct any real diagnostics on this system, since it is a production server (kinda painful when it locks up)
But I would suggest looking at system level measurements & settings (memory, TCP connections)
Also - you don't mention load - do you have the ability to control the load and monitor the system as it increases?
You might try holding a small load for some time (minutes? hours?) while measuring system behaviour and ramping up in a "step function" to see if you can identify what is going on.
Good luck, Jim
Hi, You probably have flow-control enabled, in which case if the destination stalls, syslog-ng will stop reading its inputs as well. Workarounds: 1) increase the window-size (log-iw-size at the source) & destination buffer size (log-fifo-size option at the destination), this will let more leeway, until syslog-ng blocks if its destination blocks 2) have postgres log to a file, and then read that file as a file source. 3) last, but not least disable flow-control Hope this helps, -- Bazsi On Mon, Jan 5, 2015 at 11:24 AM, Tomáš Novosad <tomas.novosad@linuxbox.cz> wrote:
Hello,
and thanks for reply.
However that is not my case. The problem is not becouse of TCP conns limit, but because SSH daemon tries to log a message via syslog-ng and it is blocket on the attempt, because syslog is blocked. Syslog is just blocked and does nothing. And this completely block all services on the server, which tries to log a message via syslog.
I can reproduce the problem on virtual machine, so i have acces to blocked system. Even command: # logger "bla bli" just hangs and blocks.
My colleague used a debbuger on a blocked syslog and found it locked in this position (this does not say much to me): #0 0x00007f7948ff6654 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7948ff1f4a in _L_lock_1034 () from /lib64/libpthread.so.0 #2 0x00007f7948ff1e0c in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f79471ef3da in afsql_dd_message_became_available_in_the_queue (user_data=0x2778340) at afsql.c:889 #4 0x00007f794a52dfea in log_queue_push_notify (self=0x259f0f0) at logqueue.c:58 #5 0x00007f794a52f029 in log_queue_fifo_push_tail (s=0x259f0f0,msg=0x2725cf0, path_options=0x7fffc5c92660) at logqueue-fifo.c:263 #6 0x00007f79471efbc4 in log_queue_push_tail (s=0x2778340,msg=0x2725cf0, path_options=0x7fffc5c92660, user_data=0x0) at ../../lib/logqueue.h:84 #7 afsql_dd_queue (s=0x2778340, msg=0x2725cf0, path_options=0x7fffc5c92660, user_data=0x0) at afsql.c:1198 #8 0x00007f794a527a55 in log_pipe_queue (s=0x25a3c60, msg=0x2725cf0,path_options=0x7fffc5c927b0, user_data=<value optimized out>) at logpipe.h:320 #9 log_multiplexer_queue (s=0x25a3c60, msg=0x2725cf0,path_options=0x7fffc5c927b0, user_data=<value optimized out>) at logmpx.c:106 #10 0x00007f794a527a55 in log_pipe_queue (s=0x25ff210, msg=0x2725cf0,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #11 log_multiplexer_queue (s=0x25ff210, msg=0x2725cf0,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logmpx.c:106 #12 0x00007f794a5311d5 in log_pipe_forward_msg (s=0x277b1c0,msg=0x2725cf0, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #13 log_rewrite_queue (s=0x277b1c0, msg=0x2725cf0,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logrewrite.c:68 #14 0x00007f794a5311d5 in log_pipe_forward_msg (s=0x277a980,msg=0x2725cf0, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #15 log_rewrite_queue (s=0x277a980, msg=0x2725cf0,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logrewrite.c:68 #16 0x00007f794a5311d5 in log_pipe_forward_msg (s=0x277a1a0,msg=0x2725cf0, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #17 log_rewrite_queue (s=0x277a1a0, msg=0x2725cf0, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logrewrite.c:68 #18 0x00007f794a5311d5 in log_pipe_forward_msg (s=0x2779bc0,msg=0x2725cf0, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #19 log_rewrite_queue (s=0x2779bc0, msg=0x2725cf0,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logrewrite.c:68 #20 0x00007f794a52ac7c in log_pipe_forward_msg (s=0x2779460,msg=0x2689d00, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #21 log_parser_queue (s=0x2779460, msg=0x2689d00,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logparser.c:81 #22 0x00007f794a522fcb in log_pipe_forward_msg (s=0x2788480,msg=0x26b4230, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #23 log_filter_pipe_queue (s=0x2788480, msg=0x26b4230,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at filter.c:731 #24 0x00007f794a527744 in log_pipe_queue (s=<value optimized out>,msg=0x26b4230, path_options=0x0) at logpipe.h:320 #25 log_pipe_forward_msg (s=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289 #26 log_pipe_queue (s=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:324 #27 log_pipe_forward_msg (s=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289 #28 log_pipe_queue (s=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:324 #29 0x00007f794a527af1 in log_pipe_queue (s=0x275bbe0, msg=0x26b4230,path_options=0x7fffc5c93160, user_data=<value optimized out>) at logpipe.h:289 #30 log_multiplexer_queue (s=0x275bbe0, msg=0x26b4230,path_options=0x7fffc5c93160, user_data=<value optimized out>) at logmpx.c:106 #31 0x00007f794a531dbe in log_pipe_queue (self=<value optimized out>,msg=0x26b4230, path_options=0x0) at logpipe.h:320 #32 log_pipe_forward_msg (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289 #33 log_pipe_queue (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:324 #34 log_pipe_forward_msg (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289 #35 log_pipe_queue (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:324 #36 log_pipe_forward_msg (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289 #37 0x00007f794a531e2d in log_pipe_queue (self=<value optimized out>,msg=0x26b4230, path_options=0x0) at logpipe.h:324 #38 log_pipe_forward_msg (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289
I think i'll have to report this to Balabit.
Thanks for help -- Tomáš Novosad LinuxBox.cz, s.r.o. 28. října 168, 709 00 Ostrava
tel.: +420 591 166 221 mobil: +420 737 238 655 email: tomas.novosad@linuxbox.cz jabber: novosad@linuxbox.cz www.linuxbox.cz
mobil servis: +420 737 238 656 email servis: servis@linuxbox.cz
On 2. 1. 10:08, Jim Hendrick wrote:
Hi -
Not directly similar, but somewhat analogous behaviour was observed using syslog-ng to log to a remote elasticsearch destination.
The system exhibited the same symptoms - no ability to connect or login.
We have no logs or data to directly support this - but my suspicion is that it had to do with TCP connections - possibly "maxing out" the system and leaving it impossible to establish any new connections.
We could not (and cannot) conduct any real diagnostics on this system, since it is a production server (kinda painful when it locks up)
But I would suggest looking at system level measurements & settings (memory, TCP connections)
Also - you don't mention load - do you have the ability to control the load and monitor the system as it increases?
You might try holding a small load for some time (minutes? hours?) while measuring system behaviour and ramping up in a "step function" to see if you can identify what is going on.
Good luck, Jim
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
Hi, checking the backtrace again, it might actually be a genuine deadlock. can you give a backtrace of all the threads as syslog-ng stalls like that? -- Bazsi On Mon, Jan 5, 2015 at 12:09 PM, Scheidler, Balázs < balazs.scheidler@balabit.com> wrote:
Hi,
You probably have flow-control enabled, in which case if the destination stalls, syslog-ng will stop reading its inputs as well. Workarounds:
1) increase the window-size (log-iw-size at the source) & destination buffer size (log-fifo-size option at the destination), this will let more leeway, until syslog-ng blocks if its destination blocks 2) have postgres log to a file, and then read that file as a file source. 3) last, but not least disable flow-control
Hope this helps,
-- Bazsi
On Mon, Jan 5, 2015 at 11:24 AM, Tomáš Novosad <tomas.novosad@linuxbox.cz> wrote:
Hello,
and thanks for reply.
However that is not my case. The problem is not becouse of TCP conns limit, but because SSH daemon tries to log a message via syslog-ng and it is blocket on the attempt, because syslog is blocked. Syslog is just blocked and does nothing. And this completely block all services on the server, which tries to log a message via syslog.
I can reproduce the problem on virtual machine, so i have acces to blocked system. Even command: # logger "bla bli" just hangs and blocks.
My colleague used a debbuger on a blocked syslog and found it locked in this position (this does not say much to me): #0 0x00007f7948ff6654 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7948ff1f4a in _L_lock_1034 () from /lib64/libpthread.so.0 #2 0x00007f7948ff1e0c in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f79471ef3da in afsql_dd_message_became_available_in_the_queue (user_data=0x2778340) at afsql.c:889 #4 0x00007f794a52dfea in log_queue_push_notify (self=0x259f0f0) at logqueue.c:58 #5 0x00007f794a52f029 in log_queue_fifo_push_tail (s=0x259f0f0,msg=0x2725cf0, path_options=0x7fffc5c92660) at logqueue-fifo.c:263 #6 0x00007f79471efbc4 in log_queue_push_tail (s=0x2778340,msg=0x2725cf0, path_options=0x7fffc5c92660, user_data=0x0) at ../../lib/logqueue.h:84 #7 afsql_dd_queue (s=0x2778340, msg=0x2725cf0, path_options=0x7fffc5c92660, user_data=0x0) at afsql.c:1198 #8 0x00007f794a527a55 in log_pipe_queue (s=0x25a3c60, msg=0x2725cf0,path_options=0x7fffc5c927b0, user_data=<value optimized out>) at logpipe.h:320 #9 log_multiplexer_queue (s=0x25a3c60, msg=0x2725cf0,path_options=0x7fffc5c927b0, user_data=<value optimized out>) at logmpx.c:106 #10 0x00007f794a527a55 in log_pipe_queue (s=0x25ff210, msg=0x2725cf0,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #11 log_multiplexer_queue (s=0x25ff210, msg=0x2725cf0,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logmpx.c:106 #12 0x00007f794a5311d5 in log_pipe_forward_msg (s=0x277b1c0,msg=0x2725cf0, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #13 log_rewrite_queue (s=0x277b1c0, msg=0x2725cf0,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logrewrite.c:68 #14 0x00007f794a5311d5 in log_pipe_forward_msg (s=0x277a980,msg=0x2725cf0, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #15 log_rewrite_queue (s=0x277a980, msg=0x2725cf0,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logrewrite.c:68 #16 0x00007f794a5311d5 in log_pipe_forward_msg (s=0x277a1a0,msg=0x2725cf0, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #17 log_rewrite_queue (s=0x277a1a0, msg=0x2725cf0, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logrewrite.c:68 #18 0x00007f794a5311d5 in log_pipe_forward_msg (s=0x2779bc0,msg=0x2725cf0, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #19 log_rewrite_queue (s=0x2779bc0, msg=0x2725cf0,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logrewrite.c:68 #20 0x00007f794a52ac7c in log_pipe_forward_msg (s=0x2779460,msg=0x2689d00, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #21 log_parser_queue (s=0x2779460, msg=0x2689d00,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logparser.c:81 #22 0x00007f794a522fcb in log_pipe_forward_msg (s=0x2788480,msg=0x26b4230, path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at logpipe.h:320 #23 log_filter_pipe_queue (s=0x2788480, msg=0x26b4230,path_options=0x7fffc5c92fe0, user_data=<value optimized out>) at filter.c:731 #24 0x00007f794a527744 in log_pipe_queue (s=<value optimized out>,msg=0x26b4230, path_options=0x0) at logpipe.h:320 #25 log_pipe_forward_msg (s=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289 #26 log_pipe_queue (s=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:324 #27 log_pipe_forward_msg (s=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289 #28 log_pipe_queue (s=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:324 #29 0x00007f794a527af1 in log_pipe_queue (s=0x275bbe0, msg=0x26b4230,path_options=0x7fffc5c93160, user_data=<value optimized out>) at logpipe.h:289 #30 log_multiplexer_queue (s=0x275bbe0, msg=0x26b4230,path_options=0x7fffc5c93160, user_data=<value optimized out>) at logmpx.c:106 #31 0x00007f794a531dbe in log_pipe_queue (self=<value optimized out>,msg=0x26b4230, path_options=0x0) at logpipe.h:320 #32 log_pipe_forward_msg (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289 #33 log_pipe_queue (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:324 #34 log_pipe_forward_msg (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289 #35 log_pipe_queue (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:324 #36 log_pipe_forward_msg (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289 #37 0x00007f794a531e2d in log_pipe_queue (self=<value optimized out>,msg=0x26b4230, path_options=0x0) at logpipe.h:324 #38 log_pipe_forward_msg (self=<value optimized out>, msg=0x26b4230,path_options=0x0) at logpipe.h:289
I think i'll have to report this to Balabit.
Thanks for help -- Tomáš Novosad LinuxBox.cz, s.r.o. 28. října 168, 709 00 Ostrava
tel.: +420 591 166 221 mobil: +420 737 238 655 email: tomas.novosad@linuxbox.cz jabber: novosad@linuxbox.cz www.linuxbox.cz
mobil servis: +420 737 238 656 email servis: servis@linuxbox.cz
On 2. 1. 10:08, Jim Hendrick wrote:
Hi -
Not directly similar, but somewhat analogous behaviour was observed using syslog-ng to log to a remote elasticsearch destination.
The system exhibited the same symptoms - no ability to connect or login.
We have no logs or data to directly support this - but my suspicion is that it had to do with TCP connections - possibly "maxing out" the system and leaving it impossible to establish any new connections.
We could not (and cannot) conduct any real diagnostics on this system, since it is a production server (kinda painful when it locks up)
But I would suggest looking at system level measurements & settings (memory, TCP connections)
Also - you don't mention load - do you have the ability to control the load and monitor the system as it increases?
You might try holding a small load for some time (minutes? hours?) while measuring system behaviour and ramping up in a "step function" to see if you can identify what is going on.
Good luck, Jim
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
Hello, in fact, i've experienced that flow-control problem earlier. I've disabled flow-control, but after some time this "deadlock" problem appeared. I will reporoduce the problem and will provide as much info as You'll need. Do You want me to post the data here on mailing list, or should i open a bug report somewhere (balabit website)? Thanks for help -- Tomáš Novosad LinuxBox.cz, s.r.o. 28. října 168, 709 00 Ostrava tel.: +420 591 166 221 mobil: +420 737 238 655 email: tomas.novosad@linuxbox.cz jabber: novosad@linuxbox.cz www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis@linuxbox.cz On 5. 1. 12:11, Scheidler, Balázs wrote:
Hi,
checking the backtrace again, it might actually be a genuine deadlock. can you give a backtrace of all the threads as syslog-ng stalls like that?
-- Bazsi
On Mon, Jan 5, 2015 at 12:09 PM, Scheidler, Balázs < balazs.scheidler@balabit.com> wrote:
Hi,
You probably have flow-control enabled, in which case if the destination stalls, syslog-ng will stop reading its inputs as well. Workarounds:
1) increase the window-size (log-iw-size at the source) & destination buffer size (log-fifo-size option at the destination), this will let more leeway, until syslog-ng blocks if its destination blocks 2) have postgres log to a file, and then read that file as a file source. 3) last, but not least disable flow-control
Hope this helps,
Hello, mine colleague prepared the backtraces. Please, see attached file. Let me know if there is anything elase a should provide (cfg files, logs,..). Thanks for any suggestions/help. -- Tomáš Novosad LinuxBox.cz, s.r.o. 28. října 168, 709 00 Ostrava tel.: +420 591 166 221 mobil: +420 737 238 655 email: tomas.novosad@linuxbox.cz jabber: novosad@linuxbox.cz www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis@linuxbox.cz On 5. 1. 12:11, Scheidler, Balázs wrote:
Hi,
checking the backtrace again, it might actually be a genuine deadlock. can you give a backtrace of all the threads as syslog-ng stalls like that?
-- Bazsi
Hi, Thanks for the information. The culprit seems to be that we keep a mutex locked for the duration of the COMMIT statement, which seems to take long. This then causes to stop processing the inputs. Is there any reason why PgSQL stalls in COMMIT? Can you check the pgsql side? maybe syslog-ng progresses, just very-very slowly. if for instance COMMIT takes a minute, and then it wakes up processes a message and then a commit again, may make it feel like it completely stopped whereas it stops waiting for long operations. Hmm.. it would probably make sense to release the lock around the commit operation though. -- Bazsi Hello, mine colleague prepared the backtraces. Please, see attached file. Let me know if there is anything elase a should provide (cfg files, logs,..). Thanks for any suggestions/help. -- Tomáš Novosad LinuxBox.cz, s.r.o. 28. října 168, 709 00 Ostrava tel.: +420 591 166 221 mobil: +420 737 238 655 email: tomas.novosad@linuxbox.cz jabber: novosad@linuxbox.cz www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis@linuxbox.cz On 5. 1. 12:11, Scheidler, Balázs wrote:
Hi,
checking the backtrace again, it might actually be a genuine deadlock. can you give a backtrace of all the threads as syslog-ng stalls like that?
-- Bazsi
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
participants (3)
-
Jim Hendrick
-
Scheidler, Balázs
-
Tomáš Novosad