Hi, Just a quick update, after changing "dns-cache-size” to 10000, syslog-ng has been running for 3 days, without any restarts due to the SIGABRT. I guess the issues can be easily reproduced by setting “dns-cache-size” to 1. This issues is by the way related to the “Listen backlog issue” reported in: https://lists.balabit.hu/pipermail/syslog-ng/2016-September/023228.html (and https://github.com/balabit/syslog-ng/pull/1211). The sudden surge in SYN packets was caused because syslog-ng restarted due to the SIGABRT, causing all connected clients to reconnect. With a couple of thousand clients, it was easy to trigger a situation, where more than 255 clients reconnected at once, causing the hardcoded listen backlog to run full. /Claus A
On 26. sep. 2016, at 13.53, Claus Albøge <ca@tdchosting.dk> wrote:
Hi,
I tried to minimize my config as much as possible, but was unable to reproduce - had it run for a couple of hours.
Looking at the source code pointed my towards “effective_dns_cache_options" and dns-cache-size (https://www.balabit.com/documents/syslog-ng-ose-latest-guides/en/syslog-ng-o...)
Setting dns-cache-size to 1 on my minimized config triggered the SIGABRT within a few seconds. (Default value seems to be 1007!, which is higher that the number of hosts sending logs to my syslog-ng instance with the minimized config)
Setting dns-cache-size to 10000 on a syslog-instance with the “full” number of log sources, seems to work around the issue, as syslog-ng has been running for almost a hour now without any restarts - normally it restarts every few minutes.
So it does indeed seems like an issue with the dnscache logic. You might be able to reproduce by setting dns-cache-size to a very low number.
If you still need a core dump to debug this further, I might need a bit of help creating the coredump correctly. After the SIGABRT, running gcore in gdb creates a core, but if I try to read this core again using gdb, I’m unable to do a backtrace. Do I need to do anything special?
/Claus A
On 23. sep. 2016, at 08.37, Claus Albøge <ca@tdchosting.dk> wrote:
Hi,
I’ll try to reproduce with a minimal configuration, without ssl-certs, etc.
Syslog-ng binaries are from: https://copr-be.cloud.fedoraproject.org/results/czanik/syslog-ng38/epel-7-x8...
/Claus A
On 23. sep. 2016, at 08.25, Balazs Scheidler <bazsi77@gmail.com> wrote:
Is there a possibility for me to receive the core file and the binaries? (Syslog-ng plus dependent libs)
This seems to be some kind of dns cache corruption, which is difficult to diagnose without those.
Bazsi
On Sep 22, 2016 3:29 PM, "Claus Albøge" <ca@tdchosting.dk> wrote: Hi,
# gdb --args /usr/sbin/syslog-ng -F -p /var/run/syslogd.pid GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/syslog-ng...Reading symbols from /usr/lib/debug/usr/sbin/syslog-ng.debug...done. done. (gdb) set pagination off (gdb) run Starting program: /usr/sbin/syslog-ng -F -p /var/run/syslogd.pid [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7ffff1e8a700 (LWP 12861)] [New Thread 0x7ffff1689700 (LWP 12862)] [New Thread 0x7ffff0e88700 (LWP 12863)] [New Thread 0x7fffe3fff700 (LWP 13103)] [New Thread 0x7fffe37fe700 (LWP 13108)] [New Thread 0x7fffe2ffd700 (LWP 13109)] [New Thread 0x7fffe27fc700 (LWP 13110)] [New Thread 0x7fffe1ffb700 (LWP 13111)] [Thread 0x7ffff1689700 (LWP 12862) exited] [Thread 0x7ffff1e8a700 (LWP 12861) exited] [New Thread 0x7ffff1e8a700 (LWP 13395)] [New Thread 0x7ffff1689700 (LWP 13396)] [Thread 0x7fffe37fe700 (LWP 13108) exited] [Thread 0x7fffe2ffd700 (LWP 13109) exited] [New Thread 0x7fffe37fe700 (LWP 13646)] [New Thread 0x7fffe2ffd700 (LWP 13647)] [Thread 0x7ffff1689700 (LWP 13396) exited] [Thread 0x7fffe3fff700 (LWP 13103) exited] [New Thread 0x7fffe3fff700 (LWP 13959)] [New Thread 0x7ffff1689700 (LWP 13960)] [Thread 0x7ffff1e8a700 (LWP 13395) exited] [Thread 0x7fffe1ffb700 (LWP 13111) exited] [New Thread 0x7fffe1ffb700 (LWP 14040)] [New Thread 0x7ffff1e8a700 (LWP 14041)] [Thread 0x7fffe3fff700 (LWP 13959) exited] [Thread 0x7ffff0e88700 (LWP 12863) exited] [New Thread 0x7ffff0e88700 (LWP 14182)] [New Thread 0x7fffe3fff700 (LWP 14183)] ** ERROR:lib/dnscache.c:112:dns_cache_key_hash: code should not be reached
Program received signal SIGABRT, Aborted. [Switching to Thread 0x7ffff0e88700 (LWP 14182)] 0x00007ffff5fb05f7 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.163-3.el7.x86_64 elfutils-libs-0.163-3.el7.x86_64 eventlog-0.2.13-4.el7.x86_64 glib2-2.42.2-5.el7.x86_64 glibc-2.17-106.el7_2.8.x86_64 ivykis-0.36.2-2.el7.x86_64 json-c-0.11-4.el7_0.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.13.2-12.el7_2.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 libgcrypt-1.5.3-12.el7_1.1.x86_64 libgpg-error-1.12-3.el7.x86_64 libnet-1.1.6-7.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 libuuid-2.23.2-26.el7_2.3.x86_64 openssl-libs-1.0.1e-51.el7_2.5.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-19.el7_2.13.x86_64 tcp_wrappers-libs-7.6-77.el7.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64 (gdb) bt full #0 0x00007ffff5fb05f7 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00007ffff5fb1ce8 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x00007ffff70426d5 in g_assertion_message () from /lib64/libglib-2.0.so.0 No symbol table info available. #3 0x00007ffff704276a in g_assertion_message_expr () from /lib64/libglib-2.0.so.0 No symbol table info available. #4 0x00007ffff7b62cc4 in dns_cache_key_hash (e=<optimized out>) at lib/dnscache.c:112 e = <optimized out> #5 0x00007ffff700c4fc in g_hash_table_remove_internal () from /lib64/libglib-2.0.so.0 No symbol table info available. #6 0x00007ffff7b67584 in resolve_sockaddr_to_inet_or_inet6_hostname (host_resolve_options=0x651754, saddr=0x183d350, result_len=0x7ffff0e85370) at lib/host-resolve.c:332 hname = 0x7ffff0e864b8 "213.174.76.60" hname_len = 5 positive = 0 dnscache_key = 0x183d368 #7 resolve_sockaddr_to_hostname (result_len=result_len@entry=0x7ffff0e85370, saddr=0x183d350, host_resolve_options=0x651754) at lib/host-resolve.c:343 No locals. #8 0x00007ffff7b6c03a in log_source_mangle_hostname (self=self@entry=0x1868c10, msg=msg@entry=0x7fffe87cdd40) at lib/logsource.c:156 resolved_name = <optimized out> resolved_name_len = 140737093885344 #9 0x00007ffff7b6c280 in log_source_queue (s=0x1868c10, msg=0x7fffe87cdd40, path_options=0x7ffff0e85550, user_data=<optimized out>) at lib/logsource.c:266 self = 0x1868c10 i = <optimized out> __func__ = "log_source_queue" #10 0x00007ffff7b6ca0a in log_pipe_queue (path_options=<optimized out>, msg=<optimized out>, s=<optimized out>) at lib/logpipe.h:339 No locals. #11 log_pipe_forward_msg (self=<optimized out>, path_options=<optimized out>, msg=<optimized out>) at lib/logpipe.h:302 No locals. #12 log_pipe_queue (path_options=<optimized out>, msg=<optimized out>, s=<optimized out>) at lib/logpipe.h:343 No locals. #13 log_pipe_forward_msg (self=<optimized out>, path_options=<optimized out>, msg=<optimized out>) at lib/logpipe.h:302 No locals. #14 log_pipe_queue (path_options=<optimized out>, msg=0x7fffe87cdd40, s=0x7fffe87cdd40) at lib/logpipe.h:343 No locals. #15 log_source_post (self=self@entry=0x1868c10, msg=msg@entry=0x7fffe87cdd40) at lib/logsource.c:249 path_options = {ack_needed = 1, flow_control_requested = 0, matched = 0x0} old_window_size = <optimized out> __func__ = "log_source_post" #16 0x00007ffff7b6b179 in log_reader_handle_line (aux=0x7ffff0e85600, length=<optimized out>, line=<optimized out>, self=0x1868c10) at lib/logreader.c:320 m = 0x7fffe87cdd40 #17 log_reader_fetch_log (self=self@entry=0x1868c10) at lib/logreader.c:381 bookmark = <optimized out> msg = 0x7fffe445cef5 "<14>1 2016-09-22T10:51:07.178337+02:00 pafodev02.int.patientforsikringen.dk Microsoft-Windows-Security-Auditing 608 - [NXLOG@47121 AssetID=\"105939\" CustomerID=\"0002592\" Location=\"Hyper-V - Farm 01 (ps"... msg_len = 4245 status = <optimized out> msg_count = 1 may_read = 1 aux = {peer_addr = 0x0, data = "\000\207\350\360\377\177\000\000)\000\000\000\000\000\000\000\304\335U\366\377\177\000\000\000Y\350\360\377\177\000\000\230\235\332\001\000\000\000\000@ \245\350\377\177\000\000h\313U\366\377\177\000\000@V\350\360\377\177\000\000@V\350\360\377\177\000\000\000Y\350\360\377\177\000\000\340\037\245\350\377\177\000\000\b\000\000\000\000\000\000\000\250\037\245\350\377\177\000\000\200V\350\360\377\177\000\000\223\341U\366\377\177\000\000\001\000\000\000\000\000\000\000\006Y\350\360\377\177\000\000\aY\350\360\377\177\000\000\000Y\350\360\377\177\000\000\aY\350\360\377\177\000\000gk\266\367\377\177\000\000\220", '\000' <repeats 15 times>, "\240 E\001\000\000\000\000\301\326\003\367\377\177\000\000"..., end_ptr = 0} __func__ = "log_reader_fetch_log" #18 0x00007ffff7b6b299 in log_reader_work_perform (s=0x1868c10) at lib/logreader.c:88 self = 0x1868c10 #19 0x00007ffff7b7307d in _work (self=<optimized out>) at lib/mainloop-io-worker.c:52 No locals. #20 0x00007ffff655dc67 in iv_work_thread_do_work () from /lib64/libivykis.so.0 No symbol table info available. #21 0x00007ffff655cfc3 in iv_run_tasks () from /lib64/libivykis.so.0 No symbol table info available. #22 0x00007ffff655f7dc in iv_main () from /lib64/libivykis.so.0 No symbol table info available. #23 0x00007ffff655dab4 in iv_work_thread () from /lib64/libivykis.so.0 No symbol table info available. #24 0x00007ffff6560270 in iv_thread_handler () from /lib64/libivykis.so.0 No symbol table info available. #25 0x00007ffff6344dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #26 0x00007ffff6071ced in clone () from /lib64/libc.so.6 No symbol table info available. (gdb) frame 6 #6 0x00007ffff7b67584 in resolve_sockaddr_to_inet_or_inet6_hostname (host_resolve_options=0x651754, saddr=0x183d350, result_len=0x7ffff0e85370) at lib/host-resolve.c:332 332 dns_caching_store(saddr->sa.sa_family, dnscache_key, hname, positive); (gdb) print *saddr $1 = {refcnt = {counter = 3}, flags = 0, sa_funcs = 0x7ffff7dd7060 <inet_sockaddr_funcs>, salen = 16, sa = {sa_family = 2, sa_data = "\225\327ծL<\000\000\000\000\000\000\000"}} (gdb) print *saddr->sa Structure has no component named operator*. (gdb) print saddr $2 = (GSockAddr *) 0x183d350
/Claus A
On 21. sep. 2016, at 15.37, Szalai, Attila <Attila.Szalai@morganstanley.com> wrote:
Hi,
If it is convenient to you, just for the sake of my curiosity, can you run some more command in the gdb session, after you reproduced the issue?
* bt full * frame 6 (you should search for the line with resolve_sockaddr_to_inet_or_inet6_hostname and write the number written in that line. In the case of your previous email, it was #6) * print *saddr * print *saddr->sa
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq