after some investigation and reproduction efforts, it looks like the problem is a combination of:
- remote logging to server specified by hostname (vs. IP)
- loss of management interface (e.g. that used for syslog traffic and DNS resolution)
- log rotate triggering syslog-ng reload
when I reproduce the problem (seems to take some considerable amount of logging load and a minute or so), I can see that writes to /dev/log hang:
root@escort-ct0:/etc# strace -ttT logger test
^C
19:38:46.064830 execve("/usr/bin/logger", ["logger", "test"], [/* 26 vars */]) = 0 <0.000135>
...
19:38:46.069757 socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 1 <0.000015>
19:38:46.069806 connect(1, {sa_family=AF_LOCAL, sun_path="/dev/log"}, 110) = 0 <0.000010>
19:38:46.069850 sendto(1, "<13>Mar 20 19:38:46 root: test", 30, MSG_NOSIGNAL, NULL, 0) = 30 <11.627190>
19:38:57.697110 close(1)= 0 <0.000490>
meanwhile I see the main thread timing out name resolution:
19:48:18.214182 stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=148, ...}) = 0 <0.000008>
19:48:18.214234 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 31 <0.000012>
19:48:18.214279 connect(31, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.15.83.11")}, 16) = 0 <0.000024>
19:48:18.214332 poll([{fd=31, events=POLLOUT}], 1, 0) = 1 ([{fd=31, revents=POLLOUT}]) <0.000006>
19:48:18.214370 sendto(31, "\343\235\1\0\0\1\0\0\0\0\0\0\rescdev-syslog\3dev\vpurestorage\3com\0\0\1\0\1", 51, MSG_NOSIGNAL, NULL, 0) = 51 <0.000047>
19:48:18.214444 poll([{fd=31, events=POLLIN}], 1, 5000) = 0 (Timeout) <5.005040>
this repeats with the other DNS servers configured.
I cannot reproduce the issue if I configure an IP instead of hostname for the syslog server.
we are using UDP and no flow-control configuration, with the expectation that syslog() will never block. and, indeed, until the reload, it works as expected. however, after reload I guess we re-establish the socket for the remote connection, requiring us to resolve the hostname; I don’t pretend to understand how this ultimately backs up processing of /dev/log (note that internal() and kernel messages are coming through just fine during this time).
I guess my question is whether this is a known/expected issue, and/or if there’s a resolution other than specifying remote syslog servers by IP or hardcoding the name resolution in /etc/hosts and pointing to that with dns-cache-hosts(). basically, I’d like syslog-ng to simply give up if it can’t resolve remote syslog server hostnames, rather than allow this to interfere with servicing of /dev/log, with ramifications to callers.
thanks in advance,
nathan