New subject: supervisor not restarting a failed daemon process

29 Apr 2013

      Hi,

No kernel source handy but my guess is that the kernel is logging the tid value, which is the same as pid as long as the process is single threaded. 

I've checked the supervisor code and the only way this could happen is a fork/pipe error which is not logged. It should be logged but who knows that message can be lost. The supervisor attempts to restart 3 times then gives up.

Hmmm the supervisor messages may be redirected to syslog after the first startup which might explain why they don't get logged.

But why does fork/pipe fail?

Hope this helps.

Evan Rempel <erempel@uvic.ca> wrote:
...
The logs below show a "standard" syslog-ng processID=15017 that reads /dev/log /proc/kmsg
The second instance of syslog-ng is what we call our "server" which just listens on the network ports
and does all of the complex patterndb, filtering and routing to destination processes
kern.info kernel: syslog-ng[1561]: segfault at 7f65c0000078 ip 00007f65c0000078 sp 00007f65e1385a48 error 15
--- this is the server instance segfaulting (I assume, see WAIT below)
syslog.notice syslog-ng[15017]: Syslog connection closed; fd='20', client='AF_INET(142.104.47.145:49803)', local='AF_INET(127.0.0.1:1514)'
syslog.notice syslog-ng[15017]: Syslog connection broken; fd='14', server='AF_INET(142.104.47.146:514)', time_reopen='5'
--- this was the standard syslog loosing connection to the server, and detecting the drop of the server instance destination to it.
daemon.info syslog-ng-stats: server stopping on socket "/var/local/syslog-ng.server.ctl"
daemon.info msgid_profiler[832]: committing residual data
local0.info flare-timer[834]: stopping
local0.info action-handler[833]: stopping
--- these are all program destinations of the server instance shutting down gracefully after the close of their stdin.
daemon.crit supervise/syslog-ng[27221]: Daemon exited due to a deadlock/signal/failure, restarting; exitcode='11'
syslog.err syslog-ng[15017]: Syslog connection failed; fd='14', server='AF_INET(142.104.47.146:514)', error='Connection refused (111)', time_reopen='5'
This "connection failed" message repeats every 5 seconds until I restart the server instance.
syslog.notice syslog-ng[1911]: syslog-ng starting up; version='3.4.1'
So it does not look like there is anything in the logs about attempted restarts.
WAIT...
This is really odd. The line
kern.info kernel: syslog-ng[1561]: segfault at 7f65c0000078 ip 00007f65c0000078 sp 00007f65e1385a48 error 15
implies that there was a process ID 1561 that segfaulted, but that line is the ONLY logged line with that process ID.
We take ps snapshots every 15 minutes, and those snapshots don't show anything for that process ID.
Also, the supervisor processID is shown
USER       PID  PPID  NI PRI CPU    VSZ     ELAPSED     TIME COMMAND
root 27221 1 0 19 - 26556 9-04:00:59 00:00:00 supervising syslog-ng
root 27222 27221 0 19 - 977064 9-04:00:59 1-02:52:35 /usr/local/sbin/syslog-ng --cfgfile= ...
which matches the line
daemon.crit supervise/syslog-ng[27221]: Daemon exited due to a deadlock/signal/failure, restarting; exitcode='11'
so its child which dies should have been processID 27222 so why is the log line
kern.info kernel: syslog-ng[1561]: segfault at 7f65c0000078 ip 00007f65c0000078 sp 00007f65e1385a48 error 15
I conclude that the 1561 is not the process ID.
Can you shed any light on this?
Evan.
On 04/26/2013 11:00 PM, Balazs Scheidler wrote:
...
Strange, indeed. The supervisor gives up if the restarted daemon exits for some reason. Eg. If there's an initialization error it gives up. Any indication in the logs?
Evan Rempel <erempel@uvic.ca> wrote:
...
We are sing the log line
supervise/syslog-ng[27221]: Daemon exited due to a deadlock/signal/failure, restarting; exitcode='11'
and it looks like it should restart, but instead of restarting,
the supervisor terminates and then no syslog-ng process is running.
Is this a bug in the supervisor?
______________________________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
FAQ: http://www.balabit.com/wiki/syslog-ng-faq
______________________________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
FAQ: http://www.balabit.com/wiki/syslog-ng-faq
-- 
Evan Rempel                                      erempel@uvic.ca
Senior Systems Administrator                        250.721.7691
Data Centre Services, University Systems, University of Victoria
______________________________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
FAQ: http://www.balabit.com/wiki/syslog-ng-faq

Re: [syslog-ng] supervisor not restarting a failed daemon process

Balazs Scheidler

Evan Rempel

Evan Rempel

Balazs Scheidler

tags

participants (2)