[syslog-ng] runaway programs started by syslog-ng

Sat Oct 28 09:26:41 CEST 2006

On Fri, 2006-10-27 at 17:40 -0700, Dev Man wrote:
> A followup...
> On a test server, I have noticed that when the system reboots there is
> one instance of syslog-ng but two instances of the program being executed.
> There should only be one.
> 
> If I do a /etc/init.d/syslog-ng restart, then I end up with just one
> instance of the program as expected, and the system appears to be
> fine.

syslog-ng should restart the program when it receives a SIGCHLD signal
which indicates that the child exited.

Hmm.. I have reread that part of the code and in reality it restarts
child processes in two cases:

- when SIGCHLD is received
- when the pipe to the process returns EPIPE

The two excludes each other by the means of a variable, which could be
inherently racy (a signal handler racing with conventional code) however
this is not the case as the SIGCHLD handler is not called from inside
the signal handler, its execution is deferred until the next iteration
of the main loop. (see main_loop_run() function, and the processing of
the sig_child_received variable)

Therefore the running of the EPIPE handler and SIGCHLD is sequential in
nature, so if an EPIPE is received first, the program will be restarted
and the SIGCHLD handler will see this... Hmmm wait a minute, the
sigchild handler does not check whether the terminated process is the
same as the currently running one, so here's a possible bug scenario:

- EPIPE is received, program is restarted, its pid is recorded to
self->pid
- SIGCHLD is received, the signal handler meets with pid != -1, thus it
restarts the process without killing the previous one (as it should be
dead already)

This could explain why you have two instances running on startup, and
the sequence of EPIPE/SIGCHLD determines whether a process is leaked.

I may also explain the large number of processes you have, provided the
program frequently exits, syslog-ng might start a leaked process for
each exit, which at the end results in a lot of instances running.

I have committed the patch below (with some other minor fixes), which
should fix the problem:

--- orig/src/afprog.c
+++ mod/src/afprog.c
@@ -50,8 +50,9 @@ afprogram_dd_exit(pid_t pid, int status,
   AFProgramDestDriver *self = (AFProgramDestDriver *) s;

   /* Note: self->pid being -1 means that deinit was called, thus we don't
-   * need to restart the command */
-  if (self->pid != -1)
+   * need to restart the command. self->pid might change due to EPIPE
+   * handling restarting the command before this handler is run. */
+  if (self->pid != -1 && self->pid == pid)
     {
       msg_verbose("Child program exited, restarting",
                   evt_tag_str("cmdline", self->cmdline->str),
@@ -105,6 +106,7 @@ afprogram_dd_init(LogPipe *s, GlobalConf
       dup2(devnull, 1);
       dup2(devnull, 2);
       close(devnull);
+      close(msg_pipe[0]);
       close(msg_pipe[1]);
       execl("/bin/sh", "/bin/sh", "-c", self->cmdline->str, NULL);
       _exit(127);
@@ -115,7 +117,6 @@ afprogram_dd_init(LogPipe *s, GlobalConf

       child_manager_register(self->pid, afprogram_dd_exit, log_pipe_ref(&self->super.super), (GDestroyNotify) log_pipe_unref);

-      g_fd_set_cloexec(msg_pipe[1], TRUE);
       close(msg_pipe[0]);
       if (!self->writer)
         self->writer = log_writer_new(LW_FORMAT_FILE, s, &self->writer_options);


--- orig/src/gsockaddr.c
+++ mod/src/gsockaddr.c
@@ -698,7 +698,7 @@ g_sockaddr_unix_format(GSockAddr *addr,
   GSockAddrUnix *unix_addr = (GSockAddrUnix *) addr;

   g_snprintf(text, n, "AF_UNIX(%s)",
-             unix_addr->saun.sun_path[0] ? unix_addr->saun.sun_path
+             unix_addr->salen > sizeof(unix_addr->saun.sun_family) && unix_addr->saun.sun_path[0] ? unix_addr->saun.sun_path
                                           : "anonymous");
   return text;
 }


--- orig/src/logwriter.c
+++ mod/src/logwriter.c
@@ -266,9 +266,12 @@ log_writer_format_log(LogWriter *self, L
 static void
 log_writer_broken(LogWriter *self, gint notify_code)
 {
-  /* the connection seems to be broken */
-  log_pipe_notify(self->control, &self->super, notify_code, self);
+  /* the order of these calls is important, as log_pipe_notify() will handle
+   * reinitialization, and if deinit is called last, the writer might be
+   * left in an unpolled state */
+
   log_pipe_deinit(&self->super, NULL, NULL);
+  log_pipe_notify(self->control, &self->super, notify_code, self);
 }

 static gboolean





-- 
Bazsi