[syslog-ng] Problem with syslog-ng 1.6.8 and HUP

Wed Sep 10 14:47:09 CEST 2008

Hi,

	I have been working on an issue for the last few days with one
of our main centralized syslog servers which is a SPARC Solaris 9 box.
It is running syslog-ng 1.6.8.  Our log rotation script moves,
compresses and then touches a new file depending on size.  Then script
then sends an HUP to syslog-ng in the following way :

/usr/bin/pkill -HUP syslog-ng > /dev/null 2>&1

	I have added some debugging to the script to make sure it grabs
the proper PID, and logs it independantly : 

Sep 10 07:58:01 machine01 syslogmgr[8304]: info: signaling
Syslog/Syslog-ng (HUP) at PID 4362.

	And it is  : 

root at spdmze21# ps -ef | grep syslog-ng
    root  4362     1  2   Sep 07 ?       160:38 /usr/sbin/syslog-ng

	Now in the log files, it seems syslog-ng isn't always reacting
to the kill.  This is what I get as far as syslog-ng noticing the signal
in the last 2 days : 

root at machine01# grep syslog-ng complete.log | grep restarting
Sep  9 09:58:06 machine01 syslog-ng[4362]: SIGHUP received, restarting
syslog-ng
Sep  9 15:58:03 machine01 syslog-ng[4362]: SIGHUP received, restarting
syslog-ng
Sep  9 19:58:07 machine01 syslog-ng[4362]: SIGHUP received, restarting
syslog-ng
Sep 10 01:58:04 machine01 syslog-ng[4362]: SIGHUP received, restarting
syslog-ng
Sep 10 03:58:05 machine01 syslog-ng[4362]: SIGHUP received, restarting
syslog-ng

	However, it was signaled everyhour : 

Sep 10 00:58:02 machine01 syslogmgr[20532]: [ID 702911 daemon.info]
info: signaling Syslog/Syslog-ng (HUP) at PID 4362.
Sep 10 01:58:04 machine01 syslogmgr[26686]: [ID 702911 daemon.info]
info: signaling Syslog/Syslog-ng (HUP) at PID 4362.
Sep 10 02:58:02 machine01 syslogmgr[3702]: [ID 702911 daemon.info] info:
signaling Syslog/Syslog-ng (HUP) at PID 4362.
Sep 10 03:58:05 machine01 syslogmgr[9935]: [ID 702911 daemon.info] info:
signaling Syslog/Syslog-ng (HUP) at PID 4362.
Sep 10 04:58:04 machine01 syslogmgr[17262]: [ID 702911 daemon.info]
info: signaling Syslog/Syslog-ng (HUP) at PID 4362.
Sep 10 05:58:02 machine01 syslogmgr[24178]: [ID 702911 daemon.info]
info: signaling Syslog/Syslog-ng (HUP) at PID 4362.
Sep 10 06:58:02 machine01 syslogmgr[1748]: [ID 702911 daemon.info] info:
signaling Syslog/Syslog-ng (HUP) at PID 4362.
Sep 10 07:58:01 machine01 syslogmgr[8304]: [ID 702911 daemon.info] info:
signaling Syslog/Syslog-ng (HUP) at PID 4362.

	Now the problem this creates for me is that it leaves a bunch of
open file descriptors which don't point to a file anymore : 

root at machine01# lsof -p 4362 -s | grep d80
syslog-ng 4362 root    7w  VREG         85,80   9699746     40
/var/adm/log (/dev/md/dsk/d80)
syslog-ng 4362 root   17w  VREG         85,80   3223056 500412
/var/adm/log (/dev/md/dsk/d80)
syslog-ng 4362 root   18w  VREG         85,80  11604669 612873
/var/adm/log (/dev/md/dsk/d80)
syslog-ng 4362 root   20w  VREG         85,80   8617243     90
/var/adm/log (/dev/md/dsk/d80)
syslog-ng 4362 root   23w  VREG         85,80  11698446 612889
/var/adm/log (/dev/md/dsk/d80)
...

	This causes our rotation to mess up and thus the filesystem to
eventually become filled.  If I fire pkill manually, it works : 

root at machine01# /usr/bin/pkill -HUP syslog-ng > /dev/null 2>&1
root at machine01# lsof -p 4362 -s | grep d80
root at machine01#

	The script itself is called every hour from root's crontab.  At
this point, I am completely stumped.  Even looking at the changelog from
1.6.8 to 1.6.12 (latest 1.6 tree version) I don't see anything that
addresses this issue.  Now since I have not been able to reproduce a
non-working HUP, I don't have a truss output of when it happens.

Thank you for any help!

Marc Tessier
Analyste Support Technique
Support Technique des Infrastructures 
courriel : tessier.marc.2 at hydro.qc.ca
<mailto:tessier.marc.2 at hydro.qc.ca> 
Tel:  (514) 840-3000 Poste #6467