[syslog-ng] Problem with syslog-ng 1.6.8 and HUP

Wed Sep 10 16:18:43 CEST 2008

I don't see anything that would leave the stale file handles, however,
you really should do things in a different order.

rename/move the files
touch the new files (not necessary with syslog-ng)
HUP/reload syslog-ng
compress the files

If you do the compression before the HUP, then syslog-ng will be
writing to the files as it is being compressed. That will loose
records in the files as the compressions will potentially miss
some lines at the end due to a race condition.

Evan.

Tessier.Marc.2 at hydro.qc.ca wrote:
> Hi,
> 
> 	I have been working on an issue for the last few days with one
> of our main centralized syslog servers which is a SPARC Solaris 9 box.
> It is running syslog-ng 1.6.8.  Our log rotation script moves,
> compresses and then touches a new file depending on size.  Then script
> then sends an HUP to syslog-ng in the following way :
> 
> /usr/bin/pkill -HUP syslog-ng > /dev/null 2>&1
> 
> 	I have added some debugging to the script to make sure it grabs
> the proper PID, and logs it independantly : 
> 
> Sep 10 07:58:01 machine01 syslogmgr[8304]: info: signaling
> Syslog/Syslog-ng (HUP) at PID 4362.
> 
> 	And it is  : 
> 
> root at spdmze21# ps -ef | grep syslog-ng
>     root  4362     1  2   Sep 07 ?       160:38 /usr/sbin/syslog-ng
> 
> 	Now in the log files, it seems syslog-ng isn't always reacting
> to the kill.  This is what I get as far as syslog-ng noticing the signal
> in the last 2 days : 
> 
> root at machine01# grep syslog-ng complete.log | grep restarting
> Sep  9 09:58:06 machine01 syslog-ng[4362]: SIGHUP received, restarting
> syslog-ng
> Sep  9 15:58:03 machine01 syslog-ng[4362]: SIGHUP received, restarting
> syslog-ng
> Sep  9 19:58:07 machine01 syslog-ng[4362]: SIGHUP received, restarting
> syslog-ng
> Sep 10 01:58:04 machine01 syslog-ng[4362]: SIGHUP received, restarting
> syslog-ng
> Sep 10 03:58:05 machine01 syslog-ng[4362]: SIGHUP received, restarting
> syslog-ng
> 
> 	However, it was signaled everyhour : 
> 
> Sep 10 00:58:02 machine01 syslogmgr[20532]: [ID 702911 daemon.info]
> info: signaling Syslog/Syslog-ng (HUP) at PID 4362.
> Sep 10 01:58:04 machine01 syslogmgr[26686]: [ID 702911 daemon.info]
> info: signaling Syslog/Syslog-ng (HUP) at PID 4362.
> Sep 10 02:58:02 machine01 syslogmgr[3702]: [ID 702911 daemon.info] info:
> signaling Syslog/Syslog-ng (HUP) at PID 4362.
> Sep 10 03:58:05 machine01 syslogmgr[9935]: [ID 702911 daemon.info] info:
> signaling Syslog/Syslog-ng (HUP) at PID 4362.
> Sep 10 04:58:04 machine01 syslogmgr[17262]: [ID 702911 daemon.info]
> info: signaling Syslog/Syslog-ng (HUP) at PID 4362.
> Sep 10 05:58:02 machine01 syslogmgr[24178]: [ID 702911 daemon.info]
> info: signaling Syslog/Syslog-ng (HUP) at PID 4362.
> Sep 10 06:58:02 machine01 syslogmgr[1748]: [ID 702911 daemon.info] info:
> signaling Syslog/Syslog-ng (HUP) at PID 4362.
> Sep 10 07:58:01 machine01 syslogmgr[8304]: [ID 702911 daemon.info] info:
> signaling Syslog/Syslog-ng (HUP) at PID 4362.
> 
> 	Now the problem this creates for me is that it leaves a bunch of
> open file descriptors which don't point to a file anymore : 
> 
> root at machine01# lsof -p 4362 -s | grep d80
> syslog-ng 4362 root    7w  VREG         85,80   9699746     40
> /var/adm/log (/dev/md/dsk/d80)
> syslog-ng 4362 root   17w  VREG         85,80   3223056 500412
> /var/adm/log (/dev/md/dsk/d80)
> syslog-ng 4362 root   18w  VREG         85,80  11604669 612873
> /var/adm/log (/dev/md/dsk/d80)
> syslog-ng 4362 root   20w  VREG         85,80   8617243     90
> /var/adm/log (/dev/md/dsk/d80)
> syslog-ng 4362 root   23w  VREG         85,80  11698446 612889
> /var/adm/log (/dev/md/dsk/d80)
> ...
> 
> 	This causes our rotation to mess up and thus the filesystem to
> eventually become filled.  If I fire pkill manually, it works : 
>  
> root at machine01# /usr/bin/pkill -HUP syslog-ng > /dev/null 2>&1
> root at machine01# lsof -p 4362 -s | grep d80
> root at machine01#
> 
> 	The script itself is called every hour from root's crontab.  At
> this point, I am completely stumped.  Even looking at the changelog from
> 1.6.8 to 1.6.12 (latest 1.6 tree version) I don't see anything that
> addresses this issue.  Now since I have not been able to reproduce a
> non-working HUP, I don't have a truss output of when it happens.
> 
> Thank you for any help!
> 
> Marc Tessier
> Analyste Support Technique
> Support Technique des Infrastructures 
> courriel : tessier.marc.2 at hydro.qc.ca
> <mailto:tessier.marc.2 at hydro.qc.ca> 
> Tel:  (514) 840-3000 Poste #6467
> 
> ______________________________________________________________________________
> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
> FAQ: http://www.campin.net/syslog-ng/faq.html
>