OK, I finally got back to this. The lsof of the forground syslog-ng process shows localhost% sudo /usr/sbin/lsof -p 18279 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME syslog-ng 18279 root cwd DIR 253,6 4096 32793 /usr/local/etc/syslog-ng syslog-ng 18279 root rtd DIR 253,0 1024 2 / syslog-ng 18279 root txt REG 253,6 321731 233449 /usr/local/sbin/syslog-ng syslog-ng 18279 root mem REG 253,0 95464 49455 /lib64/libselinux.so.1 syslog-ng 18279 root mem REG 253,0 247496 49454 /lib64/libsepol.so.1 syslog-ng 18279 root mem REG 253,5 85928 394112 /usr/lib64/libz.so.1.2.3 syslog-ng 18279 root mem REG 253,0 647608 49451 /lib64/libglib-2.0.so.0.1200.3 syslog-ng 18279 root mem REG 253,0 1366176 49458 /lib64/libcrypto.so.0.9.8e syslog-ng 18279 root mem REG 253,5 190976 397026 /usr/lib64/libgssapi_krb5.so.2.2 syslog-ng 18279 root mem REG 253,0 10000 49465 /lib64/libcom_err.so.2.1 syslog-ng 18279 root mem REG 253,5 153464 397024 /usr/lib64/libk5crypto.so.3.1 syslog-ng 18279 root mem REG 253,5 613896 397025 /usr/lib64/libkrb5.so.3.3 syslog-ng 18279 root mem REG 253,5 35728 397023 /usr/lib64/libkrb5support.so.0.1 syslog-ng 18279 root mem REG 253,0 9728 49474 /lib64/libkeyutils-1.2.so syslog-ng 18279 root mem REG 253,0 137256 53855 /lib64/ld-2.5.so syslog-ng 18279 root mem REG 253,5 25464 393467 /usr/lib64/gconv/gconv-modules.cache syslog-ng 18279 root mem REG 253,4 493726 327686 /var/db/nscd/hosts syslog-ng 18279 root mem REG 253,0 50288 49190 /lib64/librt-2.5.so syslog-ng 18279 root mem REG 253,0 111480 49172 /lib64/libnsl-2.5.so syslog-ng 18279 root mem REG 253,6 21478 136301 /usr/local/lib/libevtlog.so.0.0.0 syslog-ng 18279 root mem REG 253,0 306632 49331 /lib64/libssl.so.0.9.8e syslog-ng 18279 root mem REG 253,0 20424 49168 /lib64/libdl-2.5.so syslog-ng 18279 root mem REG 253,0 1712216 49160 /lib64/libc-2.5.so syslog-ng 18279 root mem REG 253,0 142696 49186 /lib64/libpthread-2.5.so syslog-ng 18279 root mem REG 253,0 89800 49188 /lib64/libresolv-2.5.so syslog-ng 18279 root 0u CHR 136,1 3 /dev/pts/1 syslog-ng 18279 root 1u CHR 136,1 3 /dev/pts/1 syslog-ng 18279 root 2u CHR 136,1 3 /dev/pts/1 syslog-ng 18279 root 3u FIFO 253,4 491523 /var/log/syslog.pipes/erempel syslog-ng 18279 root 4u IPv4 121995943 TCP localhost:34513->localhost.localdomain:1514 (ESTABLISHED) syslog-ng 18279 root 5w FIFO 0,6 121995942 pipe syslog-ng 18279 root 6u unix 0xffff81043fd3f100 121995946 /var/syslog-ng.ctl And the strace and configuration file are attached. I'm not an strace expert, but it looks like the source is polled, but once the output can not be written to, the source is not read, even though the poll shows data is available. Thanks for looking at this. Balazs Scheidler wrote:
On Wed, 2010-04-14 at 14:04 -0700, Evan Rempel wrote:
Balazs Scheidler wrote:
On Wed, 2010-04-14 at 13:18 -0700, Evan Rempel wrote:
A little background.
There is a "server" syslog-ng process that accepts messages from the network and sends the messages to a variety of destinations. For this report, I am only interested in one destination that happens to be a pipe.
There is a "slave" syslog-ng process that reads from the pipe that the "server" writes to, and writes to a program destination.
The program reads the standard in, and does "something". All works well.
At some point our application (we know why and don't want to discuss it) application stops reading standard in for a while (1,000,000 lines over an hour). We expect that the memory footprint of syslog-ng "slave" to grow during this time but it does not. Instead, the memory footprint of the syslog-ng "server" grows. When our application starts reading its standard in again, the memory footprint of the syslog "slave" grows very quickly, and all messages reach the destination.
I think that the syslog-ng "slave" get blocked on the program destination in a way that prevents it from reading its source, resulting in the upstream syslog-ng "server" having to buffer all of the messages.
There is no flow control anywhere, and both syslog-ng instances have log_fifo_size(8000000) for all of the destinations.
Anyone have any suggestions? Hmm, either the slave syslog-ng really blocks (but I don't know any similar bugs right now), or flow control is enabled.
There was a bug that caused flow-control to be enabled, if any of the flags was used. Do you have fallback enabled? No, the only flags in the "slave" is no-parse.
Can you post the exact versions of syslog-ng you are using? 3.0.5 OSE
Also, you could confirm if the slave instance really blocks, or it has just stalled one of its sources. You could do that by attaching to the slave process using strace for a little while.
1) first check what fd is being used between master/slave (lsof -p <pid>) 2) then check via strace if that fd is being polled for POLLIN or not
If it is not polled, then flow-control is somehow enabled, if syslog-ng is not polling but waiting somewhere, then it might be blocked as you suggest.
Anyway, the list of open file descriptors and the strace dump could help in tracking down both cases.
Not to be lazy here, but I am not going to get to this for weeks, really busy at my site right now.
Here is a program that can turn on/off the reading of standard in by using kill -USR1 <PID> kill -USR2 <PID>
--------------- #!/usr/bin/perl
my $read = 0;
$SIG{USR1} = sub { $read = 1 }; $SIG{USR2} = sub { $read = 0 };
while ( not eof(STDIN)) { if ( $read == 1 ) { $line = <STDIN>; print $line; } else { sleep 1; } } ---------------
So hopefully you can reproduce this this really easy at your end.
If I don't hear back, in a few weeks I'll get to this.
I've tried it with 3.1.1, 3.0.6 and finally 3.0.5 and all behaved properly. Memory size grew until it couldn't write to the test program, of course only until the FIFO size filled up.
Once I let the program go with a SIGUSR1 it started sending messages, the drop count stayed the same. Flow-control was not enabled.
-- Evan Rempel erempel@uvic.ca Senior Systems Administrator 250.721.7691 Unix Services, University Systems, University of Victoria @version: 3.0 # # UVic syslog-ng configuration options { log_fifo_size(1000000); use_fqdn(yes); keep_hostname(yes); chain_hostnames(yes); time_reap(60); time_reopen(5); flush_lines(1000); flush_timeout(1000); }; template t_application { template("$MESSAGE\n"); template_escape(no); }; template t_localsyslog { template("<$PRI>$S_ISODATE $HOST erempel: $MSGONLY\n"); template_escape(no); }; source local { pipe("/var/log/syslog.pipes/erempel" flags(no-parse) ); }; source int { internal(); }; destination localsyslog { tcp("localhost" port(1514) localip(wolverine.comp.uvic.ca) template(t_localsyslog) log_fifo_size(5000) ); }; destination erempel { program("/home1l/erempel/read-noread.pl" template(t_application) log_fifo_size(8000000) ); }; log { source(local); destination(erempel); }; log { source(int); destination(localsyslog); };