[syslog-ng] TCP packet collapse errors

Fri May 31 10:34:39 CEST 2013

The fifo seems to be fine if you are having no drops right now. With flush lines 0 you should get a constant stream right now. But I don't have a clue why the tcp stats are so horrible. I would debug it like this

1. I always have my logs per second open to see if something odd is going on while a problem happens. It lead me to a lot of error in the past. This is my script (stats_level(2); needed):
#!/bin/bash

while true
do
        for i in $(syslog-ng-ctl stats | grep src.host | grep proc | cut -d ";" -f6)
        do
          let tc1+=$i
        done
        let lps=tc1-tc2
        test -z $tc2 || echo $lps
        tc2=$tc1
        tc1=0
        sleep 1
done

2. Check the dmesg and try to revert the tcp tweaks you've made before
3. Are you using bonding? try switching it off or check the mode (round robin is really bad)
4. Set syslog-ng to verbose and after that to the debug mode and check the logs
5. Compile a newer version of syslog :P

With older syslog versions I had really horrible problems with "log_msg_size". Maybe you shpuld increase it too just to be sure.
________________________________
Von: syslog-ng-bounces at lists.balabit.hu [syslog-ng-bounces at lists.balabit.hu]" im Auftrag von "Xuri Nagarin [secsubs at gmail.com]
Gesendet: Freitag, 31. Mai 2013 10:12
An: Syslog-ng users' and developers' mailing list
Betreff: Re: [syslog-ng] TCP packet collapse errors

Thanks for the quick response, Daniel.

I look at statistics for an hour before tweaking flush_lines to zero and setting log_fifo_size to 10000. In that period, syslog-ng reported processing 7,898,310,589 messages across all destinations and dropped 4,200,260.

After making the change (flush_lines set to 0 and log_fifo_size to 10000), I looked at three sets (half hour) of stats (default, every 10 minutes). The dropped messages are now zero across all destinations.

But the collapsed TCP packets count keeps incrementing. I ran 'iostat -xm 5' and "watch -d 'netstat -s | grep collpased' " in two windows side-by-side. Each time that disk IO spikes up, the TCP collapsed counter starts incrementing. Disk IO remains almost zero for about half a minute and then spikes up to ~4-25 Mbytes/sec for half a minute.

Does this mean I need to bump up log_fifo_size even higher? I think ideally we want the disk to be consistently written to instead of bursts of write activity. Right?

On Thu, May 30, 2013 at 10:56 PM, Daniel Neubacher <daniel.neubacher at xing.com<mailto:daniel.neubacher at xing.com>> wrote:
I don't know how much logs you are getting but should tweak "log_fifo_size (1000);" to a higher number. Your flush_lines is really high too.. I tested around with flush lines but I ended setting it to 0 with 50k log per second. And they greatest of all tweaks would be a newer syslog version because of the threading.
________________________________
Von: syslog-ng-bounces at lists.balabit.hu<mailto:syslog-ng-bounces at lists.balabit.hu> [syslog-ng-bounces at lists.balabit.hu<mailto:syslog-ng-bounces at lists.balabit.hu>]" im Auftrag von "Xuri Nagarin [secsubs at gmail.com<mailto:secsubs at gmail.com>]
Gesendet: Freitag, 31. Mai 2013 07:46
An: Syslog-ng users' and developers' mailing list
Betreff: [syslog-ng] TCP packet collapse errors

I have a pair of Syslog-NG servers running 3.2.5-3. The hardware specs are - Quad Xeon E5-2680 (32 cores), 32GB RAM, and two 1TB SAS 7200 RPM disks in RAID-1.

OS is RHEL6.2 - Kernel 2.6.32-279.5.2. Filesystem is ext3.

Global options are set as:
options {
flush_lines (1000);
time_reopen (10);
log_fifo_size (1000);
long_hostnames (off);
use_dns (no);
use_fqdn (no);
create_dirs (yes);
keep_hostname (yes);
keep_timestamp(yes);
dir_group("syslog");
perm(0640);
dir_perm(0750);
group("syslog");
};

I have already set TCP kernel buffers to 128MB max and set disk scheduler to "deadline".

But even under light disk IO load, from ~8-25MB, I see "1320811067 packets collapsed in receive queue due to low socket buffer". I had some other processes on the host writing to disk. Stopping them reduced the packet errors but this number still keeps incrementing.

To rule out other issues, I temporarily pointed my disk-based destinations to /dev/null and then packet losses/errors stopped. So either Syslog-NG isn't able to write to disk fast enough or there is an underlying OS/hardware issue.

Both hosts have the same issue. Any pointers in troubleshooting it will be appreciated.

TIA.

______________________________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
FAQ: http://www.balabit.com/wiki/syslog-ng-faq

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.balabit.hu/pipermail/syslog-ng/attachments/20130531/e6513a8d/attachment.htm