[syslog-ng] TCP packet collapse errors

Tue Jun 4 09:21:55 CEST 2013

If I remember this correctly you have to adjust the linux networking options too. Like here: http://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php
Maybe you missed an option.
And to activate multithreading you need threaded(yes); But it's odd that the syslog problem is gone now.

Von: syslog-ng-bounces at lists.balabit.hu [mailto:syslog-ng-bounces at lists.balabit.hu] Im Auftrag von Xuri Nagarin
Gesendet: Montag, 3. Juni 2013 23:53
An: Syslog-ng users' and developers' mailing list
Betreff: Re: [syslog-ng] TCP packet collapse errors

I switched to 3.4.1 (multi-threaded) but still no respite from the TCP collapse issue.

I tried to set so_rcvbuf to '51200000' and got this error:

"Starting syslog-ng: The kernel refused to set the receive buffer (SO_RCVBUF) to the requested size, you probably need to adjust buffer related kernel parameters; so_rcvbuf='51200000', so_rcvbuf_set='20971520'"

In terms of lps processed, may be it's just a case of Monday but I am seeing numbers as high as 40k consistently now with no drops to zero. Disk IO is low but more consistent between 4-8 Mbytes/sec.

In terms of multi-threading, not sure how well the code does it because I am seeing individual cores spike to 100% even when several others are at idle.

On Fri, May 31, 2013 at 12:41 PM, Xuri Nagarin <secsubs at gmail.com<mailto:secsubs at gmail.com>> wrote:
Thanks Daniel. I think I owe you a six-pack of your favourite beer just for the script :)

Very interesting stats. For both hosts, lps ranges from anywhere between few hundred to 30K per second but falls to zero for several seconds. All the packet loss happens when the lps is peaking. What has me concerned is that lps falls to zero for 15-20 seconds at a time. During this period disk IO falls to near zero too and packet hemorrhaging pauses.

All my sources are tcp and there are only four of them but those are proprietary products so they may be sending data in batches/bursts. I will tweak the tcp source options and see what more I can squeeze out of this build of syslog-ng and the disks. But you are right - I do need to upgrade to the multi-threaded version asap. At some point, a single core process is going to stall on writing to disk and not be able to reap tcp buffers during that time.

I think the disk scheduler probably did not have any impact but one system has the disk scheduler tweak and other doesn't but I don't see any significant differences with packet loss on either.

Right now, I think the issue simply is that when a burst of data comes in, the single core syslog-ng blocks of IO and cannot switch back to reading TCP buffers fast enough to clear up everything in the buffer. High network IO, high Disk IO, high lps  and packet loss - all match up.

A larger receive buffer for the tcp source in syslog config will be a band-aid, I think until I can build a multi-threaded syslog-ng. Eventually, given that these are 7200RPM disks in RAID-1, the disk sub-system will start being a bottleneck but by then, hopefully, I will find a host with better disk througput :-)

On Fri, May 31, 2013 at 2:41 AM, Daniel Neubacher <daniel.neubacher at xing.com<mailto:daniel.neubacher at xing.com>> wrote:
I've forgot to ask for your syslog source settings.
Here is my cfg. Maybe it helps.

tcp(
        log_fetch_limit(1000)
        max-connections(5000)
        so_rcvbuf(51200000)
        keep_timestamp(yes)
        port(514)
        log-iw-size(500000)
);

I've got it from a great artice http://codeascraft.com/2012/08/13/performance-tuning-syslog-ng/

________________________________
Von: syslog-ng-bounces at lists.balabit.hu<mailto:syslog-ng-bounces at lists.balabit.hu> [syslog-ng-bounces at lists.balabit.hu<mailto:syslog-ng-bounces at lists.balabit.hu>]" im Auftrag von "Xuri Nagarin [secsubs at gmail.com<mailto:secsubs at gmail.com>]
Gesendet: Freitag, 31. Mai 2013 10:12

An: Syslog-ng users' and developers' mailing list
Betreff: Re: [syslog-ng] TCP packet collapse errors
Thanks for the quick response, Daniel.

I look at statistics for an hour before tweaking flush_lines to zero and setting log_fifo_size to 10000. In that period, syslog-ng reported processing 7,898,310,589 messages across all destinations and dropped 4,200,260.

After making the change (flush_lines set to 0 and log_fifo_size to 10000), I looked at three sets (half hour) of stats (default, every 10 minutes). The dropped messages are now zero across all destinations.

But the collapsed TCP packets count keeps incrementing. I ran 'iostat -xm 5' and "watch -d 'netstat -s | grep collpased' " in two windows side-by-side. Each time that disk IO spikes up, the TCP collapsed counter starts incrementing. Disk IO remains almost zero for about half a minute and then spikes up to ~4-25 Mbytes/sec for half a minute.

Does this mean I need to bump up log_fifo_size even higher? I think ideally we want the disk to be consistently written to instead of bursts of write activity. Right?

On Thu, May 30, 2013 at 10:56 PM, Daniel Neubacher <daniel.neubacher at xing.com<mailto:daniel.neubacher at xing.com>> wrote:
I don't know how much logs you are getting but should tweak "log_fifo_size (1000);" to a higher number. Your flush_lines is really high too.. I tested around with flush lines but I ended setting it to 0 with 50k log per second. And they greatest of all tweaks would be a newer syslog version because of the threading.
________________________________
Von: syslog-ng-bounces at lists.balabit.hu<mailto:syslog-ng-bounces at lists.balabit.hu> [syslog-ng-bounces at lists.balabit.hu<mailto:syslog-ng-bounces at lists.balabit.hu>]" im Auftrag von "Xuri Nagarin [secsubs at gmail.com<mailto:secsubs at gmail.com>]
Gesendet: Freitag, 31. Mai 2013 07:46
An: Syslog-ng users' and developers' mailing list
Betreff: [syslog-ng] TCP packet collapse errors
I have a pair of Syslog-NG servers running 3.2.5-3. The hardware specs are - Quad Xeon E5-2680 (32 cores), 32GB RAM, and two 1TB SAS 7200 RPM disks in RAID-1.

OS is RHEL6.2 - Kernel 2.6.32-279.5.2. Filesystem is ext3.

Global options are set as:
options {
flush_lines (1000);
time_reopen (10);
log_fifo_size (1000);
long_hostnames (off);
use_dns (no);
use_fqdn (no);
create_dirs (yes);
keep_hostname (yes);
keep_timestamp(yes);
dir_group("syslog");
perm(0640);
dir_perm(0750);
group("syslog");
};

I have already set TCP kernel buffers to 128MB max and set disk scheduler to "deadline".

But even under light disk IO load, from ~8-25MB, I see "1320811067 packets collapsed in receive queue due to low socket buffer". I had some other processes on the host writing to disk. Stopping them reduced the packet errors but this number still keeps incrementing.

To rule out other issues, I temporarily pointed my disk-based destinations to /dev/null and then packet losses/errors stopped. So either Syslog-NG isn't able to write to disk fast enough or there is an underlying OS/hardware issue.

Both hosts have the same issue. Any pointers in troubleshooting it will be appreciated.

TIA.

______________________________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
FAQ: http://www.balabit.com/wiki/syslog-ng-faq

______________________________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
FAQ: http://www.balabit.com/wiki/syslog-ng-faq

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.balabit.hu/pipermail/syslog-ng/attachments/20130604/363b5c67/attachment-0001.htm