[syslog-ng] TCP packet collapse errors

Xuri Nagarin secsubs at gmail.com
Mon Jun 3 23:53:19 CEST 2013


I switched to 3.4.1 (multi-threaded) but still no respite from the TCP
collapse issue.

I tried to set so_rcvbuf to '51200000' and got this error:

"Starting syslog-ng: The kernel refused to set the receive buffer
(SO_RCVBUF) to the requested size, you probably need to adjust buffer
related kernel parameters; so_rcvbuf='51200000', so_rcvbuf_set='20971520'"

In terms of lps processed, may be it's just a case of Monday but I am
seeing numbers as high as 40k consistently now with no drops to zero. Disk
IO is low but more consistent between 4-8 Mbytes/sec.

In terms of multi-threading, not sure how well the code does it because I
am seeing individual cores spike to 100% even when several others are at
idle.




On Fri, May 31, 2013 at 12:41 PM, Xuri Nagarin <secsubs at gmail.com> wrote:

> Thanks Daniel. I think I owe you a six-pack of your favourite beer just
> for the script :)
>
> Very interesting stats. For both hosts, lps ranges from anywhere between
> few hundred to 30K per second but falls to zero for several seconds. All
> the packet loss happens when the lps is peaking. What has me concerned is
> that lps falls to zero for 15-20 seconds at a time. During this period disk
> IO falls to near zero too and packet hemorrhaging pauses.
>
> All my sources are tcp and there are only four of them but those are
> proprietary products so they may be sending data in batches/bursts. I will
> tweak the tcp source options and see what more I can squeeze out of this
> build of syslog-ng and the disks. But you are right - I do need to upgrade
> to the multi-threaded version asap. At some point, a single core process is
> going to stall on writing to disk and not be able to reap tcp buffers
> during that time.
>
> I think the disk scheduler probably did not have any impact but one system
> has the disk scheduler tweak and other doesn't but I don't see any
> significant differences with packet loss on either.
>
> Right now, I think the issue simply is that when a burst of data comes in,
> the single core syslog-ng blocks of IO and cannot switch back to reading
> TCP buffers fast enough to clear up everything in the buffer. High network
> IO, high Disk IO, high lps  and packet loss - all match up.
>
> A larger receive buffer for the tcp source in syslog config will be a
> band-aid, I think until I can build a multi-threaded syslog-ng. Eventually,
> given that these are 7200RPM disks in RAID-1, the disk sub-system will
> start being a bottleneck but by then, hopefully, I will find a host with
> better disk througput :-)
>
>
>
>
>
> On Fri, May 31, 2013 at 2:41 AM, Daniel Neubacher <
> daniel.neubacher at xing.com> wrote:
>
>>  I've forgot to ask for your syslog source settings.
>> Here is my cfg. Maybe it helps.
>>
>>  tcp(
>>         log_fetch_limit(1000)
>>         max-connections(5000)
>>         so_rcvbuf(51200000)
>>         keep_timestamp(yes)
>>         port(514)
>>         log-iw-size(500000)
>> );
>>
>>  I've got it from a great artice
>> http://codeascraft.com/2012/08/13/performance-tuning-syslog-ng/
>>
>>
>>
>> ------------------------------
>> *Von:* syslog-ng-bounces at lists.balabit.hu [
>> syslog-ng-bounces at lists.balabit.hu]" im Auftrag von "Xuri Nagarin [
>> secsubs at gmail.com]
>> *Gesendet:* Freitag, 31. Mai 2013 10:12
>>
>> *An:* Syslog-ng users' and developers' mailing list
>> *Betreff:* Re: [syslog-ng] TCP packet collapse errors
>>
>>   Thanks for the quick response, Daniel.
>>
>>  I look at statistics for an hour before tweaking flush_lines to zero and
>> setting log_fifo_size to 10000. In that period, syslog-ng reported
>> processing 7,898,310,589 messages across all destinations and
>> dropped 4,200,260.
>>
>>  After making the change (flush_lines set to 0 and log_fifo_size to
>> 10000), I looked at three sets (half hour) of stats (default, every 10
>> minutes). The dropped messages are now zero across all destinations.
>>
>>  But the collapsed TCP packets count keeps incrementing. I ran 'iostat
>> -xm 5' and "watch -d 'netstat -s | grep collpased' " in two windows
>> side-by-side. Each time that disk IO spikes up, the TCP collapsed counter
>> starts incrementing. Disk IO remains almost zero for about half a minute
>> and then spikes up to ~4-25 Mbytes/sec for half a minute.
>>
>>  Does this mean I need to bump up log_fifo_size even higher? I think
>> ideally we want the disk to be consistently written to instead of bursts of
>> write activity. Right?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, May 30, 2013 at 10:56 PM, Daniel Neubacher <
>> daniel.neubacher at xing.com> wrote:
>>
>>>  I don't know how much logs you are getting but should tweak "log_fifo_size
>>> (1000);" to a higher number. Your flush_lines is really high too.. I
>>> tested around with flush lines but I ended setting it to 0 with 50k log per
>>> second. And they greatest of all tweaks would be a newer syslog version
>>> because of the threading.
>>>  ------------------------------
>>> *Von:* syslog-ng-bounces at lists.balabit.hu [
>>> syslog-ng-bounces at lists.balabit.hu]" im Auftrag von "Xuri Nagarin [
>>> secsubs at gmail.com]
>>> *Gesendet:* Freitag, 31. Mai 2013 07:46
>>> *An:* Syslog-ng users' and developers' mailing list
>>> *Betreff:* [syslog-ng] TCP packet collapse errors
>>>
>>>    I have a pair of Syslog-NG servers running 3.2.5-3. The hardware
>>> specs are - Quad Xeon E5-2680 (32 cores), 32GB RAM, and two 1TB SAS 7200
>>> RPM disks in RAID-1.
>>>
>>> OS is RHEL6.2 - Kernel 2.6.32-279.5.2. Filesystem is ext3.
>>>
>>>  Global options are set as:
>>>  options {
>>> flush_lines (1000);
>>> time_reopen (10);
>>> log_fifo_size (1000);
>>> long_hostnames (off);
>>> use_dns (no);
>>> use_fqdn (no);
>>> create_dirs (yes);
>>> keep_hostname (yes);
>>> keep_timestamp(yes);
>>> dir_group("syslog");
>>> perm(0640);
>>> dir_perm(0750);
>>> group("syslog");
>>> };
>>>
>>>  I have already set TCP kernel buffers to 128MB max and set disk
>>> scheduler to "deadline".
>>>
>>>  But even under light disk IO load, from ~8-25MB, I see "1320811067
>>> packets collapsed in receive queue due to low socket buffer". I had some
>>> other processes on the host writing to disk. Stopping them reduced the
>>> packet errors but this number still keeps incrementing.
>>>
>>>  To rule out other issues, I temporarily pointed my disk-based
>>> destinations to /dev/null and then packet losses/errors stopped. So either
>>> Syslog-NG isn't able to write to disk fast enough or there is an underlying
>>> OS/hardware issue.
>>>
>>>  Both hosts have the same issue. Any pointers in troubleshooting it
>>> will be appreciated.
>>>
>>>  TIA.
>>>
>>>
>>>
>>>
>>> ______________________________________________________________________________
>>> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
>>> Documentation:
>>> http://www.balabit.com/support/documentation/?product=syslog-ng
>>> FAQ: http://www.balabit.com/wiki/syslog-ng-faq
>>>
>>>
>>>
>>
>>
>> ______________________________________________________________________________
>> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
>> Documentation:
>> http://www.balabit.com/support/documentation/?product=syslog-ng
>> FAQ: http://www.balabit.com/wiki/syslog-ng-faq
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.balabit.hu/pipermail/syslog-ng/attachments/20130603/716f1799/attachment.htm 


More information about the syslog-ng mailing list