dropping

Ferenc Wagner

7 Oct 2005 7 Oct '05

11:05 a.m.

Hi, Using Debian Sarge I set up a configuration where some 160 machines log by TCP to a single central server. When the machines boot (all at the same time) they obviously put quite some load on the server, which results in lines like Oct 6 20:55:18 bigyo syslog-ng[24969]: STATS: dropped 1303 after the client connected messages. Also there is a constant periodic loss (the clients run synchronised, so cron jobs fire simultaneously) amounting to Oct 7 06:35:27 bigyo syslog-ng[24969]: STATS: dropped 9 Is there a way to overcome this? In average the log traffic is fairly low, but huge bursts do happen as described above. Setting log_fifo_size on the server didn't help much; it logs straight onto disk: [stock Debian Sarge part distributing local logs elided] options { keep_hostname (yes); }; source s_cl { tcp (max_connections (255)); }; destination d_cl { file ("/var/log/cluster/$HOST" template ("$DATE $MSG\n") group ("adm") perm (0640) create_dirs (yes) dir_perm (750)); }; log { source (s_cl); destination (d_cl); }; The clients are configured like this (full file): options { use_dns (no); }; source s_all { internal (); unix-stream ("/dev/log"); file ("/proc/kmsg" log_prefix ("kernel: ")); }; destination bigyo { tcp ("bigyo"); }; log { source (s_all); destination (bigyo); }; Stock Sarge syslog-ng 1.6.5 with Debian patches on all machines. -- Thanks, Feri.

Show replies by date

Mike

7 Oct 7 Oct

2:51 p.m.

I think the best thing to do is to stager the sending times of the data...but failing that, adjust your system level buffer sizes. this site talks a bit about doing that http://www-didc.lbl.gov/TCP-tuning/linux.html, if you have never adjusted the sizes before. we ran into the same problem, and simply adjusting these values worked wonders (I think we set ours to be 256MByte max). Mike On Fri, 7 Oct 2005, Ferenc Wagner wrote:

...

Hi,

Using Debian Sarge I set up a configuration where some 160 machines log by TCP to a single central server. When the machines boot (all at the same time) they obviously put quite some load on the server, which results in lines like

Oct 6 20:55:18 bigyo syslog-ng[24969]: STATS: dropped 1303

after the client connected messages. Also there is a constant periodic loss (the clients run synchronised, so cron jobs fire simultaneously) amounting to

Oct 7 06:35:27 bigyo syslog-ng[24969]: STATS: dropped 9

Is there a way to overcome this? In average the log traffic is fairly low, but huge bursts do happen as described above. Setting log_fifo_size on the server didn't help much; it logs straight onto disk:

[stock Debian Sarge part distributing local logs elided] options { keep_hostname (yes); }; source s_cl { tcp (max_connections (255)); }; destination d_cl { file ("/var/log/cluster/$HOST" template ("$DATE $MSG\n") group ("adm") perm (0640) create_dirs (yes) dir_perm (750)); }; log { source (s_cl); destination (d_cl); };

The clients are configured like this (full file): options { use_dns (no); }; source s_all { internal (); unix-stream ("/dev/log"); file ("/proc/kmsg" log_prefix ("kernel: ")); }; destination bigyo { tcp ("bigyo"); }; log { source (s_all); destination (bigyo); };

Stock Sarge syslog-ng 1.6.5 with Debian patches on all machines. -- Thanks, Feri. _______________________________________________ syslog-ng maillist - syslog-ng@lists.balabit.hu https://lists.balabit.hu/mailman/listinfo/syslog-ng Frequently asked questions at http://www.campin.net/syslog-ng/faq.html

Nate Campi

5:34 p.m.

On Fri, Oct 07, 2005 at 08:51:27AM -0400, Mike wrote:

...

On Fri, 7 Oct 2005, Ferenc Wagner wrote:

...
Using Debian Sarge I set up a configuration where some 160 machines log by TCP to a single central server. When the machines boot (all at the same time) they obviously put quite some load on the server, which results in lines like

Oct 6 20:55:18 bigyo syslog-ng[24969]: STATS: dropped 1303

after the client connected messages. Also there is a constant periodic loss (the clients run synchronised, so cron jobs fire simultaneously) amounting to

I think the best thing to do is to stager the sending times of the data...but failing that, adjust your system level buffer sizes.

Good idea, you should work hard to do this, it's probably the best thing you can do to help if your problem is the bursts.

...

this site talks a bit about doing that http://www-didc.lbl.gov/TCP-tuning/linux.html, if you have never adjusted the sizes before.

No TCP tuning is needed when the application is dropping the packets. That STATS messages from syslog-ng mean that it's traversed the network stack and the data is now in the hands of syslog-ng. The messages are sent to destinations by syslog-ng and if/when the data from the source(s) exceeds the ability of the destination(s) to accept them, syslog-ng drops the message(s) and reports it in a STATS message. To increase the performance writing to disk, increase the number of lines buffered before writing using the sync() option and/or get faster disks (maybe set up as RAID5 or even better RAID 0+1). Buffering too much makes the risk of data loss due to crash a real risk, but a risk of loss is acceptable if it allows you to not constantly lose data during normal operation! We also have a performance tips section of the FAQ: http://www.campin.net/syslog-ng/faq.html#perf If you have a lot of regexps in your filters then the limit of messages you can process under heavy load is decreased - see that URL. -- Nate "It is better to deserve honours and not have them than to have them and not deserve them." - Samuel Clemens

Roberto Nibali

11:38 p.m.

...

Using Debian Sarge I set up a configuration where some 160 machines log by TCP to a single central server. When the machines boot (all at the same time) they obviously put quite some load on the server, which results in lines like

Don't boot all the machines and log to a server at the same time unless you are really well-equipped network wise. It's the same congestion problem you have when running a data center and try to power up the nodes after a power failure: you risk another power failure.

...

Oct 6 20:55:18 bigyo syslog-ng[24969]: STATS: dropped 1303

What's the peak load message-wise and network-wise? How's your network topology? Are the clients in one collision domain or geographically distributed?

...

after the client connected messages. Also there is a constant periodic loss (the clients run synchronised, so cron jobs fire simultaneously) amounting to

Add a random delay in your cronjobs before starting the action. Since you have perfectly identified the source of the problem, fix it there. There is no requirement to synchronise cronjobs over a party of machines; and the logfiles can by synchronised by using the timestamps.

...

Oct 7 06:35:27 bigyo syslog-ng[24969]: STATS: dropped 9

Is there a way to overcome this?

Fix the root of the problem. Of course we could assist you in addressing the problem by tuning the server, if the former suggestions are not appropriate.

...

In average the log traffic is fairly low, but huge bursts do happen as described above.

Did you identify other bursts besides the reboot- and cronjob-related ones?

...

Setting log_fifo_size on the server didn't help much; it logs straight onto disk:

Others have given you ideas on how to tune the server side.

...

[stock Debian Sarge part distributing local logs elided] options { keep_hostname (yes); }; source s_cl { tcp (max_connections (255)); }; destination d_cl { file ("/var/log/cluster/$HOST" template ("$DATE $MSG\n") group ("adm") perm (0640) create_dirs (yes) dir_perm (750)); }; log { source (s_cl); destination (d_cl); };

You could add flags(final) to speed up the parsing a bit; provided you have more log statements.

...

The clients are configured like this (full file): options { use_dns (no); }; source s_all { internal (); unix-stream ("/dev/log"); file ("/proc/kmsg" log_prefix ("kernel: ")); }; destination bigyo { tcp ("bigyo"); }; log { source (s_all); destination (bigyo); };

Looks fine. Best regards, Roberto Nibali, ratz -- echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq'|dc

7454

Age (days ago)

7454

Last active (days ago)

List overview

Download

3 comments

4 participants

participants (4)