Configuration tuning for reliability

older
syslog-ng OSE 3.2 Administrator...

Ben Tisdall

11 Nov 2010 11 Nov '10

2:53 p.m.

Dear list, Apologies for the long winded post but I'd really appreciate your comments on: 1) A configuration that did not perform adequately 2) My understanding of the relevant tuneables Number of clients: 40 Log sources per client: 35. 34 of these are pure file source driver, system logs use the standard redhat config. Dest: single loghost over tcp/ip, loghost uses mysql driver Due to other pressures I made no attempt to tune parameters that influence reliability before the test and performance was commensurately poor (about 40% of entries discarded at the client, confirmed by packet capture on the loghost). Client global opts: * log_msg_size (24576) * log_fifo_size (1000) * log_fetch_limit (10) * flush_lines(0) Client logging options: * each source does a program-override for loghost filtering, no other processing * flags(flow-control) NOT SET Loghost global opts * log_msg_size(32768) * log_fifo_size(1000) Loghost logging opts: * flags(flow-control) - NOT SET * streams undergoes a rewrite pass a simple filter (on program name), rewrite, parse, then mysql dest - the loghost load has always been negligable.

...

From what I NOW understand I need to do something like this:

Client: log_iw_size >= SOURCES_PER_CLIENT * log_fetch_limit eg 35 * 10 = 350 log_fifo_size >= SOURCES_PER_CLIENT * log_fetch_limit eg 35 * 10 = 350 AND log_fifo_size >= SOURCES_PER_CLIENT * log_iw_size eg 35 * 350 = 12250 So it appears to me that setting log_fifo_size to > 12250 would be correct. Loghost Less idea about this, do I need: log_iw_size >= NUMBER_OF_CLIENTS * log_fetch_limit ( * SOURCES_PER_CLIENT ? ) eg 40 * 10 * 35 = 14000 And log_fifo_size >= log_iw_size ? Is flow control important between the network source and the mysql dests? Further information: * Some of the larger logfiles output ~ 4 lines/sec * Approx 4G aggregate logs generated over 14 hours by the 40 hosts * Reliability is more important than speed - these logs are not analysed in real time. That said, the client should have to spend hours completing the log transfer once its workload has been processed.. Again, apologies for the length of the post and many thanks in advance for any help. Ben Tisdall PhotoBox

Show replies by date

Martin Holste

11 Nov 11 Nov

5:29 p.m.

You should not be having problems with your load. We had a thread earlier this year ("UDP packet loss with syslog-ng") in which Lars identified similar performance issues on RHEL. His problems were solved by setting the net.core.rmem_default to 2MB using sysctl. I would try setting that and then checking your performance. On Thu, Nov 11, 2010 at 7:53 AM, Ben Tisdall <ben.tisdall@photobox.com> wrote:

...

Dear list,

Apologies for the long winded post but I'd really appreciate your comments on:

1) A configuration that did not perform adequately 2) My understanding of the relevant tuneables

Number of clients: 40 Log sources per client: 35. 34 of these are pure file source driver, system logs use the standard redhat config. Dest: single loghost over tcp/ip, loghost uses mysql driver

Due to other pressures I made no attempt to tune parameters that influence reliability before the test and performance was commensurately poor (about 40% of entries discarded at the client, confirmed by packet capture on the loghost).

Client global opts:

* log_msg_size (24576) * log_fifo_size (1000) * log_fetch_limit (10) * flush_lines(0)

Client logging options:

* each source does a program-override for loghost filtering, no other processing * flags(flow-control) NOT SET

Loghost global opts

* log_msg_size(32768) * log_fifo_size(1000)

Loghost logging opts:

* flags(flow-control) - NOT SET * streams undergoes a rewrite pass a simple filter (on program name), rewrite, parse, then mysql dest - the loghost load has always been negligable.

From what I NOW understand I need to do something like this:

Client:

log_iw_size >= SOURCES_PER_CLIENT * log_fetch_limit

eg 35 * 10 = 350

log_fifo_size >= SOURCES_PER_CLIENT * log_fetch_limit

eg 35 * 10 = 350

AND

log_fifo_size >= SOURCES_PER_CLIENT * log_iw_size

eg 35 * 350 = 12250

So it appears to me that setting log_fifo_size to > 12250 would be correct.

Loghost

Less idea about this, do I need:

log_iw_size >= NUMBER_OF_CLIENTS * log_fetch_limit ( * SOURCES_PER_CLIENT ? )

eg 40 * 10 * 35 = 14000

And log_fifo_size >= log_iw_size ?

Is flow control important between the network source and the mysql dests?

Further information:

* Some of the larger logfiles output ~ 4 lines/sec * Approx 4G aggregate logs generated over 14 hours by the 40 hosts * Reliability is more important than speed - these logs are not analysed in real time. That said, the client should have to spend hours completing the log transfer once its workload has been processed..

Again, apologies for the length of the post and many thanks in advance for any help.

Ben Tisdall PhotoBox ______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html

Matthew Hall

5:51 p.m.

On Thursday, November 11, 2010 08:29:32 Martin Holste wrote:

...

You should not be having problems with your load. We had a thread earlier this year ("UDP packet loss with syslog-ng") in which Lars identified similar performance issues on RHEL. His problems were solved by setting the net.core.rmem_default to 2MB using sysctl. I would try setting that and then checking your performance.

Make sure to also set the so_rcvbuf in syslog-ng on any high volume socket based log sources. You need to have a really big buffer or you will get terrible performance. We've been making some efforts to get this into the documentation. I think it's small by default so it doesn't consume a ton of RAM on boxes that are not used for log collection. -- Matthew Hall

Matthew Hall

5:54 p.m.

On Thursday, November 11, 2010 08:51:11 Matthew Hall wrote:

...

On Thursday, November 11, 2010 08:29:32 Martin Holste wrote:

...
You should not be having problems with your load. We had a thread earlier this year ("UDP packet loss with syslog-ng") in which Lars identified similar performance issues on RHEL. His problems were solved by setting the net.core.rmem_default to 2MB using sysctl. I would try setting that and then checking your performance.

Make sure to also set the so_rcvbuf in syslog-ng on any high volume socket based log sources.

You need to have a really big buffer or you will get terrible performance. We've been making some efforts to get this into the documentation.

I think it's small by default so it doesn't consume a ton of RAM on boxes that are not used for log collection.

By really big I mean 16,777,216. -- Matthew Hall

Sandor Geller

12 Nov 12 Nov

2:45 p.m.

Hi, On Thu, Nov 11, 2010 at 5:54 PM, Matthew Hall <mhall@mhcomputing.net> wrote:

...

On Thursday, November 11, 2010 08:51:11 Matthew Hall wrote:

...
On Thursday, November 11, 2010 08:29:32 Martin Holste wrote:

...
You should not be having problems with your load. We had a thread earlier this year ("UDP packet loss with syslog-ng") in which Lars identified similar performance issues on RHEL. His problems were solved by setting the net.core.rmem_default to 2MB using sysctl. I would try setting that and then checking your performance.

Make sure to also set the so_rcvbuf in syslog-ng on any high volume socket based log sources.

You need to have a really big buffer or you will get terrible performance. We've been making some efforts to get this into the documentation.

I think it's small by default so it doesn't consume a ton of RAM on boxes that are not used for log collection.

By really big I mean 16,777,216.

IMHO this is actually a *very* bad advice. don't mix the fire-and-forget UDP logging case with flow-controlled TCP! Going back to the original mail:

...

Client:

log_iw_size >= SOURCES_PER_CLIENT * log_fetch_limit

eg 35 * 10 = 350

log_iw_size is used only for flow controlled log paths log_iw_size is a per-source option just like log_fetch_limit, so you shouldn't use the above math. log_iw_size has to be >= log_fetch_limit in your case, as all of your file sources use their individual incoming windows.

...

log_fifo_size >= SOURCES_PER_CLIENT * log_fetch_limit

eg 35 * 10 = 350

AND

log_fifo_size >= SOURCES_PER_CLIENT * log_iw_size

eg 35 * 350 = 12250

So it appears to me that setting log_fifo_size to > 12250 would be correct.

log_fifo_size is used for in-memory buffering for a given destination. When the FIFO is full and flow control is enabled then syslog-ng won't read further logs from the sources. Here the math should be log_fifo_size >= number_of_sources * log_iw_size so 350 should be the actual setting when log_iw_size is set to 10. Of course increasing log_fifo_size could be useful, but you should be aware that the contents of the FIFO are lost when you use syslog-ng OSE and restart it. Summarising the above: my recommendation would be to enable flow control for file sources, set log_iw_size to be >= log_fetch_limit. For your loghost destination enable flow control and set log_fifo_size to be at least as big as the accumulated size of all incoming windows.

...

Loghost

Less idea about this, do I need:

log_iw_size >= NUMBER_OF_CLIENTS * log_fetch_limit ( * SOURCES_PER_CLIENT ? )

eg 40 * 10 * 35 = 14000

similar math should be used here as on the client side. sources_per_client doesn't matter, every client is just a single source from the aspect of the server (client-side TCP destinations are mapped 1:1 to server-side TCP sources) so log_iw_size should be just 10, not 14k ! See below for a more detailed explanation. Similarly to the client side on the server every TCP connection has its own incoming buffer while your sources are using the same destination FIFO. You've got 40 clients, log_iw_size is set to 10 on the syslog-ng server so at a given moment up to 40 * 10 messages could be read into the destination FIFO. log_fifo_size has to be set at least to 400 (the default is 1000 so this is definitely met). When you use flow_control (and you definitely should!) then when the mysql destination can't handle the load then syslog-ng will stop reading sources which reached the log_iw_size limit. This will also slow down the syslog clients (but only when the send buffer of the client and the receive buffer of the server are both full otherwise the TCP/IP stack allows sending/receiving logs on the wire). When this happens then depending on the size of the receive and send buffers a *lot* of messages (ten/hundred thousands!) could be in transit so there are in peril: when syslog-ng gets restarted on either side these messages are lost :( For increasing reliability every message should get acked by the application layer. When you aim for reliability then flow-controlled logging is the way to go with fairly low sized receive / send buffers. Of yourse depending on how much is your network latency the buffers should get increased for better performance. There is no generic rule how to size buffers / incoming windows, everyone has to experiment to find the right balance. Disclaimer: I'm not an expert in the subject so feel free to correct me :) Regards, Sandor

Ben Tisdall

13 Nov 13 Nov

12:48 p.m.

Thanks to both for your contributions - in this case Sandor I think your advice is the most appropriate - I'll feedback to the list after the next test, having applied the changes. On Fri, Nov 12, 2010 at 1:45 PM, Sandor Geller <Sandor.Geller@morganstanley.com> wrote:

...

Hi,

On Thu, Nov 11, 2010 at 5:54 PM, Matthew Hall <mhall@mhcomputing.net> wrote:

...
On Thursday, November 11, 2010 08:51:11 Matthew Hall wrote:

...
On Thursday, November 11, 2010 08:29:32 Martin Holste wrote:

...
You should not be having problems with your load. We had a thread earlier this year ("UDP packet loss with syslog-ng") in which Lars identified similar performance issues on RHEL. His problems were solved by setting the net.core.rmem_default to 2MB using sysctl. I would try setting that and then checking your performance.

Make sure to also set the so_rcvbuf in syslog-ng on any high volume socket based log sources.

You need to have a really big buffer or you will get terrible performance. We've been making some efforts to get this into the documentation.

I think it's small by default so it doesn't consume a ton of RAM on boxes that are not used for log collection.

By really big I mean 16,777,216.

IMHO this is actually a *very* bad advice. don't mix the fire-and-forget UDP logging case with flow-controlled TCP!

Going back to the original mail:

...
Client:

log_iw_size >= SOURCES_PER_CLIENT * log_fetch_limit

eg 35 * 10 = 350

log_iw_size is used only for flow controlled log paths log_iw_size is a per-source option just like log_fetch_limit, so you shouldn't use the above math. log_iw_size has to be >= log_fetch_limit in your case, as all of your file sources use their individual incoming windows.

...
log_fifo_size >= SOURCES_PER_CLIENT * log_fetch_limit

eg 35 * 10 = 350

AND

log_fifo_size >= SOURCES_PER_CLIENT * log_iw_size

eg 35 * 350 = 12250

So it appears to me that setting log_fifo_size to > 12250 would be correct.

log_fifo_size is used for in-memory buffering for a given destination. When the FIFO is full and flow control is enabled then syslog-ng won't read further logs from the sources. Here the math should be

log_fifo_size >= number_of_sources * log_iw_size

so 350 should be the actual setting when log_iw_size is set to 10. Of course increasing log_fifo_size could be useful, but you should be aware that the contents of the FIFO are lost when you use syslog-ng OSE and restart it.

Summarising the above: my recommendation would be to enable flow control for file sources, set log_iw_size to be >= log_fetch_limit. For your loghost destination enable flow control and set log_fifo_size to be at least as big as the accumulated size of all incoming windows.

...
Loghost

Less idea about this, do I need:

log_iw_size >= NUMBER_OF_CLIENTS * log_fetch_limit ( * SOURCES_PER_CLIENT ? )

eg 40 * 10 * 35 = 14000

similar math should be used here as on the client side. sources_per_client doesn't matter, every client is just a single source from the aspect of the server (client-side TCP destinations are mapped 1:1 to server-side TCP sources) so log_iw_size should be just 10, not 14k ! See below for a more detailed explanation.

Similarly to the client side on the server every TCP connection has its own incoming buffer while your sources are using the same destination FIFO. You've got 40 clients, log_iw_size is set to 10 on the syslog-ng server so at a given moment up to 40 * 10 messages could be read into the destination FIFO. log_fifo_size has to be set at least to 400 (the default is 1000 so this is definitely met). When you use flow_control (and you definitely should!) then when the mysql destination can't handle the load then syslog-ng will stop reading sources which reached the log_iw_size limit. This will also slow down the syslog clients (but only when the send buffer of the client and the receive buffer of the server are both full otherwise the TCP/IP stack allows sending/receiving logs on the wire). When this happens then depending on the size of the receive and send buffers a *lot* of messages (ten/hundred thousands!) could be in transit so there are in peril: when syslog-ng gets restarted on either side these messages are lost :( For increasing reliability every message should get acked by the application layer.

When you aim for reliability then flow-controlled logging is the way to go with fairly low sized receive / send buffers. Of yourse depending on how much is your network latency the buffers should get increased for better performance. There is no generic rule how to size buffers / incoming windows, everyone has to experiment to find the right balance.

Disclaimer: I'm not an expert in the subject so feel free to correct me :)

Regards,

Sandor ______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html

Ben Tisdall

16 Nov 16 Nov

6:34 a.m.

The changes were successful, we did not drop a single log during the last test. Many thanks Sandor! On Sat, Nov 13, 2010 at 11:48 AM, Ben Tisdall <ben.tisdall@photobox.com> wrote:

...

Thanks to both for your contributions - in this case Sandor I think your advice is the most appropriate - I'll feedback to the list after the next test, having applied the changes.

On Fri, Nov 12, 2010 at 1:45 PM, Sandor Geller <Sandor.Geller@morganstanley.com> wrote:

...
Hi,

On Thu, Nov 11, 2010 at 5:54 PM, Matthew Hall <mhall@mhcomputing.net> wrote:

...
On Thursday, November 11, 2010 08:51:11 Matthew Hall wrote:

...
On Thursday, November 11, 2010 08:29:32 Martin Holste wrote:

...
You should not be having problems with your load. We had a thread earlier this year ("UDP packet loss with syslog-ng") in which Lars identified similar performance issues on RHEL. His problems were solved by setting the net.core.rmem_default to 2MB using sysctl. I would try setting that and then checking your performance.

Make sure to also set the so_rcvbuf in syslog-ng on any high volume socket based log sources.

You need to have a really big buffer or you will get terrible performance. We've been making some efforts to get this into the documentation.

I think it's small by default so it doesn't consume a ton of RAM on boxes that are not used for log collection.

By really big I mean 16,777,216.

IMHO this is actually a *very* bad advice. don't mix the fire-and-forget UDP logging case with flow-controlled TCP!

Going back to the original mail:

...
Client:

log_iw_size >= SOURCES_PER_CLIENT * log_fetch_limit

eg 35 * 10 = 350

log_iw_size is used only for flow controlled log paths log_iw_size is a per-source option just like log_fetch_limit, so you shouldn't use the above math. log_iw_size has to be >= log_fetch_limit in your case, as all of your file sources use their individual incoming windows.

...
log_fifo_size >= SOURCES_PER_CLIENT * log_fetch_limit

eg 35 * 10 = 350

AND

log_fifo_size >= SOURCES_PER_CLIENT * log_iw_size

eg 35 * 350 = 12250

So it appears to me that setting log_fifo_size to > 12250 would be correct.

log_fifo_size is used for in-memory buffering for a given destination. When the FIFO is full and flow control is enabled then syslog-ng won't read further logs from the sources. Here the math should be

log_fifo_size >= number_of_sources * log_iw_size

so 350 should be the actual setting when log_iw_size is set to 10. Of course increasing log_fifo_size could be useful, but you should be aware that the contents of the FIFO are lost when you use syslog-ng OSE and restart it.

Summarising the above: my recommendation would be to enable flow control for file sources, set log_iw_size to be >= log_fetch_limit. For your loghost destination enable flow control and set log_fifo_size to be at least as big as the accumulated size of all incoming windows.

...
Loghost

Less idea about this, do I need:

log_iw_size >= NUMBER_OF_CLIENTS * log_fetch_limit ( * SOURCES_PER_CLIENT ? )

eg 40 * 10 * 35 = 14000

similar math should be used here as on the client side. sources_per_client doesn't matter, every client is just a single source from the aspect of the server (client-side TCP destinations are mapped 1:1 to server-side TCP sources) so log_iw_size should be just 10, not 14k ! See below for a more detailed explanation.

Similarly to the client side on the server every TCP connection has its own incoming buffer while your sources are using the same destination FIFO. You've got 40 clients, log_iw_size is set to 10 on the syslog-ng server so at a given moment up to 40 * 10 messages could be read into the destination FIFO. log_fifo_size has to be set at least to 400 (the default is 1000 so this is definitely met). When you use flow_control (and you definitely should!) then when the mysql destination can't handle the load then syslog-ng will stop reading sources which reached the log_iw_size limit. This will also slow down the syslog clients (but only when the send buffer of the client and the receive buffer of the server are both full otherwise the TCP/IP stack allows sending/receiving logs on the wire). When this happens then depending on the size of the receive and send buffers a *lot* of messages (ten/hundred thousands!) could be in transit so there are in peril: when syslog-ng gets restarted on either side these messages are lost :( For increasing reliability every message should get acked by the application layer.

When you aim for reliability then flow-controlled logging is the way to go with fairly low sized receive / send buffers. Of yourse depending on how much is your network latency the buffers should get increased for better performance. There is no generic rule how to size buffers / incoming windows, everyone has to experiment to find the right balance.

Disclaimer: I'm not an expert in the subject so feel free to correct me :)

Regards,

Sandor ______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html

5416

Age (days ago)

5421

Last active (days ago)

List overview

Download

6 comments

4 participants

participants (4)

Ben Tisdall
Martin Holste
Matthew Hall
Sandor Geller