[syslog-ng] Configuration tuning for reliability

Fri Nov 12 14:45:03 CET 2010

Hi,

On Thu, Nov 11, 2010 at 5:54 PM, Matthew Hall <mhall at mhcomputing.net> wrote:
> On Thursday, November 11, 2010 08:51:11 Matthew Hall wrote:
>> On Thursday, November 11, 2010 08:29:32 Martin Holste wrote:
>> > You should not be having problems with your load.  We had a thread
>> > earlier this year ("UDP packet loss with syslog-ng") in which Lars
>> > identified similar performance issues on RHEL.  His problems were
>> > solved by setting the net.core.rmem_default to 2MB using sysctl.  I
>> > would try setting that and then checking your performance.
>>
>> Make sure to also set the so_rcvbuf in syslog-ng on any high volume
>> socket based log sources.
>>
>> You need to have a really big buffer or you will get terrible
>> performance. We've been making some efforts to get this into the
>> documentation.
>>
>> I think it's small by default so it doesn't consume a ton of RAM on boxes
>> that are not used for log collection.
>
> By really big I mean 16,777,216.

IMHO this is actually a *very* bad advice. don't mix the
fire-and-forget UDP logging case with flow-controlled TCP!

Going back to the original mail:

> Client:
>
> log_iw_size >= SOURCES_PER_CLIENT * log_fetch_limit
>
> eg 35 * 10 = 350

log_iw_size is used only for flow controlled log paths  log_iw_size is
a per-source option just like log_fetch_limit, so you shouldn't use
the above math. log_iw_size has to be >= log_fetch_limit in your case,
as all of your file sources use their individual incoming windows.

> log_fifo_size >= SOURCES_PER_CLIENT * log_fetch_limit
>
> eg 35 * 10 = 350
>
> AND
>
> log_fifo_size >= SOURCES_PER_CLIENT * log_iw_size
>
> eg 35 * 350 = 12250
>
> So it appears to me that setting log_fifo_size to > 12250 would be correct.

log_fifo_size is used for in-memory buffering for a given destination.
When the FIFO is full and flow control is enabled then syslog-ng won't
read further logs from the sources. Here the math should be

log_fifo_size >= number_of_sources *  log_iw_size

so 350 should be the actual setting when log_iw_size is set to 10. Of
course increasing log_fifo_size could be useful, but you should be
aware that the contents of the FIFO are lost when you use syslog-ng
OSE and restart it.

Summarising the above: my recommendation would be to enable flow
control for file sources, set log_iw_size to be >=  log_fetch_limit.
For your loghost destination enable flow control and set log_fifo_size
to be at least as big as the accumulated size of all incoming windows.

> Loghost
>
> Less idea about this, do I need:
>
> log_iw_size >= NUMBER_OF_CLIENTS * log_fetch_limit ( * SOURCES_PER_CLIENT ? )
>
> eg 40 * 10 * 35 = 14000

similar math should be used here as on the client side.
sources_per_client doesn't matter, every client is just a single
source from the aspect of the server (client-side TCP destinations are
mapped 1:1 to server-side TCP sources) so log_iw_size should be just
10, not 14k ! See below for a more detailed explanation.

Similarly to the client side on the server every TCP connection has
its own incoming buffer while your sources are using the same
destination FIFO. You've got 40 clients, log_iw_size is set to 10 on
the syslog-ng server so at a given moment up to 40 * 10 messages could
be read into the destination FIFO. log_fifo_size has to be set at
least to 400 (the default is 1000 so this is definitely met). When you
use flow_control (and you definitely should!) then when the mysql
destination can't handle the load then syslog-ng will stop reading
sources which reached the log_iw_size limit. This will also slow down
the syslog clients (but only when the send buffer of the client and
the receive buffer of the server are both full otherwise the TCP/IP
stack allows sending/receiving logs on the wire). When this happens
then depending on the size of the receive and send buffers a *lot* of
messages (ten/hundred thousands!) could be in transit so there are in
peril: when syslog-ng gets restarted on either side these messages are
lost :( For increasing reliability every message should get acked by
the application layer.

When you aim for reliability then flow-controlled logging is the way
to go with fairly low sized receive / send buffers. Of yourse
depending on how much is your network latency the buffers should get
increased for better performance. There is no generic rule how to size
buffers / incoming windows, everyone has to experiment to find the
right balance.

Disclaimer: I'm not an expert in the subject so feel free to correct me :)

Regards,

Sandor