[syslog-ng] librdkafka and messages dropped

Tue Jan 14 13:21:21 UTC 2020

patch attached (doh!)

On Tue, Jan 14, 2020 at 1:16 PM John Skopis <jspam at skopis.com> wrote:

> I think I can explain the behavior I am observing.
>
> rdkafka has a buffer (queue.buffering.max.messages). I guess this buffer
> is drained by rd_kafka_poll, which triggers delivery callback.
> When kafka has a failure (spurious or otherwise) _kafka_delivery_report_cb
> reinjects the message into the queue. The amount of messages reinjected can
> be very large, causing the queue to fill much faster than input log rate
> (depends on kafka timeout and max inflight requests I guess).
> When the queue is full no further messages are inserted into the queue,
> i.e. syslog-ng drop counter increments and queue is stuck at max fifo size.
> rd_kafka_poll is only called by _drain_responses, which is scheduled when
> messages are inserted into the queue.
> Messages can't be inserted to the queue because the queue is full;
> Messages can't be drained from the rdkafka buffer because it's not possible
> to add new messages, and rd_kafka_poll is only scheduled on insert.
>
>
> You can reproduce by setting queue.buffering.max.messages to 30 and
> log-fifo-size to 10. Stop kafka and log messages until the fifo fills up.
> All messages are dropped after this point. When you restart kafka rdkafka
> produce into kafka but syslog-ng never gets the delivery_cb because
> rd_kafka_poll isn't called.
>
> I am not familiar with ivkis (or syslog-ng tbh) codebase but it seems the
> issue is that the timer is not scheduled because the worker is not
> scheduled. The only way I can see to fix is in the attached patch -- move
> the callback out of the worker and into mainloop. I know it's probably not
> ideal but it does seem to fix my issue.
>
> I did try upgrading our ivykis version to 0.42.4 and it didn't seem to
> help.
>
> Thanks,
> John
>
> On Mon, Jan 13, 2020 at 9:53 AM Attila Szakacs (aszakacs) <
> Attila.Szakacs at oneidentity.com> wrote:
>
>> Hi John,
>>
>> Could you gather debug logs with `syslog-ng -Fedtv`? It would provide a
>> nice starting point.
>>
>> Thanks,
>> Attila
>> ------------------------------
>> *From:* syslog-ng <syslog-ng-bounces at lists.balabit.hu> on behalf of John
>> Skopis <jspam at skopis.com>
>> *Sent:* Monday, January 13, 2020 10:11 AM
>> *To:* syslog-ng at lists.balabit.hu <syslog-ng at lists.balabit.hu>
>> *Subject:* [syslog-ng] librdkafka and messages dropped
>>
>> CAUTION: This email originated from outside of the organization. Do not
>> follow guidance, click links, or open attachments unless you recognize the
>> sender and know the content is safe.
>>
>> Hello,
>>
>> We have a syslog-ng aggregator that accepts messages in a few different
>> formats and logs the message format to kafka. The mapping looks something
>> like:
>>
>> format1 -> topic1
>> format2 -> topic2
>> ...
>>
>> Each agent will emit messages in ~4 different formats and send to the
>> aggregator.
>> The aggregator is configured with a kafka_c destination with 8
>> threads/destination.
>>
>> We were previously using the java kafka client without issue.
>> We recently upgraded to 3.24.1 + librdkafka 1.22
>>
>> After upgrading we are seeing an issue where (seemingly randomly)
>> syslog-ng stops producing messages into kafka.
>>
>> Checking syslog-ng-ctl stats I can see
>> d_kafka_format1 has dropped message counter increasing
>> d_kafka_format1 queued messages counter increases to 80000 (10000*8
>> threads iw-size)
>> d_kafka_format1 processed messages does not increase
>>
>> There are messages printed to the log that the destination queue is full
>> for all of the rdkafka threads,
>>
>> There are spurious failures to a kafka broker but it seems rdkafka
>> reconnects.
>>
>> Seems like once the destination queue becomes full syslog-ng never
>> recovers and the instance must be restarted.
>> It also seems like rdkafka stops trying to reconnect to kafka (possibly?).
>>
>> Is this expected behavior or is there some bug around reconnecting to
>> kafka after a spurious timeout?
>> Just speculating here but does rdkafka stop trying to reconnect after N
>> times?
>> Are there any examples of destroy/create destination client after failure
>> thresholds?
>>
>> Thanks
>>
>>
>> ______________________________________________________________________________
>> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
>> Documentation:
>> http://www.balabit.com/support/documentation/?product=syslog-ng
>> FAQ: http://www.balabit.com/wiki/syslog-ng-faq
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.balabit.hu/pipermail/syslog-ng/attachments/20200114/9fb4696c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mod-rdkafka.patch
Type: text/x-patch
Size: 6206 bytes
Desc: not available
URL: <http://lists.balabit.hu/pipermail/syslog-ng/attachments/20200114/9fb4696c/attachment-0001.bin>