[syslog-ng] librdkafka and messages dropped
Attila Szakacs (aszakacs)
Attila.Szakacs at oneidentity.com
Wed Jan 15 09:02:42 UTC 2020
Hi John,
Thanks for the detailed explanation, and the patch, hats off!
While I am reproducing the issue, would you mind opening a PR with this change, so it gets more spotlight?
If we cannot start periodic timers from worker threads, that is kind of a big problem, so this is really interesting.
Thanks,
Attila
________________________________
From: syslog-ng <syslog-ng-bounces at lists.balabit.hu> on behalf of John Skopis <jspam at skopis.com>
Sent: Tuesday, January 14, 2020 2:21 PM
To: Syslog-ng users' and developers' mailing list <syslog-ng at lists.balabit.hu>
Subject: Re: [syslog-ng] librdkafka and messages dropped
CAUTION: This email originated from outside of the organization. Do not follow guidance, click links, or open attachments unless you recognize the sender and know the content is safe.
patch attached (doh!)
On Tue, Jan 14, 2020 at 1:16 PM John Skopis <jspam at skopis.com<mailto:jspam at skopis.com>> wrote:
I think I can explain the behavior I am observing.
rdkafka has a buffer (queue.buffering.max.messages). I guess this buffer is drained by rd_kafka_poll, which triggers delivery callback.
When kafka has a failure (spurious or otherwise) _kafka_delivery_report_cb reinjects the message into the queue. The amount of messages reinjected can be very large, causing the queue to fill much faster than input log rate (depends on kafka timeout and max inflight requests I guess).
When the queue is full no further messages are inserted into the queue, i.e. syslog-ng drop counter increments and queue is stuck at max fifo size.
rd_kafka_poll is only called by _drain_responses, which is scheduled when messages are inserted into the queue.
Messages can't be inserted to the queue because the queue is full; Messages can't be drained from the rdkafka buffer because it's not possible to add new messages, and rd_kafka_poll is only scheduled on insert.
You can reproduce by setting queue.buffering.max.messages to 30 and log-fifo-size to 10. Stop kafka and log messages until the fifo fills up. All messages are dropped after this point. When you restart kafka rdkafka produce into kafka but syslog-ng never gets the delivery_cb because rd_kafka_poll isn't called.
I am not familiar with ivkis (or syslog-ng tbh) codebase but it seems the issue is that the timer is not scheduled because the worker is not scheduled. The only way I can see to fix is in the attached patch -- move the callback out of the worker and into mainloop. I know it's probably not ideal but it does seem to fix my issue.
I did try upgrading our ivykis version to 0.42.4 and it didn't seem to help.
Thanks,
John
On Mon, Jan 13, 2020 at 9:53 AM Attila Szakacs (aszakacs) <Attila.Szakacs at oneidentity.com<mailto:Attila.Szakacs at oneidentity.com>> wrote:
Hi John,
Could you gather debug logs with `syslog-ng -Fedtv`? It would provide a nice starting point.
Thanks,
Attila
________________________________
From: syslog-ng <syslog-ng-bounces at lists.balabit.hu<mailto:syslog-ng-bounces at lists.balabit.hu>> on behalf of John Skopis <jspam at skopis.com<mailto:jspam at skopis.com>>
Sent: Monday, January 13, 2020 10:11 AM
To: syslog-ng at lists.balabit.hu<mailto:syslog-ng at lists.balabit.hu> <syslog-ng at lists.balabit.hu<mailto:syslog-ng at lists.balabit.hu>>
Subject: [syslog-ng] librdkafka and messages dropped
CAUTION: This email originated from outside of the organization. Do not follow guidance, click links, or open attachments unless you recognize the sender and know the content is safe.
Hello,
We have a syslog-ng aggregator that accepts messages in a few different formats and logs the message format to kafka. The mapping looks something like:
format1 -> topic1
format2 -> topic2
...
Each agent will emit messages in ~4 different formats and send to the aggregator.
The aggregator is configured with a kafka_c destination with 8 threads/destination.
We were previously using the java kafka client without issue.
We recently upgraded to 3.24.1 + librdkafka 1.22
After upgrading we are seeing an issue where (seemingly randomly) syslog-ng stops producing messages into kafka.
Checking syslog-ng-ctl stats I can see
d_kafka_format1 has dropped message counter increasing
d_kafka_format1 queued messages counter increases to 80000 (10000*8 threads iw-size)
d_kafka_format1 processed messages does not increase
There are messages printed to the log that the destination queue is full for all of the rdkafka threads,
There are spurious failures to a kafka broker but it seems rdkafka reconnects.
Seems like once the destination queue becomes full syslog-ng never recovers and the instance must be restarted.
It also seems like rdkafka stops trying to reconnect to kafka (possibly?).
Is this expected behavior or is there some bug around reconnecting to kafka after a spurious timeout?
Just speculating here but does rdkafka stop trying to reconnect after N times?
Are there any examples of destroy/create destination client after failure thresholds?
Thanks
______________________________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.balabit.hu%2Fmailman%2Flistinfo%2Fsyslog-ng&data=02%7C01%7CAttila.Szakacs%40oneidentity.com%7Ce19cb6505cdb4b2a226608d798f4aee3%7C91c369b51c9e439c989c1867ec606603%7C0%7C1%7C637146048986792401&sdata=zn1gK82FIcLYJAvXzmGfOxY%2Bp2GMUw3FlrK0Nfm0jpo%3D&reserved=0>
Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng<https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.balabit.com%2Fsupport%2Fdocumentation%2F%3Fproduct%3Dsyslog-ng&data=02%7C01%7CAttila.Szakacs%40oneidentity.com%7Ce19cb6505cdb4b2a226608d798f4aee3%7C91c369b51c9e439c989c1867ec606603%7C0%7C1%7C637146048986792401&sdata=TpPr6q0A%2BW10R5AAXjQU5zY1ieh9zBQUS35oNSteyUo%3D&reserved=0>
FAQ: http://www.balabit.com/wiki/syslog-ng-faq<https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.balabit.com%2Fwiki%2Fsyslog-ng-faq&data=02%7C01%7CAttila.Szakacs%40oneidentity.com%7Ce19cb6505cdb4b2a226608d798f4aee3%7C91c369b51c9e439c989c1867ec606603%7C0%7C1%7C637146048986802348&sdata=xt8KsVCHz9TTrn2Z9UJK%2BQbSHIv4ckdmsKoRGyX1L4Y%3D&reserved=0>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.balabit.hu/pipermail/syslog-ng/attachments/20200115/cfbf0ef1/attachment.html>
More information about the syslog-ng
mailing list