[syslog-ng] librdkafka and messages dropped

John Skopis jspam at skopis.com
Wed Jan 15 10:31:25 UTC 2020


I will open a PR, sure.

After fixing this issue I encountered another issue where when a message is
re-injected into the queue more than once it triggers a use after free and
crash because the message payload is free'd in rdkafka.

After thinking about it a bit I'm not sure the message needs to be
re-injected into the queue at all because rdkafka will attempt to retry the
configured amount of times. Meanwhile if the internal Kafka buffer fills up
(possibly with messages that are being retired) it signals back pressure to
syslog-ng, which can in turn trigger flow control when the syslog-ng queue
becomes full.

Assuming this is desirable the callback only needs to decrement the message
reference and increment stats counter for dropped/written.

Thanks


On Wed, Jan 15, 2020, 09:02 Attila Szakacs (aszakacs) <
Attila.Szakacs at oneidentity.com> wrote:

> Hi John,
>
> Thanks for the detailed explanation, and the patch, hats off!
> While I am reproducing the issue, would you mind opening a PR with this
> change, so it gets more spotlight?
>
> If we cannot start periodic timers from worker threads, that is kind of a
> big problem, so this is really interesting.
>
> Thanks,
> Attila
> ------------------------------
> *From:* syslog-ng <syslog-ng-bounces at lists.balabit.hu> on behalf of John
> Skopis <jspam at skopis.com>
> *Sent:* Tuesday, January 14, 2020 2:21 PM
> *To:* Syslog-ng users' and developers' mailing list <
> syslog-ng at lists.balabit.hu>
> *Subject:* Re: [syslog-ng] librdkafka and messages dropped
>
> CAUTION: This email originated from outside of the organization. Do not
> follow guidance, click links, or open attachments unless you recognize the
> sender and know the content is safe.
>
> patch attached (doh!)
>
> On Tue, Jan 14, 2020 at 1:16 PM John Skopis <jspam at skopis.com> wrote:
>
> I think I can explain the behavior I am observing.
>
> rdkafka has a buffer (queue.buffering.max.messages). I guess this buffer
> is drained by rd_kafka_poll, which triggers delivery callback.
> When kafka has a failure (spurious or otherwise) _kafka_delivery_report_cb
> reinjects the message into the queue. The amount of messages reinjected can
> be very large, causing the queue to fill much faster than input log rate
> (depends on kafka timeout and max inflight requests I guess).
> When the queue is full no further messages are inserted into the queue,
> i.e. syslog-ng drop counter increments and queue is stuck at max fifo size.
> rd_kafka_poll is only called by _drain_responses, which is scheduled when
> messages are inserted into the queue.
> Messages can't be inserted to the queue because the queue is full;
> Messages can't be drained from the rdkafka buffer because it's not possible
> to add new messages, and rd_kafka_poll is only scheduled on insert.
>
>
> You can reproduce by setting queue.buffering.max.messages to 30 and
> log-fifo-size to 10. Stop kafka and log messages until the fifo fills up.
> All messages are dropped after this point. When you restart kafka rdkafka
> produce into kafka but syslog-ng never gets the delivery_cb because
> rd_kafka_poll isn't called.
>
> I am not familiar with ivkis (or syslog-ng tbh) codebase but it seems the
> issue is that the timer is not scheduled because the worker is not
> scheduled. The only way I can see to fix is in the attached patch -- move
> the callback out of the worker and into mainloop. I know it's probably not
> ideal but it does seem to fix my issue.
>
> I did try upgrading our ivykis version to 0.42.4 and it didn't seem to
> help.
>
> Thanks,
> John
>
> On Mon, Jan 13, 2020 at 9:53 AM Attila Szakacs (aszakacs) <
> Attila.Szakacs at oneidentity.com> wrote:
>
> Hi John,
>
> Could you gather debug logs with `syslog-ng -Fedtv`? It would provide a
> nice starting point.
>
> Thanks,
> Attila
> ------------------------------
> *From:* syslog-ng <syslog-ng-bounces at lists.balabit.hu> on behalf of John
> Skopis <jspam at skopis.com>
> *Sent:* Monday, January 13, 2020 10:11 AM
> *To:* syslog-ng at lists.balabit.hu <syslog-ng at lists.balabit.hu>
> *Subject:* [syslog-ng] librdkafka and messages dropped
>
> CAUTION: This email originated from outside of the organization. Do not
> follow guidance, click links, or open attachments unless you recognize the
> sender and know the content is safe.
>
> Hello,
>
> We have a syslog-ng aggregator that accepts messages in a few different
> formats and logs the message format to kafka. The mapping looks something
> like:
>
> format1 -> topic1
> format2 -> topic2
> ...
>
> Each agent will emit messages in ~4 different formats and send to the
> aggregator.
> The aggregator is configured with a kafka_c destination with 8
> threads/destination.
>
> We were previously using the java kafka client without issue.
> We recently upgraded to 3.24.1 + librdkafka 1.22
>
> After upgrading we are seeing an issue where (seemingly randomly)
> syslog-ng stops producing messages into kafka.
>
> Checking syslog-ng-ctl stats I can see
> d_kafka_format1 has dropped message counter increasing
> d_kafka_format1 queued messages counter increases to 80000 (10000*8
> threads iw-size)
> d_kafka_format1 processed messages does not increase
>
> There are messages printed to the log that the destination queue is full
> for all of the rdkafka threads,
>
> There are spurious failures to a kafka broker but it seems rdkafka
> reconnects.
>
> Seems like once the destination queue becomes full syslog-ng never
> recovers and the instance must be restarted.
> It also seems like rdkafka stops trying to reconnect to kafka (possibly?).
>
> Is this expected behavior or is there some bug around reconnecting to
> kafka after a spurious timeout?
> Just speculating here but does rdkafka stop trying to reconnect after N
> times?
> Are there any examples of destroy/create destination client after failure
> thresholds?
>
> Thanks
>
>
> ______________________________________________________________________________
> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> <https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.balabit.hu%2Fmailman%2Flistinfo%2Fsyslog-ng&data=02%7C01%7CAttila.Szakacs%40oneidentity.com%7Ce19cb6505cdb4b2a226608d798f4aee3%7C91c369b51c9e439c989c1867ec606603%7C0%7C1%7C637146048986792401&sdata=zn1gK82FIcLYJAvXzmGfOxY%2Bp2GMUw3FlrK0Nfm0jpo%3D&reserved=0>
> Documentation:
> http://www.balabit.com/support/documentation/?product=syslog-ng
> <https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.balabit.com%2Fsupport%2Fdocumentation%2F%3Fproduct%3Dsyslog-ng&data=02%7C01%7CAttila.Szakacs%40oneidentity.com%7Ce19cb6505cdb4b2a226608d798f4aee3%7C91c369b51c9e439c989c1867ec606603%7C0%7C1%7C637146048986792401&sdata=TpPr6q0A%2BW10R5AAXjQU5zY1ieh9zBQUS35oNSteyUo%3D&reserved=0>
> FAQ: http://www.balabit.com/wiki/syslog-ng-faq
> <https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.balabit.com%2Fwiki%2Fsyslog-ng-faq&data=02%7C01%7CAttila.Szakacs%40oneidentity.com%7Ce19cb6505cdb4b2a226608d798f4aee3%7C91c369b51c9e439c989c1867ec606603%7C0%7C1%7C637146048986802348&sdata=xt8KsVCHz9TTrn2Z9UJK%2BQbSHIv4ckdmsKoRGyX1L4Y%3D&reserved=0>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.balabit.hu/pipermail/syslog-ng/attachments/20200115/2c8e8dff/attachment-0001.html>


More information about the syslog-ng mailing list