[syslog-ng] disk-buffer in elasticsearch2 destination loses messages if docker container is killed

Mon Sep 17 12:04:16 UTC 2018

Hi,

After repeating the test without TLS between syslog-agent & syslog-relay,
with 2229 messages logged with a 20 msgs/sec logging ratio, in debug mode
and with two tcpdumps running, these are the conclussions:

- Syslog-agent sends every single message (checked with tcpdump file and
syslog-agent log file)
- Syslog-relay returns ACK for every single message, but sometimes the
package contains more than one message (checked with tcpdump file)
- Checking the syslog-relay log after killing & starting the docker
container, I found the trace 'Reliable disk-buffer state loaded;
filename='/syslog-ng-00000.rqf', queue_length='99', size='173844' and the
queued messages (from msg 236 to msg 334 are processed). The next message
processed is the nº 440. and it comes from the active network destination
again.
- Once all logs have been processed, in elasticsearch (syslog-relay
destination) I can find messages from 1 to 334, and messages from 440 to
2229.

How is it possible that syslog-relay returns ACK for messages 335 to 439 to
syslog-agent, but they are not in the queue, nor in the .rqf file? Where
did these messages go? I know they are sent before starting to fill
syslog-agent disk-buffer file (checked with rqf file from syslog-agent when
I kill the syslog-relay container). It seems that there's some kind of race
condition that makes those messages are not processed/queued  while
syslog-relay is being killed.

Mi procedure consists of:

. Start tcpdump on both syslog-agent & syslog-relay hosts
- Start logger process
- Once logs are being inserted in elasticsearch, I kill the syslog-relay
docker container, wait about 10 seconds, and run a new container (using a
mapped volume where .rqf file and .persist file are)
- Once logs are being inserted in elasticsearch again, I stop the logger
process.
- I wait until no log is left to be processed

Would you test anything else? I'm running out of ideas.

PD: Same test with logger process writing 5 msgs/sec ratio produces from 0
to 5 lost messages.

Regards.

2018-09-14 14:05 GMT+02:00 Jose Angel Santiago <jasantiago at stratio.com>:

> Hi,
>
> I've got more accurate information about where are my lost messages.
>
> Now I'm using disk-buffer in syslog-agent and syslog-relay, and I've
> checked that lost messages are the ones sent by the syslog-agent when the
> syslog-relay docker container is being killed. I can see those messages on
> syslog-agent log (I've got both agent & relay in debug mode) with its
> corresponding "Outgoing message" line, but those messages never reach the
> relay.
>
> Could it be that the relay docker container still returns ACK to the agent
> (the agent resolves relay fqdn with a custom DNS) while syslog-ng process
> within the container is being stopped? I'm about to test again using
> tcpdump to confirm this theory,
>
> BTW, disk-buffers works ok, sometimes I get some duplicated messages when
> restarting the relay but that's not a problem for me. Forget about my
> .persist file re-creation theory, it doesn't happen.
>
> Regards.
>
>
>
> 2018-09-13 18:07 GMT+02:00 Péter, Kókai <peter.kokai at oneidentity.com>:
>
>> Hello,
>>
>> It would not make sense to replace the persist file after restart, so it
>> is not something that syslog-ng does. Only if that file is corrupted, in
>> that case at startup there should be a log about it, have you checked the
>> syslog-ng logs ? (it would be better to enable debug and/or verbose logs,
>> and if possible share it with us.)
>>
>> Could you reproduce the same behavior without docker (if possible) ?
>>
>>
>>
>> Best Regards,
>> Peter Kokai
>>
>> On Thu, Sep 13, 2018 at 4:44 PM Jose Angel Santiago <
>> jasantiago at stratio.com> wrote:
>>
>>> Hi,
>>>
>>> I guess I know what is happening, when I start from scratch the docker
>>> container, even I provide a persist file and a buffer file within the
>>> mapped volume, syslog-ng recreates them so all messages in buffer file
>>> which were not processed by the relay are lost.
>>>
>>> Is there any way to tell syslog-ng to use an already existing .persist
>>> file so it doesn't recreate the .rqf file?
>>>
>>> Regards.
>>>
>>>
>>>
>>> 2018-09-13 16:23 GMT+02:00 Budai, László <laszlo.budai at oneidentity.com>:
>>>
>>>> Hi,
>>>>
>>>> one problem could be if the flush-limit would be greater than 1... in
>>>> that case syslog-ng would use a HttpBulkMessageProcessor.
>>>> In this case syslog-ng pass the message to the HttpBulkMessageProcessor
>>>> and sends back a positive ACK to the LogSource (so the message is removed
>>>> from the diskbuffer), and if the dockerimage is killed, all the messages
>>>> stored in the HttpBulkMessageProcessor are lost.
>>>> But in your case syslog-ng should use the HttpSingleMessageProcessor...
>>>> which means that the messages are sent one-by-one...
>>>> Could you check the diskbuffer with the dqtool?
>>>>
>>>>
>>>> L.
>>>>
>>>> On Thu, Sep 13, 2018 at 3:50 PM, Jose Angel Santiago <
>>>> jasantiago at stratio.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm running syslog-ng (with an elasticsearch2 destination configured)
>>>>> within a docker container, and I'm trying to avoid loss of messages if I
>>>>> kill the docker container and I start it again.
>>>>>
>>>>> This is my scenary:
>>>>>
>>>>> - A service which produces 20 lines of log per second
>>>>> - A sislog-ng instance reading from a wildcard-file source (but
>>>>> actually it only reads logs from the above service, let's call it
>>>>> syslog-agent), which sends all logs to another syslog-ng instance (the one
>>>>> running in a docker container, let's call it syslog-relay) though a network
>>>>> destination.
>>>>> - The syslog-relay sends messages to an elasticsearch instance, with
>>>>> following configuration:
>>>>>
>>>>> options {
>>>>>     chain-hostnames(no);
>>>>>     use-dns(no);
>>>>>     keep-hostname(yes);
>>>>>     owner("syslog-ng");
>>>>>     group("stratio");
>>>>>     perm(0640);
>>>>>     time-reap(30);
>>>>>     mark-freq(10);
>>>>>     stats-freq(0);
>>>>>     bad-hostname("^gconfd$");
>>>>>     flush-lines(100);
>>>>>     log-fifo-size(1000);
>>>>>     };
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *destination d_elastic_default_0 {    elasticsearch2(
>>>>> cluster("myelastic")        cluster-url("https://myelastic.logs:9200
>>>>> <https://myelastic.logs:9200>")        client_mode("https")
>>>>> index("default")        type("log")        flush-limit(1)
>>>>> disk-buffer(            mem-buf-size(16M)
>>>>> disk-buf-size(16M)            reliable(yes)
>>>>> dir("/syslog-ng/log")        )        http-auth-type("clientcert")
>>>>> java-keystore-filepath("/etc/syslog-ng/certificates/syslog-relay.jks")
>>>>> java-keystore-password("XXXXXX")
>>>>> java-truststore-filepath("/etc/syslog-ng/certificates/ca-bundle.jks")
>>>>> java-truststore-password("XXXXXXXXXX")    );};*
>>>>>
>>>>> - The dir "/syslog-ng/log" is mapped to a path "/tmp/buffer" from the
>>>>> host where the docker container is running, so when I kill the docker
>>>>> container, the buffer file is not lost.
>>>>> - I've set flush-limit to 1 because I thought that I may lost 1
>>>>> message only as much.
>>>>>
>>>>> This architecture is working fine (flush-limit=1 makes very slow, but
>>>>> for this test is ok), but if I kill the syslog-relay docker container, wait
>>>>> 5 to 10 seconds and start it again from scratch, I can see that several
>>>>> hundreds of logs are missing in elasticsearch. I check it by stopping the
>>>>> logger service and letting syslog-ng agent & relay to finish the process
>>>>> enqueued messages.
>>>>>
>>>>> I can see in the syslog-agent stats that all logs messages have been
>>>>> processed, so it seems the problem is on the syslog-relay.
>>>>>
>>>>> Is this behaviour expected? If so, how can I protect against loss of
>>>>> messages in case of a syslog-relay docker container unexpected kill?
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> | Jose Angel Santiago
>>>>>
>>>>> [image: Logo_signature2.png] <http://www.stratio.com/>
>>>>>
>>>>> Vía de las dos Castillas, 33, Ática 4, 3ª Planta
>>>>>
>>>>> 28224 Pozuelo de Alarcón, Madrid, Spain
>>>>>
>>>>> +34 918 286 473 <+34%20918%2028%2064%2073> | www.stratio.com
>>>>> <https://twitter.com/stratiobd>
>>>>> <https://www.linkedin.com/company/stratiobd>
>>>>> <https://www.youtube.com/c/StratioBD>
>>>>>
>>>>> ____________________________________________________________
>>>>> __________________
>>>>> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
>>>>> Documentation: http://www.balabit.com/support/documentation/?product=
>>>>> syslog-ng
>>>>> FAQ: http://www.balabit.com/wiki/syslog-ng-faq
>>>>>
>>>>>
>>>>>
>>>>
>>>> ____________________________________________________________
>>>> __________________
>>>> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
>>>> Documentation: http://www.balabit.com/support/documentation/?product=
>>>> syslog-ng
>>>> FAQ: http://www.balabit.com/wiki/syslog-ng-faq
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> | Jose Angel Santiago
>>>
>>> [image: Logo_signature2.png] <http://www.stratio.com/>
>>>
>>> Vía de las dos Castillas, 33, Ática 4, 3ª Planta
>>>
>>> 28224 Pozuelo de Alarcón, Madrid, Spain
>>>
>>> +34 918 286 473 <+34%20918%2028%2064%2073> | www.stratio.com
>>> <https://twitter.com/stratiobd>
>>> <https://www.linkedin.com/company/stratiobd>
>>> <https://www.youtube.com/c/StratioBD>
>>> ____________________________________________________________
>>> __________________
>>> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
>>> Documentation: http://www.balabit.com/support/documentation/?product=
>>> syslog-ng
>>> FAQ: http://www.balabit.com/wiki/syslog-ng-faq
>>>
>>>
>> ____________________________________________________________
>> __________________
>> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
>> Documentation: http://www.balabit.com/support/documentation/?product=
>> syslog-ng
>> FAQ: http://www.balabit.com/wiki/syslog-ng-faq
>>
>>
>>
>
>
> --
>
> | Jose Angel Santiago
>
> [image: Logo_signature2.png] <http://www.stratio.com/>
>
> Vía de las dos Castillas, 33, Ática 4, 3ª Planta
>
> 28224 Pozuelo de Alarcón, Madrid, Spain
>
> +34 918 286 473 | www.stratio.com
> <https://twitter.com/stratiobd>
> <https://www.linkedin.com/company/stratiobd>
> <https://www.youtube.com/c/StratioBD>
>

-- 

| Jose Angel Santiago

[image: Logo_signature2.png] <http://www.stratio.com/>

Vía de las dos Castillas, 33, Ática 4, 3ª Planta

28224 Pozuelo de Alarcón, Madrid, Spain

+34 918 286 473 | www.stratio.com
<https://twitter.com/stratiobd> <https://www.linkedin.com/company/stratiobd>
<https://www.youtube.com/c/StratioBD>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.balabit.hu/pipermail/syslog-ng/attachments/20180917/447e4218/attachment-0001.html>