Hi,

After repeating the test without TLS between syslog-agent & syslog-relay, with 2229 messages logged with a 20 msgs/sec logging ratio, in debug mode and with two tcpdumps running, these are the conclussions:

- Syslog-agent sends every single message (checked with tcpdump file and syslog-agent log file)
- Syslog-relay returns ACK for every single message, but sometimes the package contains more than one message (checked with tcpdump file)
- Checking the syslog-relay log after killing & starting the docker container, I found the trace 'Reliable disk-buffer state loaded; filename='/syslog-ng-00000.rqf', queue_length='99', size='173844' and the queued messages (from msg 236 to msg 334 are processed). The next message processed is the nº 440. and it comes from the active network destination again.
- Once all logs have been processed, in elasticsearch (syslog-relay destination) I can find messages from 1 to 334, and messages from 440 to 2229.

How is it possible that syslog-relay returns ACK for messages 335 to 439 to syslog-agent, but they are not in the queue, nor in the .rqf file? Where did these messages go? I know they are sent before starting to fill syslog-agent disk-buffer file (checked with rqf file from syslog-agent when I kill the syslog-relay container). It seems that there's some kind of race condition that makes those messages are not processed/queued  while syslog-relay is being killed.

Mi procedure consists of:

. Start tcpdump on both syslog-agent & syslog-relay hosts
- Start logger process
- Once logs are being inserted in elasticsearch, I kill the syslog-relay docker container, wait about 10 seconds, and run a new container (using a mapped volume where .rqf file and .persist file are)
- Once logs are being inserted in elasticsearch again, I stop the logger process.
- I wait until no log is left to be processed

Would you test anything else? I'm running out of ideas.

PD: Same test with logger process writing 5 msgs/sec ratio produces from 0 to 5 lost messages.

Regards.




2018-09-14 14:05 GMT+02:00 Jose Angel Santiago <jasantiago@stratio.com>:
Hi,

I've got more accurate information about where are my lost messages.

Now I'm using disk-buffer in syslog-agent and syslog-relay, and I've checked that lost messages are the ones sent by the syslog-agent when the syslog-relay docker container is being killed. I can see those messages on syslog-agent log (I've got both agent & relay in debug mode) with its corresponding "Outgoing message" line, but those messages never reach the relay.

Could it be that the relay docker container still returns ACK to the agent (the agent resolves relay fqdn with a custom DNS) while syslog-ng process within the container is being stopped? I'm about to test again using tcpdump to confirm this theory,

BTW, disk-buffers works ok, sometimes I get some duplicated messages when restarting the relay but that's not a problem for me. Forget about my .persist file re-creation theory, it doesn't happen.

Regards.



2018-09-13 18:07 GMT+02:00 Péter, Kókai <peter.kokai@oneidentity.com>:
Hello,

It would not make sense to replace the persist file after restart, so it is not something that syslog-ng does. Only if that file is corrupted, in that case at startup there should be a log about it, have you checked the syslog-ng logs ? (it would be better to enable debug and/or verbose logs, and if possible share it with us.)

Could you reproduce the same behavior without docker (if possible) ?



Best Regards,
Peter Kokai

On Thu, Sep 13, 2018 at 4:44 PM Jose Angel Santiago <jasantiago@stratio.com> wrote:
Hi,

I guess I know what is happening, when I start from scratch the docker container, even I provide a persist file and a buffer file within the mapped volume, syslog-ng recreates them so all messages in buffer file which were not processed by the relay are lost.

Is there any way to tell syslog-ng to use an already existing .persist file so it doesn't recreate the .rqf file?

Regards.



2018-09-13 16:23 GMT+02:00 Budai, László <laszlo.budai@oneidentity.com>:
Hi,

one problem could be if the flush-limit would be greater than 1... in that case syslog-ng would use a HttpBulkMessageProcessor.
In this case syslog-ng pass the message to the HttpBulkMessageProcessor and sends back a positive ACK to the LogSource (so the message is removed from the diskbuffer), and if the dockerimage is killed, all the messages stored in the HttpBulkMessageProcessor are lost.
But in your case syslog-ng should use the HttpSingleMessageProcessor... which means that the messages are sent one-by-one...
Could you check the diskbuffer with the dqtool?


L.

On Thu, Sep 13, 2018 at 3:50 PM, Jose Angel Santiago <jasantiago@stratio.com> wrote:
Hi,

I'm running syslog-ng (with an elasticsearch2 destination configured) within a docker container, and I'm trying to avoid loss of messages if I kill the docker container and I start it again.

This is my scenary:

- A service which produces 20 lines of log per second
- A sislog-ng instance reading from a wildcard-file source (but actually it only reads logs from the above service, let's call it syslog-agent), which sends all logs to another syslog-ng instance (the one running in a docker container, let's call it syslog-relay) though a network destination.
- The syslog-relay sends messages to an elasticsearch instance, with following configuration:

options {
    chain-hostnames(no);
    use-dns(no);
    keep-hostname(yes);
    owner("syslog-ng");
    group("stratio");
    perm(0640);
    time-reap(30);
    mark-freq(10);
    stats-freq(0);
    bad-hostname("^gconfd$");
    flush-lines(100);
    log-fifo-size(1000);
    };

destination d_elastic_default_0 {
    elasticsearch2(
        cluster("myelastic")
        cluster-url("https://myelastic.logs:9200")
        client_mode("https")
        index("default")
        type("log")
        flush-limit(1)
        disk-buffer(
            mem-buf-size(16M)
            disk-buf-size(16M)
            reliable(yes)
            dir("/syslog-ng/log")
        )
        http-auth-type("clientcert")
        java-keystore-filepath("/etc/syslog-ng/certificates/syslog-relay.jks")
        java-keystore-password("XXXXXX")
        java-truststore-filepath("/etc/syslog-ng/certificates/ca-bundle.jks")
        java-truststore-password("XXXXXXXXXX")
    );
};


- The dir "/syslog-ng/log" is mapped to a path "/tmp/buffer" from the host where the docker container is running, so when I kill the docker container, the buffer file is not lost.
- I've set flush-limit to 1 because I thought that I may lost 1 message only as much.

This architecture is working fine (flush-limit=1 makes very slow, but for this test is ok), but if I kill the syslog-relay docker container, wait 5 to 10 seconds and start it again from scratch, I can see that several hundreds of logs are missing in elasticsearch. I check it by stopping the logger service and letting syslog-ng agent & relay to finish the process enqueued messages.

I can see in the syslog-agent stats that all logs messages have been processed, so it seems the problem is on the syslog-relay.  

Is this behaviour expected? If so, how can I protect against loss of messages in case of a syslog-relay docker container unexpected kill?

Thanks in advance.


--

| Jose Angel Santiago

Logo_signature2.png

Vía de las dos Castillas, 33, Ática 4, 3ª Planta

28224 Pozuelo de Alarcón, Madrid, Spain

+34 918 286 473 | www.stratio.com



______________________________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
FAQ: http://www.balabit.com/wiki/syslog-ng-faq




______________________________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
FAQ: http://www.balabit.com/wiki/syslog-ng-faq





--

| Jose Angel Santiago

Logo_signature2.png

Vía de las dos Castillas, 33, Ática 4, 3ª Planta

28224 Pozuelo de Alarcón, Madrid, Spain

+34 918 286 473 | www.stratio.com


______________________________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
FAQ: http://www.balabit.com/wiki/syslog-ng-faq


______________________________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
FAQ: http://www.balabit.com/wiki/syslog-ng-faq





--

| Jose Angel Santiago

Logo_signature2.png

Vía de las dos Castillas, 33, Ática 4, 3ª Planta

28224 Pozuelo de Alarcón, Madrid, Spain

+34 918 286 473 | www.stratio.com





--

| Jose Angel Santiago

Logo_signature2.png

Vía de las dos Castillas, 33, Ática 4, 3ª Planta

28224 Pozuelo de Alarcón, Madrid, Spain

+34 918 286 473 | www.stratio.com