[syslog-ng] periodic disconnect of TCP sessions

Fri Nov 15 14:08:01 UTC 2019

Laci, thank you for your thoughts and questions. Some answers:

My organization controls all the clients, but there are a _lot_ of them. Most are network devices or applications.

Yes, the reason for multiple instances is HA. We have one in each of our data centers. A hardware load balancer provides the DNS entry and resolution (a "wide IP"), and it IS aware of the states of the servers - it checks both every few seconds to ensure that it can establish a TCP connection. If it cannot, the DNS name will not be switched to that server's IP.

System load is not high. Both are VMs with multiple (virtual) CPUs, and syslog-ng rarely uses more than 10% of CPU. We get about 10k UDP messages per minute and about 1000 over TCP. None of the usual stress indicators - no UDP Rcvbuf errors, no bottlenecks that I can see.

When initially set up, having two copies of every syslog was not a problem - it's just disk space. But when we starting indexing them in Splunk, it became a concern since Splunk is licensed by the volume of data indexed.

Concurrent connections - less than 20 TCP connections at a time.

I suppose what I'm looking for is a more granular load balancing, if that makes sense. With UDP, it's very granular, since each packet is sent statelessly to wherever the "wide IP" points at that second. But with TCP, long-lived connections are the rule - great for throughput, but they kind of defeat the purpose. If syslog-ng kept a count of messages received over each TCP connection and did an orderly shutdown after "n" had been received, the client would reconnect seamlessly and all would be well. Alternatively, shut down a TCP connection after it had been open for "n" seconds.

Thanks,
Jon

Message: 1
Date: Fri, 15 Nov 2019 10:16:18 +0000
From: "Laszlo Szemere (lszemere)" <Laszlo.Szemere at oneidentity.com>
To: "syslog-ng at lists.balabit.hu" <syslog-ng at lists.balabit.hu>
Subject: Re: [syslog-ng] periodic disconnect of TCP sessions
Message-ID:
	<BL0PR1901MB20170F90FB51A130781F3F429D700 at BL0PR1901MB2017.namprd19.prod.outlook.com>

Content-Type: text/plain; charset="Windows-1252"

Hello Jon,

 There is currently no way to deliberately terminate an active TCP session. What I was thinking about, that there might be a better approach to the original problem, and terminating the TCP session is just a fragile workaround.

 I have a couple of questions about your setup, so I can get a better understanding of the problem:

 - Is all of the clients controlled by you? (Maybe they are syslog-ng instances in the first place?)
 - Is there a particular reason to have multiple syslog-ng instances? At a first glance (with the parallel sending) it looked like a HA(ish) setup, but the later round-robin solution ruled it out. Also it looks like the DNS server is not aware of the states of the servers.
 - What is the current load on the system? Have you experienced any throttling or bottle neck. (Why do I ask? If the reason behind using multiple syslog-ng instances, is to load balance the traffic for the Splunk servers, than it can be achieved in a different way.
 - How many concurrent connections do you expect at the peak load?

Best regards,
Laci

________________________________________
From: syslog-ng <syslog-ng-bounces at lists.balabit.hu> on behalf of Wilson, Jonathan <jonathan.wilson at vumc.org>
Sent: Wednesday, November 13, 2019 23:03
To: syslog-ng at lists.balabit.hu
Subject: [syslog-ng] periodic disconnect of TCP sessions

CAUTION: This email originated from outside of the organization. Do not follow guidance, click links, or open attachments unless you recognize the sender and know the content is safe.

Hello all,

I am using a pair of syslog-ng OSE 3.22.1 servers that write logfiles which are then scanned by a Splunk Universal Forwarder. They receive messages over TCP, TCP with TLS, and UDP. We have always had devices and systems that send us syslog messages simply send to both syslog-ng servers; however, this resulted in double indexing of the log data in Splunk.

To deal with this we set up a DNS name that round-robins across the two syslog-ng servers’ IPs every 30 seconds. The devices and systems that send to us now send to that DNS name. That neatly prevents the double indexing. If the messages are coming in over stateless UDP, the messages are load balanced in that they all go to one server for 30 seconds, then the other. However, TCP sessions are much longer lived, and some senders send many messages every second – they will latch onto one of our syslog servers and stay connected to it all day.

What I am looking for is a way to limit the lifetime of a TCP connection into syslog-ng, either by time or by number of messages received; after the connection is dropped, the sender will reconnect to whichever server is indicated by the round-robin DNS name, and over time about half of the messages will go to each server.

Is there already a way to do this? Failing that, can you suggest a place to start in patching the source?

Thanks,
Jon