Re: [syslog-ng] Buffering AF_UNIX Destination, Batch Post Processing Messages

8 Sep 2010

      Hi Martin,

On Wed, Sep 08, 2010 at 01:48:08PM -0500, Martin Holste wrote:
...
I will share my experience thus far with the exact
problem you're tackling and what's been working for us
Thanks. I appreciate your willingness to jump in and discuss tricky 
problems.
...
Use the program() destination and open(FH, "-|") in Perl to read it. 
This saves the UDP packet creation overhead as well as ensures that 
there are no lost logs.
Good to know. If I use this method, how should I see when I have 
collected one of my 60 second batches?
...
I have experimented with having N number of preforked Perl child 
workers which all listen on "sub" pipes in a round-robin (modulo on 
Perl's $. variable), but I quickly found what you've already pointed 
out, that this is a sync pipe, so there's no sense in round-robin-ing 
since the parent can't move on to the next child pipe until the first 
child is done reading anyway.
That method would not work for me anyway because I need all of my 
messages in a single memory space so I can crunch them down to look for 
anomalies. If they ended up littered into a bunch of child processes 
that would not get me very far.
...
That's fine, since I have never found the Syslog-NG -> Perl end of 
things to be a bottleneck. In our setup, I have Perl do some simple 
massaging of the logs and then write out to a tab-separated file in 
one minute batches.
Good to know where the bottlenecks aren't! :) Note that in my case I am 
only concerned about making sure I don't bog down the syslog-ng daemon 
with slowness of my Perl code. If my stuff chokes the daemon that's a 
disaster. It is OK if I am forced to lose some things sometimes going 
to the Perl end.
...
I then load the file in using MySQL LOAD DATA INFILE, and this can 
get you 100k mps sustained into a database if you're light on the 
indexing. There's also no reason you couldn't simply write the logs 
from Perl to flat file in sqlite format, which would allow you to 
skip the MySQL step entirely. It really depends what you want the 
final format of the logs to be in.
I have two cases I am trying to solve.

1) Crunch on the logs in 60 second batches to look for anomalies.

For this case I will need:

* all messages available in the memory of a single Perl 
process / thread / etc. to perform the computations

-and either-

* some way of either being able to pull in more messages from the next 
batch while processing the last batch (in Java I used two threads and 
this worked fine for a past project)

-or-

* some way of batching messages coming in, and knowing when a batch is 
done, so i can spend the next ~55 seconds doing processing, before 
preparing again to receive a new batch-- so far I don't have a 
scientific way of knowing I've gotten the entire batch from the daemon

2) I want to write logs to the DB. For this I am hoping to use the 
native daemon support if possible, but if not I will do it from Perl.

If I will do it from Perl I will still want batching so I can do a bulk 
write and bulk commit via LOAD DATA INFILE or another high speed 
technique such as Oracle bulk load, etc.
...
In any case, I would discourage you from trying the async framework
route as it adds way too much overhead.
Agreed. I looked at Moose, AnyEvent, POE, etc. and concluded they were 
too complicated and would not provide much benefit over simple select 
for my case.
...
If you do in fact find a bottleneck with pipes, I would think that a
solution involving UDP sent to a local port could work with some fancy
iptables load balancing. You would be limited to netstat counters to
detect losses, but it would probably work. But unless you hit a pipe
bottleneck, I think all of that is more trouble than it is worth.
Not going to help much in my case because I don't have a way of 
crunching logs to find anomalies if they end up in fragmented memory of 
different processes.
...
--Martin
Matthew.
...
On Wed, Sep 8, 2010 at 12:02 AM,  <syslogng@feystorm.net> wrote:
...
Sent: Martedì 7 Settembre 2010 19.42.52
From: Matthew Hall <mhall@mhcomputing.net>
To: Syslog-ng users' and developers' mailing list
<syslog-ng@lists.balabit.hu>
Subject: Re: [syslog-ng] Buffering AF_UNIX Destination, Batch Post
Processing Messages
Syslog-ng will queue all the destination messages until the oldest
message is 60 seconds old, and then flushes them all out at once.
This part is tricky. How do I tell if I have received all the messages?
How do I know when I have hit the end of the batch? Is it possible to
have the daemon insert a marker message, or is there some other way I
can check for this?
I do not believe there is an elegant way. Best idea I can come up with is to
put a timeout on the receiving end so that when it goes quiet for more than
X seconds or whatnot, it sees that as end of batch.
You might be able to request that the mark option be allowed for non-local
destinations. Basically that would allow you to set a mark of 1 second, and
when you receive 2 mark messages back-to-back, that would be end-of-batch
(would basically mean there was no data in between).
Thanks,
Matthew.