[syslog-ng] Introducing pdbtool patternize

11 Jan 2010

      Hi,

As Márton has already written about it [1], lots of guys here at BalaBit
spent a lot of time last year creating a pattern database for some 200+
often-used applications. Just like every manual process, this was a
tedious task which begged to be automated. Of course it cannot be fully
automated as no algorithm can replace an actual person understanding the
structure of the logs a piece of software produces (or even looking into
the source code to see how they're generated), but still, a tool that
can detect similar messages in a log database and generate a pattern
database for it would've been real handy.

Thus have pdbtool patternize been created. The tool I've written is a
part of the pdbtool utility and can be used to generate a pattern
database from a bunch of unknown messages. It uses the algorithm
developed by Risto Vaarandi for SLCT [2], the main idea of which is
using a data clustering technique to find similar log messages and
replacing the differing parts with wildcard characters. In our case, the
wildcards are @ESTRING:: @ parsers, otherwise, the solution is pretty
much the same.

It's far from being perfect (the code could be optimized at some places,
it needs to load everything into memory so the size of the parseable log
file is limited to the RAM in your machine as swapping slows things down
really bad, and the log messages are split to words only at spaces,
which makes it unable to detect "username=@QSTRING::'@"-type patterns),
but it has already produced some impressive results on the test
databases I've tried it with. It managed to categorize 95-98% of ~2M
lines of logs into 40-50 patterns in a reasonable time and the patterns
themselves were pretty readable as well.

There're two options that can be set for the patternization. The first
one ("-S") is the support value and accepts a floating point number: the
percentage of the lines that have to match a pattern candidate to
include it in the resulting pattern database. It allows tuning the
trade-off between the number of patterns (too much would be hard to
maintain) and the coverage these patterns produce. My first tests show
that the optimal support value for most log types are around 2.5-5%, but
you should check this with your own logs.

The other option ("-o") enables a different operation mode. In this
case, after a clustering step is completed, the tool does not exit
printing out the generated patterns, rather starts yet another
clustering on the messages that are not covered with the patterns
generated so far. It keeps on doing this while new patterns can be
generated. This way, much better coverage can be achieved while still
having a low number of patterns -- the larger groups are detected early,
but the small groups aren't left out either. Based on my tests, a bit
larger support value, around 10-15% is optimal for this operation mode.

The code is available in my public syslog-ng repository [3] and I'm more
than eager for some feedback. It still needs a larger review from
someone more experienced in syslog-ng internals and leaks a little
memory here&there but in overall it's mature enough to play around with
even with large log databases.

greets,

Peter

[1]
http://marci.blogs.balabit.com/2009/12/pattern-database-first-snapshot.html
[2] http://ristov.users.sourceforge.net/slct/]
[3] http://git.balabit.hu/?p=gyp/syslog-ng-3.1.git;a=summary

[syslog-ng] Introducing pdbtool patternize

Peter Gyongyosi