Hi, As Márton has already written about it [1], lots of guys here at BalaBit spent a lot of time last year creating a pattern database for some 200+ often-used applications. Just like every manual process, this was a tedious task which begged to be automated. Of course it cannot be fully automated as no algorithm can replace an actual person understanding the structure of the logs a piece of software produces (or even looking into the source code to see how they're generated), but still, a tool that can detect similar messages in a log database and generate a pattern database for it would've been real handy. Thus have pdbtool patternize been created. The tool I've written is a part of the pdbtool utility and can be used to generate a pattern database from a bunch of unknown messages. It uses the algorithm developed by Risto Vaarandi for SLCT [2], the main idea of which is using a data clustering technique to find similar log messages and replacing the differing parts with wildcard characters. In our case, the wildcards are @ESTRING:: @ parsers, otherwise, the solution is pretty much the same. It's far from being perfect (the code could be optimized at some places, it needs to load everything into memory so the size of the parseable log file is limited to the RAM in your machine as swapping slows things down really bad, and the log messages are split to words only at spaces, which makes it unable to detect "username=@QSTRING::'@"-type patterns), but it has already produced some impressive results on the test databases I've tried it with. It managed to categorize 95-98% of ~2M lines of logs into 40-50 patterns in a reasonable time and the patterns themselves were pretty readable as well. There're two options that can be set for the patternization. The first one ("-S") is the support value and accepts a floating point number: the percentage of the lines that have to match a pattern candidate to include it in the resulting pattern database. It allows tuning the trade-off between the number of patterns (too much would be hard to maintain) and the coverage these patterns produce. My first tests show that the optimal support value for most log types are around 2.5-5%, but you should check this with your own logs. The other option ("-o") enables a different operation mode. In this case, after a clustering step is completed, the tool does not exit printing out the generated patterns, rather starts yet another clustering on the messages that are not covered with the patterns generated so far. It keeps on doing this while new patterns can be generated. This way, much better coverage can be achieved while still having a low number of patterns -- the larger groups are detected early, but the small groups aren't left out either. Based on my tests, a bit larger support value, around 10-15% is optimal for this operation mode. The code is available in my public syslog-ng repository [3] and I'm more than eager for some feedback. It still needs a larger review from someone more experienced in syslog-ng internals and leaks a little memory here&there but in overall it's mature enough to play around with even with large log databases. greets, Peter [1] http://marci.blogs.balabit.com/2009/12/pattern-database-first-snapshot.html [2] http://ristov.users.sourceforge.net/slct/] [3] http://git.balabit.hu/?p=gyp/syslog-ng-3.1.git;a=summary