[syslog-ng] [PATCH][patternize] support for custom delimiters

Balint Kovacs balint.kovacs at balabit.com
Mon Feb 7 17:35:14 CET 2011


Hi,

with some time at my hands over the weekend I was finally able finish 
support for multiple delimiters in 'pdbtool patternize'.

Rationale:

The current implementation of patternize only splits up log messages at 
spaces, which is inaccurate - while working with patterns, I found that 
most of the volatile content in messages is enclosed in or followed by 
some kind of separator character(s). So the best way of handling this 
would be to make the user be able to define the delimiters and provide a 
sane default.

Implementation:

The delimiter list is extracted from each message separately after 
tokenization (this way it delimiters do not affect frequent word 
collection) and then added to the end of the cluster key in the SLCT 
phase to ensure unicity. When printing the pattern, the delimiter list 
is popped from the cluster key and as it is in the same order as the 
extracted words, we print delimiter with the same index as the word's 
within the for loop.
The user input needs to be sanitized, we have to make sure that the 
space char is always in the delimiter list and that there are no 
duplicates. I have also changed the the parser marker char from * to 
0x1A ("substitute" char) to prevent parser "false positives" in case a 
words starts with *. I also added the class and provider properties to 
each rule to make the generated xml xsd-compliant. The default delimiter 
list is :&~?![]=,;()'"

And the results:

Running time and memory consumption have almost doubled(!), as far as I 
could test, this comes from the fact, that we have way many more tokens 
than with splitting at spaces, so with the current structure we can't 
really do much about that. BUT accuracy has increased quite a lot, I 
think there is a visible difference.

For the log message:

core.session(3): (svc/intra_HTTP:1486416/http): Server connection 
established; server_fd='46', server_address='AF_INET(1.2.3.4:80)', 
server_zone='Zone(internet, 0.0.0.0/0)', 
server_local='AF_INET(1.2.3.4:52007)', server_protocol='TCP'

With the tokenizing at spaces:

core.session(3): @ESTRING:: @Server connection established; @ESTRING:: 
@@ESTRING:: @server_zone='Zone(internet, 0.0.0.0/0)', @ESTRING:: 
@server_protocol='TCP'

And with the new default delimiters:

core.session(3): (svc/intra_HTTP:@ESTRING::)@: Server connection 
established; server_fd='@ESTRING::'@, 
server_address='AF_INET(@ESTRING:::@80)', server_zone='Zone(internet, 
0.0.0.0/0)', server_local='AF_INET(1.2.3.4:@ESTRING::)@', 
server_protocol='TCP'

(please note that the log sample used for generating these patterns was 
grepped on intra_HTTP, that's why the zorp service name and server port 
80 are found as a frequent words and not ESTRING'd)

Regards,
Balint


-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-patternize-implemented-support-for-custom-delimiters.patch
Type: text/x-patch
Size: 0 bytes
Desc: not available
Url : http://lists.balabit.hu/pipermail/syslog-ng/attachments/20110207/30cee09a/attachment.bin 


More information about the syslog-ng mailing list