[syslog-ng] [PATCH][patternize] support for custom delimiters
Balint Kovacs
balint.kovacs at balabit.com
Mon Feb 7 17:35:14 CET 2011
Hi,
with some time at my hands over the weekend I was finally able finish
support for multiple delimiters in 'pdbtool patternize'.
Rationale:
The current implementation of patternize only splits up log messages at
spaces, which is inaccurate - while working with patterns, I found that
most of the volatile content in messages is enclosed in or followed by
some kind of separator character(s). So the best way of handling this
would be to make the user be able to define the delimiters and provide a
sane default.
Implementation:
The delimiter list is extracted from each message separately after
tokenization (this way it delimiters do not affect frequent word
collection) and then added to the end of the cluster key in the SLCT
phase to ensure unicity. When printing the pattern, the delimiter list
is popped from the cluster key and as it is in the same order as the
extracted words, we print delimiter with the same index as the word's
within the for loop.
The user input needs to be sanitized, we have to make sure that the
space char is always in the delimiter list and that there are no
duplicates. I have also changed the the parser marker char from * to
0x1A ("substitute" char) to prevent parser "false positives" in case a
words starts with *. I also added the class and provider properties to
each rule to make the generated xml xsd-compliant. The default delimiter
list is :&~?![]=,;()'"
And the results:
Running time and memory consumption have almost doubled(!), as far as I
could test, this comes from the fact, that we have way many more tokens
than with splitting at spaces, so with the current structure we can't
really do much about that. BUT accuracy has increased quite a lot, I
think there is a visible difference.
For the log message:
core.session(3): (svc/intra_HTTP:1486416/http): Server connection
established; server_fd='46', server_address='AF_INET(1.2.3.4:80)',
server_zone='Zone(internet, 0.0.0.0/0)',
server_local='AF_INET(1.2.3.4:52007)', server_protocol='TCP'
With the tokenizing at spaces:
core.session(3): @ESTRING:: @Server connection established; @ESTRING::
@@ESTRING:: @server_zone='Zone(internet, 0.0.0.0/0)', @ESTRING::
@server_protocol='TCP'
And with the new default delimiters:
core.session(3): (svc/intra_HTTP:@ESTRING::)@: Server connection
established; server_fd='@ESTRING::'@,
server_address='AF_INET(@ESTRING:::@80)', server_zone='Zone(internet,
0.0.0.0/0)', server_local='AF_INET(1.2.3.4:@ESTRING::)@',
server_protocol='TCP'
(please note that the log sample used for generating these patterns was
grepped on intra_HTTP, that's why the zorp service name and server port
80 are found as a frequent words and not ESTRING'd)
Regards,
Balint
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-patternize-implemented-support-for-custom-delimiters.patch
Type: text/x-patch
Size: 0 bytes
Desc: not available
Url : http://lists.balabit.hu/pipermail/syslog-ng/attachments/20110207/30cee09a/attachment.bin
More information about the syslog-ng
mailing list