Pattern Database first snapshot available
Hi, Last week BalaBit made available some 8000 patterns (covering more than 200 applications) for syslog-ng patterndb (or db_parser as you like to call it). The patterns are available under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 (CC by-NC-SA) license. The patterns in their current form are just snapshots of the ongoing effort of providing good quality patterns for various applications. You can download the snapshot of patterns from our website: http://www.balabit.com/downloads/files/patterndb-snapshot/patterndb-20091209... The patterns are partially hand-crafted and also automatically generated from logfiles and from logcheck regexp based database. Some of the patterns also contains example messages which we are using to automatically test the pattern and syslog-ng's db_parser. You can merge the xml files using "pdbtool merge". I would also like to setup a public git repository where anyone interested can follow the patterndb development and can submit patterns or fixes. A patterndb website containing all patterndb related information, links, forums, wikis and other useful documentations is under construction as well. Till than the syslog-ng mailing list a good place for questions, ideas and discussions. As always feedbacks are very welcomed! Happy parsing! Marton -- Key fingerprint = F78C 25CA 5F88 6FAF EA21 779D 3279 9F9E 1155 670D
This is an awesome start, and I'm big into patterndb so this is really encouraging. Off the bat, I'd say that it would be more helpful if the <values></values> tags were populated with the .dict values that are being extracted so that you can construct output patterns properly. Along with that, if you have a different name for every .dict value extracted, it becomes labor-intensive to capture them in your output template. I prefer a method in which I have arbitrarily capped the number of values to be extracted to be six strings, six integers. I then label the values I extract as s0-s5 and i0-i5. That way I only need one template for all patterns extracted. Separating the strings and integers makes database insertion easy because my tables then look like <header columns> MSG, pattern_class_id, pattern_rule_id, i0 .. i5, s0 .. s5. Now searching for fields becomes possible if you know what field belongs to what pattern rule ID. I also prefer to have the rule ID's as integers to keep my DB columns smaller. Here's an example for a Cisco FWSM deny and NAT translation teardown messages that I've been using: <ruleset name="FWSM" id='2'> <pattern>%FWSM</pattern> <rules> <rule provider="local" class='2' id='2'> <patterns> <pattern>Deny@QSTRING:i0: @src@QSTRING:s0: :@@IPv4:i1:@/@NUMBER:i2:@ dst@QSTRING:s1: :@@IPv4:i3:@/@NUMBER:i4:@ by access-group @QSTRING:s2:"@</pattern> </patterns> </rule> <rule provider="local" class='3' id='3'> <patterns> <pattern>Teardown@QSTRING:i0: @connection @NUMBER::@ for@QSTRING:s0: :@@IPv4:i1:@/@NUMBER:i2:@ to@QSTRING:s1: :@@IPv4:i3:@/@NUMBER:i4:@ duration@QSTRING:s2: @bytes @NUMBER:i5:@</pattern> </patterns> </rule> </rules> </ruleset> My back-end script does a bit of magic with IPv4 char -> uint parsing for better DB storage. (If anyone at Balabit would like to toss in a little feature for easy outputting as inet_aton/inet_ntoa from socket.h, that would be cool!) So, if I'm looking for all denied packets from IP address 1.1.1.1, I would search my DB where class_id=2 and i1=INET_ATON("1.1.1.1"). Have any others been using db-parser values? Any methods to share? --Martin On Tue, Dec 15, 2009 at 12:20 PM, ILLES, Marton <illes.marton@balabit.hu> wrote:
Hi,
Last week BalaBit made available some 8000 patterns (covering more than 200 applications) for syslog-ng patterndb (or db_parser as you like to call it). The patterns are available under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 (CC by-NC-SA) license. The patterns in their current form are just snapshots of the ongoing effort of providing good quality patterns for various applications. You can download the snapshot of patterns from our website: http://www.balabit.com/downloads/files/patterndb-snapshot/patterndb-20091209...
The patterns are partially hand-crafted and also automatically generated from logfiles and from logcheck regexp based database. Some of the patterns also contains example messages which we are using to automatically test the pattern and syslog-ng's db_parser. You can merge the xml files using "pdbtool merge".
I would also like to setup a public git repository where anyone interested can follow the patterndb development and can submit patterns or fixes. A patterndb website containing all patterndb related information, links, forums, wikis and other useful documentations is under construction as well. Till than the syslog-ng mailing list a good place for questions, ideas and discussions.
As always feedbacks are very welcomed!
Happy parsing!
Marton -- Key fingerprint = F78C 25CA 5F88 6FAF EA21 779D 3279 9F9E 1155 670D
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
On Tue, 2009-12-15 at 13:00 -0600, Martin Holste wrote:
This is an awesome start, and I'm big into patterndb so this is really encouraging. Off the bat, I'd say that it would be more helpful if the <values></values> tags were populated with the .dict values that are being extracted so that you can construct output patterns properly.
The <values></values> could be use to specify additional values which you want to set, but do not appear in the message itself. For example if you want to classify login messages, but for a certain message the username does not appear, but you know that this message reports a specific username. This case you can use the <values> to assign the .dict.username variable (for example) to that specific user and latter you can be sure that it exists. I am still not sure if I completely understand your suggestion...
Along with that, if you have a different name for every .dict value extracted, it becomes labor-intensive to capture them in your output template. I prefer a method in which I have arbitrarily capped the number of values to be extracted to be six strings, six integers. I then label the values I extract as s0-s5 and i0-i5. That way I only need one template for all patterns extracted. Separating the strings and integers makes database insertion easy because my tables then look like <header columns> MSG, pattern_class_id, pattern_rule_id, i0 .. i5, s0 .. s5. Now searching for fields becomes possible if you know what field belongs to what pattern rule ID. I also prefer to have the rule ID's as integers to keep my DB columns smaller.
The reason for using UUID was to have the ability to provide global unique ids, simple integers would be hard to maintain. I was also thinking using OIDs for IDs, but UUID was an easier choice. Technically you can use simple integers or any other string as syslog-ng currently does not check it. I will think about it... :) Using integers would be also better because of DB indexing purposes. If you want to use integers, you can than assign a <value name="my_id">42</value> as a work-around to each pattern and latter use "my_id" in your templates.
Here's an example for a Cisco FWSM deny and NAT translation teardown messages that I've been using:
<ruleset name="FWSM" id='2'> <pattern>%FWSM</pattern> <rules> <rule provider="local" class='2' id='2'> <patterns> <pattern>Deny@QSTRING:i0: @src@QSTRING:s0: :@@IPv4:i1:@/@NUMBER:i2:@ dst@QSTRING:s1: :@@IPv4:i3:@/@NUMBER:i4:@ by access-group @QSTRING:s2:"@</pattern> </patterns> </rule> <rule provider="local" class='3' id='3'> <patterns> <pattern>Teardown@QSTRING:i0: @connection @NUMBER::@ for@QSTRING:s0: :@@IPv4:i1:@/@NUMBER:i2:@ to@QSTRING:s1: :@@IPv4:i3:@/@NUMBER:i4:@ duration@QSTRING:s2: @bytes @NUMBER:i5:@</pattern> </patterns> </rule> </rules> </ruleset>
I prefer using more meaning-full names as this way you can normalize your logs, so that it wont matter if it is a PIX, iptables etc. log message, you can always refer to the source/destination address with it's name. It requires to store different type of logs in different tables, but you can freely change your application without changing your log processing scripts. You can also combine these to methods to use meaningful names in patterns and using <values> you can assign to numbered values, like this: <value name="s1">${.dict.source_ip}</value> Of course it would require a bit more memory and CPU cycles. Of course you are free to name your values as you want. I think it is really a question on the patterns we try to build and distribute. Maybe I can add a rewrite mechanism to pdbtool which would rename the pattern names to numbered value names. So this way we can publish patterns with meaningful names and anyone can latter rename the patterns for numbered names. Would it fit your needs?
My back-end script does a bit of magic with IPv4 char -> uint parsing for better DB storage. (If anyone at Balabit would like to toss in a little feature for easy outputting as inet_aton/inet_ntoa from socket.h, that would be cool!) So, if I'm looking for all denied packets from IP address 1.1.1.1, I would search my DB where class_id=2 and i1=INET_ATON("1.1.1.1").
I have also had some plan to store parsed values as different type of data and not always as string. IP addresses, numbers are a very good candidate for this. I put it on my todo list. :) Thanks for your comments I really appreciate it. best, Marton
Have any others been using db-parser values? Any methods to share?
--Martin
On Tue, Dec 15, 2009 at 12:20 PM, ILLES, Marton <illes.marton@balabit.hu> wrote:
Hi,
Last week BalaBit made available some 8000 patterns (covering more than 200 applications) for syslog-ng patterndb (or db_parser as you like to call it). The patterns are available under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 (CC by-NC-SA) license. The patterns in their current form are just snapshots of the ongoing effort of providing good quality patterns for various applications. You can download the snapshot of patterns from our website: http://www.balabit.com/downloads/files/patterndb-snapshot/patterndb-20091209...
The patterns are partially hand-crafted and also automatically generated from logfiles and from logcheck regexp based database. Some of the patterns also contains example messages which we are using to automatically test the pattern and syslog-ng's db_parser. You can merge the xml files using "pdbtool merge".
I would also like to setup a public git repository where anyone interested can follow the patterndb development and can submit patterns or fixes. A patterndb website containing all patterndb related information, links, forums, wikis and other useful documentations is under construction as well. Till than the syslog-ng mailing list a good place for questions, ideas and discussions.
As always feedbacks are very welcomed!
Happy parsing!
Marton -- Key fingerprint = F78C 25CA 5F88 6FAF EA21 779D 3279 9F9E 1155 670D
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
-- Key fingerprint = F78C 25CA 5F88 6FAF EA21 779D 3279 9F9E 1155 670D
On Fri, 2009-12-18 at 17:39 +0100, ILLES, Marton wrote:
On Tue, 2009-12-15 at 13:00 -0600, Martin Holste wrote:
This is an awesome start, and I'm big into patterndb so this is really encouraging. Off the bat, I'd say that it would be more helpful if the <values></values> tags were populated with the .dict values that are being extracted so that you can construct output patterns properly.
The <values></values> could be use to specify additional values which you want to set, but do not appear in the message itself. For example if you want to classify login messages, but for a certain message the username does not appear, but you know that this message reports a specific username. This case you can use the <values> to assign the .dict.username variable (for example) to that specific user and latter you can be sure that it exists.
I am still not sure if I completely understand your suggestion...
Along with that, if you have a different name for every .dict value extracted, it becomes labor-intensive to capture them in your output template. I prefer a method in which I have arbitrarily capped the number of values to be extracted to be six strings, six integers. I then label the values I extract as s0-s5 and i0-i5. That way I only need one template for all patterns extracted. Separating the strings and integers makes database insertion easy because my tables then look like <header columns> MSG, pattern_class_id, pattern_rule_id, i0 .. i5, s0 .. s5. Now searching for fields becomes possible if you know what field belongs to what pattern rule ID. I also prefer to have the rule ID's as integers to keep my DB columns smaller.
The reason for using UUID was to have the ability to provide global unique ids, simple integers would be hard to maintain. I was also thinking using OIDs for IDs, but UUID was an easier choice. Technically you can use simple integers or any other string as syslog-ng currently does not check it. I will think about it... :)
Using integers would be also better because of DB indexing purposes. If you want to use integers, you can than assign a <value name="my_id">42</value> as a work-around to each pattern and latter use "my_id" in your templates.
Here's an example for a Cisco FWSM deny and NAT translation teardown messages that I've been using:
<ruleset name="FWSM" id='2'> <pattern>%FWSM</pattern> <rules> <rule provider="local" class='2' id='2'> <patterns> <pattern>Deny@QSTRING:i0: @src@QSTRING:s0: :@@IPv4:i1:@/@NUMBER:i2:@ dst@QSTRING:s1: :@@IPv4:i3:@/@NUMBER:i4:@ by access-group @QSTRING:s2:"@</pattern> </patterns> </rule> <rule provider="local" class='3' id='3'> <patterns> <pattern>Teardown@QSTRING:i0: @connection @NUMBER::@ for@QSTRING:s0: :@@IPv4:i1:@/@NUMBER:i2:@ to@QSTRING:s1: :@@IPv4:i3:@/@NUMBER:i4:@ duration@QSTRING:s2: @bytes @NUMBER:i5:@</pattern> </patterns> </rule> </rules> </ruleset>
I prefer using more meaning-full names as this way you can normalize your logs, so that it wont matter if it is a PIX, iptables etc. log message, you can always refer to the source/destination address with it's name. It requires to store different type of logs in different tables, but you can freely change your application without changing your log processing scripts.
You can also combine these to methods to use meaningful names in patterns and using <values> you can assign to numbered values, like this:
<value name="s1">${.dict.source_ip}</value>
Of course it would require a bit more memory and CPU cycles. Of course you are free to name your values as you want. I think it is really a question on the patterns we try to build and distribute. Maybe I can add a rewrite mechanism to pdbtool which would rename the pattern names to numbered value names. So this way we can publish patterns with meaningful names and anyone can latter rename the patterns for numbered names. Would it fit your needs?
I guess it'd be simpler to reuse the numbered "match" support in syslog-ng, just what the regexps use. You can reference them using $1 .. $255 and it is quite simple to use them, I've almost created a patch, but at the end I didn't. With the new NVTable code, it could even use the same memory and store only a reference: log_msg_set_match_indirect(msg, index, ...) -- Bazsi
On Fri, 2009-12-18 at 17:44 +0100, Balazs Scheidler wrote:
On Fri, 2009-12-18 at 17:39 +0100, ILLES, Marton wrote:
On Tue, 2009-12-15 at 13:00 -0600, Martin Holste wrote:
Along with that, if you have a different name for every .dict value extracted, it becomes labor-intensive to capture them in your output template. I prefer a method in which I have arbitrarily capped the number of values to be extracted to be six strings, six integers. I then label the values I extract as s0-s5 and i0-i5. That way I only need one template for all patterns extracted. Separating the strings and integers makes database insertion easy because my tables then look like <header columns> MSG, pattern_class_id, pattern_rule_id, i0 .. i5, s0 .. s5. Now searching for fields becomes possible if you know what field belongs to what pattern rule ID. I also prefer to have the rule ID's as integers to keep my DB columns smaller.
The reason for using UUID was to have the ability to provide global unique ids, simple integers would be hard to maintain. I was also thinking using OIDs for IDs, but UUID was an easier choice. Technically you can use simple integers or any other string as syslog-ng currently does not check it. I will think about it... :)
Using integers would be also better because of DB indexing purposes. If you want to use integers, you can than assign a <value name="my_id">42</value> as a work-around to each pattern and latter use "my_id" in your templates.
Here's an example for a Cisco FWSM deny and NAT translation teardown messages that I've been using:
<ruleset name="FWSM" id='2'> <pattern>%FWSM</pattern> <rules> <rule provider="local" class='2' id='2'> <patterns> <pattern>Deny@QSTRING:i0: @src@QSTRING:s0: :@@IPv4:i1:@/@NUMBER:i2:@ dst@QSTRING:s1: :@@IPv4:i3:@/@NUMBER:i4:@ by access-group @QSTRING:s2:"@</pattern> </patterns> </rule> <rule provider="local" class='3' id='3'> <patterns> <pattern>Teardown@QSTRING:i0: @connection @NUMBER::@ for@QSTRING:s0: :@@IPv4:i1:@/@NUMBER:i2:@ to@QSTRING:s1: :@@IPv4:i3:@/@NUMBER:i4:@ duration@QSTRING:s2: @bytes @NUMBER:i5:@</pattern> </patterns> </rule> </rules> </ruleset>
I prefer using more meaning-full names as this way you can normalize your logs, so that it wont matter if it is a PIX, iptables etc. log message, you can always refer to the source/destination address with it's name. It requires to store different type of logs in different tables, but you can freely change your application without changing your log processing scripts.
You can also combine these to methods to use meaningful names in patterns and using <values> you can assign to numbered values, like this:
<value name="s1">${.dict.source_ip}</value>
Of course it would require a bit more memory and CPU cycles. Of course you are free to name your values as you want. I think it is really a question on the patterns we try to build and distribute. Maybe I can add a rewrite mechanism to pdbtool which would rename the pattern names to numbered value names. So this way we can publish patterns with meaningful names and anyone can latter rename the patterns for numbered names. Would it fit your needs?
I guess it'd be simpler to reuse the numbered "match" support in syslog-ng, just what the regexps use. You can reference them using $1 .. $255 and it is quite simple to use them, I've almost created a patch, but at the end I didn't.
With the new NVTable code, it could even use the same memory and store only a reference:
log_msg_set_match_indirect(msg, index, ...)
True, but i think it would also make sense to use numbered names, but also distinguish between different types as it is important for SQL tables. So have numbered and typed names. With NVTable using references it would only require little more overhead. M -- Key fingerprint = F78C 25CA 5F88 6FAF EA21 779D 3279 9F9E 1155 670D
The <values></values> could be use to specify additional values which you want to set, but do not appear in the message itself. For example if you want to classify login messages, but for a certain message the username does not appear, but you know that this message reports a specific username. This case you can use the <values> to assign the .dict.username variable (for example) to that specific user and latter you can be sure that it exists.
I am still not sure if I completely understand your suggestion...
Oh right, I completely forgot that you added the values system after the tag system and that the empty <values/> tags were to indicate that no values were being added.
The reason for using UUID was to have the ability to provide global unique ids, simple integers would be hard to maintain. I was also thinking using OIDs for IDs, but UUID was an easier choice. Technically you can use simple integers or any other string as syslog-ng currently does not check it. I will think about it... :)
Yes, that is true that the UUID would be easier for global community purposes, it's just an awfully large value to be storing as per-message overhead.
Using integers would be also better because of DB indexing purposes. If you want to use integers, you can than assign a <value name="my_id">42</value> as a work-around to each pattern and latter use "my_id" in your templates.
That's a good idea and would probably fit my needs just fine.
I prefer using more meaning-full names as this way you can normalize your logs, so that it wont matter if it is a PIX, iptables etc. log message, you can always refer to the source/destination address with it's name. It requires to store different type of logs in different tables, but you can freely change your application without changing your log processing scripts.
If you are doing multiple tables then it is most certainly better to normalize the names as you've done. My app is large enough that I was concerned with open file limits in the database with too many different tables. Specifically, if you are logging 1000 possible classes, each with their own output variables, then you would need 1000 tables x table rotation (if any). On MySQL, this means 3 x number of tables files open, for at least 3000 files open. If the DB and the OS can handle the number of files open, you still incur a fair amount of overhead when a query accesses a table not in the open table cache. Additionally, it might make the client code much more difficult to write because you have variable column names. I suppose it wouldn't be too bad to have a directory lookup for what the column names are to dynamically build your SQL, but I was trying to simplify the database as much as possible at the expense of making the patterns a bit more complex. As a side note, your method would be much more appropriate for inserting into emerging hash-style databases like TokuDB, Hypertable, TokyoCabinet, and MongoDB, or even document-based databases like CouchDB. The problem with such methods currently is the insertion rate is fairly slow for a busy syslog server (anything over 10k messages/sec).
You can also combine these to methods to use meaningful names in patterns and using <values> you can assign to numbered values, like this:
<value name="s1">${.dict.source_ip}</value>
This is an excellent idea and I will probably move to something like it.
Of course it would require a bit more memory and CPU cycles. Of course you are free to name your values as you want. I think it is really a question on the patterns we try to build and distribute. Maybe I can add a rewrite mechanism to pdbtool which would rename the pattern names to numbered value names. So this way we can publish patterns with meaningful names and anyone can latter rename the patterns for numbered names. Would it fit your needs?
I think the most valuable designation to put in the published pattern is a string or int XML attribute or element. Then users can decide how they want to handle them and optimize their storage schema accordingly.
I have also had some plan to store parsed values as different type of data and not always as string. IP addresses, numbers are a very good candidate for this. I put it on my todo list. :)
Awesome! Thanks for all of your hard work on this.
participants (3)
-
Balazs Scheidler
-
ILLES, Marton
-
Martin Holste