[announce] patterndb project

older
Syslog-ng 3.0.7 fails to insert...

Balazs Scheidler

25 Jun 2010 25 Jun '10

5:23 p.m.

Hi, By now probably most of you know about patterndb, a powerful framework in syslog-ng that lets you extract structured information from log messages and perform classification at a high speed: http://www.balabit.com/dl/html/syslog-ng-ose-v3.1-guide-admin-en.html/concep... Until now, syslog-ng offered the feature, but no release-quality patterns were produced by the syslog-ng developers. Some samples based on the logcheck database were created, but otherwise every syslog-ng user had to create her samples manually, possibly repeating work performed by others. Since this calls out to be a community project, I'm hereby starting one. Goals ===== Create release-quality pattern databases that can simply be deployed to an existing syslog-ng installation. The goal of the patterns is to extract structured information from the free-form syslog messages, e.g. create name-value pairs based on the syslog message. Since the key factor when doing something like this is the naming of fields, we're going to create our generic naming guidelines that can be applied to any application in the industry. It is not our goal to implement correllation or any other advanced form of analysis, although we feel that with the results of this project, event correllation and analysis can be performed much easier than without it. Related projects ================ I know there are other efforts in the field, why not simply join them? CEF - is the log message format for a proprietary log analysis engine, primarily meant to be used to hold IP security device logs (firewalls, IPSs, virus gateways etc). The patterndb project aims to create patterns for a wider range of device logs and be more generic in the approach. On the other hand we feel that it might be useful to create a solution for converting db-parser output to the CEF format. CEE - Common Event Expression project by Mitre has a focus on creating a nv pair dictionary for all kinds of devices/log messages out there. Although I might be missing something, but I didn't find the concrete results so far, apart from a nicely looking white paper. If the CEE delivers something, then patterndb would probably adapt the naming/taxonomy structure. But I guess not all devices will start logging in the new shiny format, thus the existing devices would need their logs converted, so the patterndb work wouldn't be wasted. Infrastructure ============== Our original patterndb related plans were to create an easy to use web based interface for editing patterns, but since that project is progressing slowly, I'm calling for a minimalist approach: git based version control of simple plain text files. Of course once the nice web based interface is finished, we're going to be ready to use it. First steps =========== I have created a git repository at: http://git.balabit.hu/bazsi/syslog-ng-patterndb.git This contains the initial version of the naming policy document and a simple schema for SIEM-style and a user login-logout naming schema. If you are interested please read the file README.txt in the git archive, or if you prefer a web browser, use this link: http://git.balabit.hu/?p=bazsi/syslog-ng-patterndb.git;a=blob;f=README.txt;h... Licensing ========= I do not have a decision yet, but for sure this is going to use one of the open source licenses or Creative Commons. Let me know if you have a preference in this area. Getting involved ================ Join the syslog-ng mailing list, a start discussing! If you have existing patterns, great. If you don't, it is not late to join. http://lists.balabit.hu/mailman/listinfo/syslog-ng -- Bazsi

Show replies by date

Martin Holste

29 Jun 29 Jun

5:11 p.m.

This is awesome. As I've written about previously, I've used the pattern-db enough to know how powerful and efficient it is, and I am doing all my logging with it. My main use is for log classification and field parsing, which normalizes logs down to something that can easily be put in a database. The classification helps with not only quickly identifying types of logs, but also higher-level ideas like log retention (so I archive important logs) and permissions (so people like web developers can have access to certain logs). The field parsing is great for things like Snort and firewall logs, as well as web server logs. If you use a NoSQL-style database, such as MongoDB or CouchDB, you don't have to worry about fitting fields into a rigid schema since there is no concept of "columns." That works out great for pattern-db because you can specify any field/value pairs in the pattern and then have Mongo write it as-is so that some records will be (_id:1, program:"snort", srcip:x.x.x.x} and others will be {_id:2, program:"sendmail", to_address:"person@example.com"} . They key is that you don't have to know ahead of time what fields you will be parsing in order to design a db schema. That means when new patterns are released, the fields can be named anything without breaking your schema. My initial concern with the format of the pattern-db XML is with the CLSID-style ID's. I understand the advantages of CLSID's, but it is very expensive to create database indexes on them because of their enormous length. I would prefer to have an integer ID in the pattern XML somewhere. Other opinions? On Fri, Jun 25, 2010 at 10:23 AM, Balazs Scheidler <bazsi@balabit.hu> wrote:

...

Hi,

By now probably most of you know about patterndb, a powerful framework in syslog-ng that lets you extract structured information from log messages and perform classification at a high speed:

http://www.balabit.com/dl/html/syslog-ng-ose-v3.1-guide-admin-en.html/concep...

Until now, syslog-ng offered the feature, but no release-quality patterns were produced by the syslog-ng developers. Some samples based on the logcheck database were created, but otherwise every syslog-ng user had to create her samples manually, possibly repeating work performed by others.

Since this calls out to be a community project, I'm hereby starting one.

Goals =====

Create release-quality pattern databases that can simply be deployed to an existing syslog-ng installation. The goal of the patterns is to extract structured information from the free-form syslog messages, e.g. create name-value pairs based on the syslog message.

Since the key factor when doing something like this is the naming of fields, we're going to create our generic naming guidelines that can be applied to any application in the industry.

It is not our goal to implement correllation or any other advanced form of analysis, although we feel that with the results of this project, event correllation and analysis can be performed much easier than without it.

Related projects ================

I know there are other efforts in the field, why not simply join them?

CEF - is the log message format for a proprietary log analysis engine, primarily meant to be used to hold IP security device logs (firewalls, IPSs, virus gateways etc). The patterndb project aims to create patterns for a wider range of device logs and be more generic in the approach. On the other hand we feel that it might be useful to create a solution for converting db-parser output to the CEF format.

CEE - Common Event Expression project by Mitre has a focus on creating a nv pair dictionary for all kinds of devices/log messages out there. Although I might be missing something, but I didn't find the concrete results so far, apart from a nicely looking white paper. If the CEE delivers something, then patterndb would probably adapt the naming/taxonomy structure. But I guess not all devices will start logging in the new shiny format, thus the existing devices would need their logs converted, so the patterndb work wouldn't be wasted.

Infrastructure ==============

Our original patterndb related plans were to create an easy to use web based interface for editing patterns, but since that project is progressing slowly, I'm calling for a minimalist approach: git based version control of simple plain text files. Of course once the nice web based interface is finished, we're going to be ready to use it.

First steps ===========

I have created a git repository at:

http://git.balabit.hu/bazsi/syslog-ng-patterndb.git

This contains the initial version of the naming policy document and a simple schema for SIEM-style and a user login-logout naming schema.

If you are interested please read the file README.txt in the git archive, or if you prefer a web browser, use this link:

http://git.balabit.hu/?p=bazsi/syslog-ng-patterndb.git;a=blob;f=README.txt;h...

Licensing =========

I do not have a decision yet, but for sure this is going to use one of the open source licenses or Creative Commons. Let me know if you have a preference in this area.

Getting involved ================

Join the syslog-ng mailing list, a start discussing! If you have existing patterns, great. If you don't, it is not late to join.

http://lists.balabit.hu/mailman/listinfo/syslog-ng

-- Bazsi

______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html

Peter Czanik

7:08 p.m.

Hello, 2010-06-29 17:11 keltezéssel, Martin Holste írta:

...

My initial concern with the format of the pattern-db XML is with the CLSID-style ID's. I understand the advantages of CLSID's, but it is very expensive to create database indexes on them because of their enormous length. I would prefer to have an integer ID in the pattern XML somewhere. Other opinions?

Well, the current solution is the only guarantee, that the IDs are uniq. In my own rules I use a different naming for IDs, to make it more human readable. I use a combination of my nick name, program name and a number. For example: <ruleset name='sshd' id='czp-sshd'> <rule provider='CzP' id='czp-sshd-1' class='violation'> <rule provider='CzP' id='czp-sshd-2' class='system'> This is a way shorter than IDs in the sample database. And when used in a config file, it is a lot more easy to read. Of course, it is far from perfrect, but a lot more convenient. Bye, CzP

Martin Holste

8:22 p.m.

I agree it's really nice to have those kinds of attributes in there. Maybe what I'm talking about then is a serial number in addition to CLSID, and in addition to whatever human-readable name. So something like: <rule provider='CzP' class='violation' name='czp-sshd-1' id=...CLSID... serial=1234567890> So you could use the name attribute for the human-readable part, keep the id's the way they currently are, and have a serial number for indexing. On Tue, Jun 29, 2010 at 12:08 PM, Peter Czanik <czanik@balabit.hu> wrote:

...

Hello,

2010-06-29 17:11 keltezéssel, Martin Holste írta:

...
My initial concern with the format of the pattern-db XML is with the CLSID-style ID's. I understand the advantages of CLSID's, but it is very expensive to create database indexes on them because of their enormous length. I would prefer to have an integer ID in the pattern XML somewhere. Other opinions?

Well, the current solution is the only guarantee, that the IDs are uniq. In my own rules I use a different naming for IDs, to make it more human readable. I use a combination of my nick name, program name and a number. For example:

<ruleset name='sshd' id='czp-sshd'> <rule provider='CzP' id='czp-sshd-1' class='violation'> <rule provider='CzP' id='czp-sshd-2' class='system'>

This is a way shorter than IDs in the sample database. And when used in a config file, it is a lot more easy to read. Of course, it is far from perfrect, but a lot more convenient.

Bye, CzP

______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html

Balazs Scheidler

30 Jun 30 Jun

5:31 p.m.

On Tue, 2010-06-29 at 10:11 -0500, Martin Holste wrote:

...

This is awesome. As I've written about previously, I've used the pattern-db enough to know how powerful and efficient it is, and I am doing all my logging with it. My main use is for log classification and field parsing, which normalizes logs down to something that can easily be put in a database. The classification helps with not only quickly identifying types of logs, but also higher-level ideas like log retention (so I archive important logs) and permissions (so people like web developers can have access to certain logs). The field parsing is great for things like Snort and firewall logs, as well as web server logs.

If you use a NoSQL-style database, such as MongoDB or CouchDB, you don't have to worry about fitting fields into a rigid schema since there is no concept of "columns." That works out great for pattern-db because you can specify any field/value pairs in the pattern and then have Mongo write it as-is so that some records will be (_id:1, program:"snort", srcip:x.x.x.x} and others will be {_id:2, program:"sendmail", to_address:"person@example.com"} . They key is that you don't have to know ahead of time what fields you will be parsing in order to design a db schema. That means when new patterns are released, the fields can be named anything without breaking your schema.

Great to know. I noted the MongoDB/CouchDB as a possible project for plugin development. (hint: see the syslog-ng OSE 3.2 tree) this could perhaps be an alternative to my schema-based SQL destination (on the current roadmap)

...

My initial concern with the format of the pattern-db XML is with the CLSID-style ID's. I understand the advantages of CLSID's, but it is very expensive to create database indexes on them because of their enormous length. I would prefer to have an integer ID in the pattern XML somewhere. Other opinions?

I don't attach to much to UUIDs (I guess that's what you mean under CLSID), but if we are not using something like UUID, then we need a central place to administer the IDs. Do you thing that's acceptable?

...

On Fri, Jun 25, 2010 at 10:23 AM, Balazs Scheidler <bazsi@balabit.hu> wrote:

-- Bazsi

Martin Holste

1 Jul 1 Jul

4:13 a.m.

Cool, I'll have a look at the OSE 3.2 roadmap. I should note that while I've done extensive testing in MongoDB, I'm currently using MySQL and a standard SQL schema for production. The main reason is speed, though I expect MongoDB to catch up eventually. CouchDB is extremely slow, comparatively, for sustained inserts, and I doubt it will ever be a viable option for high-performance logging. At any rate, a SQL schema would be fine with me. Yes, I mean UUID when I say CLSID. I think that requiring a central place to administer the ID's is actually a strength, not a weakness, because it encourages collaboration and peer review. By getting an ID, it means that the signature has been vetted. The EmergingThreats.net Snort signatures are borne from such a process and are much stronger because of the open discussion, debate, and peer review. On Wed, Jun 30, 2010 at 10:31 AM, Balazs Scheidler <bazsi@balabit.hu> wrote:

...

On Tue, 2010-06-29 at 10:11 -0500, Martin Holste wrote:

...
This is awesome. As I've written about previously, I've used the pattern-db enough to know how powerful and efficient it is, and I am doing all my logging with it. My main use is for log classification and field parsing, which normalizes logs down to something that can easily be put in a database. The classification helps with not only quickly identifying types of logs, but also higher-level ideas like log retention (so I archive important logs) and permissions (so people like web developers can have access to certain logs). The field parsing is great for things like Snort and firewall logs, as well as web server logs.

If you use a NoSQL-style database, such as MongoDB or CouchDB, you don't have to worry about fitting fields into a rigid schema since there is no concept of "columns." That works out great for pattern-db because you can specify any field/value pairs in the pattern and then have Mongo write it as-is so that some records will be (_id:1, program:"snort", srcip:x.x.x.x} and others will be {_id:2, program:"sendmail", to_address:"person@example.com"} . They key is that you don't have to know ahead of time what fields you will be parsing in order to design a db schema. That means when new patterns are released, the fields can be named anything without breaking your schema.

Great to know.

I noted the MongoDB/CouchDB as a possible project for plugin development. (hint: see the syslog-ng OSE 3.2 tree) this could perhaps be an alternative to my schema-based SQL destination (on the current roadmap)

...
My initial concern with the format of the pattern-db XML is with the CLSID-style ID's. I understand the advantages of CLSID's, but it is very expensive to create database indexes on them because of their enormous length. I would prefer to have an integer ID in the pattern XML somewhere. Other opinions?

I don't attach to much to UUIDs (I guess that's what you mean under CLSID), but if we are not using something like UUID, then we need a central place to administer the IDs.

Do you thing that's acceptable?

...
On Fri, Jun 25, 2010 at 10:23 AM, Balazs Scheidler <bazsi@balabit.hu> wrote:

-- Bazsi

______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html

Balazs Scheidler

12:53 p.m.

On Wed, 2010-06-30 at 21:13 -0500, Martin Holste wrote:

...

Cool, I'll have a look at the OSE 3.2 roadmap.

I should note that while I've done extensive testing in MongoDB, I'm currently using MySQL and a standard SQL schema for production. The main reason is speed, though I expect MongoDB to catch up eventually. CouchDB is extremely slow, comparatively, for sustained inserts, and I doubt it will ever be a viable option for high-performance logging. At any rate, a SQL schema would be fine with me.

Yes, I mean UUID when I say CLSID. I think that requiring a central place to administer the ID's is actually a strength, not a weakness, because it encourages collaboration and peer review. By getting an ID, it means that the signature has been vetted. The EmergingThreats.net Snort signatures are borne from such a process and are much stronger because of the open discussion, debate, and peer review.

I understand, and I guess we could create a policy that makes it possible to create a private ID space (similar to private IP addresses), which is guaranteed not to collide with "official" IDs. What about an application-name[@provider.tld] * official samples would only contain "application-name" * private samples would have their domain name appended For instance, the official ID for OpenSSH log patterns would be: opensshd Whereas if you wanted to create your samples for application foo, that would look like: foo@balabit.com What do you think? -- Bazsi

Martin Holste

6:03 p.m.

Shouldn't that go under the provider attribute? My point with the ID's vs UUID was that I prefer a numeric ID. Just as with IP space, we could provide a "number space" for local signatures. For instance, 0 through 2,000,000,000 would be public space, and 2,000,000,000 through 2^32 would be private space. I think the "opensshd" component would be assigned to the "name" attribute, or something similar, or maybe would be the "class" attribute. On Thu, Jul 1, 2010 at 5:53 AM, Balazs Scheidler <bazsi@balabit.hu> wrote:

...

On Wed, 2010-06-30 at 21:13 -0500, Martin Holste wrote:

...
Cool, I'll have a look at the OSE 3.2 roadmap.

I should note that while I've done extensive testing in MongoDB, I'm currently using MySQL and a standard SQL schema for production. The main reason is speed, though I expect MongoDB to catch up eventually. CouchDB is extremely slow, comparatively, for sustained inserts, and I doubt it will ever be a viable option for high-performance logging. At any rate, a SQL schema would be fine with me.

Yes, I mean UUID when I say CLSID. I think that requiring a central place to administer the ID's is actually a strength, not a weakness, because it encourages collaboration and peer review. By getting an ID, it means that the signature has been vetted. The EmergingThreats.net Snort signatures are borne from such a process and are much stronger because of the open discussion, debate, and peer review.

I understand, and I guess we could create a policy that makes it possible to create a private ID space (similar to private IP addresses), which is guaranteed not to collide with "official" IDs.

What about an

application-name[@provider.tld]

* official samples would only contain "application-name" * private samples would have their domain name appended

For instance, the official ID for OpenSSH log patterns would be:

opensshd

Whereas if you wanted to create your samples for application foo, that would look like:

foo@balabit.com

What do you think?

-- Bazsi

______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html

Balazs Scheidler

10:23 p.m.

On Thu, 2010-07-01 at 11:03 -0500, Martin Holste wrote:

...

Shouldn't that go under the provider attribute? My point with the ID's vs UUID was that I prefer a numeric ID. Just as with IP space, we could provide a "number space" for local signatures. For instance, 0 through 2,000,000,000 would be public space, and 2,000,000,000 through 2^32 would be private space.

I think the "opensshd" component would be assigned to the "name" attribute, or something similar, or maybe would be the "class" attribute.

let me think this through and also discuss with the guys who originally designed the XML format and come up with a consistent recommendation on IDs. Any other comments on the "patterndb" policy document at http://git.balabit.hu/?p=bazsi/syslog-ng-patterndb.git;a=blob;f=README.txt;h... Perhaps about the two schemas I've described at the same location in SCHEMAS.txt< -- Bazsi

Martin Holste

2 Jul 2 Jul

12:35 a.m.

I've read through them and I think they're definitely on the right track. One thing that might be good to consider would be a way to store hierarchical information. For example, the secevt class in the schema doc is really a network class as it only requires the network tuple. So, you could have a hierarchy like this: class Net { required_fields: [ proto, srcip, dstip, srcport, dstport ], optional_fields: [ in_iface, out_iface, details ] } class NAT { required_fields: [ nat_srcip, nat_dstip, nat_srcport, nat_dstport ] } class Security { required_fields: [ verdict ], optional_fields: [ zone ] } This implies that the class Net.NAT.Security requires proto, srcip, dstip, srcport, dstport, nat_srcip, nat_dstip, nat_srcport, nat_dstport, and verdict, but Net.Security only requires proto, srcip, dstip, srcport, dstport, and verdict. By that token, there's no difference between Security.NAT.Net and Net.NAT.Security, so these are really more like tags than a hierarchy. On Thu, Jul 1, 2010 at 3:23 PM, Balazs Scheidler <bazsi@balabit.hu> wrote:

...

On Thu, 2010-07-01 at 11:03 -0500, Martin Holste wrote:

...
Shouldn't that go under the provider attribute? My point with the ID's vs UUID was that I prefer a numeric ID. Just as with IP space, we could provide a "number space" for local signatures. For instance, 0 through 2,000,000,000 would be public space, and 2,000,000,000 through 2^32 would be private space.

I think the "opensshd" component would be assigned to the "name" attribute, or something similar, or maybe would be the "class" attribute.

let me think this through and also discuss with the guys who originally designed the XML format and come up with a consistent recommendation on IDs.

Any other comments on the "patterndb" policy document at

http://git.balabit.hu/?p=bazsi/syslog-ng-patterndb.git;a=blob;f=README.txt;h...

Perhaps about the two schemas I've described at the same location in SCHEMAS.txt<

-- Bazsi

______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html

Balazs Scheidler

3 Jul 3 Jul

1:32 p.m.

On Thu, 2010-07-01 at 17:35 -0500, Martin Holste wrote:

...

I've read through them and I think they're definitely on the right track. One thing that might be good to consider would be a way to store hierarchical information. For example, the secevt class in the schema doc is really a network class as it only requires the network tuple. So, you could have a hierarchy like this:

class Net { required_fields: [ proto, srcip, dstip, srcport, dstport ], optional_fields: [ in_iface, out_iface, details ] } class NAT { required_fields: [ nat_srcip, nat_dstip, nat_srcport, nat_dstport ] } class Security { required_fields: [ verdict ], optional_fields: [ zone ] }

This implies that the class Net.NAT.Security requires proto, srcip, dstip, srcport, dstport, nat_srcip, nat_dstip, nat_srcport, nat_dstport, and verdict, but Net.Security only requires proto, srcip, dstip, srcport, dstport, and verdict.

By that token, there's no difference between Security.NAT.Net and Net.NAT.Security, so these are really more like tags than a hierarchy.

Hmm.. interesting idea, making a hierarchy of schemas. Again some more food for thought. Is there a reason you want this behaviour implied, rather than being explicit? Maybe using a different syntax would make that more obvious. When combining schemas then instead of using '.' as a separator, use '+' Net+NAT+Security This would make it obvious that you can write the schema names in any order (since this is true for the mathematical '+' sign everyone knows). Also, how would you represent this in a pattern? Right now you can assign multiple tags and any number of name-value pairs. Combining these in the way you described would only be needed in case someone wants to use the same SQL/CSV table with the nonexisting columns skipped. Or? -- Bazsi

Martin Holste

4 Jul 4 Jul

6:27 p.m.

I prefer the dot notation just because it's what I'm used to. However, an XML schema could represent this as repeated child elements, like: <rule><class>Net</class><class>NAT</class><class>Security</class></rule>. A user would see these three classes listed and know that the respective required fields exist as name/value pairs within the pattern. Likewise, an author would only be able to put class="Net" if his or her pattern does in fact provide name/value extractions for the "Net" tuple. That provides the guidance needed for deciding how to classify the patterns. I'm not sure if there would be any effective difference between a "class" element and the existing tag element, so maybe it's just a matter of stipulating that contributers need to appropriately tag their signatures with the correct classes inherent within them. In fact, it probably wouldn't be hard at all to write a quick script to auto-tag signatures as they are submitted, based on the name/value pairs provided in the signature. So the only real thing a contributer would need to be aware of would be the official terms to use for the names, e.g. standardizing on "srcport" versus "source_port." So, that means that the community would be responsible for: 1. Creating a standard list of names to use, adhering to the data type contained within (strings, ints, etc.). 2. Create a convention for which names are required (and optional) for which classes or tags. 3. Maintain the officially approved and vetted list of signatures that adhere to the above conventions. This is basically what you've already stated you want to do, right? One of the nice things about XML is that you can create schema definition files (XSD's) which can validate a given XML file. So, the output of the naming conventions could be an XSD file that can be distributed with Syslog-NG so that end users can quickly verify signatures before they submit them. On Sat, Jul 3, 2010 at 6:32 AM, Balazs Scheidler <bazsi@balabit.hu> wrote:

...

On Thu, 2010-07-01 at 17:35 -0500, Martin Holste wrote:

...
I've read through them and I think they're definitely on the right track. One thing that might be good to consider would be a way to store hierarchical information. For example, the secevt class in the schema doc is really a network class as it only requires the network tuple. So, you could have a hierarchy like this:

class Net { required_fields: [ proto, srcip, dstip, srcport, dstport ], optional_fields: [ in_iface, out_iface, details ] } class NAT { required_fields: [ nat_srcip, nat_dstip, nat_srcport, nat_dstport ] } class Security { required_fields: [ verdict ], optional_fields: [ zone ] }

This implies that the class Net.NAT.Security requires proto, srcip, dstip, srcport, dstport, nat_srcip, nat_dstip, nat_srcport, nat_dstport, and verdict, but Net.Security only requires proto, srcip, dstip, srcport, dstport, and verdict.

By that token, there's no difference between Security.NAT.Net and Net.NAT.Security, so these are really more like tags than a hierarchy.

Hmm.. interesting idea, making a hierarchy of schemas. Again some more food for thought.

Is there a reason you want this behaviour implied, rather than being explicit? Maybe using a different syntax would make that more obvious. When combining schemas then instead of using '.' as a separator, use '+'

Net+NAT+Security

This would make it obvious that you can write the schema names in any order (since this is true for the mathematical '+' sign everyone knows).

Also, how would you represent this in a pattern? Right now you can assign multiple tags and any number of name-value pairs. Combining these in the way you described would only be needed in case someone wants to use the same SQL/CSV table with the nonexisting columns skipped. Or?

-- Bazsi

______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html

Balazs Scheidler

10:58 p.m.

Hi, On Sun, 2010-07-04 at 11:27 -0500, Martin Holste wrote:

...

I prefer the dot notation just because it's what I'm used to. However, an XML schema could represent this as repeated child elements, like: <rule><class>Net</class><class>NAT</class><class>Security</class></rule>. A user would see these three classes listed and know that the respective required fields exist as name/value pairs within the pattern. Likewise, an author would only be able to put class="Net" if his or her pattern does in fact provide name/value extractions for the "Net" tuple. That provides the guidance needed for deciding how to classify the patterns. I'm not sure if there would be any effective difference between a "class" element and the existing tag element, so maybe it's just a matter of stipulating that contributers need to appropriately tag their signatures with the correct classes inherent within them.

Maybe I'm missing something, but as I see the current "tags" function which is present in patterndb v3 (e.g. syslog-ng OSE 3.1 or later) is exactly what you describe. Since an example is worth thousand words, here is an untested pattern, covering an SSH login event, converting values into the proposed usracct schema: <rule id="..." class="system"> <patterns> <pattern>Accepted @STRING:usracct.authmethod@ for @STRING:usracct.username@ from @IPv4:temp.src_ip@ port @NUMBER:temp.src_port@ @STRING:usracct.service@</pattern> </patterns> <values> <value name="usracct.type">login</value> <value name="usracct.sessionid">$PID</value> <value name="usracct.application">$PROGRAM</value> <value name="usracct.device">${temp.src_ip}:${temp_src_port}</value> </values> <tags> <tag>usracct</tag> </tags> </rule> If I understand you correctly, you were referring to the "class" attribute of the rule element, and extend that. The way I think is that the "tags" feature is far superior than using classes, maybe a deprecation of the class attribute is would be needed. For example, a theoretical v4 format: <rule id="..."> <patterns> <pattern>Accepted @STRING:usracct.authmethod@ for @STRING:usracct.username@ from @IPv4:temp.src_ip@ port @NUMBER:temp.src_port@ @STRING:usracct.service@</pattern> </patterns> <values> <value name="usracct.type">login</value> <value name="usracct.sessionid">$PID</value> <value name="usracct.application">$PROGRAM</value> <value name="usracct.device">${temp.src_ip}:${temp_src_port}</value> </values> <tags> <tag>usracct</tag>  <tag>class.system</tag> </tags> </rule> But anyway, the idea of splitting complex schemas into smaller, but combinable elements is a great idea. Splitting the current "secevt" schema to three separate schemas: Net, Security and NAT and let the user combine them if needed sounds good. Example: <rule id="..."> <patterns> <pattern>... packet filter log, with NAT and verdict </pattern> </patterns> <values> ... </values> <tags> <tag>Net</tag> <tag>NAT</tag> <tag>Security</tag> </tags> </rule> But this is already possible with v3.1. The only problem with using three tags instead of one, is how to store the extracted information in a way that it can be combined later. The logical method for storing tagged data with a set of NV pairs is to put them in a properly structured SQL table. E.g. with the the three tags above, you'd get 3 tables: one for the Net fields, one for the NAT and the other for Security, which makes a problem obvious: it is one message after all, and quite possibly when you want to create a report you'd need to query the database with the following question: * please give me records that have all three tags, with all of their fields combined. E.g. if these are indeed stored in 3 tables, you have to join them, possibly using a unique message identifier. For example: SELECT * FROM Net, NAT, Security WHERE Net.MSGID=NAT.MSGID AND Net.MSGID=Security.MSGID; And voila, you have your log message. Of course using a non-SQL database could make this even simpler, or by using a handcrafted sql() destination, you could put all these fields in the same table. (my aim is to create a generic SQL destination, in which case you don't have to care how tables are laid out) The only missing bit here is that right now syslog-ng is unable to generate a unique message ID on its own, but that's not very difficult to add. What do you think? Based on this idea, I'm proposing to split the current secevt schema into 3 smaller ones: flowevt, natevt and secevt. Please check the git archive where I've pushed the current version. Also, if this is something we can agree on, I'll add some information about this into the "policy" document.

...

In fact, it probably wouldn't be hard at all to write a quick script to auto-tag signatures as they are submitted, based on the name/value pairs provided in the signature. So the only real thing a contributer would need to be aware of would be the official terms to use for the names, e.g. standardizing on "srcport" versus "source_port."

So, that means that the community would be responsible for: 1. Creating a standard list of names to use, adhering to the data type contained within (strings, ints, etc.).

yes.

...

2. Create a convention for which names are required (and optional) for which classes or tags.

yes. but please also note that CEE is doing something similar, but in the absence of anything concrete, I'd start using our own set of name-value pairs and in case the CEE is producing something, it'd be a simple search-and-replace to use the "official" names.

...

3. Maintain the officially approved and vetted list of signatures that adhere to the above conventions.

yes.

...

This is basically what you've already stated you want to do, right?

One of the nice things about XML is that you can create schema definition files (XSD's) which can validate a given XML file. So, the output of the naming conventions could be an XSD file that can be distributed with Syslog-NG so that end users can quickly verify signatures before they submit them.

There's one such schema in the syslog-ng source tree, in the directory doc/xsd right now. -- Bazsi

Martin Holste

5 Jul 5 Jul

3:12 a.m.

...

Maybe I'm missing something, but as I see the current "tags" function which is present in patterndb v3 (e.g. syslog-ng OSE 3.1 or later) is exactly what you describe.

Yes, precisely! I got a paragraph into my response and realized that, so I stated that I wasn't sure if there was any difference between a class element and a tag element. Thinking about that more, I don't see enough value in the formality of declaring a class element above and beyond what the tag element already accomplishes, as long as the XSD can prove the proper conformity.

...

If I understand you correctly, you were referring to the "class" attribute of the rule element, and extend that. The way I think is that the "tags" feature is far superior than using classes, maybe a deprecation of the class attribute is would be needed.

Agreed.

...

But this is already possible with v3.1. The only problem with using three tags instead of one, is how to store the extracted information in a way that it can be combined later.

I'll show you a schema that I've been using which compromises between flexibility and formalism. Here's my Snort parser: <ruleset name="snort" id='8'> <pattern>snort</pattern> <rules> <rule provider="LOCAL" class='8' id='8'> <patterns> <pattern>@QSTRING:s0:[]@ @ESTRING:s1:[@Classification:@QSTRING:s2: ]@ [Priority: @NUMBER:i0:@]: @QSTRING:i1:{}@ @ IPv4:i2:@:@NUMBER:i3:@ -> @IPv4:i4:@:@NUMBER:i5:@</pattern> </patterns> </rule> </rules> </ruleset> And here's my MySQL table schema: CREATE TABLE `syslogs_template` ( `id` bigint unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT, `timestamp` INT UNSIGNED NOT NULL DEFAULT 0, `host_id` INT UNSIGNED NOT NULL DEFAULT '1', `program_id` INT UNSIGNED NOT NULL DEFAULT '1', `class_id` SMALLINT unsigned NOT NULL DEFAULT '1', `rule_id` SMALLINT unsigned NOT NULL DEFAULT '1', msg TEXT, i0 INT UNSIGNED, i1 INT UNSIGNED, i2 INT UNSIGNED, i3 INT UNSIGNED, i4 INT UNSIGNED, i5 INT UNSIGNED, s0 VARCHAR(255), s1 VARCHAR(255), s2 VARCHAR(255), s3 VARCHAR(255), s4 VARCHAR(255), s5 VARCHAR(255) ) ENGINE=MyISAM; So what I've done is traded the ability to create as many fields as I want and human readability for the ability to guarantee only inserting into a single destination table. In my schema, I've got the capacity to store up to six integer fields and six string fields, in addition to the other syslog header data. (I never use priority, so I eventually dropped the priority column to save space.) The tricky thing about this setup is that when you go to query, you first have to translate the field into what is actually stored in the database, so srcip becomes "i3." I have separate, tiny lookup tables that I use for presenting the actual text. I do the same with program names by storing only the CRC32 value of the program as its program ID and keeping a lookup table for the actual text. (I use a CRC algorithm instead of an auto-generated ID so that ID's between cluster nodes don't have to sync their values.) If you were interested in heading down this road, then I would suggest adding elements to the pattern XML schema to specify what the actual values of the fields are, like this: <fields> <field name="i0">sig_priority</field> <field name="i1">proto</field> <field name="i2">srcip</field> <field name="i3">srcport</field> <field name="i4">dstip</field> <field name="i5">dstport</field> <field name="s0">sig_name</field> <field name="s1">sig_sid</field> <field name="s2">sig_classification</field> </fields> Or you could have a more formal format like this: <fields> <integers> <integer id=0>sig_priority</integer> ... </integers> <strings> <string id=0>sig_name</string> ... </strings> </fields>

...

The only missing bit here is that right now syslog-ng is unable to generate a unique message ID on its own, but that's not very difficult to add.

I would continue to leave that to the app, or in my case, the SQL schema as I think it's better handled there.

...

What do you think? Based on this idea, I'm proposing to split the current secevt schema into 3 smaller ones: flowevt, natevt and secevt.

Please check the git archive where I've pushed the current version.

I gave it a quick look and it seems right to me. One topic for discussion: should each class have an optional "details" field available, or should that be an implicitly available field to all log classes? In the case of my above SQL schema, it could be amended to include a column defined: details VARCHAR(255) as the last column of the schema. Obviously, the big thing missing from my current schema is a way to store N number of tag values. Bitmasks provide an interesting possibility, but that would limit us to X number of classes (either 32 or 64, I guess). Could almost every class be covered in 64 classes? If so, a tag_ids BIGINT UNSIGNED column would allow a convenient place to put all of the tag info. For instance, if we had the following lookup table: id name 1 Net 2 NAT 4 Security Then a row with tag_ids=7 would mean it has all three tags, and a row with tag_ids&2 would have the NAT flag set because it would match the boolean AND for that bit. To query by name, you would do this: SELECT * FROM logs JOIN classes ON (logs.tag_ids&classes.id) WHERE classes.name="NAT"; The major problem with this is that databases won't be able to use an index when doing boolean logic comparisons, so searching based on only one known tag would be extremely slow compared to const index lookup if you're looking for a perfect match of all given tags. So this query would use an index and be very fast: SELECT * FROM logs WHERE tag_ids=(SELECT SUM(id) FROM classes WHERE class="Security" OR class="NAT" OR class="Net"); On the other hand, if we're willing to give up a lot of space and the speed, you could do VARCHAR columns with a CSV of the tags ("Net,NAT,Security") or assign numeric ID's to save space with ("1,2,4"). If you used (",1,2,4,") then you could do your searches as WHERE tag_ids LIKE "%,2,%" which would still be slow but easy to program. Of course, the classic answer is to use a giant index map table with (log_id, tag_id) as the columns which represents the one-to-many relationship. That works but doesn't scale particularly well, and brings with it the burden of managing an appendix table for every log table, which gets annoying when dealing with rollover, etc.

Balazs Scheidler

11:36 a.m.

On Sun, 2010-07-04 at 20:12 -0500, Martin Holste wrote:

...

...
Maybe I'm missing something, but as I see the current "tags" function which is present in patterndb v3 (e.g. syslog-ng OSE 3.1 or later) is exactly what you describe.

Yes, precisely! I got a paragraph into my response and realized that, so I stated that I wasn't sure if there was any difference between a class element and a tag element. Thinking about that more, I don't see enough value in the formality of declaring a class element above and beyond what the tag element already accomplishes, as long as the XSD can prove the proper conformity.

Right now we supply an XSD, but it doesn't check schema validity. I'm not completely convinced that I'd add this to the schema, but I do see the value of being able to validate it, so agreed XSD or something different a validation tool would be useful.

...

...
If I understand you correctly, you were referring to the "class" attribute of the rule element, and extend that. The way I think is that the "tags" feature is far superior than using classes, maybe a deprecation of the class attribute is would be needed.

Agreed.

Great.

...

...
But this is already possible with v3.1. The only problem with using three tags instead of one, is how to store the extracted information in a way that it can be combined later.

I'll show you a schema that I've been using which compromises between flexibility and formalism. Here's my Snort parser:

<ruleset name="snort" id='8'> <pattern>snort</pattern> <rules> <rule provider="LOCAL" class='8' id='8'> <patterns> <pattern>@QSTRING:s0:[]@ @ESTRING:s1:[@Classification:@QSTRING:s2: ]@ [Priority: @NUMBER:i0:@]: @QSTRING:i1:{}@ @ IPv4:i2:@:@NUMBER:i3:@ -> @IPv4:i4:@:@NUMBER:i5:@</pattern> </patterns> </rule> </rules> </ruleset>

And here's my MySQL table schema:

CREATE TABLE `syslogs_template` ( `id` bigint unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT, `timestamp` INT UNSIGNED NOT NULL DEFAULT 0, `host_id` INT UNSIGNED NOT NULL DEFAULT '1', `program_id` INT UNSIGNED NOT NULL DEFAULT '1', `class_id` SMALLINT unsigned NOT NULL DEFAULT '1', `rule_id` SMALLINT unsigned NOT NULL DEFAULT '1', msg TEXT, i0 INT UNSIGNED, i1 INT UNSIGNED, i2 INT UNSIGNED, i3 INT UNSIGNED, i4 INT UNSIGNED, i5 INT UNSIGNED, s0 VARCHAR(255), s1 VARCHAR(255), s2 VARCHAR(255), s3 VARCHAR(255), s4 VARCHAR(255), s5 VARCHAR(255) ) ENGINE=MyISAM;

What I see with this, is that once we do have a schema description language (right now the SCHEMAS.txt, but a more formal language later), we could map each name-value pair to multiple representations, and basically the "storage" functionality is which decides which mapping to use. A naive schema based SQL destination would simply create as many tables as there are schemas. A better optimized one would use the NV -> field mapping that you propose, and a NoSQL implementation would just scale to any number of NV pairs without having to rename the fields. This mapping support would also be useful if we want to generate CEF/CEE formatted events.

...

So what I've done is traded the ability to create as many fields as I want and human readability for the ability to guarantee only inserting into a single destination table. In my schema, I've got the capacity to store up to six integer fields and six string fields, in addition to the other syslog header data. (I never use priority, so I eventually dropped the priority column to save space.)

The tricky thing about this setup is that when you go to query, you first have to translate the field into what is actually stored in the database, so srcip becomes "i3." I have separate, tiny lookup tables that I use for presenting the actual text. I do the same with program names by storing only the CRC32 value of the program as its program ID and keeping a lookup table for the actual text. (I use a CRC algorithm instead of an auto-generated ID so that ID's between cluster nodes don't have to sync their values.)

If you were interested in heading down this road, then I would suggest adding elements to the pattern XML schema to specify what the actual values of the fields are, like this:

<fields> <field name="i0">sig_priority</field> <field name="i1">proto</field> <field name="i2">srcip</field> <field name="i3">srcport</field> <field name="i4">dstip</field> <field name="i5">dstport</field> <field name="s0">sig_name</field> <field name="s1">sig_sid</field> <field name="s2">sig_classification</field> </fields>

Or you could have a more formal format like this:

<fields> <integers> <integer id=0>sig_priority</integer> ... </integers> <strings> <string id=0>sig_name</string> ... </strings> </fields>

...
The only missing bit here is that right now syslog-ng is unable to generate a unique message ID on its own, but that's not very difficult to add.

I would continue to leave that to the app, or in my case, the SQL schema as I think it's better handled there.

The problem is that I'd like to support the multiple tables idea as well, e.g. store each schema in a separate table. In this case you need a unique id in order to join the tables. Also, if this would be combined with the MSGID field of RFC5424, this could be used to fetch the original raw message easily.

...

...
What do you think? Based on this idea, I'm proposing to split the current secevt schema into 3 smaller ones: flowevt, natevt and secevt.

Please check the git archive where I've pushed the current version.

I gave it a quick look and it seems right to me. One topic for discussion: should each class have an optional "details" field available, or should that be an implicitly available field to all log classes?

true enough... "details" field is used to shove all non-structured, but related information into the event, and if a given event is using multiple schemas, we'd have three "details" field, each possibly containing overlapping information. hmm... hmm, maybe "details" should be above all schemas, e.g instead of calling it "secevt.details", it should be called "details", it is a single pattern the extracts all the fields after all, so the pattern author can decide which information wouldn't fit into any of the schemas and put that in details.

...

Obviously, the big thing missing from my current schema is a way to store N number of tag values. Bitmasks provide an interesting possibility, but that would limit us to X number of classes (either 32 or 64, I guess). Could almost every class be covered in 64 classes? If so, a tag_ids BIGINT UNSIGNED column would allow a convenient place to put all of the tag info. For instance, if we had the following lookup table:

id name 1 Net 2 NAT 4 Security

Then a row with tag_ids=7 would mean it has all three tags, and a row with tag_ids&2 would have the NAT flag set because it would match the boolean AND for that bit. To query by name, you would do this:

SELECT * FROM logs JOIN classes ON (logs.tag_ids&classes.id) WHERE classes.name="NAT";

The major problem with this is that databases won't be able to use an index when doing boolean logic comparisons, so searching based on only one known tag would be extremely slow compared to const index lookup if you're looking for a perfect match of all given tags. So this query would use an index and be very fast:

SELECT * FROM logs WHERE tag_ids=(SELECT SUM(id) FROM classes WHERE class="Security" OR class="NAT" OR class="Net");

On the other hand, if we're willing to give up a lot of space and the speed, you could do VARCHAR columns with a CSV of the tags ("Net,NAT,Security") or assign numeric ID's to save space with ("1,2,4"). If you used (",1,2,4,") then you could do your searches as WHERE tag_ids LIKE "%,2,%" which would still be slow but easy to program.

Of course, the classic answer is to use a giant index map table with (log_id, tag_id) as the columns which represents the one-to-many relationship. That works but doesn't scale particularly well, and brings with it the burden of managing an appendix table for every log table, which gets annoying when dealing with rollover, etc.

Well, I believe that in SQL, the best we could probably come up with is a "list of tags field" and use free-text indexing. -- Bazsi

Martin Holste

7:05 p.m.

...

A naive schema based SQL destination would simply create as many tables as there are schemas. A better optimized one would use the NV -> field mapping that you propose, and a NoSQL implementation would just scale to any number of NV pairs without having to rename the fields.

This mapping support would also be useful if we want to generate CEF/CEE formatted events.

Hm, so maybe we need to decouple the actual DB stuff from the XML schema and declare it out-of-scope, since its' really up to the implementer to figure that out, and the specific implementation will likely change for each setup. I think what's essential is providing the list of name-value pairs and whether they are integer or string. Maybe there could be a "contrib" section on your site with contributed scripts for stamping out the various configurations (e.g. multi-table SQL, no-SQL, etc.).

...

The problem is that I'd like to support the multiple tables idea as well, e.g. store each schema in a separate table. In this case you need a unique id in order to join the tables. Also, if this would be combined with the MSGID field of RFC5424, this could be used to fetch the original raw message easily.

It looks to me like MSGID is better suited for a tag then being part of the ID itself. From the RFC: "It is intended for filtering messages on a relay or collector." A unique ID across multiple tables is not a problem as long as there is one master table where you would put the syslog header fields with an auto-increment column to generate the ID. If you absolutely wanted Syslog-NG to generate the ID, I suppose you could append a CRC of the $MSG to the epoch timestamp, though that isn't foolproof.

...

hmm... hmm, maybe "details" should be above all schemas, e.g instead of calling it "secevt.details", it should be called "details", it is a single pattern the extracts all the fields after all, so the pattern author can decide which information wouldn't fit into any of the schemas and put that in details.

Yep, I think details would be a good spot for all miscellany, as well as other meta-data that is inherent to a specific log class that doesn't fit in a predefined field.

...

Well, I believe that in SQL, the best we could probably come up with is a "list of tags field" and use free-text indexing.

Yes, for instance, the Sphinx full-text search engine has a Multi-Value Attribute (MVA) config attribute which is specifically designed for efficiently storing a list of n-number of tag ID's for a given record.

Balazs Scheidler

7 Jul 7 Jul

1:37 p.m.

On Mon, 2010-07-05 at 12:05 -0500, Martin Holste wrote:

...

...
A naive schema based SQL destination would simply create as many tables as there are schemas. A better optimized one would use the NV -> field mapping that you propose, and a NoSQL implementation would just scale to any number of NV pairs without having to rename the fields.

This mapping support would also be useful if we want to generate CEF/CEE formatted events.

Hm, so maybe we need to decouple the actual DB stuff from the XML schema and declare it out-of-scope, since its' really up to the implementer to figure that out, and the specific implementation will likely change for each setup. I think what's essential is providing the list of name-value pairs and whether they are integer or string. Maybe there could be a "contrib" section on your site with contributed scripts for stamping out the various configurations (e.g. multi-table SQL, no-SQL, etc.).

I'd like to create a generic SQL destination, which would magically work without having to explicitly configure the table schema (e.g. no need to generate the configuration) If type information is present then the field names for your condensed table could be generated on the fly. I think I'd leave this question opened for a while, until we get that generic SQL destination.

...

...
The problem is that I'd like to support the multiple tables idea as well, e.g. store each schema in a separate table. In this case you need a unique id in order to join the tables. Also, if this would be combined with the MSGID field of RFC5424, this could be used to fetch the original raw message easily.

It looks to me like MSGID is better suited for a tag then being part of the ID itself. From the RFC: "It is intended for filtering messages on a relay or collector." A unique ID across multiple tables is not a problem as long as there is one master table where you would put the syslog header fields with an auto-increment column to generate the ID. If you absolutely wanted Syslog-NG to generate the ID, I suppose you could append a CRC of the $MSG to the epoch timestamp, though that isn't foolproof.

Right, I was under the wrong impression what MSGID is. Not that I understand or agree with the way it was defined though. Anyway, I wouldn't want to store the syslog message in the database only to get an ID, and the use of this ID would be optional.

...

...
hmm... hmm, maybe "details" should be above all schemas, e.g instead of calling it "secevt.details", it should be called "details", it is a single pattern the extracts all the fields after all, so the pattern author can decide which information wouldn't fit into any of the schemas and put that in details.

Yep, I think details would be a good spot for all miscellany, as well as other meta-data that is inherent to a specific log class that doesn't fit in a predefined field.

Agreed.

...

...
Well, I believe that in SQL, the best we could probably come up with is a "list of tags field" and use free-text indexing.

Yes, for instance, the Sphinx full-text search engine has a Multi-Value Attribute (MVA) config attribute which is specifically designed for efficiently storing a list of n-number of tag ID's for a given record.

That's what I thought. I'm going to update the document with these decisions. Thanks for your feedback, I really appreciate it. -- Bazsi

Balazs Scheidler

9 Jul 9 Jul

1:26 p.m.

On Wed, 2010-07-07 at 13:37 +0200, Balazs Scheidler wrote:

...

On Mon, 2010-07-05 at 12:05 -0500, Martin Holste wrote:

...
...
A naive schema based SQL destination would simply create as many tables as there are schemas. A better optimized one would use the NV -> field mapping that you propose, and a NoSQL implementation would just scale to any number of NV pairs without having to rename the fields.

This mapping support would also be useful if we want to generate CEF/CEE formatted events.

Hm, so maybe we need to decouple the actual DB stuff from the XML schema and declare it out-of-scope, since its' really up to the implementer to figure that out, and the specific implementation will likely change for each setup. I think what's essential is providing the list of name-value pairs and whether they are integer or string. Maybe there could be a "contrib" section on your site with contributed scripts for stamping out the various configurations (e.g. multi-table SQL, no-SQL, etc.).

I'd like to create a generic SQL destination, which would magically work without having to explicitly configure the table schema (e.g. no need to generate the configuration)

If type information is present then the field names for your condensed table could be generated on the fly. I think I'd leave this question opened for a while, until we get that generic SQL destination.

...
...
The problem is that I'd like to support the multiple tables idea as well, e.g. store each schema in a separate table. In this case you need a unique id in order to join the tables. Also, if this would be combined with the MSGID field of RFC5424, this could be used to fetch the original raw message easily.

It looks to me like MSGID is better suited for a tag then being part of the ID itself. From the RFC: "It is intended for filtering messages on a relay or collector." A unique ID across multiple tables is not a problem as long as there is one master table where you would put the syslog header fields with an auto-increment column to generate the ID. If you absolutely wanted Syslog-NG to generate the ID, I suppose you could append a CRC of the $MSG to the epoch timestamp, though that isn't foolproof.

Right, I was under the wrong impression what MSGID is. Not that I understand or agree with the way it was defined though.

Anyway, I wouldn't want to store the syslog message in the database only to get an ID, and the use of this ID would be optional.

...
...
hmm... hmm, maybe "details" should be above all schemas, e.g instead of calling it "secevt.details", it should be called "details", it is a single pattern the extracts all the fields after all, so the pattern author can decide which information wouldn't fit into any of the schemas and put that in details.

Yep, I think details would be a good spot for all miscellany, as well as other meta-data that is inherent to a specific log class that doesn't fit in a predefined field.

Agreed.

...
...
Well, I believe that in SQL, the best we could probably come up with is a "list of tags field" and use free-text indexing.

Yes, for instance, the Sphinx full-text search engine has a Multi-Value Attribute (MVA) config attribute which is specifically designed for efficiently storing a list of n-number of tag ID's for a given record.

That's what I thought.

I'm going to update the document with these decisions. Thanks for your feedback, I really appreciate it.

I've updated the patterndb policy document with the latest discussion points at http://git.balabit.hu/ I still have some open points: * ruleset and rule IDs (UUID vs something else) * ruleset organization I'd appreciate feedback on the current policy. -- Bazsi

Martin Holste

10 Jul 10 Jul

9:56 p.m.

Looking good. One picky thing: the line containing "NV pair names should only contain alphanumeric characters (a-zA-Z0-9)" should maybe include the underscore and dot in the regexp to avoid confusion, or at least the underscore. Also, I think "generic" may not be the term you're looking for when describing your initial schema design. To me, "per-schema tables" better describes the layout, as technically, my method of dumping all logs into one table is more "generic" in that it's a one-size-fits-all table setup. I'm noting that it's a bit difficult to discuss the patterndb schema and DB layouts because I keep wanting to refer to DB schemas, which is confusing. Could we instead call the patterndb schemas "rule sets," as per the original patterndb.xml, instead of schemas? That way we know when discussing schemas that it can only refer to DB tables. It is more clear to me to say "one type of schema is to have one table per rule set." On Fri, Jul 9, 2010 at 6:26 AM, Balazs Scheidler <bazsi@balabit.hu> wrote:

...

On Wed, 2010-07-07 at 13:37 +0200, Balazs Scheidler wrote:

...
On Mon, 2010-07-05 at 12:05 -0500, Martin Holste wrote:

...
...
A naive schema based SQL destination would simply create as many tables as there are schemas. A better optimized one would use the NV -> field mapping that you propose, and a NoSQL implementation would just scale to any number of NV pairs without having to rename the fields.

This mapping support would also be useful if we want to generate CEF/CEE formatted events.

Hm, so maybe we need to decouple the actual DB stuff from the XML schema and declare it out-of-scope, since its' really up to the implementer to figure that out, and the specific implementation will likely change for each setup. I think what's essential is providing the list of name-value pairs and whether they are integer or string. Maybe there could be a "contrib" section on your site with contributed scripts for stamping out the various configurations (e.g. multi-table SQL, no-SQL, etc.).

I'd like to create a generic SQL destination, which would magically work without having to explicitly configure the table schema (e.g. no need to generate the configuration)

If type information is present then the field names for your condensed table could be generated on the fly. I think I'd leave this question opened for a while, until we get that generic SQL destination.

...
...
The problem is that I'd like to support the multiple tables idea as well, e.g. store each schema in a separate table. In this case you need a unique id in order to join the tables. Also, if this would be combined with the MSGID field of RFC5424, this could be used to fetch the original raw message easily.

It looks to me like MSGID is better suited for a tag then being part of the ID itself. From the RFC: "It is intended for filtering messages on a relay or collector." A unique ID across multiple tables is not a problem as long as there is one master table where you would put the syslog header fields with an auto-increment column to generate the ID. If you absolutely wanted Syslog-NG to generate the ID, I suppose you could append a CRC of the $MSG to the epoch timestamp, though that isn't foolproof.

Right, I was under the wrong impression what MSGID is. Not that I understand or agree with the way it was defined though.

Anyway, I wouldn't want to store the syslog message in the database only to get an ID, and the use of this ID would be optional.

...
...
hmm... hmm, maybe "details" should be above all schemas, e.g instead of calling it "secevt.details", it should be called "details", it is a single pattern the extracts all the fields after all, so the pattern author can decide which information wouldn't fit into any of the schemas and put that in details.

Yep, I think details would be a good spot for all miscellany, as well as other meta-data that is inherent to a specific log class that doesn't fit in a predefined field.

Agreed.

...
...
Well, I believe that in SQL, the best we could probably come up with is a "list of tags field" and use free-text indexing.

Yes, for instance, the Sphinx full-text search engine has a Multi-Value Attribute (MVA) config attribute which is specifically designed for efficiently storing a list of n-number of tag ID's for a given record.

That's what I thought.

I'm going to update the document with these decisions. Thanks for your feedback, I really appreciate it.

I've updated the patterndb policy document with the latest discussion points at

http://git.balabit.hu/

I still have some open points: * ruleset and rule IDs (UUID vs something else) * ruleset organization

I'd appreciate feedback on the current policy.

-- Bazsi

______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html

Balazs Scheidler

13 Jul 13 Jul

12:45 p.m.

On Sat, 2010-07-10 at 14:56 -0500, Martin Holste wrote:

...

Looking good. One picky thing: the line containing "NV pair names should only contain alphanumeric characters (a-zA-Z0-9)" should maybe include the underscore and dot in the regexp to avoid confusion, or at least the underscore.

done.

...

Also, I think "generic" may not be the term you're looking for when describing your initial schema design. To me, "per-schema tables" better describes the layout, as technically, my method of dumping all logs into one table is more "generic" in that it's a one-size-fits-all table setup.

done.

...

I'm noting that it's a bit difficult to discuss the patterndb schema and DB layouts because I keep wanting to refer to DB schemas, which is confusing. Could we instead call the patterndb schemas "rule sets," as per the original patterndb.xml, instead of schemas? That way we know when discussing schemas that it can only refer to DB tables. It is more clear to me to say "one type of schema is to have one table per rule set."

well, the ruleset in patterndb refers to the application, rather than the different log message types it emits. (e.g. a ruleset has a given PROGRAM name which applies to all rules within the same ruleset). It is quite a bit of work to rewrite the relevant sections, I'm not against renaming, though. The CEE project uses: * taxonomy = the meaning of the event (e.g. user login) * dictionary = the name-value pairs The problem with the CEE naming is: taxonomy could be translated to our "combination-of-schemas", more specifically the set of tags associated with a message. And, the dictionary itself is taxonomy independent, which I feel can be problematic in the long run. -- Bazsi

Martin Holste

5:22 p.m.

...

well, the ruleset in patterndb refers to the application, rather than the different log message types it emits. (e.g. a ruleset has a given PROGRAM name which applies to all rules within the same ruleset).

It is quite a bit of work to rewrite the relevant sections, I'm not against renaming, though.

Good points, I guess schema is still the best term since both the patterndb and database are using schemas to convey data relationships.

5565

Age (days ago)

5583

Last active (days ago)

List overview

Download

20 comments

3 participants

participants (3)

Balazs Scheidler
Martin Holste
Peter Czanik