[patterndb] classification

newer
Converting filtering from 2.1 to...

older
sql driver does'nt work on RHEL5...

Balazs Scheidler

3 Sep 2010 3 Sep '10

9:11 p.m.

Hi, As you probably know one goal for patterndb is to implement message classification. E.g. in addition to extracting information from log messages, it also associates a "class", later available in the "${.classifier.class}" value. Right now, syslog-ng doesn't really care what this string is. But the XML schema validating patterndb file lists the following four classes (taken from the logcheck project) violation - security violation security - other security events system - system information unknown - no rule matches One one hand, the tagging functionality (e.g. the ability to also associate tags with each message) is superior to classes. On the other hand, all tags are equivalent, thus if a message has 5 tags, then currently syslog-ng only provides functions to _filter_ based on tags, but not use it as a macro. So for example it is possible to do: file d_class_files { file("/var/log/messages.${.classifier.class}.log"); }; But it is difficult to do with tags (except for using filters and different destinations), as there's no such functionality. Another problem is that tags/classes are completely independent, in order to filter on the class of the message, one would have to use a match() filter like this: filter f_class { match("violation" value(".classifier.class")); }; My conclusion is that classes are better when used in templates, tags are better when filtering. The two should be merged somehow. So I'm thinking on how to move forward. Here are the alternatives I'm considering: 1) the class of the message is always a tag in some generated format (e.g. if a message has class XXX, then a tag named ".class.XXX" would be automatically associated with the message. This is somewhat cumbersome. 2) the class of the message is created as a tag as well, with the same name as the class. e.g. we'd have a tag named "violation", but that'd preclude the use of the "violation" name as a tag. 3) drop the class stuff and implement a macro trick that makes it possible to use tags in macro context One way to do this: file d_class_files { file("/var/log/messages.$(expand-tag-name violation security system unknown).log"); }; The "expand-tag-name" macro function would try to look for the tags listed as parameters, and if the message matches it'd expand to the tagname. This is not intuitive and if someone wants to use such an expansion in a lot of templates, it is also irritating and difficult to get right. On an independent matter, the set of classes may need some thought. As I said the original list is borrowed from logcheck, but I think it probably needs to be expanded. Last time I got patterns for DNS queries, and although I could shove them into "system", right now I feel that the point of classification is to categorize events by "importance", in a similar spirit to syslog severity, but one that works even if the application developer uses a bogus severity when sending syslog messages. So one email, two questions, feedback appreciated. Thanks. -- Bazsi

Show replies by date

Matthew Hall

3 Sep 3 Sep

9:24 p.m.

On Fri, Sep 03, 2010 at 09:11:59PM +0200, Balazs Scheidler wrote:

...

Hi,

Hey Bazsi,

...

So one email, two questions, feedback appreciated.

Not sure if it's an option but the idea which occurs to me is that you are looking for a way of setting and optionally mapping some keys like ".classifier.class" or "mhall_special_tag" to some value like "{ violation, security, ... }". So my suggestion would be to remap the ".classifier.class" into the tag system for compatibility, then extend the tag system to be a hash table. The nice thing about the hash table would be, you could still support existing tags. For example if I tagged a message as "mhall_special_tag", in the hash table you could map that: mhall_special_tag -> PLACEHOLDER Then for fancier tags like ".classifier.class" you could map that: .classifier.class -> { violation, security, ... } Then you could provide some kind of utilities for it to expand what you need. 1) an operation to check if a key is set 2) an operation to get the value set for some key 3) an operation to check if a value is set 4) etc... Then when I want to break out messages to classifier based dirs, I could just call operation (2) to get the value of ".classifier.class". If I wanted to make a filter that grabbed messages with mhall_special_tag set, I could do that using operation (1). Etc etc.

...

Thanks. Bazsi

HTH, Matthew.

Balazs Scheidler

6 Sep 6 Sep

10:42 a.m.

On Fri, 2010-09-03 at 12:24 -0700, Matthew Hall wrote:

...

On Fri, Sep 03, 2010 at 09:11:59PM +0200, Balazs Scheidler wrote:

...
Hi,

Hey Bazsi,

...
So one email, two questions, feedback appreciated.

Not sure if it's an option but the idea which occurs to me is that you are looking for a way of setting and optionally mapping some keys like ".classifier.class" or "mhall_special_tag" to some value like "{ violation, security, ... }".

So my suggestion would be to remap the ".classifier.class" into the tag system for compatibility, then extend the tag system to be a hash table.

The nice thing about the hash table would be, you could still support existing tags. For example if I tagged a message as "mhall_special_tag", in the hash table you could map that:

mhall_special_tag -> PLACEHOLDER

Then for fancier tags like ".classifier.class" you could map that:

.classifier.class -> { violation, security, ... }

Then you could provide some kind of utilities for it to expand what you need.

1) an operation to check if a key is set 2) an operation to get the value set for some key 3) an operation to check if a value is set 4) etc...

Then when I want to break out messages to classifier based dirs, I could just call operation (2) to get the value of ".classifier.class".

If I wanted to make a filter that grabbed messages with mhall_special_tag set, I could do that using operation (1).

Hmm.. the message itself is already a hashtable (not exactly, but semantically they are the same). What you say with the above is that tags should be present/non-present attributes of the message, right? The problem I see with this is the namespace, I wouldn't want to collide tag names with built-in macros or name-value pairs. -- Bazsi

Matthew Hall

11:39 a.m.

On Mon, Sep 06, 2010 at 10:42:58AM +0200, Balazs Scheidler wrote:

...

Hmm.. the message itself is already a hashtable (not exactly, but semantically they are the same).

Makes sense. Abstractly you could represent the message in one hash table, and think of the patterns, templates, and rewrite rules as being a way of transforming the input hash to the output hash.

...

What you say with the above is that tags should be present/non-present attributes of the message, right?

You could put the tags and the classifications into a common hash table, where tags could be represented as keys with no value, and attributes like ".classifier.class" as being keys with values.

...

The problem I see with this is the namespace, I wouldn't want to collide tag names with built-in macros or name-value pairs.

One option would be sigils like perl, '$ @ # &' etc. Another option would be namespace enforcement similar to how C code identifiers are names or what you talked about in your own blog post about identifier naming a number of weeks ago. ;-)

...

Bazsi

Matthew.

Martin Holste

11:59 p.m.

I think something like what Matthew has described would work to deal with the namespace issues. The hash table system seems like a good way of doing the CEE values that have been talked about, and could also pave the way for some pretty powerful stuff. On Mon, Sep 6, 2010 at 4:39 AM, Matthew Hall <mhall@mhcomputing.net> wrote:

...

On Mon, Sep 06, 2010 at 10:42:58AM +0200, Balazs Scheidler wrote:

...
Hmm.. the message itself is already a hashtable (not exactly, but semantically they are the same).

Makes sense. Abstractly you could represent the message in one hash table, and think of the patterns, templates, and rewrite rules as being a way of transforming the input hash to the output hash.

...
What you say with the above is that tags should be present/non-present attributes of the message, right?

You could put the tags and the classifications into a common hash table, where tags could be represented as keys with no value, and attributes like ".classifier.class" as being keys with values.

...
The problem I see with this is the namespace, I wouldn't want to collide tag names with built-in macros or name-value pairs.

One option would be sigils like perl, '$ @ # &' etc.

Another option would be namespace enforcement similar to how C code identifiers are names or what you talked about in your own blog post about identifier naming a number of weeks ago. ;-)

...
Bazsi

Matthew. ______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html

Anton Chuvakin

10 Sep 10 Sep

7:48 a.m.

Balasz and others: For the benefit of the logging community, I am sharing a few ideas from the upcoming CEE taxonomy docs (all these are pre-DRAFTS): "The CEE Event Taxonomy defines a collection of "tags" that can be used to categorize events. Its goal is to provide a common vocabulary, through sets of tags, to help classify and relate records that pertain to similar types of events. Using Taxonomy tags, event producers can provide obvious and consistent event categorization identifiers. For example, users and event consumers can leverage these categories to improve event correlation or easily locate certain classes of events." "The CEE Taxonomy defines a tag set as way to categorize events. Each tag set consists of one or more tags. Similar to an event field, each tag entry has an identifying long and short name. These tag sets allow each event to be associated with multiple tags representing multiple categories. This gives the event consumers the flexibility to identify similar events based upon their needs. " "Common tag sets include event action, status, and object, and might include other categorizations such as attack type, device type, or other categorizations that are required by the event consumer. " "A tag relation describes the association that a tag has with another tag. Individual tag relations are defined in a Relation element, with the type attribute specifying the relation type (e.g., subclass) and the element's text references the Tag to which the current Tag is related. Relations are grouped together within a single Relations element." Examples: <Tag> <Name>AccountObject</Name> <ShortName>acct</ShortName> <TagSet>object</TagSet> <Description>A user account</Description> </Tag> <Tag> <Name>LogonAction</Name> <ShortName>logon</ShortName> <AltName>login</AltName> <TagSet>ActionTagSet</TagSet> <Description> An entity (typically a user, application, or system) gains access to a system or application by properly authenticating to a user account and starting a session, usually using a password or other credential </Description> <Relations> <Relation type="opposite">LogoffAction</Relation> </Relations> </Tag> Further "The CEE Dictionary defines a collection of event fields, field sets, and field value types. A field is used to describe one characteristic or property of an event (e.g., start time, account name). Each field definition may be associated with a value type, which defines the format for valid values for that field. For example, a "filename" field has values of a "string" type. Field sets, like tag sets, simply allow related fields to be grouped." Let me know if you'd like to see anything else... Best, -- Dr. Anton Chuvakin Site: http://www.chuvakin.org Blog: http://www.securitywarrior.org LinkedIn: http://www.linkedin.com/in/chuvakin Consulting: http://www.securitywarriorconsulting.com Twitter: @anton_chuvakin Google Voice: +1-510-771-7106

Balazs Scheidler

20 Sep 20 Sep

5:41 p.m.

Hi Anton, Thanks for letting us know. The things you posted about CEE so far definitely influences me while trying to work on patterndb (both as the collect the patterns project and both as code within syslog-ng itself). So, I wanted to thank you for taking the time to post this. On Thu, 2010-09-09 at 22:48 -0700, Anton Chuvakin wrote:

...

Balasz and others:

For the benefit of the logging community, I am sharing a few ideas from the upcoming CEE taxonomy docs (all these are pre-DRAFTS):

"The CEE Event Taxonomy defines a collection of "tags" that can be used to categorize events. Its goal is to provide a common vocabulary, through sets of tags, to help classify and relate records that pertain to similar types of events. Using Taxonomy tags, event producers can provide obvious and consistent event categorization identifiers. For example, users and event consumers can leverage these categories to improve event correlation or easily locate certain classes of events."

"The CEE Taxonomy defines a tag set as way to categorize events. Each tag set consists of one or more tags. Similar to an event field, each tag entry has an identifying long and short name. These tag sets allow each event to be associated with multiple tags representing multiple categories. This gives the event consumers the flexibility to identify similar events based upon their needs. "

"Common tag sets include event action, status, and object, and might include other categorizations such as attack type, device type, or other categorizations that are required by the event consumer. "

"A tag relation describes the association that a tag has with another tag. Individual tag relations are defined in a Relation element, with the type attribute specifying the relation type (e.g., subclass) and the element's text references the Tag to which the current Tag is related. Relations are grouped together within a single Relations element."

Examples:

<Tag> <Name>AccountObject</Name> <ShortName>acct</ShortName> <TagSet>object</TagSet> <Description>A user account</Description> </Tag>

<Tag> <Name>LogonAction</Name> <ShortName>logon</ShortName> <AltName>login</AltName> <TagSet>ActionTagSet</TagSet> <Description> An entity (typically a user, application, or system) gains access to a system or application by properly authenticating to a user account and starting a session, usually using a password or other credential </Description> <Relations> <Relation type="opposite">LogoffAction</Relation> </Relations> </Tag>

Further

"The CEE Dictionary defines a collection of event fields, field sets, and field value types. A field is used to describe one characteristic or property of an event (e.g., start time, account name). Each field definition may be associated with a value type, which defines the format for valid values for that field. For example, a "filename" field has values of a "string" type. Field sets, like tag sets, simply allow related fields to be grouped."

Let me know if you'd like to see anything else...

Best,

-- Bazsi

Balazs Scheidler

13 Sep 13 Sep

4:24 p.m.

On Mon, 2010-09-06 at 16:59 -0500, Martin Holste wrote:

...

I think something like what Matthew has described would work to deal with the namespace issues. The hash table system seems like a good way of doing the CEE values that have been talked about, and could also pave the way for some pretty powerful stuff.

Well, as I see it, we do have this hashtable, it is the current name-value support. tags are an independent namespace, that's right, but do we really need to assign values to tags? Right now the patterndb concept links tags and name-value pairs together: the name of a tag is the same as the prefix of the associated name-value pairs. I don't really see the immediate benefit, but I might be missing something. -- Bazsi

Anton Chuvakin

3 Sep 3 Sep

9:35 p.m.

All,

...

As you probably know one goal for patterndb is to implement message classification.

First, I worry when I hear about building a new taxonomy for log messages from scratch when CEE (cee.mitre.org) is almost ready. An arch spec just went out: http://cee.mitre.org/docs/CEE_Architecture_Overview_May_2010.pdf

...

E.g. in addition to extracting information from log messages, it also associates a "class", later available in the "${.classifier.class}" value.

That is useful but one class likely won't cut it as a lot of messages will be cross-class

...

violation - security violation security - other security events system - system information unknown - no rule matches

Both 'system' and 'security' is a very common situation. User logins - need I say more? :-) And telling 'violation' from 'security' is probably a lost cause.

...

One one hand, the tagging functionality (e.g. the ability to also associate tags with each message) is superior to classes.

Absolutely, tag clouds would be a much better bet than a tree of categories.

...

But it is difficult to do with tags (except for using filters and different destinations), as there's no such functionality. Another problem is that tags/classes are completely independent, in order to filter on the class of the message, one would have to use a match() filter like this:

Actually, that is a positive - especially when you include custom tags , like regulatory relevance or relevance to a particular unit inside the organization.

...

My conclusion is that classes are better when used in templates, tags are better when filtering. The two should be merged somehow.

Tags can be organized in 'bunches' that serve as classes.

...

3) drop the class stuff and implement a macro trick that makes it possible to use tags in macro context

I'd avoid hard-coded classes altogether and go with all tags, possible organized in "classes of tags" or bunches or whatever.

...

On an independent matter, the set of classes may need some thought. As

Ah, that's because it will fail - multi-mapping will kill it. This was pretty much our starting point in CEE as many of us spent time doing it at SIEM players. So, SIEM vendors have been trying to build HUGE trees of events and ultimately they became unwieldy. Tags will be more manageable and simple relationships can be established between them.

...

probably needs to be expanded. Last time I got patterns for DNS queries, and although I could shove them into "system", right now I feel that the point of classification is to categorize events by

Well, now multiply it by roughly 120,000 events types that leading SIEM vendors categorized over the years and you'd know you don't want that :-)

...

"importance", in a similar spirit to syslog severity, but one that works even if the application developer uses a bogus severity when sending syslog messages.

Important is HUGE challenge. Now sure what to add to this one as it is largely an unsolved problem due to very different contexts for message analysis. Even mere 'connection established' can be 10 of 10 for somebody in some circumstances. One can try to glue important to tags (like exploit > connection) and not to individual messages, it might work sometimes. Best, -- Dr. Anton Chuvakin Site: http://www.chuvakin.org Blog: http://www.securitywarrior.org LinkedIn: http://www.linkedin.com/in/chuvakin Consulting: http://www.securitywarriorconsulting.com Twitter: @anton_chuvakin Google Voice: +1-510-771-7106

Balazs Scheidler

10:03 p.m.

On Fri, 2010-09-03 at 12:35 -0700, Anton Chuvakin wrote:

...

All,

...
As you probably know one goal for patterndb is to implement message classification.

First, I worry when I hear about building a new taxonomy for log messages from scratch when CEE (cee.mitre.org) is almost ready. An arch spec just went out: http://cee.mitre.org/docs/CEE_Architecture_Overview_May_2010.pdf

Last I've checked there was nothing concrete published from CEE. But I'll definitely read it. However quickly browsing through the PDF I couldn't find the taxonomy portion, is this "almost ready" stuff available somewhere?

...

...
E.g. in addition to extracting information from log messages, it also associates a "class", later available in the "${.classifier.class}" value.

That is useful but one class likely won't cut it as a lot of messages will be cross-class

...
violation - security violation security - other security events system - system information unknown - no rule matches

Both 'system' and 'security' is a very common situation. User logins - need I say more? :-) And telling 'violation' from 'security' is probably a lost cause.

Yeah, I know that. This was coming from logcheck and until now I didn't mean to improve it in any way.

...

...
One one hand, the tagging functionality (e.g. the ability to also associate tags with each message) is superior to classes.

Absolutely, tag clouds would be a much better bet than a tree of categories.

...
But it is difficult to do with tags (except for using filters and different destinations), as there's no such functionality. Another problem is that tags/classes are completely independent, in order to filter on the class of the message, one would have to use a match() filter like this:

Actually, that is a positive - especially when you include custom tags , like regulatory relevance or relevance to a particular unit inside the organization.

...
My conclusion is that classes are better when used in templates, tags are better when filtering. The two should be merged somehow.

Tags can be organized in 'bunches' that serve as classes.

You mean, every tag would belong to a bunch and a given message could only be part of a single bunch? Thus any single tag would indicate the bunch the message belongs to? Or, I might be completely missing something.

...

...
3) drop the class stuff and implement a macro trick that makes it possible to use tags in macro context

I'd avoid hard-coded classes altogether and go with all tags, possible organized in "classes of tags" or bunches or whatever.

...

...
On an independent matter, the set of classes may need some thought. As

Ah, that's because it will fail - multi-mapping will kill it. This was pretty much our starting point in CEE as many of us spent time doing it at SIEM players. So, SIEM vendors have been trying to build HUGE trees of events and ultimately they became unwieldy. Tags will be more manageable and simple relationships can be established between them.

...
probably needs to be expanded. Last time I got patterns for DNS queries, and although I could shove them into "system", right now I feel that the point of classification is to categorize events by

Well, now multiply it by roughly 120,000 events types that leading SIEM vendors categorized over the years and you'd know you don't want that :-)

Right.

...

...
"importance", in a similar spirit to syslog severity, but one that works even if the application developer uses a bogus severity when sending syslog messages.

Important is HUGE challenge. Now sure what to add to this one as it is largely an unsolved problem due to very different contexts for message analysis. Even mere 'connection established' can be 10 of 10 for somebody in some circumstances. One can try to glue important to tags (like exploit > connection) and not to individual messages, it might work sometimes.

Hmm... good idea. -- Bazsi

Anton Chuvakin

10:25 p.m.

...

However quickly browsing through the PDF I couldn't find the taxonomy portion, is this "almost ready" stuff available somewhere?

Not public yet, but will be very soon. Let me see what I can send over at this stage. The main idea for CEE taxonomy is "OAS" for object/action/status "tags" being mandatory for each message. We found this to be both more useful and more doable than a single class for the message. Essentially, you should be able unambiguously determine what every log message in the world (!) means by reading the OAS triad.

...

...
Tags can be organized in 'bunches' that serve as classes. You mean, every tag would belong to a bunch and a given message could only be part of a single bunch?

No, it will be many-to-many where a message can carry many tags, but it can be filtered both by tags and bunches. Bunch of tags is simply a "next level tag" like: message 1 linux user login failed tagged: authentication, user, failure, PCI DSS compliance authentication tag is part of "AAA bunch", "Action" bunches PCI DSS compliance tag is part of "Regulations" bunch failure is part of "status" In CEE, OAS triad will likely be used as "default tags" for all messages.

...

...
...
"importance", in a similar spirit to syslog severity, but one that works even if the application developer uses a bogus severity when sending syslog messages.

Important is HUGE challenge. Now sure what to add to this one as it is largely an unsolved problem due to very different contexts for message analysis. Even mere 'connection established' can be 10 of 10 for somebody in some circumstances. One can try to glue important to tags (like exploit > connection) and not to individual messages, it might work sometimes.

Hmm... good idea.

Maybe.. this issue took about 3 years of discussion among CEE team - and there is still no resolution to "universal syslog/log message severity scoring" Let me know how else I can help. -- Dr. Anton Chuvakin Site: http://www.chuvakin.org Blog: http://www.securitywarrior.org LinkedIn: http://www.linkedin.com/in/chuvakin Consulting: http://www.securitywarriorconsulting.com Twitter: @anton_chuvakin Google Voice: +1-510-771-7106

Balazs Scheidler

4 Sep 4 Sep

8:02 a.m.

On Fri, 2010-09-03 at 13:25 -0700, Anton Chuvakin wrote:

...

...
However quickly browsing through the PDF I couldn't find the taxonomy portion, is this "almost ready" stuff available somewhere?

Not public yet, but will be very soon. Let me see what I can send over at this stage. The main idea for CEE taxonomy is "OAS" for object/action/status "tags" being mandatory for each message. We found this to be both more useful and more doable than a single class for the message. Essentially, you should be able unambiguously determine what every log message in the world (!) means by reading the OAS triad.

...
...
Tags can be organized in 'bunches' that serve as classes. You mean, every tag would belong to a bunch and a given message could only be part of a single bunch?

No, it will be many-to-many where a message can carry many tags, but it can be filtered both by tags and bunches. Bunch of tags is simply a "next level tag" like:

message 1 linux user login failed tagged: authentication, user, failure, PCI DSS compliance

authentication tag is part of "AAA bunch", "Action" bunches PCI DSS compliance tag is part of "Regulations" bunch failure is part of "status"

In CEE, OAS triad will likely be used as "default tags" for all messages.

Is it a recursive hierarchy? e.g. is it possible to organize bunches to even higher level bunches? Also what I see unsolved is how the user can easily sort messages into files/tables by bunch. E.g. something like: destination d_files_by_bunch { file("/var/log/messages.$bunch"); }; Although if I were to define multi-value name-value pairs the one above could expand to multiple file writes. This way writing by tags or by bunches should be very simple. Interesting idea...

...

...
...
...
"importance", in a similar spirit to syslog severity, but one that works even if the application developer uses a bogus severity when sending syslog messages.

Important is HUGE challenge. Now sure what to add to this one as it is largely an unsolved problem due to very different contexts for message analysis. Even mere 'connection established' can be 10 of 10 for somebody in some circumstances. One can try to glue important to tags (like exploit > connection) and not to individual messages, it might work sometimes.

Hmm... good idea.

Maybe.. this issue took about 3 years of discussion among CEE team - and there is still no resolution to "universal syslog/log message severity scoring"

Let me know how else I can help.

Yeah, but using tags/bunches one can define which is more important to her. -- Bazsi

Anton Chuvakin

7:57 p.m.

...

...
In CEE, OAS triad will likely be used as "default tags" for all messages.

Is it a recursive hierarchy? e.g. is it possible to organize bunches to even higher level bunches?

Actually, we have not thought about it yet :-(

...

Also what I see unsolved is how the user can easily sort messages into files/tables by bunch.

This probably has to be done inside the log analysis tool that is aware of tags and their bunches.

...

Although if I were to define multi-value name-value pairs the one above could expand to multiple file writes. This way writing by tags or by bunches should be very simple.

Multi-value N=V are evil. They kill log parsers and RDBMS :-) We did think a lot about this conundrum of src_IP="10.10.1.2,10.10.1.3" and might well recommend that it never happens. If we have to deaggregate logs (thus exploding the volume) the whole thing would be a mess... -- Dr. Anton Chuvakin Site: http://www.chuvakin.org Blog: http://www.securitywarrior.org LinkedIn: http://www.linkedin.com/in/chuvakin Consulting: http://www.securitywarriorconsulting.com Twitter: @anton_chuvakin Google Voice: +1-510-771-7106

Martin Holste

5 Sep 5 Sep

3:40 a.m.

...

Multi-value N=V are evil. They kill log parsers and RDBMS :-) We did think a lot about this conundrum of src_IP="10.10.1.2,10.10.1.3" and might well recommend that it never happens. If we have to deaggregate logs (thus exploding the volume) the whole thing would be a mess...

Yes, they are evil. I was re-reading the recent thread "[syslog-ng] [announce] patterndb project," and I think we were in agreement that tags are still a good thing, though. So, how do we store the multi-value N=V but also have the flexibility of tags? My thought is maybe we go with a "primary" tag which is the class, and then the <tags> can be output via macro $TAG. ($TAG will contain all values in <tags>, right?) So for the macro-based file name, you would only use file("/var/log/messages.${.classifier.class}.log") and do your tag grepping normally, where classifier.class would be the primary tag. I think this would work out better in the long run than trying to concatenate tags for the class, because keeping track of the order would be complicated, and it would definitely be better than sticking to the logcheck's very limited range of class selections.

Balazs Scheidler

6 Sep 6 Sep

10:48 a.m.

On Sat, 2010-09-04 at 20:40 -0500, Martin Holste wrote:

...

...
Multi-value N=V are evil. They kill log parsers and RDBMS :-) We did think a lot about this conundrum of src_IP="10.10.1.2,10.10.1.3" and might well recommend that it never happens. If we have to deaggregate logs (thus exploding the volume) the whole thing would be a mess...

Yes, they are evil. I was re-reading the recent thread "[syslog-ng] [announce] patterndb project," and I think we were in agreement that tags are still a good thing, though. So, how do we store the multi-value N=V but also have the flexibility of tags? My thought is maybe we go with a "primary" tag which is the class, and then the

What I'm thinking right now is to create the possibility to create a "tagdb", independently from the patterndb database (although they must play hand-in-hand). This tagdb would define the tag hierarch (tags in bunches basically) and could perhaps also associate type with the tags. For example, Anton said that CEE is moving in the direction to provide OAS (=object, action, status) tag triplets for each log message. This type information could be represented with the hierarchy, or the "type" field. For example (representing tag types with a hierarchy): <tagdb> <bunch name="object"> <tag name="flowevt"/> </bunch> <bunch name="status"> </bunch> <bunch name="action"> <tag name="secevt"/> </bunch> </tagdb> For example (representing tag types explicitly): <tagdb> <bunch name="security"> <tag type="object" name="flowevt"/> <tag type="action" name="secevt"/> </bunch> <bunch name="storage"> <tag type="object" name="file"/> <tag type="object" name="database"/> </bunch> <tag type="class" name="violation"/> <tag type="class" name="security"/> <tag type="class" name="system"/> <tag type="class" name="unknown"/> <tag name="just-a-simple-tag-without-type"/> </tagdb> The two are more-or-less equivalent if a single tag can belong to multiple bunches, which I guess it can, the difference is that the "type" property of the tag can be used easier by syslog-ng itself. The behaviour of syslog-ng would be (typed tags): 1) if a message is tagged with a tag type=="class", it'd become .classifier.class 2) patterndb could validate easily that each message gets an object/status/action tag The behaviour of syslog-ng would be (hierarchy based tags): 1) there would be builtin bunches that must exist 2) based on the built-in bunches syslog-ng could enforce the same as the typed bunches For some reason I rather like type tags, even though it is somewhat more bureaucratic: users/pattern authors should be free to create their tags without limitation. Opinions?

...

<tags> can be output via macro $TAG. ($TAG will contain all values in <tags>, right?)

It is $TAGS and already exists in 3.1.2, it expands to a comma separated list of tags without further escaping. (e.g. tags may not contain spaces if your storage is a text file, or otherwise it makes it really difficult to process files later).

...

So for the macro-based file name, you would only use file("/var/log/messages.${.classifier.class}.log") and do your tag grepping normally, where classifier.class would be the primary tag. I think this would work out better in the long run than trying to concatenate tags for the class, because keeping track of the order would be complicated, and it would definitely be better than sticking to the logcheck's very limited range of class selections.

-- Bazsi

Martin Holste

11:55 p.m.

...

What I'm thinking right now is to create the possibility to create a "tagdb", independently from the patterndb database (although they must play hand-in-hand).

This tagdb would define the tag hierarch (tags in bunches basically) and could perhaps also associate type with the tags.

That would be really nice, but it sounds like a lot of effort will be required on your part. Still, sounds good if you're up for the maintenance.

...

<tagdb> <bunch name="security"> <tag type="object" name="flowevt"/> <tag type="action" name="secevt"/> </bunch> <bunch name="storage"> <tag type="object" name="file"/> <tag type="object" name="database"/> </bunch> <tag type="class" name="violation"/> <tag type="class" name="security"/> <tag type="class" name="system"/> <tag type="class" name="unknown"/> <tag name="just-a-simple-tag-without-type"/> </tagdb>

This seems workable, but to me, all that is required is a standard list of classes and tags to use as a guide for contributions. People can pick the most important tag to be the class name, and then just apply the rest as tags. A worthwhile discussion could take place on whether the most general or most specific tag should be used for the class. This format would still comply with the CEE requirements as long as all of the tags needed are present. So, it would look more like: .classifier.class="security" <tags> <tag>flowevt</tag>  <tag>deny</tag>  <tag>success</tag>  </tags> Or, you could be explicity with the CEE values: <tag>object.flowevt</tag>

...

For some reason I rather like type tags, even though it is somewhat more bureaucratic: users/pattern authors should be free to create their tags without limitation.

Opinions?

I agree.

Balazs Scheidler

13 Sep 13 Sep

4:26 p.m.

On Mon, 2010-09-06 at 16:55 -0500, Martin Holste wrote:

...

...
What I'm thinking right now is to create the possibility to create a "tagdb", independently from the patterndb database (although they must play hand-in-hand).

This tagdb would define the tag hierarch (tags in bunches basically) and could perhaps also associate type with the tags.

That would be really nice, but it sounds like a lot of effort will be required on your part. Still, sounds good if you're up for the maintenance.

I'd think that maintaining the set of tags would be needed for patterndb as well. I wouldn't go beyond what is needed there, even though I'd like to make it possible to extend the tag cloud from user-supplied configuration.

...

...
<tagdb> <bunch name="security"> <tag type="object" name="flowevt"/> <tag type="action" name="secevt"/> </bunch> <bunch name="storage"> <tag type="object" name="file"/> <tag type="object" name="database"/> </bunch> <tag type="class" name="violation"/> <tag type="class" name="security"/> <tag type="class" name="system"/> <tag type="class" name="unknown"/> <tag name="just-a-simple-tag-without-type"/> </tagdb>

This seems workable, but to me, all that is required is a standard list of classes and tags to use as a guide for contributions. People can pick the most important tag to be the class name, and then just apply the rest as tags. A worthwhile discussion could take place on whether the most general or most specific tag should be used for the class. This format would still comply with the CEE requirements as long as all of the tags needed are present. So, it would look more like:

.classifier.class="security" <tags> <tag>flowevt</tag>  <tag>deny</tag>  <tag>success</tag>  </tags>

Or, you could be explicity with the CEE values: <tag>object.flowevt</tag>

...
For some reason I rather like type tags, even though it is somewhat more bureaucratic: users/pattern authors should be free to create their tags without limitation.

Opinions?

I agree.

Meanwhile I've talked with Marton (the original author behind tags and patterndb) and his opinion was that the "type" field is difficult to define semantically, and also difficult to handle situations when the same tag would have multiple types, while the original tags/bunches would nicely handle N:M relationships between tags. So at the end of this (in-person) discussion we agreed that we don't need a type field, just a set of predefined "root" bunches. -- Bazsi

Balazs Scheidler

6 Sep 6 Sep

10:26 a.m.

On Sat, 2010-09-04 at 10:57 -0700, Anton Chuvakin wrote:

...

...
...
In CEE, OAS triad will likely be used as "default tags" for all messages.

Is it a recursive hierarchy? e.g. is it possible to organize bunches to even higher level bunches?

Actually, we have not thought about it yet :-(

...
Also what I see unsolved is how the user can easily sort messages into files/tables by bunch.

This probably has to be done inside the log analysis tool that is aware of tags and their bunches.

...
Although if I were to define multi-value name-value pairs the one above could expand to multiple file writes. This way writing by tags or by bunches should be very simple.

Multi-value N=V are evil. They kill log parsers and RDBMS :-) We did think a lot about this conundrum of src_IP="10.10.1.2,10.10.1.3" and might well recommend that it never happens. If we have to deaggregate logs (thus exploding the volume) the whole thing would be a mess...

Right, understood, agreed. -- Bazsi

5496

Age (days ago)

5513

Last active (days ago)

List overview

Download

17 comments

4 participants

participants (4)

Anton Chuvakin
Balazs Scheidler
Martin Holste
Matthew Hall