[syslog-ng] [announce] patterndb project

Mon Jul 5 11:36:46 CEST 2010

On Sun, 2010-07-04 at 20:12 -0500, Martin Holste wrote:
> > Maybe I'm missing something, but as I see the current "tags" function
> > which is present in patterndb v3 (e.g. syslog-ng OSE 3.1 or later) is
> > exactly what you describe.
> >
> 
> Yes, precisely!  I got a paragraph into my response and realized that,
> so I stated that I wasn't sure if there was any difference between a
> class element and a tag element.  Thinking about that more, I don't
> see enough value in the formality of declaring a class element above
> and beyond what the tag element already accomplishes, as long as the
> XSD can prove the proper conformity.

Right now we supply an XSD, but it doesn't check schema validity. I'm
not completely convinced that I'd add this to the schema, but I do see
the value of being able to validate it, so agreed XSD or something
different a validation tool would be useful.

> 
> > If I understand you correctly, you were referring to the "class"
> > attribute of the rule element, and extend that. The way I think is that the
> > "tags" feature is far superior than using classes, maybe a deprecation of
> > the class attribute is would be needed.
> 
> Agreed.

Great.

> 
> > But this is already possible with v3.1. The only problem with using three tags
> > instead of one, is how to store the extracted information in a way that it can
> > be combined later.
> 
> I'll show you a schema that I've been using which compromises between
> flexibility and formalism.  Here's my Snort parser:
> 
> <ruleset name="snort" id='8'>
>   <pattern>snort</pattern>
>   <rules>
>     <rule provider="LOCAL" class='8' id='8'>
>       <patterns>
>         <pattern>@QSTRING:s0:[]@
> @ESTRING:s1:[@Classification:@QSTRING:s2: ]@ [Priority: @NUMBER:i0:@]:
> @QSTRING:i1:{}@ @
> IPv4:i2:@:@NUMBER:i3:@ -> @IPv4:i4:@:@NUMBER:i5:@</pattern>
>         </patterns>
>     </rule>
>   </rules>
> </ruleset>
> 
> And here's my MySQL table schema:
> 
> CREATE TABLE `syslogs_template` (
>   `id` bigint unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
>   `timestamp` INT UNSIGNED NOT NULL DEFAULT 0,
>   `host_id` INT UNSIGNED NOT NULL DEFAULT '1',
>   `program_id` INT UNSIGNED NOT NULL DEFAULT '1',
>   `class_id` SMALLINT unsigned NOT NULL DEFAULT '1',
>   `rule_id` SMALLINT unsigned NOT NULL DEFAULT '1',
>   msg TEXT,
>   i0 INT UNSIGNED,
>   i1 INT UNSIGNED,
>   i2 INT UNSIGNED,
>   i3 INT UNSIGNED,
>   i4 INT UNSIGNED,
>   i5 INT UNSIGNED,
>   s0 VARCHAR(255),
>   s1 VARCHAR(255),
>   s2 VARCHAR(255),
>   s3 VARCHAR(255),
>   s4 VARCHAR(255),
>   s5 VARCHAR(255)
> ) ENGINE=MyISAM;

What I see with this, is that once we do have a schema description
language (right now the SCHEMAS.txt, but a more formal language later),
we could map each name-value pair to multiple representations, and
basically the "storage" functionality is which decides which mapping to
use.

A naive schema based SQL destination would simply create as many tables
as there are schemas. A better optimized one would use the NV -> field
mapping that you propose, and a NoSQL implementation would just scale to
any number of NV pairs without having to rename the fields.

This mapping support would also be useful if we want to generate CEF/CEE
formatted events.

> 
> So what I've done is traded the ability to create as many fields as I
> want and human readability for the ability to guarantee only inserting
> into a single destination table.  In my schema, I've got the capacity
> to store up to six integer fields and six string fields, in addition
> to the other syslog header data.  (I never use priority, so I
> eventually dropped the priority column to save space.)
> 
> The tricky thing about this setup is that when you go to query, you
> first have to translate the field into what is actually stored in the
> database, so srcip becomes "i3."  I have separate, tiny lookup tables
> that I use for presenting the actual text.  I do the same with program
> names by storing only the CRC32 value of the program as its program ID
> and keeping a lookup table for the actual text.  (I use a CRC
> algorithm instead of an auto-generated ID so that ID's between cluster
> nodes don't have to sync their values.)
> 
> If you were interested in heading down this road, then I would suggest
> adding elements to the pattern XML schema to specify what the actual
> values of the fields are, like this:
> 
> <fields>
>   <field name="i0">sig_priority</field>
>   <field name="i1">proto</field>
>   <field name="i2">srcip</field>
>   <field name="i3">srcport</field>
>   <field name="i4">dstip</field>
>   <field name="i5">dstport</field>
>   <field name="s0">sig_name</field>
>   <field name="s1">sig_sid</field>
>   <field name="s2">sig_classification</field>
> </fields>
> 
> Or you could have a more formal format like this:
> 
> <fields>
>   <integers>
>     <integer id=0>sig_priority</integer>
>     ...
>   </integers>
>   <strings>
>     <string id=0>sig_name</string>
>     ...
>   </strings>
> </fields>
> 
> > The only missing bit here is that right now syslog-ng is unable to generate a
> > unique message ID on its own, but that's not very difficult to add.
> >
> 
> I would continue to leave that to the app, or in my case, the SQL
> schema as I think it's better handled there.

The problem is that I'd like to support the multiple tables idea as
well, e.g. store each schema in a separate table. In this case you need
a unique id in order to join the tables. Also, if this would be combined
with the MSGID field of RFC5424, this could be used to fetch the
original raw message easily.

> 
> > What do you think? Based on this idea, I'm proposing to split the current
> > secevt schema into 3 smaller ones: flowevt, natevt and secevt.
> >
> > Please check the git archive where I've pushed the current version.
> >
> 
> I gave it a quick look and it seems right to me.  One topic for
> discussion: should each class have an optional "details" field
> available, or should that be an implicitly available field to all log
> classes?  

true enough... "details" field is used to shove all non-structured, but
related information into the event, and if a given event is using
multiple schemas, we'd have three "details" field, each possibly
containing overlapping information.

hmm... hmm, maybe "details" should be above all schemas, e.g instead of
calling it "secevt.details", it should be called "details", it is a
single pattern the extracts all the fields after all, so the pattern
author can decide which information wouldn't fit into any of the schemas
and put that in details.

> 
> Obviously, the big thing missing from my current schema is a way to
> store N number of tag values.  Bitmasks provide an interesting
> possibility, but that would limit us to X number of classes (either 32
> or 64, I guess).  Could almost every class be covered in 64 classes?
> If so, a tag_ids BIGINT UNSIGNED column would allow a convenient place
> to put all of the tag info.  For instance, if we had the following
> lookup table:
> 
> id  name
> 1  Net
> 2  NAT
> 4  Security
> 
> Then a row with tag_ids=7 would mean it has all three tags, and a row
> with tag_ids&2 would have the NAT flag set because it would match the
> boolean AND for that bit.  To query by name, you would do this:
> 
> SELECT * FROM logs JOIN classes ON (logs.tag_ids&classes.id) WHERE
> classes.name="NAT";
> 
> The major problem with this is that databases won't be able to use an
> index when doing boolean logic comparisons, so searching based on only
> one known tag would be extremely slow compared to const index lookup
> if you're looking for a perfect match of all given tags.  So this
> query would use an index and be very fast:
> 
> SELECT * FROM logs WHERE tag_ids=(SELECT SUM(id) FROM classes WHERE
> class="Security" OR class="NAT" OR class="Net");
> 
> On the other hand, if we're willing to give up a lot of space and the
> speed, you could do VARCHAR columns with a CSV of the tags
> ("Net,NAT,Security") or assign numeric ID's to save space with
> ("1,2,4").  If you used (",1,2,4,") then you could do your searches as
> WHERE tag_ids LIKE "%,2,%" which would still be slow but easy to
> program.
> 
> Of course, the classic answer is to use a giant index map table with
> (log_id, tag_id) as the columns which represents the one-to-many
> relationship.  That works but doesn't scale particularly well, and
> brings with it the burden of managing an appendix table for every log
> table, which gets annoying when dealing with rollover, etc.

Well, I believe that in SQL, the best we could probably come up with is
a "list of tags field" and use free-text indexing.

-- 
Bazsi