[syslog-ng] [announce] patterndb project

Mon Jul 5 03:12:18 CEST 2010

> Maybe I'm missing something, but as I see the current "tags" function
> which is present in patterndb v3 (e.g. syslog-ng OSE 3.1 or later) is
> exactly what you describe.
>

Yes, precisely!  I got a paragraph into my response and realized that,
so I stated that I wasn't sure if there was any difference between a
class element and a tag element.  Thinking about that more, I don't
see enough value in the formality of declaring a class element above
and beyond what the tag element already accomplishes, as long as the
XSD can prove the proper conformity.

> If I understand you correctly, you were referring to the "class"
> attribute of the rule element, and extend that. The way I think is that the
> "tags" feature is far superior than using classes, maybe a deprecation of
> the class attribute is would be needed.

Agreed.

> But this is already possible with v3.1. The only problem with using three tags
> instead of one, is how to store the extracted information in a way that it can
> be combined later.

I'll show you a schema that I've been using which compromises between
flexibility and formalism.  Here's my Snort parser:

<ruleset name="snort" id='8'>
  <pattern>snort</pattern>
  <rules>
    <rule provider="LOCAL" class='8' id='8'>
      <patterns>
        <pattern>@QSTRING:s0:[]@
@ESTRING:s1:[@Classification:@QSTRING:s2: ]@ [Priority: @NUMBER:i0:@]:
@QSTRING:i1:{}@ @
IPv4:i2:@:@NUMBER:i3:@ -> @IPv4:i4:@:@NUMBER:i5:@</pattern>
        </patterns>
    </rule>
  </rules>
</ruleset>

And here's my MySQL table schema:

CREATE TABLE `syslogs_template` (
  `id` bigint unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
  `timestamp` INT UNSIGNED NOT NULL DEFAULT 0,
  `host_id` INT UNSIGNED NOT NULL DEFAULT '1',
  `program_id` INT UNSIGNED NOT NULL DEFAULT '1',
  `class_id` SMALLINT unsigned NOT NULL DEFAULT '1',
  `rule_id` SMALLINT unsigned NOT NULL DEFAULT '1',
  msg TEXT,
  i0 INT UNSIGNED,
  i1 INT UNSIGNED,
  i2 INT UNSIGNED,
  i3 INT UNSIGNED,
  i4 INT UNSIGNED,
  i5 INT UNSIGNED,
  s0 VARCHAR(255),
  s1 VARCHAR(255),
  s2 VARCHAR(255),
  s3 VARCHAR(255),
  s4 VARCHAR(255),
  s5 VARCHAR(255)
) ENGINE=MyISAM;

So what I've done is traded the ability to create as many fields as I
want and human readability for the ability to guarantee only inserting
into a single destination table.  In my schema, I've got the capacity
to store up to six integer fields and six string fields, in addition
to the other syslog header data.  (I never use priority, so I
eventually dropped the priority column to save space.)

The tricky thing about this setup is that when you go to query, you
first have to translate the field into what is actually stored in the
database, so srcip becomes "i3."  I have separate, tiny lookup tables
that I use for presenting the actual text.  I do the same with program
names by storing only the CRC32 value of the program as its program ID
and keeping a lookup table for the actual text.  (I use a CRC
algorithm instead of an auto-generated ID so that ID's between cluster
nodes don't have to sync their values.)

If you were interested in heading down this road, then I would suggest
adding elements to the pattern XML schema to specify what the actual
values of the fields are, like this:

<fields>
  <field name="i0">sig_priority</field>
  <field name="i1">proto</field>
  <field name="i2">srcip</field>
  <field name="i3">srcport</field>
  <field name="i4">dstip</field>
  <field name="i5">dstport</field>
  <field name="s0">sig_name</field>
  <field name="s1">sig_sid</field>
  <field name="s2">sig_classification</field>
</fields>

Or you could have a more formal format like this:

<fields>
  <integers>
    <integer id=0>sig_priority</integer>
    ...
  </integers>
  <strings>
    <string id=0>sig_name</string>
    ...
  </strings>
</fields>

> The only missing bit here is that right now syslog-ng is unable to generate a
> unique message ID on its own, but that's not very difficult to add.
>

I would continue to leave that to the app, or in my case, the SQL
schema as I think it's better handled there.

> What do you think? Based on this idea, I'm proposing to split the current
> secevt schema into 3 smaller ones: flowevt, natevt and secevt.
>
> Please check the git archive where I've pushed the current version.
>

I gave it a quick look and it seems right to me.  One topic for
discussion: should each class have an optional "details" field
available, or should that be an implicitly available field to all log
classes?  In the case of my above SQL schema, it could be amended to
include a column defined:

details VARCHAR(255)

as the last column of the schema.

Obviously, the big thing missing from my current schema is a way to
store N number of tag values.  Bitmasks provide an interesting
possibility, but that would limit us to X number of classes (either 32
or 64, I guess).  Could almost every class be covered in 64 classes?
If so, a tag_ids BIGINT UNSIGNED column would allow a convenient place
to put all of the tag info.  For instance, if we had the following
lookup table:

id  name
1  Net
2  NAT
4  Security

Then a row with tag_ids=7 would mean it has all three tags, and a row
with tag_ids&2 would have the NAT flag set because it would match the
boolean AND for that bit.  To query by name, you would do this:

SELECT * FROM logs JOIN classes ON (logs.tag_ids&classes.id) WHERE
classes.name="NAT";

The major problem with this is that databases won't be able to use an
index when doing boolean logic comparisons, so searching based on only
one known tag would be extremely slow compared to const index lookup
if you're looking for a perfect match of all given tags.  So this
query would use an index and be very fast:

SELECT * FROM logs WHERE tag_ids=(SELECT SUM(id) FROM classes WHERE
class="Security" OR class="NAT" OR class="Net");

On the other hand, if we're willing to give up a lot of space and the
speed, you could do VARCHAR columns with a CSV of the tags
("Net,NAT,Security") or assign numeric ID's to save space with
("1,2,4").  If you used (",1,2,4,") then you could do your searches as
WHERE tag_ids LIKE "%,2,%" which would still be slow but easy to
program.

Of course, the classic answer is to use a giant index map table with
(log_id, tag_id) as the columns which represents the one-to-many
relationship.  That works but doesn't scale particularly well, and
brings with it the burden of managing an appendix table for every log
table, which gets annoying when dealing with rollover, etc.