[syslog-ng] [RFC]: Pattern matching & corellation ideas

Fri Sep 7 18:22:55 CEST 2012

Hi,

I agree with you in that the syntax and the structure of patterndb 
desperately needs an overhaul and I like the approach your suggestions 
show. However, I've got some problems with your proposal. See my 
comments below (in no particular order).

1) the lisp-y syntax

This is fundamentally different from the main config file of syslog-ng 
and alien to those not used to do stuff in lisp-like languages -- who I 
guess make up the majority of syslog-ng users. If we are to change the 
format of patterndb, I'd suggest something like what we have in the main 
config file or something JSON-ish, similar to what logstash has: 
http://logstash.net/docs/1.1.1/filters/grok. At least I'd be just as 
annoyed to type all those parantheses as I am typing the tons of <tags> 
of XML :) I want to write patterns and not code or a huge XML. The 
actual container format just needs to get out my way as much as possible.

2) I like the action functions

I think these are the three main operations we need (set/clear/append), 
however, I wouldn't call appending conjoining but that's just me :)

3) What about pattern hierarchy == efficient matching?

Your proposal allows the user to define complex conditions for a pattern 
match. On the other hand, the patterns we have right now work in a way 
that allows us to organize them in a radix tree and use a greedy, 
non-backtracking algorithm for matching which makes this procedure 
incredibly fast. Whereas if we'd allow more complex conditions, we'd 
need to fall back to a linear matching: if we have 5000 patterns, we'd 
have to match each and every pattern to each incoming message. Which is 
slow.

4) this is horrible: (match "this " (capture :qstring :as "object") " is 
good")

Sorry for my bluntness, but it is :) It indeed is lisp-y, but it is hard 
to read and a tidious to write with all those parantheses and 
"capture"s. I personally like the current syntax of the patterns 
themselves and I'd keep it as it is. (Grok -- again, of logstash fame -- 
also has something similar and it seems to be working for them, too: 
https://github.com/logstash/logstash/blob/v1.0.17/patterns/linux-syslog)

5) I agree that correlation should be handled separately -- but we need 
IDs/names for that!

I totally agree with you that correlation should be separated from 
parsing, I always have a hard time to wrap my mind around the way 
correlation works in patterndb. But to do that (and to do more filtering 
or anything with this parsing), a pattern needs a name or ID. Sure, it 
can be added by a (set! :pattern-name "foo") command in your example but 
I think it needs a more prevalent place.

So, as a summary: I think your approach has its place but can not and 
should not replace patterndb. It can be incredibly flexible and as a 
result we would not have to bastardize patterndb to support every weird 
use case that comes up rather simply point the user to use these custom 
parsers -- but this flexibility has the price of having to do one-by-one 
matching between patterns and messages which brings in a huge 
performance penalty. We still have to give an easy-to-use solution for 
users who simply want to write patterns which they later use for 
filtering. The current XML syntax is tidious to use, I agree, but what 
you suggest is, in my opinion, even more so.

greets,
Peter

On 09/05/2012 12:31 PM, Gergely Nagy wrote:
> In the past few months, I talked a lot about patterndb and related
> things with colleagues - during coffee break, over a beer, etc -, and
> last night, I got as far as drafting a proof of concept tool that
> realizes some of the ideas we've had. Some of these might have been
> discussed on this list too, in the past, I really can't remember all the
> influences I'm afraid.
>
> Anyway! I like the concept of patterndb, but I absolutely hate XML. It's
> not a natural format for describing how to match patterns, and what to
> do with them. Not for me, anyway. I'd like something that is much closer
> to how I normally think, something that feels more like a programming
> language, a domain specific one, engineered for this task only, but
> still somewhat familiar. I also want it to be fast.
>
> For a long time, I also wanted to play with developing both a Domain
> Specific Language (DSL for short) and a compiler, but never had the
> opportunity. Pattern matching is one now!
>
> How about a small language, that we'd compile down to C, automatically
> add some boilerplate, and we'd get a syslog-ng parser plugin in return?
> That would mean the language is fairly easy to extend, it produces
> native code, which will hopefully run as fast - if not faster - than
> patterndb, and we skip the entire XML pile too!
>
> To demonstrate, this is what I've been thinking of:
>
> ,----
> | (cond :message
> |   (match "foo (bar) ([:number:])")
> |     (do
> |       (set! :bar "$1")
> |       (set! :stuff "$2")
> |       (conj! :tags "test stuff")))
> |
> | (deftest message-match
> |   "this is a foo bar 1234 message!"
> |
> |   (== :bar "bar")
> |   (== :stuff "1234")
> |   (contains? :tags "test stuff"))
> `----
>
> This would compile down to roughly the following C code (with the test
> excluded, for now):
>
> ,----
> | gboolean
> | m_something(LogMsg *message, GString *subject)
> | {
> |   GString *m1; /* ([:number:]) */
> |
> |   /* The subject must be at least as long as the static strings in the
> |   pattern, if it's shorter, we don't match */
> |   if (subject->len < 8)
> |     return FALSE;
> |
> |   /* If we don't find part of the pattern, bail out. */
> |   if (strncmp(subject->str, "foo bar ") != 0)
> |     return FALSE;
> |
> |   if (!find_number(subject->str + 8, &m1))
> |     return FALSE;
> |
> |   /* The whole stuff matched, yay! Lets fill in the fields. */
> |
> |   /* "bar" is a static string, fill it in as-is, no need to extract it
> |   from the subject. */
> |   log_msg_set_value(msg, "bar", "bar", 3);
> |
> |   /* :stuff is a number, so that needs to come from the
> |   subject. Thankfully find_number() already did the extraction for
> |   us. */
> |   log_msg_set_value(msg, "stuff", m1->str, m1->len);
> |
> |   log_msg_set_tag_by_name(msg, "test stuff");
> |
> |   return TRUE;
> | }
> `----
>
> I believe this is pretty efficient, the code generator can comment the
> the generated source nicely too. If we add named capture-groups, then
> the variables used can have meaningful names too!
>
> For example: (match "this (?<object>[:qstring:]) is good")
>
> In this case, the variable would be called m_object.
>
> However, the pattern-string is a bit awkward, when the rest is lispy, it
> also complicates the generator, so I was thinking of turning the pattern
> into a lispy syntax too:
>
> ,----
> | (match "foo " (capture "bar") " " (capture :number))
> `----
>
> Or, with named capture groups added:
>
> ,----
> | (match "this " (capture :qstring :as "object") " is good")
> `----
>
> Of course, the action to take on a match can also contain another
> cond+match pair, so it can be nested as deep as one wishes to, the
> compiler will compile each cond into a separate function, and just call
> the appropriate one. Or perhaps inline them - that's an implementation
> detail, and doesn't really matter.
>
> The big advantage I see, is a DSL that is much closer to how I think,
> one that has the potential to produce a compact parser, one that is also
> easy to debug with conventional tools (gdb ;) because it compiles down
> to C. There is less run-time overhead too.
>
> Also, if implemented correctly, the generator would have the parser and
> the code generator entirely independent, so adding a different syntax
> would be as easy as writing a parser that produces the same abstract
> syntax tree the generator works with. This way, for those who're more
> familiar with C-like languages, the above matcher could be rewritten
> like this:
>
> ,----
> | switch ($message)
> |   {
> |     case match("foo ", capture("bar"), " ", capture(:number:))
> |       {
> |         set("bar", "$1");
> |         set("stuff", "$2");
> |         append($tags, "test stuff");
> |       }
> |   }
> `----
>
> And it would compile down to the exact same C code, accompanied by an
> appropriate autotools-based build system, so all you'd have to do in the
> end is to write the matcher, and issue the following commands:
>
> ,----
> | $ matcher-generate test-patterns.pm
> | $ cd test-patterns
> | $ autoreconf -i && ./configure && make && make install
> `----
>
> And finally, modify your syslog-ng.conf:
>
> ,----
> | @module test-patterns
> | parser p_test { parser(test-patterns); };
> `----
>
> It does have downsides, though, namely that you need to regenerate &
> recompile the module and restart syslog-ng each time you modify the
> source, which is less convenient than just restarting syslog-ng
> itself. One also needs to learn a 'new' language to write pattern
> matchers in (but one has to learn patterndb too, anyway, so this isn't
> that big a disadvantage, especially since a more language-like thing is,
> in my opinion, easier to learn :).
>
> However, I believe that the advantages are worth it. For me, they
> certainly do, so I already started to hash out a proof of concept. So
> far, my PoC code can generate C functions, and supports a small subset
> of the DSL explained below.
>
> Do note that all this does not include corellation, because I believe
> that corellation should be separate from parsing, and a similar
> technique could be used to write advanced corellation setups - I will go
> into detail once I have a working proof of concept compiler for the
> parser.
>
> As a start, the DSL would support the following constructs:
>
> Top-level constructs:
> ---------------------
>
> * (cond :field condition action ...)
>
>    Where :field can be any field, condition is a single condition
>    function (see below) and action is a single action too (see even
>    further below).
>
>    Any number of condition-action paris can be specified, the first one
>    matching will win, and the rest won't be tried. These two must always
>    be paired together.
>
> * (deftest test-name
>      "source string"
>
>      test-conditions)
>
>    Right now, lets ignore this. But in the long run, I want to be able to
>    write down reasonably complex tests too. Not entirely sure yet what I
>    need it to do though.
>
> Condition functions:
> --------------------
>
> * (match pattern-spec)
>
>    Matches a pattern-spec, simple as that. See below for the definition
>    of the pattern-spec!
>
> * (not-match pattern-spec)
>
>    The opposite of (match): action is triggered if the pattern does not
>    match.
>
> * (exists?)
>
>    Triggers the action if the field in the condition exists.
>
> * (not-exists?)
>
>    Opposite of (exists?).
>
> * :default or (always)
>
>    Always triggers, so one can do catch-all actions.
>
> Action functions:
> -----------------
>
> * (set! :field value)
> * (clear! :field)
>
> These two should speak for themselves, I believe.
>
> * (conj! :field values)
>
> Conjoin (append) values to the specified field. Some fields (:tags) that
> need to be, will be treated specially, otherwise it just appends a
> separator (",") and the values.
>
> * (do actions...)
>
> Does all the specified actions.
>
> Pattern spec:
> -------------
>
> The pattern can be built up from the sequence of the following things,
> in any order:
>
> * A plain string
>
> * (capture pattern-spec [:as name])
>
>    This just marks the pattern-spec as something to capture. While the
>    implementation may produce code that captures things anyway, the only
>    guarantee that something will be available for the actions, is to wrap
>    it in (capture). The pattern-spec can be anything, it can contain
>    nested captures.
>
>    If :as name is specified, the capture will be named, and actions can
>    refer to the captured thing by name. Otherwise, they need to refer it
>    by a number. Each capture - named or anonymous - has a different
>    number, starting from one, increasing with each occurrence of
>    (capture).
>
> * Any of the following special keywords:
>
>    * :number
>    * :string
>    * :qstring
>    * :ipv4-address
>    * :ipv6-address
>    * :mac-address
>
> Future ideas
> ------------
>
> Later, once the basics are ready and work, it would make sense to
> introduce a way to share common blocks of code: functions and perhaps
> variables.
>
> Conclusion
> ==========
>
> XML sucks, DSL rocks.
>
> Feedback appreciated, be that on the syntax, or the initially proposed
> features/functions/etc, or anything else.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.balabit.hu/pipermail/syslog-ng/attachments/20120907/6ceca710/attachment.htm