[syslog-ng] [RFC]: Pattern matching & corellation ideas
Peter Gyongyosi
gyp at balabit.hu
Fri Sep 7 18:22:55 CEST 2012
Hi,
I agree with you in that the syntax and the structure of patterndb
desperately needs an overhaul and I like the approach your suggestions
show. However, I've got some problems with your proposal. See my
comments below (in no particular order).
1) the lisp-y syntax
This is fundamentally different from the main config file of syslog-ng
and alien to those not used to do stuff in lisp-like languages -- who I
guess make up the majority of syslog-ng users. If we are to change the
format of patterndb, I'd suggest something like what we have in the main
config file or something JSON-ish, similar to what logstash has:
http://logstash.net/docs/1.1.1/filters/grok. At least I'd be just as
annoyed to type all those parantheses as I am typing the tons of <tags>
of XML :) I want to write patterns and not code or a huge XML. The
actual container format just needs to get out my way as much as possible.
2) I like the action functions
I think these are the three main operations we need (set/clear/append),
however, I wouldn't call appending conjoining but that's just me :)
3) What about pattern hierarchy == efficient matching?
Your proposal allows the user to define complex conditions for a pattern
match. On the other hand, the patterns we have right now work in a way
that allows us to organize them in a radix tree and use a greedy,
non-backtracking algorithm for matching which makes this procedure
incredibly fast. Whereas if we'd allow more complex conditions, we'd
need to fall back to a linear matching: if we have 5000 patterns, we'd
have to match each and every pattern to each incoming message. Which is
slow.
4) this is horrible: (match "this " (capture :qstring :as "object") " is
good")
Sorry for my bluntness, but it is :) It indeed is lisp-y, but it is hard
to read and a tidious to write with all those parantheses and
"capture"s. I personally like the current syntax of the patterns
themselves and I'd keep it as it is. (Grok -- again, of logstash fame --
also has something similar and it seems to be working for them, too:
https://github.com/logstash/logstash/blob/v1.0.17/patterns/linux-syslog)
5) I agree that correlation should be handled separately -- but we need
IDs/names for that!
I totally agree with you that correlation should be separated from
parsing, I always have a hard time to wrap my mind around the way
correlation works in patterndb. But to do that (and to do more filtering
or anything with this parsing), a pattern needs a name or ID. Sure, it
can be added by a (set! :pattern-name "foo") command in your example but
I think it needs a more prevalent place.
So, as a summary: I think your approach has its place but can not and
should not replace patterndb. It can be incredibly flexible and as a
result we would not have to bastardize patterndb to support every weird
use case that comes up rather simply point the user to use these custom
parsers -- but this flexibility has the price of having to do one-by-one
matching between patterns and messages which brings in a huge
performance penalty. We still have to give an easy-to-use solution for
users who simply want to write patterns which they later use for
filtering. The current XML syntax is tidious to use, I agree, but what
you suggest is, in my opinion, even more so.
greets,
Peter
On 09/05/2012 12:31 PM, Gergely Nagy wrote:
> In the past few months, I talked a lot about patterndb and related
> things with colleagues - during coffee break, over a beer, etc -, and
> last night, I got as far as drafting a proof of concept tool that
> realizes some of the ideas we've had. Some of these might have been
> discussed on this list too, in the past, I really can't remember all the
> influences I'm afraid.
>
> Anyway! I like the concept of patterndb, but I absolutely hate XML. It's
> not a natural format for describing how to match patterns, and what to
> do with them. Not for me, anyway. I'd like something that is much closer
> to how I normally think, something that feels more like a programming
> language, a domain specific one, engineered for this task only, but
> still somewhat familiar. I also want it to be fast.
>
> For a long time, I also wanted to play with developing both a Domain
> Specific Language (DSL for short) and a compiler, but never had the
> opportunity. Pattern matching is one now!
>
> How about a small language, that we'd compile down to C, automatically
> add some boilerplate, and we'd get a syslog-ng parser plugin in return?
> That would mean the language is fairly easy to extend, it produces
> native code, which will hopefully run as fast - if not faster - than
> patterndb, and we skip the entire XML pile too!
>
> To demonstrate, this is what I've been thinking of:
>
> ,----
> | (cond :message
> | (match "foo (bar) ([:number:])")
> | (do
> | (set! :bar "$1")
> | (set! :stuff "$2")
> | (conj! :tags "test stuff")))
> |
> | (deftest message-match
> | "this is a foo bar 1234 message!"
> |
> | (== :bar "bar")
> | (== :stuff "1234")
> | (contains? :tags "test stuff"))
> `----
>
> This would compile down to roughly the following C code (with the test
> excluded, for now):
>
> ,----
> | gboolean
> | m_something(LogMsg *message, GString *subject)
> | {
> | GString *m1; /* ([:number:]) */
> |
> | /* The subject must be at least as long as the static strings in the
> | pattern, if it's shorter, we don't match */
> | if (subject->len < 8)
> | return FALSE;
> |
> | /* If we don't find part of the pattern, bail out. */
> | if (strncmp(subject->str, "foo bar ") != 0)
> | return FALSE;
> |
> | if (!find_number(subject->str + 8, &m1))
> | return FALSE;
> |
> | /* The whole stuff matched, yay! Lets fill in the fields. */
> |
> | /* "bar" is a static string, fill it in as-is, no need to extract it
> | from the subject. */
> | log_msg_set_value(msg, "bar", "bar", 3);
> |
> | /* :stuff is a number, so that needs to come from the
> | subject. Thankfully find_number() already did the extraction for
> | us. */
> | log_msg_set_value(msg, "stuff", m1->str, m1->len);
> |
> | log_msg_set_tag_by_name(msg, "test stuff");
> |
> | return TRUE;
> | }
> `----
>
> I believe this is pretty efficient, the code generator can comment the
> the generated source nicely too. If we add named capture-groups, then
> the variables used can have meaningful names too!
>
> For example: (match "this (?<object>[:qstring:]) is good")
>
> In this case, the variable would be called m_object.
>
> However, the pattern-string is a bit awkward, when the rest is lispy, it
> also complicates the generator, so I was thinking of turning the pattern
> into a lispy syntax too:
>
> ,----
> | (match "foo " (capture "bar") " " (capture :number))
> `----
>
> Or, with named capture groups added:
>
> ,----
> | (match "this " (capture :qstring :as "object") " is good")
> `----
>
> Of course, the action to take on a match can also contain another
> cond+match pair, so it can be nested as deep as one wishes to, the
> compiler will compile each cond into a separate function, and just call
> the appropriate one. Or perhaps inline them - that's an implementation
> detail, and doesn't really matter.
>
> The big advantage I see, is a DSL that is much closer to how I think,
> one that has the potential to produce a compact parser, one that is also
> easy to debug with conventional tools (gdb ;) because it compiles down
> to C. There is less run-time overhead too.
>
> Also, if implemented correctly, the generator would have the parser and
> the code generator entirely independent, so adding a different syntax
> would be as easy as writing a parser that produces the same abstract
> syntax tree the generator works with. This way, for those who're more
> familiar with C-like languages, the above matcher could be rewritten
> like this:
>
> ,----
> | switch ($message)
> | {
> | case match("foo ", capture("bar"), " ", capture(:number:))
> | {
> | set("bar", "$1");
> | set("stuff", "$2");
> | append($tags, "test stuff");
> | }
> | }
> `----
>
> And it would compile down to the exact same C code, accompanied by an
> appropriate autotools-based build system, so all you'd have to do in the
> end is to write the matcher, and issue the following commands:
>
> ,----
> | $ matcher-generate test-patterns.pm
> | $ cd test-patterns
> | $ autoreconf -i && ./configure && make && make install
> `----
>
> And finally, modify your syslog-ng.conf:
>
> ,----
> | @module test-patterns
> | parser p_test { parser(test-patterns); };
> `----
>
> It does have downsides, though, namely that you need to regenerate &
> recompile the module and restart syslog-ng each time you modify the
> source, which is less convenient than just restarting syslog-ng
> itself. One also needs to learn a 'new' language to write pattern
> matchers in (but one has to learn patterndb too, anyway, so this isn't
> that big a disadvantage, especially since a more language-like thing is,
> in my opinion, easier to learn :).
>
> However, I believe that the advantages are worth it. For me, they
> certainly do, so I already started to hash out a proof of concept. So
> far, my PoC code can generate C functions, and supports a small subset
> of the DSL explained below.
>
> Do note that all this does not include corellation, because I believe
> that corellation should be separate from parsing, and a similar
> technique could be used to write advanced corellation setups - I will go
> into detail once I have a working proof of concept compiler for the
> parser.
>
> As a start, the DSL would support the following constructs:
>
> Top-level constructs:
> ---------------------
>
> * (cond :field condition action ...)
>
> Where :field can be any field, condition is a single condition
> function (see below) and action is a single action too (see even
> further below).
>
> Any number of condition-action paris can be specified, the first one
> matching will win, and the rest won't be tried. These two must always
> be paired together.
>
> * (deftest test-name
> "source string"
>
> test-conditions)
>
> Right now, lets ignore this. But in the long run, I want to be able to
> write down reasonably complex tests too. Not entirely sure yet what I
> need it to do though.
>
> Condition functions:
> --------------------
>
> * (match pattern-spec)
>
> Matches a pattern-spec, simple as that. See below for the definition
> of the pattern-spec!
>
> * (not-match pattern-spec)
>
> The opposite of (match): action is triggered if the pattern does not
> match.
>
> * (exists?)
>
> Triggers the action if the field in the condition exists.
>
> * (not-exists?)
>
> Opposite of (exists?).
>
> * :default or (always)
>
> Always triggers, so one can do catch-all actions.
>
> Action functions:
> -----------------
>
> * (set! :field value)
> * (clear! :field)
>
> These two should speak for themselves, I believe.
>
> * (conj! :field values)
>
> Conjoin (append) values to the specified field. Some fields (:tags) that
> need to be, will be treated specially, otherwise it just appends a
> separator (",") and the values.
>
> * (do actions...)
>
> Does all the specified actions.
>
> Pattern spec:
> -------------
>
> The pattern can be built up from the sequence of the following things,
> in any order:
>
> * A plain string
>
> * (capture pattern-spec [:as name])
>
> This just marks the pattern-spec as something to capture. While the
> implementation may produce code that captures things anyway, the only
> guarantee that something will be available for the actions, is to wrap
> it in (capture). The pattern-spec can be anything, it can contain
> nested captures.
>
> If :as name is specified, the capture will be named, and actions can
> refer to the captured thing by name. Otherwise, they need to refer it
> by a number. Each capture - named or anonymous - has a different
> number, starting from one, increasing with each occurrence of
> (capture).
>
> * Any of the following special keywords:
>
> * :number
> * :string
> * :qstring
> * :ipv4-address
> * :ipv6-address
> * :mac-address
>
> Future ideas
> ------------
>
> Later, once the basics are ready and work, it would make sense to
> introduce a way to share common blocks of code: functions and perhaps
> variables.
>
> Conclusion
> ==========
>
> XML sucks, DSL rocks.
>
> Feedback appreciated, be that on the syntax, or the initially proposed
> features/functions/etc, or anything else.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.balabit.hu/pipermail/syslog-ng/attachments/20120907/6ceca710/attachment.htm
More information about the syslog-ng
mailing list