[syslog-ng] [RFC]: Pattern matching & corellation ideas
Gergely Nagy
algernon at balabit.hu
Fri Sep 7 20:26:17 CEST 2012
Peter Gyongyosi <gyp at balabit.hu> writes:
> 1) the lisp-y syntax
Yep, it is different, because of two factors: I like lisp, and I started
coding the PoC in Clojure, and having a compatible syntax made the
prototyping much much faster.
But as I said in the RFC, I understand the syntax may not be easy for
non-lispy folk, so the whole compiler business is being coded with this
in mind: the parser is entirely separate from the rest. As long as
there's a parser to translate the source to an intermediate format,
we'll be fine, the rest of the toolchain will handle it.
Right now, I have clojure macros that translate a DSL to an
intermediate format, which gets further translated into a lower level
representation (this is where the "(match)" stuff gets analyzed), which
is then optimised, eliminating unused stuff, combining others, and so on
and so forth, and in the end, the final step turns it into C.
I also have a Lua and a Guile generator PoC'd up, so it is entirely
possible to compile down to another, dynamic language, which can then be
embedded in syslog-ng, and voila, no compiler is necessary!
But I digress.
> http://logstash.net/docs/1.1.1/filters/grok.
Haven't looked at it in detail yet, but JSON has similar disadvantages:
instead of parentheses, you'll have a ton of {} and [].
Having had a second look at some of the recipes... eeep, no, thank
you. It has the same feel as the current patterndb, except instead of an
XML container, it's JSON. The fundamental problem still remains: it uses
format-string-like syntax. That's the most horrible, inconvenient and
inflexible thing ever invented.
(Did I mention that I passionately hate format strings? Not just when
they're used for parsing, but for formatting too.)
> I want to write patterns and not code or a huge XML. The actual
> container format just needs to get out my way as much as possible.
Yeah, understandable. While playing with the PoC, I came to the
conclusion that the current language is too verbose. Thankfully, because
it's all a bunch of clojure macros, I could build further macros to
abstract away a bunch of things, and without *any* change to the code, I
was able to rewrite this patterndb rule:
<rule provider='patterndb' id='4dd5a329-da83-4876-a431-ddcb59c2858c' class='system'>
<patterns>
<pattern>Accepted @ESTRING:usracct.authmethod: @for @ESTRING:usracct.username: @from @ESTRING:usracct.device: @port @ESTRING:: @@ANYSTRING:usracct.service@</pattern>
</patterns>
<values>
<value name='usracct.type'>login</value>
<value name='usracct.sessionid'>$PID</value>
<value name='usracct.application'>$PROGRAM</value>
<value name='secevt.verdict'>ACCEPT</value>
</values>
<tags>
<tag>usracct</tag>
<tag>secevt</tag>
</tags>
</rule>
To this:
(defruleset "4dd5a329-da83-4876-a431-ddcb59c2858c"
{:class :system
:provider :PoC}
(with-pattern "Accepted " (word :usracct.authmethod) " for "
(word :usracct.username) " from "
(word :usracct.device) " port "
(string :usracct.service)
(do->
(set! :usracct.type "login"
:usracct.sessionid "$PID"
:usracct.application "$PROGRAM"
:secevt.verdict "ACCEPT")
(tag! :usracct :secevt))))
There ain't that many parentheses anymore, and I think it's sufficiently
clear even for those who don't speak a bit of lisp. Just read it as-is,
and you'll pretty much know what the ruleset does.
> 2) I like the action functions
>
> I think these are the three main operations we need
> (set/clear/append), however, I wouldn't call appending conjoining but
> that's just me :)
I tried to stay as close to the Clojure terminology as possible. It's
one line in the current PoC to make append! an alias to conj!:
(def append! conj!)
Mind you, due to practical reasons, I ended up using append! in the PoC
too.
> 3) What about pattern hierarchy == efficient matching?
>
> Your proposal allows the user to define complex conditions for a
> pattern match. On the other hand, the patterns we have right now work
> in a way that allows us to organize them in a radix tree and use a
> greedy, non-backtracking algorithm for matching which makes this
> procedure incredibly fast.
That's where the optimisation step comes in. In due time, I will be able
to teach the optimiser to use a radix tree whenever possible, and only
fall back when the complexity demands that.
> Whereas if we'd allow more complex conditions, we'd need to fall back
> to a linear matching: if we have 5000 patterns, we'd have to match
> each and every pattern to each incoming message. Which is slow.
Indeed. Which is why the language is limited enough to allow the
optimiser to (reasonably easily) figure out what algorithm to
use. I do not want to limit complexity because that makes it possible to
write less efficient - or even horribly inefficient - parsers. Sometimes
that is necessary, and I want to allow complex patterns too, while
maintaining the ability to generate very fast code for the simple ones.
As an example, it is entirely possible to translate simpler rulesets
from my language to patterndb. If a ruleset can be translated to
patterndb syntax, then the same algorithms can be used too. Perhaps I
can even reuse the already existing code...
Or, as an intermediate step in the PoC, I can teach my generator to emit
patterndb rules instead of C, if what I wrote is expressable that
way. :)
> 4) this is horrible: (match "this " (capture :qstring :as "object") "
> is good")
>
> Sorry for my bluntness, but it is :) It indeed is lisp-y, but it is
> hard to read and a tidious to write with all those parantheses and
> "capture"s.
Yep, I ended up dropping this syntax, and on the lowest level of the
PoC, this is now:
(match "this " (capture-as "object" :string) " is good")
But with a macro, can be turned into:
(match "this " (string :object) " is good")
Same number of parentheses, but shorter, and easier to understand for a
human.
> I personally like the current syntax of the patterns themselves and
> I'd keep it as it is. (Grok -- again, of logstash fame -- also has
> something similar and it seems to be working for them, too:
>
> https://github.com/logstash/logstash/blob/v1.0.17/patterns/linux-syslog
This is horrible. Sorry, but...
SYSLOGBASE2 (?:%{SYSLOGTIMESTAMP:timestamp}|%{TIMESTAMP_ISO8601:timestamp8601}) (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} %{SYSLOGPROG}:
I look at this, and I have no idea what it means. Lets translate this
part to my lispy syntax:
(match (syslog-timestamp :timestamp) (syslog-facility :fac)
(string :host) (string :program):)
(match (iso8601-timestamp :timestamp) (syslog-facility :fac)
(string :host) (string :program):)
Two lines, because my syntax does not have an explicit OR, but that can
be fixed in different ways, which I will not detail here. The most
obvious would be to introduce an OR operator into the language, but I
don't really like that, makes it too easy to write patterns that are
hard to optimize.
Nevertheless, I don't find the grok syntax readable. It's the same "lets
shovel everything into a format-string-like abomination!" nonsense that
plagues many many things, including our patterndb.
That's what I'm trying to move away from, not the XML container. XML is
far less evil than this :P
> 5) I agree that correlation should be handled separately -- but we
> need IDs/names for that!
>
> I totally agree with you that correlation should be separated from
> parsing, I always have a hard time to wrap my mind around the way
> correlation works in patterndb. But to do that (and to do more
> filtering or anything with this parsing), a pattern needs a name or
> ID. Sure, it can be added by a (set! :pattern-name "foo") command in
> your example but I think it needs a more prevalent place.
Yeah, I came to the same conclusion. The (cond ...) stuff in the
original example was replaced by something along these lines:
(ruleset "id" :message
(match ...) (action...)
...)
So there's an explicit id there now.
> So, as a summary: I think your approach has its place but can not and
> should not replace patterndb.
I'll try my best to prove you wrong on the 'can' part. :)
> It can be incredibly flexible and as a result we would not have to
> bastardize patterndb to support every weird use case that comes up
> rather simply point the user to use these custom parsers -- but this
> flexibility has the price of having to do one-by-one matching between
> patterns and messages which brings in a huge performance penalty.
I'm not convinced that's the case. For patterns that the optimiser finds
too complex - yes. But remember: this whole thingamabob gets compiled
down to a separate module. There's absolutely no reason not to use a
better algorithm when the pattern allows us to.
> We still have to give an easy-to-use solution for users who simply
> want to write patterns which they later use for filtering. The current
> XML syntax is tidious to use, I agree, but what you suggest is, in my
> opinion, even more so.
Eee, we'll see. I don't really see that many people writing patterndb
rules. I think I could count all of them in two hands, and one hand
would be BalaBit employees.
The way to make pattern writing easier, is not really the language
itself (it does help if it is not cryptic; both grok and patterndb
are. Compact, but cryptic), but the provided tools. Give people good
tools, and they won't care the least bit about what language the tool
produces as output.
Which brings me to another benefit of using a Clojure-compatible syntax
for the PoC: it's easy to manipulate from Clojure *AND* ClojureScript
too. It wouldn't be too hard to knock up a little web app that presents
you with a bunch of logs, and you can interactively develop patterns,
without ever having to look at the code produced under the hood.
Same could be done with Grok or PatternDB too, I suppose, but I'm not
going to touch either from an application running in the browser.
--
|8]
More information about the syslog-ng
mailing list