[RFC]: Pattern matching & corellation ideas
In the past few months, I talked a lot about patterndb and related things with colleagues - during coffee break, over a beer, etc -, and last night, I got as far as drafting a proof of concept tool that realizes some of the ideas we've had. Some of these might have been discussed on this list too, in the past, I really can't remember all the influences I'm afraid. Anyway! I like the concept of patterndb, but I absolutely hate XML. It's not a natural format for describing how to match patterns, and what to do with them. Not for me, anyway. I'd like something that is much closer to how I normally think, something that feels more like a programming language, a domain specific one, engineered for this task only, but still somewhat familiar. I also want it to be fast. For a long time, I also wanted to play with developing both a Domain Specific Language (DSL for short) and a compiler, but never had the opportunity. Pattern matching is one now! How about a small language, that we'd compile down to C, automatically add some boilerplate, and we'd get a syslog-ng parser plugin in return? That would mean the language is fairly easy to extend, it produces native code, which will hopefully run as fast - if not faster - than patterndb, and we skip the entire XML pile too! To demonstrate, this is what I've been thinking of: ,---- | (cond :message | (match "foo (bar) ([:number:])") | (do | (set! :bar "$1") | (set! :stuff "$2") | (conj! :tags "test stuff"))) | | (deftest message-match | "this is a foo bar 1234 message!" | | (== :bar "bar") | (== :stuff "1234") | (contains? :tags "test stuff")) `---- This would compile down to roughly the following C code (with the test excluded, for now): ,---- | gboolean | m_something(LogMsg *message, GString *subject) | { | GString *m1; /* ([:number:]) */ | | /* The subject must be at least as long as the static strings in the | pattern, if it's shorter, we don't match */ | if (subject->len < 8) | return FALSE; | | /* If we don't find part of the pattern, bail out. */ | if (strncmp(subject->str, "foo bar ") != 0) | return FALSE; | | if (!find_number(subject->str + 8, &m1)) | return FALSE; | | /* The whole stuff matched, yay! Lets fill in the fields. */ | | /* "bar" is a static string, fill it in as-is, no need to extract it | from the subject. */ | log_msg_set_value(msg, "bar", "bar", 3); | | /* :stuff is a number, so that needs to come from the | subject. Thankfully find_number() already did the extraction for | us. */ | log_msg_set_value(msg, "stuff", m1->str, m1->len); | | log_msg_set_tag_by_name(msg, "test stuff"); | | return TRUE; | } `---- I believe this is pretty efficient, the code generator can comment the the generated source nicely too. If we add named capture-groups, then the variables used can have meaningful names too! For example: (match "this (?<object>[:qstring:]) is good") In this case, the variable would be called m_object. However, the pattern-string is a bit awkward, when the rest is lispy, it also complicates the generator, so I was thinking of turning the pattern into a lispy syntax too: ,---- | (match "foo " (capture "bar") " " (capture :number)) `---- Or, with named capture groups added: ,---- | (match "this " (capture :qstring :as "object") " is good") `---- Of course, the action to take on a match can also contain another cond+match pair, so it can be nested as deep as one wishes to, the compiler will compile each cond into a separate function, and just call the appropriate one. Or perhaps inline them - that's an implementation detail, and doesn't really matter. The big advantage I see, is a DSL that is much closer to how I think, one that has the potential to produce a compact parser, one that is also easy to debug with conventional tools (gdb ;) because it compiles down to C. There is less run-time overhead too. Also, if implemented correctly, the generator would have the parser and the code generator entirely independent, so adding a different syntax would be as easy as writing a parser that produces the same abstract syntax tree the generator works with. This way, for those who're more familiar with C-like languages, the above matcher could be rewritten like this: ,---- | switch ($message) | { | case match("foo ", capture("bar"), " ", capture(:number:)) | { | set("bar", "$1"); | set("stuff", "$2"); | append($tags, "test stuff"); | } | } `---- And it would compile down to the exact same C code, accompanied by an appropriate autotools-based build system, so all you'd have to do in the end is to write the matcher, and issue the following commands: ,---- | $ matcher-generate test-patterns.pm | $ cd test-patterns | $ autoreconf -i && ./configure && make && make install `---- And finally, modify your syslog-ng.conf: ,---- | @module test-patterns | parser p_test { parser(test-patterns); }; `---- It does have downsides, though, namely that you need to regenerate & recompile the module and restart syslog-ng each time you modify the source, which is less convenient than just restarting syslog-ng itself. One also needs to learn a 'new' language to write pattern matchers in (but one has to learn patterndb too, anyway, so this isn't that big a disadvantage, especially since a more language-like thing is, in my opinion, easier to learn :). However, I believe that the advantages are worth it. For me, they certainly do, so I already started to hash out a proof of concept. So far, my PoC code can generate C functions, and supports a small subset of the DSL explained below. Do note that all this does not include corellation, because I believe that corellation should be separate from parsing, and a similar technique could be used to write advanced corellation setups - I will go into detail once I have a working proof of concept compiler for the parser. As a start, the DSL would support the following constructs: Top-level constructs: --------------------- * (cond :field condition action ...) Where :field can be any field, condition is a single condition function (see below) and action is a single action too (see even further below). Any number of condition-action paris can be specified, the first one matching will win, and the rest won't be tried. These two must always be paired together. * (deftest test-name "source string" test-conditions) Right now, lets ignore this. But in the long run, I want to be able to write down reasonably complex tests too. Not entirely sure yet what I need it to do though. Condition functions: -------------------- * (match pattern-spec) Matches a pattern-spec, simple as that. See below for the definition of the pattern-spec! * (not-match pattern-spec) The opposite of (match): action is triggered if the pattern does not match. * (exists?) Triggers the action if the field in the condition exists. * (not-exists?) Opposite of (exists?). * :default or (always) Always triggers, so one can do catch-all actions. Action functions: ----------------- * (set! :field value) * (clear! :field) These two should speak for themselves, I believe. * (conj! :field values) Conjoin (append) values to the specified field. Some fields (:tags) that need to be, will be treated specially, otherwise it just appends a separator (",") and the values. * (do actions...) Does all the specified actions. Pattern spec: ------------- The pattern can be built up from the sequence of the following things, in any order: * A plain string * (capture pattern-spec [:as name]) This just marks the pattern-spec as something to capture. While the implementation may produce code that captures things anyway, the only guarantee that something will be available for the actions, is to wrap it in (capture). The pattern-spec can be anything, it can contain nested captures. If :as name is specified, the capture will be named, and actions can refer to the captured thing by name. Otherwise, they need to refer it by a number. Each capture - named or anonymous - has a different number, starting from one, increasing with each occurrence of (capture). * Any of the following special keywords: * :number * :string * :qstring * :ipv4-address * :ipv6-address * :mac-address Future ideas ------------ Later, once the basics are ready and work, it would make sense to introduce a way to share common blocks of code: functions and perhaps variables. Conclusion ========== XML sucks, DSL rocks. Feedback appreciated, be that on the syntax, or the initially proposed features/functions/etc, or anything else. -- |8], who perhaps spent a little too much time near LISP-y stuff
On 2012-09-05, Gergely Nagy wrote:
And it would compile down to the exact same C code, accompanied by an appropriate autotools-based build system, so all you'd have to do in the end is to write the matcher, and issue the following commands:
,---- | $ matcher-generate test-patterns.pm | $ cd test-patterns | $ autoreconf -i && ./configure && make && make install `----
And finally, modify your syslog-ng.conf:
,---- | @module test-patterns | parser p_test { parser(test-patterns); }; `----
It does have downsides, though, namely that you need to regenerate & recompile the module and restart syslog-ng each time you modify the source, which is less convenient than just restarting syslog-ng itself. One also needs to learn a 'new' language to write pattern matchers in (but one has to learn patterndb too, anyway, so this isn't that big a disadvantage, especially since a more language-like thing is, in my opinion, easier to learn :).
For me, this is a huge disadvantage, because that'd introduce the need to have compiler handy, or to distribute binary instead of a plaintext config file. Just my $0.02, Jakub. -- Jakub Jankowski|shasta@toxcorp.com|http://toxcorp.com/ GPG: FCBF F03D 9ADB B768 8B92 BB52 0341 9037 A875 942D
Jakub Jankowski <shasta@toxcorp.com> writes:
It does have downsides, though, namely that you need to regenerate & recompile the module and restart syslog-ng each time you modify the source, which is less convenient than just restarting syslog-ng itself. One also needs to learn a 'new' language to write pattern matchers in (but one has to learn patterndb too, anyway, so this isn't that big a disadvantage, especially since a more language-like thing is, in my opinion, easier to learn :).
For me, this is a huge disadvantage, because that'd introduce the need to have compiler handy, or to distribute binary instead of a plaintext config file.
Yep, that sadly is there, which is why this will be an option, along with patterndb. On the other hand, it would also be possible to skip the compile step, and write a module that would just run the thing. That'd have the disadvantage (compared to the compiled version) that it'd be somewhat slower and a little bit more complex to write, but would allow you to only distribute plain text config files. Since there's an intermediate syntax tree anyway (to separate the parser and the code generator), it's not terribly hard to write an interpreter on top of that, that doesn't generate the C code, but runs it instead. I'll keep that in mind when I proceed, and will try to write the interpreter along with the generator. Thanks for the suggestion! -- |8]
Hi, I agree with you in that the syntax and the structure of patterndb desperately needs an overhaul and I like the approach your suggestions show. However, I've got some problems with your proposal. See my comments below (in no particular order). 1) the lisp-y syntax This is fundamentally different from the main config file of syslog-ng and alien to those not used to do stuff in lisp-like languages -- who I guess make up the majority of syslog-ng users. If we are to change the format of patterndb, I'd suggest something like what we have in the main config file or something JSON-ish, similar to what logstash has: http://logstash.net/docs/1.1.1/filters/grok. At least I'd be just as annoyed to type all those parantheses as I am typing the tons of <tags> of XML :) I want to write patterns and not code or a huge XML. The actual container format just needs to get out my way as much as possible. 2) I like the action functions I think these are the three main operations we need (set/clear/append), however, I wouldn't call appending conjoining but that's just me :) 3) What about pattern hierarchy == efficient matching? Your proposal allows the user to define complex conditions for a pattern match. On the other hand, the patterns we have right now work in a way that allows us to organize them in a radix tree and use a greedy, non-backtracking algorithm for matching which makes this procedure incredibly fast. Whereas if we'd allow more complex conditions, we'd need to fall back to a linear matching: if we have 5000 patterns, we'd have to match each and every pattern to each incoming message. Which is slow. 4) this is horrible: (match "this " (capture :qstring :as "object") " is good") Sorry for my bluntness, but it is :) It indeed is lisp-y, but it is hard to read and a tidious to write with all those parantheses and "capture"s. I personally like the current syntax of the patterns themselves and I'd keep it as it is. (Grok -- again, of logstash fame -- also has something similar and it seems to be working for them, too: https://github.com/logstash/logstash/blob/v1.0.17/patterns/linux-syslog) 5) I agree that correlation should be handled separately -- but we need IDs/names for that! I totally agree with you that correlation should be separated from parsing, I always have a hard time to wrap my mind around the way correlation works in patterndb. But to do that (and to do more filtering or anything with this parsing), a pattern needs a name or ID. Sure, it can be added by a (set! :pattern-name "foo") command in your example but I think it needs a more prevalent place. So, as a summary: I think your approach has its place but can not and should not replace patterndb. It can be incredibly flexible and as a result we would not have to bastardize patterndb to support every weird use case that comes up rather simply point the user to use these custom parsers -- but this flexibility has the price of having to do one-by-one matching between patterns and messages which brings in a huge performance penalty. We still have to give an easy-to-use solution for users who simply want to write patterns which they later use for filtering. The current XML syntax is tidious to use, I agree, but what you suggest is, in my opinion, even more so. greets, Peter On 09/05/2012 12:31 PM, Gergely Nagy wrote:
In the past few months, I talked a lot about patterndb and related things with colleagues - during coffee break, over a beer, etc -, and last night, I got as far as drafting a proof of concept tool that realizes some of the ideas we've had. Some of these might have been discussed on this list too, in the past, I really can't remember all the influences I'm afraid.
Anyway! I like the concept of patterndb, but I absolutely hate XML. It's not a natural format for describing how to match patterns, and what to do with them. Not for me, anyway. I'd like something that is much closer to how I normally think, something that feels more like a programming language, a domain specific one, engineered for this task only, but still somewhat familiar. I also want it to be fast.
For a long time, I also wanted to play with developing both a Domain Specific Language (DSL for short) and a compiler, but never had the opportunity. Pattern matching is one now!
How about a small language, that we'd compile down to C, automatically add some boilerplate, and we'd get a syslog-ng parser plugin in return? That would mean the language is fairly easy to extend, it produces native code, which will hopefully run as fast - if not faster - than patterndb, and we skip the entire XML pile too!
To demonstrate, this is what I've been thinking of:
,---- | (cond :message | (match "foo (bar) ([:number:])") | (do | (set! :bar "$1") | (set! :stuff "$2") | (conj! :tags "test stuff"))) | | (deftest message-match | "this is a foo bar 1234 message!" | | (== :bar "bar") | (== :stuff "1234") | (contains? :tags "test stuff")) `----
This would compile down to roughly the following C code (with the test excluded, for now):
,---- | gboolean | m_something(LogMsg *message, GString *subject) | { | GString *m1; /* ([:number:]) */ | | /* The subject must be at least as long as the static strings in the | pattern, if it's shorter, we don't match */ | if (subject->len < 8) | return FALSE; | | /* If we don't find part of the pattern, bail out. */ | if (strncmp(subject->str, "foo bar ") != 0) | return FALSE; | | if (!find_number(subject->str + 8, &m1)) | return FALSE; | | /* The whole stuff matched, yay! Lets fill in the fields. */ | | /* "bar" is a static string, fill it in as-is, no need to extract it | from the subject. */ | log_msg_set_value(msg, "bar", "bar", 3); | | /* :stuff is a number, so that needs to come from the | subject. Thankfully find_number() already did the extraction for | us. */ | log_msg_set_value(msg, "stuff", m1->str, m1->len); | | log_msg_set_tag_by_name(msg, "test stuff"); | | return TRUE; | } `----
I believe this is pretty efficient, the code generator can comment the the generated source nicely too. If we add named capture-groups, then the variables used can have meaningful names too!
For example: (match "this (?<object>[:qstring:]) is good")
In this case, the variable would be called m_object.
However, the pattern-string is a bit awkward, when the rest is lispy, it also complicates the generator, so I was thinking of turning the pattern into a lispy syntax too:
,---- | (match "foo " (capture "bar") " " (capture :number)) `----
Or, with named capture groups added:
,---- | (match "this " (capture :qstring :as "object") " is good") `----
Of course, the action to take on a match can also contain another cond+match pair, so it can be nested as deep as one wishes to, the compiler will compile each cond into a separate function, and just call the appropriate one. Or perhaps inline them - that's an implementation detail, and doesn't really matter.
The big advantage I see, is a DSL that is much closer to how I think, one that has the potential to produce a compact parser, one that is also easy to debug with conventional tools (gdb ;) because it compiles down to C. There is less run-time overhead too.
Also, if implemented correctly, the generator would have the parser and the code generator entirely independent, so adding a different syntax would be as easy as writing a parser that produces the same abstract syntax tree the generator works with. This way, for those who're more familiar with C-like languages, the above matcher could be rewritten like this:
,---- | switch ($message) | { | case match("foo ", capture("bar"), " ", capture(:number:)) | { | set("bar", "$1"); | set("stuff", "$2"); | append($tags, "test stuff"); | } | } `----
And it would compile down to the exact same C code, accompanied by an appropriate autotools-based build system, so all you'd have to do in the end is to write the matcher, and issue the following commands:
,---- | $ matcher-generate test-patterns.pm | $ cd test-patterns | $ autoreconf -i && ./configure && make && make install `----
And finally, modify your syslog-ng.conf:
,---- | @module test-patterns | parser p_test { parser(test-patterns); }; `----
It does have downsides, though, namely that you need to regenerate & recompile the module and restart syslog-ng each time you modify the source, which is less convenient than just restarting syslog-ng itself. One also needs to learn a 'new' language to write pattern matchers in (but one has to learn patterndb too, anyway, so this isn't that big a disadvantage, especially since a more language-like thing is, in my opinion, easier to learn :).
However, I believe that the advantages are worth it. For me, they certainly do, so I already started to hash out a proof of concept. So far, my PoC code can generate C functions, and supports a small subset of the DSL explained below.
Do note that all this does not include corellation, because I believe that corellation should be separate from parsing, and a similar technique could be used to write advanced corellation setups - I will go into detail once I have a working proof of concept compiler for the parser.
As a start, the DSL would support the following constructs:
Top-level constructs: ---------------------
* (cond :field condition action ...)
Where :field can be any field, condition is a single condition function (see below) and action is a single action too (see even further below).
Any number of condition-action paris can be specified, the first one matching will win, and the rest won't be tried. These two must always be paired together.
* (deftest test-name "source string"
test-conditions)
Right now, lets ignore this. But in the long run, I want to be able to write down reasonably complex tests too. Not entirely sure yet what I need it to do though.
Condition functions: --------------------
* (match pattern-spec)
Matches a pattern-spec, simple as that. See below for the definition of the pattern-spec!
* (not-match pattern-spec)
The opposite of (match): action is triggered if the pattern does not match.
* (exists?)
Triggers the action if the field in the condition exists.
* (not-exists?)
Opposite of (exists?).
* :default or (always)
Always triggers, so one can do catch-all actions.
Action functions: -----------------
* (set! :field value) * (clear! :field)
These two should speak for themselves, I believe.
* (conj! :field values)
Conjoin (append) values to the specified field. Some fields (:tags) that need to be, will be treated specially, otherwise it just appends a separator (",") and the values.
* (do actions...)
Does all the specified actions.
Pattern spec: -------------
The pattern can be built up from the sequence of the following things, in any order:
* A plain string
* (capture pattern-spec [:as name])
This just marks the pattern-spec as something to capture. While the implementation may produce code that captures things anyway, the only guarantee that something will be available for the actions, is to wrap it in (capture). The pattern-spec can be anything, it can contain nested captures.
If :as name is specified, the capture will be named, and actions can refer to the captured thing by name. Otherwise, they need to refer it by a number. Each capture - named or anonymous - has a different number, starting from one, increasing with each occurrence of (capture).
* Any of the following special keywords:
* :number * :string * :qstring * :ipv4-address * :ipv6-address * :mac-address
Future ideas ------------
Later, once the basics are ready and work, it would make sense to introduce a way to share common blocks of code: functions and perhaps variables.
Conclusion ==========
XML sucks, DSL rocks.
Feedback appreciated, be that on the syntax, or the initially proposed features/functions/etc, or anything else.
Peter Gyongyosi <gyp@balabit.hu> writes:
1) the lisp-y syntax
Yep, it is different, because of two factors: I like lisp, and I started coding the PoC in Clojure, and having a compatible syntax made the prototyping much much faster. But as I said in the RFC, I understand the syntax may not be easy for non-lispy folk, so the whole compiler business is being coded with this in mind: the parser is entirely separate from the rest. As long as there's a parser to translate the source to an intermediate format, we'll be fine, the rest of the toolchain will handle it. Right now, I have clojure macros that translate a DSL to an intermediate format, which gets further translated into a lower level representation (this is where the "(match)" stuff gets analyzed), which is then optimised, eliminating unused stuff, combining others, and so on and so forth, and in the end, the final step turns it into C. I also have a Lua and a Guile generator PoC'd up, so it is entirely possible to compile down to another, dynamic language, which can then be embedded in syslog-ng, and voila, no compiler is necessary! But I digress.
Haven't looked at it in detail yet, but JSON has similar disadvantages: instead of parentheses, you'll have a ton of {} and []. Having had a second look at some of the recipes... eeep, no, thank you. It has the same feel as the current patterndb, except instead of an XML container, it's JSON. The fundamental problem still remains: it uses format-string-like syntax. That's the most horrible, inconvenient and inflexible thing ever invented. (Did I mention that I passionately hate format strings? Not just when they're used for parsing, but for formatting too.)
I want to write patterns and not code or a huge XML. The actual container format just needs to get out my way as much as possible.
Yeah, understandable. While playing with the PoC, I came to the conclusion that the current language is too verbose. Thankfully, because it's all a bunch of clojure macros, I could build further macros to abstract away a bunch of things, and without *any* change to the code, I was able to rewrite this patterndb rule: <rule provider='patterndb' id='4dd5a329-da83-4876-a431-ddcb59c2858c' class='system'> <patterns> <pattern>Accepted @ESTRING:usracct.authmethod: @for @ESTRING:usracct.username: @from @ESTRING:usracct.device: @port @ESTRING:: @@ANYSTRING:usracct.service@</pattern> </patterns> <values> <value name='usracct.type'>login</value> <value name='usracct.sessionid'>$PID</value> <value name='usracct.application'>$PROGRAM</value> <value name='secevt.verdict'>ACCEPT</value> </values> <tags> <tag>usracct</tag> <tag>secevt</tag> </tags> </rule> To this: (defruleset "4dd5a329-da83-4876-a431-ddcb59c2858c" {:class :system :provider :PoC} (with-pattern "Accepted " (word :usracct.authmethod) " for " (word :usracct.username) " from " (word :usracct.device) " port " (string :usracct.service) (do-> (set! :usracct.type "login" :usracct.sessionid "$PID" :usracct.application "$PROGRAM" :secevt.verdict "ACCEPT") (tag! :usracct :secevt)))) There ain't that many parentheses anymore, and I think it's sufficiently clear even for those who don't speak a bit of lisp. Just read it as-is, and you'll pretty much know what the ruleset does.
2) I like the action functions
I think these are the three main operations we need (set/clear/append), however, I wouldn't call appending conjoining but that's just me :)
I tried to stay as close to the Clojure terminology as possible. It's one line in the current PoC to make append! an alias to conj!: (def append! conj!) Mind you, due to practical reasons, I ended up using append! in the PoC too.
3) What about pattern hierarchy == efficient matching?
Your proposal allows the user to define complex conditions for a pattern match. On the other hand, the patterns we have right now work in a way that allows us to organize them in a radix tree and use a greedy, non-backtracking algorithm for matching which makes this procedure incredibly fast.
That's where the optimisation step comes in. In due time, I will be able to teach the optimiser to use a radix tree whenever possible, and only fall back when the complexity demands that.
Whereas if we'd allow more complex conditions, we'd need to fall back to a linear matching: if we have 5000 patterns, we'd have to match each and every pattern to each incoming message. Which is slow.
Indeed. Which is why the language is limited enough to allow the optimiser to (reasonably easily) figure out what algorithm to use. I do not want to limit complexity because that makes it possible to write less efficient - or even horribly inefficient - parsers. Sometimes that is necessary, and I want to allow complex patterns too, while maintaining the ability to generate very fast code for the simple ones. As an example, it is entirely possible to translate simpler rulesets from my language to patterndb. If a ruleset can be translated to patterndb syntax, then the same algorithms can be used too. Perhaps I can even reuse the already existing code... Or, as an intermediate step in the PoC, I can teach my generator to emit patterndb rules instead of C, if what I wrote is expressable that way. :)
4) this is horrible: (match "this " (capture :qstring :as "object") " is good")
Sorry for my bluntness, but it is :) It indeed is lisp-y, but it is hard to read and a tidious to write with all those parantheses and "capture"s.
Yep, I ended up dropping this syntax, and on the lowest level of the PoC, this is now: (match "this " (capture-as "object" :string) " is good") But with a macro, can be turned into: (match "this " (string :object) " is good") Same number of parentheses, but shorter, and easier to understand for a human.
I personally like the current syntax of the patterns themselves and I'd keep it as it is. (Grok -- again, of logstash fame -- also has something similar and it seems to be working for them, too:
https://github.com/logstash/logstash/blob/v1.0.17/patterns/linux-syslog
This is horrible. Sorry, but... SYSLOGBASE2 (?:%{SYSLOGTIMESTAMP:timestamp}|%{TIMESTAMP_ISO8601:timestamp8601}) (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} %{SYSLOGPROG}: I look at this, and I have no idea what it means. Lets translate this part to my lispy syntax: (match (syslog-timestamp :timestamp) (syslog-facility :fac) (string :host) (string :program):) (match (iso8601-timestamp :timestamp) (syslog-facility :fac) (string :host) (string :program):) Two lines, because my syntax does not have an explicit OR, but that can be fixed in different ways, which I will not detail here. The most obvious would be to introduce an OR operator into the language, but I don't really like that, makes it too easy to write patterns that are hard to optimize. Nevertheless, I don't find the grok syntax readable. It's the same "lets shovel everything into a format-string-like abomination!" nonsense that plagues many many things, including our patterndb. That's what I'm trying to move away from, not the XML container. XML is far less evil than this :P
5) I agree that correlation should be handled separately -- but we need IDs/names for that!
I totally agree with you that correlation should be separated from parsing, I always have a hard time to wrap my mind around the way correlation works in patterndb. But to do that (and to do more filtering or anything with this parsing), a pattern needs a name or ID. Sure, it can be added by a (set! :pattern-name "foo") command in your example but I think it needs a more prevalent place.
Yeah, I came to the same conclusion. The (cond ...) stuff in the original example was replaced by something along these lines: (ruleset "id" :message (match ...) (action...) ...) So there's an explicit id there now.
So, as a summary: I think your approach has its place but can not and should not replace patterndb.
I'll try my best to prove you wrong on the 'can' part. :)
It can be incredibly flexible and as a result we would not have to bastardize patterndb to support every weird use case that comes up rather simply point the user to use these custom parsers -- but this flexibility has the price of having to do one-by-one matching between patterns and messages which brings in a huge performance penalty.
I'm not convinced that's the case. For patterns that the optimiser finds too complex - yes. But remember: this whole thingamabob gets compiled down to a separate module. There's absolutely no reason not to use a better algorithm when the pattern allows us to.
We still have to give an easy-to-use solution for users who simply want to write patterns which they later use for filtering. The current XML syntax is tidious to use, I agree, but what you suggest is, in my opinion, even more so.
Eee, we'll see. I don't really see that many people writing patterndb rules. I think I could count all of them in two hands, and one hand would be BalaBit employees. The way to make pattern writing easier, is not really the language itself (it does help if it is not cryptic; both grok and patterndb are. Compact, but cryptic), but the provided tools. Give people good tools, and they won't care the least bit about what language the tool produces as output. Which brings me to another benefit of using a Clojure-compatible syntax for the PoC: it's easy to manipulate from Clojure *AND* ClojureScript too. It wouldn't be too hard to knock up a little web app that presents you with a bunch of logs, and you can interactively develop patterns, without ever having to look at the code produced under the hood. Same could be done with Grok or PatternDB too, I suppose, but I'm not going to touch either from an application running in the browser. -- |8]
The way to make pattern writing easier, is not really the language itself (it does help if it is not cryptic; both grok and patterndb are. Compact, but cryptic), but the provided tools. Give people good tools, and they won't care the least bit about what language the tool produces as output.
Which brings me to another benefit of using a Clojure-compatible syntax for the PoC: it's easy to manipulate from Clojure *AND* ClojureScript too. It wouldn't be too hard to knock up a little web app that presents you with a bunch of logs, and you can interactively develop patterns, without ever having to look at the code produced under the hood.
Same could be done with Grok or PatternDB too, I suppose, but I'm not going to touch either from an application running in the browser.
And that is exactly what we are doing. We are building a database that has CEE classifications of events and all of the tag related goodness. Thne a script pulls all of this out of the database and produces the XML. I never look at XML again. Just patterns, tags, and classifications. We are adding tags for event notification, incident reporting, threshhold measurements etc. All of this controls syslog-ng's message routing to program destinations that then make tickets, trigger nagios probes and plug into the rest of our infrastructure. Who want's to look at ANY container language. As soon as I have more than a couple of hundred patterns I need an interface tool kit anyway. -- Evan
Hi, On 09/07/2012 08:26 PM, Gergely Nagy wrote:
Peter Gyongyosi <gyp@balabit.hu> writes:
1) the lisp-y syntax Yep, it is different, because of two factors: I like lisp, and I started coding the PoC in Clojure, and having a compatible syntax made the prototyping much much faster.
But as I said in the RFC, I understand the syntax may not be easy for non-lispy folk, so the whole compiler business is being coded with this in mind: the parser is entirely separate from the rest. As long as there's a parser to translate the source to an intermediate format, we'll be fine, the rest of the toolchain will handle it.
Right now, I have clojure macros that translate a DSL to an intermediate format, which gets further translated into a lower level representation (this is where the "(match)" stuff gets analyzed), which is then optimised, eliminating unused stuff, combining others, and so on and so forth, and in the end, the final step turns it into C.
I also have a Lua and a Guile generator PoC'd up, so it is entirely possible to compile down to another, dynamic language, which can then be embedded in syslog-ng, and voila, no compiler is necessary!
I think that'd be great and it's a must.
But I digress.
http://logstash.net/docs/1.1.1/filters/grok. Haven't looked at it in detail yet, but JSON has similar disadvantages: instead of parentheses, you'll have a ton of {} and [].
Having had a second look at some of the recipes... eeep, no, thank you. It has the same feel as the current patterndb, except instead of an XML container, it's JSON. The fundamental problem still remains: it uses format-string-like syntax. That's the most horrible, inconvenient and inflexible thing ever invented.
(Did I mention that I passionately hate format strings? Not just when they're used for parsing, but for formatting too.)
I think this is where our main differences come in: I do not hate format strings and I think they're quite readable and compact. But I have to agree that your updated syntax in the example below is easier to read.
I want to write patterns and not code or a huge XML. The actual container format just needs to get out my way as much as possible. Yeah, understandable. While playing with the PoC, I came to the conclusion that the current language is too verbose. Thankfully, because it's all a bunch of clojure macros, I could build further macros to abstract away a bunch of things, and without *any* change to the code, I was able to rewrite this patterndb rule:
<rule provider='patterndb' id='4dd5a329-da83-4876-a431-ddcb59c2858c' class='system'> <patterns> <pattern>Accepted @ESTRING:usracct.authmethod: @for @ESTRING:usracct.username: @from @ESTRING:usracct.device: @port @ESTRING:: @@ANYSTRING:usracct.service@</pattern> </patterns> <values> <value name='usracct.type'>login</value> <value name='usracct.sessionid'>$PID</value> <value name='usracct.application'>$PROGRAM</value> <value name='secevt.verdict'>ACCEPT</value> </values> <tags> <tag>usracct</tag> <tag>secevt</tag> </tags> </rule>
To this:
(defruleset "4dd5a329-da83-4876-a431-ddcb59c2858c" {:class :system :provider :PoC}
(with-pattern "Accepted " (word :usracct.authmethod) " for " (word :usracct.username) " from " (word :usracct.device) " port " (string :usracct.service) (do-> (set! :usracct.type "login" :usracct.sessionid "$PID" :usracct.application "$PROGRAM" :secevt.verdict "ACCEPT") (tag! :usracct :secevt))))
There ain't that many parentheses anymore, and I think it's sufficiently clear even for those who don't speak a bit of lisp. Just read it as-is, and you'll pretty much know what the ruleset does.
OK, I'm convinced, I could live with such a syntax. If I were to design it, I'd create something JSON-y instead of Lisp-y, but I think it's just a matter of personal preference (and the fact that it's been about a decade since I've done anything with Lisp whereas I have to handle somethin JSON-like weekly). But that's just the two of us: what do others think?
3) What about pattern hierarchy == efficient matching?
Your proposal allows the user to define complex conditions for a pattern match. On the other hand, the patterns we have right now work in a way that allows us to organize them in a radix tree and use a greedy, non-backtracking algorithm for matching which makes this procedure incredibly fast. That's where the optimisation step comes in. In due time, I will be able to teach the optimiser to use a radix tree whenever possible, and only fall back when the complexity demands that.
Whereas if we'd allow more complex conditions, we'd need to fall back to a linear matching: if we have 5000 patterns, we'd have to match each and every pattern to each incoming message. Which is slow. Indeed. Which is why the language is limited enough to allow the optimiser to (reasonably easily) figure out what algorithm to use. I do not want to limit complexity because that makes it possible to write less efficient - or even horribly inefficient - parsers. Sometimes that is necessary, and I want to allow complex patterns too, while maintaining the ability to generate very fast code for the simple ones.
As an example, it is entirely possible to translate simpler rulesets from my language to patterndb. If a ruleset can be translated to patterndb syntax, then the same algorithms can be used too. Perhaps I can even reuse the already existing code...
Or, as an intermediate step in the PoC, I can teach my generator to emit patterndb rules instead of C, if what I wrote is expressable that way. :)
Oh, a haven't thought of that, although it is indeed doable. I like the idea of automatic optimization.
We still have to give an easy-to-use solution for users who simply want to write patterns which they later use for filtering. The current XML syntax is tidious to use, I agree, but what you suggest is, in my opinion, even more so. Eee, we'll see. I don't really see that many people writing patterndb rules. I think I could count all of them in two hands, and one hand would be BalaBit employees.
The way to make pattern writing easier, is not really the language itself (it does help if it is not cryptic; both grok and patterndb are. Compact, but cryptic), but the provided tools. Give people good tools, and they won't care the least bit about what language the tool produces as output.
Which brings me to another benefit of using a Clojure-compatible syntax for the PoC: it's easy to manipulate from Clojure *AND* ClojureScript too. It wouldn't be too hard to knock up a little web app that presents you with a bunch of logs, and you can interactively develop patterns, without ever having to look at the code produced under the hood.
Same could be done with Grok or PatternDB too, I suppose, but I'm not going to touch either from an application running in the browser.
Yes, you're absolutely right, although I don't really see what difference does the underlying format make -- but if it makes it more likely that you (or someone else) would come up with such a tool than it's a great plus by itself. greets, Peter
Peter Gyongyosi <gyp@balabit.hu> writes:
http://logstash.net/docs/1.1.1/filters/grok. Haven't looked at it in detail yet, but JSON has similar disadvantages: instead of parentheses, you'll have a ton of {} and [].
Having had a second look at some of the recipes... eeep, no, thank you. It has the same feel as the current patterndb, except instead of an XML container, it's JSON. The fundamental problem still remains: it uses format-string-like syntax. That's the most horrible, inconvenient and inflexible thing ever invented.
(Did I mention that I passionately hate format strings? Not just when they're used for parsing, but for formatting too.)
I think this is where our main differences come in: I do not hate format strings and I think they're quite readable and compact. But I have to agree that your updated syntax in the example below is easier to read.
Well, format strings are fine and all up until a point. Once you try to shovel all kinds of things into them, they start to get more and more complex, and then it becomes a terrible choice. As in, they're very fine for output, makes it easy to translate strings. But when matching patterns... not so much. Nevertheless, I suppose it's up to one's own preferences.
(defruleset "4dd5a329-da83-4876-a431-ddcb59c2858c" {:class :system :provider :PoC}
(with-pattern "Accepted " (word :usracct.authmethod) " for " (word :usracct.username) " from " (word :usracct.device) " port " (string :usracct.service) (do-> (set! :usracct.type "login" :usracct.sessionid "$PID" :usracct.application "$PROGRAM" :secevt.verdict "ACCEPT") (tag! :usracct :secevt))))
There ain't that many parentheses anymore, and I think it's sufficiently clear even for those who don't speak a bit of lisp. Just read it as-is, and you'll pretty much know what the ruleset does.
OK, I'm convinced, I could live with such a syntax. If I were to design it, I'd create something JSON-y instead of Lisp-y, but I think it's just a matter of personal preference (and the fact that it's been about a decade since I've done anything with Lisp whereas I have to handle somethin JSON-like weekly). But that's just the two of us: what do others think?
Well, JSON-like isn't much different: {"ruleset": {"id": "4dd5a329-da83-4876-a431-ddcb59c2858c", "class": "system", "provider": "PoC", "rules": [{"pattern": ["Accepted ", {"usracct.authmethod": "word"}, " for ", {"usracct.username": "word"}, " from ", {"usracct.device": "word"}, " port ", {"usracct.service": "string"}], "actions" [{"set": {"usracct.type": "login", "usracct.sessionid": "$PID", "usracct.application": "$PROGRAM", "secevt.verdict": "ACCEPT"}, "tag": ["usracct", "secevt"]}] }] } } Or something along those lines... Writing a parser that turns this into the very same AST is about ~15 minutes of work. (At the moment, the internal AST can be serialized to and from JSON trivially, with about 3 lines of code, but the AST is far more verbose) Thing is, the input doesn't matter much. I like lisp-y, because I like lisp, and the PoC is in Clojure, and that also gives me a lot more power: I can use Clojure functions and macros, thereby reducing copywaste within my rulesets, without having to extend the DSL itself. But writing an input parser that turns patterndb, grok or whatever else you can think of into our internal AST, just like we can output pretty much anything that supports all the stuff described by the rulesets.
Or, as an intermediate step in the PoC, I can teach my generator to emit patterndb rules instead of C, if what I wrote is expressable that way. :)
Oh, a haven't thought of that, although it is indeed doable. I like the idea of automatic optimization.
FWIW, my PoC can generate patterndb rules, and soon enough, it will be able to read them too. C will be considerably harder, but I'm progressing with that too.
Yes, you're absolutely right, although I don't really see what difference does the underlying format make -- but if it makes it more likely that you (or someone else) would come up with such a tool than it's a great plus by itself.
If the format is easier to handle programmatically, then it's easier to make a tool to fiddle with it, imo. Patterndb is - I believe - hard to handle. It's not hard to generate from another format, mind you, but to parse it, and interpret it... that's a tough one. I plan to write a trivial interpreter along with my PoC, which will be slow and inefficient, but enough to show how easy it is to work with the format. -- |8]
Peter Gyongyosi <gyp@balabit.hu> writes:
3) What about pattern hierarchy == efficient matching?
Another thing that just popped in my mind: it wouldn't be much effort to teach the generator to be able to generate different code when optimised for best-case scenario and for worst-case scenario. For example, lets consider that we have sshd logs running through the parser, and our rule wants to process the successful login events. Worst case scenario: we have a lot of failed logins. If it's a locked down system with one or two users only, who rarely log in, then there may very well be far more failed logins. So in this case, we'd want the algorithm to eliminate failing matches ASAP. Since we know the length of the line, and the minimum length of a successful login, we can already skip any message shorther than that with a simple integer comparsion. However, in the best case, when most logs are succesful logins, this would be a waste of time. So if we can tell the compiler which scenario to optimise against, that can boost performance too. And this is something that's reasonably easy to do with the generator approach. -- |8]
participants (4)
-
Evan Rempel
-
Gergely Nagy
-
Jakub Jankowski
-
Peter Gyongyosi