Peter Gyongyosi <gyp@balabit.hu> writes:
1) the lisp-y syntax
Yep, it is different, because of two factors: I like lisp, and I started coding the PoC in Clojure, and having a compatible syntax made the prototyping much much faster. But as I said in the RFC, I understand the syntax may not be easy for non-lispy folk, so the whole compiler business is being coded with this in mind: the parser is entirely separate from the rest. As long as there's a parser to translate the source to an intermediate format, we'll be fine, the rest of the toolchain will handle it. Right now, I have clojure macros that translate a DSL to an intermediate format, which gets further translated into a lower level representation (this is where the "(match)" stuff gets analyzed), which is then optimised, eliminating unused stuff, combining others, and so on and so forth, and in the end, the final step turns it into C. I also have a Lua and a Guile generator PoC'd up, so it is entirely possible to compile down to another, dynamic language, which can then be embedded in syslog-ng, and voila, no compiler is necessary! But I digress.
Haven't looked at it in detail yet, but JSON has similar disadvantages: instead of parentheses, you'll have a ton of {} and []. Having had a second look at some of the recipes... eeep, no, thank you. It has the same feel as the current patterndb, except instead of an XML container, it's JSON. The fundamental problem still remains: it uses format-string-like syntax. That's the most horrible, inconvenient and inflexible thing ever invented. (Did I mention that I passionately hate format strings? Not just when they're used for parsing, but for formatting too.)
I want to write patterns and not code or a huge XML. The actual container format just needs to get out my way as much as possible.
Yeah, understandable. While playing with the PoC, I came to the conclusion that the current language is too verbose. Thankfully, because it's all a bunch of clojure macros, I could build further macros to abstract away a bunch of things, and without *any* change to the code, I was able to rewrite this patterndb rule: <rule provider='patterndb' id='4dd5a329-da83-4876-a431-ddcb59c2858c' class='system'> <patterns> <pattern>Accepted @ESTRING:usracct.authmethod: @for @ESTRING:usracct.username: @from @ESTRING:usracct.device: @port @ESTRING:: @@ANYSTRING:usracct.service@</pattern> </patterns> <values> <value name='usracct.type'>login</value> <value name='usracct.sessionid'>$PID</value> <value name='usracct.application'>$PROGRAM</value> <value name='secevt.verdict'>ACCEPT</value> </values> <tags> <tag>usracct</tag> <tag>secevt</tag> </tags> </rule> To this: (defruleset "4dd5a329-da83-4876-a431-ddcb59c2858c" {:class :system :provider :PoC} (with-pattern "Accepted " (word :usracct.authmethod) " for " (word :usracct.username) " from " (word :usracct.device) " port " (string :usracct.service) (do-> (set! :usracct.type "login" :usracct.sessionid "$PID" :usracct.application "$PROGRAM" :secevt.verdict "ACCEPT") (tag! :usracct :secevt)))) There ain't that many parentheses anymore, and I think it's sufficiently clear even for those who don't speak a bit of lisp. Just read it as-is, and you'll pretty much know what the ruleset does.
2) I like the action functions
I think these are the three main operations we need (set/clear/append), however, I wouldn't call appending conjoining but that's just me :)
I tried to stay as close to the Clojure terminology as possible. It's one line in the current PoC to make append! an alias to conj!: (def append! conj!) Mind you, due to practical reasons, I ended up using append! in the PoC too.
3) What about pattern hierarchy == efficient matching?
Your proposal allows the user to define complex conditions for a pattern match. On the other hand, the patterns we have right now work in a way that allows us to organize them in a radix tree and use a greedy, non-backtracking algorithm for matching which makes this procedure incredibly fast.
That's where the optimisation step comes in. In due time, I will be able to teach the optimiser to use a radix tree whenever possible, and only fall back when the complexity demands that.
Whereas if we'd allow more complex conditions, we'd need to fall back to a linear matching: if we have 5000 patterns, we'd have to match each and every pattern to each incoming message. Which is slow.
Indeed. Which is why the language is limited enough to allow the optimiser to (reasonably easily) figure out what algorithm to use. I do not want to limit complexity because that makes it possible to write less efficient - or even horribly inefficient - parsers. Sometimes that is necessary, and I want to allow complex patterns too, while maintaining the ability to generate very fast code for the simple ones. As an example, it is entirely possible to translate simpler rulesets from my language to patterndb. If a ruleset can be translated to patterndb syntax, then the same algorithms can be used too. Perhaps I can even reuse the already existing code... Or, as an intermediate step in the PoC, I can teach my generator to emit patterndb rules instead of C, if what I wrote is expressable that way. :)
4) this is horrible: (match "this " (capture :qstring :as "object") " is good")
Sorry for my bluntness, but it is :) It indeed is lisp-y, but it is hard to read and a tidious to write with all those parantheses and "capture"s.
Yep, I ended up dropping this syntax, and on the lowest level of the PoC, this is now: (match "this " (capture-as "object" :string) " is good") But with a macro, can be turned into: (match "this " (string :object) " is good") Same number of parentheses, but shorter, and easier to understand for a human.
I personally like the current syntax of the patterns themselves and I'd keep it as it is. (Grok -- again, of logstash fame -- also has something similar and it seems to be working for them, too:
https://github.com/logstash/logstash/blob/v1.0.17/patterns/linux-syslog
This is horrible. Sorry, but... SYSLOGBASE2 (?:%{SYSLOGTIMESTAMP:timestamp}|%{TIMESTAMP_ISO8601:timestamp8601}) (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} %{SYSLOGPROG}: I look at this, and I have no idea what it means. Lets translate this part to my lispy syntax: (match (syslog-timestamp :timestamp) (syslog-facility :fac) (string :host) (string :program):) (match (iso8601-timestamp :timestamp) (syslog-facility :fac) (string :host) (string :program):) Two lines, because my syntax does not have an explicit OR, but that can be fixed in different ways, which I will not detail here. The most obvious would be to introduce an OR operator into the language, but I don't really like that, makes it too easy to write patterns that are hard to optimize. Nevertheless, I don't find the grok syntax readable. It's the same "lets shovel everything into a format-string-like abomination!" nonsense that plagues many many things, including our patterndb. That's what I'm trying to move away from, not the XML container. XML is far less evil than this :P
5) I agree that correlation should be handled separately -- but we need IDs/names for that!
I totally agree with you that correlation should be separated from parsing, I always have a hard time to wrap my mind around the way correlation works in patterndb. But to do that (and to do more filtering or anything with this parsing), a pattern needs a name or ID. Sure, it can be added by a (set! :pattern-name "foo") command in your example but I think it needs a more prevalent place.
Yeah, I came to the same conclusion. The (cond ...) stuff in the original example was replaced by something along these lines: (ruleset "id" :message (match ...) (action...) ...) So there's an explicit id there now.
So, as a summary: I think your approach has its place but can not and should not replace patterndb.
I'll try my best to prove you wrong on the 'can' part. :)
It can be incredibly flexible and as a result we would not have to bastardize patterndb to support every weird use case that comes up rather simply point the user to use these custom parsers -- but this flexibility has the price of having to do one-by-one matching between patterns and messages which brings in a huge performance penalty.
I'm not convinced that's the case. For patterns that the optimiser finds too complex - yes. But remember: this whole thingamabob gets compiled down to a separate module. There's absolutely no reason not to use a better algorithm when the pattern allows us to.
We still have to give an easy-to-use solution for users who simply want to write patterns which they later use for filtering. The current XML syntax is tidious to use, I agree, but what you suggest is, in my opinion, even more so.
Eee, we'll see. I don't really see that many people writing patterndb rules. I think I could count all of them in two hands, and one hand would be BalaBit employees. The way to make pattern writing easier, is not really the language itself (it does help if it is not cryptic; both grok and patterndb are. Compact, but cryptic), but the provided tools. Give people good tools, and they won't care the least bit about what language the tool produces as output. Which brings me to another benefit of using a Clojure-compatible syntax for the PoC: it's easy to manipulate from Clojure *AND* ClojureScript too. It wouldn't be too hard to knock up a little web app that presents you with a bunch of logs, and you can interactively develop patterns, without ever having to look at the code produced under the hood. Same could be done with Grok or PatternDB too, I suppose, but I'm not going to touch either from an application running in the browser. -- |8]