Thoughts on patterndb syntax
I've been playing with 3.2beta1 recently and getting my feet wet with the patterndb support, which I haven't had a chance to work with before. I have a few thoughts regarding the patterndb rule syntax, mostly targeted at making things a little bit easier to work with. - Rule IDs Is there any particular reason why unique IDs were selected as rule identifiers? They're not particularly meaningful to people, and they're hard to talk about. It's much easier to say, "we're suddently seeing lots of matches for ssh-accept-connection" than it is to say, "we're suddenly seeing lots of matches for 4dd5a329-da83-4876-a431-ddcb59c2858c". With the current version of syslog-ng it looks like it's possible to use arbitrary identifiers in place of UUIDs, and that's what I'm doing for my local rulesets. This even makes classification metadata more useful, because .classifier.rule_id=ssh-accept-connection is immediately meaningful, while a UUID is useless unless I go grepping around the database. - Whitespace Log messages tend to be long, which makes them unwieldy in a number of situations. It would be nice if instead of this: <pattern>...some very long log message that makes my life difficult and my patterns hard to read ...</pattern> I could do this: <pattern collapse_whitespace="yes">...some very long log message that has been conveniently wrapped to that it's easier to edit, email, and otherwise work with.</pattern> Specifically, enabling "collapse_whitespace" would transform any sequence of whitespace to a single " ". As proposed here this would be a completely backwords-compatible change because the default behavior would remain the same. - Reusable patterns We use Cisco firewall modules on our network. When these devices log a connection-related message, the source and destination address look something like this: ircs:3610:67.186.94.126/41004 That's: interface:vlan:ip/port Which becomes: @ESTRING:fwsm.src_if::@@NUMBER:fwsm.src_vlan@:@IPv4:fwsm.src_ip@/@NUMBER:fwsm.src_port@ When this occurs throughout the ruleset, and multiple times within a single message, it really lowers the readability of the rules. I wish there was a way to modularize this so that I could create custom types, something like this: <type name="FWSM_ADDRSPEC"> @ESTRING:iface::@@NUMBER:vlan@:@IPv4:ip@/@NUMBER:fwsm.port@ </type> And then do this: <pattern>Accepted src @FWSM_ADDRSPEC:fwsm.src:@ dst @FWSM_ADDRSPEC:fwsm.dst:@</pattern> And get this: fwsm.src.iface fwsm.src.ip fwsm.src.port Etc. Anyway, that's all for now.
This even makes classification metadata more useful, because .classifier.rule_id=ssh-accept-connection is immediately meaningful, while a UUID is useless unless I go grepping around the database.
You can do whatever you want with the rule id as far as I know. I use straight integers for my rule id's so that I can use an int column in my database schema. That said, I haven't found a particularly good use for the rule id's yet--I guess it's more for posterity. Note that for the kinds of things you're doing, <tags> is a good way of attaching arbitrary values that will hit on greps for later because you can standardize them across different rules and you can attach an arbitrary number of them.
When this occurs throughout the ruleset, and multiple times within a single message, it really lowers the readability of the rules.
I guess for me, readability is pretty far down on the list of features I want poor Bazsi slaving away on, and that's mainly because pdbtool does such a good job of verifying that my patterns match on exactly what I think they do. The other thing is that I think a lot of us are planning on using patternize to do auto pattern generation, and so if all goes to plan, humans won't have to be looking at these very often. On the other hand, I recognize that the easier it is to author rules, the more community contribution there will be.
On Oct 20, 2010, at 8:59 PM, Martin Holste wrote:
When this occurs throughout the ruleset, and multiple times within a single message, it really lowers the readability of the rules.
I guess for me, readability is pretty far down on the list of features I want poor Bazsi slaving away on, and that's mainly because pdbtool does such a good job of verifying that my patterns match on exactly what I think they do. The other thing is that I think a lot of us are planning on using patternize to do auto pattern generation, and so if all goes to plan, humans won't have to be looking at these very often. On the other hand, I recognize that the easier it is to author rules, the more community contribution there will be.
A few things that IMO would help with this would be: 1) LETTERS - like STRING but ONLY matches on letters 2) Ability to set @ESTRING delimiter to be \t. Right now to get it to work I use vim and Ctrl-V <tab> to insert a literal tab. Using \t doesn't work. 3) Partial match. From my experimentation it seems that to get a match you have to describe the entire message. If it isn't too much a performance hit I'd like to be able to declare a rule to be a partial match. Currently I can define everything up to the match and @ANYSTRING the remainder, but it feels ... off. The way I read the ability that Lars listed, defining a pattern and then re-using it could be very powerful in that it could enable essentially a grammar. For example I have a set of clusters that all use tab delimited logs messages and the first N fields are the same. While it would certainly be more readable to have a "sub-pattern" that I could use to start each rule with that, it would also seem to be more expressive. Then again I am currently working heavily with Apache access logs and looking to see what I can tease out ability-wise with patterndb. Thus my use case may not be that common. Yet. Cheers, Bill
Hi Lars, First of all thanks for your message. Any kind of feedback is very much appreciated, basically these make me want to work on the code further :) So I'd like to urge everyone to post their opinions, they really make my day. Comments on your points are below, inserted into your message. On Wed, 2010-10-20 at 21:57 -0400, Lars Kellogg-Stedman wrote:
I've been playing with 3.2beta1 recently and getting my feet wet with the patterndb support, which I haven't had a chance to work with before. I have a few thoughts regarding the patterndb rule syntax, mostly targeted at making things a little bit easier to work with.
- Rule IDs
Is there any particular reason why unique IDs were selected as rule identifiers? They're not particularly meaningful to people, and they're hard to talk about. It's much easier to say, "we're suddently seeing lots of matches for ssh-accept-connection" than it is to say, "we're suddenly seeing lots of matches for 4dd5a329-da83-4876-a431-ddcb59c2858c". With the current version of syslog-ng it looks like it's possible to use arbitrary identifiers in place of UUIDs, and that's what I'm doing for my local rulesets.
This even makes classification metadata more useful, because .classifier.rule_id=ssh-accept-connection is immediately meaningful, while a UUID is useless unless I go grepping around the database.
Well, I don't really made up my mind how rule_id's should be used. They were proposed by Marci (who implemented the patterndb syntax in the first place). syslog-ng doesn't really care, but they must be unique. These are useful as we can attach a lot of information to the patterndb rule. For example, if you use a "<description>" tag, then you can retrieve this said description when you browse the log, based on the unique ID. This is what SSB (our syslog-ng appliance product) does for example.
- Whitespace
Log messages tend to be long, which makes them unwieldy in a number of situations. It would be nice if instead of this:
<pattern>...some very long log message that makes my life difficult and my patterns hard to read ...</pattern>
I could do this:
<pattern collapse_whitespace="yes">...some very long log message that has been conveniently wrapped to that it's easier to edit, email, and otherwise work with.</pattern>
Specifically, enabling "collapse_whitespace" would transform any sequence of whitespace to a single " ".
As proposed here this would be a completely backwords-compatible change because the default behavior would remain the same.
Interesting idea and of course doable, but then if there's indeed multiple spaces in the message, you get in trouble.
- Reusable patterns
We use Cisco firewall modules on our network. When these devices log a connection-related message, the source and destination address look something like this:
ircs:3610:67.186.94.126/41004
That's:
interface:vlan:ip/port
Which becomes:
@ESTRING:fwsm.src_if::@@NUMBER:fwsm.src_vlan@:@IPv4:fwsm.src_ip@/@NUMBER:fwsm.src_port@
When this occurs throughout the ruleset, and multiple times within a single message, it really lowers the readability of the rules. I wish there was a way to modularize this so that I could create custom types, something like this:
<type name="FWSM_ADDRSPEC"> @ESTRING:iface::@@NUMBER:vlan@:@IPv4:ip@/@NUMBER:fwsm.port@ </type>
And then do this:
<pattern>Accepted src @FWSM_ADDRSPEC:fwsm.src:@ dst @FWSM_ADDRSPEC:fwsm.dst:@</pattern>
And get this:
fwsm.src.iface fwsm.src.ip fwsm.src.port
This is a good idea. Although I'm a bit fiddling with the idea to extend the patterndb syntax a little bit. Nothing concrete yet, but reusable components will necessarily be a part of the picture. -- Bazsi
Interesting idea and of course doable, but then if there's indeed multiple spaces in the message, you get in trouble.
If you were to only give linebreaks special treatment -- so that "this\nthat" would become "this that" -- then you've probably solved both problems; messages can be wrapped for readability and you can still include arbitrary stretches of whitespace in the expression.
On Thu, 2010-10-21 at 12:26 -0400, Lars Kellogg-Stedman wrote:
Interesting idea and of course doable, but then if there's indeed multiple spaces in the message, you get in trouble.
If you were to only give linebreaks special treatment -- so that "this\nthat" would become "this that" -- then you've probably solved both problems; messages can be wrapped for readability and you can still include arbitrary stretches of whitespace in the expression.
Hmm... and what about multi-line messages? sorry to raise one problem at a time, but this how they come to my scattered and distracted mind. (after returning from Netfilter Workshop where I spent my last week, this week is close to horrible :) -- Bazsi
I always dealt with the messages containing heinous characters (such as \t and \n) by running them through a rewrite rule to strip them out and replace them with ' ', then collecting them to an output file with this template. template t_raw { template("${MSGONLY}\n"); }; After that then you can just create the PatternDB based on the content of the file and you should be OK. Scarier question: how do you detect multiline log messages when the logs arrive over a TCP socket? :-) Matthew. On Thu, Oct 28, 2010 at 08:40:07PM +0200, Balazs Scheidler wrote:
On Thu, 2010-10-21 at 12:26 -0400, Lars Kellogg-Stedman wrote:
Interesting idea and of course doable, but then if there's indeed multiple spaces in the message, you get in trouble.
If you were to only give linebreaks special treatment -- so that "this\nthat" would become "this that" -- then you've probably solved both problems; messages can be wrapped for readability and you can still include arbitrary stretches of whitespace in the expression.
Hmm... and what about multi-line messages? sorry to raise one problem at a time, but this how they come to my scattered and distracted mind. (after returning from Netfilter Workshop where I spent my last week, this week is close to horrible :)
-- Bazsi
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
both problems; messages can be wrapped for readability and you can still include arbitrary stretches of whitespace in the expression.
Hmm... and what about multi-line messages?
That's why I'm suggesting this as an optional behavior -- that way (a) it doesn't affect existing configurations, and (b) you can actually put multi-line patterns in place if you need them. If you know you need to match multi-line patterns, don't enable whitespace collapsing. Another idea would be to adopt the syntax used by Perl and Python's "extended" regular expressions -- this requires that any whitespace be specified explicitly. So this: this \s is \s a \s test Is the same as: this \s is \s a \s test And of course you can make newlines explicit via \n and so forth. Actually, the more we wander down this path the more I wish I could just use PCRE-style regular expressions as long as I was willing to put up with the performance impact.
On Oct 28, 2010, at 12:58 PM, Lars Kellogg-Stedman wrote:
And of course you can make newlines explicit via \n and so forth. Actually, the more we wander down this path the more I wish I could just use PCRE-style regular expressions as long as I was willing to put up with the performance impact.
I'd go for that too. Especially if it were implemented in the "create the regex once and re-use" fashion.
On Thu, 2010-10-28 at 13:13 -0600, Bill Anderson wrote:
On Oct 28, 2010, at 12:58 PM, Lars Kellogg-Stedman wrote:
And of course you can make newlines explicit via \n and so forth. Actually, the more we wander down this path the more I wish I could just use PCRE-style regular expressions as long as I was willing to put up with the performance impact.
I'd go for that too. Especially if it were implemented in the "create the regex once and re-use" fashion.
I agree that the /x syntax produces actually quite readable regexps. Otherwise regexps are quite unreadable and hard to maintain. -- Bazsi
Otherwise regexps are quite unreadable and hard to maintain.
I'm not sure that: Accepted publickey for (?<user>\S+) from (?<ipaddr>\S+) port (?<port>\d+) (?<version>.*) Is any less readable than: Accepted publickey for @ESTRING:user: @ from @IPv4:ipaddr:@ port @NUMBER:port:@ @ANYSTRING:version:@ In general, I don't think the patterndb syntax adds anything in terms of readability or maintainability. I assume that regular expressions were rejected primarily for performance reasons, which may be a bigger concern in some environments than others. The performance of modern hardware means that in our environment this isn't a particular concern (but we're not a large environment by any definition). I would argue that having to learn an entirely new syntax for this one application actually makes it less readable, since one can't apply lessons learned from working with other tools.
On Thu, 2010-10-28 at 16:25 -0400, Lars Kellogg-Stedman wrote:
Otherwise regexps are quite unreadable and hard to maintain.
I'm not sure that:
Accepted publickey for (?<user>\S+) from (?<ipaddr>\S+) port (?<port>\d+) (?<version>.*)
Is any less readable than:
Accepted publickey for @ESTRING:user: @ from @IPv4:ipaddr:@ port @NUMBER:port:@ @ANYSTRING:version:@
In general, I don't think the patterndb syntax adds anything in terms of readability or maintainability. I assume that regular expressions were rejected primarily for performance reasons, which may be a bigger concern in some environments than others. The performance of modern hardware means that in our environment this isn't a particular concern (but we're not a large environment by any definition).
I would argue that having to learn an entirely new syntax for this one application actually makes it less readable, since one can't apply lessons learned from working with other tools.
Well, the two are not the same, For example this regexp parses an IPv6 address: '/^(?:(?>(?>([a-f0-9]{1,4})(?>:(?1)){7})|(?>(?!(?:.*[a-f0-9](?>:| $)){8,})((?1)(?>:(?1)){0,6})?::(?2)?))|(?>(?>(?>(?1)(?>:(?1)){5}:)|(?>(?!(?:.*[a-f0-9]:){6,})((?1)(?>:(?1)){0,4})?::(?>(?3):)?))?(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])(?>\.(?4)){3}))$/iD' I wouldn't say this is readable, especially if repeated an a lot of patterns scattered around in a file. -- Bazsi
On Wed, 2010-10-20 at 21:57 -0400, Lars Kellogg-Stedman wrote:
I've been playing with 3.2beta1 recently and getting my feet wet with the patterndb support, which I haven't had a chance to work with before. I have a few thoughts regarding the patterndb rule syntax, mostly targeted at making things a little bit easier to work with.
- Rule IDs
Is there any particular reason why unique IDs were selected as rule identifiers? They're not particularly meaningful to people, and they're hard to talk about. It's much easier to say, "we're suddently seeing lots of matches for ssh-accept-connection" than it is to say, "we're suddenly seeing lots of matches for 4dd5a329-da83-4876-a431-ddcb59c2858c". With the current version of syslog-ng it looks like it's possible to use arbitrary identifiers in place of UUIDs, and that's what I'm doing for my local rulesets.
This even makes classification metadata more useful, because .classifier.rule_id=ssh-accept-connection is immediately meaningful, while a UUID is useless unless I go grepping around the database.
I've removed the requirement to use UUIDs for these IDs from the XML schema. Until I have a better idea, it just requires any kind of string. Here's the patch: commit f334d4363b2dd38190e74d502f8fc266628944a7 Author: Balazs Scheidler <bazsi@balabit.hu> Date: Thu Oct 21 17:25:44 2010 +0200 patterndb-3.xsd: do not require UUID format for rule/ruleset IDs For now, we're going to use UUIDs in patterndb, but that may change later. -- Bazsi
participants (5)
-
Balazs Scheidler
-
Bill Anderson
-
Lars Kellogg-Stedman
-
Martin Holste
-
Matthew Hall