[RFC] value-pairs and key rewriting
Hi! Now that value-pairs() is in 3.3, it's time to dig up an idea Bazsi and I were discussing way back when we first talked about value-pairs: a way to change the keys in a value-pairs set, without the need to explicitly specify them all using pair(). It's actually easier to explain this by explaining the need behind this feature: there's the MongoDB destination, and by default, SDATA goes under the "sdata" key, somewhat like this: { "sdata": { "test": "value" } } Now, if I'd rather have those values under, say "sd", I can't do that with the current driver, because I can't tell value-pairs() that "sdata" should be mapped to "sd" instead. The best I can do, is exclude ".SDATA.*", and either use "$SDATA", and post-process it, or list all the .SDATA.* keys explicitly. Neither of which is good enough. So, I propose that we should have a way to remedy this problem, and this remedy should be called "rekey()". The way I imagine it, is something like this: value-pairs ( scope("selected-macros" "nv-pairs") rekey( regexp("^\.SDATA\.(.*)" "sd.$1") prefix(".secevt.*" "events") prefix("[A-Z]*" "syslog.") ) ) This would do the following: - Any key that begins with ".SDATA." will have that part replaced with "sd." - Keys matching ".secevt.*" (shell glob, not regexp) will be prefixed with "events". Thus ".secevt.verdict" would become "events.secevt.verdict". - Keys that are all uppercase would be prefixed with "syslog.", thus "HOST" would become "syslog.HOST" The transformations would be applied to the raw set of keys, in the order they're listed in the configuration file. Initially, regexp() and prefix() would be implemented only, with the possibility of adding more, if the need arises. This would also solve another problem I encountered recently: if the value-pairs() result set contains both "$SDATA" and "$SDATA.*" (which is the case if one specifies scope("selected-macros" "nv-pairs") and the incoming message has structured data), then we'll have a key conflict in the MongoDB destination, because internally "foo.bar" gets translated to (using JSON notation): { "foo": { "bar": ... } } Now, in the case of SDATA, this translates to something like the following: { SDATA: "[foo=bar]", // $SDATA SDATA: { "foo": "bar" // $SDATA.foo } } This is because the MongoDB destination strips the leading dot at the moment (because that would be invalid too), and we end up with conflicting types: one string, and one object. The driver does not support overriding right now, so this is a problem. I could, of course, change the driver to replace the dot with an underscore, but that would be costier than the current stripping, and would still be ugly, in my opinion. It's much nicer to allow the users to rewrite the keys instead, or prefix them. That's about how far I got with thinking for now. Critique, comments and ideas would be most appreciated. (PS: This is, of course, strictly 3.4 material, as 3.3 is in a feature freeze) -- |8]
Yep, I think you're on the right track in that some rewriting will definitely be necessary for Mongo. I'm a bit concerned with performance, but Mongo will probably be the bottleneck when things don't fit in RAM anyway. On Tue, May 10, 2011 at 2:07 PM, Gergely Nagy <algernon@balabit.hu> wrote:
Hi!
Now that value-pairs() is in 3.3, it's time to dig up an idea Bazsi and I were discussing way back when we first talked about value-pairs: a way to change the keys in a value-pairs set, without the need to explicitly specify them all using pair().
It's actually easier to explain this by explaining the need behind this feature: there's the MongoDB destination, and by default, SDATA goes under the "sdata" key, somewhat like this:
{ "sdata": { "test": "value" } }
Now, if I'd rather have those values under, say "sd", I can't do that with the current driver, because I can't tell value-pairs() that "sdata" should be mapped to "sd" instead. The best I can do, is exclude ".SDATA.*", and either use "$SDATA", and post-process it, or list all the .SDATA.* keys explicitly. Neither of which is good enough.
So, I propose that we should have a way to remedy this problem, and this remedy should be called "rekey()".
The way I imagine it, is something like this:
value-pairs ( scope("selected-macros" "nv-pairs") rekey( regexp("^\.SDATA\.(.*)" "sd.$1") prefix(".secevt.*" "events") prefix("[A-Z]*" "syslog.") ) )
This would do the following:
- Any key that begins with ".SDATA." will have that part replaced with "sd." - Keys matching ".secevt.*" (shell glob, not regexp) will be prefixed with "events". Thus ".secevt.verdict" would become "events.secevt.verdict". - Keys that are all uppercase would be prefixed with "syslog.", thus "HOST" would become "syslog.HOST"
The transformations would be applied to the raw set of keys, in the order they're listed in the configuration file. Initially, regexp() and prefix() would be implemented only, with the possibility of adding more, if the need arises.
This would also solve another problem I encountered recently: if the value-pairs() result set contains both "$SDATA" and "$SDATA.*" (which is the case if one specifies scope("selected-macros" "nv-pairs") and the incoming message has structured data), then we'll have a key conflict in the MongoDB destination, because internally "foo.bar" gets translated to (using JSON notation):
{ "foo": { "bar": ... } }
Now, in the case of SDATA, this translates to something like the following:
{ SDATA: "[foo=bar]", // $SDATA SDATA: { "foo": "bar" // $SDATA.foo } }
This is because the MongoDB destination strips the leading dot at the moment (because that would be invalid too), and we end up with conflicting types: one string, and one object. The driver does not support overriding right now, so this is a problem.
I could, of course, change the driver to replace the dot with an underscore, but that would be costier than the current stripping, and would still be ugly, in my opinion.
It's much nicer to allow the users to rewrite the keys instead, or prefix them.
That's about how far I got with thinking for now. Critique, comments and ideas would be most appreciated.
(PS: This is, of course, strictly 3.4 material, as 3.3 is in a feature freeze)
-- |8]
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
Martin Holste <mcholste@gmail.com> writes:
Yep, I think you're on the right track in that some rewriting will definitely be necessary for Mongo.
I'm hoping that it will be useful for JSON aswell - but that might need some more work, to be really useful. (The $(format-json) template function only supports a single level of values, embedded objects aren't supported at all, yet, key rewriting would be more useful if they'd be, like in Mongo's case.)
I'm a bit concerned with performance, but Mongo will probably be the bottleneck when things don't fit in RAM anyway.
I'm not too worried about Mongo performance, there's many ways it can be made to scale well. The key rewriting will come with a cost of course (especially the regexp-based one; I'm reasonably sure I can cook up prefix() to be quite efficient), but I, for one, don't mind that all that much: my immediate need is differently structured data, and I have enough free resources to throw at the task. ;) -- |8]
Hi! A while ago, I posted a proposal about key rewriting for value-pairs. Today I'm happy to announce that I have some half-baked code, and a couple of ideas on how to proceed. Below, I'll share a few technical details about the current implementation, its limits, and my idea of the way forward. To reiterate, this is roughly the syntax I described earlier:
The way I imagine it, is something like this:
value-pairs ( scope("selected-macros" "nv-pairs") rekey( regexp("^\.SDATA\.(.*)" "sd.$1") prefix(".secevt.*" "events") prefix("[A-Z]*" "syslog.") ) )
As of this writing, this is what's implemented: value-pairs( scope("everything") rekey( add_prefix(".classifier.*" ".syslog-ng") shift(".sdata.*" 1) add_prefix(".*" "private") ) ) After constructing the full scope of keys to work with, value-pairs() will iterate over them, and apply all the rekey transformations in the order listed. The two available transformation functions are: * add-prefix(glob, prefix): with which one can match keys based on a shell glob, and add a prefix to them. * shift(glob, amount): with which one can shift the matched keys a few bytes. This means that given a structure like the following: { ".classifier.rule_id": "foobar", ".sdata.foo": "bar", ".sdata.bar": "baz", ".my-stuff.this": "that" } .classifier.rule_id will first be transformed to .syslog-ng.classifier.rule-id (due to the first rule), it doesn't match the second, and the third transforms it again to private.syslog-ng.classifier.rule-id. The second doesn't match the first, the next will transform it to sdata.foo, and then it doesn't match the last rule anymore. Same goes for the third item in our list. The last item only matches the third rule, so it will be transformed into private.my-stuff.this. That's about it! In the near future, I want to implement two more transform functions: * replace(prefix, new_prefix): which takes two strings, and if a key starts with prefix, it will be replaced by new_prefix (they can be of different length). * regexp(pattern, replace): Which does pretty much the same as replace(), but instead of matching on a prefix, does a whole PCRE match and replace. Performance =========== Performance is not the greatest, but I haven't measured yet, so everything below should be taken with a grain of salt. Key rewriting has inevitable costs, ones that we can't easily get around: we need to match each key we work with against a pattern (or at least a prefix in the best case), and then apply a transformation, which most often will result in extra memory allocations. I tried to limit allocations to a minimum, and cache & lookup instead, whenever possible. But that too, has a cost, even if slightly less than always allocating memory for the same transformations. There's probably a few ways in which performance could be (and will be) improved, but at the moment, the focus is bringing the full set of features in, and cleaning up the mess I made afterwards. Only then, when the mess is gone, will I start to think about making it the fastest possible. Implementation ============== At the current stage of this work, the implementation is a bit messy and inefficient, but it's not all that horrible, in my opinion. It works by calling vp_transform_apply(vp, key) on every key that is inserted into the final scope. If no transformations were specified, this function returns immediately, and we go on as if nothing happened. If we do have transformations, then each matching one gets applied in turn, until we reach the end (I might introduce an optional "final" flag for the transformation functions, so that 'final' flagged transformations will short-circuit the loop if they match a key). The transformation functions try their best not to duplicate strings or allocate memory: * shift() simply returns the same pointer it received, just shifted N bytes (if N is < 0, the whole string is returned, but otherwise no attempt is made to verify that the string is long enough to shift N bytes - yet). * add-prefix() will try to look up a match from an internal hash, and add the new transformation there, if one wasn't found. This allows me to make shift() far lighter on resources than add-prefix(): it doesn't need any memory allocation at all! When syslog-ng is shutting down, value_pairs_free() will call the ->destroy callbacks of the various transformers, which are responsible for freeing up the transformer-specific structures (eg, add-prefix's hash table). There's quite a bit to improve still, though: in order to support transformation functions that expect something else than a shell glob as their first argument, the matching must be abstracted away aswell, among other things. I'm also thinking about rewriting the current - quite hackish - ValuePairTransformer structure into something that resembles object oriented design instead: we'd have a basic ValuePairsTransformer, from which the various transformation functions would inherit from. We'd end up with pretty much the same thing, just in a cleaner design. For the adventurous types, the code is available from my git repo at git://git.balabit.hu/algernon/syslog-ng-3.3.git on the vp/rekey branch. While this is 3.4 material, it's on my 3.3 branch for now, because Bazsi's 3.4 tree doesn't have some of the latest 3.3 stuff, which I indirectly or directly depend on (eg, all the newish value-pairs fixes and enhancements), and I didn't feel like cherry-picking the good stuff. -- |8]
Ladies and Gentlemen, welcome to the latest issue of a tired mouse's brain dump! (Also known as "Quid scriptura, parva mus?")
A while ago, I posted a proposal about key rewriting for value-pairs. Today I'm happy to announce that I have some half-baked code, and a couple of ideas on how to proceed.
Good news, everyone! The code's far better baked now! But we still need a little bit of dressing here and there to make it not only delicious, but attractive to the eyes aswell. A new class of object-like things were introduced, the root of whom is the ValuePairsTransform structure: It has a match, a transform and a destroy function, and a match_str property. But I plan to get rid of match_str, and move it to the appropriate sub-classes instead - it just so happens that I sat down to write this memo before doing that. Anyway, the ValuePairsTransform family of objects are at the core of the key rewriting functionality: whenever the value-pairs framework would insert a key into the final set, it will transform them using a list of transformers. It will first call the appropriate object's ->match() function, which decides whether a particular key is interesting or not. If it is, we quickly transform it with ->transform(). And in the very end, we free up all memory with ->destroy(). One thing of note about the transform function, is that it MUST return a const gchar * - and it is its responsibility to free that at ->destroy() time, and not earlier. If it needs freeing at all, that is. I might relax this, and add support for explicitly freeing the cached values, so that once we're done processing a message, we can free the associated cached data. Though, care was taken to design the cache so that it will not eat all that much memory. The reason it must return a const gchar * is because I wanted to allow the key to be a borrowed pointer: the shift() transform function for example takes advantage of this. Now, ValuePairsTransform, and its descendants live in lib/vptransform.[ch]. They were moved out of value-pairs.[ch] as part of the cleanup process. Apart from baking the code into something that starts to resemble a delicious cake, a few more ingredients were added too! Well, one. But it's a start, and I'm not a cook anyway... This is the replace() transform function, with which one can select a prefix to match on, and replace it with another. Thus, we can do this: value-pairs ( scope("selected-macros" "nv-pairs") rekey( replace("." "_") ) ); And that will replace leading dots with an underscore. This can then replace a similar thing in the current mongodb destination, giving more freedom to the administrator, and making the destination driver's code simpler aswell. This is all available on the vp/rekey branch of my git tree at git://git.balabit.hu/algernon/syslog-ng-3.3.git - for the brave and adventurous, who are not afraid of mice. For fun and profit, we can do some interesting transformation chains now: value-pairs( scope("everything") rekey( add-prefix(".secevt" "events") add-prefix(".classifier" "syslog-ng") shift(".sdata.*" 1) replace("." "_") ) ) This will turn a key like ".secevt.verdict" into "events.secevt.verdict"; a key like ".classifier.rule_id" to "syslog-ng.classifier.rule_id"; everything under ".sdata" will be moved to the "sdata" namespace, and the rest of the keys that begin with a dot, will begin with an underscore instead. Funky, isn't it? Performance should be a bit better than the last time, but there's still room for improvement: I see a few possible ways to get rid of the ugly self->cache hash tables I'm using in some of the transformers. But that will be the topic of another brain dump some other time. For now, lets rejoice that the cake is almost fully baked now, the code's a lot cleaner, replace() is done, and there's very little left to do! -- |8]
Thanks, I've been following the posts here, and although I couldn't get a stab at the code yet, I wanted to react to let you know that this is not forgotten :) I'm doing a forward merge of 3.3 into 3.4 right now. On Fri, 2011-06-03 at 15:53 +0200, Gergely Nagy wrote:
Ladies and Gentlemen, welcome to the latest issue of a tired mouse's brain dump! (Also known as "Quid scriptura, parva mus?")
-- Bazsi
Balazs Scheidler <bazsi@balabit.hu> writes:
Thanks, I've been following the posts here, and although I couldn't get a stab at the code yet, I wanted to react to let you know that this is not forgotten :)
I'm doing a forward merge of 3.3 into 3.4 right now.
Wonderful, thanks! The value-pairs/rekey branch of my syslog-ng-3.4 tree[1] is now up to date, and further development will continue there. I'll keep the 3.3 branch around for a little while, but no further changes will appear there. [1]: git://git.balabit.hu/algernon/syslog-ng-3.4.git -- |8]
participants (3)
-
Balazs Scheidler
-
Gergely Nagy
-
Martin Holste