[syslog-ng] [RFC] value-pairs and key rewriting

Fri May 27 17:07:21 CEST 2011

Hi!

A while ago, I posted a proposal about key rewriting for
value-pairs. Today I'm happy to announce that I have some half-baked
code, and a couple of ideas on how to proceed.

Below, I'll share a few technical details about the current
implementation, its limits, and my idea of the way forward.

To reiterate, this is roughly the syntax I described earlier:

> The way I imagine it, is something like this:
>
> value-pairs (
>   scope("selected-macros" "nv-pairs")
>   rekey(
>     regexp("^\.SDATA\.(.*)" "sd.$1")
>     prefix(".secevt.*" "events")
>     prefix("[A-Z]*" "syslog.")
>   )
> )

As of this writing, this is what's implemented:

value-pairs(
  scope("everything")
  rekey(
    add_prefix(".classifier.*" ".syslog-ng")
    shift(".sdata.*" 1)
    add_prefix(".*" "private")
  )
)

After constructing the full scope of keys to work with, value-pairs()
will iterate over them, and apply all the rekey transformations in the
order listed.

The two available transformation functions are:

* add-prefix(glob, prefix): with which one can match keys based on a
  shell glob, and add a prefix to them.
* shift(glob, amount): with which one can shift the matched keys a few
  bytes.

This means that given a structure like the following:

{
 ".classifier.rule_id": "foobar",
 ".sdata.foo": "bar",
 ".sdata.bar": "baz",
 ".my-stuff.this": "that"
}

.classifier.rule_id will first be transformed to
.syslog-ng.classifier.rule-id (due to the first rule), it doesn't match
the second, and the third transforms it again to
private.syslog-ng.classifier.rule-id.

The second doesn't match the first, the next will transform it to
sdata.foo, and then it doesn't match the last rule anymore. Same goes
for the third item in our list.

The last item only matches the third rule, so it will be transformed
into private.my-stuff.this.

That's about it!

In the near future, I want to implement two more transform functions:

* replace(prefix, new_prefix): which takes two strings, and if a key
  starts with prefix, it will be replaced by new_prefix (they can be of
  different length).
* regexp(pattern, replace): Which does pretty much the same as
  replace(), but instead of matching on a prefix, does a whole PCRE
  match and replace.

Performance
===========

Performance is not the greatest, but I haven't measured yet, so
everything below should be taken with a grain of salt.

Key rewriting has inevitable costs, ones that we can't easily get
around: we need to match each key we work with against a pattern (or at
least a prefix in the best case), and then apply a transformation, which
most often will result in extra memory allocations.

I tried to limit allocations to a minimum, and cache & lookup instead,
whenever possible. But that too, has a cost, even if slightly less than
always allocating memory for the same transformations.

There's probably a few ways in which performance could be (and will be)
improved, but at the moment, the focus is bringing the full set of
features in, and cleaning up the mess I made afterwards. Only then, when
the mess is gone, will I start to think about making it the fastest
possible.

Implementation
==============

At the current stage of this work, the implementation is a bit messy and
inefficient, but it's not all that horrible, in my opinion. It works by
calling vp_transform_apply(vp, key) on every key that is inserted into
the final scope. If no transformations were specified, this function
returns immediately, and we go on as if nothing happened.

If we do have transformations, then each matching one gets applied in
turn, until we reach the end (I might introduce an optional "final" flag
for the transformation functions, so that 'final' flagged
transformations will short-circuit the loop if they match a key). The
transformation functions try their best not to duplicate strings or
allocate memory:

* shift() simply returns the same pointer it received, just shifted N
  bytes (if N is < 0, the whole string is returned, but otherwise no
  attempt is made to verify that the string is long enough to shift N
  bytes - yet).

* add-prefix() will try to look up a match from an internal hash, and
  add the new transformation there, if one wasn't found.

This allows me to make shift() far lighter on resources than
add-prefix(): it doesn't need any memory allocation at all!

When syslog-ng is shutting down, value_pairs_free() will call the
->destroy callbacks of the various transformers, which are responsible
for freeing up the transformer-specific structures (eg, add-prefix's
hash table).

There's quite a bit to improve still, though: in order to support
transformation functions that expect something else than a shell glob as
their first argument, the matching must be abstracted away aswell, among
other things.

I'm also thinking about rewriting the current - quite hackish -
ValuePairTransformer structure into something that resembles object
oriented design instead: we'd have a basic ValuePairsTransformer, from
which the various transformation functions would inherit from.

We'd end up with pretty much the same thing, just in a cleaner design.

For the adventurous types, the code is available from my git repo at
 git://git.balabit.hu/algernon/syslog-ng-3.3.git
on the vp/rekey branch.

While this is 3.4 material, it's on my 3.3 branch for now, because
Bazsi's 3.4 tree doesn't have some of the latest 3.3 stuff, which I
indirectly or directly depend on (eg, all the newish value-pairs fixes
and enhancements), and I didn't feel like cherry-picking the good stuff.

-- 
|8]