RFC: Applying transformations to a whole log message
Hi! In the GeoIP thread[1], I started to play with the idea of introducing another way to modify messages. So far, we have rewrite, which can set new values associated with a message, or change existing ones - one at a time. We also have template functions, which one can use how a specific value will be formatted. Again, pretty much one at a time. What syslog-ng lacks right now, is a way to apply a transformation to a message as a whole, a transformation that will take effect right there, right then, instead of making a modified copy like value-pairs() does. (value-pairs() also suffers from the problem that to be useful, it needs explicit support elsewhere: among the template functions, or within the destination driver). What I wish for, is to be able to apply any number of transformation functions to a whole LogMessage. Whether the transformations rewire the key names, or change values, I'd love to be able to just tell syslog-ng, that "here, take this message, go out and prosper, make it better, whatever it takes!" - and it would do just that. To give a few examples, I'd love to be able to do any and all of the following: * Ask syslog-ng to take a message, and look up every IP address associated with it (for simplicity's sake, lets assume every such address is stored within a key that ends with "_IP", and no other keys end with the same suffix) in a GeoIP database, and put the result in keys that have a "geo." prefix, followed by the original key name. * Ask syslog-ng to take a message, and rename the keys according to various rules I set - similar how value-pairs()' rekeying works, possibly following the same syntax. For example, I want to take all ".json.*" keys, remove the prefix, and uppercase the names. Then, I want to replace all leading dots with an underscore in whatever keys remain. * Ask syslog-ng to remove keys completely. I don't care about the DATE field, because I receive CEE-enabled messages only, and they come with a high-precision date field anyway, called a "timestamp". * I also want to drop every key where the value matches a certain pattern. Or perhaps not drop them, but anonymize the value.. For example, I might not like the word "plasson", so much so, that whatever key contains it, I never want to see it. I also want to pull a prank, because it just happens to be April 1st, so I want to replace every occurence of "Linux" with "Emacs" within a LogMessage, in every single key. * Since we're applying transformations, might aswell do what rewrite does too, and be able to set stuff - we already do subst. For example, I want to anonymize all the IP addresses. I don't mind the country-codes exposed, but I don't want the IPs in my logs. * I want to be able to compose all of the above, chain them together, so one gets executed after the other, and in the end, the LogMessage will end up with their combined result. For the above, I propose the following syntax: ,---- | map m_do_stuff { | geoip("*_IP", target-prefix("geo.")) | rekey(".json.*", | shift(6) uppercase()) | rekey(".*", replace(".", "_")) | filter-out(key("DATE")) | filter-out(value("plasson", type(substring))) | subst(value("Linux", "Emacs", type(substring))) | set(key("*_IP", "<anonymized>")) | }; `---- Of course, this differs a bit from the syntax used in rewrite, and to be honest, intentionally so. I could never learn to love rewrite's way of set("new-value", value("key-name")). Nevertheless, the syntax can be changed to be similar to rewrite, the functionality would remain the same even then. And how to use this? destination d_something { source(s_something); map(m_do_stuff); ... } And in case we want to tie it to a condition, then: map(m_do_stuff, condition(filter(f_filter_condition))) I don't think I'd want to support specifying maps in-line, but I suppose that could be done aswell. Basically, this would be rewrite on steroids, with the ability to modify keys and values in bulk aswell. The major advantage this would have is the ability to work not only on a single key, but apply transformations to any number of key-value pairs, changing either of them or both. If architectured well, it could even be fast, on par with rewrite if it has to do similar things. I mean, the following two should be equally fast: ,---- | map m_set_host { | set(key("HOST", "myhost")); | }; `---- ,---- | rewrite r_set_host { | set("myhost", value("HOST")); | }; `---- For this to work, and for optimisations to be made possible, the implementation will have to be clever, and able to take shortcuts. I have a few ideas about that too, but that's a topic for a later time: let's see first if the idea is deemed useful, and if the syntax I came up with makes sense to anyone else. What do YOU think? Would you have a use for a way to configure bulk transformations? If so, what other transformations would you find interesting? -- |8]
This is definitely something that's needed, but I'm a bit concerned with the complexity. I want to propose another idea, which is just off the top of my head: What if something like the program() destination can be used to do the message transformations so that your favorite script or C program can be used inline as a log preprocessor as well as a destination. The reason I think this could be helpful is that then you can re-use utility scripts and code you already have laying around without having to learn the new system. Granted, in a lot of cases, the proposed built-in system would be fairly straightforward, but for advanced usage, like tying in with external databases, it could be very helpful to have the ability to offload the transforming to an arbitrary script or program. I think the challenge would be with latency and potential queue clogging, but that can be managed. On Thu, May 10, 2012 at 4:08 AM, Gergely Nagy <algernon@balabit.hu> wrote:
Hi!
In the GeoIP thread[1], I started to play with the idea of introducing another way to modify messages.
So far, we have rewrite, which can set new values associated with a message, or change existing ones - one at a time.
We also have template functions, which one can use how a specific value will be formatted. Again, pretty much one at a time.
What syslog-ng lacks right now, is a way to apply a transformation to a message as a whole, a transformation that will take effect right there, right then, instead of making a modified copy like value-pairs() does. (value-pairs() also suffers from the problem that to be useful, it needs explicit support elsewhere: among the template functions, or within the destination driver).
What I wish for, is to be able to apply any number of transformation functions to a whole LogMessage. Whether the transformations rewire the key names, or change values, I'd love to be able to just tell syslog-ng, that "here, take this message, go out and prosper, make it better, whatever it takes!" - and it would do just that.
To give a few examples, I'd love to be able to do any and all of the following:
* Ask syslog-ng to take a message, and look up every IP address associated with it (for simplicity's sake, lets assume every such address is stored within a key that ends with "_IP", and no other keys end with the same suffix) in a GeoIP database, and put the result in keys that have a "geo." prefix, followed by the original key name.
* Ask syslog-ng to take a message, and rename the keys according to various rules I set - similar how value-pairs()' rekeying works, possibly following the same syntax. For example, I want to take all ".json.*" keys, remove the prefix, and uppercase the names. Then, I want to replace all leading dots with an underscore in whatever keys remain.
* Ask syslog-ng to remove keys completely. I don't care about the DATE field, because I receive CEE-enabled messages only, and they come with a high-precision date field anyway, called a "timestamp".
* I also want to drop every key where the value matches a certain pattern. Or perhaps not drop them, but anonymize the value..
For example, I might not like the word "plasson", so much so, that whatever key contains it, I never want to see it.
I also want to pull a prank, because it just happens to be April 1st, so I want to replace every occurence of "Linux" with "Emacs" within a LogMessage, in every single key.
* Since we're applying transformations, might aswell do what rewrite does too, and be able to set stuff - we already do subst.
For example, I want to anonymize all the IP addresses. I don't mind the country-codes exposed, but I don't want the IPs in my logs.
* I want to be able to compose all of the above, chain them together, so one gets executed after the other, and in the end, the LogMessage will end up with their combined result.
For the above, I propose the following syntax:
,---- | map m_do_stuff { | geoip("*_IP", target-prefix("geo.")) | rekey(".json.*", | shift(6) uppercase()) | rekey(".*", replace(".", "_")) | filter-out(key("DATE")) | filter-out(value("plasson", type(substring))) | subst(value("Linux", "Emacs", type(substring))) | set(key("*_IP", "<anonymized>")) | }; `----
Of course, this differs a bit from the syntax used in rewrite, and to be honest, intentionally so. I could never learn to love rewrite's way of set("new-value", value("key-name")). Nevertheless, the syntax can be changed to be similar to rewrite, the functionality would remain the same even then.
And how to use this?
destination d_something { source(s_something); map(m_do_stuff); ... }
And in case we want to tie it to a condition, then: map(m_do_stuff, condition(filter(f_filter_condition)))
I don't think I'd want to support specifying maps in-line, but I suppose that could be done aswell.
Basically, this would be rewrite on steroids, with the ability to modify keys and values in bulk aswell. The major advantage this would have is the ability to work not only on a single key, but apply transformations to any number of key-value pairs, changing either of them or both.
If architectured well, it could even be fast, on par with rewrite if it has to do similar things. I mean, the following two should be equally fast:
,---- | map m_set_host { | set(key("HOST", "myhost")); | }; `----
,---- | rewrite r_set_host { | set("myhost", value("HOST")); | }; `----
For this to work, and for optimisations to be made possible, the implementation will have to be clever, and able to take shortcuts. I have a few ideas about that too, but that's a topic for a later time: let's see first if the idea is deemed useful, and if the syntax I came up with makes sense to anyone else.
What do YOU think? Would you have a use for a way to configure bulk transformations? If so, what other transformations would you find interesting?
-- |8]
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
Martin Holste <mcholste@gmail.com> writes:
This is definitely something that's needed, but I'm a bit concerned with the complexity. I want to propose another idea, which is just off the top of my head: What if something like the program() destination can be used to do the message transformations so that your favorite script or C program can be used inline as a log preprocessor as well as a destination.
That would make it necessary to serialize LogMessages, pass it to the program, then deserialize it - which would be costy, and that's something I can already do: I can send JSON to a program, and set up my system to get JSON back, parse it and be happy. It's not efficient, and requires a separate program running. It's much much faster if some of these things can be done *inside* syslog-ng. It may not suite every need that is possible, but it covers a large set, and I hope to make it so that adding new functionality would be very, very easy.
The reason I think this could be helpful is that then you can re-use utility scripts and code you already have laying around without having to learn the new system.
That's already possible with a little glue-code. It could be made simpler, so that you could use program() as a kind of pipe, and that's something that might be worth exploring, but it's not a replacement for what I wish to do with map{}.
Granted, in a lot of cases, the proposed built-in system would be fairly straightforward, but for advanced usage, like tying in with external databases, it could be very helpful to have the ability to offload the transforming to an arbitrary script or program. I think the challenge would be with latency and potential queue clogging, but that can be managed.
Indeed. This would be another useful feature, perhaps even easier to implement than the map{} stuff I proposed, but it has its disadvantages (speed & efficiency for one). -- |8]
participants (2)
-
Gergely Nagy
-
Martin Holste