advice/assistance with parsing attempt requested
i've spent the better part of the past week reading and trying to understand both the documentation and list posts trying to sort this out, if anyone can offer some advice as to whether this is possible or not and if so, what i'm doing wrong; i would really appreciate it! … i have a simple enough task, or so i thought! i've got a syslog stream being received by syslog-ng with too much data. what i'd like to do is parse out pieces of the stream and write only those to a file. the tricky part is that the order of the stream is very variable so that sometimes the desired named parser preceding strings and associated values are present and sometimes not. furthermore, the extra data is also quite variable. can this challenge even be addressed with syslog-ng ose? if so, can it be done with patterned without creating a pattern for EVERY variation of possible streams? for clarification, we've tried to leverage an external perl script which does this using regexs but, it seems that it can't keep up with the stream, we only receive 10% of the original events in the output. if this (external parsing script) only way this can be done, we will continue our efforts to enhance the external script but, if this is possible to be done natively within syslog-ng, i'd rather do that. with the following configuration, i am able to generate output log entries which correctly contain the global macros of $DATE $FULLHOST $PROGRAM as well as the strings preceding the named parser variables but, not the named parser macros. my output looks like this: Dec 2 11:11:11 127.0.0.1 ABC: 0 namedparser1= namedparser2= namedparser3= namedparser4= namedparser5= *****examples of entries in source stream***** Dec 2 11:11:11 127.0.0.1 ABC: 0 namedparser1=namedparser1value extra1=extravalue1 namedparser2=namedparser2value namedparser3=namedparser3value extra2=extravalue2 namedparser4=namedparser4value namedparser5=namedparser5value extra3=extravalue3 Dec 2 11:11:11 127.0.0.1 ABC: 0 extra1=extravalue1 namedparser3=namedparser3value extra2=extravalue2 namedparser4=namedparser4value namedparser5=namedparser5value extra3=extravalue3 extra4=extravalue4 Dec 2 11:11:11 127.0.0.1 ABC: 0 namedparser1=namedparser1value extra1=extravalue1 namedparser2=namedparser2value namedparser3=namedparser3value extra2=extravalue2 namedparser4=namedparser4value extra3=extravalue3 *****examples of desired output***** Dec 2 11:11:11 127.0.0.1 ABC: 0 namedparser1=namedparser1value namedparser2=namedparser2value namedparser3=namedparser3value namedparser4=namedparser4value namedparser5=namedparser5value Dec 2 11:11:11 127.0.0.1 ABC: 0 namedparser3=namedparser3value namedparser4=namedparser4value namedparser5=namedparser5value Dec 2 11:11:11 127.0.0.1 ABC: 0 namedparser1=namedparser1value namedparser2=namedparser2value namedparser3=namedparser3value namedparser4=namedparser4value *****included in conf file***** parser pattern_db { db_parser(file("/opt/syslog-ng/config/patterndb.xml") }; template reduced { template("$DATE $FULLHOST $PROGRAM: 0 namedparser1=$NAMEDPARSER1 namedparser2=$NAMEDPARSER2 namedparser3=$NAMEDPARSER3 namedparser4=$NAMEDPARSER4 namedparser5=$NAMEDPARSER5 \n"); template_escape(no); } destination d_logfile { file("/opt/syslog-ng/logs/logfile" template(reduced)); } log { source(source); parser(pattern_db); destination(d_logfile); }; *****patterndb.xml contents***** <patterndb version='3' pub_date=''> <ruleset name='globe' id='1234567890'> <pattern>ABC</pattern> <rules> <rule provider='someone' id='123' class='system'> <patterns> <pattern>ABC namedparser1=@ESTRING:NAMEDPARSER1:\ @ namedparser2=@ESTRING:NAMEDPARSER2:\ @ namedparser3=@ESTRING:NAMEDPARSER3:\ @ namedparser4=@ESTRING:NAMEDPARSER4:\ @ namedparser5=@ESTRING:NAMEDPARSER5:\ @</pattern> <patterns> </rule> <rules> </ruleset> </patterndb> MANY thanks in advance!
On Dec 6, 2010, at 4:18 AM, <syslog-ng2010@hushmail.com> wrote:
i've spent the better part of the past week reading and trying to understand both the documentation and list posts trying to sort this out, if anyone can offer some advice as to whether this is possible or not and if so, what i'm doing wrong; i would really appreciate it! …
i have a simple enough task, or so i thought! i've got a syslog stream being received by syslog-ng with too much data. what i'd like to do is parse out pieces of the stream and write only those to a file. the tricky part is that the order of the stream is very variable so that sometimes the desired named parser preceding strings and associated values are present and sometimes not. furthermore, the extra data is also quite variable. can this challenge even be addressed with syslog-ng ose? if so, can it be done with patterned without creating a pattern for EVERY variation of possible streams?
I believe you can use the parser and filter in combination to log on match essentially. With this you would only need to set up patterns for the possible combinations you actually want to log/reduce.
for clarification, we've tried to leverage an external perl script which does this using regexs but, it seems that it can't keep up with the stream, we only receive 10% of the original events in the output. if this (external parsing script) only way this can be done, we will continue our efforts to enhance the external script but, if this is possible to be done natively within syslog-ng, i'd rather do that.
What is the volume of events here per-second and per-minute? Perl may not be the right tool for the job here (assuming you can't get it done natively in syslog-ng). If there are too many patterns for you to create you might consider sending the base matches for this to an external daemon that processes them and sends them back into syslog for storage. Then again, I'd be weighing the cost of patterns vs. external script or daemon. How much time to imply input the patterns? If you can do it in a script, you can have the script write your patterndb file for starters. Then there is the cost of adding new entries when they come around (assuming they do) vs. adding to the code. Another option, if you don't want to keep the extras might be to use a rewrite rule to remove extra1=extravalue1 prior to running the parser. Cheers, Bill -- Bill Anderson, RHCE Linux Systems Engineer bill.anderson@bodybuilding.com
Good points, Bill. This is a cool challenge! If the values can really come in any order and you don't necessary know all possible extra values ahead of time, then there's a good chance that regexp is your only hope, through Perl or other means. Pattern-db is really not setup to do this kind of thing, because the order changes. This must be pretty high volume, as I've got Perl doing regexp on around 3-4k large messages per second with no problems. If that's the case, maybe you want a hybrid solution of some sort where you do some of the formatting in pattern-db, but then output to Perl for the final parsing and writing. Another tactic might be to do multi-core processing with Perl by having Syslog-NG pipe to a master Perl process which uses round-robin load-balancing and the IO::AIO CPAN module to asynchronously send the logs to child processes where the actual PCRE matches take place. Something like: Logs -> Syslog-NG -> Perl master -> AIO to Perl Child n -> write file to disk Can you send a snippet of what your Perl script looks like? One regexp should be able to parse the message into an array, and a simple hash lookup should be enough to toss the "extra" key/val pairs. Here's how I would do it: my %keep = ( namedparser1 => 1, namedparser2 => 1, namedparser3 => 1, namedparser4 => 1, namedparser5 => 1); my $test_msg = q{extra1=extravalue1 namedparser3=namedparser3value extra2=extravalue2 namedparser4=namedparser4value namedparser5=namedparser5value extra3=extravalue3 extra4=extravalue4}; my @arr = $test_msg =~ /(\w+)\=(\w+)/g; my @kept; for (my $i = 0; $i < $#arr; $i += 2){ if ($keep{ $arr[$i] }){ push @kept, $arr[$i] . "=" . $arr[$i+1]; } } print join(" ", @kept) . "\n"; Which should print: namedparser3=namedparser3value namedparser4=namedparser4value namedparser5=namedparser5value On Mon, Dec 6, 2010 at 8:38 AM, Bill Anderson <Bill.Anderson@bodybuilding.com> wrote:
On Dec 6, 2010, at 4:18 AM, <syslog-ng2010@hushmail.com> wrote:
i've spent the better part of the past week reading and trying to understand both the documentation and list posts trying to sort this out, if anyone can offer some advice as to whether this is possible or not and if so, what i'm doing wrong; i would really appreciate it! …
i have a simple enough task, or so i thought! i've got a syslog stream being received by syslog-ng with too much data. what i'd like to do is parse out pieces of the stream and write only those to a file. the tricky part is that the order of the stream is very variable so that sometimes the desired named parser preceding strings and associated values are present and sometimes not. furthermore, the extra data is also quite variable. can this challenge even be addressed with syslog-ng ose? if so, can it be done with patterned without creating a pattern for EVERY variation of possible streams?
I believe you can use the parser and filter in combination to log on match essentially. With this you would only need to set up patterns for the possible combinations you actually want to log/reduce.
for clarification, we've tried to leverage an external perl script which does this using regexs but, it seems that it can't keep up with the stream, we only receive 10% of the original events in the output. if this (external parsing script) only way this can be done, we will continue our efforts to enhance the external script but, if this is possible to be done natively within syslog-ng, i'd rather do that.
What is the volume of events here per-second and per-minute? Perl may not be the right tool for the job here (assuming you can't get it done natively in syslog-ng). If there are too many patterns for you to create you might consider sending the base matches for this to an external daemon that processes them and sends them back into syslog for storage. Then again, I'd be weighing the cost of patterns vs. external script or daemon. How much time to imply input the patterns? If you can do it in a script, you can have the script write your patterndb file for starters. Then there is the cost of adding new entries when they come around (assuming they do) vs. adding to the code.
Another option, if you don't want to keep the extras might be to use a rewrite rule to remove extra1=extravalue1 prior to running the parser.
Cheers, Bill
-- Bill Anderson, RHCE Linux Systems Engineer bill.anderson@bodybuilding.com
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
On Dec 6, 2010, at 11:15 AM, Martin Holste wrote:
Good points, Bill.
This is a cool challenge!
Aye, it sure is. :)
If the values can really come in any order and you don't necessary know all possible extra values ahead of time, then there's a good chance that regexp is your only hope, through Perl or other means. Pattern-db is really not setup to do this kind of thing, because the order changes.
Agreed, if the order is going to be fully dynamic I would write a Python script to generate the permutations as a patterndb file and go that route. ;) If that wasn't desired (or for some other reason didn't work), I'd probably go with a python daemon or c++ (I've got a lot of log parsing using Qt for example).
This must be pretty high volume, as I've got Perl doing regexp on around 3-4k large messages per second with no problems. If that's the case, maybe you want a hybrid solution of some sort where you do some of the formatting in pattern-db, but then output to Perl for the final parsing and writing.
Agreed, Perl is plenty quick, hence my wondering about the actual volume. If it is too much for Perl I'd go w/C++.
Logs -> Syslog-NG -> Perl master -> AIO to Perl Child n -> write file to disk
Personally, I'd make the last step routing back into syslog-ng with a source on a custom port and letting syslog handle the writing to disk. That way you can still use macros such as timestamps, etc.. Then again, that may be because I do that all the time. ;) A log statement that takes everything from the custom source and logs to a file should work beautifully; no need for filters though you could still do additional processing if needed. That said I'd also consider running a daemon that accepted all the input, formatted it, and then sent it to syslog-ng, pointing the clients at the custom daemon if that was possible. One advantage to the daemon route is that it wouldn't *have* to reside on the same system. Cheers, Bill -- Bill Anderson, RHCE Linux Systems Engineer bill.anderson@bodybuilding.com
Agreed, Perl is plenty quick, hence my wondering about the actual volume. If it is too much for Perl I'd go w/C++.
From what I can tell, PCRE in Perl (or Python or whatever) is really close to C/C++ speeds because they're essentially using the same library and therefore mostly the same syscalls. I'd be really interested if anyone has benchmarks. I'd expect something like 10% better performance in C, but not much more, assuming that the vast majority of CPU time is spent on PCRE.
Personally, I'd make the last step routing back into syslog-ng with a source on a custom port and letting syslog handle the writing to disk. That way you can still use macros such as timestamps, etc.. Then again, that may be because I do that all the time. ;) A log statement that takes everything from the custom source and logs to a file should work beautifully; no need for filters though you could still do additional processing if needed. That said I'd also consider running a daemon that accepted all the input, formatted it, and then sent it to syslog-ng, pointing the clients at the custom daemon if that was possible.
One advantage to the daemon route is that it wouldn't *have* to reside on the same system.
Yep, you could definitely let Syslog-NG handle the last mile as well. I was trying to keep the scope as narrow as possible in my example. I wonder if you could build an NFA state machine by conditionally looping output from a pattern-db parsed message into a source in Syslog-NG with a different pattern-db, depending on the previous output. Something like a token parser pdb that does an ESTRING up until " " and another one that only expects the key/val pair to be sent to it as the message. So it comes in as k1=v1 k2=v2 and the first kv gets gobbled up and then sent to another pdb source with a pdb which only matches if the message starts with certain terms. Then the rest of the original message is looped back to itself using @ANYSTRING@ to capture the remainder, that is, minus the kv which was sent to the kv pdb. It would keep recursively looping like that until there's no message left. If that all worked, your pattern db would be extremely simple as it would just be a pattern per key you were looking for, and order would no longer be an issue. Of course there's still the problem of demuxing the whole thing back into a coherent message, but I think that could be done a number of ways by passing the MSGID token with each part and using the new conditionals present in OSE 3.2. If OSE 3.3 can really do close to 1 million msgs/sec, then the overhead of resubmitting the same log many times may be bearable, especially with the threading.
On Dec 6, 2010, at 12:37 PM, Martin Holste wrote:
Agreed, Perl is plenty quick, hence my wondering about the actual volume. If it is too much for Perl I'd go w/C++.
From what I can tell, PCRE in Perl (or Python or whatever) is really close to C/C++ speeds because they're essentially using the same library and therefore mostly the same syscalls. I'd be really interested if anyone has benchmarks. I'd expect something like 10% better performance in C, but not much more, assuming that the vast majority of CPU time is spent on PCRE.
Yeah I was thinking the overhead might be in what is done, as opposed to just the RE portion. Of course, the OP script might be implemented rather differently. ;)
Personally, I'd make the last step routing back into syslog-ng with a source on a custom port and letting syslog handle the writing to disk. That way you can still use macros such as timestamps, etc.. Then again, that may be because I do that all the time. ;) A log statement that takes everything from the custom source and logs to a file should work beautifully; no need for filters though you could still do additional processing if needed. That said I'd also consider running a daemon that accepted all the input, formatted it, and then sent it to syslog-ng, pointing the clients at the custom daemon if that was possible.
One advantage to the daemon route is that it wouldn't *have* to reside on the same system.
Yep, you could definitely let Syslog-NG handle the last mile as well. I was trying to keep the scope as narrow as possible in my example.
I wonder if you could build an NFA state machine by conditionally looping output from a pattern-db parsed message into a source in Syslog-NG with a different pattern-db, depending on the previous output. Something like a token parser pdb that does an ESTRING up until " " and another one that only expects the key/val pair to be sent to it as the message. So it comes in as k1=v1 k2=v2 and the first kv gets gobbled up and then sent to another pdb source with a pdb which only matches if the message starts with certain terms. Then the rest of the original message is looped back to itself using @ANYSTRING@ to capture the remainder, that is, minus the kv which was sent to the kv pdb. It would keep recursively looping like that until there's no message left. If that all worked, your pattern db would be extremely simple as it would just be a pattern per key you were looking for, and order would no longer be an issue.
Maybe I'm nuts, but that sounds awesome to me. :D
Of course there's still the problem of demuxing the whole thing back into a coherent message, but I think that could be done a number of ways by passing the MSGID token with each part and using the new conditionals present in OSE 3.2.
Well, there is message correlation in 3.2.1 right? muahahaha
If OSE 3.3 can really do close to 1 million msgs/sec, then the overhead of resubmitting the same log many times may be bearable, especially with the threading.
True the rate might be the downside to that mechanism. However, the terseness of the messages might make up for some of it.
Hi, Although I really like the ideas floating around, the best way to address this issue is to write a welf parser plugin to syslog-ng which simply produces name-value pairs from the input, without having to pipe them out to an external process. The round-trip (pipe-write, pipe-read, process, pipe-write, pipe-read) is simply enormous. And 3.2 already has plugins in place, so we only need someone volunteering to write a welf parser. :) Something along the lines of: parser { welf-parser(prefix(".welf")); }; Which would put all name-value pairs in the input into name-value pairs, prefixed with '.welf', e.g. name1=value1 would become an NV pair in syslog-ng with the name ${.welf.name1} and value "value1". Does that make sense? Or I'm missing something? On Mon, 2010-12-06 at 13:01 -0700, Bill Anderson wrote:
On Dec 6, 2010, at 12:37 PM, Martin Holste wrote:
Agreed, Perl is plenty quick, hence my wondering about the actual volume. If it is too much for Perl I'd go w/C++.
From what I can tell, PCRE in Perl (or Python or whatever) is really close to C/C++ speeds because they're essentially using the same library and therefore mostly the same syscalls. I'd be really interested if anyone has benchmarks. I'd expect something like 10% better performance in C, but not much more, assuming that the vast majority of CPU time is spent on PCRE.
Yeah I was thinking the overhead might be in what is done, as opposed to just the RE portion. Of course, the OP script might be implemented rather differently. ;)
Personally, I'd make the last step routing back into syslog-ng with a source on a custom port and letting syslog handle the writing to disk. That way you can still use macros such as timestamps, etc.. Then again, that may be because I do that all the time. ;) A log statement that takes everything from the custom source and logs to a file should work beautifully; no need for filters though you could still do additional processing if needed. That said I'd also consider running a daemon that accepted all the input, formatted it, and then sent it to syslog-ng, pointing the clients at the custom daemon if that was possible.
One advantage to the daemon route is that it wouldn't *have* to reside on the same system.
Yep, you could definitely let Syslog-NG handle the last mile as well. I was trying to keep the scope as narrow as possible in my example.
I wonder if you could build an NFA state machine by conditionally looping output from a pattern-db parsed message into a source in Syslog-NG with a different pattern-db, depending on the previous output. Something like a token parser pdb that does an ESTRING up until " " and another one that only expects the key/val pair to be sent to it as the message. So it comes in as k1=v1 k2=v2 and the first kv gets gobbled up and then sent to another pdb source with a pdb which only matches if the message starts with certain terms. Then the rest of the original message is looped back to itself using @ANYSTRING@ to capture the remainder, that is, minus the kv which was sent to the kv pdb. It would keep recursively looping like that until there's no message left. If that all worked, your pattern db would be extremely simple as it would just be a pattern per key you were looking for, and order would no longer be an issue.
Maybe I'm nuts, but that sounds awesome to me. :D
Of course there's still the problem of demuxing the whole thing back into a coherent message, but I think that could be done a number of ways by passing the MSGID token with each part and using the new conditionals present in OSE 3.2.
Well, there is message correlation in 3.2.1 right? muahahaha
If OSE 3.3 can really do close to 1 million msgs/sec, then the overhead of resubmitting the same log many times may be bearable, especially with the threading.
True the rate might be the downside to that mechanism. However, the terseness of the messages might make up for some of it.
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
-- Bazsi
That would be awesome, I just can't code in C or C++. I would suppose though, that an interested party could copy most of the CSV parser code and just have to implement a function to sub-parse the equal sign delimiter. On Wed, Dec 8, 2010 at 2:47 PM, Balazs Scheidler <bazsi@balabit.hu> wrote:
Hi,
Although I really like the ideas floating around, the best way to address this issue is to write a welf parser plugin to syslog-ng which simply produces name-value pairs from the input, without having to pipe them out to an external process.
The round-trip (pipe-write, pipe-read, process, pipe-write, pipe-read) is simply enormous.
And 3.2 already has plugins in place, so we only need someone volunteering to write a welf parser. :)
Something along the lines of:
parser { welf-parser(prefix(".welf")); };
Which would put all name-value pairs in the input into name-value pairs, prefixed with '.welf', e.g. name1=value1 would become an NV pair in syslog-ng with the name ${.welf.name1} and value "value1".
Does that make sense? Or I'm missing something?
On Mon, 2010-12-06 at 13:01 -0700, Bill Anderson wrote:
On Dec 6, 2010, at 12:37 PM, Martin Holste wrote:
Agreed, Perl is plenty quick, hence my wondering about the actual volume. If it is too much for Perl I'd go w/C++.
From what I can tell, PCRE in Perl (or Python or whatever) is really close to C/C++ speeds because they're essentially using the same library and therefore mostly the same syscalls. I'd be really interested if anyone has benchmarks. I'd expect something like 10% better performance in C, but not much more, assuming that the vast majority of CPU time is spent on PCRE.
Yeah I was thinking the overhead might be in what is done, as opposed to just the RE portion. Of course, the OP script might be implemented rather differently. ;)
Personally, I'd make the last step routing back into syslog-ng with a source on a custom port and letting syslog handle the writing to disk. That way you can still use macros such as timestamps, etc.. Then again, that may be because I do that all the time. ;) A log statement that takes everything from the custom source and logs to a file should work beautifully; no need for filters though you could still do additional processing if needed. That said I'd also consider running a daemon that accepted all the input, formatted it, and then sent it to syslog-ng, pointing the clients at the custom daemon if that was possible.
One advantage to the daemon route is that it wouldn't *have* to reside on the same system.
Yep, you could definitely let Syslog-NG handle the last mile as well. I was trying to keep the scope as narrow as possible in my example.
I wonder if you could build an NFA state machine by conditionally looping output from a pattern-db parsed message into a source in Syslog-NG with a different pattern-db, depending on the previous output. Something like a token parser pdb that does an ESTRING up until " " and another one that only expects the key/val pair to be sent to it as the message. So it comes in as k1=v1 k2=v2 and the first kv gets gobbled up and then sent to another pdb source with a pdb which only matches if the message starts with certain terms. Then the rest of the original message is looped back to itself using @ANYSTRING@ to capture the remainder, that is, minus the kv which was sent to the kv pdb. It would keep recursively looping like that until there's no message left. If that all worked, your pattern db would be extremely simple as it would just be a pattern per key you were looking for, and order would no longer be an issue.
Maybe I'm nuts, but that sounds awesome to me. :D
Of course there's still the problem of demuxing the whole thing back into a coherent message, but I think that could be done a number of ways by passing the MSGID token with each part and using the new conditionals present in OSE 3.2.
Well, there is message correlation in 3.2.1 right? muahahaha
If OSE 3.3 can really do close to 1 million msgs/sec, then the overhead of resubmitting the same log many times may be bearable, especially with the threading.
True the rate might be the downside to that mechanism. However, the terseness of the messages might make up for some of it.
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
-- Bazsi
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
On Wed, Dec 08, 2010 at 09:47:48PM +0100, Balazs Scheidler wrote:
Although I really like the ideas floating around, the best way to address this issue is to write a welf parser plugin to syslog-ng which simply produces name-value pairs from the input, without having to pipe them out to an external process.
The round-trip (pipe-write, pipe-read, process, pipe-write, pipe-read) is simply enormous.
And 3.2 already has plugins in place, so we only need someone volunteering to write a welf parser. :)
Something along the lines of:
parser { welf-parser(prefix(".welf")); };
Which would put all name-value pairs in the input into name-value pairs, prefixed with '.welf', e.g. name1=value1 would become an NV pair in syslog-ng with the name ${.welf.name1} and value "value1".
Does that make sense? Or I'm missing something?
As a C programmer who makes heavy use of welf it's on my roadmap. But so far I was able to make my WELF usage work with patterndb because my fields have a predictable order. So hopefully I'll be able to get to it some time in the next month or two if nobody else can do it before then. Matthew.
participants (5)
-
Balazs Scheidler
-
Bill Anderson
-
Martin Holste
-
Matthew Hall
-
syslog-ng2010@hushmail.com