Not all messages getting to program destination
Setup: v3.1 program destination: python script secondary file destination in the _same_ log block (so it should be getting everything the program destination gets), no final flag. stats turned on, 60s interval. Problem: In the script I am tallying lines parsed per minute. The lines parsed per-minute is not matching a grep -c on the file destination. stats indicate no dropped or queued messages to either destination. If I cat the file to the same destination script on the command line, I get the same counts as I do grepping the file directly. I would expect that the script would return the same results from the shell as when running as a destination in syslog. Where else can I look/what else can I turn on, to see why the discrepancy exists? Cheers, Bill -- Bill Anderson, RHCE Linux Systems Engineer bill.anderson@bodybuilding.com
Could you paste a snippet (even if paths, etc. are edited) so we could try to repro / interpret it in detail? Matthew. On Tue, Apr 12, 2011 at 04:51:38PM -0600, Bill Anderson wrote:
Setup: v3.1 program destination: python script secondary file destination in the _same_ log block (so it should be getting everything the program destination gets), no final flag. stats turned on, 60s interval.
Problem: In the script I am tallying lines parsed per minute. The lines parsed per-minute is not matching a grep -c on the file destination. stats indicate no dropped or queued messages to either destination. If I cat the file to the same destination script on the command line, I get the same counts as I do grepping the file directly. I would expect that the script would return the same results from the shell as when running as a destination in syslog.
Where else can I look/what else can I turn on, to see why the discrepancy exists?
Cheers, Bill
Swap out the python script with a different python script which simply writes to a file and verify that looks the way it should (matches your grep). That should narrow things down. On Tue, Apr 12, 2011 at 5:51 PM, Bill Anderson <Bill.Anderson@bodybuilding.com> wrote:
Setup: v3.1 program destination: python script secondary file destination in the _same_ log block (so it should be getting everything the program destination gets), no final flag. stats turned on, 60s interval.
Problem: In the script I am tallying lines parsed per minute. The lines parsed per-minute is not matching a grep -c on the file destination. stats indicate no dropped or queued messages to either destination. If I cat the file to the same destination script on the command line, I get the same counts as I do grepping the file directly. I would expect that the script would return the same results from the shell as when running as a destination in syslog.
Where else can I look/what else can I turn on, to see why the discrepancy exists?
Cheers, Bill
-- Bill Anderson, RHCE Linux Systems Engineer bill.anderson@bodybuilding.com
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
On Apr 13, 2011, at 9:52 AM, Martin Holste wrote:
Swap out the python script with a different python script which simply writes to a file and verify that looks the way it should (matches your grep). That should narrow things down.
It ran overnight and now I am actually getting dropped counters so at least that makes sense, though as to why I wasn't before I dunno. Unfortunately the dropped counters vary greatly from 0 most minutes to ~26k. Looks like I may need to recode in c or c++. That's far too much variance for my tastes. Alternatively some digging into flow-control might work. Or just a python script that dumps the message to either my rabbitmq system or the redis cache, and have the python script pull from there. It's strange that the same python script can handle far more than that rate on stdin normally. In fact if I grep the minutes out of the logfile that I get the >20k dropped changes and pipe them in, it handles it in just a few seconds. Perhaps flow control may be the more sane route. Any tips in that direction? Thanks, Bill Cheers, Bill -- Bill Anderson, RHCE Linux Systems Engineer bill.anderson@bodybuilding.com
Hm, you really shouldn't be dropping with a simple script. How many msgs/second? On Wed, Apr 13, 2011 at 12:11 PM, Bill Anderson <Bill.Anderson@bodybuilding.com> wrote:
On Apr 13, 2011, at 9:52 AM, Martin Holste wrote:
Swap out the python script with a different python script which simply writes to a file and verify that looks the way it should (matches your grep). That should narrow things down.
It ran overnight and now I am actually getting dropped counters so at least that makes sense, though as to why I wasn't before I dunno. Unfortunately the dropped counters vary greatly from 0 most minutes to ~26k. Looks like I may need to recode in c or c++. That's far too much variance for my tastes. Alternatively some digging into flow-control might work. Or just a python script that dumps the message to either my rabbitmq system or the redis cache, and have the python script pull from there. It's strange that the same python script can handle far more than that rate on stdin normally. In fact if I grep the minutes out of the logfile that I get the >20k dropped changes and pipe them in, it handles it in just a few seconds.
Perhaps flow control may be the more sane route. Any tips in that direction?
Thanks, Bill
Cheers, Bill
-- Bill Anderson, RHCE Linux Systems Engineer bill.anderson@bodybuilding.com
On Wed, Apr 13, 2011 at 03:28:18PM -0500, Martin Holste wrote:
Hm, you really shouldn't be dropping with a simple script. How many msgs/second?
Try recording a few hundred MB and replaying through the script. Print a time stamp for every 10K processed and look for spike / dip with a spreadsheet. Try a profiler, etc. Maybe something in the script is not stable under load. Or maybe syslog-ng is getting too many UDPs with too small of a socket buffer and losing them. Matthew.
On Apr 13, 2011, at 5:28 PM, Matthew Hall wrote:
On Wed, Apr 13, 2011 at 03:28:18PM -0500, Martin Holste wrote:
Hm, you really shouldn't be dropping with a simple script. How many msgs/second?
Try recording a few hundred MB and replaying through the script.
Print a time stamp for every 10K processed and look for spike / dip with a spreadsheet. Try a profiler, etc.
I had done all of that (though using several GB of data), hence the confusion. Average traffic for this script is 200-400/s, with peaks of 1500/s
Maybe something in the script is not stable under load.
Or maybe syslog-ng is getting too many UDPs with too small of a socket buffer and losing them.
If that were the case Id expect those messages to also not make it into the log file. Then again, I'm using TCP for these. ;) After a couple days of effort I've eliminated the problem. Turns out I'd get bursts of over 1200/second for a while and things apparently just piled up, and once behind it started dropping. I've increased the log_fifo_size to 21k, moved some of the conditional logic into syslog, and reduced the fields in the message to just those that are absolutely required by the script, and the problem has disappeared entirely. The first two reduced it, the last finished it off entirely. Related question: is there a reliable way to reset the statistics counters? Cheers, Bill -- Bill Anderson, RHCE Linux Systems Engineer bill.anderson@bodybuilding.com
On Fri, 2011-04-15 at 11:30 -0600, Bill Anderson wrote:
On Apr 13, 2011, at 5:28 PM, Matthew Hall wrote:
On Wed, Apr 13, 2011 at 03:28:18PM -0500, Martin Holste wrote:
Hm, you really shouldn't be dropping with a simple script. How many msgs/second?
Try recording a few hundred MB and replaying through the script.
Print a time stamp for every 10K processed and look for spike / dip with a spreadsheet. Try a profiler, etc.
I had done all of that (though using several GB of data), hence the confusion. Average traffic for this script is 200-400/s, with peaks of 1500/s
Maybe something in the script is not stable under load.
Or maybe syslog-ng is getting too many UDPs with too small of a socket buffer and losing them.
If that were the case Id expect those messages to also not make it into the log file. Then again, I'm using TCP for these. ;)
After a couple days of effort I've eliminated the problem. Turns out I'd get bursts of over 1200/second for a while and things apparently just piled up, and once behind it started dropping. I've increased the log_fifo_size to 21k, moved some of the conditional logic into syslog, and reduced the fields in the message to just those that are absolutely required by the script, and the problem has disappeared entirely. The first two reduced it, the last finished it off entirely.
Related question: is there a reliable way to reset the statistics counters?
Hmm.. not currently (apart from restarting syslog-ng), but it is a good idea nevertheless. -- Bazsi
participants (4)
-
Balazs Scheidler
-
Bill Anderson
-
Martin Holste
-
Matthew Hall