[syslog-ng] Syslog-ng 3.0.2 statistics

Fri Jun 19 17:14:30 CEST 2009

Thanks for the reply Bazsi.

I think we are running into multiple issues.  Here are my findings from
yesterday with regard to the config file on the archive box and the UDP
receive errors on the kernel.

Archive host:
TCP seemed to resolve the issue.  This was a por test because once TCP was
configured we lost our source_spoof function and everything dumped to a
single file.

I then tried dumping everything to a single file and used UDP, this resolved
the MPS descrepancy as well. Are there limitations to how many filters and
destinations you can have.  We have about 140 filters and same for
destinations.  We are breaking everything out by device, it's pretty
simplistic but there are a lot.

UDP buffer problems:
I read about this yesterday where heavy UDP syslog traffic the UDP receive
buffer in the kernel can have problems.  I checked the relay's and archive
boxes for udp receive errors and sure enough we're seeing 20% udp receive
errors on the relay and 60% on the archive.  I adjusted the relay box to an
8meg udp receive buffer in the kernel and this has dropped the udp receive
errors to 8-10%.

There was still a question of how Xen in the virtualized environment might
play a role in these issues and after a bit of research it looks like there
are logged bugs with regard to domU's and UDP.

My thought now is to go with the suggestion Martin had, scrap the virtual
environment and utilize the box in whole.  Just seems like such a waste,
it's a 16proc, 32gig, 13TB box leftover from a project gone awry.  Lot's of
horsepower for a single syslog server that can't thread. Is there a way to
utilize this horsepower for syslog-ng?

And the hits just keep on coming...

On Thu, Jun 18, 2009 at 9:34 PM, Balazs Scheidler <bazsi at balabit.hu> wrote:

> On Wed, 2009-06-17 at 11:18 -0700, Aaron Robel wrote:
> > More tests...
> >
> > I ran tcpdump on both relay-01 and the archive box.  There were zero
> > descrepancies between the tcpdumps.  This tells me that the "virtual
> > network" is good.
> >
> > Here is the latest in message descrepancy:
> > relay-01 - 3745 mps (this is according to the processed destination,
> > my archive box)
> > archive - 1900 mps (this is according to the processed source)
>
>
> The problem is that udp packets can be dropped even before they reach
> syslog-ng, thus syslog-ng will not know about them. And this can happen
> even if the packet is physically transmitted to the host correctly.
>
> And that's what probably happens if you see those numbers, the udp
> receive buffer fills up (in the kernel) before syslog-ng has a chance to
> read messages out.
>
> Try increasing the udp receive buffer (so_rcvbuf option), but do not
> forget to increase the maximum value the kernel allows. (this is OS
> dependant, on Linux /proc/sys/net/core/rmem_max)
>
> Also, if you control the relay as well, just use tcp.
>
> >
> > Another question about syslog processing, does syslog-ng record
> > processed stats for the source based on what it wrote to the file
> > destinations? Or, is it simply on how many messages it receives on the
> > source? If it's simply how many messages it's received then all my
> > filters and destinations can be ruled out.  I was concerned that
> > having 150 filters and 150 destinations within the syslog_config might
> > hit a limitation.  What I've done is separated out every network
> > device to a seperate file to make searches and our web front
> > end(phplogcon) perform better.
> >
> >
> > On Wed, Jun 17, 2009 at 10:24 AM, Aaron Robel <megawott at gmail.com>
> > wrote:
> >         So, I did a couple tests.
> >
> >         I started by watching realtime logs flow in on both the relay
> >         and archive.  This showed that sure enough we not getting all
> >         our messages to the back end.
> >
> >         I then removed the following options:
> >         time_sleep(10);
> >         log_fetch_limit(250);
> >         log_fifo_size(2000);
> >
> >         flush_lines(2000);
> >         flush_timeout(200);
> >
> >         Then performed the test again.  The results were much  better,
> >         but we are still missing about 1 out of every 6 or 8 messages.
> >         CPU, as expected, has also dramatically increased from 10% to
> >         60% utilization.
> >
> >         I thought my next step would be to compare tcpdumps on both
> >         boxes to rule out the network, then to progress onto more
> >         dramatic options.  Any other ideas on what may be happening is
> >         greatly appreciated.
> >
> >         Just when I thought this project was about to be wrapped up,
> >         it drags me back in...
> >
> >
> >         On Wed, Jun 17, 2009 at 10:05 AM, Martin Holste
> >         <mcholste at gmail.com> wrote:
> >                 I highly doubt that the UDP is being dropped on the
> >                 "network" (quoted since it's all in a VM), but you can
> >                 always check by running iptraf on the receiving
> >                 interfaces to get a ballpark figure of how many UDP
> >                 packets are coming in on 514.  To find out if
> >                 Syslog-NG is the bottleneck, try a test config that is
> >                 as simple as possible, e.g. configure with just one
> >                 source and one file destination and see what the stats
> >                 do then.  If possible, you could also try sending all
> >                 of the logs to a stock syslogd daemon (see a previous
> >                 thread about this) which is faster for simple file
> >                 writing operations.  The truth may be that a VM is not
> >                 a good environment for high-performance log
> >                 collection, and that turning all those VM's into one
> >                 physical might outperform your VM cluster.  Please
> >                 keep me posted--I'm interested in how this plays out.
> >
> >                 --Martin
> >
> >
> >
> >                 On Wed, Jun 17, 2009 at 11:26 AM, Aaron Robel
> >                 <megawott at gmail.com> wrote:
> >                         You make a good point. I initially thought the
> >                         same thing and did some checking on
> >                         the bandwidth usage and we aren't saturating
> >                         any of the links or even getting close.  I
> >                         also didn't see any errors or drops on the
> >                         interfaces.  The big question for me is
> >                         how does this all play out in the virtualized
> >                         environment could I be running into a
> >                         limitation there, rhetorical question.  All of
> >                         these hosts live physically on the same piece
> >                         of hardware and on the same vlan.   I'll keep
> >                         poking around in that arena to see if anything
> >                         turns up. Maybe play with tcp to the archive
> >                         host, I just worry about performance
> >                         implications.
> >
> >                         Do you see anything else in my options config
> >                         that looks amiss?
> >
> >                         Thanks for the suggestion Joe.
> >
> >                         Hardware stats:
> >                         relays:
> >                         2 3gig procs
> >                         4 gig mem
> >                         1 TB disk
> >
> >                         archive
> >                         4 3 gig procs
> >                         6 gig mem
> >                         5.5 TB disk
> >
> >                         Network bandwidth stats:
> >                         relay 01:  in-850KBps out-300KBps (I'm
> >                         assuming the descrepancy here is due to the
> >                         fifo and flush settings.)
> >                         relay 02:  in-60KBps out-55KBps
> >                         relay 03:  in-nill out-nill
> >
> >                         Archive:
> >                         network utilization: 600KBps
> >
> >
> >                         On Wed, Jun 17, 2009 at 8:58 AM, Fegan, Joe
> >                         <Joe.Fegan at hp.com> wrote:
> >
> >
> >                                 Knee jerk reaction: are you using udp?
> >                                 You probably know that udp is a
> >                                 connection-less,
> >                                 fire-and-forget protocol so if the
> >                                 packet gets lost neither the sender
> >                                 nor the intended recipent will know
> >                                 (or care).
> >
> >
> >                                 ______________________________________
> >                                 From:
> >                                 syslog-ng-bounces at lists.balabit.hu
> >                                 [mailto:
> syslog-ng-bounces at lists.balabit.hu] On Behalf Of Aaron Robel
> >                                 Sent: 17 June 2009 16:20
> >                                 To: syslog-ng at lists.balabit.hu
> >                                 Subject: [syslog-ng] Syslog-ng 3.0.2
> >                                 statistics
> >
> >
> >
> >
> >                                 Hello,
> >
> >                                 My apologies in advance, this is my
> >                                 first posting and I'm quite the rook'
> >                                 when it comes to Linux and Syslog-ng.
> >                                 I keep wondering why this is my
> >                                 project.
> >
> >                                 I have a 4 server syslog deployment
> >                                 with 3 front end "relay" boxes and 1
> >                                 backend archive box all within a
> >                                 virtualized SLES environment.
> >
> >                                 Recently I noticed that the relay's
> >                                 together are averaging about 2500
> >                                 messages per second (mps).   The
> >                                 majority of the messages are coming
> >                                 from a single relay, about 2000 mps.
> >                                 Yet the archive box is only averaging
> >                                 about 400 mps.
> >
> >                                 Since we are running 3.0.2 I decided
> >                                 to turn up the stats_level to (1).  I
> >                                 don't see any drops to the about
> >                                 150 file destinations that I've built.
> >
> >                                 What does stamp, processed, stored,
> >                                 etc.. mean?  I couldn't find any
> >                                 detailed documentation about the
> >                                 different statistics.
> >
> >                                 Why am I getting such a large
> >                                 discrepency between "stamp" and
> >                                 "processed" in the log stats?
> >
> >                                 Finally, since I'm sending the
> >                                 email does anyone see an issue with
> >                                 the way I've got the flow control set
> >                                 up in the global options?
> >
> >                                 Here are my stats in question off my
> >                                 archive box:
> >
> processed='src.udp(s_network#0)=22020892',
> >                                 stamp='src.udp(s_network#0)=1245249328'
> >
> >                                 Here's the global's off the archive
> >                                 box:
> >                                 options {
> >                                         time_sleep(10);
> >                                         log_fetch_limit(250);
> >                                         log_fifo_size(2000);
> >                                         use_dns(no);
> >                                         keep_timestamp(yes);
> >                                         dns_cache(no);
> >                                         long_hostnames(off);
> >                                         flush_lines(2000);
> >                                         flush_timeout(200);
> >                                         perm(0644);
> >                                         stats_freq(1800);
> >                                         stats_level(1);
> >                                         time_reopen(10);
> >                                         create_dirs(yes);
> >                                         dir_perm(755);
> >                                 };
> >
> >
> >                                 Thanks!
> >
> >
> >
> >
> ______________________________________________________________________________
> >                                 Member info:
> >
> https://lists.balabit.hu/mailman/listinfo/syslog-ng
> >                                 Documentation:
> >
> http://www.balabit.com/support/documentation/?product=syslog-ng
> >                                 FAQ:
> >                                 http://www.campin.net/syslog-ng/faq.html
> >
> >
> >
> >
> >
> >
> >                         --
> >                         Aaron Robel
> >
> >
> ______________________________________________________________________________
> >                         Member info:
> >
> https://lists.balabit.hu/mailman/listinfo/syslog-ng
> >                         Documentation:
> >
> http://www.balabit.com/support/documentation/?product=syslog-ng
> >                         FAQ: http://www.campin.net/syslog-ng/faq.html
> >
> >
> >
> >
> >
> >
> ______________________________________________________________________________
> >                 Member info:
> >                 https://lists.balabit.hu/mailman/listinfo/syslog-ng
> >                 Documentation:
> >
> http://www.balabit.com/support/documentation/?product=syslog-ng
> >                 FAQ: http://www.campin.net/syslog-ng/faq.html
> >
> >
> >
> >
> >
> >
> >         --
> >         Aaron Robel
> >
> >
> >
> >
> >
> > --
> > Aaron Robel
> >
> ______________________________________________________________________________
> > Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> > Documentation:
> http://www.balabit.com/support/documentation/?product=syslog-ng
> > FAQ: http://www.campin.net/syslog-ng/faq.html
> >
> --
> Bazsi
>
>
>
> ______________________________________________________________________________
> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> Documentation:
> http://www.balabit.com/support/documentation/?product=syslog-ng
> FAQ: http://www.campin.net/syslog-ng/faq.html
>
>

-- 
Aaron Robel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.balabit.hu/pipermail/syslog-ng/attachments/20090619/b5f936a5/attachment-0001.htm