[syslog-ng] Solaris 10 UDP overflows, message drops

Mishou Michael Michael.Mishou at csirc.irs.gov
Tue Apr 26 18:05:01 CEST 2011


For those following this thread, I have applied the "thundering herd"
UDP patch and experienced no change in the drops experienced by
syslog-ng 3.1.2.  Sorry I took so long to respond, the patching was a
much more time-involved process than I thought it would be.

At this point, based on Michael Hocke's response, I'm thinking that
perhaps there is just too much UDP traffic for single-threaded syslog-ng
to deal with in light of what filtering and parsing it does up front
(for macro usage). 

I'm going to experiment with syslog-ng and the loggen tool to find a
point at which a single syslog-ng instance starts dropping inbound UDP
traffic with a simple configuration writing to disk.  Once I have that
number, I have a few options:

1.  Experiment with syslog-ng 3.3 and the new threaded code to see if I
have performance gains.  I'm hesitant to push Alpha code in production,
if anyone has any experience with 3.3 in semi-production environment
running consistently I'd love to hear it.

2.  So I don't have to change the configuration on a lot of clients, use
PF to rewrite incoming UDP messages from specific, busy clients to other
syslog-ng listeners, configured exactly as my main instance (which will
handle all the non-insanely-busy clients).  I could run multiple
listeners in this manner, and not need threading to take advantage of
multiple processors, though obviously each process would still be
limited to the magic number determined above.  I have 10 or so really
busy clients, so this is one solution I'm leaning towards if syslog-ng
3.1.2 can handle just one of them.

3.  Give up on syslog-ng until 3.3, or move to some other solution.  Not
sure what I could do here, rsyslog is the other major contender I guess,
not sure what gains I would get.  Could also do native syslog server and
post-process to different buckets/relay which is what we mainly use
syslog-ng for.

4.  Get a faster box (not likely to happen).

If anyone has any thoughts on any of the above I'd love to hear them.
Also, if this is unique to Solaris SPARC systems (similarly spec'd x86
Solaris systems having none of these limitations) I'd love to know that
as well.  Is there any way anyone knows to figure out at what point the
SPARC is hitting a ceiling?  The CPU is not pegged, so why would we be
experiencing CPU-based drops?  Maybe the code is not efficient for how
SPARC does things, or how some syscall is implemented on Solaris?

--Mike

--------------------
Mike Mishou - IRS CSIRC
(202) 283-2189 -- Desk
(202) 384-7817 -- Mobile
(202) 283-4809 -- 24x7 Hotline

-----Original Message-----
From: syslog-ng-bounces at lists.balabit.hu
[mailto:syslog-ng-bounces at lists.balabit.hu] On Behalf Of Mishou Michael
Sent: Monday, April 18, 2011 11:44 AM
To: Syslog-ng users' and developers' mailing list
Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops

Fred,
 
Great find!  For those following on the TV at home, here is the link to
the patch notes that I found:
 
https://getupdates.oracle.com/readme/144488-11
 
Which contains this tantalizing tidbit:
 
6638967 UDP recv (think DNS) suffers from thundering herd problem
(bug report for above: http://bit.ly/eD57KB+ )
 
I'm going to install this patch and see what comes of it.  That
certainly seems like it could be related.

Martin,

I checked based on Matthew's suggestion of the ipInDiscards counter
incrementing, and it's not, so no dice with the checksum errors, good
call though!
 
Matthew, 
 
Following your suggestions I set up all of the network-based
destinations to be /dev/null and the problem persists, shows no change
in terms of how fast the buffers fill up or how many overflows are
generated per second at all.  As for loggen, keep in mind that I was
using that before to write to /dev/null and also directly to disk and
topped loggen out (locally) at ~8k/msgs/sec, so I wonder if Fred isn't
on the right track with some OS issues?
 
I used Google to find scripts to run for Dtrace to track syscalls, and
one called procsystime stood out
(http://www.brendangregg.com/DTrace/procsystime)  Here's a sample of the
output over about 20 seconds each time, while the overflows are
happening.  It's hard for me to tell if the write() syscall is really
freaking slow here or if this is as it should be and not weird.  Either
way, write() doesn't seem to be the issue if you look at the /dev/null
output for which UDP overflows still occur.

This first one is the output for the simple config I posted earlier,
writing to disk:
 
# /root/procsystime -aTn syslog-ng
Hit Ctrl-C to stop sampling...
^C
 
Elapsed Times for processes syslog-ng,
 
         SYSCALL          TIME (ns)
          getpid             193600
          fchmod             443000
            bind             757700
           fcntl             786800
      setsockopt            1355700
          fchown            1446200
         connect            1542000
       so_socket            2626100
           close            4382500
          open64            4691400
          stat64            4957400
             brk           23978700
         pollsys          113785800
          llseek          158217200
           gtime          244735300
        recvfrom          404501900
           write        11343815700
          TOTAL:        12312217000
 
CPU Times for processes syslog-ng,
 
         SYSCALL          TIME (ns)
          getpid              42400
          fchmod             286700
           fcntl             341800
            bind             691000
          fchown            1163100
      setsockopt            1230300
         connect            1468100
       so_socket            2543900
          open64            4047800
           close            4075300
          stat64            4626600
             brk           17442100
           gtime           31685200
          llseek           88638300
         pollsys           92742600
        recvfrom          332658600
           write         3376428300
          TOTAL:         3960112100
 
Syscall Counts for processes syslog-ng,
 
         SYSCALL              COUNT
            bind                 19
         connect                 19
       so_socket                 19
      setsockopt                 38
          fchmod                 48
          open64                 48
          getpid                 58
           close                 89
          fchown                 96
          stat64                 96
           fcntl                172
             brk               2342
         pollsys               4741
        recvfrom              22003
          llseek              22701
           write              51022
           gtime              85661
          TOTAL:             189172

And this second sample is for writing directly to /dev/null, you'll
notice write() is taking a lot less time but that totally makes sense
since disks are no longer being used (the overflows are still happening
just as aggressively however):

# /root/procsystime -aTn syslog-ng
Hit Ctrl-C to stop sampling...
^C

Elapsed Times for processes syslog-ng,

         SYSCALL          TIME (ns)
           chmod              24400
             brk              28900
          fchmod              45200
           fcntl              49000
           chown              70900
           close             154800
          fchown             165800
           mkdir             297700
          open64             797000
          stat64             846900
         pollsys           14180100
          llseek           25238500
           write           54238800
           gtime           59281800
        recvfrom           87493900
       nanosleep        17615931600
          TOTAL:        17858845300

CPU Times for processes syslog-ng,

         SYSCALL          TIME (ns)
           fcntl              20600
           chmod              20700
             brk              22400
          fchmod              29400
           chown              62400
          fchown             134100
           close             137000
           mkdir             292500
          open64             774500
          stat64             799900
         pollsys           10720500
          llseek           12170100
           gtime           13406200
           write           18375500
       nanosleep           30881900
        recvfrom           72251100
          TOTAL:          160098800

Syscall Counts for processes syslog-ng,

         SYSCALL              COUNT
           chmod                  1
           mkdir                  1
             brk                  2
           chown                  2
          fchmod                  5
          open64                  5
           close                  6
          fchown                 10
           fcntl                 10
          stat64                 14
         pollsys                949
       nanosleep                950
          llseek               4774
           write               4774
        recvfrom               4776
           gtime              17891
          TOTAL:              34170

I will let you all know what happens post-patch.  And if there are any
suggestions for other things to try with dtrace, they are welcome, as
I'm a novice with it and Solaris administration in general.

Thanks for all the excellent help so far!

--Mike


________________________________

From: syslog-ng-bounces at lists.balabit.hu
[mailto:syslog-ng-bounces at lists.balabit.hu] On Behalf Of Fred Connolly
Sent: Friday, April 15, 2011 9:28 PM
To: Syslog-ng users' and developers' mailing list
Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops


I am experiencing the same problem with Sun V490 except the server has
about 16gb memory. We are using UDP and losing about 85% of the traffic.
The udpinoverflows is darn near equal to the total number of packets
coming in. I am not at work now so cannot provide accurate statistics at
this time. The NIC statistics are perfect, we aren't getting any errors
with regards to the UDP area etc.

There is a kernel patch that came out about a week or two ago that deals
in this area, but I have not yet applied it. I want to apply the patch
first before adjusting other kernel parameters. We have Solaris 10,
update 9. Version of syslog-ng is 3.1.2. It is really terrible. 

By terrible, I mean the packet loss, not the product:)) It is probably
something I don't have set up correctly.

Mike, check out that latest patch, it can't hurt. I had to open a case
with Sun to find out about it:))




On Fri, Apr 15, 2011 at 3:45 PM, Matthew Hall <mhall at mhcomputing.net>
wrote:


	On Fri, Apr 15, 2011 at 02:01:50PM -0400, Mishou Michael wrote:
	> I left out the resources I have to work with on this system,
and how
	> bad/good things are with syslog-ng running (and dropping),
I'll include
	> those now.  As you can see, it's an older server, but it has a
ton of
	> RAM and the CPUs should have enough pop for this I think.
	
	
	> I'm just not sure what to do next to troubleshoot.  I'm hoping
someone
	> here can point me in the right direction, or at least confirm
that they
	> are running syslog-ng in a similar configuration without drops
so I know
	> that it's at least possible?
	>
	> Regards,
	>
	> --Mike
	
	
	I think the next suspect would be the disks. Can you disable
anything that writes to disk or tell it to write to /dev/null and see if
it still blows up?
	
	Also, it's Solaris, so you could start using some of the dtrace
scripts to look for what syscalls / other ops are running too slow, and
when it gets stuck what type of socket / disk file / what IO is it
doing?
	
	Matthew.
	
	
________________________________________________________________________
______
	Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
	Documentation:
http://www.balabit.com/support/documentation/?product=syslog-ng
	FAQ: http://www.campin.net/syslog-ng/faq.html
	
	


________________________________________________________________________
______
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
Documentation:
http://www.balabit.com/support/documentation/?product=syslog-ng
FAQ: http://www.campin.net/syslog-ng/faq.html



More information about the syslog-ng mailing list