[syslog-ng] Solaris 10 UDP overflows, message drops

Mon Apr 18 17:43:50 CEST 2011

Fred,

Great find!  For those following on the TV at home, here is the link to
the patch notes that I found:

https://getupdates.oracle.com/readme/144488-11

Which contains this tantalizing tidbit:

6638967 UDP recv (think DNS) suffers from thundering herd problem
(bug report for above: http://bit.ly/eD57KB+ )

I'm going to install this patch and see what comes of it.  That
certainly seems like it could be related.

Martin,

I checked based on Matthew's suggestion of the ipInDiscards counter
incrementing, and it's not, so no dice with the checksum errors, good
call though!

Matthew, 

Following your suggestions I set up all of the network-based
destinations to be /dev/null and the problem persists, shows no change
in terms of how fast the buffers fill up or how many overflows are
generated per second at all.  As for loggen, keep in mind that I was
using that before to write to /dev/null and also directly to disk and
topped loggen out (locally) at ~8k/msgs/sec, so I wonder if Fred isn't
on the right track with some OS issues?

I used Google to find scripts to run for Dtrace to track syscalls, and
one called procsystime stood out
(http://www.brendangregg.com/DTrace/procsystime)  Here's a sample of the
output over about 20 seconds each time, while the overflows are
happening.  It's hard for me to tell if the write() syscall is really
freaking slow here or if this is as it should be and not weird.  Either
way, write() doesn't seem to be the issue if you look at the /dev/null
output for which UDP overflows still occur.

This first one is the output for the simple config I posted earlier,
writing to disk:

# /root/procsystime -aTn syslog-ng
Hit Ctrl-C to stop sampling...
^C

Elapsed Times for processes syslog-ng,

         SYSCALL          TIME (ns)
          getpid             193600
          fchmod             443000
            bind             757700
           fcntl             786800
      setsockopt            1355700
          fchown            1446200
         connect            1542000
       so_socket            2626100
           close            4382500
          open64            4691400
          stat64            4957400
             brk           23978700
         pollsys          113785800
          llseek          158217200
           gtime          244735300
        recvfrom          404501900
           write        11343815700
          TOTAL:        12312217000

CPU Times for processes syslog-ng,

         SYSCALL          TIME (ns)
          getpid              42400
          fchmod             286700
           fcntl             341800
            bind             691000
          fchown            1163100
      setsockopt            1230300
         connect            1468100
       so_socket            2543900
          open64            4047800
           close            4075300
          stat64            4626600
             brk           17442100
           gtime           31685200
          llseek           88638300
         pollsys           92742600
        recvfrom          332658600
           write         3376428300
          TOTAL:         3960112100

Syscall Counts for processes syslog-ng,

         SYSCALL              COUNT
            bind                 19
         connect                 19
       so_socket                 19
      setsockopt                 38
          fchmod                 48
          open64                 48
          getpid                 58
           close                 89
          fchown                 96
          stat64                 96
           fcntl                172
             brk               2342
         pollsys               4741
        recvfrom              22003
          llseek              22701
           write              51022
           gtime              85661
          TOTAL:             189172

And this second sample is for writing directly to /dev/null, you'll
notice write() is taking a lot less time but that totally makes sense
since disks are no longer being used (the overflows are still happening
just as aggressively however):

# /root/procsystime -aTn syslog-ng
Hit Ctrl-C to stop sampling...
^C

Elapsed Times for processes syslog-ng,

         SYSCALL          TIME (ns)
           chmod              24400
             brk              28900
          fchmod              45200
           fcntl              49000
           chown              70900
           close             154800
          fchown             165800
           mkdir             297700
          open64             797000
          stat64             846900
         pollsys           14180100
          llseek           25238500
           write           54238800
           gtime           59281800
        recvfrom           87493900
       nanosleep        17615931600
          TOTAL:        17858845300

CPU Times for processes syslog-ng,

         SYSCALL          TIME (ns)
           fcntl              20600
           chmod              20700
             brk              22400
          fchmod              29400
           chown              62400
          fchown             134100
           close             137000
           mkdir             292500
          open64             774500
          stat64             799900
         pollsys           10720500
          llseek           12170100
           gtime           13406200
           write           18375500
       nanosleep           30881900
        recvfrom           72251100
          TOTAL:          160098800

Syscall Counts for processes syslog-ng,

         SYSCALL              COUNT
           chmod                  1
           mkdir                  1
             brk                  2
           chown                  2
          fchmod                  5
          open64                  5
           close                  6
          fchown                 10
           fcntl                 10
          stat64                 14
         pollsys                949
       nanosleep                950
          llseek               4774
           write               4774
        recvfrom               4776
           gtime              17891
          TOTAL:              34170

I will let you all know what happens post-patch.  And if there are any
suggestions for other things to try with dtrace, they are welcome, as
I'm a novice with it and Solaris administration in general.

Thanks for all the excellent help so far!

--Mike

________________________________

From: syslog-ng-bounces at lists.balabit.hu
[mailto:syslog-ng-bounces at lists.balabit.hu] On Behalf Of Fred Connolly
Sent: Friday, April 15, 2011 9:28 PM
To: Syslog-ng users' and developers' mailing list
Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops

I am experiencing the same problem with Sun V490 except the server has
about 16gb memory. We are using UDP and losing about 85% of the traffic.
The udpinoverflows is darn near equal to the total number of packets
coming in. I am not at work now so cannot provide accurate statistics at
this time. The NIC statistics are perfect, we aren't getting any errors
with regards to the UDP area etc.

There is a kernel patch that came out about a week or two ago that deals
in this area, but I have not yet applied it. I want to apply the patch
first before adjusting other kernel parameters. We have Solaris 10,
update 9. Version of syslog-ng is 3.1.2. It is really terrible. 

By terrible, I mean the packet loss, not the product:)) It is probably
something I don't have set up correctly.

Mike, check out that latest patch, it can't hurt. I had to open a case
with Sun to find out about it:))

On Fri, Apr 15, 2011 at 3:45 PM, Matthew Hall <mhall at mhcomputing.net>
wrote:

	On Fri, Apr 15, 2011 at 02:01:50PM -0400, Mishou Michael wrote:
	> I left out the resources I have to work with on this system,
and how
	> bad/good things are with syslog-ng running (and dropping),
I'll include
	> those now.  As you can see, it's an older server, but it has a
ton of
	> RAM and the CPUs should have enough pop for this I think.

	> I'm just not sure what to do next to troubleshoot.  I'm hoping
someone
	> here can point me in the right direction, or at least confirm
that they
	> are running syslog-ng in a similar configuration without drops
so I know
	> that it's at least possible?
	>
	> Regards,
	>
	> --Mike

	I think the next suspect would be the disks. Can you disable
anything that writes to disk or tell it to write to /dev/null and see if
it still blows up?

	Also, it's Solaris, so you could start using some of the dtrace
scripts to look for what syscalls / other ops are running too slow, and
when it gets stuck what type of socket / disk file / what IO is it
doing?

	Matthew.

________________________________________________________________________
______
	Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
	Documentation:
http://www.balabit.com/support/documentation/?product=syslog-ng
	FAQ: http://www.campin.net/syslog-ng/faq.html