[syslog-ng] Solaris 10 UDP overflows, message drops

Mon Apr 18 17:57:41 CEST 2011

EXCELLENT debug and measurement work.

Based on this evidence and testing I vote in favor of a UDP issue a la Fred.

Matthew.

On Mon, Apr 18, 2011 at 11:43:50AM -0400, Mishou Michael wrote:
> Fred,
>  
> Great find!  For those following on the TV at home, here is the link to
> the patch notes that I found:
>  
> https://getupdates.oracle.com/readme/144488-11
>  
> Which contains this tantalizing tidbit:
>  
> 6638967 UDP recv (think DNS) suffers from thundering herd problem
> (bug report for above: http://bit.ly/eD57KB+ )
>  
> I'm going to install this patch and see what comes of it.  That
> certainly seems like it could be related.
> 
> Martin,
> 
> I checked based on Matthew's suggestion of the ipInDiscards counter
> incrementing, and it's not, so no dice with the checksum errors, good
> call though!
>  
> Matthew, 
>  
> Following your suggestions I set up all of the network-based
> destinations to be /dev/null and the problem persists, shows no change
> in terms of how fast the buffers fill up or how many overflows are
> generated per second at all.  As for loggen, keep in mind that I was
> using that before to write to /dev/null and also directly to disk and
> topped loggen out (locally) at ~8k/msgs/sec, so I wonder if Fred isn't
> on the right track with some OS issues?
>  
> I used Google to find scripts to run for Dtrace to track syscalls, and
> one called procsystime stood out
> (http://www.brendangregg.com/DTrace/procsystime)  Here's a sample of the
> output over about 20 seconds each time, while the overflows are
> happening.  It's hard for me to tell if the write() syscall is really
> freaking slow here or if this is as it should be and not weird.  Either
> way, write() doesn't seem to be the issue if you look at the /dev/null
> output for which UDP overflows still occur.
> 
> This first one is the output for the simple config I posted earlier,
> writing to disk:
>  
> # /root/procsystime -aTn syslog-ng
> Hit Ctrl-C to stop sampling...
> ^C
>  
> Elapsed Times for processes syslog-ng,
>  
>          SYSCALL          TIME (ns)
>           getpid             193600
>           fchmod             443000
>             bind             757700
>            fcntl             786800
>       setsockopt            1355700
>           fchown            1446200
>          connect            1542000
>        so_socket            2626100
>            close            4382500
>           open64            4691400
>           stat64            4957400
>              brk           23978700
>          pollsys          113785800
>           llseek          158217200
>            gtime          244735300
>         recvfrom          404501900
>            write        11343815700
>           TOTAL:        12312217000
>  
> CPU Times for processes syslog-ng,
>  
>          SYSCALL          TIME (ns)
>           getpid              42400
>           fchmod             286700
>            fcntl             341800
>             bind             691000
>           fchown            1163100
>       setsockopt            1230300
>          connect            1468100
>        so_socket            2543900
>           open64            4047800
>            close            4075300
>           stat64            4626600
>              brk           17442100
>            gtime           31685200
>           llseek           88638300
>          pollsys           92742600
>         recvfrom          332658600
>            write         3376428300
>           TOTAL:         3960112100
>  
> Syscall Counts for processes syslog-ng,
>  
>          SYSCALL              COUNT
>             bind                 19
>          connect                 19
>        so_socket                 19
>       setsockopt                 38
>           fchmod                 48
>           open64                 48
>           getpid                 58
>            close                 89
>           fchown                 96
>           stat64                 96
>            fcntl                172
>              brk               2342
>          pollsys               4741
>         recvfrom              22003
>           llseek              22701
>            write              51022
>            gtime              85661
>           TOTAL:             189172
> 
> And this second sample is for writing directly to /dev/null, you'll
> notice write() is taking a lot less time but that totally makes sense
> since disks are no longer being used (the overflows are still happening
> just as aggressively however):
> 
> # /root/procsystime -aTn syslog-ng
> Hit Ctrl-C to stop sampling...
> ^C
> 
> Elapsed Times for processes syslog-ng,
> 
>          SYSCALL          TIME (ns)
>            chmod              24400
>              brk              28900
>           fchmod              45200
>            fcntl              49000
>            chown              70900
>            close             154800
>           fchown             165800
>            mkdir             297700
>           open64             797000
>           stat64             846900
>          pollsys           14180100
>           llseek           25238500
>            write           54238800
>            gtime           59281800
>         recvfrom           87493900
>        nanosleep        17615931600
>           TOTAL:        17858845300
> 
> CPU Times for processes syslog-ng,
> 
>          SYSCALL          TIME (ns)
>            fcntl              20600
>            chmod              20700
>              brk              22400
>           fchmod              29400
>            chown              62400
>           fchown             134100
>            close             137000
>            mkdir             292500
>           open64             774500
>           stat64             799900
>          pollsys           10720500
>           llseek           12170100
>            gtime           13406200
>            write           18375500
>        nanosleep           30881900
>         recvfrom           72251100
>           TOTAL:          160098800
> 
> Syscall Counts for processes syslog-ng,
> 
>          SYSCALL              COUNT
>            chmod                  1
>            mkdir                  1
>              brk                  2
>            chown                  2
>           fchmod                  5
>           open64                  5
>            close                  6
>           fchown                 10
>            fcntl                 10
>           stat64                 14
>          pollsys                949
>        nanosleep                950
>           llseek               4774
>            write               4774
>         recvfrom               4776
>            gtime              17891
>           TOTAL:              34170
> 
> I will let you all know what happens post-patch.  And if there are any
> suggestions for other things to try with dtrace, they are welcome, as
> I'm a novice with it and Solaris administration in general.
> 
> Thanks for all the excellent help so far!
> 
> --Mike
> 
> 
> ________________________________
> 
> From: syslog-ng-bounces at lists.balabit.hu
> [mailto:syslog-ng-bounces at lists.balabit.hu] On Behalf Of Fred Connolly
> Sent: Friday, April 15, 2011 9:28 PM
> To: Syslog-ng users' and developers' mailing list
> Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops
> 
> 
> I am experiencing the same problem with Sun V490 except the server has
> about 16gb memory. We are using UDP and losing about 85% of the traffic.
> The udpinoverflows is darn near equal to the total number of packets
> coming in. I am not at work now so cannot provide accurate statistics at
> this time. The NIC statistics are perfect, we aren't getting any errors
> with regards to the UDP area etc.
> 
> There is a kernel patch that came out about a week or two ago that deals
> in this area, but I have not yet applied it. I want to apply the patch
> first before adjusting other kernel parameters. We have Solaris 10,
> update 9. Version of syslog-ng is 3.1.2. It is really terrible. 
> 
> By terrible, I mean the packet loss, not the product:)) It is probably
> something I don't have set up correctly.
> 
> Mike, check out that latest patch, it can't hurt. I had to open a case
> with Sun to find out about it:))
> 
> 
> 
> 
> On Fri, Apr 15, 2011 at 3:45 PM, Matthew Hall <mhall at mhcomputing.net>
> wrote:
> 
> 
> 	On Fri, Apr 15, 2011 at 02:01:50PM -0400, Mishou Michael wrote:
> 	> I left out the resources I have to work with on this system,
> and how
> 	> bad/good things are with syslog-ng running (and dropping),
> I'll include
> 	> those now.  As you can see, it's an older server, but it has a
> ton of
> 	> RAM and the CPUs should have enough pop for this I think.
> 	
> 	
> 	> I'm just not sure what to do next to troubleshoot.  I'm hoping
> someone
> 	> here can point me in the right direction, or at least confirm
> that they
> 	> are running syslog-ng in a similar configuration without drops
> so I know
> 	> that it's at least possible?
> 	>
> 	> Regards,
> 	>
> 	> --Mike
> 	
> 	
> 	I think the next suspect would be the disks. Can you disable
> anything that writes to disk or tell it to write to /dev/null and see if
> it still blows up?
> 	
> 	Also, it's Solaris, so you could start using some of the dtrace
> scripts to look for what syscalls / other ops are running too slow, and
> when it gets stuck what type of socket / disk file / what IO is it
> doing?
> 	
> 	Matthew.
> 	
> 	
> ________________________________________________________________________
> ______
> 	Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> 	Documentation:
> http://www.balabit.com/support/documentation/?product=syslog-ng
> 	FAQ: http://www.campin.net/syslog-ng/faq.html
> 	
> 	
> 
> 
> ______________________________________________________________________________
> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
> FAQ: http://www.campin.net/syslog-ng/faq.html
>