[syslog-ng] Solaris 10 UDP overflows, message drops

Mon Apr 18 18:01:07 CEST 2011

However. Reading the patch description seems to indicate it is actually thread related.

And syslog ng before 3.3 is single thread or two thread with the second thread doing DB writes.

On Mon, Apr 18, 2011 at 08:57:41AM -0700, Matthew Hall wrote:
> EXCELLENT debug and measurement work.
> 
> Based on this evidence and testing I vote in favor of a UDP issue a la Fred.
> 
> Matthew.
> 
> On Mon, Apr 18, 2011 at 11:43:50AM -0400, Mishou Michael wrote:
> > Fred,
> >  
> > Great find!  For those following on the TV at home, here is the link to
> > the patch notes that I found:
> >  
> > https://getupdates.oracle.com/readme/144488-11
> >  
> > Which contains this tantalizing tidbit:
> >  
> > 6638967 UDP recv (think DNS) suffers from thundering herd problem
> > (bug report for above: http://bit.ly/eD57KB+ )
> >  
> > I'm going to install this patch and see what comes of it.  That
> > certainly seems like it could be related.
> > 
> > Martin,
> > 
> > I checked based on Matthew's suggestion of the ipInDiscards counter
> > incrementing, and it's not, so no dice with the checksum errors, good
> > call though!
> >  
> > Matthew, 
> >  
> > Following your suggestions I set up all of the network-based
> > destinations to be /dev/null and the problem persists, shows no change
> > in terms of how fast the buffers fill up or how many overflows are
> > generated per second at all.  As for loggen, keep in mind that I was
> > using that before to write to /dev/null and also directly to disk and
> > topped loggen out (locally) at ~8k/msgs/sec, so I wonder if Fred isn't
> > on the right track with some OS issues?
> >  
> > I used Google to find scripts to run for Dtrace to track syscalls, and
> > one called procsystime stood out
> > (http://www.brendangregg.com/DTrace/procsystime)  Here's a sample of the
> > output over about 20 seconds each time, while the overflows are
> > happening.  It's hard for me to tell if the write() syscall is really
> > freaking slow here or if this is as it should be and not weird.  Either
> > way, write() doesn't seem to be the issue if you look at the /dev/null
> > output for which UDP overflows still occur.
> > 
> > This first one is the output for the simple config I posted earlier,
> > writing to disk:
> >  
> > # /root/procsystime -aTn syslog-ng
> > Hit Ctrl-C to stop sampling...
> > ^C
> >  
> > Elapsed Times for processes syslog-ng,
> >  
> >          SYSCALL          TIME (ns)
> >           getpid             193600
> >           fchmod             443000
> >             bind             757700
> >            fcntl             786800
> >       setsockopt            1355700
> >           fchown            1446200
> >          connect            1542000
> >        so_socket            2626100
> >            close            4382500
> >           open64            4691400
> >           stat64            4957400
> >              brk           23978700
> >          pollsys          113785800
> >           llseek          158217200
> >            gtime          244735300
> >         recvfrom          404501900
> >            write        11343815700
> >           TOTAL:        12312217000
> >  
> > CPU Times for processes syslog-ng,
> >  
> >          SYSCALL          TIME (ns)
> >           getpid              42400
> >           fchmod             286700
> >            fcntl             341800
> >             bind             691000
> >           fchown            1163100
> >       setsockopt            1230300
> >          connect            1468100
> >        so_socket            2543900
> >           open64            4047800
> >            close            4075300
> >           stat64            4626600
> >              brk           17442100
> >            gtime           31685200
> >           llseek           88638300
> >          pollsys           92742600
> >         recvfrom          332658600
> >            write         3376428300
> >           TOTAL:         3960112100
> >  
> > Syscall Counts for processes syslog-ng,
> >  
> >          SYSCALL              COUNT
> >             bind                 19
> >          connect                 19
> >        so_socket                 19
> >       setsockopt                 38
> >           fchmod                 48
> >           open64                 48
> >           getpid                 58
> >            close                 89
> >           fchown                 96
> >           stat64                 96
> >            fcntl                172
> >              brk               2342
> >          pollsys               4741
> >         recvfrom              22003
> >           llseek              22701
> >            write              51022
> >            gtime              85661
> >           TOTAL:             189172
> > 
> > And this second sample is for writing directly to /dev/null, you'll
> > notice write() is taking a lot less time but that totally makes sense
> > since disks are no longer being used (the overflows are still happening
> > just as aggressively however):
> > 
> > # /root/procsystime -aTn syslog-ng
> > Hit Ctrl-C to stop sampling...
> > ^C
> > 
> > Elapsed Times for processes syslog-ng,
> > 
> >          SYSCALL          TIME (ns)
> >            chmod              24400
> >              brk              28900
> >           fchmod              45200
> >            fcntl              49000
> >            chown              70900
> >            close             154800
> >           fchown             165800
> >            mkdir             297700
> >           open64             797000
> >           stat64             846900
> >          pollsys           14180100
> >           llseek           25238500
> >            write           54238800
> >            gtime           59281800
> >         recvfrom           87493900
> >        nanosleep        17615931600
> >           TOTAL:        17858845300
> > 
> > CPU Times for processes syslog-ng,
> > 
> >          SYSCALL          TIME (ns)
> >            fcntl              20600
> >            chmod              20700
> >              brk              22400
> >           fchmod              29400
> >            chown              62400
> >           fchown             134100
> >            close             137000
> >            mkdir             292500
> >           open64             774500
> >           stat64             799900
> >          pollsys           10720500
> >           llseek           12170100
> >            gtime           13406200
> >            write           18375500
> >        nanosleep           30881900
> >         recvfrom           72251100
> >           TOTAL:          160098800
> > 
> > Syscall Counts for processes syslog-ng,
> > 
> >          SYSCALL              COUNT
> >            chmod                  1
> >            mkdir                  1
> >              brk                  2
> >            chown                  2
> >           fchmod                  5
> >           open64                  5
> >            close                  6
> >           fchown                 10
> >            fcntl                 10
> >           stat64                 14
> >          pollsys                949
> >        nanosleep                950
> >           llseek               4774
> >            write               4774
> >         recvfrom               4776
> >            gtime              17891
> >           TOTAL:              34170
> > 
> > I will let you all know what happens post-patch.  And if there are any
> > suggestions for other things to try with dtrace, they are welcome, as
> > I'm a novice with it and Solaris administration in general.
> > 
> > Thanks for all the excellent help so far!
> > 
> > --Mike
> > 
> > 
> > ________________________________
> > 
> > From: syslog-ng-bounces at lists.balabit.hu
> > [mailto:syslog-ng-bounces at lists.balabit.hu] On Behalf Of Fred Connolly
> > Sent: Friday, April 15, 2011 9:28 PM
> > To: Syslog-ng users' and developers' mailing list
> > Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops
> > 
> > 
> > I am experiencing the same problem with Sun V490 except the server has
> > about 16gb memory. We are using UDP and losing about 85% of the traffic.
> > The udpinoverflows is darn near equal to the total number of packets
> > coming in. I am not at work now so cannot provide accurate statistics at
> > this time. The NIC statistics are perfect, we aren't getting any errors
> > with regards to the UDP area etc.
> > 
> > There is a kernel patch that came out about a week or two ago that deals
> > in this area, but I have not yet applied it. I want to apply the patch
> > first before adjusting other kernel parameters. We have Solaris 10,
> > update 9. Version of syslog-ng is 3.1.2. It is really terrible. 
> > 
> > By terrible, I mean the packet loss, not the product:)) It is probably
> > something I don't have set up correctly.
> > 
> > Mike, check out that latest patch, it can't hurt. I had to open a case
> > with Sun to find out about it:))
> > 
> > 
> > 
> > 
> > On Fri, Apr 15, 2011 at 3:45 PM, Matthew Hall <mhall at mhcomputing.net>
> > wrote:
> > 
> > 
> > 	On Fri, Apr 15, 2011 at 02:01:50PM -0400, Mishou Michael wrote:
> > 	> I left out the resources I have to work with on this system,
> > and how
> > 	> bad/good things are with syslog-ng running (and dropping),
> > I'll include
> > 	> those now.  As you can see, it's an older server, but it has a
> > ton of
> > 	> RAM and the CPUs should have enough pop for this I think.
> > 	
> > 	
> > 	> I'm just not sure what to do next to troubleshoot.  I'm hoping
> > someone
> > 	> here can point me in the right direction, or at least confirm
> > that they
> > 	> are running syslog-ng in a similar configuration without drops
> > so I know
> > 	> that it's at least possible?
> > 	>
> > 	> Regards,
> > 	>
> > 	> --Mike
> > 	
> > 	
> > 	I think the next suspect would be the disks. Can you disable
> > anything that writes to disk or tell it to write to /dev/null and see if
> > it still blows up?
> > 	
> > 	Also, it's Solaris, so you could start using some of the dtrace
> > scripts to look for what syscalls / other ops are running too slow, and
> > when it gets stuck what type of socket / disk file / what IO is it
> > doing?
> > 	
> > 	Matthew.
> > 	
> > 	
> > ________________________________________________________________________
> > ______
> > 	Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> > 	Documentation:
> > http://www.balabit.com/support/documentation/?product=syslog-ng
> > 	FAQ: http://www.campin.net/syslog-ng/faq.html
> > 	
> > 	
> > 
> > 
> > ______________________________________________________________________________
> > Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> > Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
> > FAQ: http://www.campin.net/syslog-ng/faq.html
> > 
> ______________________________________________________________________________
> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
> FAQ: http://www.campin.net/syslog-ng/faq.html
>