[syslog-ng] Solaris 10 UDP overflows, message drops
Matthew Hall
mhall at mhcomputing.net
Mon Apr 18 18:01:07 CEST 2011
However. Reading the patch description seems to indicate it is actually thread related.
And syslog ng before 3.3 is single thread or two thread with the second thread doing DB writes.
On Mon, Apr 18, 2011 at 08:57:41AM -0700, Matthew Hall wrote:
> EXCELLENT debug and measurement work.
>
> Based on this evidence and testing I vote in favor of a UDP issue a la Fred.
>
> Matthew.
>
> On Mon, Apr 18, 2011 at 11:43:50AM -0400, Mishou Michael wrote:
> > Fred,
> >
> > Great find! For those following on the TV at home, here is the link to
> > the patch notes that I found:
> >
> > https://getupdates.oracle.com/readme/144488-11
> >
> > Which contains this tantalizing tidbit:
> >
> > 6638967 UDP recv (think DNS) suffers from thundering herd problem
> > (bug report for above: http://bit.ly/eD57KB+ )
> >
> > I'm going to install this patch and see what comes of it. That
> > certainly seems like it could be related.
> >
> > Martin,
> >
> > I checked based on Matthew's suggestion of the ipInDiscards counter
> > incrementing, and it's not, so no dice with the checksum errors, good
> > call though!
> >
> > Matthew,
> >
> > Following your suggestions I set up all of the network-based
> > destinations to be /dev/null and the problem persists, shows no change
> > in terms of how fast the buffers fill up or how many overflows are
> > generated per second at all. As for loggen, keep in mind that I was
> > using that before to write to /dev/null and also directly to disk and
> > topped loggen out (locally) at ~8k/msgs/sec, so I wonder if Fred isn't
> > on the right track with some OS issues?
> >
> > I used Google to find scripts to run for Dtrace to track syscalls, and
> > one called procsystime stood out
> > (http://www.brendangregg.com/DTrace/procsystime) Here's a sample of the
> > output over about 20 seconds each time, while the overflows are
> > happening. It's hard for me to tell if the write() syscall is really
> > freaking slow here or if this is as it should be and not weird. Either
> > way, write() doesn't seem to be the issue if you look at the /dev/null
> > output for which UDP overflows still occur.
> >
> > This first one is the output for the simple config I posted earlier,
> > writing to disk:
> >
> > # /root/procsystime -aTn syslog-ng
> > Hit Ctrl-C to stop sampling...
> > ^C
> >
> > Elapsed Times for processes syslog-ng,
> >
> > SYSCALL TIME (ns)
> > getpid 193600
> > fchmod 443000
> > bind 757700
> > fcntl 786800
> > setsockopt 1355700
> > fchown 1446200
> > connect 1542000
> > so_socket 2626100
> > close 4382500
> > open64 4691400
> > stat64 4957400
> > brk 23978700
> > pollsys 113785800
> > llseek 158217200
> > gtime 244735300
> > recvfrom 404501900
> > write 11343815700
> > TOTAL: 12312217000
> >
> > CPU Times for processes syslog-ng,
> >
> > SYSCALL TIME (ns)
> > getpid 42400
> > fchmod 286700
> > fcntl 341800
> > bind 691000
> > fchown 1163100
> > setsockopt 1230300
> > connect 1468100
> > so_socket 2543900
> > open64 4047800
> > close 4075300
> > stat64 4626600
> > brk 17442100
> > gtime 31685200
> > llseek 88638300
> > pollsys 92742600
> > recvfrom 332658600
> > write 3376428300
> > TOTAL: 3960112100
> >
> > Syscall Counts for processes syslog-ng,
> >
> > SYSCALL COUNT
> > bind 19
> > connect 19
> > so_socket 19
> > setsockopt 38
> > fchmod 48
> > open64 48
> > getpid 58
> > close 89
> > fchown 96
> > stat64 96
> > fcntl 172
> > brk 2342
> > pollsys 4741
> > recvfrom 22003
> > llseek 22701
> > write 51022
> > gtime 85661
> > TOTAL: 189172
> >
> > And this second sample is for writing directly to /dev/null, you'll
> > notice write() is taking a lot less time but that totally makes sense
> > since disks are no longer being used (the overflows are still happening
> > just as aggressively however):
> >
> > # /root/procsystime -aTn syslog-ng
> > Hit Ctrl-C to stop sampling...
> > ^C
> >
> > Elapsed Times for processes syslog-ng,
> >
> > SYSCALL TIME (ns)
> > chmod 24400
> > brk 28900
> > fchmod 45200
> > fcntl 49000
> > chown 70900
> > close 154800
> > fchown 165800
> > mkdir 297700
> > open64 797000
> > stat64 846900
> > pollsys 14180100
> > llseek 25238500
> > write 54238800
> > gtime 59281800
> > recvfrom 87493900
> > nanosleep 17615931600
> > TOTAL: 17858845300
> >
> > CPU Times for processes syslog-ng,
> >
> > SYSCALL TIME (ns)
> > fcntl 20600
> > chmod 20700
> > brk 22400
> > fchmod 29400
> > chown 62400
> > fchown 134100
> > close 137000
> > mkdir 292500
> > open64 774500
> > stat64 799900
> > pollsys 10720500
> > llseek 12170100
> > gtime 13406200
> > write 18375500
> > nanosleep 30881900
> > recvfrom 72251100
> > TOTAL: 160098800
> >
> > Syscall Counts for processes syslog-ng,
> >
> > SYSCALL COUNT
> > chmod 1
> > mkdir 1
> > brk 2
> > chown 2
> > fchmod 5
> > open64 5
> > close 6
> > fchown 10
> > fcntl 10
> > stat64 14
> > pollsys 949
> > nanosleep 950
> > llseek 4774
> > write 4774
> > recvfrom 4776
> > gtime 17891
> > TOTAL: 34170
> >
> > I will let you all know what happens post-patch. And if there are any
> > suggestions for other things to try with dtrace, they are welcome, as
> > I'm a novice with it and Solaris administration in general.
> >
> > Thanks for all the excellent help so far!
> >
> > --Mike
> >
> >
> > ________________________________
> >
> > From: syslog-ng-bounces at lists.balabit.hu
> > [mailto:syslog-ng-bounces at lists.balabit.hu] On Behalf Of Fred Connolly
> > Sent: Friday, April 15, 2011 9:28 PM
> > To: Syslog-ng users' and developers' mailing list
> > Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops
> >
> >
> > I am experiencing the same problem with Sun V490 except the server has
> > about 16gb memory. We are using UDP and losing about 85% of the traffic.
> > The udpinoverflows is darn near equal to the total number of packets
> > coming in. I am not at work now so cannot provide accurate statistics at
> > this time. The NIC statistics are perfect, we aren't getting any errors
> > with regards to the UDP area etc.
> >
> > There is a kernel patch that came out about a week or two ago that deals
> > in this area, but I have not yet applied it. I want to apply the patch
> > first before adjusting other kernel parameters. We have Solaris 10,
> > update 9. Version of syslog-ng is 3.1.2. It is really terrible.
> >
> > By terrible, I mean the packet loss, not the product:)) It is probably
> > something I don't have set up correctly.
> >
> > Mike, check out that latest patch, it can't hurt. I had to open a case
> > with Sun to find out about it:))
> >
> >
> >
> >
> > On Fri, Apr 15, 2011 at 3:45 PM, Matthew Hall <mhall at mhcomputing.net>
> > wrote:
> >
> >
> > On Fri, Apr 15, 2011 at 02:01:50PM -0400, Mishou Michael wrote:
> > > I left out the resources I have to work with on this system,
> > and how
> > > bad/good things are with syslog-ng running (and dropping),
> > I'll include
> > > those now. As you can see, it's an older server, but it has a
> > ton of
> > > RAM and the CPUs should have enough pop for this I think.
> >
> >
> > > I'm just not sure what to do next to troubleshoot. I'm hoping
> > someone
> > > here can point me in the right direction, or at least confirm
> > that they
> > > are running syslog-ng in a similar configuration without drops
> > so I know
> > > that it's at least possible?
> > >
> > > Regards,
> > >
> > > --Mike
> >
> >
> > I think the next suspect would be the disks. Can you disable
> > anything that writes to disk or tell it to write to /dev/null and see if
> > it still blows up?
> >
> > Also, it's Solaris, so you could start using some of the dtrace
> > scripts to look for what syscalls / other ops are running too slow, and
> > when it gets stuck what type of socket / disk file / what IO is it
> > doing?
> >
> > Matthew.
> >
> >
> > ________________________________________________________________________
> > ______
> > Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> > Documentation:
> > http://www.balabit.com/support/documentation/?product=syslog-ng
> > FAQ: http://www.campin.net/syslog-ng/faq.html
> >
> >
> >
> >
> > ______________________________________________________________________________
> > Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> > Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
> > FAQ: http://www.campin.net/syslog-ng/faq.html
> >
> ______________________________________________________________________________
> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
> FAQ: http://www.campin.net/syslog-ng/faq.html
>
More information about the syslog-ng
mailing list