[syslog-ng] Solaris 10 UDP overflows, message drops
Matthew Hall
mhall at mhcomputing.net
Mon Apr 18 17:57:41 CEST 2011
EXCELLENT debug and measurement work.
Based on this evidence and testing I vote in favor of a UDP issue a la Fred.
Matthew.
On Mon, Apr 18, 2011 at 11:43:50AM -0400, Mishou Michael wrote:
> Fred,
>
> Great find! For those following on the TV at home, here is the link to
> the patch notes that I found:
>
> https://getupdates.oracle.com/readme/144488-11
>
> Which contains this tantalizing tidbit:
>
> 6638967 UDP recv (think DNS) suffers from thundering herd problem
> (bug report for above: http://bit.ly/eD57KB+ )
>
> I'm going to install this patch and see what comes of it. That
> certainly seems like it could be related.
>
> Martin,
>
> I checked based on Matthew's suggestion of the ipInDiscards counter
> incrementing, and it's not, so no dice with the checksum errors, good
> call though!
>
> Matthew,
>
> Following your suggestions I set up all of the network-based
> destinations to be /dev/null and the problem persists, shows no change
> in terms of how fast the buffers fill up or how many overflows are
> generated per second at all. As for loggen, keep in mind that I was
> using that before to write to /dev/null and also directly to disk and
> topped loggen out (locally) at ~8k/msgs/sec, so I wonder if Fred isn't
> on the right track with some OS issues?
>
> I used Google to find scripts to run for Dtrace to track syscalls, and
> one called procsystime stood out
> (http://www.brendangregg.com/DTrace/procsystime) Here's a sample of the
> output over about 20 seconds each time, while the overflows are
> happening. It's hard for me to tell if the write() syscall is really
> freaking slow here or if this is as it should be and not weird. Either
> way, write() doesn't seem to be the issue if you look at the /dev/null
> output for which UDP overflows still occur.
>
> This first one is the output for the simple config I posted earlier,
> writing to disk:
>
> # /root/procsystime -aTn syslog-ng
> Hit Ctrl-C to stop sampling...
> ^C
>
> Elapsed Times for processes syslog-ng,
>
> SYSCALL TIME (ns)
> getpid 193600
> fchmod 443000
> bind 757700
> fcntl 786800
> setsockopt 1355700
> fchown 1446200
> connect 1542000
> so_socket 2626100
> close 4382500
> open64 4691400
> stat64 4957400
> brk 23978700
> pollsys 113785800
> llseek 158217200
> gtime 244735300
> recvfrom 404501900
> write 11343815700
> TOTAL: 12312217000
>
> CPU Times for processes syslog-ng,
>
> SYSCALL TIME (ns)
> getpid 42400
> fchmod 286700
> fcntl 341800
> bind 691000
> fchown 1163100
> setsockopt 1230300
> connect 1468100
> so_socket 2543900
> open64 4047800
> close 4075300
> stat64 4626600
> brk 17442100
> gtime 31685200
> llseek 88638300
> pollsys 92742600
> recvfrom 332658600
> write 3376428300
> TOTAL: 3960112100
>
> Syscall Counts for processes syslog-ng,
>
> SYSCALL COUNT
> bind 19
> connect 19
> so_socket 19
> setsockopt 38
> fchmod 48
> open64 48
> getpid 58
> close 89
> fchown 96
> stat64 96
> fcntl 172
> brk 2342
> pollsys 4741
> recvfrom 22003
> llseek 22701
> write 51022
> gtime 85661
> TOTAL: 189172
>
> And this second sample is for writing directly to /dev/null, you'll
> notice write() is taking a lot less time but that totally makes sense
> since disks are no longer being used (the overflows are still happening
> just as aggressively however):
>
> # /root/procsystime -aTn syslog-ng
> Hit Ctrl-C to stop sampling...
> ^C
>
> Elapsed Times for processes syslog-ng,
>
> SYSCALL TIME (ns)
> chmod 24400
> brk 28900
> fchmod 45200
> fcntl 49000
> chown 70900
> close 154800
> fchown 165800
> mkdir 297700
> open64 797000
> stat64 846900
> pollsys 14180100
> llseek 25238500
> write 54238800
> gtime 59281800
> recvfrom 87493900
> nanosleep 17615931600
> TOTAL: 17858845300
>
> CPU Times for processes syslog-ng,
>
> SYSCALL TIME (ns)
> fcntl 20600
> chmod 20700
> brk 22400
> fchmod 29400
> chown 62400
> fchown 134100
> close 137000
> mkdir 292500
> open64 774500
> stat64 799900
> pollsys 10720500
> llseek 12170100
> gtime 13406200
> write 18375500
> nanosleep 30881900
> recvfrom 72251100
> TOTAL: 160098800
>
> Syscall Counts for processes syslog-ng,
>
> SYSCALL COUNT
> chmod 1
> mkdir 1
> brk 2
> chown 2
> fchmod 5
> open64 5
> close 6
> fchown 10
> fcntl 10
> stat64 14
> pollsys 949
> nanosleep 950
> llseek 4774
> write 4774
> recvfrom 4776
> gtime 17891
> TOTAL: 34170
>
> I will let you all know what happens post-patch. And if there are any
> suggestions for other things to try with dtrace, they are welcome, as
> I'm a novice with it and Solaris administration in general.
>
> Thanks for all the excellent help so far!
>
> --Mike
>
>
> ________________________________
>
> From: syslog-ng-bounces at lists.balabit.hu
> [mailto:syslog-ng-bounces at lists.balabit.hu] On Behalf Of Fred Connolly
> Sent: Friday, April 15, 2011 9:28 PM
> To: Syslog-ng users' and developers' mailing list
> Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops
>
>
> I am experiencing the same problem with Sun V490 except the server has
> about 16gb memory. We are using UDP and losing about 85% of the traffic.
> The udpinoverflows is darn near equal to the total number of packets
> coming in. I am not at work now so cannot provide accurate statistics at
> this time. The NIC statistics are perfect, we aren't getting any errors
> with regards to the UDP area etc.
>
> There is a kernel patch that came out about a week or two ago that deals
> in this area, but I have not yet applied it. I want to apply the patch
> first before adjusting other kernel parameters. We have Solaris 10,
> update 9. Version of syslog-ng is 3.1.2. It is really terrible.
>
> By terrible, I mean the packet loss, not the product:)) It is probably
> something I don't have set up correctly.
>
> Mike, check out that latest patch, it can't hurt. I had to open a case
> with Sun to find out about it:))
>
>
>
>
> On Fri, Apr 15, 2011 at 3:45 PM, Matthew Hall <mhall at mhcomputing.net>
> wrote:
>
>
> On Fri, Apr 15, 2011 at 02:01:50PM -0400, Mishou Michael wrote:
> > I left out the resources I have to work with on this system,
> and how
> > bad/good things are with syslog-ng running (and dropping),
> I'll include
> > those now. As you can see, it's an older server, but it has a
> ton of
> > RAM and the CPUs should have enough pop for this I think.
>
>
> > I'm just not sure what to do next to troubleshoot. I'm hoping
> someone
> > here can point me in the right direction, or at least confirm
> that they
> > are running syslog-ng in a similar configuration without drops
> so I know
> > that it's at least possible?
> >
> > Regards,
> >
> > --Mike
>
>
> I think the next suspect would be the disks. Can you disable
> anything that writes to disk or tell it to write to /dev/null and see if
> it still blows up?
>
> Also, it's Solaris, so you could start using some of the dtrace
> scripts to look for what syscalls / other ops are running too slow, and
> when it gets stuck what type of socket / disk file / what IO is it
> doing?
>
> Matthew.
>
>
> ________________________________________________________________________
> ______
> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> Documentation:
> http://www.balabit.com/support/documentation/?product=syslog-ng
> FAQ: http://www.campin.net/syslog-ng/faq.html
>
>
>
>
> ______________________________________________________________________________
> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
> FAQ: http://www.campin.net/syslog-ng/faq.html
>
More information about the syslog-ng
mailing list