[syslog-ng] Solaris 10 UDP overflows, message drops
Mishou Michael
Michael.Mishou at csirc.irs.gov
Mon Apr 18 17:43:50 CEST 2011
Fred,
Great find! For those following on the TV at home, here is the link to
the patch notes that I found:
https://getupdates.oracle.com/readme/144488-11
Which contains this tantalizing tidbit:
6638967 UDP recv (think DNS) suffers from thundering herd problem
(bug report for above: http://bit.ly/eD57KB+ )
I'm going to install this patch and see what comes of it. That
certainly seems like it could be related.
Martin,
I checked based on Matthew's suggestion of the ipInDiscards counter
incrementing, and it's not, so no dice with the checksum errors, good
call though!
Matthew,
Following your suggestions I set up all of the network-based
destinations to be /dev/null and the problem persists, shows no change
in terms of how fast the buffers fill up or how many overflows are
generated per second at all. As for loggen, keep in mind that I was
using that before to write to /dev/null and also directly to disk and
topped loggen out (locally) at ~8k/msgs/sec, so I wonder if Fred isn't
on the right track with some OS issues?
I used Google to find scripts to run for Dtrace to track syscalls, and
one called procsystime stood out
(http://www.brendangregg.com/DTrace/procsystime) Here's a sample of the
output over about 20 seconds each time, while the overflows are
happening. It's hard for me to tell if the write() syscall is really
freaking slow here or if this is as it should be and not weird. Either
way, write() doesn't seem to be the issue if you look at the /dev/null
output for which UDP overflows still occur.
This first one is the output for the simple config I posted earlier,
writing to disk:
# /root/procsystime -aTn syslog-ng
Hit Ctrl-C to stop sampling...
^C
Elapsed Times for processes syslog-ng,
SYSCALL TIME (ns)
getpid 193600
fchmod 443000
bind 757700
fcntl 786800
setsockopt 1355700
fchown 1446200
connect 1542000
so_socket 2626100
close 4382500
open64 4691400
stat64 4957400
brk 23978700
pollsys 113785800
llseek 158217200
gtime 244735300
recvfrom 404501900
write 11343815700
TOTAL: 12312217000
CPU Times for processes syslog-ng,
SYSCALL TIME (ns)
getpid 42400
fchmod 286700
fcntl 341800
bind 691000
fchown 1163100
setsockopt 1230300
connect 1468100
so_socket 2543900
open64 4047800
close 4075300
stat64 4626600
brk 17442100
gtime 31685200
llseek 88638300
pollsys 92742600
recvfrom 332658600
write 3376428300
TOTAL: 3960112100
Syscall Counts for processes syslog-ng,
SYSCALL COUNT
bind 19
connect 19
so_socket 19
setsockopt 38
fchmod 48
open64 48
getpid 58
close 89
fchown 96
stat64 96
fcntl 172
brk 2342
pollsys 4741
recvfrom 22003
llseek 22701
write 51022
gtime 85661
TOTAL: 189172
And this second sample is for writing directly to /dev/null, you'll
notice write() is taking a lot less time but that totally makes sense
since disks are no longer being used (the overflows are still happening
just as aggressively however):
# /root/procsystime -aTn syslog-ng
Hit Ctrl-C to stop sampling...
^C
Elapsed Times for processes syslog-ng,
SYSCALL TIME (ns)
chmod 24400
brk 28900
fchmod 45200
fcntl 49000
chown 70900
close 154800
fchown 165800
mkdir 297700
open64 797000
stat64 846900
pollsys 14180100
llseek 25238500
write 54238800
gtime 59281800
recvfrom 87493900
nanosleep 17615931600
TOTAL: 17858845300
CPU Times for processes syslog-ng,
SYSCALL TIME (ns)
fcntl 20600
chmod 20700
brk 22400
fchmod 29400
chown 62400
fchown 134100
close 137000
mkdir 292500
open64 774500
stat64 799900
pollsys 10720500
llseek 12170100
gtime 13406200
write 18375500
nanosleep 30881900
recvfrom 72251100
TOTAL: 160098800
Syscall Counts for processes syslog-ng,
SYSCALL COUNT
chmod 1
mkdir 1
brk 2
chown 2
fchmod 5
open64 5
close 6
fchown 10
fcntl 10
stat64 14
pollsys 949
nanosleep 950
llseek 4774
write 4774
recvfrom 4776
gtime 17891
TOTAL: 34170
I will let you all know what happens post-patch. And if there are any
suggestions for other things to try with dtrace, they are welcome, as
I'm a novice with it and Solaris administration in general.
Thanks for all the excellent help so far!
--Mike
________________________________
From: syslog-ng-bounces at lists.balabit.hu
[mailto:syslog-ng-bounces at lists.balabit.hu] On Behalf Of Fred Connolly
Sent: Friday, April 15, 2011 9:28 PM
To: Syslog-ng users' and developers' mailing list
Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops
I am experiencing the same problem with Sun V490 except the server has
about 16gb memory. We are using UDP and losing about 85% of the traffic.
The udpinoverflows is darn near equal to the total number of packets
coming in. I am not at work now so cannot provide accurate statistics at
this time. The NIC statistics are perfect, we aren't getting any errors
with regards to the UDP area etc.
There is a kernel patch that came out about a week or two ago that deals
in this area, but I have not yet applied it. I want to apply the patch
first before adjusting other kernel parameters. We have Solaris 10,
update 9. Version of syslog-ng is 3.1.2. It is really terrible.
By terrible, I mean the packet loss, not the product:)) It is probably
something I don't have set up correctly.
Mike, check out that latest patch, it can't hurt. I had to open a case
with Sun to find out about it:))
On Fri, Apr 15, 2011 at 3:45 PM, Matthew Hall <mhall at mhcomputing.net>
wrote:
On Fri, Apr 15, 2011 at 02:01:50PM -0400, Mishou Michael wrote:
> I left out the resources I have to work with on this system,
and how
> bad/good things are with syslog-ng running (and dropping),
I'll include
> those now. As you can see, it's an older server, but it has a
ton of
> RAM and the CPUs should have enough pop for this I think.
> I'm just not sure what to do next to troubleshoot. I'm hoping
someone
> here can point me in the right direction, or at least confirm
that they
> are running syslog-ng in a similar configuration without drops
so I know
> that it's at least possible?
>
> Regards,
>
> --Mike
I think the next suspect would be the disks. Can you disable
anything that writes to disk or tell it to write to /dev/null and see if
it still blows up?
Also, it's Solaris, so you could start using some of the dtrace
scripts to look for what syscalls / other ops are running too slow, and
when it gets stuck what type of socket / disk file / what IO is it
doing?
Matthew.
________________________________________________________________________
______
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
Documentation:
http://www.balabit.com/support/documentation/?product=syslog-ng
FAQ: http://www.campin.net/syslog-ng/faq.html
More information about the syslog-ng
mailing list