Fred, Great find! For those following on the TV at home, here is the link to the patch notes that I found: https://getupdates.oracle.com/readme/144488-11 Which contains this tantalizing tidbit: 6638967 UDP recv (think DNS) suffers from thundering herd problem (bug report for above: http://bit.ly/eD57KB+ ) I'm going to install this patch and see what comes of it. That certainly seems like it could be related. Martin, I checked based on Matthew's suggestion of the ipInDiscards counter incrementing, and it's not, so no dice with the checksum errors, good call though! Matthew, Following your suggestions I set up all of the network-based destinations to be /dev/null and the problem persists, shows no change in terms of how fast the buffers fill up or how many overflows are generated per second at all. As for loggen, keep in mind that I was using that before to write to /dev/null and also directly to disk and topped loggen out (locally) at ~8k/msgs/sec, so I wonder if Fred isn't on the right track with some OS issues? I used Google to find scripts to run for Dtrace to track syscalls, and one called procsystime stood out (http://www.brendangregg.com/DTrace/procsystime) Here's a sample of the output over about 20 seconds each time, while the overflows are happening. It's hard for me to tell if the write() syscall is really freaking slow here or if this is as it should be and not weird. Either way, write() doesn't seem to be the issue if you look at the /dev/null output for which UDP overflows still occur. This first one is the output for the simple config I posted earlier, writing to disk: # /root/procsystime -aTn syslog-ng Hit Ctrl-C to stop sampling... ^C Elapsed Times for processes syslog-ng, SYSCALL TIME (ns) getpid 193600 fchmod 443000 bind 757700 fcntl 786800 setsockopt 1355700 fchown 1446200 connect 1542000 so_socket 2626100 close 4382500 open64 4691400 stat64 4957400 brk 23978700 pollsys 113785800 llseek 158217200 gtime 244735300 recvfrom 404501900 write 11343815700 TOTAL: 12312217000 CPU Times for processes syslog-ng, SYSCALL TIME (ns) getpid 42400 fchmod 286700 fcntl 341800 bind 691000 fchown 1163100 setsockopt 1230300 connect 1468100 so_socket 2543900 open64 4047800 close 4075300 stat64 4626600 brk 17442100 gtime 31685200 llseek 88638300 pollsys 92742600 recvfrom 332658600 write 3376428300 TOTAL: 3960112100 Syscall Counts for processes syslog-ng, SYSCALL COUNT bind 19 connect 19 so_socket 19 setsockopt 38 fchmod 48 open64 48 getpid 58 close 89 fchown 96 stat64 96 fcntl 172 brk 2342 pollsys 4741 recvfrom 22003 llseek 22701 write 51022 gtime 85661 TOTAL: 189172 And this second sample is for writing directly to /dev/null, you'll notice write() is taking a lot less time but that totally makes sense since disks are no longer being used (the overflows are still happening just as aggressively however): # /root/procsystime -aTn syslog-ng Hit Ctrl-C to stop sampling... ^C Elapsed Times for processes syslog-ng, SYSCALL TIME (ns) chmod 24400 brk 28900 fchmod 45200 fcntl 49000 chown 70900 close 154800 fchown 165800 mkdir 297700 open64 797000 stat64 846900 pollsys 14180100 llseek 25238500 write 54238800 gtime 59281800 recvfrom 87493900 nanosleep 17615931600 TOTAL: 17858845300 CPU Times for processes syslog-ng, SYSCALL TIME (ns) fcntl 20600 chmod 20700 brk 22400 fchmod 29400 chown 62400 fchown 134100 close 137000 mkdir 292500 open64 774500 stat64 799900 pollsys 10720500 llseek 12170100 gtime 13406200 write 18375500 nanosleep 30881900 recvfrom 72251100 TOTAL: 160098800 Syscall Counts for processes syslog-ng, SYSCALL COUNT chmod 1 mkdir 1 brk 2 chown 2 fchmod 5 open64 5 close 6 fchown 10 fcntl 10 stat64 14 pollsys 949 nanosleep 950 llseek 4774 write 4774 recvfrom 4776 gtime 17891 TOTAL: 34170 I will let you all know what happens post-patch. And if there are any suggestions for other things to try with dtrace, they are welcome, as I'm a novice with it and Solaris administration in general. Thanks for all the excellent help so far! --Mike ________________________________ From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Fred Connolly Sent: Friday, April 15, 2011 9:28 PM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops I am experiencing the same problem with Sun V490 except the server has about 16gb memory. We are using UDP and losing about 85% of the traffic. The udpinoverflows is darn near equal to the total number of packets coming in. I am not at work now so cannot provide accurate statistics at this time. The NIC statistics are perfect, we aren't getting any errors with regards to the UDP area etc. There is a kernel patch that came out about a week or two ago that deals in this area, but I have not yet applied it. I want to apply the patch first before adjusting other kernel parameters. We have Solaris 10, update 9. Version of syslog-ng is 3.1.2. It is really terrible. By terrible, I mean the packet loss, not the product:)) It is probably something I don't have set up correctly. Mike, check out that latest patch, it can't hurt. I had to open a case with Sun to find out about it:)) On Fri, Apr 15, 2011 at 3:45 PM, Matthew Hall <mhall@mhcomputing.net> wrote: On Fri, Apr 15, 2011 at 02:01:50PM -0400, Mishou Michael wrote: > I left out the resources I have to work with on this system, and how > bad/good things are with syslog-ng running (and dropping), I'll include > those now. As you can see, it's an older server, but it has a ton of > RAM and the CPUs should have enough pop for this I think. > I'm just not sure what to do next to troubleshoot. I'm hoping someone > here can point me in the right direction, or at least confirm that they > are running syslog-ng in a similar configuration without drops so I know > that it's at least possible? > > Regards, > > --Mike I think the next suspect would be the disks. Can you disable anything that writes to disk or tell it to write to /dev/null and see if it still blows up? Also, it's Solaris, so you could start using some of the dtrace scripts to look for what syscalls / other ops are running too slow, and when it gets stuck what type of socket / disk file / what IO is it doing? Matthew. ________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html