Hi, On Tue, 2011-04-26 at 12:05 -0400, Mishou Michael wrote:
For those following this thread, I have applied the "thundering herd" UDP patch and experienced no change in the drops experienced by syslog-ng 3.1.2. Sorry I took so long to respond, the patching was a much more time-involved process than I thought it would be.
At this point, based on Michael Hocke's response, I'm thinking that perhaps there is just too much UDP traffic for single-threaded syslog-ng to deal with in light of what filtering and parsing it does up front (for macro usage).
I'm going to experiment with syslog-ng and the loggen tool to find a point at which a single syslog-ng instance starts dropping inbound UDP traffic with a simple configuration writing to disk. Once I have that number, I have a few options:
1. Experiment with syslog-ng 3.3 and the new threaded code to see if I have performance gains. I'm hesitant to push Alpha code in production, if anyone has any experience with 3.3 in semi-production environment running consistently I'd love to hear it.
I think the most difficult part of compiling syslog-ng for Solaris is ivykis, the new I/O backend library that we've started using for threading (it supports epoll, /dev/poll, kqueue etc). The ivykis version that we use is available on git.balabit.hu, but you need a complete toolchain (autoconf, automake, libtool, gcc, gmake) to compile it.
2. So I don't have to change the configuration on a lot of clients, use PF to rewrite incoming UDP messages from specific, busy clients to other syslog-ng listeners, configured exactly as my main instance (which will handle all the non-insanely-busy clients). I could run multiple listeners in this manner, and not need threading to take advantage of multiple processors, though obviously each process would still be limited to the magic number determined above. I have 10 or so really busy clients, so this is one solution I'm leaning towards if syslog-ng 3.1.2 can handle just one of them.
This could work.
3. Give up on syslog-ng until 3.3, or move to some other solution. Not sure what I could do here, rsyslog is the other major contender I guess, not sure what gains I would get. Could also do native syslog server and post-process to different buckets/relay which is what we mainly use syslog-ng for.
4. Get a faster box (not likely to happen).
If anyone has any thoughts on any of the above I'd love to hear them. Also, if this is unique to Solaris SPARC systems (similarly spec'd x86 Solaris systems having none of these limitations) I'd love to know that as well. Is there any way anyone knows to figure out at what point the SPARC is hitting a ceiling? The CPU is not pegged, so why would we be experiencing CPU-based drops? Maybe the code is not efficient for how SPARC does things, or how some syscall is implemented on Solaris?
Yes, I think this is the root cause of the problem. -- Bazsi