Profiling syslog-ng
I'm trying to profile syslog-ng on a set of servers that generate high levels of logging traffic (a few GB/day). Currently, syslog-ng is consuming anywhere from 40-80% of a CPU on a dual Pentium IV Xeon machine. gprof(1) generates a profile, except for time consumed - it says "no time accumulated" even after running a profile-enabled syslog-ng for several minutes. For example: Flat profile: Each sample counts as 0.01 seconds. no time accumulated % cumulative self self total time seconds seconds calls Ts/call Ts/call name 0.00 0.00 0.00 667 0.00 0.00 xalloc 0.00 0.00 0.00 593 0.00 0.00 prepare 0.00 0.00 0.00 434 0.00 0.00 gc_mark 0.00 0.00 0.00 372 0.00 0.00 ol_string_free 0.00 0.00 0.00 313 0.00 0.00 do_prepare_write I've found similar complaints in the binutils list archives: http://sourceware.org/ml/binutils/2005-07/msg00364.html http://sourceware.org/ml/binutils/2002-01/msg00121.html http://sourceware.org/ml/binutils/2001-10/msg00315.html Personally, we're using glibc 2.3.2.ds1-22 and building the Debian syslog-ng 1.6.9-1 package using gcc 3.3.5-3 or 3.4.3-13. I've also tried oprofile, but it only(?) seems to report call counts, not time consumed. For example: samples % symbol name 102829 67.7349 io_iter 16560 10.9083 prepare 2975 1.9597 parse_log_msg 2518 1.6586 expand_macro I've read io_iter() and it doesn't seem to contain any smoking guns that would account for 68% of total syslog-ng CPU usage. Any suggestions on how to profile syslog-ng? I can post my configuration if it'd be helpful, but we've twisted many knobs (gc idle/busy thresholds, buffering via different sync() levels, disabling remote syslogging via UDP, etc.) without a major change in CPU consumption. john -- John Morrissey _o /\ ---- __o jwm@horde.net _-< \_ / \ ---- < \, www.horde.net/ __(_)/_(_)________/ \_______(_) /_(_)__
On 2/10/06, John Morrissey <jwm@horde.net> wrote:
I'm trying to profile syslog-ng on a set of servers that generate high levels of logging traffic (a few GB/day). Currently, syslog-ng is consuming anywhere from 40-80% of a CPU on a dual Pentium IV Xeon machine.
Even for the slowest possible P4, that seems immensely high. Can you give more information about what on these servers generates the log events, the average log message length, and the average and peak events/second rate?
Any suggestions on how to profile syslog-ng? can post my configuration if it'd be helpful,
Please do.
but we've twisted many knobs (gc idle/busy thresholds, buffering via different sync() levels, disabling remote syslogging via UDP, etc.) without a major change in CPU consumption.
Just as a SWAG, the first thing I'd wonder is if there's a regex match in your configuration chewing cycles. Kevin
On Fri, Feb 10, 2006 at 09:14:44PM -0600, Kevin wrote:
On 2/10/06, John Morrissey <jwm@horde.net> wrote:
I'm trying to profile syslog-ng on a set of servers that generate high levels of logging traffic (a few GB/day). Currently, syslog-ng is consuming anywhere from 40-80% of a CPU on a dual Pentium IV Xeon machine.
Even for the slowest possible P4, that seems immensely high. Can you give more information about what on these servers generates the log events, the average log message length, and the average and peak events/second rate?
They're our inbound mail cluster (running Postfix), handling a few million mail messages a day. They usually have several hundred processes running simultaneously, almost all of which log. As far as log messages go, 13-19,000 messages/minute is a general average, with frequent peaks of 23-28,000 messages/minute, and occasional heavy peaks of up to 35,000 messages/minute for a couple minutes at a time. Yesterday, one machine syslogged about 23 million messages (~16,000/minute average for the day). Here's a breakdown of log lengths: < 100 bytes - 34% 100 - 45% 200 - 7% 300 - 12% 400 - 2%
can post my configuration if it'd be helpful,
Please do.
It's up at http://horde.net/~jwm/syslog-ng.conf. john -- John Morrissey _o /\ ---- __o jwm@horde.net _-< \_ / \ ---- < \, www.horde.net/ __(_)/_(_)________/ \_______(_) /_(_)__
On Sat, 2006-02-11 at 11:09 -0500, John Morrissey wrote:
On Fri, Feb 10, 2006 at 09:14:44PM -0600, Kevin wrote:
On 2/10/06, John Morrissey <jwm@horde.net> wrote:
I'm trying to profile syslog-ng on a set of servers that generate high levels of logging traffic (a few GB/day). Currently, syslog-ng is consuming anywhere from 40-80% of a CPU on a dual Pentium IV Xeon machine.
Even for the slowest possible P4, that seems immensely high. Can you give more information about what on these servers generates the log events, the average log message length, and the average and peak events/second rate?
Can you check if the attached patch decreases your load? It avoids getting back to the poll loop after every log message and loops on the currently processed fd until it either returns EAGAIN or it reaches a hard-coded number. (which is currently 30) Make sure that you touch sources.c.x after applying the patch or the build scripts will require scsh to be installed. If I speculate correctly this will move io_iter lower in your profile. (I would say instead of 70% percent, it will use about 7%) Thanks in advance. -- Bazsi
On Sun, Feb 12, 2006 at 08:07:14PM +0100, Balazs Scheidler wrote:
Can you check if the attached patch decreases your load? It avoids getting back to the poll loop after every log message and loops on the currently processed fd until it either returns EAGAIN or it reaches a hard-coded number. (which is currently 30)
Unfortunately, it didn't help that much. I strace(1)ed it for a few minutes; here are some read(2) statistics. It shows how many reads were done before encountering an EAGAIN: Count - # Reads 44268 - 1 reads 370 - 2 reads 279 - 3 reads 9 - 4 reads 6 - 5 reads 1 - 6 reads [tail removed - ~34 lines with a count < 10] Repeated reads might not make sense for our load pattern, since there are many, many processes all syslogging a line or two at once. Few processes (as you can see from the stats) syslog several messages in immediate succession. On Sun, Feb 12, 2006 at 02:53:55PM +0100, Balazs Scheidler wrote:
Strange, it's been a while since I last profiled syslog-ng. What were your build options?
They're pretty stock. We used the Debian package for 1.6.9, adding 'debug' to DEB_BUILD_OPTIONS, which seems to use: ./configure --prefix=/ --mandir=/usr/share/man --sysconfdir=/etc \ --enable-debug and '-g -Wall' for CFLAGS. We added -pg manually.
It would be nice to know how blocking affects this, io_iter is the central poll loop and unless syslog-ng is spinning for some reason it should be really low in the profile. Hm... as I see you have max-connections set to 2000, how many connections are there in reality? Can you check lsof -p <syslog-ng pid> how many open fds syslog-ng is polling?
Currently, it has about 1,000 file descriptors open. We had to increase it since we were hitting the stock limit (1024?) at peak times.
io_iter basically traverses all registered fds and constructs a "struct pollfd" array every time it is invoked. If there are a lot of fds, this might be a problem.
I agree; poll(2) is fairly cheap(?), so regenerating the struct pollfd is a likely suspect. Could the struct pollfd be cached, regenerated every n msec? That way latency for new processes isn't too bad, but it keeps syslog-ng from spinning so much? Or perhaps it could keep a struct pollfd that is only regenerated when file descriptors are opened/closed. Our processes are fairly long-lived, so this too could keep the struct from being regenerated so often. Even if logging processes are not long-lived, I doubt it would be regenerated more often than once every second or few. Here are some poll(2) stats. It seems that over half of poll(2) calls return a single file descriptor with activity. Count - # of descriptors returned by poll(2) 20140 - 1 8372 - 2 3941 - 3 1601 - 4 635 - 5 260 - 6 103 - 7 62 - 8 30 - 9 15 - 10 [tail removed - ~40 lines with a count < 10] Thanks for your help! john -- John Morrissey _o /\ ---- __o jwm@horde.net _-< \_ / \ ---- < \, www.horde.net/ __(_)/_(_)________/ \_______(_) /_(_)__
What is the best way to log mysql using syslog-ng? --- John Morrissey <jwm@horde.net> wrote:
On Sun, Feb 12, 2006 at 08:07:14PM +0100, Balazs Scheidler wrote:
Can you check if the attached patch decreases your load? It avoids getting back to the poll loop after every log message and loops on the currently processed fd until it either returns EAGAIN or it reaches a hard-coded number. (which is currently 30)
Unfortunately, it didn't help that much. I strace(1)ed it for a few minutes; here are some read(2) statistics. It shows how many reads were done before encountering an EAGAIN:
Count - # Reads 44268 - 1 reads 370 - 2 reads 279 - 3 reads 9 - 4 reads 6 - 5 reads 1 - 6 reads [tail removed - ~34 lines with a count < 10]
Repeated reads might not make sense for our load pattern, since there are many, many processes all syslogging a line or two at once. Few processes (as you can see from the stats) syslog several messages in immediate succession.
On Sun, Feb 12, 2006 at 02:53:55PM +0100, Balazs Scheidler wrote:
Strange, it's been a while since I last profiled syslog-ng. What were your build options?
They're pretty stock. We used the Debian package for 1.6.9, adding 'debug' to DEB_BUILD_OPTIONS, which seems to use:
./configure --prefix=/ --mandir=/usr/share/man --sysconfdir=/etc \ --enable-debug
and '-g -Wall' for CFLAGS. We added -pg manually.
It would be nice to know how blocking affects this, io_iter is the central poll loop and unless syslog-ng is spinning for some reason it should be really low in the profile. Hm... as I see you have max-connections set to 2000, how many connections are there in reality? Can you check lsof -p <syslog-ng pid> how many open fds syslog-ng is polling?
Currently, it has about 1,000 file descriptors open. We had to increase it since we were hitting the stock limit (1024?) at peak times.
io_iter basically traverses all registered fds and constructs a "struct pollfd" array every time it is invoked. If there are a lot of fds, this might be a problem.
I agree; poll(2) is fairly cheap(?), so regenerating the struct pollfd is a likely suspect. Could the struct pollfd be cached, regenerated every n msec? That way latency for new processes isn't too bad, but it keeps syslog-ng from spinning so much?
Or perhaps it could keep a struct pollfd that is only regenerated when file descriptors are opened/closed. Our processes are fairly long-lived, so this too could keep the struct from being regenerated so often. Even if logging processes are not long-lived, I doubt it would be regenerated more often than once every second or few.
Here are some poll(2) stats. It seems that over half of poll(2) calls return a single file descriptor with activity.
Count - # of descriptors returned by poll(2) 20140 - 1 8372 - 2 3941 - 3 1601 - 4 635 - 5 260 - 6 103 - 7 62 - 8 30 - 9 15 - 10 [tail removed - ~40 lines with a count < 10]
Thanks for your help!
john -- John Morrissey _o /\ ---- __o jwm@horde.net _-< \_ / \ ---- < \, www.horde.net/ __(_)/_(_)________/ \_______(_) /_(_)__ _______________________________________________ syslog-ng maillist - syslog-ng@lists.balabit.hu https://lists.balabit.hu/mailman/listinfo/syslog-ng Frequently asked questions at http://www.campin.net/syslog-ng/faq.html
I'm worth a million in prizes With my torture film Drive a GTO Wear a uniform On a government loan. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
On Mon, 2006-02-13 at 10:57 -0500, John Morrissey wrote:
On Sun, Feb 12, 2006 at 08:07:14PM +0100, Balazs Scheidler wrote: On Sun, Feb 12, 2006 at 02:53:55PM +0100, Balazs Scheidler wrote:
Strange, it's been a while since I last profiled syslog-ng. What were your build options?
They're pretty stock. We used the Debian package for 1.6.9, adding 'debug' to DEB_BUILD_OPTIONS, which seems to use:
./configure --prefix=/ --mandir=/usr/share/man --sysconfdir=/etc \ --enable-debug
and '-g -Wall' for CFLAGS. We added -pg manually.
It should be ok. Maybe some kind of kernel patch (grsec or something like that?)
io_iter basically traverses all registered fds and constructs a "struct pollfd" array every time it is invoked. If there are a lot of fds, this might be a problem.
I agree; poll(2) is fairly cheap(?), so regenerating the struct pollfd is a likely suspect. Could the struct pollfd be cached, regenerated every n msec? That way latency for new processes isn't too bad, but it keeps syslog-ng from spinning so much?
Or perhaps it could keep a struct pollfd that is only regenerated when file descriptors are opened/closed. Our processes are fairly long-lived, so this too could keep the struct from being regenerated so often. Even if logging processes are not long-lived, I doubt it would be regenerated more often than once every second or few.
the problem is that it is not so simple, this code is in libol, it is generic and the program relies on this behaviour at a lot of places. (to request a read callback, it simply changes the value for a want_read value, no functions called to update the pollfd state) io_iter currently iterates through the registered fds 3 times, first to call prepare callbacks, then to count the final set of pollfd structs and then to fill that up. We might spare one of the loops (the counting part with some reallocs), but the basic structure has to remain the same. I have now looked at the development branch of syslog-ng, and the mainloop in GLib basically does the same, it traverses the registered sources multiple times for each iteration. (I have not looked that closely though, it probably spares the counting part) As it seems syslog-ng does not scale to many parallel connections and prefers load patterns with less connections and more traffic on those. I also looked at epoll now, but it is not a trivial change to convert either libol nor glib to use that. (I also looked at ivykis as well but again it would be a big change) I'll see into removing one of the loops which could help a bit. -- Bazsi
Can you check if the attached patch helps your situation? On Tue, 2006-02-14 at 10:48 +0100, Balazs Scheidler wrote:
I'll see into removing one of the loops which could help a bit.
-- Bazsi
On Tue, Feb 14, 2006 at 11:13:11AM +0100, Balazs Scheidler wrote:
On Tue, 2006-02-14 at 10:48 +0100, Balazs Scheidler wrote:
I'll see into removing one of the loops which could help a bit.
Can you check if the attached patch helps your situation?
It's not helping a noticable amount, sorry. Check out my other post re: usleep()ing - it's somewhat hackish, but it's made a huge difference for us. john -- John Morrissey _o /\ ---- __o jwm@horde.net _-< \_ / \ ---- < \, www.horde.net/ __(_)/_(_)________/ \_______(_) /_(_)__
On Tue, Feb 14, 2006 at 10:48:33AM +0100, Balazs Scheidler wrote:
On Mon, 2006-02-13 at 10:57 -0500, John Morrissey wrote:
On Sun, Feb 12, 2006 at 08:07:14PM +0100, Balazs Scheidler wrote:
Strange, it's been a while since I last profiled syslog-ng. What were your build options?
They're pretty stock. We used the Debian package for 1.6.9, adding 'debug' to DEB_BUILD_OPTIONS, which seems to use:
./configure --prefix=/ --mandir=/usr/share/man --sysconfdir=/etc \ --enable-debug
and '-g -Wall' for CFLAGS. We added -pg manually.
It should be ok. Maybe some kind of kernel patch (grsec or something like that?)
Nope, it's essentially the stock Debian 2.6.8 kernel (the only changes we make are to add drivers and optimize for the Pentium IV Xeon CPU).
the problem is that it is not so simple, this code is in libol, it is generic and the program relies on this behaviour at a lot of places. (to request a read callback, it simply changes the value for a want_read value, no functions called to update the pollfd state)
FWIW, I modified syslog-ng 1.6.9 to usleep(10000) at the end of the io_iter() loop in src/main.c. This dropped CPU consumption by about fourfold - instead of spinning in that io_iter() loop, reading one or two messages per poll(), it's now reading several (3-15) messages per poll(). Increasing the time to 20000 usec cuts CPU in half again - to about 1/8 of the original consumption. It's now using ~10-15% of a CPU under the same heavy log load from ~750 connections. Granted, this increases latency, but I don't think 10 or 20 msec delays will kill anything. john -- John Morrissey _o /\ ---- __o jwm@horde.net _-< \_ / \ ---- < \, www.horde.net/ __(_)/_(_)________/ \_______(_) /_(_)__
On Wed, 2006-02-15 at 14:28 -0500, John Morrissey wrote:
On Tue, Feb 14, 2006 at 10:48:33AM +0100, Balazs Scheidler wrote:
On Mon, 2006-02-13 at 10:57 -0500, John Morrissey wrote:
On Sun, Feb 12, 2006 at 08:07:14PM +0100, Balazs Scheidler wrote: the problem is that it is not so simple, this code is in libol, it is generic and the program relies on this behaviour at a lot of places. (to request a read callback, it simply changes the value for a want_read value, no functions called to update the pollfd state)
FWIW, I modified syslog-ng 1.6.9 to usleep(10000) at the end of the io_iter() loop in src/main.c. This dropped CPU consumption by about fourfold - instead of spinning in that io_iter() loop, reading one or two messages per poll(), it's now reading several (3-15) messages per poll().
Increasing the time to 20000 usec cuts CPU in half again - to about 1/8 of the original consumption. It's now using ~10-15% of a CPU under the same heavy log load from ~750 connections. Granted, this increases latency, but I don't think 10 or 20 msec delays will kill anything.
The idea is cool, although I have to admit it is really a hack :) This might be useful to others as well. I will probably add it as a global option. Thanks for tracking this down. -- Bazsi
On Thu, Feb 16, 2006 at 09:56:30AM +0100, Balazs Scheidler wrote:
FWIW, I modified syslog-ng 1.6.9 to usleep(10000) at the end of the io_iter() loop in src/main.c. This dropped CPU consumption by about fourfold - instead of spinning in that io_iter() loop, reading one or two messages per poll(), it's now reading several (3-15) messages per poll().
Increasing the time to 20000 usec cuts CPU in half again - to about 1/8 of the original consumption. It's now using ~10-15% of a CPU under the same heavy log load from ~750 connections. Granted, this increases latency, but I don't think 10 or 20 msec delays will kill anything.
The idea is cool, although I have to admit it is really a hack :)
:-)
This might be useful to others as well. I will probably add it as a global option. Thanks for tracking this down.
I've been tinkering with different delay times. 40 msec greatly increases the number of descriptors per poll(), to 30-50, but decreases CPU utilization under very heavy log load (950 descriptors) to ~12%. Do you have a guesstimate at what point I should stop, where the latency of the delay plus processing might be too much? thanks, john -- John Morrissey _o /\ ---- __o jwm@horde.net _-< \_ / \ ---- < \, www.horde.net/ __(_)/_(_)________/ \_______(_) /_(_)__
On Thu, 2006-02-16 at 08:11 -0500, John Morrissey wrote:
On Thu, Feb 16, 2006 at 09:56:30AM +0100, Balazs Scheidler wrote:
I've been tinkering with different delay times. 40 msec greatly increases the number of descriptors per poll(), to 30-50, but decreases CPU utilization under very heavy log load (950 descriptors) to ~12%. Do you have a guesstimate at what point I should stop, where the latency of the delay plus processing might be too much?
Each channel in syslog-ng (at least in 1.6.x) is read-polled once the connection is opened, the increase in latency does not really affect the sender itself as no application layer acknowledgement is made (though I'm planning on this one), the only result in waiting more is that it increases the memory load, as received messages are stored in socket buffers. If you are using datagram sockets this might mean some more dropped messages, for stream sockets this might block sending applications after the window is filled. The other effect is output, outputs are also polled, files are written when poll indicates writability, this also gets delayed because of the sleep, e.g. messages are kept in memory for a bit more time as well. I'm not sure this is really a problem as messages are accumulated in the input buffers... I think the output side is affected only because the input becomes more bursty, a lot more messages are processed in a single poll loop, so you have to make sure your output fifo sizes are sized accordingly. So I'd say for stream connections the negative impact is negligible, apart from the additional latency (timestamps + file writes) I would not recommend more than 0.1sec (100ms) as that might skew local timestamps significantly and that would mean sitting 100ms worth of logs in your input buffers. -- Bazsi
Increasing the time to 20000 usec cuts CPU in half again - to about 1/8 of the original consumption. It's now using ~10-15% of a CPU under the same heavy log load from ~750 connections. Granted, this increases latency, but I don't think 10 or 20 msec delays will kill anything.
The idea is cool, although I have to admit it is really a hack :)
What's wrong/the issue with adding epoll? Besides non-portability? Also 20msec may not hurt on 2.4.x kernels regarding latency however, 2.6.x kernels busy loop with HZ>100. Reducing HZ to something sensible like 100 is better for servers anyway. Could the OP also put his .config online somewhere?
This might be useful to others as well. I will probably add it as a global option. Thanks for tracking this down.
Could you please oprofile syslog-ng with your setup and configuration? Also, if you have some spare cycles to burn, have oprofile annotate your source with the profiler results. I happen to find oprofile a bit better to use for independent task/thread profiling sometimes than pure gprof. YMMV. I don't have time right now, that's why I don't do it. Regards, Roberto Nibali, ratz -- echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc
On Mon, 2006-02-20 at 01:41 +0100, Roberto Nibali wrote:
Increasing the time to 20000 usec cuts CPU in half again - to about 1/8 of the original consumption. It's now using ~10-15% of a CPU under the same heavy log load from ~750 connections. Granted, this increases latency, but I don't think 10 or 20 msec delays will kill anything.
The idea is cool, although I have to admit it is really a hack :)
What's wrong/the issue with adding epoll? Besides non-portability? Also 20msec may not hurt on 2.4.x kernels regarding latency however, 2.6.x kernels busy loop with HZ>100. Reducing HZ to something sensible like 100 is better for servers anyway. Could the OP also put his .config online somewhere?
The problem with epoll is that syslog-ng uses glib for its main loop, and it does not support epoll, and it does not seem to be trivial to add either. -- Bazsi
On Thu, Feb 16, 2006 at 09:56:30AM +0100, Balazs Scheidler wrote:
On Wed, 2006-02-15 at 14:28 -0500, John Morrissey wrote:
FWIW, I modified syslog-ng 1.6.9 to usleep(10000) at the end of the io_iter() loop in src/main.c. This dropped CPU consumption by about fourfold - instead of spinning in that io_iter() loop, reading one or two messages per poll(), it's now reading several (3-15) messages per poll().
The idea is cool, although I have to admit it is really a hack :)
This might be useful to others as well. I will probably add it as a global option. Thanks for tracking this down.
I pulled the time_sleep() changes from a recent snapshot (thanks!) and am backporting them to 1.6.x[1]. I noticed this hunk: + if (backend->configuration->time_sleep > 0) + { + struct timespec ts; + + ts.tv_sec = 0; + ts.tv_nsec = backend->configuration->time_sleep * 1000; + + nanosleep(&ts, NULL); The docs say the time_sleep() value is in milliseconds, but 1 msec == 1 000 000 nsec - shouldn't this multiplication factor have an extra three zeros? :-) john [1] http://horde.net/~jwm/software/misc/10_configurable_sleep.diff -- John Morrissey _o /\ ---- __o jwm@horde.net _-< \_ / \ ---- < \, www.horde.net/ __(_)/_(_)________/ \_______(_) /_(_)__
On Mon, 2006-03-13 at 14:33 -0500, John Morrissey wrote:
I pulled the time_sleep() changes from a recent snapshot (thanks!) and am backporting them to 1.6.x[1]. I noticed this hunk:
+ if (backend->configuration->time_sleep > 0) + { + struct timespec ts; + + ts.tv_sec = 0; + ts.tv_nsec = backend->configuration->time_sleep * 1000; + + nanosleep(&ts, NULL);
The docs say the time_sleep() value is in milliseconds, but 1 msec == 1 000 000 nsec - shouldn't this multiplication factor have an extra three zeros? :-)
Thanks for spotting this, I've just committed a fix. -- Bazsi
On Mon, 2006-03-13 at 14:33 -0500, John Morrissey wrote:
On Thu, Feb 16, 2006 at 09:56:30AM +0100, Balazs Scheidler wrote:
On Wed, 2006-02-15 at 14:28 -0500, John Morrissey wrote:
FWIW, I modified syslog-ng 1.6.9 to usleep(10000) at the end of the io_iter() loop in src/main.c. This dropped CPU consumption by about fourfold - instead of spinning in that io_iter() loop, reading one or two messages per poll(), it's now reading several (3-15) messages per poll().
The idea is cool, although I have to admit it is really a hack :)
This might be useful to others as well. I will probably add it as a global option. Thanks for tracking this down.
[1] http://horde.net/~jwm/software/misc/10_configurable_sleep.diff
I have integrated this into my CVS tree with the change to use 1000000 as multiplier instead of 1000. Tomorrow's snapshot should contain the change. -- Bazsi
On Sun, Feb 12, 2006 at 08:07:14PM +0100, Balazs Scheidler wrote:
Can you check if the attached patch decreases your load? It avoids getting back to the poll loop after every log message and loops on the currently processed fd until it either returns EAGAIN or it reaches a hard-coded number. (which is currently 30)
Going back to this patch for a minute... I'm now tuning our syslog hub, running the same 1.6.x. All of our servers forward their logs to this hub, mainly via UDP but there are a handful of TCP connections. This instance of syslog-ng is consuming ~30-35% of a CPU. I would like to get it lower, which is probably a reasonable goal given the gains seen on the other machines (~80-90% of a CPU down to ~15%). I haven't profiled it yet, but strace(1) shows this one in a tight: poll() recvfrom() poll() loop. It seems poll() overhead could be cut with this patch that you posted (decrease-poll-load.diff). However, it looks like this code hunk in src/sources.c: 122 if (closure->dgram || (!eol && closure->pos == closure->max_log_line)) { [snip] 137 do_handle_line(closure, closure->pos, closure->buffer, salen ? (abstract_addr *) &sabuf : NULL, salen); 138 closure->pos = 0; 139 return ST_OK | ST_GOON; doesn't treat datagram sockets the same as streams - instead of reading until a threshold is hit or we receive EAGAIN, a single datagram is read and processed and it returns, triggering another io_recv(). Could this be changed to give datagrams the same treatment as streams? thanks, john -- John Morrissey _o /\ ---- __o jwm@horde.net _-< \_ / \ ---- < \, www.horde.net/ __(_)/_(_)________/ \_______(_) /_(_)__
On Mon, 2006-03-13 at 15:41 -0500, John Morrissey wrote:
On Sun, Feb 12, 2006 at 08:07:14PM +0100, Balazs Scheidler wrote:
Can you check if the attached patch decreases your load? It avoids getting back to the poll loop after every log message and loops on the currently processed fd until it either returns EAGAIN or it reaches a hard-coded number. (which is currently 30)
Going back to this patch for a minute...
I'm now tuning our syslog hub, running the same 1.6.x. All of our servers forward their logs to this hub, mainly via UDP but there are a handful of TCP connections. This instance of syslog-ng is consuming ~30-35% of a CPU.
I would like to get it lower, which is probably a reasonable goal given the gains seen on the other machines (~80-90% of a CPU down to ~15%). I haven't profiled it yet, but strace(1) shows this one in a tight:
poll() recvfrom() poll()
loop. It seems poll() overhead could be cut with this patch that you posted (decrease-poll-load.diff). However, it looks like this code hunk in src/sources.c:
122 if (closure->dgram || (!eol && closure->pos == closure->max_log_line)) { [snip] 137 do_handle_line(closure, closure->pos, closure->buffer, salen ? (abstract_addr *) &sabuf : NULL, salen); 138 closure->pos = 0; 139 return ST_OK | ST_GOON;
doesn't treat datagram sockets the same as streams - instead of reading until a threshold is hit or we receive EAGAIN, a single datagram is read and processed and it returns, triggering another io_recv(). Could this be changed to give datagrams the same treatment as streams?
The latest 1.9.x snapshot already has this behaviour, although I'd understand you hesitate going from 1.6.x -> 1.9.x in a sudden move. What about the following (compile-tested) patch: diff -u -r1.37.4.8 sources.c --- sources.c 23 Feb 2006 17:44:46 -0000 1.37.4.8 +++ sources.c 13 Mar 2006 23:17:04 -0000 @@ -117,7 +117,7 @@ if (closure->pad_size && eol) { do_handle_line(closure, eol - closure->buffer, closure->buffer, salen ? (abstract_addr *) &sabuf : NULL, salen); closure->pos = 0; - return ST_OK | ST_GOON; + goto next_fetch; } if (closure->dgram || (!eol && closure->pos == closure->max_log_line)) { /* we don't have a terminating nl nor \0, and our buffer is @@ -136,7 +136,7 @@ } do_handle_line(closure, closure->pos, closure->buffer, salen ? (abstract_addr *) &sabuf : NULL, salen); closure->pos = 0; - return ST_OK | ST_GOON; + goto next_fetch; } start = closure->buffer; while (eol) { @@ -162,6 +162,8 @@ } memmove(closure->buffer, start, &closure->buffer[closure->pos] - start); closure->pos = &closure->buffer[closure->pos] - start; + + next_fetch: fetch_count++; } return ST_OK | ST_GOON; -- Bazsi
On Tue, Mar 14, 2006 at 12:19:52AM +0100, Balazs Scheidler wrote:
The latest 1.9.x snapshot already has this behaviour, although I'd understand you hesitate going from 1.6.x -> 1.9.x in a sudden move.
What about the following (compile-tested) patch:
This looks good and is working well - thanks! One more thing I've noticed: syslog-ng is making huge numbers of time(2) calls. Our hub is calling time() 14-18,000 times/second(!) at peak. Is there a way to optimize this away, or is 1.9.x more frugal in calling it? I'm thinking a global time_t might be useful, with periodic updates triggered by a timer every n msec. I'm not sure what guarantees are provided for timers, though, especially under heavy load. Thoughts? Thanks again for all your help. john -- John Morrissey _o /\ ---- __o jwm@horde.net _-< \_ / \ ---- < \, www.horde.net/ __(_)/_(_)________/ \_______(_) /_(_)__
On Tue, 2006-03-14 at 16:07 -0500, John Morrissey wrote:
On Tue, Mar 14, 2006 at 12:19:52AM +0100, Balazs Scheidler wrote:
The latest 1.9.x snapshot already has this behaviour, although I'd understand you hesitate going from 1.6.x -> 1.9.x in a sudden move.
What about the following (compile-tested) patch:
This looks good and is working well - thanks!
One more thing I've noticed: syslog-ng is making huge numbers of time(2) calls. Our hub is calling time() 14-18,000 times/second(!) at peak. Is there a way to optimize this away, or is 1.9.x more frugal in calling it?
Maybe it calls somewhat less, but not significantly as it still calls time() for each message, and maybe even multiple times. How expensive do you think calling time() is?
I'm thinking a global time_t might be useful, with periodic updates triggered by a timer every n msec. I'm not sure what guarantees are provided for timers, though, especially under heavy load. Thoughts?
a global variable updated in every poll loop might be ok, but timers otherwise are not too reliable. to make things worse, syslog-ng 1.9.9 supports second fractions, e.g. individual messages have different timestamps, using a global variable would be bad. -- Bazsi
On Tue, Mar 14, 2006 at 11:42:48PM +0100, Balazs Scheidler wrote:
One more thing I've noticed: syslog-ng is making huge numbers of time(2) calls. Our hub is calling time() 14-18,000 times/second(!) at peak. Is there a way to optimize this away, or is 1.9.x more frugal in calling it?
Maybe it calls somewhat less, but not significantly as it still calls time() for each message, and maybe even multiple times.
How expensive do you think calling time() is?
I did some crude benchmarking (running time() in a for loop) and it seems each call consumes about 5usec of time on the machines I tried it on (~2.5GHz Xeons running Debian sarge). Most of the time consumed is in system state. __GI_time() isn't very high in oprofile sampling (~1.8% of samples), so it doesn't seem that expensive. I'll leave it alone for now. :-) Thanks for all your help! samples % app name symbol name 239113 21.6752 syslog-ng io_iter 67510 6.1197 libc-2.3.2.so strlen 42656 3.8667 syslog-ng do_prepare_write 38041 3.4484 libc-2.3.2.so re_search_internal 37159 3.3684 libc-2.3.2.so _int_malloc 33938 3.0764 syslog-ng affile_find_writer 30920 2.8028 libc-2.3.2.so strptime_internal 30305 2.7471 syslog-ng prepare 27297 2.4744 libc-2.3.2.so transit_state 23447 2.1254 syslog-ng parse_log_msg 23210 2.1039 libc-2.3.2.so memcpy 20568 1.8645 syslog-ng c_vformat_write 19815 1.7962 libc-2.3.2.so __GI_time 19386 1.7573 libc-2.3.2.so __offtime 17908 1.6233 libc-2.3.2.so strcmp 16430 1.4894 syslog-ng c_vformat_length 15923 1.4434 libc-2.3.2.so check_matching 15899 1.4412 syslog-ng expand_macros 15876 1.4391 libc-2.3.2.so __GI_getenv 15568 1.4112 libc-2.3.2.so memset 13982 1.2674 libc-2.3.2.so build_upper_buffer 13210 1.1975 syslog-ng do_distribute_log 12668 1.1483 syslog-ng format_size_in_decimal 12397 1.1238 libc-2.3.2.so __memchr 12311 1.1160 libc-2.3.2.so _IO_vfprintf_internal 11740 1.0642 libc-2.3.2.so re_string_reconstruct 11189 1.0143 libc-2.3.2.so __cfree john -- John Morrissey _o /\ ---- __o jwm@horde.net _-< \_ / \ ---- < \, www.horde.net/ __(_)/_(_)________/ \_______(_) /_(_)__
On Fri, 2006-02-10 at 16:11 -0500, John Morrissey wrote:
I'm trying to profile syslog-ng on a set of servers that generate high levels of logging traffic (a few GB/day). Currently, syslog-ng is consuming anywhere from 40-80% of a CPU on a dual Pentium IV Xeon machine.
gprof(1) generates a profile, except for time consumed - it says "no time accumulated" even after running a profile-enabled syslog-ng for several minutes. For example:
Flat profile:
Each sample counts as 0.01 seconds. no time accumulated
% cumulative self self total time seconds seconds calls Ts/call Ts/call name 0.00 0.00 0.00 667 0.00 0.00 xalloc 0.00 0.00 0.00 593 0.00 0.00 prepare 0.00 0.00 0.00 434 0.00 0.00 gc_mark 0.00 0.00 0.00 372 0.00 0.00 ol_string_free 0.00 0.00 0.00 313 0.00 0.00 do_prepare_write
I've found similar complaints in the binutils list archives:
http://sourceware.org/ml/binutils/2005-07/msg00364.html http://sourceware.org/ml/binutils/2002-01/msg00121.html http://sourceware.org/ml/binutils/2001-10/msg00315.html
Personally, we're using glibc 2.3.2.ds1-22 and building the Debian syslog-ng 1.6.9-1 package using gcc 3.3.5-3 or 3.4.3-13.
Strange, it's been a while since I last profiled syslog-ng. What were your build options?
I've also tried oprofile, but it only(?) seems to report call counts, not time consumed. For example:
samples % symbol name 102829 67.7349 io_iter 16560 10.9083 prepare 2975 1.9597 parse_log_msg 2518 1.6586 expand_macro
I've read io_iter() and it doesn't seem to contain any smoking guns that would account for 68% of total syslog-ng CPU usage.
It would be nice to know how blocking affects this, io_iter is the central poll loop and unless syslog-ng is spinning for some reason it should be really low in the profile. Hm... as I see you have max-connections set to 2000, how many connections are there in reality? Can you check lsof -p <syslog-ng pid> how many open fds syslog-ng is polling? io_iter basically traverses all registered fds and constructs a "struct pollfd" array every time it is invoked. If there are a lot of fds, this might be a problem.
Any suggestions on how to profile syslog-ng? I can post my configuration if it'd be helpful, but we've twisted many knobs (gc idle/busy thresholds, buffering via different sync() levels, disabling remote syslogging via UDP, etc.) without a major change in CPU consumption.
I would say call-graph enabled oprofile should do it. My first guess is that it does not handle 1000s of connections too well, especially as that list is traversed every time a single message is received. -- Bazsi
participants (5)
-
Andrew Meyer
-
Balazs Scheidler
-
John Morrissey
-
Kevin
-
Roberto Nibali