Solaris 10 UDP overflows, message drops
All, I've done a lot of reading, and I can't figure out what I can do to this config in order to fix the UDP drops due to udpInOverflows on netstat -s. Here are some statistics relating to the amount of traffic we receive via syslog-ng, it's pretty busy but in reading I'm finding that some folks are doing much more. These stats are based on a ~30 second window of traffic during peak times, but variance due to time is not so much in our environment. I used tcpdump with a bpf to capture only inbound udp/514, so this is what the interface is seeing in the way of syslog. Elapsed: 00:00:34 Packets: 200000 Avg. packets/sec: 5836.546 Avg. packet size: 303.182 bytes Bytes: 60636477 Avg. bytes/sec: 1769537.884 Avg. MBit/sec: 14.156 So, about 6k messages per second. Here are the drop numbers over a time sample (done right after a process restart, you can see the buffer takes a moment to fill up [64 MB so_rcvbuf]): # while true; do echo -en "$(date) :: "; netstat -s | grep udpInOverflows | head -n 1 | sed 's|.*=||'; sleep 10; done Fri Apr 15 14:12:46 GMT 2011 :: 472517477 Fri Apr 15 14:12:56 GMT 2011 :: 472517477 Fri Apr 15 14:13:06 GMT 2011 :: 472517477 Fri Apr 15 14:13:16 GMT 2011 :: 472517477 Fri Apr 15 14:13:26 GMT 2011 :: 472543152 Fri Apr 15 14:13:36 GMT 2011 :: 472592800 Fri Apr 15 14:13:46 GMT 2011 :: 472638848 Fri Apr 15 14:13:56 GMT 2011 :: 472684407 So that's about 5k overflows a second, which jives with our calculations, suggesting we're getting only ~10% of our messages logged to disk. I inherited a config with _very_ many filter statements, but have decided to cut all that out to see if my performance problems in the way of udp drops continue (they do). I've attached a sanitized config to this message, all the stuff here concerns this config running (even though I thought eliminating the filters would really help, it didn't). We're running Solaris 10 SPARC. The syslog-ng version is: # /usr/local/sbin/syslog-ng -V syslog-ng 3.1.2 Installer-Version: 3.1.2 Revision: ssh+git://bazsi@git.balabit//var/scm/git/syslog-ng/syslog-ng-ose--mainli ne--3.1#master#8bf13c304b6ab5fc1a372b49d55c78370efe14ca Compile-Date: Oct 25 2010 23:56:18 Enable-Threads: off Enable-Debug: off Enable-GProf: off Enable-Memtrace: off Enable-Sun-STREAMS: on Enable-Sun-Door: on Enable-IPv6: on Enable-Spoof-Source: on Enable-TCP-Wrapper: off Enable-SSL: on Enable-SQL: off Enable-Linux-Caps: off Enable-Pcre: on The following options are set for the OS: # ndd /dev/udp udp_max_buf 1073741824 # ndd /dev/udp udp_recv_hiwat 65536 Some options lines from the config based on what I've seen: * note the TCP stuff can be safely ignored, it's legacy from some testing but isn't currently seeing traffic * all 3 udp sources set with so_rcvbuf(67108864) (64 MB) options { # things I've changed/tweaked flush_lines(1000); flush_timeout(20); log_fifo_size (67108864); log_msg_size(8192); chain_hostnames(yes); # end my changes <snip> }; So I'm totally stumped. I can set the buffers with so_rcvbuf() to 1 GB, it still doesn't matter, they eventually fill up and I start losing packets. I'm hoping that someone can point me to some tweaks I can do to get the numbers of drops down or eliminated. Is it unreasonable to expect to be able to process this many messages per second via UDP? Maybe that's the problem. I might experiment some with default syslog to see if it can write this many messages without drops...this doesn't seem like an insane amount of traffic. But perhaps my expectations are unrealistic, that's what I'm hoping someone can tell me. Regards, --Mike
Probably you need to adjust so_sndbuf and so_rcvbuf: http://www.balabit.com/sites/default/files/documents/syslog-ng-ose-v3.2-guid... That should make it run better. Matthew. On Fri, Apr 15, 2011 at 10:52:59AM -0400, Mishou Michael wrote:
All,
I've done a lot of reading, and I can't figure out what I can do to this config in order to fix the UDP drops due to udpInOverflows on netstat -s. Here are some statistics relating to the amount of traffic we receive via syslog-ng, it's pretty busy but in reading I'm finding that some folks are doing much more. These stats are based on a ~30 second window of traffic during peak times, but variance due to time is not so much in our environment. I used tcpdump with a bpf to capture only inbound udp/514, so this is what the interface is seeing in the way of syslog.
Elapsed: 00:00:34 Packets: 200000 Avg. packets/sec: 5836.546 Avg. packet size: 303.182 bytes Bytes: 60636477 Avg. bytes/sec: 1769537.884 Avg. MBit/sec: 14.156
So, about 6k messages per second. Here are the drop numbers over a time sample (done right after a process restart, you can see the buffer takes a moment to fill up [64 MB so_rcvbuf]):
# while true; do echo -en "$(date) :: "; netstat -s | grep udpInOverflows | head -n 1 | sed 's|.*=||'; sleep 10; done Fri Apr 15 14:12:46 GMT 2011 :: 472517477 Fri Apr 15 14:12:56 GMT 2011 :: 472517477 Fri Apr 15 14:13:06 GMT 2011 :: 472517477 Fri Apr 15 14:13:16 GMT 2011 :: 472517477 Fri Apr 15 14:13:26 GMT 2011 :: 472543152 Fri Apr 15 14:13:36 GMT 2011 :: 472592800 Fri Apr 15 14:13:46 GMT 2011 :: 472638848 Fri Apr 15 14:13:56 GMT 2011 :: 472684407
So that's about 5k overflows a second, which jives with our calculations, suggesting we're getting only ~10% of our messages logged to disk.
I inherited a config with _very_ many filter statements, but have decided to cut all that out to see if my performance problems in the way of udp drops continue (they do). I've attached a sanitized config to this message, all the stuff here concerns this config running (even though I thought eliminating the filters would really help, it didn't).
We're running Solaris 10 SPARC. The syslog-ng version is:
# /usr/local/sbin/syslog-ng -V syslog-ng 3.1.2 Installer-Version: 3.1.2 Revision: ssh+git://bazsi@git.balabit//var/scm/git/syslog-ng/syslog-ng-ose--mainli ne--3.1#master#8bf13c304b6ab5fc1a372b49d55c78370efe14ca Compile-Date: Oct 25 2010 23:56:18 Enable-Threads: off Enable-Debug: off Enable-GProf: off Enable-Memtrace: off Enable-Sun-STREAMS: on Enable-Sun-Door: on Enable-IPv6: on Enable-Spoof-Source: on Enable-TCP-Wrapper: off Enable-SSL: on Enable-SQL: off Enable-Linux-Caps: off Enable-Pcre: on
The following options are set for the OS:
# ndd /dev/udp udp_max_buf 1073741824 # ndd /dev/udp udp_recv_hiwat 65536
Some options lines from the config based on what I've seen:
* note the TCP stuff can be safely ignored, it's legacy from some testing but isn't currently seeing traffic * all 3 udp sources set with so_rcvbuf(67108864) (64 MB)
options { # things I've changed/tweaked flush_lines(1000); flush_timeout(20); log_fifo_size (67108864); log_msg_size(8192); chain_hostnames(yes); # end my changes <snip> };
So I'm totally stumped. I can set the buffers with so_rcvbuf() to 1 GB, it still doesn't matter, they eventually fill up and I start losing packets. I'm hoping that someone can point me to some tweaks I can do to get the numbers of drops down or eliminated. Is it unreasonable to expect to be able to process this many messages per second via UDP? Maybe that's the problem. I might experiment some with default syslog to see if it can write this many messages without drops...this doesn't seem like an insane amount of traffic. But perhaps my expectations are unrealistic, that's what I'm hoping someone can tell me.
Regards,
--Mike
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
Matthew, Thanks for the suggestion. I'm not using so_sndbuf anywhere in this configuration, just recieving and writing directly to disk. As for so_rcvbuf, I've already tried that per the initial message, no dice. Even if I run a so_rcvbuf size that is 10 times the recommended value in the configuration note you linked to, it still fills up and then drops/udpInOverflows start to occur at the rate of about 5k/sec. Is there something else I'm missing in the config perhaps? The setting of so_rcvbuf to a 64 MB buffer only delays the problem for a few seconds until the buffer again fills. If I set it to 1 GB (tried this, have a ton of RAM to work with) it delays the problem for about 10 minutes, then the drops start. It seems as if the buffer is not being emptied fast enough, but the CPU is by no means pegged by syslog-ng. I left out the resources I have to work with on this system, and how bad/good things are with syslog-ng running (and dropping), I'll include those now. As you can see, it's an older server, but it has a ton of RAM and the CPUs should have enough pop for this I think. # uname -a SunOS ms00310 5.10 Generic_127111-10 sun4u sparc SUNW,Sun-Fire-V490 Solaris # psrinfo -v | grep MHz The sparcv9 processor operates at 1350 MHz, The sparcv9 processor operates at 1350 MHz, The sparcv9 processor operates at 1350 MHz, The sparcv9 processor operates at 1350 MHz, The sparcv9 processor operates at 1350 MHz, The sparcv9 processor operates at 1350 MHz, The sparcv9 processor operates at 1350 MHz, The sparcv9 processor operates at 1350 MHz, # swap -s total: 4042128k bytes allocated + 967184k reserved = 5009312k used, 48662184k available # ps -e -o pcpu -o pid -o user -o args | grep syslog 0.0 70 root vxconfigd -x syslog -m boot 0.0 6110 root grep syslog 7.7 22802 root /usr/local/sbin/syslog-ng -f /etc/crap_config.txt -p /var/run/syslog-ng 0.0 22801 root /usr/local/sbin/syslog-ng -f /etc/crap_config.txt -p /var/run/syslog-ng # top -b -n 5 last pid: 6355; load avg: 1.36, 1.34, 1.34; up 58+23:59:11 17:37:27 94 processes: 91 sleeping, 3 on cpu CPU states: 82.1% idle, 7.0% user, 10.9% kernel, 0.0% iowait, 0.0% swap Memory: 32G phys mem, 16G free mem, 32G total swap, 32G free swap PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 22802 root 2 50 0 3067M 3063M cpu/2 79:50 7.81% syslog-ng 29459 root 15 40 0 193M 166M cpu/1 150.5H 4.49% issCSF 6352 root 1 55 0 3376K 2032K cpu/19 0:00 0.20% top 4229 root 82 59 0 327M 324M sleep 661:05 0.17% java 2695 root 6 59 0 8000K 2984K sleep 802:59 0.11% rmserver I'm just not sure what to do next to troubleshoot. I'm hoping someone here can point me in the right direction, or at least confirm that they are running syslog-ng in a similar configuration without drops so I know that it's at least possible? Regards, --Mike -----Original Message----- From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Matthew Hall Sent: Friday, April 15, 2011 12:12 PM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops Probably you need to adjust so_sndbuf and so_rcvbuf: http://www.balabit.com/sites/default/files/documents/syslog-ng-ose-v3.2- guide-admin-en.html/index.html-single.html#reference_source_tcpudp That should make it run better. Matthew. On Fri, Apr 15, 2011 at 10:52:59AM -0400, Mishou Michael wrote:
All,
I've done a lot of reading, and I can't figure out what I can do to this config in order to fix the UDP drops due to udpInOverflows on netstat -s. Here are some statistics relating to the amount of traffic we receive via syslog-ng, it's pretty busy but in reading I'm finding that some folks are doing much more. These stats are based on a ~30 second window of traffic during peak times, but variance due to time is not so much in our environment. I used tcpdump with a bpf to capture only inbound udp/514, so this is what the interface is seeing in the way of syslog.
Elapsed: 00:00:34 Packets: 200000 Avg. packets/sec: 5836.546 Avg. packet size: 303.182 bytes Bytes: 60636477 Avg. bytes/sec: 1769537.884 Avg. MBit/sec: 14.156
So, about 6k messages per second. Here are the drop numbers over a time sample (done right after a process restart, you can see the buffer takes a moment to fill up [64 MB so_rcvbuf]):
# while true; do echo -en "$(date) :: "; netstat -s | grep udpInOverflows | head -n 1 | sed 's|.*=||'; sleep 10; done Fri Apr 15 14:12:46 GMT 2011 :: 472517477 Fri Apr 15 14:12:56 GMT 2011 :: 472517477 Fri Apr 15 14:13:06 GMT 2011 :: 472517477 Fri Apr 15 14:13:16 GMT 2011 :: 472517477 Fri Apr 15 14:13:26 GMT 2011 :: 472543152 Fri Apr 15 14:13:36 GMT 2011 :: 472592800 Fri Apr 15 14:13:46 GMT 2011 :: 472638848 Fri Apr 15 14:13:56 GMT 2011 :: 472684407
So that's about 5k overflows a second, which jives with our calculations, suggesting we're getting only ~10% of our messages logged to disk.
I inherited a config with _very_ many filter statements, but have decided to cut all that out to see if my performance problems in the way of udp drops continue (they do). I've attached a sanitized config to this message, all the stuff here concerns this config running (even though I thought eliminating the filters would really help, it didn't).
We're running Solaris 10 SPARC. The syslog-ng version is:
# /usr/local/sbin/syslog-ng -V syslog-ng 3.1.2 Installer-Version: 3.1.2 Revision:
ssh+git://bazsi@git.balabit//var/scm/git/syslog-ng/syslog-ng-ose--mainli
ne--3.1#master#8bf13c304b6ab5fc1a372b49d55c78370efe14ca Compile-Date: Oct 25 2010 23:56:18 Enable-Threads: off Enable-Debug: off Enable-GProf: off Enable-Memtrace: off Enable-Sun-STREAMS: on Enable-Sun-Door: on Enable-IPv6: on Enable-Spoof-Source: on Enable-TCP-Wrapper: off Enable-SSL: on Enable-SQL: off Enable-Linux-Caps: off Enable-Pcre: on
The following options are set for the OS:
# ndd /dev/udp udp_max_buf 1073741824 # ndd /dev/udp udp_recv_hiwat 65536
Some options lines from the config based on what I've seen:
* note the TCP stuff can be safely ignored, it's legacy from some testing but isn't currently seeing traffic * all 3 udp sources set with so_rcvbuf(67108864) (64 MB)
options { # things I've changed/tweaked flush_lines(1000); flush_timeout(20); log_fifo_size (67108864); log_msg_size(8192); chain_hostnames(yes); # end my changes <snip> };
So I'm totally stumped. I can set the buffers with so_rcvbuf() to 1 GB, it still doesn't matter, they eventually fill up and I start losing packets. I'm hoping that someone can point me to some tweaks I can do to get the numbers of drops down or eliminated. Is it unreasonable to expect to be able to process this many messages per second via UDP? Maybe that's the problem. I might experiment some with default syslog to see if it can write this many messages without drops...this doesn't seem like an insane amount of traffic. But perhaps my expectations are unrealistic, that's what I'm hoping someone can tell me.
Regards,
--Mike
________________________________________________________________________ ______
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
Hi Mike, I'm heading out of town on a trip, so not enough time to read the whole thread. You may or may not have tried some of this, but I had similar issues a while back and noted it here: http://nms.gdd.net/index.php/Install_Guide_for_LogZilla_v3.1#UDP_Buffers <http://nms.gdd.net/index.php/Install_Guide_for_LogZilla_v3.1#UDP_Buffers>Hope it helps :-) ______________________________________________________________ Clayton Dukes ______________________________________________________________ On Fri, Apr 15, 2011 at 2:01 PM, Mishou Michael < Michael.Mishou@csirc.irs.gov> wrote:
Matthew,
Thanks for the suggestion. I'm not using so_sndbuf anywhere in this configuration, just recieving and writing directly to disk. As for so_rcvbuf, I've already tried that per the initial message, no dice. Even if I run a so_rcvbuf size that is 10 times the recommended value in the configuration note you linked to, it still fills up and then drops/udpInOverflows start to occur at the rate of about 5k/sec.
Is there something else I'm missing in the config perhaps? The setting of so_rcvbuf to a 64 MB buffer only delays the problem for a few seconds until the buffer again fills. If I set it to 1 GB (tried this, have a ton of RAM to work with) it delays the problem for about 10 minutes, then the drops start. It seems as if the buffer is not being emptied fast enough, but the CPU is by no means pegged by syslog-ng.
I left out the resources I have to work with on this system, and how bad/good things are with syslog-ng running (and dropping), I'll include those now. As you can see, it's an older server, but it has a ton of RAM and the CPUs should have enough pop for this I think.
# uname -a SunOS ms00310 5.10 Generic_127111-10 sun4u sparc SUNW,Sun-Fire-V490 Solaris # psrinfo -v | grep MHz The sparcv9 processor operates at 1350 MHz, The sparcv9 processor operates at 1350 MHz, The sparcv9 processor operates at 1350 MHz, The sparcv9 processor operates at 1350 MHz, The sparcv9 processor operates at 1350 MHz, The sparcv9 processor operates at 1350 MHz, The sparcv9 processor operates at 1350 MHz, The sparcv9 processor operates at 1350 MHz, # swap -s total: 4042128k bytes allocated + 967184k reserved = 5009312k used, 48662184k available # ps -e -o pcpu -o pid -o user -o args | grep syslog 0.0 70 root vxconfigd -x syslog -m boot 0.0 6110 root grep syslog 7.7 22802 root /usr/local/sbin/syslog-ng -f /etc/crap_config.txt -p /var/run/syslog-ng 0.0 22801 root /usr/local/sbin/syslog-ng -f /etc/crap_config.txt -p /var/run/syslog-ng # top -b -n 5 last pid: 6355; load avg: 1.36, 1.34, 1.34; up 58+23:59:11 17:37:27 94 processes: 91 sleeping, 3 on cpu CPU states: 82.1% idle, 7.0% user, 10.9% kernel, 0.0% iowait, 0.0% swap Memory: 32G phys mem, 16G free mem, 32G total swap, 32G free swap
PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 22802 root 2 50 0 3067M 3063M cpu/2 79:50 7.81% syslog-ng 29459 root 15 40 0 193M 166M cpu/1 150.5H 4.49% issCSF 6352 root 1 55 0 3376K 2032K cpu/19 0:00 0.20% top 4229 root 82 59 0 327M 324M sleep 661:05 0.17% java 2695 root 6 59 0 8000K 2984K sleep 802:59 0.11% rmserver
I'm just not sure what to do next to troubleshoot. I'm hoping someone here can point me in the right direction, or at least confirm that they are running syslog-ng in a similar configuration without drops so I know that it's at least possible?
Regards,
--Mike
-----Original Message----- From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Matthew Hall Sent: Friday, April 15, 2011 12:12 PM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops
Probably you need to adjust so_sndbuf and so_rcvbuf:
http://www.balabit.com/sites/default/files/documents/syslog-ng-ose-v3.2- guide-admin-en.html/index.html-single.html#reference_source_tcpudp
That should make it run better.
Matthew.
On Fri, Apr 15, 2011 at 10:52:59AM -0400, Mishou Michael wrote:
All,
I've done a lot of reading, and I can't figure out what I can do to this config in order to fix the UDP drops due to udpInOverflows on netstat -s. Here are some statistics relating to the amount of traffic we receive via syslog-ng, it's pretty busy but in reading I'm finding that some folks are doing much more. These stats are based on a ~30 second window of traffic during peak times, but variance due to time is not so much in our environment. I used tcpdump with a bpf to capture only inbound udp/514, so this is what the interface is seeing in the way of syslog.
Elapsed: 00:00:34 Packets: 200000 Avg. packets/sec: 5836.546 Avg. packet size: 303.182 bytes Bytes: 60636477 Avg. bytes/sec: 1769537.884 Avg. MBit/sec: 14.156
So, about 6k messages per second. Here are the drop numbers over a time sample (done right after a process restart, you can see the buffer takes a moment to fill up [64 MB so_rcvbuf]):
# while true; do echo -en "$(date) :: "; netstat -s | grep udpInOverflows | head -n 1 | sed 's|.*=||'; sleep 10; done Fri Apr 15 14:12:46 GMT 2011 :: 472517477 Fri Apr 15 14:12:56 GMT 2011 :: 472517477 Fri Apr 15 14:13:06 GMT 2011 :: 472517477 Fri Apr 15 14:13:16 GMT 2011 :: 472517477 Fri Apr 15 14:13:26 GMT 2011 :: 472543152 Fri Apr 15 14:13:36 GMT 2011 :: 472592800 Fri Apr 15 14:13:46 GMT 2011 :: 472638848 Fri Apr 15 14:13:56 GMT 2011 :: 472684407
So that's about 5k overflows a second, which jives with our calculations, suggesting we're getting only ~10% of our messages logged to disk.
I inherited a config with _very_ many filter statements, but have decided to cut all that out to see if my performance problems in the way of udp drops continue (they do). I've attached a sanitized config to this message, all the stuff here concerns this config running (even though I thought eliminating the filters would really help, it didn't).
We're running Solaris 10 SPARC. The syslog-ng version is:
# /usr/local/sbin/syslog-ng -V syslog-ng 3.1.2 Installer-Version: 3.1.2 Revision:
ssh+git://bazsi@git.balabit//var/scm/git/syslog-ng/syslog-ng-ose--mainli
ne--3.1#master#8bf13c304b6ab5fc1a372b49d55c78370efe14ca Compile-Date: Oct 25 2010 23:56:18 Enable-Threads: off Enable-Debug: off Enable-GProf: off Enable-Memtrace: off Enable-Sun-STREAMS: on Enable-Sun-Door: on Enable-IPv6: on Enable-Spoof-Source: on Enable-TCP-Wrapper: off Enable-SSL: on Enable-SQL: off Enable-Linux-Caps: off Enable-Pcre: on
The following options are set for the OS:
# ndd /dev/udp udp_max_buf 1073741824 # ndd /dev/udp udp_recv_hiwat 65536
Some options lines from the config based on what I've seen:
* note the TCP stuff can be safely ignored, it's legacy from some testing but isn't currently seeing traffic * all 3 udp sources set with so_rcvbuf(67108864) (64 MB)
options { # things I've changed/tweaked flush_lines(1000); flush_timeout(20); log_fifo_size (67108864); log_msg_size(8192); chain_hostnames(yes); # end my changes <snip> };
So I'm totally stumped. I can set the buffers with so_rcvbuf() to 1 GB, it still doesn't matter, they eventually fill up and I start losing packets. I'm hoping that someone can point me to some tweaks I can do to get the numbers of drops down or eliminated. Is it unreasonable to expect to be able to process this many messages per second via UDP? Maybe that's the problem. I might experiment some with default syslog to see if it can write this many messages without drops...this doesn't seem like an insane amount of traffic. But perhaps my expectations are unrealistic, that's what I'm hoping someone can tell me.
Regards,
--Mike
________________________________________________________________________ ______
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
On Fri, Apr 15, 2011 at 02:01:50PM -0400, Mishou Michael wrote:
I left out the resources I have to work with on this system, and how bad/good things are with syslog-ng running (and dropping), I'll include those now. As you can see, it's an older server, but it has a ton of RAM and the CPUs should have enough pop for this I think.
I'm just not sure what to do next to troubleshoot. I'm hoping someone here can point me in the right direction, or at least confirm that they are running syslog-ng in a similar configuration without drops so I know that it's at least possible?
Regards,
--Mike
I think the next suspect would be the disks. Can you disable anything that writes to disk or tell it to write to /dev/null and see if it still blows up? Also, it's Solaris, so you could start using some of the dtrace scripts to look for what syscalls / other ops are running too slow, and when it gets stuck what type of socket / disk file / what IO is it doing? Matthew.
I am experiencing the same problem with Sun V490 except the server has about 16gb memory. We are using UDP and losing about 85% of the traffic. The udpinoverflows is darn near equal to the total number of packets coming in. I am not at work now so cannot provide accurate statistics at this time. The NIC statistics are perfect, we aren't getting any errors with regards to the UDP area etc. There is a kernel patch that came out about a week or two ago that deals in this area, but I have not yet applied it. I want to apply the patch first before adjusting other kernel parameters. We have Solaris 10, update 9. Version of syslog-ng is 3.1.2. It is really terrible. By terrible, I mean the packet loss, not the product:)) It is probably something I don't have set up correctly. Mike, check out that latest patch, it can't hurt. I had to open a case with Sun to find out about it:)) On Fri, Apr 15, 2011 at 3:45 PM, Matthew Hall <mhall@mhcomputing.net> wrote:
On Fri, Apr 15, 2011 at 02:01:50PM -0400, Mishou Michael wrote:
I left out the resources I have to work with on this system, and how bad/good things are with syslog-ng running (and dropping), I'll include those now. As you can see, it's an older server, but it has a ton of RAM and the CPUs should have enough pop for this I think.
I'm just not sure what to do next to troubleshoot. I'm hoping someone here can point me in the right direction, or at least confirm that they are running syslog-ng in a similar configuration without drops so I know that it's at least possible?
Regards,
--Mike
I think the next suspect would be the disks. Can you disable anything that writes to disk or tell it to write to /dev/null and see if it still blows up?
Also, it's Solaris, so you could start using some of the dtrace scripts to look for what syscalls / other ops are running too slow, and when it gets stuck what type of socket / disk file / what IO is it doing?
Matthew.
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
Shot in the dark: have you checked to be sure there aren't checksum errors in the packets? Some kernels will drop bad checksummed packets. On Fri, Apr 15, 2011 at 8:27 PM, Fred Connolly <fred.connolly@gmail.com> wrote:
I am experiencing the same problem with Sun V490 except the server has about 16gb memory. We are using UDP and losing about 85% of the traffic. The udpinoverflows is darn near equal to the total number of packets coming in. I am not at work now so cannot provide accurate statistics at this time. The NIC statistics are perfect, we aren't getting any errors with regards to the UDP area etc.
There is a kernel patch that came out about a week or two ago that deals in this area, but I have not yet applied it. I want to apply the patch first before adjusting other kernel parameters. We have Solaris 10, update 9. Version of syslog-ng is 3.1.2. It is really terrible.
By terrible, I mean the packet loss, not the product:)) It is probably something I don't have set up correctly.
Mike, check out that latest patch, it can't hurt. I had to open a case with Sun to find out about it:))
On Fri, Apr 15, 2011 at 3:45 PM, Matthew Hall <mhall@mhcomputing.net> wrote:
On Fri, Apr 15, 2011 at 02:01:50PM -0400, Mishou Michael wrote:
I left out the resources I have to work with on this system, and how bad/good things are with syslog-ng running (and dropping), I'll include those now. As you can see, it's an older server, but it has a ton of RAM and the CPUs should have enough pop for this I think.
I'm just not sure what to do next to troubleshoot. I'm hoping someone here can point me in the right direction, or at least confirm that they are running syslog-ng in a similar configuration without drops so I know that it's at least possible?
Regards,
--Mike
I think the next suspect would be the disks. Can you disable anything that writes to disk or tell it to write to /dev/null and see if it still blows up?
Also, it's Solaris, so you could start using some of the dtrace scripts to look for what syscalls / other ops are running too slow, and when it gets stuck what type of socket / disk file / what IO is it doing?
Matthew.
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
Checksum issues should trigger ifInDiscards type MIB counters instead of udpOverflow type counters. Hit the thing with syslog-ng's loggen utility and figure out when it keels over. There's GOT to be a choke point that can be found scientifically. Matthew. On Sat, Apr 16, 2011 at 10:25:53PM -0500, Martin Holste wrote:
Shot in the dark: have you checked to be sure there aren't checksum errors in the packets? Some kernels will drop bad checksummed packets.
On Fri, Apr 15, 2011 at 8:27 PM, Fred Connolly <fred.connolly@gmail.com> wrote:
I am experiencing the same problem with Sun V490 except the server has about 16gb memory. We are using UDP and losing about 85% of the traffic. The udpinoverflows is darn near equal to the total number of packets coming in. I am not at work now so cannot provide accurate statistics at this time. The NIC statistics are perfect, we aren't getting any errors with regards to the UDP area etc.
There is a kernel patch that came out about a week or two ago that deals in this area, but I have not yet applied it. I want to apply the patch first before adjusting other kernel parameters. We have Solaris 10, update 9. Version of syslog-ng is 3.1.2. It is really terrible.
By terrible, I mean the packet loss, not the product:)) It is probably something I don't have set up correctly.
Mike, check out that latest patch, it can't hurt. I had to open a case with Sun to find out about it:))
On Fri, Apr 15, 2011 at 3:45 PM, Matthew Hall <mhall@mhcomputing.net> wrote:
On Fri, Apr 15, 2011 at 02:01:50PM -0400, Mishou Michael wrote:
I left out the resources I have to work with on this system, and how bad/good things are with syslog-ng running (and dropping), I'll include those now. As you can see, it's an older server, but it has a ton of RAM and the CPUs should have enough pop for this I think.
I'm just not sure what to do next to troubleshoot. I'm hoping someone here can point me in the right direction, or at least confirm that they are running syslog-ng in a similar configuration without drops so I know that it's at least possible?
Regards,
--Mike
I think the next suspect would be the disks. Can you disable anything that writes to disk or tell it to write to /dev/null and see if it still blows up?
Also, it's Solaris, so you could start using some of the dtrace scripts to look for what syscalls / other ops are running too slow, and when it gets stuck what type of socket / disk file / what IO is it doing?
Matthew.
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
My apologies to Matthew and Martin. I did not intend to jump in and take over the thread. I just wanted to make Matthew and other Solaris 10 syslog-ng users aware that there is a recent kernel patch to address UDP issues. The patch number is 144488-11. It is a Kernel patch so you might as well apply the latest Recommended Patches, as it is included in that patch set. I will take Matthew's good suggestion and check out the loggen utility. I was not aware that it existed and am going to look for the documentation as soon as I hit Send. I will also open up a separate thread to avoid any confusion:)) On Sat, Apr 16, 2011 at 11:25 PM, Martin Holste <mcholste@gmail.com> wrote:
Shot in the dark: have you checked to be sure there aren't checksum errors in the packets? Some kernels will drop bad checksummed packets.
On Fri, Apr 15, 2011 at 8:27 PM, Fred Connolly <fred.connolly@gmail.com> wrote:
I am experiencing the same problem with Sun V490 except the server has about 16gb memory. We are using UDP and losing about 85% of the traffic. The udpinoverflows is darn near equal to the total number of packets coming in. I am not at work now so cannot provide accurate statistics at this time. The NIC statistics are perfect, we aren't getting any errors with regards to the UDP area etc.
There is a kernel patch that came out about a week or two ago that deals in this area, but I have not yet applied it. I want to apply the patch first before adjusting other kernel parameters. We have Solaris 10, update 9. Version of syslog-ng is 3.1.2. It is really terrible.
By terrible, I mean the packet loss, not the product:)) It is probably something I don't have set up correctly.
Mike, check out that latest patch, it can't hurt. I had to open a case with Sun to find out about it:))
On Fri, Apr 15, 2011 at 3:45 PM, Matthew Hall <mhall@mhcomputing.net> wrote:
On Fri, Apr 15, 2011 at 02:01:50PM -0400, Mishou Michael wrote:
I left out the resources I have to work with on this system, and how bad/good things are with syslog-ng running (and dropping), I'll
include
those now. As you can see, it's an older server, but it has a ton of RAM and the CPUs should have enough pop for this I think.
I'm just not sure what to do next to troubleshoot. I'm hoping someone here can point me in the right direction, or at least confirm that they are running syslog-ng in a similar configuration without drops so I know that it's at least possible?
Regards,
--Mike
I think the next suspect would be the disks. Can you disable anything that writes to disk or tell it to write to /dev/null and see if it still blows up?
Also, it's Solaris, so you could start using some of the dtrace scripts to look for what syscalls / other ops are running too slow, and when it gets stuck what type of socket / disk file / what IO is it doing?
Matthew.
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
______________________________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
Your reply was fine and perfectly valid. I didn't think you were highjacking the thread at all. If the patch could fix it, that's a perfectly OK topic to bring up. I was just thinking of some general non-Solaris-specific techniques. Your Solaris-specific advice was a valuable contribution. So I hope you'll continue to keep reading and keep replying, and not be afraid to add to threads. Regards, Matthew. On Sun, Apr 17, 2011 at 11:36:00AM -0400, Fred Connolly wrote:
My apologies to Matthew and Martin. I did not intend to jump in and take over the thread. I just wanted to make Matthew and other Solaris 10 syslog-ng users aware that there is a recent kernel patch to address UDP issues. The patch number is 144488-11. It is a Kernel patch so you might as well apply the latest Recommended Patches, as it is included in that patch set.
I will take Matthew's good suggestion and check out the loggen utility. I was not aware that it existed and am going to look for the documentation as soon as I hit Send.
I will also open up a separate thread to avoid any confusion:))
Fred, Great find! For those following on the TV at home, here is the link to the patch notes that I found: https://getupdates.oracle.com/readme/144488-11 Which contains this tantalizing tidbit: 6638967 UDP recv (think DNS) suffers from thundering herd problem (bug report for above: http://bit.ly/eD57KB+ ) I'm going to install this patch and see what comes of it. That certainly seems like it could be related. Martin, I checked based on Matthew's suggestion of the ipInDiscards counter incrementing, and it's not, so no dice with the checksum errors, good call though! Matthew, Following your suggestions I set up all of the network-based destinations to be /dev/null and the problem persists, shows no change in terms of how fast the buffers fill up or how many overflows are generated per second at all. As for loggen, keep in mind that I was using that before to write to /dev/null and also directly to disk and topped loggen out (locally) at ~8k/msgs/sec, so I wonder if Fred isn't on the right track with some OS issues? I used Google to find scripts to run for Dtrace to track syscalls, and one called procsystime stood out (http://www.brendangregg.com/DTrace/procsystime) Here's a sample of the output over about 20 seconds each time, while the overflows are happening. It's hard for me to tell if the write() syscall is really freaking slow here or if this is as it should be and not weird. Either way, write() doesn't seem to be the issue if you look at the /dev/null output for which UDP overflows still occur. This first one is the output for the simple config I posted earlier, writing to disk: # /root/procsystime -aTn syslog-ng Hit Ctrl-C to stop sampling... ^C Elapsed Times for processes syslog-ng, SYSCALL TIME (ns) getpid 193600 fchmod 443000 bind 757700 fcntl 786800 setsockopt 1355700 fchown 1446200 connect 1542000 so_socket 2626100 close 4382500 open64 4691400 stat64 4957400 brk 23978700 pollsys 113785800 llseek 158217200 gtime 244735300 recvfrom 404501900 write 11343815700 TOTAL: 12312217000 CPU Times for processes syslog-ng, SYSCALL TIME (ns) getpid 42400 fchmod 286700 fcntl 341800 bind 691000 fchown 1163100 setsockopt 1230300 connect 1468100 so_socket 2543900 open64 4047800 close 4075300 stat64 4626600 brk 17442100 gtime 31685200 llseek 88638300 pollsys 92742600 recvfrom 332658600 write 3376428300 TOTAL: 3960112100 Syscall Counts for processes syslog-ng, SYSCALL COUNT bind 19 connect 19 so_socket 19 setsockopt 38 fchmod 48 open64 48 getpid 58 close 89 fchown 96 stat64 96 fcntl 172 brk 2342 pollsys 4741 recvfrom 22003 llseek 22701 write 51022 gtime 85661 TOTAL: 189172 And this second sample is for writing directly to /dev/null, you'll notice write() is taking a lot less time but that totally makes sense since disks are no longer being used (the overflows are still happening just as aggressively however): # /root/procsystime -aTn syslog-ng Hit Ctrl-C to stop sampling... ^C Elapsed Times for processes syslog-ng, SYSCALL TIME (ns) chmod 24400 brk 28900 fchmod 45200 fcntl 49000 chown 70900 close 154800 fchown 165800 mkdir 297700 open64 797000 stat64 846900 pollsys 14180100 llseek 25238500 write 54238800 gtime 59281800 recvfrom 87493900 nanosleep 17615931600 TOTAL: 17858845300 CPU Times for processes syslog-ng, SYSCALL TIME (ns) fcntl 20600 chmod 20700 brk 22400 fchmod 29400 chown 62400 fchown 134100 close 137000 mkdir 292500 open64 774500 stat64 799900 pollsys 10720500 llseek 12170100 gtime 13406200 write 18375500 nanosleep 30881900 recvfrom 72251100 TOTAL: 160098800 Syscall Counts for processes syslog-ng, SYSCALL COUNT chmod 1 mkdir 1 brk 2 chown 2 fchmod 5 open64 5 close 6 fchown 10 fcntl 10 stat64 14 pollsys 949 nanosleep 950 llseek 4774 write 4774 recvfrom 4776 gtime 17891 TOTAL: 34170 I will let you all know what happens post-patch. And if there are any suggestions for other things to try with dtrace, they are welcome, as I'm a novice with it and Solaris administration in general. Thanks for all the excellent help so far! --Mike ________________________________ From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Fred Connolly Sent: Friday, April 15, 2011 9:28 PM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops I am experiencing the same problem with Sun V490 except the server has about 16gb memory. We are using UDP and losing about 85% of the traffic. The udpinoverflows is darn near equal to the total number of packets coming in. I am not at work now so cannot provide accurate statistics at this time. The NIC statistics are perfect, we aren't getting any errors with regards to the UDP area etc. There is a kernel patch that came out about a week or two ago that deals in this area, but I have not yet applied it. I want to apply the patch first before adjusting other kernel parameters. We have Solaris 10, update 9. Version of syslog-ng is 3.1.2. It is really terrible. By terrible, I mean the packet loss, not the product:)) It is probably something I don't have set up correctly. Mike, check out that latest patch, it can't hurt. I had to open a case with Sun to find out about it:)) On Fri, Apr 15, 2011 at 3:45 PM, Matthew Hall <mhall@mhcomputing.net> wrote: On Fri, Apr 15, 2011 at 02:01:50PM -0400, Mishou Michael wrote: > I left out the resources I have to work with on this system, and how > bad/good things are with syslog-ng running (and dropping), I'll include > those now. As you can see, it's an older server, but it has a ton of > RAM and the CPUs should have enough pop for this I think. > I'm just not sure what to do next to troubleshoot. I'm hoping someone > here can point me in the right direction, or at least confirm that they > are running syslog-ng in a similar configuration without drops so I know > that it's at least possible? > > Regards, > > --Mike I think the next suspect would be the disks. Can you disable anything that writes to disk or tell it to write to /dev/null and see if it still blows up? Also, it's Solaris, so you could start using some of the dtrace scripts to look for what syscalls / other ops are running too slow, and when it gets stuck what type of socket / disk file / what IO is it doing? Matthew. ________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
EXCELLENT debug and measurement work. Based on this evidence and testing I vote in favor of a UDP issue a la Fred. Matthew. On Mon, Apr 18, 2011 at 11:43:50AM -0400, Mishou Michael wrote:
Fred,
Great find! For those following on the TV at home, here is the link to the patch notes that I found:
https://getupdates.oracle.com/readme/144488-11
Which contains this tantalizing tidbit:
6638967 UDP recv (think DNS) suffers from thundering herd problem (bug report for above: http://bit.ly/eD57KB+ )
I'm going to install this patch and see what comes of it. That certainly seems like it could be related.
Martin,
I checked based on Matthew's suggestion of the ipInDiscards counter incrementing, and it's not, so no dice with the checksum errors, good call though!
Matthew,
Following your suggestions I set up all of the network-based destinations to be /dev/null and the problem persists, shows no change in terms of how fast the buffers fill up or how many overflows are generated per second at all. As for loggen, keep in mind that I was using that before to write to /dev/null and also directly to disk and topped loggen out (locally) at ~8k/msgs/sec, so I wonder if Fred isn't on the right track with some OS issues?
I used Google to find scripts to run for Dtrace to track syscalls, and one called procsystime stood out (http://www.brendangregg.com/DTrace/procsystime) Here's a sample of the output over about 20 seconds each time, while the overflows are happening. It's hard for me to tell if the write() syscall is really freaking slow here or if this is as it should be and not weird. Either way, write() doesn't seem to be the issue if you look at the /dev/null output for which UDP overflows still occur.
This first one is the output for the simple config I posted earlier, writing to disk:
# /root/procsystime -aTn syslog-ng Hit Ctrl-C to stop sampling... ^C
Elapsed Times for processes syslog-ng,
SYSCALL TIME (ns) getpid 193600 fchmod 443000 bind 757700 fcntl 786800 setsockopt 1355700 fchown 1446200 connect 1542000 so_socket 2626100 close 4382500 open64 4691400 stat64 4957400 brk 23978700 pollsys 113785800 llseek 158217200 gtime 244735300 recvfrom 404501900 write 11343815700 TOTAL: 12312217000
CPU Times for processes syslog-ng,
SYSCALL TIME (ns) getpid 42400 fchmod 286700 fcntl 341800 bind 691000 fchown 1163100 setsockopt 1230300 connect 1468100 so_socket 2543900 open64 4047800 close 4075300 stat64 4626600 brk 17442100 gtime 31685200 llseek 88638300 pollsys 92742600 recvfrom 332658600 write 3376428300 TOTAL: 3960112100
Syscall Counts for processes syslog-ng,
SYSCALL COUNT bind 19 connect 19 so_socket 19 setsockopt 38 fchmod 48 open64 48 getpid 58 close 89 fchown 96 stat64 96 fcntl 172 brk 2342 pollsys 4741 recvfrom 22003 llseek 22701 write 51022 gtime 85661 TOTAL: 189172
And this second sample is for writing directly to /dev/null, you'll notice write() is taking a lot less time but that totally makes sense since disks are no longer being used (the overflows are still happening just as aggressively however):
# /root/procsystime -aTn syslog-ng Hit Ctrl-C to stop sampling... ^C
Elapsed Times for processes syslog-ng,
SYSCALL TIME (ns) chmod 24400 brk 28900 fchmod 45200 fcntl 49000 chown 70900 close 154800 fchown 165800 mkdir 297700 open64 797000 stat64 846900 pollsys 14180100 llseek 25238500 write 54238800 gtime 59281800 recvfrom 87493900 nanosleep 17615931600 TOTAL: 17858845300
CPU Times for processes syslog-ng,
SYSCALL TIME (ns) fcntl 20600 chmod 20700 brk 22400 fchmod 29400 chown 62400 fchown 134100 close 137000 mkdir 292500 open64 774500 stat64 799900 pollsys 10720500 llseek 12170100 gtime 13406200 write 18375500 nanosleep 30881900 recvfrom 72251100 TOTAL: 160098800
Syscall Counts for processes syslog-ng,
SYSCALL COUNT chmod 1 mkdir 1 brk 2 chown 2 fchmod 5 open64 5 close 6 fchown 10 fcntl 10 stat64 14 pollsys 949 nanosleep 950 llseek 4774 write 4774 recvfrom 4776 gtime 17891 TOTAL: 34170
I will let you all know what happens post-patch. And if there are any suggestions for other things to try with dtrace, they are welcome, as I'm a novice with it and Solaris administration in general.
Thanks for all the excellent help so far!
--Mike
________________________________
From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Fred Connolly Sent: Friday, April 15, 2011 9:28 PM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops
I am experiencing the same problem with Sun V490 except the server has about 16gb memory. We are using UDP and losing about 85% of the traffic. The udpinoverflows is darn near equal to the total number of packets coming in. I am not at work now so cannot provide accurate statistics at this time. The NIC statistics are perfect, we aren't getting any errors with regards to the UDP area etc.
There is a kernel patch that came out about a week or two ago that deals in this area, but I have not yet applied it. I want to apply the patch first before adjusting other kernel parameters. We have Solaris 10, update 9. Version of syslog-ng is 3.1.2. It is really terrible.
By terrible, I mean the packet loss, not the product:)) It is probably something I don't have set up correctly.
Mike, check out that latest patch, it can't hurt. I had to open a case with Sun to find out about it:))
On Fri, Apr 15, 2011 at 3:45 PM, Matthew Hall <mhall@mhcomputing.net> wrote:
On Fri, Apr 15, 2011 at 02:01:50PM -0400, Mishou Michael wrote:
I left out the resources I have to work with on this system, and how bad/good things are with syslog-ng running (and dropping), I'll include those now. As you can see, it's an older server, but it has a ton of RAM and the CPUs should have enough pop for this I think.
I'm just not sure what to do next to troubleshoot. I'm hoping someone here can point me in the right direction, or at least confirm that they are running syslog-ng in a similar configuration without drops so I know that it's at least possible?
Regards,
--Mike
I think the next suspect would be the disks. Can you disable anything that writes to disk or tell it to write to /dev/null and see if it still blows up?
Also, it's Solaris, so you could start using some of the dtrace scripts to look for what syscalls / other ops are running too slow, and when it gets stuck what type of socket / disk file / what IO is it doing?
Matthew.
________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
However. Reading the patch description seems to indicate it is actually thread related. And syslog ng before 3.3 is single thread or two thread with the second thread doing DB writes. On Mon, Apr 18, 2011 at 08:57:41AM -0700, Matthew Hall wrote:
EXCELLENT debug and measurement work.
Based on this evidence and testing I vote in favor of a UDP issue a la Fred.
Matthew.
On Mon, Apr 18, 2011 at 11:43:50AM -0400, Mishou Michael wrote:
Fred,
Great find! For those following on the TV at home, here is the link to the patch notes that I found:
https://getupdates.oracle.com/readme/144488-11
Which contains this tantalizing tidbit:
6638967 UDP recv (think DNS) suffers from thundering herd problem (bug report for above: http://bit.ly/eD57KB+ )
I'm going to install this patch and see what comes of it. That certainly seems like it could be related.
Martin,
I checked based on Matthew's suggestion of the ipInDiscards counter incrementing, and it's not, so no dice with the checksum errors, good call though!
Matthew,
Following your suggestions I set up all of the network-based destinations to be /dev/null and the problem persists, shows no change in terms of how fast the buffers fill up or how many overflows are generated per second at all. As for loggen, keep in mind that I was using that before to write to /dev/null and also directly to disk and topped loggen out (locally) at ~8k/msgs/sec, so I wonder if Fred isn't on the right track with some OS issues?
I used Google to find scripts to run for Dtrace to track syscalls, and one called procsystime stood out (http://www.brendangregg.com/DTrace/procsystime) Here's a sample of the output over about 20 seconds each time, while the overflows are happening. It's hard for me to tell if the write() syscall is really freaking slow here or if this is as it should be and not weird. Either way, write() doesn't seem to be the issue if you look at the /dev/null output for which UDP overflows still occur.
This first one is the output for the simple config I posted earlier, writing to disk:
# /root/procsystime -aTn syslog-ng Hit Ctrl-C to stop sampling... ^C
Elapsed Times for processes syslog-ng,
SYSCALL TIME (ns) getpid 193600 fchmod 443000 bind 757700 fcntl 786800 setsockopt 1355700 fchown 1446200 connect 1542000 so_socket 2626100 close 4382500 open64 4691400 stat64 4957400 brk 23978700 pollsys 113785800 llseek 158217200 gtime 244735300 recvfrom 404501900 write 11343815700 TOTAL: 12312217000
CPU Times for processes syslog-ng,
SYSCALL TIME (ns) getpid 42400 fchmod 286700 fcntl 341800 bind 691000 fchown 1163100 setsockopt 1230300 connect 1468100 so_socket 2543900 open64 4047800 close 4075300 stat64 4626600 brk 17442100 gtime 31685200 llseek 88638300 pollsys 92742600 recvfrom 332658600 write 3376428300 TOTAL: 3960112100
Syscall Counts for processes syslog-ng,
SYSCALL COUNT bind 19 connect 19 so_socket 19 setsockopt 38 fchmod 48 open64 48 getpid 58 close 89 fchown 96 stat64 96 fcntl 172 brk 2342 pollsys 4741 recvfrom 22003 llseek 22701 write 51022 gtime 85661 TOTAL: 189172
And this second sample is for writing directly to /dev/null, you'll notice write() is taking a lot less time but that totally makes sense since disks are no longer being used (the overflows are still happening just as aggressively however):
# /root/procsystime -aTn syslog-ng Hit Ctrl-C to stop sampling... ^C
Elapsed Times for processes syslog-ng,
SYSCALL TIME (ns) chmod 24400 brk 28900 fchmod 45200 fcntl 49000 chown 70900 close 154800 fchown 165800 mkdir 297700 open64 797000 stat64 846900 pollsys 14180100 llseek 25238500 write 54238800 gtime 59281800 recvfrom 87493900 nanosleep 17615931600 TOTAL: 17858845300
CPU Times for processes syslog-ng,
SYSCALL TIME (ns) fcntl 20600 chmod 20700 brk 22400 fchmod 29400 chown 62400 fchown 134100 close 137000 mkdir 292500 open64 774500 stat64 799900 pollsys 10720500 llseek 12170100 gtime 13406200 write 18375500 nanosleep 30881900 recvfrom 72251100 TOTAL: 160098800
Syscall Counts for processes syslog-ng,
SYSCALL COUNT chmod 1 mkdir 1 brk 2 chown 2 fchmod 5 open64 5 close 6 fchown 10 fcntl 10 stat64 14 pollsys 949 nanosleep 950 llseek 4774 write 4774 recvfrom 4776 gtime 17891 TOTAL: 34170
I will let you all know what happens post-patch. And if there are any suggestions for other things to try with dtrace, they are welcome, as I'm a novice with it and Solaris administration in general.
Thanks for all the excellent help so far!
--Mike
________________________________
From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Fred Connolly Sent: Friday, April 15, 2011 9:28 PM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops
I am experiencing the same problem with Sun V490 except the server has about 16gb memory. We are using UDP and losing about 85% of the traffic. The udpinoverflows is darn near equal to the total number of packets coming in. I am not at work now so cannot provide accurate statistics at this time. The NIC statistics are perfect, we aren't getting any errors with regards to the UDP area etc.
There is a kernel patch that came out about a week or two ago that deals in this area, but I have not yet applied it. I want to apply the patch first before adjusting other kernel parameters. We have Solaris 10, update 9. Version of syslog-ng is 3.1.2. It is really terrible.
By terrible, I mean the packet loss, not the product:)) It is probably something I don't have set up correctly.
Mike, check out that latest patch, it can't hurt. I had to open a case with Sun to find out about it:))
On Fri, Apr 15, 2011 at 3:45 PM, Matthew Hall <mhall@mhcomputing.net> wrote:
On Fri, Apr 15, 2011 at 02:01:50PM -0400, Mishou Michael wrote:
I left out the resources I have to work with on this system, and how bad/good things are with syslog-ng running (and dropping), I'll include those now. As you can see, it's an older server, but it has a ton of RAM and the CPUs should have enough pop for this I think.
I'm just not sure what to do next to troubleshoot. I'm hoping someone here can point me in the right direction, or at least confirm that they are running syslog-ng in a similar configuration without drops so I know that it's at least possible?
Regards,
--Mike
I think the next suspect would be the disks. Can you disable anything that writes to disk or tell it to write to /dev/null and see if it still blows up?
Also, it's Solaris, so you could start using some of the dtrace scripts to look for what syscalls / other ops are running too slow, and when it gets stuck what type of socket / disk file / what IO is it doing?
Matthew.
________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
For those following this thread, I have applied the "thundering herd" UDP patch and experienced no change in the drops experienced by syslog-ng 3.1.2. Sorry I took so long to respond, the patching was a much more time-involved process than I thought it would be. At this point, based on Michael Hocke's response, I'm thinking that perhaps there is just too much UDP traffic for single-threaded syslog-ng to deal with in light of what filtering and parsing it does up front (for macro usage). I'm going to experiment with syslog-ng and the loggen tool to find a point at which a single syslog-ng instance starts dropping inbound UDP traffic with a simple configuration writing to disk. Once I have that number, I have a few options: 1. Experiment with syslog-ng 3.3 and the new threaded code to see if I have performance gains. I'm hesitant to push Alpha code in production, if anyone has any experience with 3.3 in semi-production environment running consistently I'd love to hear it. 2. So I don't have to change the configuration on a lot of clients, use PF to rewrite incoming UDP messages from specific, busy clients to other syslog-ng listeners, configured exactly as my main instance (which will handle all the non-insanely-busy clients). I could run multiple listeners in this manner, and not need threading to take advantage of multiple processors, though obviously each process would still be limited to the magic number determined above. I have 10 or so really busy clients, so this is one solution I'm leaning towards if syslog-ng 3.1.2 can handle just one of them. 3. Give up on syslog-ng until 3.3, or move to some other solution. Not sure what I could do here, rsyslog is the other major contender I guess, not sure what gains I would get. Could also do native syslog server and post-process to different buckets/relay which is what we mainly use syslog-ng for. 4. Get a faster box (not likely to happen). If anyone has any thoughts on any of the above I'd love to hear them. Also, if this is unique to Solaris SPARC systems (similarly spec'd x86 Solaris systems having none of these limitations) I'd love to know that as well. Is there any way anyone knows to figure out at what point the SPARC is hitting a ceiling? The CPU is not pegged, so why would we be experiencing CPU-based drops? Maybe the code is not efficient for how SPARC does things, or how some syscall is implemented on Solaris? --Mike -------------------- Mike Mishou - IRS CSIRC (202) 283-2189 -- Desk (202) 384-7817 -- Mobile (202) 283-4809 -- 24x7 Hotline -----Original Message----- From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Mishou Michael Sent: Monday, April 18, 2011 11:44 AM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops Fred, Great find! For those following on the TV at home, here is the link to the patch notes that I found: https://getupdates.oracle.com/readme/144488-11 Which contains this tantalizing tidbit: 6638967 UDP recv (think DNS) suffers from thundering herd problem (bug report for above: http://bit.ly/eD57KB+ ) I'm going to install this patch and see what comes of it. That certainly seems like it could be related. Martin, I checked based on Matthew's suggestion of the ipInDiscards counter incrementing, and it's not, so no dice with the checksum errors, good call though! Matthew, Following your suggestions I set up all of the network-based destinations to be /dev/null and the problem persists, shows no change in terms of how fast the buffers fill up or how many overflows are generated per second at all. As for loggen, keep in mind that I was using that before to write to /dev/null and also directly to disk and topped loggen out (locally) at ~8k/msgs/sec, so I wonder if Fred isn't on the right track with some OS issues? I used Google to find scripts to run for Dtrace to track syscalls, and one called procsystime stood out (http://www.brendangregg.com/DTrace/procsystime) Here's a sample of the output over about 20 seconds each time, while the overflows are happening. It's hard for me to tell if the write() syscall is really freaking slow here or if this is as it should be and not weird. Either way, write() doesn't seem to be the issue if you look at the /dev/null output for which UDP overflows still occur. This first one is the output for the simple config I posted earlier, writing to disk: # /root/procsystime -aTn syslog-ng Hit Ctrl-C to stop sampling... ^C Elapsed Times for processes syslog-ng, SYSCALL TIME (ns) getpid 193600 fchmod 443000 bind 757700 fcntl 786800 setsockopt 1355700 fchown 1446200 connect 1542000 so_socket 2626100 close 4382500 open64 4691400 stat64 4957400 brk 23978700 pollsys 113785800 llseek 158217200 gtime 244735300 recvfrom 404501900 write 11343815700 TOTAL: 12312217000 CPU Times for processes syslog-ng, SYSCALL TIME (ns) getpid 42400 fchmod 286700 fcntl 341800 bind 691000 fchown 1163100 setsockopt 1230300 connect 1468100 so_socket 2543900 open64 4047800 close 4075300 stat64 4626600 brk 17442100 gtime 31685200 llseek 88638300 pollsys 92742600 recvfrom 332658600 write 3376428300 TOTAL: 3960112100 Syscall Counts for processes syslog-ng, SYSCALL COUNT bind 19 connect 19 so_socket 19 setsockopt 38 fchmod 48 open64 48 getpid 58 close 89 fchown 96 stat64 96 fcntl 172 brk 2342 pollsys 4741 recvfrom 22003 llseek 22701 write 51022 gtime 85661 TOTAL: 189172 And this second sample is for writing directly to /dev/null, you'll notice write() is taking a lot less time but that totally makes sense since disks are no longer being used (the overflows are still happening just as aggressively however): # /root/procsystime -aTn syslog-ng Hit Ctrl-C to stop sampling... ^C Elapsed Times for processes syslog-ng, SYSCALL TIME (ns) chmod 24400 brk 28900 fchmod 45200 fcntl 49000 chown 70900 close 154800 fchown 165800 mkdir 297700 open64 797000 stat64 846900 pollsys 14180100 llseek 25238500 write 54238800 gtime 59281800 recvfrom 87493900 nanosleep 17615931600 TOTAL: 17858845300 CPU Times for processes syslog-ng, SYSCALL TIME (ns) fcntl 20600 chmod 20700 brk 22400 fchmod 29400 chown 62400 fchown 134100 close 137000 mkdir 292500 open64 774500 stat64 799900 pollsys 10720500 llseek 12170100 gtime 13406200 write 18375500 nanosleep 30881900 recvfrom 72251100 TOTAL: 160098800 Syscall Counts for processes syslog-ng, SYSCALL COUNT chmod 1 mkdir 1 brk 2 chown 2 fchmod 5 open64 5 close 6 fchown 10 fcntl 10 stat64 14 pollsys 949 nanosleep 950 llseek 4774 write 4774 recvfrom 4776 gtime 17891 TOTAL: 34170 I will let you all know what happens post-patch. And if there are any suggestions for other things to try with dtrace, they are welcome, as I'm a novice with it and Solaris administration in general. Thanks for all the excellent help so far! --Mike ________________________________ From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Fred Connolly Sent: Friday, April 15, 2011 9:28 PM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops I am experiencing the same problem with Sun V490 except the server has about 16gb memory. We are using UDP and losing about 85% of the traffic. The udpinoverflows is darn near equal to the total number of packets coming in. I am not at work now so cannot provide accurate statistics at this time. The NIC statistics are perfect, we aren't getting any errors with regards to the UDP area etc. There is a kernel patch that came out about a week or two ago that deals in this area, but I have not yet applied it. I want to apply the patch first before adjusting other kernel parameters. We have Solaris 10, update 9. Version of syslog-ng is 3.1.2. It is really terrible. By terrible, I mean the packet loss, not the product:)) It is probably something I don't have set up correctly. Mike, check out that latest patch, it can't hurt. I had to open a case with Sun to find out about it:)) On Fri, Apr 15, 2011 at 3:45 PM, Matthew Hall <mhall@mhcomputing.net> wrote: On Fri, Apr 15, 2011 at 02:01:50PM -0400, Mishou Michael wrote: > I left out the resources I have to work with on this system, and how > bad/good things are with syslog-ng running (and dropping), I'll include > those now. As you can see, it's an older server, but it has a ton of > RAM and the CPUs should have enough pop for this I think. > I'm just not sure what to do next to troubleshoot. I'm hoping someone > here can point me in the right direction, or at least confirm that they > are running syslog-ng in a similar configuration without drops so I know > that it's at least possible? > > Regards, > > --Mike I think the next suspect would be the disks. Can you disable anything that writes to disk or tell it to write to /dev/null and see if it still blows up? Also, it's Solaris, so you could start using some of the dtrace scripts to look for what syscalls / other ops are running too slow, and when it gets stuck what type of socket / disk file / what IO is it doing? Matthew. ________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html ________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
(A few preliminary answers follow - I'll have another look at this later tonight from home, once I tested a few things on my local solaris vm) "Mishou Michael" <Michael.Mishou@csirc.irs.gov> writes:
I'm going to experiment with syslog-ng and the loggen tool to find a point at which a single syslog-ng instance starts dropping inbound UDP traffic with a simple configuration writing to disk. Once I have that number, I have a few options:
1. Experiment with syslog-ng 3.3 and the new threaded code to see if I have performance gains. I'm hesitant to push Alpha code in production, if anyone has any experience with 3.3 in semi-production environment running consistently I'd love to hear it.
I've been running 3.3 on most systems I administer (2 of my own servers + a few I administer for friends; and all of my virtual machines). It's been serving me fine for the past 4 months now. However, most of my systems are also linux systems, where syslog-ng is much better tested (and I'm not using UDP at all). Personally, I'd give it a test run, as current 3.3 is fairly stable.
3. Give up on syslog-ng until 3.3, or move to some other solution. Not sure what I could do here, rsyslog is the other major contender I guess, not sure what gains I would get. Could also do native syslog server and post-process to different buckets/relay which is what we mainly use syslog-ng for.
I wouldn't consider rsyslog. It's a nightmare to maintain that, and an even bigger nightmare to get it to perform well in any but the most trivial situations. (Or it might be just me being too used to good documentation and readable config files, but I'm fairly sure it's not just that :P) -- |8]
Gergely, Thanks for any testing you can do. I'm not sure if a SPARC processor is an important testing component or not, I suppose your VMs will help determine this since you'll be using x86. If there's any testing I can do to help things along, please let me know. Yes, I'm (very) scared of rsyslog as a maintainable solution, the configs for syslog-ng are *so* much easier to read and understand. I'll try 3.3 and report back how threading helps things out, I'm glad to hear that it's been pretty stable for you, that was my major concern in testing 3.3 since eventually we'll need this to be in production with our basic (from a config complexity standpoint) requirements. I'll report back how 3.3 works out for me after I get it compiled and up today. Regards, --Mike -----Original Message----- From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Gergely Nagy Sent: Tuesday, April 26, 2011 12:19 PM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops (A few preliminary answers follow - I'll have another look at this later tonight from home, once I tested a few things on my local solaris vm) "Mishou Michael" <Michael.Mishou@csirc.irs.gov> writes:
I'm going to experiment with syslog-ng and the loggen tool to find a point at which a single syslog-ng instance starts dropping inbound UDP traffic with a simple configuration writing to disk. Once I have that number, I have a few options:
1. Experiment with syslog-ng 3.3 and the new threaded code to see if I have performance gains. I'm hesitant to push Alpha code in production, if anyone has any experience with 3.3 in semi-production environment running consistently I'd love to hear it.
I've been running 3.3 on most systems I administer (2 of my own servers + a few I administer for friends; and all of my virtual machines). It's been serving me fine for the past 4 months now. However, most of my systems are also linux systems, where syslog-ng is much better tested (and I'm not using UDP at all). Personally, I'd give it a test run, as current 3.3 is fairly stable.
3. Give up on syslog-ng until 3.3, or move to some other solution. Not sure what I could do here, rsyslog is the other major contender I guess, not sure what gains I would get. Could also do native syslog server and post-process to different buckets/relay which is what we mainly use syslog-ng for.
I wouldn't consider rsyslog. It's a nightmare to maintain that, and an even bigger nightmare to get it to perform well in any but the most trivial situations. (Or it might be just me being too used to good documentation and readable config files, but I'm fairly sure it's not just that :P) -- |8] ________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
Just a heads up Mike. I tried doing the same thing with regards to using loggen to find the best rate on my V490. My version of loggen did not have the --active-connections parameter for sure, and I think it didn't have the --idle connection parameter either. I set the -I to 600 for 10 minutes, and that didn't work either. It ran until I manually killed it about 25 minutes later. Then for the output all I got was : count=14877 diff=15930 rate = 627.75 I haven't found what they mean yet. I reckon count would be the number of packets sent, not sure what diff is, but I know what the msg/sec is:)) I am curious to see what you come up with. Oh, did you use the SunFreeware version or did you compile it yourself? On Tue, Apr 26, 2011 at 1:58 PM, Mishou Michael < Michael.Mishou@csirc.irs.gov> wrote:
Gergely,
Thanks for any testing you can do. I'm not sure if a SPARC processor is an important testing component or not, I suppose your VMs will help determine this since you'll be using x86. If there's any testing I can do to help things along, please let me know.
Yes, I'm (very) scared of rsyslog as a maintainable solution, the configs for syslog-ng are *so* much easier to read and understand. I'll try 3.3 and report back how threading helps things out, I'm glad to hear that it's been pretty stable for you, that was my major concern in testing 3.3 since eventually we'll need this to be in production with our basic (from a config complexity standpoint) requirements.
I'll report back how 3.3 works out for me after I get it compiled and up today.
Regards,
--Mike
-----Original Message----- From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Gergely Nagy Sent: Tuesday, April 26, 2011 12:19 PM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops
(A few preliminary answers follow - I'll have another look at this later tonight from home, once I tested a few things on my local solaris vm)
"Mishou Michael" <Michael.Mishou@csirc.irs.gov> writes:
I'm going to experiment with syslog-ng and the loggen tool to find a point at which a single syslog-ng instance starts dropping inbound UDP traffic with a simple configuration writing to disk. Once I have that number, I have a few options:
1. Experiment with syslog-ng 3.3 and the new threaded code to see if I have performance gains. I'm hesitant to push Alpha code in production, if anyone has any experience with 3.3 in semi-production environment running consistently I'd love to hear it.
I've been running 3.3 on most systems I administer (2 of my own servers + a few I administer for friends; and all of my virtual machines). It's been serving me fine for the past 4 months now.
However, most of my systems are also linux systems, where syslog-ng is much better tested (and I'm not using UDP at all).
Personally, I'd give it a test run, as current 3.3 is fairly stable.
3. Give up on syslog-ng until 3.3, or move to some other solution. Not sure what I could do here, rsyslog is the other major contender I guess, not sure what gains I would get. Could also do native syslog server and post-process to different buckets/relay which is what we mainly use syslog-ng for.
I wouldn't consider rsyslog. It's a nightmare to maintain that, and an even bigger nightmare to get it to perform well in any but the most trivial situations. (Or it might be just me being too used to good documentation and readable config files, but I'm fairly sure it's not just that :P)
-- |8] ________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
Zeek, I didn't compile it myself, I'm using the 3.1.2 from sunfreeware.com. I'm actually having a heck of a time figuring out how to compile 3.3 from the alpha2 tarball on Solaris 10. I don't think I'm helping myself by having all the gcc tools installed from sunfreeware.com, maybe I need to start over. I'm so much more comfortable on Linux, where stuff just compiles magically and I don't have to do anything special. When you are using loggen, you should write to disk on the receiving end and compare the number of messages received to messages sent. Clayton Dukes (on this list) has a good writeup of how to use loggen to generate some relevant performance numbers here: http://nms.gdd.net/index.php/Install_Guide_for_LogZilla_v3.1#UDP_Buffers If I had to guess, --active-connections parameter wouldn't apply to UDP transport. Sounds like a TCP thing. Hope this helps! --Mike ________________________________ From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Zeek Anow Sent: Tuesday, April 26, 2011 5:37 PM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops Just a heads up Mike. I tried doing the same thing with regards to using loggen to find the best rate on my V490. My version of loggen did not have the --active-connections parameter for sure, and I think it didn't have the --idle connection parameter either. I set the -I to 600 for 10 minutes, and that didn't work either. It ran until I manually killed it about 25 minutes later. Then for the output all I got was : count=14877 diff=15930 rate = 627.75 I haven't found what they mean yet. I reckon count would be the number of packets sent, not sure what diff is, but I know what the msg/sec is:)) I am curious to see what you come up with. Oh, did you use the SunFreeware version or did you compile it yourself? On Tue, Apr 26, 2011 at 1:58 PM, Mishou Michael <Michael.Mishou@csirc.irs.gov> wrote: Gergely, Thanks for any testing you can do. I'm not sure if a SPARC processor is an important testing component or not, I suppose your VMs will help determine this since you'll be using x86. If there's any testing I can do to help things along, please let me know. Yes, I'm (very) scared of rsyslog as a maintainable solution, the configs for syslog-ng are *so* much easier to read and understand. I'll try 3.3 and report back how threading helps things out, I'm glad to hear that it's been pretty stable for you, that was my major concern in testing 3.3 since eventually we'll need this to be in production with our basic (from a config complexity standpoint) requirements. I'll report back how 3.3 works out for me after I get it compiled and up today. Regards, --Mike -----Original Message----- From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Gergely Nagy Sent: Tuesday, April 26, 2011 12:19 PM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops (A few preliminary answers follow - I'll have another look at this later tonight from home, once I tested a few things on my local solaris vm) "Mishou Michael" <Michael.Mishou@csirc.irs.gov> writes: > I'm going to experiment with syslog-ng and the loggen tool to find a > point at which a single syslog-ng instance starts dropping inbound UDP > traffic with a simple configuration writing to disk. Once I have that > number, I have a few options: > > 1. Experiment with syslog-ng 3.3 and the new threaded code to see if I > have performance gains. I'm hesitant to push Alpha code in production, > if anyone has any experience with 3.3 in semi-production environment > running consistently I'd love to hear it. I've been running 3.3 on most systems I administer (2 of my own servers + a few I administer for friends; and all of my virtual machines). It's been serving me fine for the past 4 months now. However, most of my systems are also linux systems, where syslog-ng is much better tested (and I'm not using UDP at all). Personally, I'd give it a test run, as current 3.3 is fairly stable. > 3. Give up on syslog-ng until 3.3, or move to some other solution. Not > sure what I could do here, rsyslog is the other major contender I guess, > not sure what gains I would get. Could also do native syslog server and > post-process to different buckets/relay which is what we mainly use > syslog-ng for. I wouldn't consider rsyslog. It's a nightmare to maintain that, and an even bigger nightmare to get it to perform well in any but the most trivial situations. (Or it might be just me being too used to good documentation and readable config files, but I'm fairly sure it's not just that :P) -- |8] ________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html ________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
Thanks Mike.. I am in the same boat as you are. I have the same hardware, same OS (Solaris 10, Update 9) etc. I also am using the sunfreeware version. One thing that kind of concerns me is the lack of response from other Sun users. There does not seem to be too many that use it for a central log host. Maybe we found out why:)) I too am going down the compiling 3.3 route and am hoping to start next week. Agree with the active connections. I mentioned that because we have both. Also, I am going to get rid of the UDP logging since I'm dropping so many packets and move to TCP. Thanks for that link. Looks pretty good!! Have you been able to get to the link in that document? The one that says Topics in High Performance Messaging? It is supposed to take you to the 29west.com site but I keep getting redirected to a different site. Don't know if I have malware or if the site is no longer around. If I get around to compiling 3.3 before you do, I will post back here if you want and let you know how I did it, if you want. I have a friend that is pretty good at it and am hoping it isn't a big deal to do. Fred On Wed, Apr 27, 2011 at 9:18 AM, Mishou Michael < Michael.Mishou@csirc.irs.gov> wrote:
Zeek,
I didn't compile it myself, I'm using the 3.1.2 from sunfreeware.com. I'm actually having a heck of a time figuring out how to compile 3.3 from the alpha2 tarball on Solaris 10. I don't think I'm helping myself by having all the gcc tools installed from sunfreeware.com, maybe I need to start over. I'm so much more comfortable on Linux, where stuff just compiles magically and I don't have to do anything special.
When you are using loggen, you should write to disk on the receiving end and compare the number of messages received to messages sent. Clayton Dukes (on this list) has a good writeup of how to use loggen to generate some relevant performance numbers here: http://nms.gdd.net/index.php/Install_Guide_for_LogZilla_v3.1#UDP_Buffers
If I had to guess, --active-connections parameter wouldn't apply to UDP transport. Sounds like a TCP thing.
Hope this helps!
--Mike
________________________________
From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Zeek Anow Sent: Tuesday, April 26, 2011 5:37 PM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops
Just a heads up Mike. I tried doing the same thing with regards to using loggen to find the best rate on my V490. My version of loggen did not have the --active-connections parameter for sure, and I think it didn't have the --idle connection parameter either. I set the -I to 600 for 10 minutes, and that didn't work either. It ran until I manually killed it about 25 minutes later.
Then for the output all I got was : count=14877 diff=15930 rate = 627.75
I haven't found what they mean yet. I reckon count would be the number of packets sent, not sure what diff is, but I know what the msg/sec is:))
I am curious to see what you come up with. Oh, did you use the SunFreeware version or did you compile it yourself?
On Tue, Apr 26, 2011 at 1:58 PM, Mishou Michael <Michael.Mishou@csirc.irs.gov> wrote:
Gergely,
Thanks for any testing you can do. I'm not sure if a SPARC processor is an important testing component or not, I suppose your VMs will help determine this since you'll be using x86. If there's any testing I can do to help things along, please let me know.
Yes, I'm (very) scared of rsyslog as a maintainable solution, the configs for syslog-ng are *so* much easier to read and understand. I'll try 3.3 and report back how threading helps things out, I'm glad to hear that it's been pretty stable for you, that was my major concern in testing 3.3 since eventually we'll need this to be in production with our basic (from a config complexity standpoint) requirements.
I'll report back how 3.3 works out for me after I get it compiled and up today.
Regards,
--Mike
-----Original Message----- From: syslog-ng-bounces@lists.balabit.hu
[mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Gergely Nagy Sent: Tuesday, April 26, 2011 12:19 PM
To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops
(A few preliminary answers follow - I'll have another look at this later tonight from home, once I tested a few things on my local solaris vm)
"Mishou Michael" <Michael.Mishou@csirc.irs.gov> writes:
> I'm going to experiment with syslog-ng and the loggen tool to find a > point at which a single syslog-ng instance starts dropping inbound UDP > traffic with a simple configuration writing to disk. Once I have that > number, I have a few options: > > 1. Experiment with syslog-ng 3.3 and the new threaded code to see if I > have performance gains. I'm hesitant to push Alpha code in production, > if anyone has any experience with 3.3 in semi-production environment > running consistently I'd love to hear it.
I've been running 3.3 on most systems I administer (2 of my own servers + a few I administer for friends; and all of my virtual machines). It's been serving me fine for the past 4 months now.
However, most of my systems are also linux systems, where syslog-ng is much better tested (and I'm not using UDP at all).
Personally, I'd give it a test run, as current 3.3 is fairly stable.
> 3. Give up on syslog-ng until 3.3, or move to some other solution. Not > sure what I could do here, rsyslog is the other major contender I guess, > not sure what gains I would get. Could also do native syslog server and > post-process to different buckets/relay which is what we mainly use > syslog-ng for.
I wouldn't consider rsyslog. It's a nightmare to maintain that, and an even bigger nightmare to get it to perform well in any but the most trivial situations. (Or it might be just me being too used to good documentation and readable config files, but I'm fairly sure it's not just that :P)
-- |8]
________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
Zeek Anow <zeekstern@gmail.com> writes:
If I get around to compiling 3.3 before you do, I will post back here if you want and let you know how I did it, if you want. I have a friend that is pretty good at it and am hoping it isn't a big deal to do.
FYI, I'd strongly suggest going with the 3.3 sources from git, as opposed to alpha2: there's been a few important fixes inbetween. -- |8]
Thanks. Will do. What is git:)) On Thu, Apr 28, 2011 at 11:25 AM, Gergely Nagy <algernon@balabit.hu> wrote:
Zeek Anow <zeekstern@gmail.com> writes:
If I get around to compiling 3.3 before you do, I will post back here if you want and let you know how I did it, if you want. I have a friend that is pretty good at it and am hoping it isn't a big deal to do.
FYI, I'd strongly suggest going with the 3.3 sources from git, as opposed to alpha2: there's been a few important fixes inbetween.
-- |8]
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
Zeek Anow <zeekstern@gmail.com> writes:
Thanks. Will do. What is git:))
A version control system :] Come to think of it, it's probably easier for everyone if I prepare a tarball from Bazsi's latest version, with bells & whistles included. I'll try not to forget, but if I don't post back with a link within a day, please remind me O:) -- |8]
That would be great. Thanks!! On Thu, Apr 28, 2011 at 11:33 AM, Gergely Nagy <algernon@balabit.hu> wrote:
Zeek Anow <zeekstern@gmail.com> writes:
Thanks. Will do. What is git:))
A version control system :]
Come to think of it, it's probably easier for everyone if I prepare a tarball from Bazsi's latest version, with bells & whistles included.
I'll try not to forget, but if I don't post back with a link within a day, please remind me O:)
-- |8]
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
Mike - You asked earlier about if anyone knows a way to figure out when our pieces of junk starts to get pegged:)) Check this out: http://blogs.balabit.com/2011/02/07/syslog-ng-performance-tuning I know it is for TCP, but you can still probably get some use out of it. The only thing I question is why he never changed the log_iw_size.** Hope it helps.. On Thu, Apr 28, 2011 at 12:46 PM, Zeek Anow <zeekstern@gmail.com> wrote:
That would be great. Thanks!!
On Thu, Apr 28, 2011 at 11:33 AM, Gergely Nagy <algernon@balabit.hu>wrote:
Zeek Anow <zeekstern@gmail.com> writes:
Thanks. Will do. What is git:))
A version control system :]
Come to think of it, it's probably easier for everyone if I prepare a tarball from Bazsi's latest version, with bells & whistles included.
I'll try not to forget, but if I don't post back with a link within a day, please remind me O:)
-- |8]
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
Sorry for the extra posts. Can't think straight today:)) Are you using flow-control? I went back to your first couple of posts and didn't see it mentioned. At first I thought it was for only TCP, but then I found it in the manual under log path flags and it didn't say it was only for TCP. f l o w - control- Enables flow-control to the log path, meaning that syslog-ng will stop reading messages from the sources of this log statement if the destinations are not able to process the messages at the required speed. If disabled, syslog-ng will drop messages if the destination queues are full. If enabled, syslog-ng will only drop messages if the destination queues/window sizes are improperly sized. On Thu, Apr 28, 2011 at 6:56 PM, Zeek Anow <zeekstern@gmail.com> wrote:
Mike - You asked earlier about if anyone knows a way to figure out when our pieces of junk starts to get pegged:)) Check this out:
http://blogs.balabit.com/2011/02/07/syslog-ng-performance-tuning
I know it is for TCP, but you can still probably get some use out of it. The only thing I question is why he never changed the log_iw_size.**
Hope it helps..
On Thu, Apr 28, 2011 at 12:46 PM, Zeek Anow <zeekstern@gmail.com> wrote:
That would be great. Thanks!!
On Thu, Apr 28, 2011 at 11:33 AM, Gergely Nagy <algernon@balabit.hu>wrote:
Zeek Anow <zeekstern@gmail.com> writes:
Thanks. Will do. What is git:))
A version control system :]
Come to think of it, it's probably easier for everyone if I prepare a tarball from Bazsi's latest version, with bells & whistles included.
I'll try not to forget, but if I don't post back with a link within a day, please remind me O:)
-- |8]
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
Hi, UDP is a connection-less protocol, meaning that the sender has no knowledge if the messages reach the destination or not. In the perspective of flow-control, this means the following (someone please correct me if I'm wrong): - when using flow-control on the client that sends messages via UDP, flow-control can slow down reading messages from the client's sources if the client cannot send out the messages fast enough. But the client does not know (hence it cannot slow down) if the server cannot handle the messages it receives. - when using flow-control on the server that receives UDP messages, flow-control can slow down reading from the server's sources (thus probably cause the server to drop UDP messages) if the destination on the server (file, database, whatever) cannot handle the messages send to the destination fast enough. Robert On 04/29/2011 01:11 AM, Zeek Anow wrote:
Sorry for the extra posts. Can't think straight today:)) Are you using flow-control? I went back to your first couple of posts and didn't see it mentioned. At first I thought it was for only TCP, but then I found it in the manual under log path flags and it didn't say it was only for TCP.
f l o w - control- Enables flow-control to the log path, meaning that syslog-ng will stop reading messages from the sources of this log statement if the destinations are not able to process the messages at the required speed. If disabled, syslog-ng will drop messages if the destination queues are full. If enabled, syslog-ng will only drop messages if the destination queues/window sizes are improperly sized.
On Thu, Apr 28, 2011 at 6:56 PM, Zeek Anow <zeekstern@gmail.com <mailto:zeekstern@gmail.com>> wrote:
Mike - You asked earlier about if anyone knows a way to figure out when our pieces of junk starts to get pegged:)) Check this out:
http://blogs.balabit.com/2011/02/07/syslog-ng-performance-tuning
I know it is for TCP, but you can still probably get some use out of it. The only thing I question is why he never changed the log_iw_size.**
Hope it helps..
On Thu, Apr 28, 2011 at 12:46 PM, Zeek Anow <zeekstern@gmail.com <mailto:zeekstern@gmail.com>> wrote:
That would be great. Thanks!!
On Thu, Apr 28, 2011 at 11:33 AM, Gergely Nagy <algernon@balabit.hu <mailto:algernon@balabit.hu>> wrote:
Zeek Anow <zeekstern@gmail.com <mailto:zeekstern@gmail.com>> writes:
> Thanks. Will do. > What is git:))
A version control system :]
Come to think of it, it's probably easier for everyone if I prepare a tarball from Bazsi's latest version, with bells & whistles included.
I'll try not to forget, but if I don't post back with a link within a day, please remind me O:)
-- |8] ______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
On Fri, 2011-04-29 at 08:51 +0200, Fekete Robert wrote:
Hi,
UDP is a connection-less protocol, meaning that the sender has no knowledge if the messages reach the destination or not. In the perspective of flow-control, this means the following (someone please correct me if I'm wrong):
- when using flow-control on the client that sends messages via UDP, flow-control can slow down reading messages from the client's sources if the client cannot send out the messages fast enough. But the client does not know (hence it cannot slow down) if the server cannot handle the messages it receives.
except when it explicitly receives an ICMP port unreachable, in which case syslog-ng will cease sending. port unreachable is not something you can count on, but works when the issue is that the server is not running.
- when using flow-control on the server that receives UDP messages, flow-control can slow down reading from the server's sources (thus probably cause the server to drop UDP messages) if the destination on the server (file, database, whatever) cannot handle the messages send to the destination fast enough.
and in this case it can help, since syslog-ng will not be active-dropping message, but rather the kernel will do that, even before the message reach syslog-ng. -- Bazsi
Gergely Nagy <algernon@balabit.hu> writes:
Zeek Anow <zeekstern@gmail.com> writes:
Thanks. Will do. What is git:))
A version control system :]
Come to think of it, it's probably easier for everyone if I prepare a tarball from Bazsi's latest version, with bells & whistles included.
http://static.madhouse-project.org/tmp/syslog-ng-3.3.0alpha2+20110426.tar.gz sha1sum: 5d2bb56c9168e0e3eb95ee4342bd0171f6c24c5b md5sum: d52ab242544366e5c6454a00e89b3ce6 Snapshot was taken from the state as of the 26th of april (the last change in the repository). Hopefully this'll be easier to compile than 3.3.0alpha2, a bunch of fixes went in that should help in this area. (Apologies for not putting this on a .balabit.hu domain, that needs more administration than I was prepared to do for this task) -- |8]
Gergely, Got it, trying to compile now. Was having issues with the configure/compilation of ivykis with the alpha2 release. Will update everyone soon, was out end of last week unplanned. Thank you! --Mike -----Original Message----- From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Gergely Nagy Sent: Friday, April 29, 2011 9:48 AM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops Gergely Nagy <algernon@balabit.hu> writes:
Zeek Anow <zeekstern@gmail.com> writes:
Thanks. Will do. What is git:))
A version control system :]
Come to think of it, it's probably easier for everyone if I prepare a tarball from Bazsi's latest version, with bells & whistles included.
http://static.madhouse-project.org/tmp/syslog-ng-3.3.0alpha2+20110426.ta r.gz sha1sum: 5d2bb56c9168e0e3eb95ee4342bd0171f6c24c5b md5sum: d52ab242544366e5c6454a00e89b3ce6 Snapshot was taken from the state as of the 26th of april (the last change in the repository). Hopefully this'll be easier to compile than 3.3.0alpha2, a bunch of fixes went in that should help in this area. (Apologies for not putting this on a .balabit.hu domain, that needs more administration than I was prepared to do for this task) -- |8] ________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
All, Sure enough, as Bazsi had suggested, the stopping point appears to be ivykis compilation on Solaris 10. I have a full suite of gcc tools (including automake, autoheader, autoconf, flex, gcc, etc.) installed on this machine from Sunfreeware.com. The configure proceeds without any errors. When I try to get it to compile, I get the following error: make[7]: Entering directory `/root/syslog_tools/syslog-ng-3.3.0alpha2+20110426/lib/ivykis/modules' /bin/bash ../libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I.. -D_GNU_SOURCE -I../lib/include -I../lib/include -I../lib -I../modules/include -fPIC -O3 -m32 -Wall -fPIC -O3 -m32 -D_REENTRANT -D_POSIX_C_SOURCE=199506L -D_POSIX_PTHREAD_SEMANTICS -D_XPG4_2 -MT iv_event.lo -MD -MP -MF .deps/iv_event.Tpo -c -o iv_event.lo iv_event.c libtool: compile: gcc -DHAVE_CONFIG_H -I. -I.. -D_GNU_SOURCE -I../lib/include -I../lib/include -I../lib -I../modules/include -fPIC -O3 -m32 -Wall -fPIC -O3 -m32 -D_REENTRANT -D_POSIX_C_SOURCE=199506L -D_POSIX_PTHREAD_SEMANTICS -D_XPG4_2 -MT iv_event.lo -MD -MP -MF .deps/iv_event.Tpo -c iv_event.c -fPIC -DPIC -o iv_event.o iv_event.c: In function `iv_event_run_pending_events': iv_event.c:62: error: unrecognizable insn: (insn 306 305 307 11 (set (reg/f:SI 192) (high:SI (const:SI (plus:SI (symbol_ref:SI ("__tls") [flags 0x12] <var_decl 7f6c6980 __tls>) (const_int 104 [0x68]))))) -1 (nil) (nil)) iv_event.c:62: internal compiler error: in extract_insn, at recog.c:2083 I'm not even sure how to begin troubleshooting this, Google and Bing yield nothing obvious either. Ivykis must be fairly new, there's not much in their indexes concerning it at all for that matter. Have you guys seen this when compiling with that snapshot that Gergely prepared? Any hints for me? The configure (which works): ./configure --enable-pcre --disable-ipv6 --enable-dynamic-linking --enable-sun-streams --disable-mongodb The configure report: syslog-ng Open Source Edition 3.3.0alpha2 configured Compiler options: compiler : gcc -std=gnu99 compiler options : -fPIC -O3 -m32 -Wall -fPIC -O3 -m32 -D_REENTRANT -D_PTHREADS -I/usr/local/include/glib-2.0 -I/usr/local/lib/glib-2.0/include -I/usr/local/include/eventlog -I/usr/local/include -I$(top_srcdir)/lib/ivykis/lib/include -I$(top_builddir)/lib/ivykis/lib/include -I$(top_srcdir)/lib/ivykis/modules/include -D_GNU_SOURCE -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 linker flags : -lpthread prefix : /usr/local linking mode : dynamic __thread keyword : yes Submodules: ivykis : internal libmongo-client : internal Features: Debug symbols : no GCC profiling : no Memtrace : no IPV6 support : no spoof-source support : no tcp-wrapper support : no Linux capability support : no PCRE support : yes Env wrapper support : no Modules: Default module list : affile,afprog,afsocket,afuser,basicfuncs,csvparser,dbparser,syslogformat Sun STREAMS support (module): yes SSL support (module) : no SQL support (module) : no PACCT module (EXPERIMENTAL) : no MongoDB destination (module): no The various vars I've had to set to get things working (had to add /usr/ccs/bin to get ar in path, had to add -fPIC to fix a TLS error, stuck on current error). # echo $CFLAGS -fPIC -O3 -m32 # echo $PATH # all the GNU tools are in /usr/local /usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/usr/ucb:/usr/ccs/bin Thanks for any help anyone can give (ongoing!). I'm also waiting on a 4.0 PE eval download to get allowed. Hopefully I can give some updated performance numbers in light of that getting installed soon (with the write() and time() fixes Bazsi alluded to earlier). --Mike Mishou -----Original Message----- From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Mishou Michael Sent: Monday, May 02, 2011 9:26 AM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops Gergely, Got it, trying to compile now. Was having issues with the configure/compilation of ivykis with the alpha2 release. Will update everyone soon, was out end of last week unplanned. Thank you! --Mike -----Original Message----- From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Gergely Nagy Sent: Friday, April 29, 2011 9:48 AM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops Gergely Nagy <algernon@balabit.hu> writes:
Zeek Anow <zeekstern@gmail.com> writes:
Thanks. Will do. What is git:))
A version control system :]
Come to think of it, it's probably easier for everyone if I prepare a tarball from Bazsi's latest version, with bells & whistles included.
http://static.madhouse-project.org/tmp/syslog-ng-3.3.0alpha2+20110426.ta r.gz sha1sum: 5d2bb56c9168e0e3eb95ee4342bd0171f6c24c5b md5sum: d52ab242544366e5c6454a00e89b3ce6 Snapshot was taken from the state as of the 26th of april (the last change in the repository). Hopefully this'll be easier to compile than 3.3.0alpha2, a bunch of fixes went in that should help in this area. (Apologies for not putting this on a .balabit.hu domain, that needs more administration than I was prepared to do for this task) -- |8] ________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html ________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
I found http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22286 and disabled optimization with -O0 in CFLAGS/CPPFLAGS, still getting the same exact error. I found this which seems to address the problem: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21613. Seems to only be an issue when using -fPIC with GCC 3.4 (I'm using 3.4.6). I have to use -fPIC to use the TLS references (I think). I'm not sure where to go from here, I don't fully understand the workaround in bug 21613 listing (at the bottom), I guess I could move to a newer version of GCC? I'm not sure how much I'll break doing that, but it's worth a shot I suppose. --Mike Mishou -----Original Message----- From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Mishou Michael Sent: Monday, May 02, 2011 12:34 PM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops All, Sure enough, as Bazsi had suggested, the stopping point appears to be ivykis compilation on Solaris 10. I have a full suite of gcc tools (including automake, autoheader, autoconf, flex, gcc, etc.) installed on this machine from Sunfreeware.com. The configure proceeds without any errors. When I try to get it to compile, I get the following error: make[7]: Entering directory `/root/syslog_tools/syslog-ng-3.3.0alpha2+20110426/lib/ivykis/modules' /bin/bash ../libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I.. -D_GNU_SOURCE -I../lib/include -I../lib/include -I../lib -I../modules/include -fPIC -O3 -m32 -Wall -fPIC -O3 -m32 -D_REENTRANT -D_POSIX_C_SOURCE=199506L -D_POSIX_PTHREAD_SEMANTICS -D_XPG4_2 -MT iv_event.lo -MD -MP -MF .deps/iv_event.Tpo -c -o iv_event.lo iv_event.c libtool: compile: gcc -DHAVE_CONFIG_H -I. -I.. -D_GNU_SOURCE -I../lib/include -I../lib/include -I../lib -I../modules/include -fPIC -O3 -m32 -Wall -fPIC -O3 -m32 -D_REENTRANT -D_POSIX_C_SOURCE=199506L -D_POSIX_PTHREAD_SEMANTICS -D_XPG4_2 -MT iv_event.lo -MD -MP -MF .deps/iv_event.Tpo -c iv_event.c -fPIC -DPIC -o iv_event.o iv_event.c: In function `iv_event_run_pending_events': iv_event.c:62: error: unrecognizable insn: (insn 306 305 307 11 (set (reg/f:SI 192) (high:SI (const:SI (plus:SI (symbol_ref:SI ("__tls") [flags 0x12] <var_decl 7f6c6980 __tls>) (const_int 104 [0x68]))))) -1 (nil) (nil)) iv_event.c:62: internal compiler error: in extract_insn, at recog.c:2083 I'm not even sure how to begin troubleshooting this, Google and Bing yield nothing obvious either. Ivykis must be fairly new, there's not much in their indexes concerning it at all for that matter. Have you guys seen this when compiling with that snapshot that Gergely prepared? Any hints for me? The configure (which works): ./configure --enable-pcre --disable-ipv6 --enable-dynamic-linking --enable-sun-streams --disable-mongodb The configure report: syslog-ng Open Source Edition 3.3.0alpha2 configured Compiler options: compiler : gcc -std=gnu99 compiler options : -fPIC -O3 -m32 -Wall -fPIC -O3 -m32 -D_REENTRANT -D_PTHREADS -I/usr/local/include/glib-2.0 -I/usr/local/lib/glib-2.0/include -I/usr/local/include/eventlog -I/usr/local/include -I$(top_srcdir)/lib/ivykis/lib/include -I$(top_builddir)/lib/ivykis/lib/include -I$(top_srcdir)/lib/ivykis/modules/include -D_GNU_SOURCE -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 linker flags : -lpthread prefix : /usr/local linking mode : dynamic __thread keyword : yes Submodules: ivykis : internal libmongo-client : internal Features: Debug symbols : no GCC profiling : no Memtrace : no IPV6 support : no spoof-source support : no tcp-wrapper support : no Linux capability support : no PCRE support : yes Env wrapper support : no Modules: Default module list : affile,afprog,afsocket,afuser,basicfuncs,csvparser,dbparser,syslogformat Sun STREAMS support (module): yes SSL support (module) : no SQL support (module) : no PACCT module (EXPERIMENTAL) : no MongoDB destination (module): no The various vars I've had to set to get things working (had to add /usr/ccs/bin to get ar in path, had to add -fPIC to fix a TLS error, stuck on current error). # echo $CFLAGS -fPIC -O3 -m32 # echo $PATH # all the GNU tools are in /usr/local /usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/usr/ucb:/usr/ccs/bin Thanks for any help anyone can give (ongoing!). I'm also waiting on a 4.0 PE eval download to get allowed. Hopefully I can give some updated performance numbers in light of that getting installed soon (with the write() and time() fixes Bazsi alluded to earlier). --Mike Mishou -----Original Message----- From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Mishou Michael Sent: Monday, May 02, 2011 9:26 AM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops Gergely, Got it, trying to compile now. Was having issues with the configure/compilation of ivykis with the alpha2 release. Will update everyone soon, was out end of last week unplanned. Thank you! --Mike -----Original Message----- From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Gergely Nagy Sent: Friday, April 29, 2011 9:48 AM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops Gergely Nagy <algernon@balabit.hu> writes:
Zeek Anow <zeekstern@gmail.com> writes:
Thanks. Will do. What is git:))
A version control system :]
Come to think of it, it's probably easier for everyone if I prepare a tarball from Bazsi's latest version, with bells & whistles included.
http://static.madhouse-project.org/tmp/syslog-ng-3.3.0alpha2+20110426.ta r.gz sha1sum: 5d2bb56c9168e0e3eb95ee4342bd0171f6c24c5b md5sum: d52ab242544366e5c6454a00e89b3ce6 Snapshot was taken from the state as of the 26th of april (the last change in the repository). Hopefully this'll be easier to compile than 3.3.0alpha2, a bunch of fixes went in that should help in this area. (Apologies for not putting this on a .balabit.hu domain, that needs more administration than I was prepared to do for this task) -- |8] ________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html ________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html ________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
On Mon, 2011-05-02 at 13:01 -0400, Mishou Michael wrote:
I found http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22286 and disabled optimization with -O0 in CFLAGS/CPPFLAGS, still getting the same exact error.
I found this which seems to address the problem: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21613. Seems to only be an issue when using -fPIC with GCC 3.4 (I'm using 3.4.6). I have to use -fPIC to use the TLS references (I think). I'm not sure where to go from here, I don't fully understand the workaround in bug 21613 listing (at the bottom), I guess I could move to a newer version of GCC? I'm not sure how much I'll break doing that, but it's worth a shot I suppose.
Before touching the toolchain (which I don't recommend while porting ivykis to various platforms and failed miserably), we've patched ivykis to work on systems the __thread is not available. I'm not sure if Gergely has prepared the patched ivykis or the upstream one. (we were exchanging patches with upstream, but not everything was integrated yet). The proper version is here: http://git.balabit.hu/?p=bazsi/ivykis.git;a=summary which is equivalent to: git://git.balabit.hu/bazsi/ivykis.git But if you don't have git, you can grab a tarball from the gitweb interface, e.g: http://git.balabit.hu/?p=bazsi/ivykis.git;a=snapshot;h=1d9e413f31e09a2c82128... If you still cannot get it to compile, it'd be helpful if you could include the config.log / config.status files in the syslog-ng root directory, that should contain lib/ivykis checks too. The most interesting part is whether it finds support for __thread variables. Now as I think if it, it probably does, and that's the cause of the error, since your compiler doesn't really support it (because of the error), but is able to compile a simple test program. You could try to edit lib/ivykis/lib/iv_thr.h and add: #undef HAVE_TLS #define HAVE_TLS 0 after the config.h header is included. -- Bazsi
On Mon, 2011-05-02 at 20:09 +0200, Balazs Scheidler wrote:
On Mon, 2011-05-02 at 13:01 -0400, Mishou Michael wrote:
I found http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22286 and disabled optimization with -O0 in CFLAGS/CPPFLAGS, still getting the same exact error.
I found this which seems to address the problem: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21613. Seems to only be an issue when using -fPIC with GCC 3.4 (I'm using 3.4.6). I have to use -fPIC to use the TLS references (I think). I'm not sure where to go from here, I don't fully understand the workaround in bug 21613 listing (at the bottom), I guess I could move to a newer version of GCC? I'm not sure how much I'll break doing that, but it's worth a shot I suppose.
Before touching the toolchain (which I don't recommend while porting ivykis to various platforms and failed miserably),
argh, successfully edited this sentence to complete gribberish. So I don't recommending trying to fix the toolchain, we've tried that and failed miserably.
we've patched ivykis to work on systems the __thread is not available.
I'm not sure if Gergely has prepared the patched ivykis or the upstream one. (we were exchanging patches with upstream, but not everything was integrated yet).
The proper version is here:
http://git.balabit.hu/?p=bazsi/ivykis.git;a=summary
which is equivalent to:
git://git.balabit.hu/bazsi/ivykis.git
But if you don't have git, you can grab a tarball from the gitweb interface, e.g:
http://git.balabit.hu/?p=bazsi/ivykis.git;a=snapshot;h=1d9e413f31e09a2c82128...
If you still cannot get it to compile, it'd be helpful if you could include the config.log / config.status files in the syslog-ng root directory, that should contain lib/ivykis checks too. The most interesting part is whether it finds support for __thread variables.
Now as I think if it, it probably does, and that's the cause of the error, since your compiler doesn't really support it (because of the error), but is able to compile a simple test program.
You could try to edit lib/ivykis/lib/iv_thr.h and add:
#undef HAVE_TLS #define HAVE_TLS 0
after the config.h header is included.
-- Bazsi
Hi All, Can any body help me. Please????? I hve configured syslog-ng in X86 server. Bellow is the configuration. But Logs are not coming under /syslog-ng folder.....if I am wrong in bellow configuration, can you Please Provide step by step configuration Procedure to configure the same... # cat /etc/syslog-ng/syslog-ng.conf options { sync (0); time_reopen (10); log_fifo_size (1000); long_hostnames (off); use_dns (no); use_fqdn (no); create_dirs (yes); keep_hostname (yes); }; source s_sys { file ("/proc/kmsg" log_prefix("kernel: ")); sun-stream ("/dev/log"); internal(); }; # External Source source s_ext { # Standard Syslog udp(); # All interfaces tcp(); # All interfaces on tcp port sun-stream("/dev/log"); }; destination d_cons { file("/dev/console"); }; destination d_mesg { file("/var/adm/messages"); }; destination d_mail { file("/var/log/syslog"); }; destination d_auth { file("/var/log/authlog"); }; destination d_mlop { usertty("operator"); }; destination d_mlrt { usertty("root"); }; destination d_mlal { usertty("*"); }; destination d_ext { file("/syslog-ng/$HOST/$YEAR/$MONTH/$DAY/$FACILITY$YEAR$MONTH$DAY" \ owner(root) group(root) perm(0650) dir_perm(0750) create_dirs(yes)); create_dirs(yes)); }; filter f_filter1 { level(err) or (level(notice) and facility (auth, kern)); }; filter f_filter2 { level(err) or (facility(kern) and level(notice)) or (facility(daemon) and level(notice)) or (facility(mail) and level(crit)); }; filter f_filter3 { level(alert) or (facility(kern) and level(err)) or (facility(daemon) and level(err)); }; filter f_filter4 { level(alert); }; filter f_filter5 { level(emerg); }; filter f_filter6 { facility(kern) and level(notice); }; filter f_filter7 { facility(mail) and level(debug); }; filter f_filter8 { facility(user) and level(err); }; filter f_filter9 { facility(user) and level(alert); }; log { source(s_sys); filter(f_filter1); destination(d_cons); }; log { source(s_sys); filter(f_filter2); destination(d_mesg); }; log { source(s_sys); filter(f_filter3); destination(d_mlop); }; log { source(s_sys); filter(f_filter4); destination(d_mlrt); }; log { source(s_sys); filter(f_filter5); destination(d_mlal); }; log { source(s_sys); filter(f_filter6); destination(d_auth); }; log { source(s_sys); filter(f_filter7); destination(d_mail); }; log { source(s_sys); filter(f_filter8); destination(d_cons); destination(d_mesg); }; log { source(s_ext); destination(d_ext); }; # isainfo -kv 64-bit amd64 kernel modules # cat /etc/release Solaris 10 10/08 s10x_u6wos_07b X86 Copyright 2008 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 27 October 2008 # pkginfo -l SMCsyslng PKGINST: SMCsyslng NAME: syslogng CATEGORY: application ARCH: x86 VERSION: 2.0.5 BASEDIR: /usr/local VENDOR: BalaBit IT Ltd PSTAMP: Steve Christensen INSTDATE: Apr 20 2011 16:24 EMAIL: steve@smc.vnet.net STATUS: completely installed FILES: 64 installed pathnames 3 shared pathnames 15 directories 2 executables 2163 blocks used (approx) -----Original Message----- From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Mishou Michael Sent: Tuesday, April 26, 2011 11:28 PM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops Gergely, Thanks for any testing you can do. I'm not sure if a SPARC processor is an important testing component or not, I suppose your VMs will help determine this since you'll be using x86. If there's any testing I can do to help things along, please let me know. Yes, I'm (very) scared of rsyslog as a maintainable solution, the configs for syslog-ng are *so* much easier to read and understand. I'll try 3.3 and report back how threading helps things out, I'm glad to hear that it's been pretty stable for you, that was my major concern in testing 3.3 since eventually we'll need this to be in production with our basic (from a config complexity standpoint) requirements. I'll report back how 3.3 works out for me after I get it compiled and up today. Regards, --Mike -----Original Message----- From: syslog-ng-bounces@lists.balabit.hu [mailto:syslog-ng-bounces@lists.balabit.hu] On Behalf Of Gergely Nagy Sent: Tuesday, April 26, 2011 12:19 PM To: Syslog-ng users' and developers' mailing list Subject: Re: [syslog-ng] Solaris 10 UDP overflows, message drops (A few preliminary answers follow - I'll have another look at this later tonight from home, once I tested a few things on my local solaris vm) "Mishou Michael" <Michael.Mishou@csirc.irs.gov> writes:
I'm going to experiment with syslog-ng and the loggen tool to find a point at which a single syslog-ng instance starts dropping inbound UDP traffic with a simple configuration writing to disk. Once I have that number, I have a few options:
1. Experiment with syslog-ng 3.3 and the new threaded code to see if I have performance gains. I'm hesitant to push Alpha code in production, if anyone has any experience with 3.3 in semi-production environment running consistently I'd love to hear it.
I've been running 3.3 on most systems I administer (2 of my own servers + a few I administer for friends; and all of my virtual machines). It's been serving me fine for the past 4 months now. However, most of my systems are also linux systems, where syslog-ng is much better tested (and I'm not using UDP at all). Personally, I'd give it a test run, as current 3.3 is fairly stable.
3. Give up on syslog-ng until 3.3, or move to some other solution. Not sure what I could do here, rsyslog is the other major contender I guess, not sure what gains I would get. Could also do native syslog server and post-process to different buckets/relay which is what we mainly use syslog-ng for.
I wouldn't consider rsyslog. It's a nightmare to maintain that, and an even bigger nightmare to get it to perform well in any but the most trivial situations. (Or it might be just me being too used to good documentation and readable config files, but I'm fairly sure it's not just that :P) -- |8] ________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html ________________________________________________________________________ ______ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
Gergely Nagy <algernon@balabit.hu> writes:
(A few preliminary answers follow - I'll have another look at this later tonight from home, once I tested a few things on my local solaris vm)
I had a few tries with my VMs (both x86 and sparc), but sadly didn't manage to come to any meaningful resolution yet. I have a few more tricks up my sleeve, and will explore a couple of more ideas in the coming days - will report back as soon as I have something. -- |8]
Hi, On Tue, 2011-04-26 at 12:05 -0400, Mishou Michael wrote:
For those following this thread, I have applied the "thundering herd" UDP patch and experienced no change in the drops experienced by syslog-ng 3.1.2. Sorry I took so long to respond, the patching was a much more time-involved process than I thought it would be.
At this point, based on Michael Hocke's response, I'm thinking that perhaps there is just too much UDP traffic for single-threaded syslog-ng to deal with in light of what filtering and parsing it does up front (for macro usage).
I'm going to experiment with syslog-ng and the loggen tool to find a point at which a single syslog-ng instance starts dropping inbound UDP traffic with a simple configuration writing to disk. Once I have that number, I have a few options:
1. Experiment with syslog-ng 3.3 and the new threaded code to see if I have performance gains. I'm hesitant to push Alpha code in production, if anyone has any experience with 3.3 in semi-production environment running consistently I'd love to hear it.
I think the most difficult part of compiling syslog-ng for Solaris is ivykis, the new I/O backend library that we've started using for threading (it supports epoll, /dev/poll, kqueue etc). The ivykis version that we use is available on git.balabit.hu, but you need a complete toolchain (autoconf, automake, libtool, gcc, gmake) to compile it.
2. So I don't have to change the configuration on a lot of clients, use PF to rewrite incoming UDP messages from specific, busy clients to other syslog-ng listeners, configured exactly as my main instance (which will handle all the non-insanely-busy clients). I could run multiple listeners in this manner, and not need threading to take advantage of multiple processors, though obviously each process would still be limited to the magic number determined above. I have 10 or so really busy clients, so this is one solution I'm leaning towards if syslog-ng 3.1.2 can handle just one of them.
This could work.
3. Give up on syslog-ng until 3.3, or move to some other solution. Not sure what I could do here, rsyslog is the other major contender I guess, not sure what gains I would get. Could also do native syslog server and post-process to different buckets/relay which is what we mainly use syslog-ng for.
4. Get a faster box (not likely to happen).
If anyone has any thoughts on any of the above I'd love to hear them. Also, if this is unique to Solaris SPARC systems (similarly spec'd x86 Solaris systems having none of these limitations) I'd love to know that as well. Is there any way anyone knows to figure out at what point the SPARC is hitting a ceiling? The CPU is not pegged, so why would we be experiencing CPU-based drops? Maybe the code is not efficient for how SPARC does things, or how some syscall is implemented on Solaris?
Yes, I think this is the root cause of the problem. -- Bazsi
On Mon, 2011-04-18 at 11:43 -0400, Mishou Michael wrote:
Fred,
Great find! For those following on the TV at home, here is the link to the patch notes that I found:
https://getupdates.oracle.com/readme/144488-11
Which contains this tantalizing tidbit:
6638967 UDP recv (think DNS) suffers from thundering herd problem (bug report for above: http://bit.ly/eD57KB+ )
I'm going to install this patch and see what comes of it. That certainly seems like it could be related.
Martin,
I checked based on Matthew's suggestion of the ipInDiscards counter incrementing, and it's not, so no dice with the checksum errors, good call though!
Matthew,
Following your suggestions I set up all of the network-based destinations to be /dev/null and the problem persists, shows no change in terms of how fast the buffers fill up or how many overflows are generated per second at all. As for loggen, keep in mind that I was using that before to write to /dev/null and also directly to disk and topped loggen out (locally) at ~8k/msgs/sec, so I wonder if Fred isn't on the right track with some OS issues?
I used Google to find scripts to run for Dtrace to track syscalls, and one called procsystime stood out (http://www.brendangregg.com/DTrace/procsystime) Here's a sample of the output over about 20 seconds each time, while the overflows are happening. It's hard for me to tell if the write() syscall is really freaking slow here or if this is as it should be and not weird. Either way, write() doesn't seem to be the issue if you look at the /dev/null output for which UDP overflows still occur.
This first one is the output for the simple config I posted earlier, writing to disk:
# /root/procsystime -aTn syslog-ng Hit Ctrl-C to stop sampling... ^C
Elapsed Times for processes syslog-ng,
SYSCALL TIME (ns) getpid 193600 fchmod 443000 bind 757700 fcntl 786800 setsockopt 1355700 fchown 1446200 connect 1542000 so_socket 2626100 close 4382500 open64 4691400 stat64 4957400 brk 23978700 pollsys 113785800 llseek 158217200 gtime 244735300 recvfrom 404501900 write 11343815700 TOTAL: 12312217000
Hmm... I think I know what the culprit is, based on these statistics. 1) syslog-ng < 3.3 issued a single write() syscall for each and every line written in the log files. It had overhead in Linux too, but not this bad. 2) time() is invoked by syslog-ng a lot, and it is clearly visible from your profile. These invocations were also decreased in numbers, unluckily only in 3.3. _Perhaps_ it'd be make sense to backport the performance improvements in question, so you wouldn't have to fight 3.3 related issues, although I'm more than happy if you choose that. This is the bulk of the change that should more-or-less cleanly apply to 3.2: Author: NagyAttila <naat@balabit.hu> 2010-11-15 18:00:10 Committer: Balazs Scheidler <bazsi@balabit.hu> 2010-12-21 16:31:09 Parent: 88a2a660255147c5ebd35951d1ccead9fd779e13 (LogProto: apply_state shouldn't allow file offsets over the end-of-file) Child: fe541bcb0368342228eebbb8dcf08e7a5f5f6a05 (LogProto: simplify prepare method) Branches: many (84) Follows: v3.2.1 Precedes: v3.3.0alpha1 Performance improvement: file write uses writev instead of per-message write() to write larger chunks There were three fixes on the code since then: 3fa8e900453fb6af1767caf42e17ab6d8c42452b syslog() destination driver: fixed potential framing problem on contended connections 20e523a53dd8259061b2d277927f79681b8a9334 LogProtoFileWriter: flush should not attempt to call writev() if there are no buffers 3abdd8773662f9d779429262262a1ed9229e98e6 logproto: Handle EAGAIN and EINTR correctly in _text_client_flush() The time related changes are much more scattered, so I wouldn't recommend to go there first, and seeing your profile, shouldn't matter that much. Also, please note that in order to really make use of the new code, you need to set flush_lines() to non-zero, something like 100-1000 should make a big difference. And one last point: please don't take it as a plug, but it _could_ make sense to check the PE version first, you can get a free eval (in binary), and PE 4.0 had these patches already (and got ported to OSE later). If you really can see the performance boost on your system, you can still decide whether to take the PE route, or stick with the OSE, possibly with porting patches. -- Bazsi
-----BEGIN PGP SIGNED MESSAGE----- On Apr 15, 2011, at 2:01 PM, Mishou Michael wrote:
I left out the resources I have to work with on this system, and how bad/good things are with syslog-ng running (and dropping), I'll include those now. As you can see, it's an older server, but it has a ton of RAM and the CPUs should have enough pop for this I think.
Hi Mishou, I battled this fight for quite a long time when I built a syslog server using syslog-ng on Solaris 10 running on a Sun Fire V210 (dual 1.5GHz US-IIIi processors, 4GB memory). This syslog server is being used to collect the immense amount of Cisco firewall messages (in the neighborhood of 14000 messages per second). At first I tried to fiddle around with the UDP buffers in the system and the so_rcvbuf setting in syslog-ng.conf but to no avail. Any increase of the buffer would just delay the time when UDP packets were starting to drop again. I then found an old Sun x86 server (a V60x) lying around (dual Xeon 3GHz, 6GB memory) and replaced the V210 with it, suspecting that even my very simple syslog-ng configuration (no filters or anything) just overwhelms the V210. That did the trick. It was just a matter of processing power. Not sure if this applies to your situation but it kind of has the same smell to it. Hope this helps a bit. - - Michael -----BEGIN PGP SIGNATURE----- Version: PGP Desktop 10.0.3 (Build 1) Charset: us-ascii wsBVAwUBTaxwOZbfnpCg64TVAQGU4QgAw3rl6mvucBuThAvR+0uC2JoGYcN7xpBb hDzninYg1PlqAHEmfMHw3nt1fimnfxPQ4fnFq5UFoHaWqqbs1G3AqjiqOV7GOcoJ Yxq6F8cmGz1HM8AiHZJM7XHYdrqsZ8FQjyqW/Youv/TCC1zU0oigMdkobTkAphGg nJD9foAKIqMMgRawTRPY/8W9QFPvotLMN84Q/zzs6Wi62Kumncfjrg4bJQkpQdq/ pS0m/9ZvtQD7EohF/lVZRa5nPa/3/xm5WjTrEFmB16dzXOQvkSmcOWx8N88/joMR tmGfiutg6Lu69oG7xj7oeb/yp1iWKoTYwb/nZgwu/onZmLMtrZ+ZeA== =z1AA -----END PGP SIGNATURE-----
participants (11)
-
Balazs Scheidler
-
Clayton Dukes
-
Fekete Robert
-
Fred Connolly
-
Gergely Nagy
-
Martin Holste
-
Matthew Hall
-
Michael Hocke
-
Mishou Michael
-
sramesh.kumar@wipro.com
-
Zeek Anow