Leaking objects in size-4096 cache in slabinfo
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I'm running syslog-ng 3.0.4 OSE on Gentoo and have a central syslog host that accepts about 3Mbps of log traffic via a TCP pipe. We've recently been having oom-killer problems and I finally tracked it down to the fact that we seem to be leaking objects in the size-4096 cache. Now, slabinfo is new to me so I could be interpreting this wrong, but restarting syslog-ng definite reclaims this cache. What data can I provide to help track down this problem? I noticed that there seems to be a similar error in 3.1 beta, reported here: http://thread.gmane.org/gmane.comp.syslog-ng/8544 - -Doug -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iD8DBQFLJ8FqJV36su0A0xIRAizGAKDbVEpHRMhrRgAogUuAh0nCKz9xFwCgtz17 6EN8aU3MxU7bL8Sr4xRx6tI= =1LQg -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 12/15/2009 12:03 PM, Doug Warner wrote:
I'm running syslog-ng 3.0.4 OSE on Gentoo and have a central syslog host that accepts about 3Mbps of log traffic via a TCP pipe. We've recently been having oom-killer problems and I finally tracked it down to the fact that we seem to be leaking objects in the size-4096 cache. Now, slabinfo is new to me so I could be interpreting this wrong, but restarting syslog-ng definite reclaims this cache.
What data can I provide to help track down this problem?
I noticed that there seems to be a similar error in 3.1 beta, reported here: http://thread.gmane.org/gmane.comp.syslog-ng/8544
Attached is a valgrind for the process after a couple hours. Let me know if I can be any more help. - -Doug -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iD8DBQFLJ9+BJV36su0A0xIRAorOAKCAAlDjKipjLpPcU4Ptv4TBaz2fvwCgsSZa GhTnE6zbL2/+LiQTU5Bqv6c= =coic -----END PGP SIGNATURE-----
On Tue, 2009-12-15 at 14:12 -0500, Doug Warner wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 12/15/2009 12:03 PM, Doug Warner wrote:
I'm running syslog-ng 3.0.4 OSE on Gentoo and have a central syslog host that accepts about 3Mbps of log traffic via a TCP pipe. We've recently been having oom-killer problems and I finally tracked it down to the fact that we seem to be leaking objects in the size-4096 cache. Now, slabinfo is new to me so I could be interpreting this wrong, but restarting syslog-ng definite reclaims this cache.
What data can I provide to help track down this problem?
I noticed that there seems to be a similar error in 3.1 beta, reported here: http://thread.gmane.org/gmane.comp.syslog-ng/8544
Attached is a valgrind for the process after a couple hours. Let me know if I can be any more help.
slabinfo contains information about kernel memory allocations, which can certainly be attributed to syslog-ng, when the kernel does an allocation on behalf of syslog-ng. but it would be important to know what kind of kernel objects syslog-ng allocates. One thing is sure, mere user-space memory usage as can be allocated by syslog-ng will never show up in size-4096 cache, since the kernel allocates pages directly in case userspace programs need that. And size-4096 in slab cache certainly is associated with a kmalloc()-ed object. Possibly a socket structure (but AFAIR it has its own slab). Can you check if: * syslog-ng has too many open sockets (visible for example with lsof) * whether it reads its input queue (although network packets would again not show up in the slab), using netstat -antp * possibly check the network statistics: netstat -ns The 3.1 problem was completely 3.1 specific and has nothing to do with 3.0 (and in the meanwhile it was solved too, offlist) -- Bazsi
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 12/15/2009 03:17 PM, Balazs Scheidler wrote:
Possibly a socket structure (but AFAIR it has its own slab). Can you check if: * syslog-ng has too many open sockets (visible for example with lsof)
We have about 290 open TCP connections.
* whether it reads its input queue (although network packets would again not show up in the slab), using netstat -antp
Most have smallish Recv-Q's (<2000KB), but a couple hosts have receive queues of 50KB, and one of >1MB. Is this an indication that my syslog-ng instance is overloaded and can't get the data out to disk fast enough?
* possibly check the network statistics: netstat -ns
Nothing looks too odd here. Thanks for your help debugging this issue! - -Doug -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iD8DBQFLKPSSJV36su0A0xIRApNsAKDcdXer/qYeXLsiGXj3fHnnb3PrswCgmrV+ Txho9qhyIJxvXVeoirEgSws= =75bV -----END PGP SIGNATURE-----
On Wed, 2009-12-16 at 09:54 -0500, Doug Warner wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 12/15/2009 03:17 PM, Balazs Scheidler wrote:
Possibly a socket structure (but AFAIR it has its own slab). Can you check if: * syslog-ng has too many open sockets (visible for example with lsof)
We have about 290 open TCP connections.
* whether it reads its input queue (although network packets would again not show up in the slab), using netstat -antp
Most have smallish Recv-Q's (<2000KB), but a couple hosts have receive queues of 50KB, and one of >1MB. Is this an indication that my syslog-ng instance is overloaded and can't get the data out to disk fast enough?
* possibly check the network statistics: netstat -ns
Nothing looks too odd here.
Thanks for your help debugging this issue!
I've checked the kernel source that sockets and skbuffs have their own slab, thus they wouldn't show up in slab-4096. However skbuffs data portion is stored in kmalloced memory, thus they would could show up in slab-4096. The question is much memory is showing in slab-4096 (by mulitplying the counters with 4096 to see how much is actually allocated) and whether they correspond with the receive queues that you see. Also note that if this is a 32 bit machine with lots of memory (> 4GB), then the additional RAM is actually _reducing_ low memory size, which the slab is using. Can you check /proc/meminfo to see how much low memory you have. -- Bazsi
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 12/18/2009 03:03 AM, Balazs Scheidler wrote:
I've checked the kernel source that sockets and skbuffs have their own slab, thus they wouldn't show up in slab-4096.
However skbuffs data portion is stored in kmalloced memory, thus they would could show up in slab-4096.
The question is much memory is showing in slab-4096 (by mulitplying the counters with 4096 to see how much is actually allocated) and whether they correspond with the receive queues that you see.
Also note that if this is a 32 bit machine with lots of memory (> 4GB), then the additional RAM is actually _reducing_ low memory size, which the slab is using. Can you check /proc/meminfo to see how much low memory you have.
We're running x86_64, so all RAM should be low-mem (although we only have 2GB). Currently it looks like we have ~326MB in Recv-Q and our size-4096 is ~479MB. I guess the big concern here is why our recv-q is so high and what we can do to reduce it. These are the current global options (nothing set on the src or dest since this syslog server only handles the tcp connections): options { chain_hostnames(off); dns_cache(yes); log_fifo_size(5000); flush_lines(200); flush_timeout(1000); use_fqdn(yes); stats_freq(300); }; Most destinations are files but we do have one program. The box doesn't seem to be particularly loaded otherwise. Do these values seem like we're trying to make syslog-ng cache too much information? Is there other settings we should tweak to try to get the messages through syslog-ng faster? - -Doug -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iD8DBQFLK5a7JV36su0A0xIRAlEAAKCZEMe3gICPyjYle04VF5v+uQo3yACg0y2J Ml6cA4MgYmvK5JPUxd9zPnE= =YEXd -----END PGP SIGNATURE-----
On Fri, 2009-12-18 at 09:50 -0500, Doug Warner wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 12/18/2009 03:03 AM, Balazs Scheidler wrote:
I've checked the kernel source that sockets and skbuffs have their own slab, thus they wouldn't show up in slab-4096.
However skbuffs data portion is stored in kmalloced memory, thus they would could show up in slab-4096.
The question is much memory is showing in slab-4096 (by mulitplying the counters with 4096 to see how much is actually allocated) and whether they correspond with the receive queues that you see.
Also note that if this is a 32 bit machine with lots of memory (> 4GB), then the additional RAM is actually _reducing_ low memory size, which the slab is using. Can you check /proc/meminfo to see how much low memory you have.
We're running x86_64, so all RAM should be low-mem (although we only have 2GB).
Currently it looks like we have ~326MB in Recv-Q and our size-4096 is ~479MB.
This is reasonable, as the space to be allocated is rounded up, thus it is possible that syslog-ng is indeed the culprit.
I guess the big concern here is why our recv-q is so high and what we can do to reduce it.
These are the current global options (nothing set on the src or dest since this syslog server only handles the tcp connections):
options { chain_hostnames(off); dns_cache(yes); log_fifo_size(5000); flush_lines(200); flush_timeout(1000); use_fqdn(yes); stats_freq(300); };
Most destinations are files but we do have one program. The box doesn't seem to be particularly loaded otherwise.
Do these values seem like we're trying to make syslog-ng cache too much information? Is there other settings we should tweak to try to get the messages through syslog-ng faster?
Do you have flow control enabled on any of your log paths? is syslog-ng burning your CPU, is it CPU bound? -- Bazsi
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 12/18/2009 11:41 AM, Balazs Scheidler wrote:
I guess the big concern here is why our recv-q is so high and what we can do to reduce it.
These are the current global options (nothing set on the src or dest since this syslog server only handles the tcp connections):
options { chain_hostnames(off); dns_cache(yes); log_fifo_size(5000); flush_lines(200); flush_timeout(1000); use_fqdn(yes); stats_freq(300); };
Most destinations are files but we do have one program. The box doesn't seem to be particularly loaded otherwise.
Do these values seem like we're trying to make syslog-ng cache too much information? Is there other settings we should tweak to try to get the messages through syslog-ng faster? Do you have flow control enabled on any of your log paths? is syslog-ng burning your CPU, is it CPU bound?
Periodically when syslog-ng seems to be working fine we'll see syslog-ng using a bit of CPU (~20%), but more commonly right now we see it using very little and the recv-q building up. Non of our src/dest in syslog-ng have any flags set other than the above global options; so whatever is the default for 3.0.4 OSE should be what we're using. - -Doug -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iD8DBQFLK8pPJV36su0A0xIRAsMHAJ99+MmNdVGBLE8+gGt761/JTPUb7ACeOfFZ BPyx86v6bczFDvmDb1urLtk= =t7Rz -----END PGP SIGNATURE-----
On Fri, 2009-12-18 at 13:30 -0500, Doug Warner wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 12/18/2009 11:41 AM, Balazs Scheidler wrote:
I guess the big concern here is why our recv-q is so high and what we can do to reduce it.
These are the current global options (nothing set on the src or dest since this syslog server only handles the tcp connections):
options { chain_hostnames(off); dns_cache(yes); log_fifo_size(5000); flush_lines(200); flush_timeout(1000); use_fqdn(yes); stats_freq(300); };
Most destinations are files but we do have one program. The box doesn't seem to be particularly loaded otherwise.
Do these values seem like we're trying to make syslog-ng cache too much information? Is there other settings we should tweak to try to get the messages through syslog-ng faster? Do you have flow control enabled on any of your log paths? is syslog-ng burning your CPU, is it CPU bound?
Periodically when syslog-ng seems to be working fine we'll see syslog-ng using a bit of CPU (~20%), but more commonly right now we see it using very little and the recv-q building up.
Non of our src/dest in syslog-ng have any flags set other than the above global options; so whatever is the default for 3.0.4 OSE should be what we're using.
Strange. Can you start strace on the syslog-ng process, preferably with timestamps (-ttT -s 256 -p <pid>) options to see what it is doing? -- Bazsi
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 12/21/2009 02:50 AM, Balazs Scheidler wrote:
Periodically when syslog-ng seems to be working fine we'll see syslog-ng using
a bit of CPU (~20%), but more commonly right now we see it using very little and the recv-q building up.
Non of our src/dest in syslog-ng have any flags set other than the above global options; so whatever is the default for 3.0.4 OSE should be what we're using. Strange. Can you start strace on the syslog-ng process, preferably with timestamps (-ttT -s 256 -p <pid>) options to see what it is doing?
Sure thing; sending off-list. - -Doug -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iD8DBQFLL4kQJV36su0A0xIRAjv5AJ9f4Svk9P4oyMOyHthhPKXPKmJ5QgCgwCrh +jhlf3f2HIcKegXZJf6l8e8= =hIi4 -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 12/21/2009 09:41 AM, Doug Warner wrote:
On 12/21/2009 02:50 AM, Balazs Scheidler wrote:
Periodically when syslog-ng seems to be working fine we'll see syslog-ng using
a bit of CPU (~20%), but more commonly right now we see it using very little and the recv-q building up.
Non of our src/dest in syslog-ng have any flags set other than the above global options; so whatever is the default for 3.0.4 OSE should be what we're using. Strange. Can you start strace on the syslog-ng process, preferably with timestamps (-ttT -s 256 -p <pid>) options to see what it is doing?
Sure thing; sending off-list.
Is there anything more I can provide to help debug this? Thanks for your time and help! - -Doug -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iD8DBQFLQ26OJV36su0A0xIRAhOUAJ9wMYEIfHqJ3i8SqVEVm2kgu5cAMACbB/jb eEYSV+EjfpXawGweGMxq/qQ= =TpPh -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/05/2010 11:53 AM, Doug Warner wrote:
On 12/21/2009 09:41 AM, Doug Warner wrote:
On 12/21/2009 02:50 AM, Balazs Scheidler wrote:
Periodically when syslog-ng seems to be working fine we'll see syslog-ng using
a bit of CPU (~20%), but more commonly right now we see it using very little and the recv-q building up.
Non of our src/dest in syslog-ng have any flags set other than the above global options; so whatever is the default for 3.0.4 OSE should be what we're using. Strange. Can you start strace on the syslog-ng process, preferably with timestamps (-ttT -s 256 -p <pid>) options to see what it is doing?
Just as a follow-up, I was able to solve this by setting flow-control on my log path and increasing my log_fetch_limit() and log_iw_size() on my tcp source and increasing log_fifo_size() globally. Doing these 4 things brought my memory consumption back in line and syslog-ng seems to be processing the logs at the proper pace again. - -Doug -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iD8DBQFLS1QAJV36su0A0xIRAk9FAJ9Yx4Zg+FIWkKJsC3/WVXx+GFmz4wCgrrNm +9hWZbmWh/3b3A/3O24MMi4= =cMv3 -----END PGP SIGNATURE-----
So, if I have a tls incoming server that I expect to see thousands, or tens of thousands of clients, what should I set my flow-control parameters to? I assume that I'll want max-connections to the number of clients. So, let's say 50k. What would the other paramters need to be set to? log_fifo_size, log_fetch_limit, log_iw_size? thx On 1/11/10 8:38 AM, Doug Warner wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 01/05/2010 11:53 AM, Doug Warner wrote:
On 12/21/2009 09:41 AM, Doug Warner wrote:
On 12/21/2009 02:50 AM, Balazs Scheidler wrote:
Periodically when syslog-ng seems to be working fine we'll see syslog-ng using
a bit of CPU (~20%), but more commonly right now we see it using very little and the recv-q building up.
Non of our src/dest in syslog-ng have any flags set other than the above global options; so whatever is the default for 3.0.4 OSE should be what we're using.
Strange. Can you start strace on the syslog-ng process, preferably with timestamps (-ttT -s 256 -p<pid>) options to see what it is doing?
Just as a follow-up, I was able to solve this by setting flow-control on my log path and increasing my log_fetch_limit() and log_iw_size() on my tcp source and increasing log_fifo_size() globally. Doing these 4 things brought my memory consumption back in line and syslog-ng seems to be processing the logs at the proper pace again.
- -Doug -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux)
iD8DBQFLS1QAJV36su0A0xIRAk9FAJ9Yx4Zg+FIWkKJsC3/WVXx+GFmz4wCgrrNm +9hWZbmWh/3b3A/3O24MMi4= =cMv3 -----END PGP SIGNATURE----- ______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
On Mon, 2010-01-11 at 11:38 -0500, Doug Warner wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 01/05/2010 11:53 AM, Doug Warner wrote:
On 12/21/2009 09:41 AM, Doug Warner wrote:
On 12/21/2009 02:50 AM, Balazs Scheidler wrote:
Periodically when syslog-ng seems to be working fine we'll see syslog-ng using
a bit of CPU (~20%), but more commonly right now we see it using very little and the recv-q building up.
Non of our src/dest in syslog-ng have any flags set other than the above global options; so whatever is the default for 3.0.4 OSE should be what we're using. Strange. Can you start strace on the syslog-ng process, preferably with timestamps (-ttT -s 256 -p <pid>) options to see what it is doing?
Just as a follow-up, I was able to solve this by setting flow-control on my log path and increasing my log_fetch_limit() and log_iw_size() on my tcp source and increasing log_fifo_size() globally. Doing these 4 things brought my memory consumption back in line and syslog-ng seems to be processing the logs at the proper pace again.
Great to know, and sorry for not answering any sooner. I was a bit overwhelmed with other stuff. Since the settings that you changed are somewhat contradictory, I guess a smaller set of changes would have resolved the issue as well. log_fetch_limit() - with this you increased the number of messages syslog-ng would receive with a single main loop iteration. the default value here is 10, which is quite conservative. If you have a lot of connections, increasing this value to a very high number could make some connections to starve. Enabling flow control will only help throughput if otherwise syslog-ng drops messages actively (e.g. receive it and then decide there's no space in the destination FIFO). If you enable flow control, the kernel would drop/stall messages (instead of syslog-ng), which makes syslog-ng to care about output messages instead of receiving them just to drop them. I think in your specific case the increase of fetch-limit was the solution. But increasing fifo_size() when fetch_limit() is increased is a wise thing to do. I doubt enabling flow-control would have a real effect. -- Bazsi
participants (3)
-
Balazs Scheidler
-
Doug Warner
-
Rory Toma