3.3.0beta1 leaking memeory (Re: syslog-ng 3.3.0beta & ESX crashes)
Hi there, I've mailed Gergely the Valgrind output to analyze 3.3.0beta1 memory leak(s), so if anybody else are interested in it, please contact me for them. I've since compiled 3.2.4+noescape patch (since only 3.3.0+ have the default option for SQL insert I need/use for the autoincrementing ID field in systemevents), and using the same config file (other than @version 3.2 and the noescape('default') for my postgresql insertions) and the system is stable with no descernable memory growth compared to syslog-ng 3.3.0beta1's explosion in memory size. Hendrik On Mon, Jul 18, 2011 at 3:02 PM, Gergely Nagy <algernon@balabit.hu> wrote:
Hendrik Visage <hvjunk@gmail.com> writes:
Hi Gergely,
Looks like I have a syslog-ng 3.3beta1 (fetched from GITHEAD) that is having a serious memory leak (and I'm not yet pushing all our logs to it ;(
Any advice in how to help you track this problem?
Hrm. Well, valgrind would be best, but that slows things down to almost a grounding halt. (But in case you want to try that: G_SLICE=always-malloc valgrind --leak-check=full -v --show-reachable-yes --log-file=valgrind.log syslog-ng -F would be the command to run ;)
Other than that, a config could be useful, so I can try and reproduce the problem (or at least identify possible areas where the leak might be coming from, and go from there), and see if I can fix it.
-- |8]
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
Hendrik Visage <hvjunk@gmail.com> writes:
Hi there,
I've mailed Gergely the Valgrind output to analyze 3.3.0beta1 memory leak(s), so if anybody else are interested in it, please contact me for them.
I just had a look at said logs, and the leak is spectacularly visible. If all goes well, I'll be able to post a patch tonight. -- |8]
Thanks Gergely! On Wed, Jul 20, 2011 at 2:06 PM, Gergely Nagy <algernon@balabit.hu> wrote:
Hendrik Visage <hvjunk@gmail.com> writes:
Hi there,
I've mailed Gergely the Valgrind output to analyze 3.3.0beta1 memory leak(s), so if anybody else are interested in it, please contact me for them.
I just had a look at said logs, and the leak is spectacularly visible. If all goes well, I'll be able to post a patch tonight.
-- |8]
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
Hendrik Visage <hvjunk@gmail.com> writes:
I've mailed Gergely the Valgrind output to analyze 3.3.0beta1 memory leak(s), so if anybody else are interested in it, please contact me for them.
I had a little time to look into the memory leak issue, and I can confirm: the leak happens all the time with latest git. What happens is, that log_reader_work_perform() calls log_reader_fetch_log(), which in turn calls log_reader_handle_line(), which allocates a LogMessage. This LogMessage is never freed, only forgotten, thus we end up with stuff like this: ==13955== 39,240,752 (15,004,656 direct, 24,236,096 indirect) bytes in 19,953 blocks are definitely lost in loss record 623 of 623 ==13955== at 0x4C274A8: malloc (vg_replace_malloc.c:236) ==13955== by 0x5941754: g_malloc (in /lib/libglib-2.0.so.0.2400.1) ==13955== by 0x4E5EABD: log_msg_alloc (logmsg.c:925) ==13955== by 0x4E5EB3F: log_msg_new (logmsg.c:950) ==13955== by 0x4E6765C: log_reader_handle_line (logreader.c:494) ==13955== by 0x4E6787D: log_reader_fetch_log (logreader.c:572) ==13955== by 0x4E66A17: log_reader_work_perform (logreader.c:116) ==13955== by 0x4E66B74: log_reader_io_process_input (logreader.c:191) ==13955== by 0x4E972A5: iv_run_tasks (iv_task.c:53) ==13955== by 0x4E95F7A: iv_main (iv_main.c:253) ==13955== by 0x4E6F8F8: main_loop_run (mainloop.c:670) ==13955== by 0x401AE8: main (main.c:263) There's also another leak somewhere around patterndb (but I haven't looked into this one yet): ==13955== 24,265,280 bytes in 19,955 blocks are indirectly lost in loss record 622 of 623 ==13955== at 0x4C275A2: realloc (vg_replace_malloc.c:525) ==13955== by 0x594158E: g_realloc (in /lib/libglib-2.0.so.0.2400.1) ==13955== by 0x4E7333A: nv_table_realloc (nvtable.c:685) ==13955== by 0x4E5DBC1: log_msg_set_value_indirect (logmsg.c:491) ==13955== by 0x8826CF6: pdb_rule_set_lookup (patterndb.c:1254) ==13955== by 0x88273E3: pattern_db_process (patterndb.c:1458) ==13955== by 0x881F408: log_db_parser_process (dbparser.c:217) ==13955== by 0x4E5FE15: log_parser_queue (logparser.c:49) ==13955== by 0x4E5C971: log_pipe_queue (logpipe.h:288) ==13955== by 0x4E5CBF8: log_multiplexer_queue (logmpx.c:125) ==13955== by 0x4E771A7: log_pipe_queue (logpipe.h:288) ==13955== by 0x4E77149: log_pipe_forward_msg (logpipe.h:275) The config to trigger this is pretty simple: | @version: 3.3 | @include "scl.conf" | @module tfjson | | source s_local { internal(); }; | source s_network { tcp(port(10514) tags("tcp-tag")); }; | destination d_local { | file("/tmp/ose-messages" | template("$(format-json --scope all-nv-pairs --scope core)\n")); | }; | parser p_loggen { | db_parser(file("/home/algernon/install/ose/syslog-ng-3.3/etc/loggen.pdb")); | }; | | log { source(s_local); source(s_network); parser(p_loggen); | destination(d_local); }; There's also a few other leaks, but those seem reasonably minor (eg, internal messages seem leaking aswell), and are probably related to one of the above leaks. I have no fix yet, but I thought I'd share my findings nevertheless. The loggen.pdb file is attached, for reference, and triggering the bug was simply done by throwing a ton of logs at syslog-ng with loggen. -- |8]
Gergely Nagy <algernon@balabit.hu> writes:
Hendrik Visage <hvjunk@gmail.com> writes:
I've mailed Gergely the Valgrind output to analyze 3.3.0beta1 memory leak(s), so if anybody else are interested in it, please contact me for them.
I had a little time to look into the memory leak issue, and I can confirm: the leak happens all the time with latest git.
After a bit of bisecting, this is what i found to be the cause of the massive leak: b8cc9fe3bb41d862918d9e39b0ded812dd756ef5 is the first bad commit commit b8cc9fe3bb41d862918d9e39b0ded812dd756ef5 Author: Balazs Scheidler <bazsi@balabit.hu> Date: Thu Jul 14 12:26:01 2011 +0200 delegate flow-control early-ack decision to destination drivers Sources always expect an acknowledgement for each message they produce. This acknowledgement is usually generated when all the destinations have finished their processing of the given message, unless flow-control is disabled. Earlier the LogMultiplexer object was responsible for ACKing in the name of destinations where flow-control was not desired. This patch changes that and delegates the early-ack (or break-ack) decision to destination drivers, based on the information stored in the LogPathOptions structure, where the user-requested mode of operation is stored. This patch splits the "flow_control" member of the LogPathOptions structure into two: - ack_needed: specifies whether the source still expects an ACK on the current path. - flow_control_requested: specifies whether the user has enabled flow control on the current path. Based on these, the destination driver is able to make a decision when to perform the ACK: - log_msg_break_ack(): is a new function call that can be used to generate the ACK and prepare for _sending_ the same message for further processing. - log_msg_ack(): is the good-old ack generation function that can be used when the processing is finally finished, and no processing is to be done. Signed-off-by: Balazs Scheidler <bazsi@balabit.hu> :040000 040000 a26bae67903b4109d7b38fbc7c5189c16cc34537 498a1f804d5f3cc578d4fc0dd3833e9b2088a4f8 M lib :040000 040000 7ee20d0fad233792bc68b1db27e3377a22a21499 14bfecc8d10a9a4a36fbe4f38d68b2cef2f6aa63 M modules :040000 040000 eb95b6f26303ac0e879321b6f4366ff8dae66f69 19c30ee65e21c7ef10148630546658e28ac0f5b9 M tests I'm looking into what exactly this patch does, and where the LogMessage freeing should be added. Reverting the above patch from git head does, indeed, make the leak go away. -- |8]
Hi, Thanks for the info, I'll try to look at this today. On Fri, 2011-07-22 at 14:04 +0200, Gergely Nagy wrote:
Gergely Nagy <algernon@balabit.hu> writes:
Hendrik Visage <hvjunk@gmail.com> writes:
I've mailed Gergely the Valgrind output to analyze 3.3.0beta1 memory leak(s), so if anybody else are interested in it, please contact me for them.
I had a little time to look into the memory leak issue, and I can confirm: the leak happens all the time with latest git.
After a bit of bisecting, this is what i found to be the cause of the massive leak:
b8cc9fe3bb41d862918d9e39b0ded812dd756ef5 is the first bad commit commit b8cc9fe3bb41d862918d9e39b0ded812dd756ef5 Author: Balazs Scheidler <bazsi@balabit.hu> Date: Thu Jul 14 12:26:01 2011 +0200
delegate flow-control early-ack decision to destination drivers
Sources always expect an acknowledgement for each message they produce. This acknowledgement is usually generated when all the destinations have finished their processing of the given message, unless flow-control is disabled.
Earlier the LogMultiplexer object was responsible for ACKing in the name of destinations where flow-control was not desired. This patch changes that and delegates the early-ack (or break-ack) decision to destination drivers, based on the information stored in the LogPathOptions structure, where the user-requested mode of operation is stored.
This patch splits the "flow_control" member of the LogPathOptions structure into two:
- ack_needed: specifies whether the source still expects an ACK on the current path. - flow_control_requested: specifies whether the user has enabled flow control on the current path.
Based on these, the destination driver is able to make a decision when to perform the ACK:
- log_msg_break_ack(): is a new function call that can be used to generate the ACK and prepare for _sending_ the same message for further processing. - log_msg_ack(): is the good-old ack generation function that can be used when the processing is finally finished, and no processing is to be done.
Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
:040000 040000 a26bae67903b4109d7b38fbc7c5189c16cc34537 498a1f804d5f3cc578d4fc0dd3833e9b2088a4f8 M lib :040000 040000 7ee20d0fad233792bc68b1db27e3377a22a21499 14bfecc8d10a9a4a36fbe4f38d68b2cef2f6aa63 M modules :040000 040000 eb95b6f26303ac0e879321b6f4366ff8dae66f69 19c30ee65e21c7ef10148630546658e28ac0f5b9 M tests
I'm looking into what exactly this patch does, and where the LogMessage freeing should be added.
Reverting the above patch from git head does, indeed, make the leak go away.
-- Bazsi
On Mon, 2011-08-01 at 12:39 +0200, Balazs Scheidler wrote:
Hi,
Thanks for the info, I'll try to look at this today.
On Fri, 2011-07-22 at 14:04 +0200, Gergely Nagy wrote:
Gergely Nagy <algernon@balabit.hu> writes:
Hendrik Visage <hvjunk@gmail.com> writes:
The patch wasn't the root cause for the leak, rather it was in the ref counter cache used to spare some atomic operations in the log processing fast path. This patch fixes that (I've pushed it just now): commit 13a87ed5548cb7b152b708d24bb5b3684771d7ca Author: Balazs Scheidler <bazsi@balabit.hu> Date: Mon Aug 1 20:13:45 2011 +0200 refcache: process the ref count differences in two steps The refcache may have caused a leak, when the ack callback performed an additional unref operation (which it does), because folding in the ref difference counter didn't take the additional unref into account. Thus the ref difference needs to be folded back into the atomic counter in two steps: 1) we add in the difference that was present when entering the function (e.g. the diffs accumulated before the function was entered) 2) we add in the difference that was created by the ACK callback Signed-off-by: Balazs Scheidler <bazsi@balabit.hu> -- Bazsi
participants (3)
-
Balazs Scheidler
-
Gergely Nagy
-
Hendrik Visage