[syslog-ng] Possible memleak in 3.3 HEAD

19 Sep 2011

      Hi,

I think I'm having a memleak in 3.3's HEAD. My production system, which 
receives ~300 messages/s throughout the day (so ~26.4 million 
messages daily) leaks ~700MB of memory per day.

I did some investigation, here are the details:

# syslog-ng -V
syslog-ng 3.3.0beta2
Installer-Version: 3.3.0beta2
Revision: ssh+git://bazsi@git.balabit//var/scm/git/syslog-ng/syslog-ng-ose--mainline--3.3#no_branch#d6eea6a1130f8343b6600976d9f2319f9ed86c88
Compile-Date: Sep  9 2011 16:32:59
Default-Modules: affile,afprog,afsocket,afuser,basicfuncs,csvparser,dbparser,syslogformat,afsql
Available-Modules: syslogformat,affile,afmongodb,basicfuncs,csvparser,dbparser,afsql,convertfuncs,afprog,confgen,afsocket,afuser,afsocket-tls,dummy
Enable-Debug: off
Enable-GProf: off
Enable-Memtrace: off
Enable-IPv6: on
Enable-Spoof-Source: on
Enable-TCP-Wrapper: on
Enable-Linux-Caps: on
Enable-Pcre: off
#

My build is patched with:
- https://github.com/algernon/libmongo-client/commit/93c5cff05feec5d7f421acfdf...
  (so it compiles on rhel5)
- https://lists.balabit.hu/pipermail/syslog-ng/2011-April/016375.html (I need
  this feature)

I don't think any of these is the reason, though.

My test configuration is basically this:

$ cat /etc/syslog-ng/syslog-ng.conf
@version: 3.3
options {
        time_reap (1);
        log_fifo_size (100000);
        use_dns (no);
        use_fqdn (no);
        create_dirs (yes);
        keep_hostname (yes);
        log_msg_size(16384);
        owner ("root");
        group ("root");
        perm (0644);
        dir_owner ("root");
        dir_group ("root");
        dir_perm (0755);
};
source s_www {
    syslog(ip(0.0.0.0) port(515) flags(no-multi-line) log_iw_size(6000) max-connections(30) tags("src.syslog", "src.port.515", "www", "www.accesslog"));
};
filter f_svc_www_accesslog     { tags("www.accesslog"); };

parser p_apache_split_vhost_access {
    csv-parser(
        columns("APACHE.ACCESS.VHOST", "APACHE.ACCESS.LOG_MESSAGE")
        flags(greedy)
        delimiters(" ")
    );
};
parser dbp_apache_split_vhost_access {
  db-parser(
    file("/etc/syslog-ng/pdb1.xml")
  );
};
destination d_access_split { 
    file("/logs/$S_YEAR/$S_MONTH/$S_DAY/${APACHE.ACCESS.VHOST:-default}-access.log-${S_YEAR}${S_MONTH}${S_DAY}" 
        template("${APACHE.ACCESS.LOG_MESSAGE}\n"));
};
log { source(s_www); 
    filter(f_svc_www_accesslog); 
    parser(dbp_apache_split_vhost_access); 
    destination(d_access_split);
    flags(final);
};
$

Those two parsers (csv-parser and db-parser) are doing exactly the same, 
they split my Apache HTTPd access logs into two parts: first column (which is
the vhost name) and the rest of the line. I have them both because I needed to
know if that's a leak in any particular parser type. (Apparently not)

I'm not very experienced with valgrind, but I did some tests, and what I've found
is that the number of "definitely lost" blocks depends on the number of files
created by d_access_split destination.

So I took 10000 lines of some real httpd logs and did two tests:
1) replace real vhost names with completely random strings, to maximize number
of files created by d_access_split
2) replace real vhost names with constant string, so my destination always writes
to the same file.

And this is what I got, using "G_SLICE=always-malloc valgrind --leak-check=full 
-v --show-reachable=yes --track-origins=yes 
--log-file=valgrind.log syslog-ng -F --no-caps -p /var/run/syslog-ng.pid"

1) random files:

==22416== 3,675,952 bytes in 9,989 blocks are definitely lost in loss record 578 of 578
==22416==    at 0x4021EC2: calloc (vg_replace_malloc.c:418)
==22416==    by 0x40FF57D: g_malloc0 (in /lib/libglib-2.0.so.0.1200.3)
==22416==    by 0x4052899: log_queue_fifo_new (logqueue-fifo.c:438)
==22416==    by 0x4046057: log_dest_driver_acquire_queue_method (driver.c:153)
==22416==    by 0x47081D1: affile_dw_init (driver.h:178)
==22416==    by 0x4707C41: affile_dd_open_writer (logpipe.h:239)
==22416==    by 0x405C258: main_loop_call (mainloop.c:145)
==22416==    by 0x4707A6B: affile_dd_queue (affile.c:1127)
==22416==    by 0x40451C8: log_dest_group_queue (logpipe.h:288)
==22416==    by 0x404B9DD: log_multiplexer_queue (logpipe.h:288)
==22416==    by 0x404EC16: log_parser_queue (logpipe.h:288)
==22416==    by 0x4047473: log_filter_pipe_queue (logpipe.h:288)
==22416== 
==22416== LEAK SUMMARY:
==22416==    definitely lost: 3,676,156 bytes in 9,992 blocks
==22416==    indirectly lost: 6,541 bytes in 56 blocks
==22416==      possibly lost: 16,967 bytes in 117 blocks
==22416==    still reachable: 76,908 bytes in 3,286 blocks
==22416==         suppressed: 0 bytes in 0 blocks

2) single file (constant filename):

==19460== 368 bytes in 1 blocks are definitely lost in loss record 525 of 577
==19460==    at 0x4021EC2: calloc (vg_replace_malloc.c:418)
==19460==    by 0x40FF57D: g_malloc0 (in /lib/libglib-2.0.so.0.1200.3)
==19460==    by 0x4052899: log_queue_fifo_new (logqueue-fifo.c:438)
==19460==    by 0x4046057: log_dest_driver_acquire_queue_method (driver.c:153)
==19460==    by 0x47081D1: affile_dw_init (driver.h:178)
==19460==    by 0x4707C41: affile_dd_open_writer (logpipe.h:239)
==19460==    by 0x405C258: main_loop_call (mainloop.c:145)
==19460==    by 0x4707A6B: affile_dd_queue (affile.c:1127)
==19460==    by 0x40451C8: log_dest_group_queue (logpipe.h:288)
==19460==    by 0x404B9DD: log_multiplexer_queue (logpipe.h:288)
==19460==    by 0x404EC16: log_parser_queue (logpipe.h:288)
==19460==    by 0x4047473: log_filter_pipe_queue (logpipe.h:288)
(...)
==19460== LEAK SUMMARY:
==19460==    definitely lost: 572 bytes in 4 blocks
==19460==    indirectly lost: 6,541 bytes in 56 blocks
==19460==      possibly lost: 12,919 bytes in 106 blocks
==19460==    still reachable: 76,908 bytes in 3,286 blocks
==19460==         suppressed: 0 bytes in 0 blocks

Could anyone proficient in syslog-ng source code take a look on this?
Can I do anything else to help you diagnose this? (I can provide complete 
valgrind output logs if this helps.)

Also, I'll try to reproduce this on other versions later this week, unless
someone fixes this earlier. :)

Cheers,

-- 
Jakub Jankowski|shasta@toxcorp.com|http://toxcorp.com/
GPG: FCBF F03D 9ADB B768 8B92 BB52 0341 9037 A875 942D