[syslog-ng] Fastest, most scaleable solution

Thu Jul 23 04:53:20 CEST 2009

I think for serious (over 20k messages per second) bulk processing on
a single machine, flat files are the only way to go.  If you segregate
the files into separate directories enough, you can get some
pseudo-indexing for faster searches, but you'll never be able to get
any real kind of indexing for true GROUP BY-style reporting.  Flat
files would probably suffice for fairly fast, basic grep-like
searches, but if you want to graph anything, you'll need a database
somewhere along the line.

I've almost got the code for the backend I'm working on down now, and
I'm seeing performance of something like 500 messages per second per
CPU bursting to about 750 MP/S/CPU, scaling linearly with 0% loss, .
This is with my crazy massive indexing schema in which I have about 7
data field rows for each syslog message (as parsed by db-parser) in
addition to the standard syslog data fields of host, program, level,
etc.  I split syslogs into four tables: meta, message, integer fields,
and char fields.  The result is that the 88 million syslogs in a test
run last night created about 900 million rows in the database.  I'm
sure that you could get upwards of 2-3k MP/S/CPU if you just write
flat, non-indexed tables instead of the schema I'm using.

The write strategy I'm using is to use program() in syslog-ng to write
logs to the STDIN of a Perl script, then round-robin load balance on N
number of forked Perl child processes which do some minor rewriting
and write out tab-separated files ready for database bulk import. I'm
using MySQL 5.1 and mysqlimport to do LOAD DATA batches in totally
separate Perl processes.  The tables are separated by type, log class,
and hour. The most interesting thing performance-wise is that the
actual disk utilization is not the limiting factor, but rather the CPU
usage in the MySQL indexing process.  MySQL MyISAM tables are the best
way to go if you want your data in a database because MySQL InnoDB,
PostgreSQL, MSSQL, and Oracle all have clustered (or cluster-like)
indexes.  The clustered indexes make bulk inserts much, much slower
because the data has to be arranged by primary key instead of just
being written straight into the table as they are received.  This
makes inserts something like 1-2 orders of magnitude slower.

One idea I toyed with but decided against was to use Sphinx to index
the log data.  I decided against it because Sphinx won't index
non-alpha-numeric data, like IP addresses.  If its source code could
be altered so that it would index things like IP Addresses, MAC
addresses, email addresses, and DNS names, then syslog-ng ->
non-indexed MySQL -> Sphinx would be a good solution.  I've also
considered using SQLite, but I don't think you'd gain much over
MyISAM.

Is anyone else attempting high-throughput indexing?  What's worked for
you?  How about large-scale reporting/graphing strategies?

--Martin

On Wed, Jul 22, 2009 at 5:17 PM, Clayton Dukes<cdukes at gmail.com> wrote:
> Hi folks,
> I'm thinking about completely re-writing my app (php-syslog-ng) in
> order to provide high end scalability and a better (ajax) front end.
>
> At the base of it, my biggest concern is scalability.
> I want to write the backend to be able to handle many thousands of
> messages per second.
>
> In your opinion, what is the best way to accomplish this?
> Should I:
> Log everything to MySQL directly? If so, can MySQL handle that insert rate?
> Log everything to disk and use flat files to do analysis and reporting?
> Log everything to disk and use MySQL's load_data_infile?
>
> Other suggestions?
>
> Also, is anyone out there using MySQL for this level of insert rates?
> Do you have any recommendations on table structures, indexes, etc.?
>
> --
> ______________________________________________________________
>
> Clayton Dukes
> ______________________________________________________________
> ______________________________________________________________________________
> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
> FAQ: http://www.campin.net/syslog-ng/faq.html
>
>