Fastest, most scaleable solution
Hi folks, I'm thinking about completely re-writing my app (php-syslog-ng) in order to provide high end scalability and a better (ajax) front end. At the base of it, my biggest concern is scalability. I want to write the backend to be able to handle many thousands of messages per second. In your opinion, what is the best way to accomplish this? Should I: Log everything to MySQL directly? If so, can MySQL handle that insert rate? Log everything to disk and use flat files to do analysis and reporting? Log everything to disk and use MySQL's load_data_infile? Other suggestions? Also, is anyone out there using MySQL for this level of insert rates? Do you have any recommendations on table structures, indexes, etc.? -- ______________________________________________________________ Clayton Dukes ______________________________________________________________
I think for serious (over 20k messages per second) bulk processing on a single machine, flat files are the only way to go. If you segregate the files into separate directories enough, you can get some pseudo-indexing for faster searches, but you'll never be able to get any real kind of indexing for true GROUP BY-style reporting. Flat files would probably suffice for fairly fast, basic grep-like searches, but if you want to graph anything, you'll need a database somewhere along the line. I've almost got the code for the backend I'm working on down now, and I'm seeing performance of something like 500 messages per second per CPU bursting to about 750 MP/S/CPU, scaling linearly with 0% loss, . This is with my crazy massive indexing schema in which I have about 7 data field rows for each syslog message (as parsed by db-parser) in addition to the standard syslog data fields of host, program, level, etc. I split syslogs into four tables: meta, message, integer fields, and char fields. The result is that the 88 million syslogs in a test run last night created about 900 million rows in the database. I'm sure that you could get upwards of 2-3k MP/S/CPU if you just write flat, non-indexed tables instead of the schema I'm using. The write strategy I'm using is to use program() in syslog-ng to write logs to the STDIN of a Perl script, then round-robin load balance on N number of forked Perl child processes which do some minor rewriting and write out tab-separated files ready for database bulk import. I'm using MySQL 5.1 and mysqlimport to do LOAD DATA batches in totally separate Perl processes. The tables are separated by type, log class, and hour. The most interesting thing performance-wise is that the actual disk utilization is not the limiting factor, but rather the CPU usage in the MySQL indexing process. MySQL MyISAM tables are the best way to go if you want your data in a database because MySQL InnoDB, PostgreSQL, MSSQL, and Oracle all have clustered (or cluster-like) indexes. The clustered indexes make bulk inserts much, much slower because the data has to be arranged by primary key instead of just being written straight into the table as they are received. This makes inserts something like 1-2 orders of magnitude slower. One idea I toyed with but decided against was to use Sphinx to index the log data. I decided against it because Sphinx won't index non-alpha-numeric data, like IP addresses. If its source code could be altered so that it would index things like IP Addresses, MAC addresses, email addresses, and DNS names, then syslog-ng -> non-indexed MySQL -> Sphinx would be a good solution. I've also considered using SQLite, but I don't think you'd gain much over MyISAM. Is anyone else attempting high-throughput indexing? What's worked for you? How about large-scale reporting/graphing strategies? --Martin On Wed, Jul 22, 2009 at 5:17 PM, Clayton Dukes<cdukes@gmail.com> wrote:
Hi folks, I'm thinking about completely re-writing my app (php-syslog-ng) in order to provide high end scalability and a better (ajax) front end.
At the base of it, my biggest concern is scalability. I want to write the backend to be able to handle many thousands of messages per second.
In your opinion, what is the best way to accomplish this? Should I: Log everything to MySQL directly? If so, can MySQL handle that insert rate? Log everything to disk and use flat files to do analysis and reporting? Log everything to disk and use MySQL's load_data_infile?
Other suggestions?
Also, is anyone out there using MySQL for this level of insert rates? Do you have any recommendations on table structures, indexes, etc.?
-- ______________________________________________________________
Clayton Dukes ______________________________________________________________ ______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
On Jul 22, 2009, at 3:17 PM, Clayton Dukes wrote:
Hi folks, I'm thinking about completely re-writing my app (php-syslog-ng) in order to provide high end scalability and a better (ajax) front end.
At the base of it, my biggest concern is scalability. I want to write the backend to be able to handle many thousands of messages per second.
In your opinion, what is the best way to accomplish this? Should I: Log everything to MySQL directly? If so, can MySQL handle that insert rate?
It can. I have perl based analyzers that eat a feed from syslog-ng and insert into mysql.
Log everything to disk and use flat files to do analysis and reporting? Log everything to disk and use MySQL's load_data_infile?
Other suggestions?
Also, is anyone out there using MySQL for this level of insert rates? Do you have any recommendations on table structures, indexes, etc.?
I'm at 16 million rows for today already, and it's only 10 am here. Look into bulk insert formats. Keep in mind that mysql updates indexes after every insert completes, so row by row inserts will choke. Also, insert times increase slightly as the table grows, so you'll see a performance hit late in the day. You want to keep indexes as light as possible for this. - billn
Bill, I'm finding that the indexing process after LOAD DATA is still extremely intensive, though I'm getting about 12 million rows per hour inserted successfully, (albeit with an extremely high load on the box). If you do LOAD DATA into non-indexed tables, MySQL barely sweats and I think the upper bound on insert rates would be more like 50-100 million rows per hour. What does your schema look like? Are you splitting the messages into separate tables? I'm finding that the big challenge with getting tables into database third normal form is that you need to pre-generate your message indexes so that you can do a LOAD DATA. An auto-increment value on a column means you won't be able to split messages into more than one table and map them together easily, but if you only use the normal syslog header fields, you can get them all into one table without too much of a problem. However, I'm trying to get extracted field information inserted as well, which requires more than just one standard table and many more indexes. Thanks, Martin On Fri, Jul 24, 2009 at 12:20 PM, Bill Nash<billn@billn.net> wrote:
On Jul 22, 2009, at 3:17 PM, Clayton Dukes wrote:
Hi folks, I'm thinking about completely re-writing my app (php-syslog-ng) in order to provide high end scalability and a better (ajax) front end.
At the base of it, my biggest concern is scalability. I want to write the backend to be able to handle many thousands of messages per second.
In your opinion, what is the best way to accomplish this? Should I: Log everything to MySQL directly? If so, can MySQL handle that insert rate?
It can. I have perl based analyzers that eat a feed from syslog-ng and insert into mysql.
Log everything to disk and use flat files to do analysis and reporting? Log everything to disk and use MySQL's load_data_infile?
Other suggestions?
Also, is anyone out there using MySQL for this level of insert rates? Do you have any recommendations on table structures, indexes, etc.?
I'm at 16 million rows for today already, and it's only 10 am here.
Look into bulk insert formats. Keep in mind that mysql updates indexes after every insert completes, so row by row inserts will choke. Also, insert times increase slightly as the table grows, so you'll see a performance hit late in the day. You want to keep indexes as light as possible for this.
- billn
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
I can't release my schema, but I can talk about some of this. I sort into daily tables currently, so each day is separate from the previous. I log data like analyzer rule matches, source host, timestamps, with indexes on those fields only. It's pretty light, overall, with about 20 analyzers around the world inserting at once. I'm not sure what your setup is, but mine is for live logging as close to real time as I can get, given buffer fill cycles and the like. An auto-increment value is only a problem if you really truly care about it, which, I don't. It's there for expiration purposes in my case, so I can delete by unique id within each day's table. Even if you start combining days into monthly tables, just give them a new unique id and index against it, because you're just using it as a convenient handle for reference from the application anyway. One slight advantage I have by logging the analyzer rule id is that I don't need to store extracted information, I'd just reference the rule and re-extract it later. On-demand CPU usage vs storage/index costs. I think you'll find unless you have a pressing need to constantly be rooting through logs (I don't, since I have analyzers kicking off trouble tickets), you might want to reassess this. - billn On Jul 24, 2009, at 2:28 PM, Martin Holste wrote:
Bill,
I'm finding that the indexing process after LOAD DATA is still extremely intensive, though I'm getting about 12 million rows per hour inserted successfully, (albeit with an extremely high load on the box). If you do LOAD DATA into non-indexed tables, MySQL barely sweats and I think the upper bound on insert rates would be more like 50-100 million rows per hour. What does your schema look like? Are you splitting the messages into separate tables? I'm finding that the big challenge with getting tables into database third normal form is that you need to pre-generate your message indexes so that you can do a LOAD DATA. An auto-increment value on a column means you won't be able to split messages into more than one table and map them together easily, but if you only use the normal syslog header fields, you can get them all into one table without too much of a problem. However, I'm trying to get extracted field information inserted as well, which requires more than just one standard table and many more indexes.
Thanks,
Martin
On Fri, Jul 24, 2009 at 12:20 PM, Bill Nash<billn@billn.net> wrote:
On Jul 22, 2009, at 3:17 PM, Clayton Dukes wrote:
Hi folks, I'm thinking about completely re-writing my app (php-syslog-ng) in order to provide high end scalability and a better (ajax) front end.
At the base of it, my biggest concern is scalability. I want to write the backend to be able to handle many thousands of messages per second.
In your opinion, what is the best way to accomplish this? Should I: Log everything to MySQL directly? If so, can MySQL handle that insert rate?
It can. I have perl based analyzers that eat a feed from syslog-ng and insert into mysql.
Log everything to disk and use flat files to do analysis and reporting? Log everything to disk and use MySQL's load_data_infile?
Other suggestions?
Also, is anyone out there using MySQL for this level of insert rates? Do you have any recommendations on table structures, indexes, etc.?
I'm at 16 million rows for today already, and it's only 10 am here.
Look into bulk insert formats. Keep in mind that mysql updates indexes after every insert completes, so row by row inserts will choke. Also, insert times increase slightly as the table grows, so you'll see a performance hit late in the day. You want to keep indexes as light as possible for this.
- billn
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
- billn
Thanks for describing your setup a bit--I appreciate it. Do you do any graphing or reporting with your setup, or is mainly for grepping and alerting? I'm primarily interested in graphing, so indexing becomes a priority. I'm also interested in being able to make field-specific queries like "source port=80" and to be able to graph such results on the fly. On Friday, July 24, 2009, Bill Nash <billn@billn.net> wrote:
I can't release my schema, but I can talk about some of this. I sort into daily tables currently, so each day is separate from the previous. I log data like analyzer rule matches, source host, timestamps, with indexes on those fields only. It's pretty light, overall, with about 20 analyzers around the world inserting at once. I'm not sure what your setup is, but mine is for live logging as close to real time as I can get, given buffer fill cycles and the like.
An auto-increment value is only a problem if you really truly care about it, which, I don't. It's there for expiration purposes in my case, so I can delete by unique id within each day's table. Even if you start combining days into monthly tables, just give them a new unique id and index against it, because you're just using it as a convenient handle for reference from the application anyway.
One slight advantage I have by logging the analyzer rule id is that I don't need to store extracted information, I'd just reference the rule and re-extract it later. On-demand CPU usage vs storage/index costs. I think you'll find unless you have a pressing need to constantly be rooting through logs (I don't, since I have analyzers kicking off trouble tickets), you might want to reassess this.
- billn
On Jul 24, 2009, at 2:28 PM, Martin Holste wrote:
Bill,
I'm finding that the indexing process after LOAD DATA is still extremely intensive, though I'm getting about 12 million rows per hour inserted successfully, (albeit with an extremely high load on the box). If you do LOAD DATA into non-indexed tables, MySQL barely sweats and I think the upper bound on insert rates would be more like 50-100 million rows per hour. What does your schema look like? Are you splitting the messages into separate tables? I'm finding that the big challenge with getting tables into database third normal form is that you need to pre-generate your message indexes so that you can do a LOAD DATA. An auto-increment value on a column means you won't be able to split messages into more than one table and map them together easily, but if you only use the normal syslog header fields, you can get them all into one table without too much of a problem. However, I'm trying to get extracted field information inserted as well, which requires more than just one standard table and many more indexes.
Thanks,
Martin
On Fri, Jul 24, 2009 at 12:20 PM, Bill Nash<billn@billn.net> wrote:
On Jul 22, 2009, at 3:17 PM, Clayton Dukes wrote:
Hi folks, I'm thinking about completely re-writing my app (php-syslog-ng) in order to provide high end scalability and a better (ajax) front end.
At the base of it, my biggest concern is scalability. I want to write the backend to be able to handle many thousands of messages per second.
In your opinion, what is the best way to accomplish this? Should I: Log everything to MySQL directly? If so, can MySQL handle that insert rate?
It can. I have perl based analyzers that eat a feed from syslog-ng and insert into mysql.
Log everything to disk and use flat files to do analysis and reporting? Log everything to disk and use MySQL's load_data_infile?
Other suggestions?
Also, is anyone out there using MySQL for this level of insert rates? Do you have any recommendations on table structures, indexes, etc.?
I'm at 16 million rows for today already, and it's only 10 am here.
Look into bulk insert formats. Keep in mind that mysql updates indexes after every insert completes, so row by row inserts will choke. Also, insert times increase slightly as the table grows, so you'll see a performance hit late in the day. You want to keep indexes as light as possible for this.
- billn
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
- billn
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
I'm predominantly into alerting. I don't graph my traffic, but I do keep an eye on inbound rates, because occasionally someone puts a box or app into debug mode and runs line rate messages at an analyzer. Some of the bodies have never been found. - billn On Jul 25, 2009, at 10:23 AM, Martin Holste wrote:
Thanks for describing your setup a bit--I appreciate it. Do you do any graphing or reporting with your setup, or is mainly for grepping and alerting? I'm primarily interested in graphing, so indexing becomes a priority. I'm also interested in being able to make field-specific queries like "source port=80" and to be able to graph such results on the fly.
On Friday, July 24, 2009, Bill Nash <billn@billn.net> wrote:
I can't release my schema, but I can talk about some of this. I sort into daily tables currently, so each day is separate from the previous. I log data like analyzer rule matches, source host, timestamps, with indexes on those fields only. It's pretty light, overall, with about 20 analyzers around the world inserting at once. I'm not sure what your setup is, but mine is for live logging as close to real time as I can get, given buffer fill cycles and the like.
An auto-increment value is only a problem if you really truly care about it, which, I don't. It's there for expiration purposes in my case, so I can delete by unique id within each day's table. Even if you start combining days into monthly tables, just give them a new unique id and index against it, because you're just using it as a convenient handle for reference from the application anyway.
One slight advantage I have by logging the analyzer rule id is that I don't need to store extracted information, I'd just reference the rule and re-extract it later. On-demand CPU usage vs storage/index costs. I think you'll find unless you have a pressing need to constantly be rooting through logs (I don't, since I have analyzers kicking off trouble tickets), you might want to reassess this.
- billn
On Jul 24, 2009, at 2:28 PM, Martin Holste wrote:
Bill,
I'm finding that the indexing process after LOAD DATA is still extremely intensive, though I'm getting about 12 million rows per hour inserted successfully, (albeit with an extremely high load on the box). If you do LOAD DATA into non-indexed tables, MySQL barely sweats and I think the upper bound on insert rates would be more like 50-100 million rows per hour. What does your schema look like? Are you splitting the messages into separate tables? I'm finding that the big challenge with getting tables into database third normal form is that you need to pre-generate your message indexes so that you can do a LOAD DATA. An auto-increment value on a column means you won't be able to split messages into more than one table and map them together easily, but if you only use the normal syslog header fields, you can get them all into one table without too much of a problem. However, I'm trying to get extracted field information inserted as well, which requires more than just one standard table and many more indexes.
Thanks,
Martin
On Fri, Jul 24, 2009 at 12:20 PM, Bill Nash<billn@billn.net> wrote:
On Jul 22, 2009, at 3:17 PM, Clayton Dukes wrote:
Hi folks, I'm thinking about completely re-writing my app (php-syslog-ng) in order to provide high end scalability and a better (ajax) front end.
At the base of it, my biggest concern is scalability. I want to write the backend to be able to handle many thousands of messages per second.
In your opinion, what is the best way to accomplish this? Should I: Log everything to MySQL directly? If so, can MySQL handle that insert rate?
It can. I have perl based analyzers that eat a feed from syslog-ng and insert into mysql.
Log everything to disk and use flat files to do analysis and reporting? Log everything to disk and use MySQL's load_data_infile?
Other suggestions?
Also, is anyone out there using MySQL for this level of insert rates? Do you have any recommendations on table structures, indexes, etc.?
I'm at 16 million rows for today already, and it's only 10 am here.
Look into bulk insert formats. Keep in mind that mysql updates indexes after every insert completes, so row by row inserts will choke. Also, insert times increase slightly as the table grows, so you'll see a performance hit late in the day. You want to keep indexes as light as possible for this.
- billn
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
- billn
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html
- billn
participants (3)
-
Bill Nash
-
Clayton Dukes
-
Martin Holste