This sounds like MySQL-vs-PgSQL flame war bait, but I do want to address some important ELSA architecture decisions. I will readily admit that I do not use PgSQL much at all, so I will not try to address those issues. I will also note that I almost decided to use straight flat-files instead of any RDBMS, but the XML pipe performance was slower for Sphinx, and an RDBMS does make a lot of other administrative things handier, so I decided to go with one. I don't think you're fully grasping just how Sphinx-dependent the whole setup is and how very, very little the DB has to do with performance since it is doing no indexing other than a simple primary key for the row. Sphinx is doing all queries as well except for the final row retrievals based on primary key (like I said, I could've used flat files). PgSQL will beat MySQL for many, many features, especially on writes (http://www.wikivs.com/wiki/MySQL_vs_PostgreSQL), which you would think would make it a shoe-in for ELSA. But since all writes have to be in a batch anyway (as you said, neither MySQL nor PgSQL can handle the load with ordinary inserts), none of those extra write performance improvements help. The reason Facebook, Craigslist, and ELSA are using MySQL is because those features aren't needed. We need things like HandlerSocket because we're not doing transaction processing. If I were to switch to anything, it would be MongoDB because a schema-less implementation is helpful when you don't want a snowflake schema and you don't know all of your column names ahead of time. The only reason I didn't use it to begin with was the poor Sphinx XML pipe performance and the overhead of having to marshal everything through XML.
Given that postgresql outperforms mysql in high writes and high concurrency situations and that more time is spent computing indexes rather that retrieving data from the db, sphinx on pgsql should be quite fast actually.
The database has nothing to do with "computing indexes" as all time is spent "retrieving data from the db."
Also on multi-core systems postgresql is significantly faster. The sphinx folks confirm that db communication isn't the bottleneck.
I have used Sphinx a lot with ODBC on Sybase, and the numbers were very, very close to the numbers for using XML pipe with simple data, which is why I am assuming that ODBC was the bottleneck. If you have a reference regarding the bottleneck, please forward along. My recommendation is to use MySQL 5.5 which has been generally available for awhile now. There is something like %30 better performance all around according to most estimates on the MySQL Performance Blog, but major rewrites in the threading model make it scale to multiple CPU's much better. For those reading this thread and running MySQL, I highly recommend upgrading to 5.5.
Another Factor to consider is regulations or company policy regarding log storage. MyISAM is terrible when it cones to this, that's the tradeoff.
If you need encryption, do it at the file system or partition level, preferably through the RAID card or LVM. Regulations would only apply to encryption. In most orgs I've been around, few know PgSQL, so it's actually a much tougher sell than MySQL, which most people have at least heard of. I'm not saying that makes MySQL better, but I find it to be more accepted as a solution by upper-management. Now that Oracle is behind it (which I am personally not a fan of), it's even easier to for management to approve it.
For data integrity you'd be wiser to choose InnoDB or postgresql. Also o note is that postgresql can store the data compressed to begin with thus obviating the need for An archive.
Very true! Unfortunately, there are huge performance hits for both of these. InnoDB also takes up much, much more disk space than MyISAM (not sure about the PG disk space). ELSA is about performance and fast search. Logs you can't find might as well have been lost or never recorded to begin with.
I've tested both directly an for my original needs neither was fast enough though postgresql was much faster than mysql in every case I tested. If you weren't using MySQL 5.5, then I am sure this is true. Even if you were running 5.5, I'd still believe it's true. Also, MySQL does need a fair amount of ini file changes as the out-of-the-box setup is not optimized for large data sets. They include templates for various configurations in the RPM or tarball, and I recommend using the "/usr/share/mysql/my-huge.cnf" config as a template.
I would be wary of tossing out assumed numbers about performance. I've seen postgresql copy and even faster of_upload handles well over a gig of data per minute when the input file is much larger than system memory.
Assuming is bad! Do you have some numbers? My logs average about 300 bytes in message length, so 1 gig of data translates to roughly 60k rows/second. (1073741824 / 300 = 3579139.413 logs per minute, which is 59652.323 logs per second.) As I'm typing, here's what ELSA is currently logging for its rates for the to-be-indexed log table: Loaded 371906 records in 2.89521217346191 seconds (128455.525093796 per second) And here's the log for the archive table (compression is done as the inserts occur): Loaded 371906 records in 5.32420706748962 seconds (69851.903820742 per second) And here's the Sphinx indexing number next from the log: Indexed 371906 rows in 6.03515 seconds (61623.23033 rows/sec) So, that's about 3 seconds for the load and 6 seconds for the index for a total of 9 seconds for ELSA to full-text index (371906 * 300) = 111,571,800 bytes of log data. 9 seconds translated to a minute to compare with your number would be 6.66 * 111,571,800 = 743812000 bytes per minute. That's only 75% as fast as the PgSQL bulk load, but that means it's _full text indexed_ and still only %25 slower!
Given this, I would say try it and see.
As I said, the database is barely needed at all, so the only feature that matters at all is primary key lookup speed (for Sphinx index creation and query result finalizing) and bulk import speed. I would wager that with proper key buffers configured in MySQL, the difference for primary key lookup would be negligible, because it's almost entirely disk-bound, but it would be interesting to see if the Sphinx indexing numbers were significantly better on PgSQL, because 2x time is spent indexing than loading.