[syslog-ng] [RFC]: afsql improvement plans

Fri Apr 1 15:16:37 CEST 2011

You've laid everything out nicely, but you're missing a huge one:
HandlerSocket!  I would forget about all of the INSERT variations and
focus on a HanderSocket implementation.  Check out the MySQL
Performance Blog articles on it, and you'll see that according to
their numbers, you can get LOAD DATA speed with normal inserts using
it.  The only catch with it is you can't use auto-increment columns,
so $SEQNUM or something would have to be used for an id.

On Fri, Apr 1, 2011 at 8:05 AM, Gergely Nagy <algernon at balabit.hu> wrote:
> Hi!
>
> I'm about to blow up^W^Wrefactor the afsql destination to increase both
> performance and functionality. To do this properly, and to satisfy most
> needs, I'd like to ask the greater community (especially those of you
> who have more experience with logging to SQL than I do!)  about the
> plans laid out on the paragraphs below.
>
> Those of you who do log into SQL one way or the other, but aren't
> interested in technical mumbo-jumbo, please skip to the end of this
> e-mail, I'd have a few questions, for which all responses would be most
> appreciated!
>
> Where, when, how??
> ==================
>
> There are multiple goals to accomplish, which I believe are not
> exclusive: we want to keep the sql() statement in the configuration
> file, and maintain backwards compatibility aswell. On the other hand, we
> want to improve performance too, and that is best accomplished by using
> the native database libraries. To this end, I propose we introduce new
> database types: "mysql-native", "postgresql-native", "sqlite-native",
> and so on and so forth.
>
> The first step will be to rearrange the current afsql driver, so that it
> has a generic configuration interface, but can have multiple backends
> (dbi - the only existing backend right now; and one for each native db
> client). I'd change the AFSqlDestDriver structure to only contain the
> configurable data and nothing else. The backends would subclass this
> type (much like how AFSqlDestDriver subclasses LogDriver) and add their
> own.
>
> The reason I want to do this, making the AFSqlDestDriver a config-only
> interface instead of keeping larger chunks of code there (for example,
> the threading code) is because I do not want to limit my possibilities:
> I can imagine an SQL driver utilizing multiple writer threads
> easily. However, neither model works for all databases (concurrent
> writes to SQLite are usually not a good idea, for example).
>
> And that's about the generic interface! Lets dive into the more
> interesting parts: the native drivers. Since MySQL is the most used
> server out there, that will be the first I tackle.
>
> My goal with that is to implement a driver that is not only faster than
> our current destinations, but one that is also faster than the perl
> scripts people have been using until now, when they wanted top
> performance. All that while maintaining reasonable flexibility.
>
> An ambitious plan, but hopefully not impossible.
>
> I've been reading the MySQL documentation for the past few hours to
> explore my options, and found a few ways to increase our insert speed:
>
> mysql_list_tables()
> -------------------
>
> Instead of doing a full select to verify that a table exists, we can
> just try to list the table. This is probably not a significant increase,
> but every bit counts.
>
> Prepared statements
> -------------------
>
> We'll use prepared statements wherever possible - I expect this alone
> will result in a measurable performance increase, since the server will
> not have to parse each and every insert - only once, we'll bind
> variables thereafter.
>
> LOCK TABLES / transactions
> --------------------------
>
> Currently, syslog-ng can do bulk-inserts, wrapped in a transaction. But
> that only works if the MySQL table we insert to is on a storage engine
> that supports transactions (InnoDB). However, MyISAM has been the
> default for quite a while, and that does not support transactions to the
> best of my knowledge.
>
> In this case, insert performance can be increased by locking the tables
> for a few hundred inserts at a time, and unlocking after: poor man's
> transactions, to some extent.
>
> This, however, is more of a workaround than a real improvement.
>
> Insertion plans
> ===============
>
> There's a couple of alternative ways to improve insert speed by changing
> how we actually do them. All have their pros and cons, which I will try
> to enumerate below.
>
> LOAD DATA INFILE
> ----------------
>
> The fastest way by far is LOAD DATA INFILE, which is what people use in
> the various perl scripts, as far as I remember. The downside is that -
> as far as I saw so far - this does need a temporary file, which has its
> downsides, like writing to disk (unless put on tmpfs or similar). An
> option which I haven't tried would be to use a pipe(), so that the data
> never hits the disk.
>
> But nevertheless, the mysql-native db type will have an option to use
> LOAD DATA INFILE.
>
>  * Pros:
>   + The fastest way to insert
>  * Cons:
>   - Needs a temporary file, with all its drawbacks, some of which can
>   be worked around.
>   - Serialisation is costlier than simple INSERTs.
>   - No fine-grained error checking: if something goes wrong, the whole
>   LOAD will have to be dropped (unless we add code that splits bulks up
>   into smaller chunks, and retries - but that's something I want to
>   avoid if possible)
>
> For LOAD DATA to work, we will have to format the data to insert into a
> suitable format, and if there's an error in the serialisation somewhere,
> we'll have a hard time to figure out where the problem is, and syslog-ng
> will just give up.
>
> Also, beating our data into a suitable format isn't exactly cheap,
> either (think escaping: processing large amounts of string data,
> possibly changing strings - there's both processing and memory
> ramifications involved).
>
> The need for a file is also a (minor) issue. I suppose named pipes would
> work, that way data never hits the disk between syslog-ng and
> mysql. However, that still involves copying data around. We'll see if
> these affect LOAD DATA's performance badly enough (they probably don't).
>
> INSERT INTO ... VALUES (...), (...), ...
> ----------------------------------------
>
> This is considerably slower than LOAD DATA, but still faster than using
> INSERTs (even if we use transactions / locking for those).
>
> The downside is that - like with LOAD DATA - serialisation is still
> costlier than INSERTs.
>
>  * Pros:
>   + Reasonably fast
>   + Does not require neither a temporary file, nor gross hacks
>  * Cons:
>   - Slower than LOAD DATA
>   - Serialisation can be costly.
>   - No fine-grained error checking
>
> Serialisation is a little less costly than in LOAD DATA's case, but this
> can be worked around a little with prepared statements (with a fallback
> to parsed ones, in case we have to insert less data than originally
> planned).
>
> The same issues with error handling persist, though.
>
> INSERT ... DELAYED
> ------------------
>
> The DELAYED strategy works best if we have multiple writer threads, as
> in this case, INSERT becomes a fire & forget call, and we can return to
> our business far sooner. This does not work all that well with bulk
> inserts, though.
>
>  * Pros:
>   + Reasonably fast with multiple writers
>   + Simple & fast serialisation
>  * Cons:
>   - No error checking at all (except for connection and syntax
>   errors)
>   - Probably still slower than LOAD DATA
>
> The advantage of this is that we can get the data from syslog-ng to the
> mysql server faster than any of the other methods, and then the work is
> off-loaded to the server. This has the potential of being faster than
> any of the above, but with the cost of hardly any possibilities to
> recover from an error, as we get no feedback after inserting: did it
> succeed? did it not? We'll have no idea.
>
> The Plan
> ========
>
> The plan is to implement all the strategies above, and measure them
> against each other and external scripts. This will take a little time,
> because I want to get the architecture right, and figure out whether we
> want a single or multiple writer threads (per-destination, of course),
> or if I want to make that a configurable option aswell.
>
> Once done, and people are happy with it, I'll move on to the next
> database (PostgreSQL, to scratch my own itch too).
>
> The Questions
> =============
>
> What I'd like to know most, is, that would YOU expect from a native
> mysql destination?
>
> Apart from speed, what would you want from it, that the current driver
> does not support, but which is possible via external scripts?
>
> If you happened to read through my wall of text above, did I miss
> anything important?
>
> Cheers,
> --
> |8]
> ______________________________________________________________________________
> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
> FAQ: http://www.campin.net/syslog-ng/faq.html
>
>