[syslog-ng] syslog-ng 3.2.4 MySQL connection loss Server has gone away

Thu Sep 29 10:06:07 CEST 2011

On Thu, 2011-09-29 at 00:21 -0700, Erik Maciejewski wrote:
> On Sep 28, 2011, at 12:06 AM, Balazs Scheidler wrote:
> 
> > Hi,
> > 
> > Sorry for the late answer, but better late than never :)
> > 
> > On Sun, 2011-09-04 at 23:15 -0700, Erik Maciejewski wrote:
> >> Hello,
> >> 
> >> I have been researching an issue with syslog-ng 3.2.4 (both the Linux glibc2.3.6 and platform 
> >> independent compiled from source versions) and MySQL on CentOS 5.6 in which syslog-ng tries to 
> >> insert a log messages using a dead TCP or unix socket connection to a MySQL database. The issue 
> >> results in the max attempt to insert a log message and then the message subsequently dropped. 
> >> This seems to occur regularly in predominately low message volume environments, but has the 
> >> potential to affect all environments using a MySQL (or other db) destination. The root cause 
> >> of the issue seems to be directly related to the health of the connection between syslog-ng 
> >> and MySQL and can be affected by the "wait_timeout" value used by MySQL to kill off inactive 
> >> connections. I want to provide justification for implementing a fix to syslog-ng as I believe 
> >> manipulating a, possibly tuned, environment variable for MySQL would seem like the wrong 
> >> approach to take for remedial action.
> >> 
> >> I feel that many MySQL instances are implemented in shared environments and are themselves 
> >> shared by many applications. New connections to a MySQL instance are generally regarded as 
> >> low cost and, it would seem, more often than not never used in a persisted fashion when 
> >> supporting distributed applications. That being said, many times the "wait_timeout" 
> >> value is set very low so that the MySQL instance can effectively control the number of
> >> simultaneous connections.
> >> 
> >> 
> >> By taking somewhat of a black-box approach to the issue (I'm just starting to explore 
> >> syslog-ng), I noticed that there doesn't seem to be any health checking of the TCP or 
> >> unix socket connection in the SQL destination implementation after the initial 
> >> connection is made (most of the db actions being abstracted away by libdbi). Even 
> >> if a syslog-ng database thread is suspended and reactivated due to error, there 
> >> does not seem to be any attempt to check the health of the database connection.
> >> 
> >> I have implemented a fix in my compiled version that will check a database 
> >> connection's health and attempt to reconnect upon discovering a dead 
> >> connection. I would like to know the policies/procedures/best practices 
> >> for submitting such issues and/or fixes for discussion, review, and 
> >> implementation. If this is a known issue or I am just wrong, all the better! 
> > 
> >> Thank you guys for keeping this project going and I'm looking forward to 
> >> implementing syslog-ng in a production environment in the very near future!
> > 
> > 
> > syslog-ng doesn't explicitly check the aliveness of an SQL connection,
> > however it does drop and reestablish connection if an error occurs.
> > 
> > If you look at afsql_dd_database_thread() function in afsql.c, you'll
> > see this code:
> > 
> >      if (!afsql_dd_insert_db(self))
> >        {
> >          afsql_dd_disconnect(self);
> >          afsql_dd_suspend(self);
> >        }
> > 
> > afsql_dd_insert_db() should return FALSE for any failures, and syslog-ng
> > basically assumes that if the database side initiates the closure of a
> > connection, that'll trickle up to syslog-ng as an error return to
> > dbi_conn_query(), which will in turn cause the SQL destination to
> > suspend its operations for time_reopen() amount of time and then
> > reconnect.
> > 
> > What kind of fix did you implement yourself? Can you post the patch?
> > Thanks.
> > 
> > -- 
> > Bazsi
> > 
> > 
> No worries :)
> 
> Basically, I would propose explicitly checking the health of the connection 
> (using libdbi). By doing so, I believe there is a likelihood that log messages 
> will not be lost in the event that the connection to the database is terminated 
> (ex: db process restart or db connection TTL).
> 
> Towards the beginning of the function afsql_dd_insert_db() in afsql.c I have:
> 
>     if (!self->dbi_ctx)
>       {
> 	    /*handle new db connection*/
> 	
> From my understanding, this section of code checks to see if a valid connection
> instance exists. If not, a new instance is created. I'm not too familiar with 
> libdbi, but it doesn't seem like a valid connection instance equates to a valid 
> and healthy connection. Upon the first attempt to log a message to the db, this 
> instance checking works because the connection instance does not yet exist. Once 
> the initial connection instance is made, a connection to the database is also 
> made (and the first INSERT statement is executed), but the connection instance does 
> not guarantee that the connection will remain intact. If the connection is terminated 
> outside of syslog-ng, it would seem, the "connection instance" is still valid and 
> subsequently a new connection is never attempted and FALSE is never returned. 
> 
> The table validation check, also in afsql_dd_insert_db(): 
> 
>     if (!afsql_dd_validate_table(self, table->str))
>         {
>           /*handle table validation error*/
> 
> does not catch this either as the table was successfully validated from the first 
> successful insert.
> 
> Keeping inline with the concept that afsql_dd_insert_db() should return FALSE for any
> failures, I am proposing the use of the libdbi function dbi_conn_ping() to check if 
> the db connection is alive within afsql_dd_insert_db(). Please let me know if I am way
> off base here or if I am missing some underlying concepts. I'm sorry that I am not quite 
> familiar with git yet, but my proposed patch, with a/modules/afsql/afsql.c being the
> current file from syslog-ng 3.2.4, would be as follows:
> 
> diff -u a/modules/afsql/afsql.c b/modules/afsql/afsql.c 
> --- a/modules/afsql/afsql.c	
> +++ b/modules/afsql/afsql.c	
> @@ -572,7 +572,7 @@
>    gint i;
>    LogPathOptions path_options = LOG_PATH_OPTIONS_INIT;
> 
> -  if (!self->dbi_ctx)
> +  if (!self->dbi_ctx || dbi_conn_ping(self->dbi_ctx) != 1)
>      {
>        self->dbi_ctx = dbi_conn_new(self->type);
>        if (self->dbi_ctx)

This would probably slow down the insert rate a lot as it'd require a
roundtrip to the SQL server before each INSERT.

What I would be interested in why syslog-ng doesn't reestablish the SQL
connection at the first error. It should do so.

If an error happens that doesn't mean that the last message is lost. It
is retried for 3 times (by default).

-- 
Bazsi