[syslog-ng] syslog-ng 3.2.4 MySQL connection loss Server has gone away

Fri Sep 30 07:02:59 CEST 2011

On Sep 29, 2011, at 1:06 AM, Balazs Scheidler wrote:

> On Thu, 2011-09-29 at 00:21 -0700, Erik Maciejewski wrote:
>> On Sep 28, 2011, at 12:06 AM, Balazs Scheidler wrote:
>> 
>>> Hi,
>>> 
>>> Sorry for the late answer, but better late than never :)
>>> 
>>> On Sun, 2011-09-04 at 23:15 -0700, Erik Maciejewski wrote:
>>>> Hello,
>>>> 
>>>> I have been researching an issue with syslog-ng 3.2.4 (both the Linux glibc2.3.6 and platform 
>>>> independent compiled from source versions) and MySQL on CentOS 5.6 in which syslog-ng tries to 
>>>> insert a log messages using a dead TCP or unix socket connection to a MySQL database. The issue 
>>>> results in the max attempt to insert a log message and then the message subsequently dropped. 
>>>> This seems to occur regularly in predominately low message volume environments, but has the 
>>>> potential to affect all environments using a MySQL (or other db) destination. The root cause 
>>>> of the issue seems to be directly related to the health of the connection between syslog-ng 
>>>> and MySQL and can be affected by the "wait_timeout" value used by MySQL to kill off inactive 
>>>> connections. I want to provide justification for implementing a fix to syslog-ng as I believe 
>>>> manipulating a, possibly tuned, environment variable for MySQL would seem like the wrong 
>>>> approach to take for remedial action.
>>>> 
>>>> I feel that many MySQL instances are implemented in shared environments and are themselves 
>>>> shared by many applications. New connections to a MySQL instance are generally regarded as 
>>>> low cost and, it would seem, more often than not never used in a persisted fashion when 
>>>> supporting distributed applications. That being said, many times the "wait_timeout" 
>>>> value is set very low so that the MySQL instance can effectively control the number of
>>>> simultaneous connections.
>>>> 
>>>> 
>>>> By taking somewhat of a black-box approach to the issue (I'm just starting to explore 
>>>> syslog-ng), I noticed that there doesn't seem to be any health checking of the TCP or 
>>>> unix socket connection in the SQL destination implementation after the initial 
>>>> connection is made (most of the db actions being abstracted away by libdbi). Even 
>>>> if a syslog-ng database thread is suspended and reactivated due to error, there 
>>>> does not seem to be any attempt to check the health of the database connection.
>>>> 
>>>> I have implemented a fix in my compiled version that will check a database 
>>>> connection's health and attempt to reconnect upon discovering a dead 
>>>> connection. I would like to know the policies/procedures/best practices 
>>>> for submitting such issues and/or fixes for discussion, review, and 
>>>> implementation. If this is a known issue or I am just wrong, all the better! 
>>> 
>>>> Thank you guys for keeping this project going and I'm looking forward to 
>>>> implementing syslog-ng in a production environment in the very near future!
>>> 
>>> 
>>> syslog-ng doesn't explicitly check the aliveness of an SQL connection,
>>> however it does drop and reestablish connection if an error occurs.
>>> 
>>> If you look at afsql_dd_database_thread() function in afsql.c, you'll
>>> see this code:
>>> 
>>>     if (!afsql_dd_insert_db(self))
>>>       {
>>>         afsql_dd_disconnect(self);
>>>         afsql_dd_suspend(self);
>>>       }
>>> 
>>> afsql_dd_insert_db() should return FALSE for any failures, and syslog-ng
>>> basically assumes that if the database side initiates the closure of a
>>> connection, that'll trickle up to syslog-ng as an error return to
>>> dbi_conn_query(), which will in turn cause the SQL destination to
>>> suspend its operations for time_reopen() amount of time and then
>>> reconnect.
>>> 
>>> What kind of fix did you implement yourself? Can you post the patch?
>>> Thanks.
>>> 
>>> -- 
>>> Bazsi
>>> 
>>> 
>> No worries :)
>> 
>> Basically, I would propose explicitly checking the health of the connection 
>> (using libdbi). By doing so, I believe there is a likelihood that log messages 
>> will not be lost in the event that the connection to the database is terminated 
>> (ex: db process restart or db connection TTL).
>> 
>> Towards the beginning of the function afsql_dd_insert_db() in afsql.c I have:
>> 
>>    if (!self->dbi_ctx)
>>      {
>> 	    /*handle new db connection*/
>> 	
>> From my understanding, this section of code checks to see if a valid connection
>> instance exists. If not, a new instance is created. I'm not too familiar with 
>> libdbi, but it doesn't seem like a valid connection instance equates to a valid 
>> and healthy connection. Upon the first attempt to log a message to the db, this 
>> instance checking works because the connection instance does not yet exist. Once 
>> the initial connection instance is made, a connection to the database is also 
>> made (and the first INSERT statement is executed), but the connection instance does 
>> not guarantee that the connection will remain intact. If the connection is terminated 
>> outside of syslog-ng, it would seem, the "connection instance" is still valid and 
>> subsequently a new connection is never attempted and FALSE is never returned. 
>> 
>> The table validation check, also in afsql_dd_insert_db(): 
>> 
>>    if (!afsql_dd_validate_table(self, table->str))
>>        {
>>          /*handle table validation error*/
>> 
>> does not catch this either as the table was successfully validated from the first 
>> successful insert.
>> 
>> Keeping inline with the concept that afsql_dd_insert_db() should return FALSE for any
>> failures, I am proposing the use of the libdbi function dbi_conn_ping() to check if 
>> the db connection is alive within afsql_dd_insert_db(). Please let me know if I am way
>> off base here or if I am missing some underlying concepts. I'm sorry that I am not quite 
>> familiar with git yet, but my proposed patch, with a/modules/afsql/afsql.c being the
>> current file from syslog-ng 3.2.4, would be as follows:
>> 
>> diff -u a/modules/afsql/afsql.c b/modules/afsql/afsql.c 
>> --- a/modules/afsql/afsql.c	
>> +++ b/modules/afsql/afsql.c	
>> @@ -572,7 +572,7 @@
>>   gint i;
>>   LogPathOptions path_options = LOG_PATH_OPTIONS_INIT;
>> 
>> -  if (!self->dbi_ctx)
>> +  if (!self->dbi_ctx || dbi_conn_ping(self->dbi_ctx) != 1)
>>     {
>>       self->dbi_ctx = dbi_conn_new(self->type);
>>       if (self->dbi_ctx)
> 
> This would probably slow down the insert rate a lot as it'd require a
> roundtrip to the SQL server before each INSERT.
> 
> What I would be interested in why syslog-ng doesn't reestablish the SQL
> connection at the first error. It should do so.
> 
> If an error happens that doesn't mean that the last message is lost. It
> is retried for 3 times (by default).
> 
> -- 
> Bazsi
> 
> 

In my test environment, syslog-ng does not reestablish the SQL connection and 
this behavior is very easy to reproduce. I was able to replicate this behavior on 
a clean install of CentOS 5.7 using syslog-ng OSE 3.2.4 (linux glibc2.3.6 amd64 
(server) and linux glibc2.3.6 i386 (server)) and MySQL 5.0 (default in CentOS 
base repo, but shouldn't matter). Once I start logging anything to MySQL, all I 
have to do is to restart the mysqld process and syslog-ng will start dropping log 
messages (maxing out attempts). The only explanation I can find as to the cause 
of this condition is what I outlined in my previous email.

When this condition is entered, all query attempts will fail despite the  
MAX_FAILED_ATTEMPTS  value. The function, afsql_dd_insert_db() is not 
designed to return FALSE when an insert statement fails to execute, but instead, 
a number of attempts are made using the same connection instance regardless 
of whether or not syslog-ng is actually connected to the db. This means the 
connection instance is never released and the thread never suspended for a 
future attempt at connecting.

Maybe it makes more sense to implement dbi_conn_ping() after the first 
unsuccessful query attempt in afsql_dd_insert_db(). That way the extra work 
needed to check if the connection is alive is only performed if the first attempt of 
the query fails. The only wasted round trip will occur when at least one failed 
attempt of the query has been made and the connection is actually alive. 
dbi_conn_ping() has some interesting functionality (advantages?) where, if 
certain libdbi driver options are set, a reconnect to the db can occur automatically 
if the connection is found to be down (obviously this functionality is driver specific 
as dbi_conn_ping() simply invokes the driver specific ping function). This might 
save a thread from unnecessary suspension in the future.

Alternatively, after detecting a failed insert statement, afsql_dd_insert_db() could 
return FALSE (after incrementing the attempt counter). This will maintain the 
query attempt count for the duration of subsequent attempts while still allowing 
for the normal behavior of database connection attempts if, in fact, the db is 
unreachable for a period of time. This approach would seem to involve the most 
risk as it changes the way that insert statement errors are handled, while the 
former only changes behavior if the connection is found to be down after a normal 
insert error.

Regards,
Erik