[syslog-ng] syslog-ng 3.2.4 MySQL connection loss Server has gone away

Tue Oct 4 10:37:54 CEST 2011

A non-text attachment was scrubbed...
Name: 0001-afsql-Database-connection-health-check.patch
Type: application/octet-stream
Size: 2228 bytes
Desc: not available
Url : http://lists.balabit.hu/pipermail/syslog-ng/attachments/20111004/b0e258f8/attachment-0001.obj 
-------------- next part --------------

On Oct 3, 2011, at 12:31 AM, Balazs Scheidler wrote:

> Hi,
> 
> On Thu, 2011-09-29 at 22:02 -0700, Erik Maciejewski wrote:
>> On Sep 29, 2011, at 1:06 AM, Balazs Scheidler wrote:
>> 
>>> On Thu, 2011-09-29 at 00:21 -0700, Erik Maciejewski wrote:
>>>> On Sep 28, 2011, at 12:06 AM, Balazs Scheidler wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Sorry for the late answer, but better late than never :)
>>>>> 
>>>>> On Sun, 2011-09-04 at 23:15 -0700, Erik Maciejewski wrote:
>>>>>> Hello,
>>>>>> 
>>>>>> I have been researching an issue with syslog-ng 3.2.4 (both the Linux glibc2.3.6 and platform 
>>>>>> independent compiled from source versions) and MySQL on CentOS 5.6 in which syslog-ng tries to 
>>>>>> insert a log messages using a dead TCP or unix socket connection to a MySQL database. The issue 
>>>>>> results in the max attempt to insert a log message and then the message subsequently dropped. 
>>>>>> This seems to occur regularly in predominately low message volume environments, but has the 
>>>>>> potential to affect all environments using a MySQL (or other db) destination. The root cause 
>>>>>> of the issue seems to be directly related to the health of the connection between syslog-ng 
>>>>>> and MySQL and can be affected by the "wait_timeout" value used by MySQL to kill off inactive 
>>>>>> connections. I want to provide justification for implementing a fix to syslog-ng as I believe 
>>>>>> manipulating a, possibly tuned, environment variable for MySQL would seem like the wrong 
>>>>>> approach to take for remedial action.
>>>>>> 
>>>>>> I feel that many MySQL instances are implemented in shared environments and are themselves 
>>>>>> shared by many applications. New connections to a MySQL instance are generally regarded as 
>>>>>> low cost and, it would seem, more often than not never used in a persisted fashion when 
>>>>>> supporting distributed applications. That being said, many times the "wait_timeout" 
>>>>>> value is set very low so that the MySQL instance can effectively control the number of
>>>>>> simultaneous connections.
>>>>>> 
>>>>>> 
>>>>>> By taking somewhat of a black-box approach to the issue (I'm just starting to explore 
>>>>>> syslog-ng), I noticed that there doesn't seem to be any health checking of the TCP or 
>>>>>> unix socket connection in the SQL destination implementation after the initial 
>>>>>> connection is made (most of the db actions being abstracted away by libdbi). Even 
>>>>>> if a syslog-ng database thread is suspended and reactivated due to error, there 
>>>>>> does not seem to be any attempt to check the health of the database connection.
>>>>>> 
>>>>>> I have implemented a fix in my compiled version that will check a database 
>>>>>> connection's health and attempt to reconnect upon discovering a dead 
>>>>>> connection. I would like to know the policies/procedures/best practices 
>>>>>> for submitting such issues and/or fixes for discussion, review, and 
>>>>>> implementation. If this is a known issue or I am just wrong, all the better! 
>>>>> 
>>>>>> Thank you guys for keeping this project going and I'm looking forward to 
>>>>>> implementing syslog-ng in a production environment in the very near future!
>>>>> 
>>>>> 
>>>>> syslog-ng doesn't explicitly check the aliveness of an SQL connection,
>>>>> however it does drop and reestablish connection if an error occurs.
>>>>> 
>>>>> If you look at afsql_dd_database_thread() function in afsql.c, you'll
>>>>> see this code:
>>>>> 
>>>>>    if (!afsql_dd_insert_db(self))
>>>>>      {
>>>>>        afsql_dd_disconnect(self);
>>>>>        afsql_dd_suspend(self);
>>>>>      }
>>>>> 
>>>>> afsql_dd_insert_db() should return FALSE for any failures, and syslog-ng
>>>>> basically assumes that if the database side initiates the closure of a
>>>>> connection, that'll trickle up to syslog-ng as an error return to
>>>>> dbi_conn_query(), which will in turn cause the SQL destination to
>>>>> suspend its operations for time_reopen() amount of time and then
>>>>> reconnect.
>>>>> 
>>>>> What kind of fix did you implement yourself? Can you post the patch?
>>>>> Thanks.
>>>>> 
>>>>> -- 
>>>>> Bazsi
>>>>> 
>>>>> 
>>>> No worries :)
>>>> 
>>>> Basically, I would propose explicitly checking the health of the connection 
>>>> (using libdbi). By doing so, I believe there is a likelihood that log messages 
>>>> will not be lost in the event that the connection to the database is terminated 
>>>> (ex: db process restart or db connection TTL).
>>>> 
>>>> Towards the beginning of the function afsql_dd_insert_db() in afsql.c I have:
>>>> 
>>>>   if (!self->dbi_ctx)
>>>>     {
>>>> 	    /*handle new db connection*/
>>>> 	
>>>> From my understanding, this section of code checks to see if a valid connection
>>>> instance exists. If not, a new instance is created. I'm not too familiar with 
>>>> libdbi, but it doesn't seem like a valid connection instance equates to a valid 
>>>> and healthy connection. Upon the first attempt to log a message to the db, this 
>>>> instance checking works because the connection instance does not yet exist. Once 
>>>> the initial connection instance is made, a connection to the database is also 
>>>> made (and the first INSERT statement is executed), but the connection instance does 
>>>> not guarantee that the connection will remain intact. If the connection is terminated 
>>>> outside of syslog-ng, it would seem, the "connection instance" is still valid and 
>>>> subsequently a new connection is never attempted and FALSE is never returned. 
>>>> 
>>>> The table validation check, also in afsql_dd_insert_db(): 
>>>> 
>>>>   if (!afsql_dd_validate_table(self, table->str))
>>>>       {
>>>>         /*handle table validation error*/
>>>> 
>>>> does not catch this either as the table was successfully validated from the first 
>>>> successful insert.
>>>> 
>>>> Keeping inline with the concept that afsql_dd_insert_db() should return FALSE for any
>>>> failures, I am proposing the use of the libdbi function dbi_conn_ping() to check if 
>>>> the db connection is alive within afsql_dd_insert_db(). Please let me know if I am way
>>>> off base here or if I am missing some underlying concepts. I'm sorry that I am not quite 
>>>> familiar with git yet, but my proposed patch, with a/modules/afsql/afsql.c being the
>>>> current file from syslog-ng 3.2.4, would be as follows:
>>>> 
>>>> diff -u a/modules/afsql/afsql.c b/modules/afsql/afsql.c 
>>>> --- a/modules/afsql/afsql.c	
>>>> +++ b/modules/afsql/afsql.c	
>>>> @@ -572,7 +572,7 @@
>>>>  gint i;
>>>>  LogPathOptions path_options = LOG_PATH_OPTIONS_INIT;
>>>> 
>>>> -  if (!self->dbi_ctx)
>>>> +  if (!self->dbi_ctx || dbi_conn_ping(self->dbi_ctx) != 1)
>>>>    {
>>>>      self->dbi_ctx = dbi_conn_new(self->type);
>>>>      if (self->dbi_ctx)
>>> 
>>> This would probably slow down the insert rate a lot as it'd require a
>>> roundtrip to the SQL server before each INSERT.
>>> 
>>> What I would be interested in why syslog-ng doesn't reestablish the SQL
>>> connection at the first error. It should do so.
>>> 
>>> If an error happens that doesn't mean that the last message is lost. It
>>> is retried for 3 times (by default).
>>> 
>>> -- 
>>> Bazsi
>>> 
>>> 
>> 
>> In my test environment, syslog-ng does not reestablish the SQL connection and 
>> this behavior is very easy to reproduce. I was able to replicate this behavior on 
>> a clean install of CentOS 5.7 using syslog-ng OSE 3.2.4 (linux glibc2.3.6 amd64 
>> (server) and linux glibc2.3.6 i386 (server)) and MySQL 5.0 (default in CentOS 
>> base repo, but shouldn't matter). Once I start logging anything to MySQL, all I 
>> have to do is to restart the mysqld process and syslog-ng will start dropping log 
>> messages (maxing out attempts). The only explanation I can find as to the cause 
>> of this condition is what I outlined in my previous email.
>> 
>> When this condition is entered, all query attempts will fail despite the  
>> MAX_FAILED_ATTEMPTS  value. The function, afsql_dd_insert_db() is not 
>> designed to return FALSE when an insert statement fails to execute, but instead, 
>> a number of attempts are made using the same connection instance regardless 
>> of whether or not syslog-ng is actually connected to the db. This means the 
>> connection instance is never released and the thread never suspended for a 
>> future attempt at connecting.
> 
> Ah, I understand.
> 
>> 
>> Maybe it makes more sense to implement dbi_conn_ping() after the first 
>> unsuccessful query attempt in afsql_dd_insert_db(). That way the extra work 
>> needed to check if the connection is alive is only performed if the first attempt of 
>> the query fails. The only wasted round trip will occur when at least one failed 
>> attempt of the query has been made and the connection is actually alive. 
>> dbi_conn_ping() has some interesting functionality (advantages?) where, if 
>> certain libdbi driver options are set, a reconnect to the db can occur automatically 
>> if the connection is found to be down (obviously this functionality is driver specific 
>> as dbi_conn_ping() simply invokes the driver specific ping function). This might 
>> save a thread from unnecessary suspension in the future.
> 
> I like this one, could you please post a patch to implement it?
> Thanks.
> 
>> 
>> Alternatively, after detecting a failed insert statement, afsql_dd_insert_db() could 
>> return FALSE (after incrementing the attempt counter). This will maintain the 
>> query attempt count for the duration of subsequent attempts while still allowing 
>> for the normal behavior of database connection attempts if, in fact, the db is 
>> unreachable for a period of time. This approach would seem to involve the most 
>> risk as it changes the way that insert statement errors are handled, while the 
>> former only changes behavior if the connection is found to be down after a normal 
>> insert error.
> 
> -- 
> Bazsi
> 
>