[syslog-ng] syslog-ng 3.2.4 MySQL connection loss Server has gone away
Balazs Scheidler
bazsi at balabit.hu
Sat Nov 12 13:51:45 CET 2011
Thanks Eric, I've applied this to 3.3.
On Tue, 2011-10-04 at 01:37 -0700, Erik Maciejewski wrote:
> On Oct 3, 2011, at 12:31 AM, Balazs Scheidler wrote:
>
> > Hi,
> >
> > On Thu, 2011-09-29 at 22:02 -0700, Erik Maciejewski wrote:
> >> On Sep 29, 2011, at 1:06 AM, Balazs Scheidler wrote:
> >>
> >>> On Thu, 2011-09-29 at 00:21 -0700, Erik Maciejewski wrote:
> >>>> On Sep 28, 2011, at 12:06 AM, Balazs Scheidler wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> Sorry for the late answer, but better late than never :)
> >>>>>
> >>>>> On Sun, 2011-09-04 at 23:15 -0700, Erik Maciejewski wrote:
> >>>>>> Hello,
> >>>>>>
> >>>>>> I have been researching an issue with syslog-ng 3.2.4 (both the Linux glibc2.3.6 and platform
> >>>>>> independent compiled from source versions) and MySQL on CentOS 5.6 in which syslog-ng tries to
> >>>>>> insert a log messages using a dead TCP or unix socket connection to a MySQL database. The issue
> >>>>>> results in the max attempt to insert a log message and then the message subsequently dropped.
> >>>>>> This seems to occur regularly in predominately low message volume environments, but has the
> >>>>>> potential to affect all environments using a MySQL (or other db) destination. The root cause
> >>>>>> of the issue seems to be directly related to the health of the connection between syslog-ng
> >>>>>> and MySQL and can be affected by the "wait_timeout" value used by MySQL to kill off inactive
> >>>>>> connections. I want to provide justification for implementing a fix to syslog-ng as I believe
> >>>>>> manipulating a, possibly tuned, environment variable for MySQL would seem like the wrong
> >>>>>> approach to take for remedial action.
> >>>>>>
> >>>>>> I feel that many MySQL instances are implemented in shared environments and are themselves
> >>>>>> shared by many applications. New connections to a MySQL instance are generally regarded as
> >>>>>> low cost and, it would seem, more often than not never used in a persisted fashion when
> >>>>>> supporting distributed applications. That being said, many times the "wait_timeout"
> >>>>>> value is set very low so that the MySQL instance can effectively control the number of
> >>>>>> simultaneous connections.
> >>>>>>
> >>>>>>
> >>>>>> By taking somewhat of a black-box approach to the issue (I'm just starting to explore
> >>>>>> syslog-ng), I noticed that there doesn't seem to be any health checking of the TCP or
> >>>>>> unix socket connection in the SQL destination implementation after the initial
> >>>>>> connection is made (most of the db actions being abstracted away by libdbi). Even
> >>>>>> if a syslog-ng database thread is suspended and reactivated due to error, there
> >>>>>> does not seem to be any attempt to check the health of the database connection.
> >>>>>>
> >>>>>> I have implemented a fix in my compiled version that will check a database
> >>>>>> connection's health and attempt to reconnect upon discovering a dead
> >>>>>> connection. I would like to know the policies/procedures/best practices
> >>>>>> for submitting such issues and/or fixes for discussion, review, and
> >>>>>> implementation. If this is a known issue or I am just wrong, all the better!
> >>>>>
> >>>>>> Thank you guys for keeping this project going and I'm looking forward to
> >>>>>> implementing syslog-ng in a production environment in the very near future!
> >>>>>
> >>>>>
> >>>>> syslog-ng doesn't explicitly check the aliveness of an SQL connection,
> >>>>> however it does drop and reestablish connection if an error occurs.
> >>>>>
> >>>>> If you look at afsql_dd_database_thread() function in afsql.c, you'll
> >>>>> see this code:
> >>>>>
> >>>>> if (!afsql_dd_insert_db(self))
> >>>>> {
> >>>>> afsql_dd_disconnect(self);
> >>>>> afsql_dd_suspend(self);
> >>>>> }
> >>>>>
> >>>>> afsql_dd_insert_db() should return FALSE for any failures, and syslog-ng
> >>>>> basically assumes that if the database side initiates the closure of a
> >>>>> connection, that'll trickle up to syslog-ng as an error return to
> >>>>> dbi_conn_query(), which will in turn cause the SQL destination to
> >>>>> suspend its operations for time_reopen() amount of time and then
> >>>>> reconnect.
> >>>>>
> >>>>> What kind of fix did you implement yourself? Can you post the patch?
> >>>>> Thanks.
> >>>>>
> >>>>> --
> >>>>> Bazsi
> >>>>>
> >>>>>
> >>>> No worries :)
> >>>>
> >>>> Basically, I would propose explicitly checking the health of the connection
> >>>> (using libdbi). By doing so, I believe there is a likelihood that log messages
> >>>> will not be lost in the event that the connection to the database is terminated
> >>>> (ex: db process restart or db connection TTL).
> >>>>
> >>>> Towards the beginning of the function afsql_dd_insert_db() in afsql.c I have:
> >>>>
> >>>> if (!self->dbi_ctx)
> >>>> {
> >>>> /*handle new db connection*/
> >>>>
> >>>> From my understanding, this section of code checks to see if a valid connection
> >>>> instance exists. If not, a new instance is created. I'm not too familiar with
> >>>> libdbi, but it doesn't seem like a valid connection instance equates to a valid
> >>>> and healthy connection. Upon the first attempt to log a message to the db, this
> >>>> instance checking works because the connection instance does not yet exist. Once
> >>>> the initial connection instance is made, a connection to the database is also
> >>>> made (and the first INSERT statement is executed), but the connection instance does
> >>>> not guarantee that the connection will remain intact. If the connection is terminated
> >>>> outside of syslog-ng, it would seem, the "connection instance" is still valid and
> >>>> subsequently a new connection is never attempted and FALSE is never returned.
> >>>>
> >>>> The table validation check, also in afsql_dd_insert_db():
> >>>>
> >>>> if (!afsql_dd_validate_table(self, table->str))
> >>>> {
> >>>> /*handle table validation error*/
> >>>>
> >>>> does not catch this either as the table was successfully validated from the first
> >>>> successful insert.
> >>>>
> >>>> Keeping inline with the concept that afsql_dd_insert_db() should return FALSE for any
> >>>> failures, I am proposing the use of the libdbi function dbi_conn_ping() to check if
> >>>> the db connection is alive within afsql_dd_insert_db(). Please let me know if I am way
> >>>> off base here or if I am missing some underlying concepts. I'm sorry that I am not quite
> >>>> familiar with git yet, but my proposed patch, with a/modules/afsql/afsql.c being the
> >>>> current file from syslog-ng 3.2.4, would be as follows:
> >>>>
> >>>> diff -u a/modules/afsql/afsql.c b/modules/afsql/afsql.c
> >>>> --- a/modules/afsql/afsql.c
> >>>> +++ b/modules/afsql/afsql.c
> >>>> @@ -572,7 +572,7 @@
> >>>> gint i;
> >>>> LogPathOptions path_options = LOG_PATH_OPTIONS_INIT;
> >>>>
> >>>> - if (!self->dbi_ctx)
> >>>> + if (!self->dbi_ctx || dbi_conn_ping(self->dbi_ctx) != 1)
> >>>> {
> >>>> self->dbi_ctx = dbi_conn_new(self->type);
> >>>> if (self->dbi_ctx)
> >>>
> >>> This would probably slow down the insert rate a lot as it'd require a
> >>> roundtrip to the SQL server before each INSERT.
> >>>
> >>> What I would be interested in why syslog-ng doesn't reestablish the SQL
> >>> connection at the first error. It should do so.
> >>>
> >>> If an error happens that doesn't mean that the last message is lost. It
> >>> is retried for 3 times (by default).
> >>>
> >>> --
> >>> Bazsi
> >>>
> >>>
> >>
> >> In my test environment, syslog-ng does not reestablish the SQL connection and
> >> this behavior is very easy to reproduce. I was able to replicate this behavior on
> >> a clean install of CentOS 5.7 using syslog-ng OSE 3.2.4 (linux glibc2.3.6 amd64
> >> (server) and linux glibc2.3.6 i386 (server)) and MySQL 5.0 (default in CentOS
> >> base repo, but shouldn't matter). Once I start logging anything to MySQL, all I
> >> have to do is to restart the mysqld process and syslog-ng will start dropping log
> >> messages (maxing out attempts). The only explanation I can find as to the cause
> >> of this condition is what I outlined in my previous email.
> >>
> >> When this condition is entered, all query attempts will fail despite the
> >> MAX_FAILED_ATTEMPTS value. The function, afsql_dd_insert_db() is not
> >> designed to return FALSE when an insert statement fails to execute, but instead,
> >> a number of attempts are made using the same connection instance regardless
> >> of whether or not syslog-ng is actually connected to the db. This means the
> >> connection instance is never released and the thread never suspended for a
> >> future attempt at connecting.
> >
> > Ah, I understand.
> >
> >>
> >> Maybe it makes more sense to implement dbi_conn_ping() after the first
> >> unsuccessful query attempt in afsql_dd_insert_db(). That way the extra work
> >> needed to check if the connection is alive is only performed if the first attempt of
> >> the query fails. The only wasted round trip will occur when at least one failed
> >> attempt of the query has been made and the connection is actually alive.
> >> dbi_conn_ping() has some interesting functionality (advantages?) where, if
> >> certain libdbi driver options are set, a reconnect to the db can occur automatically
> >> if the connection is found to be down (obviously this functionality is driver specific
> >> as dbi_conn_ping() simply invokes the driver specific ping function). This might
> >> save a thread from unnecessary suspension in the future.
> >
> > I like this one, could you please post a patch to implement it?
> > Thanks.
> >
> >>
> >> Alternatively, after detecting a failed insert statement, afsql_dd_insert_db() could
> >> return FALSE (after incrementing the attempt counter). This will maintain the
> >> query attempt count for the duration of subsequent attempts while still allowing
> >> for the normal behavior of database connection attempts if, in fact, the db is
> >> unreachable for a period of time. This approach would seem to involve the most
> >> risk as it changes the way that insert statement errors are handled, while the
> >> former only changes behavior if the connection is found to be down after a normal
> >> insert error.
> >
> > --
> > Bazsi
> >
> >
>
> ______________________________________________________________________________
> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
> Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
> FAQ: http://www.balabit.com/wiki/syslog-ng-faq
>
--
Bazsi
More information about the syslog-ng
mailing list