[syslog-ng] Problems with failed connections and time_reopen()?

Matt Wise matt at nextdoor.com
Fri May 10 19:39:54 CEST 2013


Any thoughts guys? Using the ELB would be alot better for us in the event that one of our Flume log nodes goes down. Especially since we can't give syslog-ng a secondary IP address to connect to in the event of failure.

--Matt

On May 8, 2013, at 8:52 AM, Matt Wise <matt at nextdoor.com> wrote:

> In both test cases, I initiated the failure by restarting the syslog endpoint (which is actually a flume agent). When running through the ELB, the syslog-ng client never catches the connection failure and continues to try to send data through a TCP connection thats in CLOSE_WAIT state. When not using the ELB, the syslog-ng client notices immediately that the connection has failed and begins to reconnect in earnest.
> 
> --Matt
> 
> On May 7, 2013, at 9:29 PM, Balazs Scheidler <bazsi77 at gmail.com> wrote:
> 
>> In both cases the client initiated the close operation not the load balancer nor the server. Where does the connection stall, then?
>> 
>> On May 7, 2013 11:17 PM, "Matt Wise" <matt at nextdoor.com> wrote:
>> Here's the dump THROUGH the ELB:
>> 
>>> 21:11:26.208951 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [S], seq 267618391, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
>>> 21:11:26.290452 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [S.], seq 848900027, ack 267618392, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 8], length 0
>>> 21:11:26.290509 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], ack 1, win 115, length 0
>>> 21:11:26.291460 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [P.], seq 1:227, ack 1, win 115, length 226
>>> 21:11:26.375765 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [.], ack 227, win 62, length 0
>>> 21:11:26.401850 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [.], seq 1:1461, ack 227, win 62, length 1460
>>> 21:11:26.401871 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [.], seq 1461:2921, ack 227, win 62, length 1460
>>> 21:11:26.401898 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [P.], seq 2921:3515, ack 227, win 62, length 594
>>> 21:11:26.402343 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], ack 1461, win 137, length 0
>>> 21:11:26.402356 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], ack 2921, win 160, length 0
>>> 21:11:26.402361 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], ack 3515, win 183, length 0
>>> 21:11:26.484345 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], seq 227:3147, ack 3515, win 183, length 2920
>>> 21:11:26.484365 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [P.], seq 3147:3690, ack 3515, win 183, length 543
>>> 21:11:26.566175 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [.], ack 3147, win 85, length 0 
>>> 21:11:26.569031 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [.], seq 3515:4975, ack 3690, win 96, length 1460
>>> 21:11:26.569046 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [P.], seq 4975:5221, ack 3690, win 96, length 246
>>> 21:11:26.569222 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], ack 4975, win 206, length 0
>>> 21:11:26.569234 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], ack 5221, win 229, length 0
>>> 21:11:28.478081 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [P.], seq 3690:3727, ack 5221, win 229, length 37
>>> 21:11:28.603557 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [.], ack 3727, win 96, length 0 
>>> 21:11:50.707433 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [P.], seq 5221:5258, ack 3727, win 96, length 37
>>> 21:11:50.707460 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], ack 5258, win 229, length 0
>>> 21:11:50.707577 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [P.], seq 3727:3764, ack 5258, win 229, length 37
>>> 21:11:50.707599 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [F.], seq 3764, ack 5258, win 229, length 0
>>> 21:11:50.789084 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [.], ack 3764, win 96, length 0 
>>> 21:11:50.789847 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [F.], seq 5258, ack 3765, win 96, length 0
>>> 21:11:50.789868 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], ack 5259, win 229, length 0
>> 
>> Here's a direct connection:
>> 
>>> 21:15:14.495542 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [S], seq 379756253, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
>>> 21:15:14.576380 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [S.], seq 521570022, ack 379756254, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
>>> 21:15:14.576409 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], ack 1, win 115, length 0
>>> 21:15:14.576940 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [P.], seq 1:227, ack 1, win 115, length 226
>>> 21:15:14.657397 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [.], ack 227, win 123, length 0
>>> 21:15:14.683465 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [.], seq 1:1461, ack 227, win 123, length 1460
>>> 21:15:14.683481 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [.], seq 1461:2921, ack 227, win 123, length 1460
>>> 21:15:14.683485 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [P.], seq 2921:3515, ack 227, win 123, length 594
>>> 21:15:14.683683 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], ack 1461, win 137, length 0
>>> 21:15:14.683696 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], ack 2921, win 160, length 0
>>> 21:15:14.683702 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], ack 3515, win 183, length 0
>>> 21:15:14.766227 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], seq 227:3147, ack 3515, win 183, length 2920
>>> 21:15:14.766243 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [P.], seq 3147:3690, ack 3515, win 183, length 543
>>> 21:15:14.846942 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [.], ack 3147, win 169, length 0
>>> 21:15:14.849068 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [.], seq 3515:4975, ack 3690, win 191, length 1460
>>> 21:15:14.849082 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [P.], seq 4975:5221, ack 3690, win 191, length 246
>>> 21:15:14.849251 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], ack 4975, win 206, length 0
>>> 21:15:14.849262 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], ack 5221, win 229, length 0
>>> 21:15:18.394716 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [P.], seq 3690:3727, ack 5221, win 229, length 37
>>> 21:15:18.511442 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [.], ack 3727, win 191, length 0
>>> 21:15:52.957532 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [P.], seq 5221:5258, ack 3727, win 191, length 37
>>> 21:15:52.957587 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], ack 5258, win 229, length 0
>>> 21:15:52.957716 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [P.], seq 3727:3764, ack 5258, win 229, length 37
>>> 21:15:52.957742 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [F.], seq 3764, ack 5258, win 229, length 0
>>> 21:15:53.039203 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [.], ack 3764, win 191, length 0
>>> 21:15:53.039468 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [F.], seq 5258, ack 3764, win 191, length 0
>>> 21:15:53.039484 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], ack 5259, win 229, length 0
>>> 21:15:53.039492 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [.], ack 3765, win 191, length 0
>> 
>> 
>> Any thoughts? By the way, I'm trying out 3.3.9, but running into other issues..
>> 
>> On May 7, 2013, at 1:55 PM, Balazs Scheidler <bazsi77 at gmail.com> wrote:
>> 
>>> 
>>> On May 7, 2013 10:51 PM, "Matt Wise" <matt at nextdoor.com> wrote:
>>> >
>>> > I've done some more testing and now have narrowed the problem down to our Amazon ELB. Because the OSS version of Syslog-ng does not support failing over destinations from hostA to hostB when one fails, we are using an ELB in front of our syslog servers.
>>> >
>>> > When we have no ELB in place, our syslog-ng client detects the network drop immediately and begins to try to reconnect. When the ELB is in the way, it never detects the network connection drop. I don't understand why. I've tested a bit manually using openssl to connect to our remote endpoint through the ELB and directly and I don't see any difference in the way network connections are killed off. Any thoughts here?
>>> >
>>> 
>>> Hmm interesting. The difference might be how connections are terminated. Can you check that using tcpdump?
>>> 
>>> > --matt
>>> >
>>> > On May 6, 2013, at 9:53 AM, Matt Wise <matt at nextdoor.com> wrote:
>>> >
>>> > > We're running Syslog-NG 3.3.4 in our mixed Ubuntu 10/12 environment. We use SSL for all of our syslog-to-syslog connections, and have logging going to two different data pipelines.
>>> > >
>>> > >  Data Dest #1: SyslogNG Client ----(SSL)----> SyslogNG Server ------> Logstash File-read-in-service
>>> > >  Data Dest #2: SyslogNG Client ----(SSL)----> Stunnel Service ------> Flume Syslog Service
>>> > >
>>> > > The data streams work fine most of the time, but if we restart either the remote syslog-ng server, or the stunnel service, it seems that the syslog ng clients don't try to reconnect for a LONG time (or ever) to the endpoints again. I end up seeing the connection on the client go into a CLOSE_WAIT state, and syslog-ng keeps thinking that its sending log events through the connection, so it seems to never try to reconnect.
>>> > >
>>> > > I've tried setting time_reopen() to 0, 1 and 5... no luck or change in behavior.
>>> > >
>>> > > Any thoughts?
>>> > >
>>> > > --Matt
>>> > >
>>> >
>>> > ______________________________________________________________________________
>>> > Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
>>> > Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
>>> > FAQ: http://www.balabit.com/wiki/syslog-ng-faq
>>> >
>>> ______________________________________________________________________________
>>> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
>>> Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
>>> FAQ: http://www.balabit.com/wiki/syslog-ng-faq
>>> 
>> 
>> 
>> ______________________________________________________________________________
>> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
>> Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
>> FAQ: http://www.balabit.com/wiki/syslog-ng-faq
>> 
>> 
>> ______________________________________________________________________________
>> Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
>> Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
>> FAQ: http://www.balabit.com/wiki/syslog-ng-faq
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.balabit.hu/pipermail/syslog-ng/attachments/20130510/af863cf0/attachment-0001.htm 


More information about the syslog-ng mailing list