Any thoughts guys? Using the ELB would be alot better for us in the event that one of our Flume log nodes goes down. Especially since we can't give syslog-ng a secondary IP address to connect to in the event of failure. --Matt On May 8, 2013, at 8:52 AM, Matt Wise <matt@nextdoor.com> wrote:
In both test cases, I initiated the failure by restarting the syslog endpoint (which is actually a flume agent). When running through the ELB, the syslog-ng client never catches the connection failure and continues to try to send data through a TCP connection thats in CLOSE_WAIT state. When not using the ELB, the syslog-ng client notices immediately that the connection has failed and begins to reconnect in earnest.
--Matt
On May 7, 2013, at 9:29 PM, Balazs Scheidler <bazsi77@gmail.com> wrote:
In both cases the client initiated the close operation not the load balancer nor the server. Where does the connection stall, then?
On May 7, 2013 11:17 PM, "Matt Wise" <matt@nextdoor.com> wrote: Here's the dump THROUGH the ELB:
21:11:26.208951 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [S], seq 267618391, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0 21:11:26.290452 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [S.], seq 848900027, ack 267618392, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 8], length 0 21:11:26.290509 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], ack 1, win 115, length 0 21:11:26.291460 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [P.], seq 1:227, ack 1, win 115, length 226 21:11:26.375765 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [.], ack 227, win 62, length 0 21:11:26.401850 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [.], seq 1:1461, ack 227, win 62, length 1460 21:11:26.401871 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [.], seq 1461:2921, ack 227, win 62, length 1460 21:11:26.401898 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [P.], seq 2921:3515, ack 227, win 62, length 594 21:11:26.402343 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], ack 1461, win 137, length 0 21:11:26.402356 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], ack 2921, win 160, length 0 21:11:26.402361 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], ack 3515, win 183, length 0 21:11:26.484345 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], seq 227:3147, ack 3515, win 183, length 2920 21:11:26.484365 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [P.], seq 3147:3690, ack 3515, win 183, length 543 21:11:26.566175 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [.], ack 3147, win 85, length 0 21:11:26.569031 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [.], seq 3515:4975, ack 3690, win 96, length 1460 21:11:26.569046 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [P.], seq 4975:5221, ack 3690, win 96, length 246 21:11:26.569222 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], ack 4975, win 206, length 0 21:11:26.569234 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], ack 5221, win 229, length 0 21:11:28.478081 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [P.], seq 3690:3727, ack 5221, win 229, length 37 21:11:28.603557 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [.], ack 3727, win 96, length 0 21:11:50.707433 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [P.], seq 5221:5258, ack 3727, win 96, length 37 21:11:50.707460 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], ack 5258, win 229, length 0 21:11:50.707577 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [P.], seq 3727:3764, ack 5258, win 229, length 37 21:11:50.707599 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [F.], seq 3764, ack 5258, win 229, length 0 21:11:50.789084 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [.], ack 3764, win 96, length 0 21:11:50.789847 IP ELB.com.rfe > CLIENT.foo.com.43414: Flags [F.], seq 5258, ack 3765, win 96, length 0 21:11:50.789868 IP CLIENT.foo.com.43414 > ELB.com.rfe: Flags [.], ack 5259, win 229, length 0
Here's a direct connection:
21:15:14.495542 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [S], seq 379756253, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0 21:15:14.576380 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [S.], seq 521570022, ack 379756254, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0 21:15:14.576409 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], ack 1, win 115, length 0 21:15:14.576940 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [P.], seq 1:227, ack 1, win 115, length 226 21:15:14.657397 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [.], ack 227, win 123, length 0 21:15:14.683465 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [.], seq 1:1461, ack 227, win 123, length 1460 21:15:14.683481 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [.], seq 1461:2921, ack 227, win 123, length 1460 21:15:14.683485 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [P.], seq 2921:3515, ack 227, win 123, length 594 21:15:14.683683 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], ack 1461, win 137, length 0 21:15:14.683696 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], ack 2921, win 160, length 0 21:15:14.683702 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], ack 3515, win 183, length 0 21:15:14.766227 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], seq 227:3147, ack 3515, win 183, length 2920 21:15:14.766243 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [P.], seq 3147:3690, ack 3515, win 183, length 543 21:15:14.846942 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [.], ack 3147, win 169, length 0 21:15:14.849068 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [.], seq 3515:4975, ack 3690, win 191, length 1460 21:15:14.849082 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [P.], seq 4975:5221, ack 3690, win 191, length 246 21:15:14.849251 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], ack 4975, win 206, length 0 21:15:14.849262 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], ack 5221, win 229, length 0 21:15:18.394716 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [P.], seq 3690:3727, ack 5221, win 229, length 37 21:15:18.511442 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [.], ack 3727, win 191, length 0 21:15:52.957532 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [P.], seq 5221:5258, ack 3727, win 191, length 37 21:15:52.957587 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], ack 5258, win 229, length 0 21:15:52.957716 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [P.], seq 3727:3764, ack 5258, win 229, length 37 21:15:52.957742 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [F.], seq 3764, ack 5258, win 229, length 0 21:15:53.039203 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [.], ack 3764, win 191, length 0 21:15:53.039468 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [F.], seq 5258, ack 3764, win 191, length 0 21:15:53.039484 IP CLIENT.foo.com.18497 > ELB.com.rfe: Flags [.], ack 5259, win 229, length 0 21:15:53.039492 IP ELB.com.rfe > CLIENT.foo.com.18497: Flags [.], ack 3765, win 191, length 0
Any thoughts? By the way, I'm trying out 3.3.9, but running into other issues..
On May 7, 2013, at 1:55 PM, Balazs Scheidler <bazsi77@gmail.com> wrote:
On May 7, 2013 10:51 PM, "Matt Wise" <matt@nextdoor.com> wrote:
I've done some more testing and now have narrowed the problem down to our Amazon ELB. Because the OSS version of Syslog-ng does not support failing over destinations from hostA to hostB when one fails, we are using an ELB in front of our syslog servers.
When we have no ELB in place, our syslog-ng client detects the network drop immediately and begins to try to reconnect. When the ELB is in the way, it never detects the network connection drop. I don't understand why. I've tested a bit manually using openssl to connect to our remote endpoint through the ELB and directly and I don't see any difference in the way network connections are killed off. Any thoughts here?
Hmm interesting. The difference might be how connections are terminated. Can you check that using tcpdump?
--matt
On May 6, 2013, at 9:53 AM, Matt Wise <matt@nextdoor.com> wrote:
We're running Syslog-NG 3.3.4 in our mixed Ubuntu 10/12 environment. We use SSL for all of our syslog-to-syslog connections, and have logging going to two different data pipelines.
Data Dest #1: SyslogNG Client ----(SSL)----> SyslogNG Server ------> Logstash File-read-in-service Data Dest #2: SyslogNG Client ----(SSL)----> Stunnel Service ------> Flume Syslog Service
The data streams work fine most of the time, but if we restart either the remote syslog-ng server, or the stunnel service, it seems that the syslog ng clients don't try to reconnect for a LONG time (or ever) to the endpoints again. I end up seeing the connection on the client go into a CLOSE_WAIT state, and syslog-ng keeps thinking that its sending log events through the connection, so it seems to never try to reconnect.
I've tried setting time_reopen() to 0, 1 and 5... no luck or change in behavior.
Any thoughts?
--Matt
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq