multiple file sources, worked - some have now gone silent
I'm looking at a syslog-ng client that has multiple file() sources feeding a single destination unix-domain socket with flags(flow-control) and some disk-buffer. Two days ago it was sending all these files and tracking their rotations fine etc. The destination server restarted, so the socket disappeared for a while, and the client ran out of the default buffering. The intention is that it recovers when the destination returns and resumes reading the files where it left off. When the server restarted data started flowing again. But I've noticed it's no longer sending data from ALL of the file sources. Some have gone silent. The log is saying nothing about the frozen sources. The statistics counters simply stopped incrementing for them. This is kind of fatal. Has anyone seen this before or know how to avoid it? The faulty instance is still running, so if there's any way of interrogating it to find out its mental state, that's still possible. syslog-ng 3.12.1 OS: Solaris -- Declan White
Hi, this is very unfortunate. I'm sure a core dump of the process would be helpful to the developers. Not sure if gcore or similar is available on Solaris though. cheers
The data in the core dump would need to stay in my hands, so that's no good. I'm going to have to toss syslog-ng out :( Silencing sources without logging anything, when things went wrong in the most common way, is a complete deal breaker. - Declan On Wed, Feb 14, 2018 at 09:52:37PM +0100, Fabien Wernli wrote:
Hi,
this is very unfortunate. I'm sure a core dump of the process would be helpful to the developers. Not sure if gcore or similar is available on Solaris though.
cheers
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
-- Declan White
Hello!
From the description you gave it is hard to find out what is happening without some more information. Can you share your configuration, please? Do you use the internal source, wildcard file-source?
We also have a tool that can be used to collect environment and information called 'syslog-ng-debun'. It can collect sensitive data, so be sure to replace/remove those before sending the debun, e.g. IP addresses and passwords from config, etc. https://github.com/balabit/syslog-ng/blob/master/contrib/syslog-ng-debun https://github.com/balabit/syslog-ng/blob/master/contrib/README.syslog-ng-de... As you are using Solaris as syslog-ng 3.12.1 have you built it from source or used a package? Best Regards, Gabor On Thu, Feb 15, 2018 at 6:13 PM, Declan White <declanw@is.bbc.co.uk> wrote:
The data in the core dump would need to stay in my hands, so that's no good.
I'm going to have to toss syslog-ng out :( Silencing sources without logging anything, when things went wrong in the most common way, is a complete deal breaker.
- Declan
On Wed, Feb 14, 2018 at 09:52:37PM +0100, Fabien Wernli wrote:
Hi,
this is very unfortunate. I'm sure a core dump of the process would be helpful to the developers. Not sure if gcore or similar is available on Solaris though.
cheers
____________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product= syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
-- Declan White ____________________________________________________________ __________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product= syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
On Fri, Feb 16, 2018 at 09:44:22AM +0100, Nagy, G?bor wrote:
Hello!
From the description you gave it is hard to find out what is happening without some more information.
Thanks for replying.
Can you share your configuration, please? Do you use the internal source, wildcard file-source?
Attaching config.
We also have a tool that can be used to collect environment and information called 'syslog-ng-debun'. It can collect sensitive data, so be sure to replace/remove those before sending the debun, e.g. IP addresses and passwords from config, etc. https://github.com/balabit/syslog-ng/blob/master/contrib/syslog-ng-debun https://github.com/balabit/syslog-ng/blob/master/contrib/README.syslog-ng-de...
I'll have a snuffle, but the box it blew up on is sensitive, replicating the issue will be non-trivial, and I don't have much time, at all. Sol10 x86 stripped down base build + sunstudio12 compiler. Packages compiled for syslog-ng: pkg-config-0.29 coreutils-8.29 binutils-2.29.1 gmp-6.1.2 mpfr-3.1.6 mpc-1.1.0 gcc-7.2.0 chrpath-0.16 glib-2.50.3-gcc pcre-8.41-gcc json-c-0.13-gcc --enable-java=no --with-mongoc=no --with-jsonc=system Yes, Solaris is a sinking ship, but the evacuation will take some time. I need the applications on either side of the evacuation route to match first. The log contains messages like: Feb 12 22:28:24.07 host1 syslog-ng[12121]: Destination reliable queue full, dropping message; filename='/var/syslog-ng/syslog-ng-00000.rqf', queue_len='3929', mem_buf_size='10000', disk_buf_size='2000000', persist_name='afsocket_dd_qfile(stream,localhost.afunix:/var/syslog-ng/logserver.socket)' I don't know why it needs to drop messages when the source is a file and the flow-control is on.
As you are using Solaris as syslog-ng 3.12.1 have you built it from source or used a package?
Source. Had to patch it to get it to compile. Patches attached. Some are patches already in > 3.12.1 Tried to remove GCCisms but failed. Had to compile GCC and many other things too. (Fun fact - GCC now contains GCCisms in libcpp, so can only be compiled with GCC) Tried a later syslog-ng version but the tarball was missing 'configure'. One of them was missing the bundled json-c. Needed an empty "json_object_private.h" in the include path (should be another patch, but it was easier just to touch the file). Openssl now compulsory in syslog-ng but doesn't compile against Solaris openssl, as it assumes some optional openssl features are present (EC algorithms - patentfoo/geopolitics..), so I found an old 0.9.8 just to get it to compile, but I don't want to use SSL anyway (it's dangerous to leave 'custom' SSL libs around to age). I'm building it to install in an isolated directory, so it can be tested in there independantly of any already installed/running version on the same machine, as a different user. To make it a deployable I'm running 'chrpath' on all the ELF dynamic objects to replace the RPATH/RUNPATH so it uses its own personal library directory for its own libraries. It's more permanent and effective than the equivalent LD_LIBRARY_PATH wrapper script, and guarantees the deployable is self-contained. This causes fun, as syslog-ng mostly relies on the compiled-in install paths being the same as the config runpaths, and most of these paths cannot be controlled in the config file. This makes the command line veeeerrrryyyy looooooong, and there's no way to stop it trying to chdir to the wrong place on startup (so core dumps probably end up in whatever directory you were in when you started it?). There may be more than one syslog-ng being run by more than one non-root user. You've already seen me abandon unix-stream as a source in previous listmails - it breaks when a C library-call tracer is *not* attached, probably for threading reasons (tracer will be serialising), at which point I turned and ran. You've also seen me fix the SGID dir usage case (syslog-ng really needs a way to set its own umask), and hit the framing differences between unix-stream() and network()/syslog() (maybe frame/no-frame could just be added as flags?) You've also seen me hit SGID issues, as syslog-ng does not trust the user's umask and overrides it with something too restrictive. My use case is strangely simple. I want changes to a list of files on one host replicated to another host, reliably. Reliably means accounting for any network and host disruption, file truncation or rotation. This may seem straightforward but there is no such software. People I've tracked down in the same situation are just running rsync in while(1) loops, which doesn't scale. (Also, I've seen rsync protocol-deadlock on big-v-little-endian + 32v64 + differing-raw-directory-order weirdness before). I tried rsyslog (which required configuration in env vars as well as command line options) and watched its 'reliable protocol' module go insane, flinging messages at a failed connection socket, spinning on the CPU flinging the same bytes, then timing out and declaring success. So as you can see I've been having fun. I can only logically conclude I've run over Murphy's cat. If this is all blowing up because the patches I applied to get it to compile weren't thread safe, that would be appropriately ironic. - Declan
Best Regards, Gabor
On Thu, Feb 15, 2018 at 6:13 PM, Declan White <declanw@is.bbc.co.uk> wrote:
The data in the core dump would need to stay in my hands, so that's no good.
I'm going to have to toss syslog-ng out :( Silencing sources without logging anything, when things went wrong in the most common way, is a complete deal breaker.
- Declan
On Wed, Feb 14, 2018 at 09:52:37PM +0100, Fabien Wernli wrote:
Hi,
this is very unfortunate. I'm sure a core dump of the process would be helpful to the developers. Not sure if gcore or similar is available on Solaris though.
cheers
____________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product= syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
-- Declan White ____________________________________________________________ __________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product= syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
-- Declan White IT Services - Unix Engineer BBC Service Operations T: +44 (0)2036 181487 E: declan.white@atos.net W: uk.atos.net This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Atos group liability cannot be triggered for the message content. Although the sender endeavours to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.
Hi Declan, On Thu, Feb 15, 2018 at 6:13 PM, Declan White <declanw@is.bbc.co.uk> wrote:
The data in the core dump would need to stay in my hands, so that's no good.
I'm going to have to toss syslog-ng out :( Silencing sources without logging anything, when things went wrong in the most common way, is a complete deal breaker.
- Declan
I understand the sentiment and rest assured what you described is not an intended behaviour. Frankly, you are not very helpful here. It is an integral part of open source that users are also contributors and help forming the product and/or fix bugs. By asserting that there's a problem on your side, without providing details to help us fix it, and then calling it a deal breaker will not solve the issue. The premium edition may or may not be for you, but there you could at least have some expectations wrt. deal breakers and stuff, as you would be paying money in exchange for service and product. And I am not saying that we leave the open source as garbage. We do everything to keep it as stable and featureful as possible. Cheers, -- Bazsi
On Fri, Feb 16, 2018 at 01:28:02PM +0100, Balazs Scheidler wrote:
Hi Declan,
On Thu, Feb 15, 2018 at 6:13 PM, Declan White <declanw@is.bbc.co.uk> wrote:
I understand the sentiment and rest assured what you described is not an intended behaviour.
Frankly, you are not very helpful here. It is an integral part of open source that users are also contributors and help forming the product and/or fix bugs. By asserting that there's a problem on your side, without providing details to help us fix it, and then calling it a deal breaker will not solve the issue.
And I understand that sentiment :) But when I found it had silently broken doing something simple, my priority switched from tracing it/helping you fix it, to running away screaming. I am in survival mode now. I spent a month of late nights trying to get rsyslog to work. When it vomited thready madness I wisely ran away screaming. I spent the next month of late nights trying to get syslog-ng working (and patching each of the dependancies that themselves didn't compile, and then compiling GCC itself when I found you were serious about being GCC-only). It then silently broke when the receiver blipped, and complained about dropping messages, just shoveling only basic files, in reliable mode, on default settings. Such an obvious breakage in such an obvious usage case almost certainly means it doesn't do this on Linux or your other test platforms, and that means it will take you a long time to find out what it is about my build/env/OSver/threadmodel that is triggering this. And that's assuming anyone still cares about Sol10. That would take me some time working with you to find. I do not have that time. I have negative time. I am shedding tears of hysterical laughter and coding a new file relay protocol in perl right now, doing daily status reports to higherups about the huge delays making a simple reliable file relay.
The premium edition may or may not be for you, but there you could at least have some expectations wrt. deal breakers and stuff, as you would be paying money in exchange for service and product. And I am not saying that we leave the open source as garbage. We do everything to keep it as stable and featureful as possible.
Yes, I dangled that options at the higher ups. I'm desperate for a reliable relay protocol (hence the attempt at rsyslog). But seeing as I can't guarantee syslog-ng will relay UNTRUSTED application logs in a verbatim-recoverable manner (e.g. NUL chars, logs with no newlines), I would eventually have to move away from syslog-ng anyway. The stars are not aligned. The stars are in fact on fire.
Cheers, -- Bazsi
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
-- Declan White
Hello Declan! Thanks for sharing the details of these issues*.* Reading your letters I tried to focus on points where we could help. I have highlighted them below with "|" and answered them the best I could. *Mainly I see 2 kinds of issues:* * * Compilation problems on Solaris* Although we don't support Solaris officially we received and solved compile related issues on Solaris. I would recommend to submit these on Github with details to discuss it. * * The reported destination dropping messages (or "silent sources" as originally reported ) in syslog-ng.* The log contains messages like:
Feb 12 22:28:24.07 host1 syslog-ng[12121]: Destination reliable queue full, dropping message; filename='/var/syslog-ng/syslog-ng-00000.rqf', queue_len='3929', mem_buf_size='10000', disk_buf_size='2000000', persist_name='afsocket_dd_qfile(stream,localhost.afunix:/ var/syslog-ng/logserver.socket)' I don't know why it needs to drop messages when the source is a file and the flow-control is on.
The log message is not dropped, this shows that the debug message can mislead. Message dropping is handled in a layer above the place where this function is called. I think this should be fixed.
Tried a later syslog-ng version but the tarball was missing 'configure'. One of them was missing the bundled json-c. Needed an empty "json_object_private.h" in the include path (should be another patch, but it was easier just to touch the file).
Can you please specify which tarballs do you mean? *Some word about your use case:*
My use case is strangely simple. I want changes to a list of files on one host replicated to another host, reliably. Reliably means accounting for any network and host disruption, file truncation or rotation. This may seem straightforward but there is no such software. People I've tracked down in the same situation are just running rsync in while(1) loops, which doesn't scale. (Also, I've seen rsync protocol-deadlock on big-v-little-endian + 32v64 + differing-raw-directory-order weirdness before).
... If this is all blowing up because the patches I applied to get it to
compile weren't thread safe, that would be appropriately ironic.
As written your log messages are not necessarily conforming to any syslog protocols (which would not be a problem itself) and could be very special too ("... UNTRUSTED application logs in a verbatim-recoverable manner (e.g. NUL chars, logs with no newlines) ..."). Your use case is quite special: file replication/transfer without any constraint about the format of the log message while file truncation can happen. Please note that syslog-ng is designed to sequentially read from file sources, if your applications can truncate the file anytime that could lead (in some cases) to message loss! In our admin guide we state that after a rotation you must reload/restart syslog-ng. Best Regards, Gabor On Fri, Feb 16, 2018 at 7:18 PM, Declan White <declanw@is.bbc.co.uk> wrote:
On Fri, Feb 16, 2018 at 01:28:02PM +0100, Balazs Scheidler wrote:
Hi Declan,
On Thu, Feb 15, 2018 at 6:13 PM, Declan White <declanw@is.bbc.co.uk> wrote:
I understand the sentiment and rest assured what you described is not an intended behaviour.
Frankly, you are not very helpful here. It is an integral part of open source that users are also contributors and help forming the product and/or fix bugs. By asserting that there's a problem on your side, without providing details to help us fix it, and then calling it a deal breaker will not solve the issue.
And I understand that sentiment :) But when I found it had silently broken doing something simple, my priority switched from tracing it/helping you fix it, to running away screaming. I am in survival mode now.
I spent a month of late nights trying to get rsyslog to work. When it vomited thready madness I wisely ran away screaming.
I spent the next month of late nights trying to get syslog-ng working (and patching each of the dependancies that themselves didn't compile, and then compiling GCC itself when I found you were serious about being GCC-only). It then silently broke when the receiver blipped, and complained about dropping messages, just shoveling only basic files, in reliable mode, on default settings.
Such an obvious breakage in such an obvious usage case almost certainly means it doesn't do this on Linux or your other test platforms, and that means it will take you a long time to find out what it is about my build/env/OSver/threadmodel that is triggering this. And that's assuming anyone still cares about Sol10.
That would take me some time working with you to find. I do not have that time. I have negative time. I am shedding tears of hysterical laughter and coding a new file relay protocol in perl right now, doing daily status reports to higherups about the huge delays making a simple reliable file relay.
The premium edition may or may not be for you, but there you could at least have some expectations wrt. deal breakers and stuff, as you would be paying money in exchange for service and product. And I am not saying that we leave the open source as garbage. We do everything to keep it as stable and featureful as possible.
Yes, I dangled that options at the higher ups. I'm desperate for a reliable relay protocol (hence the attempt at rsyslog). But seeing as I can't guarantee syslog-ng will relay UNTRUSTED application logs in a verbatim-recoverable manner (e.g. NUL chars, logs with no newlines), I would eventually have to move away from syslog-ng anyway.
The stars are not aligned. The stars are in fact on fire.
Cheers, -- Bazsi
____________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/? product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
-- Declan White ____________________________________________________________ __________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/? product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
participants (4)
-
Balazs Scheidler
-
Declan White
-
Fabien Wernli
-
Nagy, Gábor