guys I've been pulling my hair out here. I gonna go insane.... Here's the story. I first setup syslog-ng-1.6.6 for remote syslogging a year ago. I seemed to work out great - all my stuff appeared to be logged remotely and everything was dandy. Then I decided to come up with some sort of check that would tell me if a machine was not logging remotely anymore cause of network problems or whatever. I have a perl logchecker which I basically told to check for the cron.hourly log message and if it finds it to grep for the source host and add this host to a file. I was then gonna check this list against a pre-defined array of hosts that should be logging blablaba.... This is how I discovered that not everything is really logged. All my clients have this conf file: options { chain_hostnames(0); time_reopen(10); time_reap(360); log_fifo_size(2048); sync(1); stats(0); create_dirs(yes); owner(root); group(root); perm(0644); dir_perm(0755); use_dns(yes); dns_cache(yes); }; source src { internal(); unix-stream("/dev/log"); file("/proc/kmsg"); }; destination messages { file("/var/log/messages"); }; destination auth { file("/var/log/auth"); }; destination cron { file("/var/log/cron"); }; destination mail { file("/var/log/mail"); }; destination loghost { tcp("**********" port(5000)); }; filter auth { facility(auth, authpriv); }; filter cron { facility(cron); }; filter mail { facility(mail); }; filter nagios { not ( match("Accepted publickey for nagios from ********") or match("COMMAND=/usr/local/nagios/home/check_duplex") or match("session opened for user nagios") or match("session closed for user nagios")); }; log { source(src); filter(cron); destination(cron); destination(loghost); flags(final); }; log { source(src); filter(mail); destination(mail); destination(loghost); flags(final); }; log { source(src); filter(auth); filter(nagios); destination(auth); destination(loghost); flags(final); }; log { source(src); filter(nagios); destination(messages); destination(loghost); }; If I check the cronlog locally on a client I can see that all the hourly logs are there: Feb 27 12:01:01 sbeta crond[31964]: (root) CMD (run-parts /etc/cron.hourly) Feb 27 13:01:01 sbeta crond[9232]: (root) CMD (run-parts /etc/cron.hourly) Feb 27 14:01:01 sbeta crond[18969]: (root) CMD (run-parts /etc/cron.hourly) Feb 27 15:01:01 sbeta crond[28630]: (root) CMD (run-parts /etc/cron.hourly) Feb 27 16:01:01 sbeta crond[5311]: (root) CMD (run-parts /etc/cron.hourly) Feb 27 17:01:01 sbeta crond[14977]: (root) CMD (run-parts /etc/cron.hourly) Feb 27 18:01:01 sbeta crond[22335]: (root) CMD (run-parts /etc/cron.hourly) Feb 27 19:01:01 sbeta crond[31995]: (root) CMD (run-parts /etc/cron.hourly) Feb 27 20:01:01 sbeta crond[9336]: (root) CMD (run-parts /etc/cron.hourly) Feb 27 21:01:01 sbeta crond[19002]: (root) CMD (run-parts /etc/cron.hourly) Feb 27 22:01:01 sbeta crond[28663]: (root) CMD (run-parts /etc/cron.hourly) Feb 27 23:01:01 sbeta crond[5928]: (root) CMD (run-parts /etc/cron.hourly) Feb 28 00:01:01 sbeta crond[15495]: (root) CMD (run-parts /etc/cron.hourly) Feb 28 01:01:01 sbeta crond[25160]: (root) CMD (run-parts /etc/cron.hourly) Feb 28 02:01:01 sbeta crond[2394]: (root) CMD (run-parts /etc/cron.hourly) Feb 28 03:01:01 sbeta crond[12087]: (root) CMD (run-parts /etc/cron.hourly) Feb 28 04:01:01 sbeta crond[21753]: (root) CMD (run-parts /etc/cron.hourly) Feb 28 05:01:01 sbeta crond[31573]: (root) CMD (run-parts /etc/cron.hourly) Feb 28 06:01:01 sbeta crond[8840]: (root) CMD (run-parts /etc/cron.hourly) Feb 28 07:01:01 sbeta crond[18506]: (root) CMD (run-parts /etc/cron.hourly) Feb 28 08:01:01 sbeta crond[28166]: (root) CMD (run-parts /etc/cron.hourly) Feb 28 09:01:01 sbeta crond[5433]: (root) CMD (run-parts /etc/cron.hourly) However, checking the remote cron logs for that machine for that whole of February shows this: Feb 21 12:01:01 sbeta crond[23754]: (root) CMD (run-parts /etc/cron.hourly) Feb 24 05:01:01 sbeta crond[5012]: (root) CMD (run-parts /etc/cron.hourly) Feb 24 17:01:01 sbeta crond[26525]: (root) CMD (run-parts /etc/cron.hourly) Feb 25 03:01:01 sbeta crond[28212]: (root) CMD (run-parts /etc/cron.hourly) Feb 25 13:01:01 sbeta crond[30890]: (root) CMD (run-parts /etc/cron.hourly) Feb 26 01:01:01 sbeta crond[20620]: (root) CMD (run-parts /etc/cron.hourly) Feb 16 19:01:01 sbeta crond[6299]: (root) CMD (run-parts /etc/cron.hourly) Feb 17 00:01:01 sbeta crond[22138]: (root) CMD (run-parts /etc/cron.hourly) Feb 17 04:01:01 sbeta crond[28292]: (root) CMD (run-parts /etc/cron.hourly) Feb 17 08:01:01 sbeta crond[1917]: (root) CMD (run-parts /etc/cron.hourly) Feb 17 12:01:01 sbeta crond[7812]: (root) CMD (run-parts /etc/cron.hourly) Feb 17 16:01:01 sbeta crond[14136]: (root) CMD (run-parts /etc/cron.hourly) Feb 17 19:01:01 sbeta crond[10669]: (root) CMD (run-parts /etc/cron.hourly) Feb 18 00:01:01 sbeta crond[26489]: (root) CMD (run-parts /etc/cron.hourly) Feb 18 04:01:01 sbeta crond[32742]: (root) CMD (run-parts /etc/cron.hourly) Feb 18 08:01:01 sbeta crond[6852]: (root) CMD (run-parts /etc/cron.hourly) Feb 18 12:01:01 sbeta crond[13106]: (root) CMD (run-parts /etc/cron.hourly) Feb 18 15:01:01 sbeta crond[9768]: (root) CMD (run-parts /etc/cron.hourly) Feb 18 20:01:01 sbeta crond[25588]: (root) CMD (run-parts /etc/cron.hourly) Feb 12 08:01:01 sbeta crond[22554]: (root) CMD (run-parts /etc/cron.hourly) Feb 12 12:01:01 sbeta crond[28808]: (root) CMD (run-parts /etc/cron.hourly) Feb 12 16:01:01 sbeta crond[2735]: (root) CMD (run-parts /etc/cron.hourly) Feb 12 20:01:01 sbeta crond[9063]: (root) CMD (run-parts /etc/cron.hourly) Feb 13 00:01:01 sbeta crond[15268]: (root) CMD (run-parts /etc/cron.hourly) Feb 14 17:01:01 sbeta crond[22845]: (root) CMD (run-parts /etc/cron.hourly) Feb 14 21:01:02 sbeta crond[29177]: (root) CMD (run-parts /etc/cron.hourly) Feb 15 22:01:01 sbeta crond[12945]: (root) CMD (run-parts /etc/cron.hourly) Feb 16 05:01:01 sbeta crond[17917]: (root) CMD (run-parts /etc/cron.hourly) Feb 16 16:01:01 sbeta crond[29758]: (root) CMD (run-parts /etc/cron.hourly) Feb 16 18:01:01 sbeta crond[16687]: (root) CMD (run-parts /etc/cron.hourly) As you can see it seemingly randomly logs this stuff now and then. So this machine hasn't been sending logs remotely since Feb 21. ? As I said all clients run the same version of syslog-ng and the exact same config - yet some of then seem to log cron stuff every hour and others don't. I get a log ot those 'Error connecting to remote host AF_INET' messages BUT i get those for EVERY host so why is there no consistency ? I know what everbody is gonna this - your network is hosed but here is the problem with that theory. I can log on to that box and I will immediately see this on the loghost in /var/log/auth Feb 28 10:43:06 sbeta sshd[21815]: Accepted publickey for stucky from ******** port 47427 ssh2 Feb 28 10:43:06 sbeta sshd(pam_unix)[21819]: session opened for user stucky by (uid=0) This will work every time for all other machines as well. I'd understand if it didn't log at all or with a slight delay like say 1 minute. Besides it appears cron stuff really gets lost but then why doesn't any of the auth/authpriv stuff get lost ? I can't help feeling it has something to do with logging cron but looking at the config I clearly told it to log to the local file first and then again to the remote host just like auth/authpriv stuff. Here is my logservers config: options { chain_hostnames(0); time_reopen(10); time_reap(360); log_fifo_size(2048); use_fqdn(no); use_dns(yes); dns_cache(yes); keep_hostname(yes); long_hostnames(off); sync(1); stats(0); create_dirs(yes); perm(0640); dir_perm(0750); }; source src { internal(); unix-stream("/dev/log"); file("/proc/kmsg"); udp (ip("*********")); tcp (ip("*********") port(5000) max-connections(1000) keep-alive(yes)); }; destination messages { file("/var/log/messages"); }; destination auth { file("/var/log/auth"); }; destination cron { file("/var/log/cron"); }; destination mail { file("/var/log/mail"); }; destination arch { file("/usr/local/var/log_archive/$HOST/$YEAR/$MONTH/$DAY/archive"); }; filter auth { facility(auth, authpriv); }; filter cron { facility(cron); }; filter mail { facility(mail); }; filter messages { not facility(auth, authpriv, cron, mail); }; filter nagios { not ( match("Accepted publickey for nagios from *******") or match("COMMAND=/usr/local/nagios/home/check_duplex") or match("session opened for user nagios") or match("session closed for user nagios")); }; filter junk { not ( match("Accepted publickey for oracle from ***********") or match("Accepted publickey for oracle from ***********") or match("Accepted publickey for stucky from ***********") or match("COMMAND=/usr/local/nagios/home/check_duplex") or match(".+reconnecting to LDAP server.+sleeping") or match("session opened for user root") or match(".+AF_INET client connected from .+") or match(".+AF_INET client dropped connection from.+") or match(".+Connection broken to AF_INET.+")); }; log { source(src); filter(cron); destination(cron); destination(arch); flags(final); }; log { source(src); filter(mail); destination(mail); destination(arch); flags(final); }; log { source(src); filter(auth); filter(nagios); destination(auth); destination(arch); flags(final); }; log { source(src); filter(junk); filter(nagios); destination(messages); destination(arch); }; And here's the really messed up thing. EVERY time I set up a tcpdump on the loghost at the full hour to see whether the box actually sends stuff to it - it does ! Yet, if I wait again for a while and check the logs I can see again that a log of hourly cronlogs are skipped. Can anybody please take a look at my configs and at least tell me that they are not completely wrong ? I'm running out of ideas. -- stucky
On Tue, Feb 28, 2006 at 11:09:43AM -0800, stucky wrote:
guys
I've been pulling my hair out here. I gonna go insane....
Here's the story.
I first setup syslog-ng-1.6.6 for remote syslogging a year ago.
I seemed to work out great - all my stuff appeared to be logged remotely and everything was dandy. Then I decided to come up with some sort of check that would tell me if a machine was not logging remotely anymore cause of network problems or whatever.
I'm thinking that since you use TCP that perhaps the messages only come through once in a while, and since the first message sent when a TCP connection comes up used to be lost (might still be, but I remember Bazsi coming up with a workaround/fix at some point) then you might actually lose messages the way you report. Either try using UDP for a short while or try tcp-keep-alive() to keep the connection up. Also review the changelogs to see if a newer version fixes this behavior (probably a good idea to upgrade anyways). HTH, -- Nate He may look like an idiot and talk like an idiot but don't let that fool you. He really is an idiot. - Groucho Marx
Nate thanks for your reply ! I am already using keepalive on the server : tcp (ip("**********") port(5000) max-connections(1000) keep-alive(yes)); }; It doesn't look like I can specify this option on the client though. If I try something like : destination loghost { tcp("*********" port(5000) keep-alive(yes)); }; It complains about a syntax error. I have never seen an option called 'tcp-keep-alive()' but I tried it without luck. It always complains about a syntax error. I guess you meant 'keep-alive(yes)' right ? As far as upgrading is concerned I want to but I wanted to wait till 2.0 is out and stable. Any ETA ? --stucky On 3/2/06, Nate Campi <nate@campin.net> wrote:
On Tue, Feb 28, 2006 at 11:09:43AM -0800, stucky wrote:
guys
I've been pulling my hair out here. I gonna go insane....
Here's the story.
I first setup syslog-ng-1.6.6 for remote syslogging a year ago.
I seemed to work out great - all my stuff appeared to be logged remotely and everything was dandy. Then I decided to come up with some sort of check that would tell me if a machine was not logging remotely anymore cause of network problems or whatever.
I'm thinking that since you use TCP that perhaps the messages only come through once in a while, and since the first message sent when a TCP connection comes up used to be lost (might still be, but I remember Bazsi coming up with a workaround/fix at some point) then you might actually lose messages the way you report.
Either try using UDP for a short while or try tcp-keep-alive() to keep the connection up. Also review the changelogs to see if a newer version fixes this behavior (probably a good idea to upgrade anyways).
HTH, -- Nate
He may look like an idiot and talk like an idiot but don't let that fool you. He really is an idiot. - Groucho Marx
_______________________________________________ syslog-ng maillist - syslog-ng@lists.balabit.hu https://lists.balabit.hu/mailman/listinfo/syslog-ng Frequently asked questions at http://www.campin.net/syslog-ng/faq.html
On Thu, 2006-03-02 at 16:15 -0800, stucky wrote:
thanks for your reply ! I am already using keepalive on the server :
tcp (ip("**********") port(5000) max-connections(1000) keep-alive(yes)); };
It doesn't look like I can specify this option on the client though. If I try something like : destination loghost { tcp("*********" port(5000) keep-alive(yes)); };
It complains about a syntax error. I have never seen an option called 'tcp-keep-alive()' but I tried it without luck. It always complains about a syntax error. I guess you meant 'keep-alive(yes)' right ?
tcp-keep-alive was added in 1.6.3 as I read my changelogs. keep-alive and tcp-keep-alive are different beasts, the first means that syslog-ng should keep all connections open through reloads, the second means to enable SO_KEEPALIVE option.
As far as upgrading is concerned I want to but I wanted to wait till 2.0 is out and stable. Any ETA ?
It all depends on testers. Judging the feedback it is either rock solid, or no-one is using it. I suspect it is the latter. In the current situation an "it works for me" messages would help a lot, the first can be this very mail, as I've been using 1.9.x snapshots for about a year now. -- Bazsi
Hey gurus I will try 1.9.9 again when I have more time. I tried it once and it literally killed me with stats although I had stats(0) set. It was insane so I turned it off again. As I said when I have more time but I was wondering whether some of the log delivery probs may have to do with oracle rac. It seems those machines are particulary bad and a netstat -ee reveals the great amount of open tcp connections used by 'oracle' for the cluster stuff. I have an average of 300 lines like this in netstat -ee: tcp 0 0 {host-vip}:1521 {host-real}:44853 ESTABLISHED oracle 1176037944 the real and VIP are on the same box. my log is stuffed of the dreaded : Mar 9 18:51:26 {host} syslog-ng[29085]: Connection broken to AF_INET(localhost:5000), reopening in 10 seconds I'm thinking this box run out of sockets but I'm not too sure how to properly ts/prove that. netstat -s seems to show random drops but on all machines. Would anyone have more hints on how to check whether I have to tune the kernel for these boxes ? Sorry but this is the first time I'm digging that deep into the stack... PS: any chance 2.0 will include temporary message supressant code - syslog style (the only thing I really liked about old syslog) ? On 3/3/06, Balazs Scheidler <bazsi@balabit.hu> wrote:
On Thu, 2006-03-02 at 16:15 -0800, stucky wrote:
thanks for your reply ! I am already using keepalive on the server :
tcp (ip("**********") port(5000) max-connections(1000) keep-alive(yes)); };
It doesn't look like I can specify this option on the client though. If I try something like : destination loghost { tcp("*********" port(5000) keep-alive(yes)); };
It complains about a syntax error. I have never seen an option called 'tcp-keep-alive()' but I tried it without luck. It always complains about a syntax error. I guess you meant 'keep-alive(yes)' right ?
tcp-keep-alive was added in 1.6.3 as I read my changelogs.
keep-alive and tcp-keep-alive are different beasts, the first means that syslog-ng should keep all connections open through reloads, the second means to enable SO_KEEPALIVE option.
As far as upgrading is concerned I want to but I wanted to wait till 2.0 is out and stable. Any ETA ?
It all depends on testers. Judging the feedback it is either rock solid, or no-one is using it. I suspect it is the latter.
In the current situation an "it works for me" messages would help a lot, the first can be this very mail, as I've been using 1.9.x snapshots for about a year now.
-- Bazsi
_______________________________________________ syslog-ng maillist - syslog-ng@lists.balabit.hu https://lists.balabit.hu/mailman/listinfo/syslog-ng Frequently asked questions at http://www.campin.net/syslog-ng/faq.html
-- stucky
On Thu, Mar 09, 2006 at 06:56:47PM -0800, stucky wrote:
I will try 1.9.9 again when I have more time. I tried it once and it literally killed me with stats although I had stats(0) set. It was insane so I turned it off again.
I noticed the same behavior when upgrading from 1.6.x to 1.9.x and filed a Debian bug report: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=352362 john -- John Morrissey _o /\ ---- __o jwm@horde.net _-< \_ / \ ---- < \, www.horde.net/ __(_)/_(_)________/ \_______(_) /_(_)__
On Fri, 2006-03-10 at 09:04 -0500, John Morrissey wrote:
On Thu, Mar 09, 2006 at 06:56:47PM -0800, stucky wrote:
I will try 1.9.9 again when I have more time. I tried it once and it literally killed me with stats although I had stats(0) set. It was insane so I turned it off again.
I noticed the same behavior when upgrading from 1.6.x to 1.9.x and filed a Debian bug report:
I would prefer bugreports posted to this list, for instance I did not know about this one until now. I've just committed this fix which also fixes a minor issue that syslog-ng did not care about the changed value of stats_freq on SIGHUP (tomorrow's snapshot should include this patch) --- orig/src/main.c +++ mod/src/main.c @@ -140,12 +140,14 @@ main_loop_run(GlobalConfig *cfg) { GMainLoop *main_loop; gint iters; + guint stats_timer_id = 0; msg_notice("syslog-ng starting up", evt_tag_str("version", VERSION), NULL); main_loop = g_main_loop_new(NULL, TRUE); - g_timeout_add(cfg->stats_freq * 1000, stats_timer, NULL); + if (cfg->stats_freq > 0) + stats_timer_id = g_timeout_add(cfg->stats_freq * 1000, stats_timer, NULL); while (g_main_loop_is_running(main_loop)) { if (cfg->time_sleep > 0) @@ -163,6 +165,12 @@ main_loop_run(GlobalConfig *cfg) msg_notice("SIGHUP received, reloading configuration", NULL); cfg = cfg_reload_config(cfgfilename, cfg); sig_hup_received = FALSE; + if (cfg->stats_freq > 0) + { + if (stats_timer_id != 0) + g_source_remove(stats_timer_id); + stats_timer_id = g_timeout_add(cfg->stats_freq * 1000, stats_timer, NULL); + } } if (sig_term_received) { -- Bazsi
On Mon, Mar 13, 2006 at 05:44:16PM +0100, Balazs Scheidler wrote:
On Fri, 2006-03-10 at 09:04 -0500, John Morrissey wrote:
I noticed the same behavior when upgrading from 1.6.x to 1.9.x and filed a Debian bug report:
I would prefer bugreports posted to this list, for instance I did not know about this one until now.
Sure. I wasn't certain where to report it; I used the Debian BTS since I originally observed this while using their syslog-ng package, and Debian developers often like it if you submit a Debian bug first when their packages are involved. Thanks for the fix! john -- John Morrissey _o /\ ---- __o jwm@horde.net _-< \_ / \ ---- < \, www.horde.net/ __(_)/_(_)________/ \_______(_) /_(_)__
participants (4)
-
Balazs Scheidler
-
John Morrissey
-
Nate Campi
-
stucky