RE: [syslog-ng]PTR lookups; leak/crash fixes
[snip]
But now there is a bunch of extra CPU usage (syslog-ng has accumulated 600
minutes of CPU time after running for just one day; this is a 333Mhz machine and there are only a couple dozen logging hosts, most of which comes relayed through other syslog-ngs anyways).
[snip] I keep thinking I'm doing something wrong... I have syslog-ng running on a SUN Ultra 10, and it's been running for 192 days, and I have 700 minutes time and 4.5M size... Of course, I only log from 6 other boxes to it for now. :) Consider this a success story :) I have used syslog-ng to split up incoming logs per facility and priority, so that if I see a message in the crit file, I can go to the corresponding facility file and see if there were any more message. This functionality has come in very handy a few times. :) /Mick
Ohrberg, Mick on Thu 5/10 13:20 -0500:
I keep thinking I'm doing something wrong... I have syslog-ng running on a SUN Ultra 10, and it's been running for 192 days, and I have 700 minutes time and 4.5M size... Of course, I only log from 6 other boxes to it for now. :)
Consider this a success story :) I have used syslog-ng to split up incoming logs per facility and priority, so that if I see a message in the crit file, I can go to the corresponding facility file and see if there were any more message. This functionality has come in very handy a few times. :)
Just for kicks, try HUPping it after you changed the config file to write everything to a file host/program/year.month.day I bet it starts eating your CPU and RAM until your computer explodes ;) But we'll see over the next few days...also it HUPs each night when the rotation occurs, so perhaps that does a lot of freeing and we'll get some stable behavior.
I keep thinking I'm doing something wrong... I have syslog-ng running on a SUN Ultra 10, and it's been running for 192 days, and I have 700 minutes time and 4.5M size... Of course, I only log from 6 other boxes to it for now. :)
Consider this a success story :) I have used syslog-ng to split up incoming logs per facility and priority, so that if I see a message in the crit file, I can go to the corresponding facility file and see if there were any more message. This functionality has come in very handy a few times. :)
Just for kicks, try HUPping it after you changed the config file to write everything to a file host/program/year.month.day
I bet it starts eating your CPU and RAM until your computer explodes ;)
But we'll see over the next few days...also it HUPs each night when the rotation occurs, so perhaps that does a lot of freeing and we'll get some stable behavior.
Actually, I'm testing a linux box under almost exactly that configuration, and (going to check) after about 24 hours running under stress-test conditions (bout 2-4 messages a second from local, about 5-10 messages a second from other machines (UDP)), I've got the following: VSZ RSS TTY STAT START TIME COMMAND 1596 652 ? S Oct04 0:35 syslog-ng I haven't actually had any trouble with syslog-ng on Linux since around 1.4.5 or so... Anyhoo, YMMV I suppose. -- A.L.Lambert
A.L.Lambert on Thu 5/10 14:32 -0500:
Just for kicks, try HUPping it after you changed the config file to write everything to a file host/program/year.month.day
I bet it starts eating your CPU and RAM until your computer explodes ;)
But we'll see over the next few days...also it HUPs each night when the rotation occurs, so perhaps that does a lot of freeing and we'll get some stable behavior.
Actually, I'm testing a linux box under almost exactly that configuration, and (going to check) after about 24 hours running under stress-test conditions (bout 2-4 messages a second from local, about 5-10 messages a second from other machines (UDP)), I've got the following:
VSZ RSS TTY STAT START TIME COMMAND 1596 652 ? S Oct04 0:35 syslog-ng
I haven't actually had any trouble with syslog-ng on Linux since around 1.4.5 or so... Anyhoo, YMMV I suppose.
Wow, I was crashing constantly (like, every half hour) without the assertion fix :) Try accepting from other machines via TCP, that seems to invoke the strange behavior better. I'm also using chain_hostnames which might have something to do with it...more string ops. Here's my ng process: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18717 27.4 4.3 6268 5572 ? R Oct04 642:10 syslog-ng on 333Mhz PII. The VM use seems steady though...we'll see tomorrow.
Actually, I'm testing a linux box under almost exactly that configuration, and (going to check) after about 24 hours running under stress-test conditions (bout 2-4 messages a second from local, about 5-10 messages a second from other machines (UDP)), I've got the following:
VSZ RSS TTY STAT START TIME COMMAND 1596 652 ? S Oct04 0:35 syslog-ng
I haven't actually had any trouble with syslog-ng on Linux since around 1.4.5 or so... Anyhoo, YMMV I suppose.
Wow, I was crashing constantly (like, every half hour) without the assertion fix :)
Try accepting from other machines via TCP, that seems to invoke the strange behavior better.
I never could get TCP syslogging to work at all close to right with syslog-ng. Of course I didn't try all _THAT_ hard to get it going, and it was a few versions back when I tried... And I don't really want to invoke any more strange behavior than I've alreday got truth be told. :)
I'm also using chain_hostnames which might have something to do with it...more string ops.
Actually, I wouldn't think that's it, since I'm using that too. The config from my test box is below: long_hostnames(off); sync(0); time_reopen(10); log_fifo_size(1000); create_dirs(on); use_fqdn(yes); use_dns(no); chain_hostnames(yes); That, on a fairly severely modified ImmunixOS 6.2 base with full errata patches in place, a PIII 500 w/ a lot of RAM, and I got 0 problems. Going to start rotating it into heavier production use before too long, so I might see some more "bad things" when the load goes up, but so far, it's lookin' good. :)
Here's my ng process:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18717 27.4 4.3 6268 5572 ? R Oct04 642:10 syslog-ng
on 333Mhz PII. The VM use seems steady though...we'll see tomorrow.
Good luck! -- A.L.Lambert
Scott McDermott on Thu 5/10 15:59 -0400:
I bet it starts eating your CPU and RAM until your computer explodes ;)
But we'll see over the next few days...also it HUPs each night when the rotation occurs, so perhaps that does a lot of freeing and we'll get some stable behavior.
[...] I haven't actually had any trouble with syslog-ng on Linux since around 1.4.5 or so... Anyhoo, YMMV I suppose.
[...] Try accepting from other machines via TCP, that seems to invoke the strange behavior better. I'm also using chain_hostnames which might have something to do with it...more string ops. Here's my ng process:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18717 27.4 4.3 6268 5572 ? R Oct04 642:10 syslog-ng
on 333Mhz PII. The VM use seems steady though...we'll see tomorrow.
No such luck. USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18717 33.8 6.4 8976 8280 ? R Oct04 1169:12 syslog-ng Just keeps going up and up, with both CPU and RAM.
Scott McDermott on Fri 6/10 10:33 -0400:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18717 27.4 4.3 6268 5572 ? R Oct04 642:10 syslog-ng
on 333Mhz PII. The VM use seems steady though...we'll see tomorrow.
No such luck.
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18717 33.8 6.4 8976 8280 ? R Oct04 1169:12 syslog-ng
Just keeps going up and up, with both CPU and RAM.
That was Friday evening. Now it's Monday morning: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18717 48.5 13.4 17912 17244 ? R Oct04 3764:46 syslog-ng I guess I go back to a cron job restarting syslog-ng daily...
I am noticing that if I kill of the server and restart using remote syslog-ng, that the client syslog-ng doesn't reconnect when I send a message. How long does it take before the client syslog will recognize that the server syslog-ng is there again? (or does it not reconnect?) Second question, what happens to the logged messages in the in the time when the client can't talk to the server? Matthew M. Copeland
matthew.copeland@honeywell.com on Mon 9/10 15:37 -0500:
I am noticing that if I kill of the server and restart using remote syslog-ng, that the client syslog-ng doesn't reconnect when I send a message. How long does it take before the client syslog will recognize that the server syslog-ng is there again? (or does it not reconnect?) Second question, what happens to the logged messages in the in the time when the client can't talk to the server?
The client does reconnect...but I'm not sure about messages in the interim. I know that I will continue to receive messages afterwards, though. Whether a few were dropped, I don't know, but I doubt it (if you are using TCP).
I am noticing that if I kill of the server and restart using remote syslog-ng, that the client syslog-ng doesn't reconnect when I send a message. How long does it take before the client syslog will recognize that the server syslog-ng is there again? (or does it not reconnect?) Second question, what happens to the logged messages in the in the time when the client can't talk to the server?
by default 60 seconds, but you can set it with the time_reopen() option. syslog-ng keeps buffering messages in its destination fifos, whose size can be set using log_fifo_size() -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1 url: http://www.balabit.hu/pgpkey.txt
I am running an instance of syslog-ng 1.4.7 which is set to only accepting remove syslog-ng connectoins for the source. (at least I think it is.) I can accept the connections, but after maybe 700 or 800K, syslog-ng dies. Any thoughts? Here is a copy of the syslog-ng.conf file I am using for the server. options { long_hostnames(off); sync(0); }; source src { tcp(port(999) max_connections(310)); internal(); }; destination everything { file("/var/log/think"); }; log { source(src); destination(everything); }; Matthew M. Copeland
On Thu, Oct 19, 2000 at 03:25:33PM +0000, matthew.copeland@honeywell.com wrote:
I am running an instance of syslog-ng 1.4.7 which is set to only accepting remove syslog-ng connectoins for the source. (at least I think it is.) I can accept the connections, but after maybe 700 or 800K, syslog-ng dies. Any thoughts?
Here is a copy of the syslog-ng.conf file I am using for the server.
How is it dying? Gets a SIGSEGV, assertion failed, or? Version of libol? 700k or 800k connections or messages? -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1 url: http://www.balabit.hu/pgpkey.txt
I am using syslog-ng 1.4.7 and libol 0.2.20 and am attempting to do some tests to make sure that I can get the level of performance that I need. I am having the worst trouble though trying to isolate whether a problem is occuring in either syslog-ng or within the networking code of the machines that I am using. I've managed to hack things in the network code for some of the machines to get better performance for some of the tests, but we have dropped down the requirements now. I am sending along some test code that should compile on either red hat 6.2, red hat 7.0, and hopefully others. (Haven't tried) The problem I am running into is that I lose anywhere between 30 and 800 messages. As the message number goes up and number of clients, I lose more messages. Some results I have gotten, with 100 threads connecting to the server, I can send 60,000 messages in about 30 seconds without a problem (200 bytes), but when I increase the number to 300 threads, and I have it sending for 30 seconds for a total of 36,000 messages, I start to lose messages. Notice in the first case I was sending more messages and had it sleeping for less time between instances, in the second case it is sleeping longer and sending less messages per second per thread. I have tried it with multiple different machines and types of varying speeds, and it happens on all of them. For the latest run, I have for teh sending machine an AMD athlon 1Ghz w 256M RAM, with an ATA 100 Hard drive running on ATA66 running red hat 7.0. For the server, I have a PII 300Mhz with 180M RAM, a ATA33 hard drive running without acceleration. The server is running red hat 6.2. I have tried the numbers with the sender and receiver switched, but I lose more messages that way. If you look in the code at logconf.h, you can change the #defines. The logtest.c just creates the threads. log_thread.c has the send_messages function that simulates a syslog-ng client. If anyone can provide any help, it would be greatly appreciated, since I have to be able to show this working before I can get the full sign-off to use it for our the applications we are building. Matthew M. Copeland Software Engineer matthew.copeland@honeywell.com
I am using syslog-ng 1.4.7 and libol 0.2.20 and am attempting to do some tests to make sure that I can get the level of performance that I need. I am having the worst trouble though trying to isolate whether a problem is occuring in either syslog-ng or within the networking code of the machines that I am using. I've managed to hack things in the network code for some of the machines to get better performance for some of the tests, but we have dropped down the requirements now.
I am sending along some test code that should compile on either red hat 6.2, red hat 7.0, and hopefully others. (Haven't tried) The problem I am running into is that I lose anywhere between 30 and 800 messages. As the message number goes up and number of clients, I lose more messages.
Some results I have gotten, with 100 threads connecting to the server, I can send 60,000 messages in about 30 seconds without a problem (200 bytes), but when I increase the number to 300 threads, and I have it sending for 30 seconds for a total of 36,000 messages, I start to lose messages. Notice in the first case I was sending more messages and had it sleeping for less time between instances, in the second case it is sleeping longer and sending less messages per second per thread.
I have tried it with multiple different machines and types of varying speeds, and it happens on all of them. For the latest run, I have for teh sending machine an AMD athlon 1Ghz w 256M RAM, with an ATA 100 Hard drive running on ATA66 running red hat 7.0. For the server, I have a PII 300Mhz with 180M RAM, a ATA33 hard drive running without acceleration. The server is running red hat 6.2. I have tried the numbers with the sender and receiver switched, but I lose more messages that way.
If you look in the code at logconf.h, you can change the #defines. The logtest.c just creates the threads. log_thread.c has the send_messages function that simulates a syslog-ng client. If anyone can provide any help, it would be greatly appreciated, since I have to be able to show this working before I can get the full sign-off to use it for our the applications we are building.
Can you send me your syslog-ng config file on server? What I'd be especially interested is your log_fifo_size() setting. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1 url: http://www.balabit.hu/pgpkey.txt
Can you send me your syslog-ng config file on server? What I'd be especially interested is your log_fifo_size() setting.
Sure. I didn't reset log_fifo_size, since I was under the understanding that would only be useful on the client side. Matthew M. Copeland
On Tue, Oct 24, 2000 at 08:59:12PM +0000, matthew.copeland@honeywell.com wrote:
Can you send me your syslog-ng config file on server? What I'd be especially interested is your log_fifo_size() setting.
Sure. I didn't reset log_fifo_size, since I was under the understanding that would only be useful on the client side.
try to set it to a larger value, the buffering in syslog-ng doesn't really differentiate between files and destination connections. so set log_fifo_size to 310*20 = 6200. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1 url: http://www.balabit.hu/pgpkey.txt
Sure. I didn't reset log_fifo_size, since I was under the understanding that would only be useful on the client side.
try to set it to a larger value, the buffering in syslog-ng doesn't really differentiate between files and destination connections. so set log_fifo_size to 310*20 = 6200.
Okay. I gave that a try, and it seems to help some. Meaning that if I increase it enough, I don't lose more than say about 25 - 50 messages at most. It's odd because I can lose anywhere between 1 - 50 messages or so, and never get all of them under this test. (300 thread clients, 4 messages a second, for 30 seconds == 36000 messages) I would think that at least once I would get all of the messages for all the times that I am getting only 2 or 3 lost. I can get all of the messages with a less number of clients and more messages per second. Any ideas? Matthew M. Copeland
On Thu, 26 Oct 2000 matthew.copeland@honeywell.com wrote:
Sure. I didn't reset log_fifo_size, since I was under the understanding that would only be useful on the client side.
try to set it to a larger value, the buffering in syslog-ng doesn't really differentiate between files and destination connections. so set log_fifo_size to 310*20 = 6200.
Okay. I gave that a try, and it seems to help some. Meaning that if I increase it enough, I don't lose more than say about 25 - 50 messages at most. It's odd because I can lose anywhere between 1 - 50 messages or so, and never get all of them under this test. (300 thread clients, 4 messages a second, for 30 seconds == 36000 messages) I would think that at least once I would get all of the messages for all the times that I am getting only 2 or 3 lost. I can get all of the messages with a less number of clients and more messages per second. Any ideas?
You're almost certainly overflowing the internal queue in either the server or the clients or both. Try logging the messages on the clients to a file as well as the network, and see if the local syslog-ngs are dropping messages before the server has any chance. You can put some printf statements into the libol code to trace what is happening with each log entry and the output queue. I did this to mine and I can see in the output that syslog-ng sometimes (often) simply throws messages away. The interesting parts are in pkt_buffer.c and queue.c in libol: if (self->queue_size == self->queue_max) { /* fifo full */ <== oops ol_string_free(string); <== this tosses your message return ST_FAIL | ST_OK; <== return code is ignored } I've raised this issue on this list before, but have been ignored. Regardless of how high your fifo size is, syslog-ng will lose messages if the sources generate messages faster than the destintation can consume them. Raising the fifo size only masks transients, but does not help in the steady state. The symptom is easily seen if you send very small message, such as a three-digit sequence and a newline. This puts as much stress as possible on syslog-ng. If you are suffering from this problem, you will notice in your logs that large blocks of messages are missing. You can generate these messages very quickly with a perl script writing to a named pipe. -jwb
You're almost certainly overflowing the internal queue in either the server or the clients or both. Try logging the messages on the clients to a file as well as the network, and see if the local syslog-ngs are dropping messages before the server has any chance.
I can give this a try on the server side, but on the client side I have a custom program that is doing the sending. It could be that the buffers are getting overflown there, but I am finding that I am having a hard time telling. Also, if it were on the client side, then in theory, I should get one of the attempts that gets all of the messages as the os scheduler schedules the threads differently to send, so I am thinking it must be on the server side somewhere.
You can put some printf statements into the libol code to trace what is happening with each log entry and the output queue. I did this to mine and I can see in the output that syslog-ng sometimes (often) simply throws messages away. The interesting parts are in pkt_buffer.c and queue.c in libol:
if (self->queue_size == self->queue_max) { /* fifo full */ <== oops ol_string_free(string); <== this tosses your message return ST_FAIL | ST_OK; <== return code is ignored }
Okay I set log_fifo_size to be something like 72,000 does that not integrate well with queue_max? I will give this a try though and see if this is where I am losing messages.
I've raised this issue on this list before, but have been ignored. Regardless of how high your fifo size is, syslog-ng will lose messages if the sources generate messages faster than the destintation can consume them. Raising the fifo size only masks transients, but does not help in the steady state.
The symptom is easily seen if you send very small message, such as a three-digit sequence and a newline. This puts as much stress as possible on syslog-ng. If you are suffering from this problem, you will notice in your logs that large blocks of messages are missing. You can generate these messages very quickly with a perl script writing to a named pipe.
-jwb
On Thu, Oct 26, 2000 at 04:18:23PM +0000, matthew.copeland@honeywell.com wrote:
You're almost certainly overflowing the internal queue in either the server or the clients or both. Try logging the messages on the clients to a file as well as the network, and see if the local syslog-ngs are dropping messages before the server has any chance.
I can give this a try on the server side, but on the client side I have a custom program that is doing the sending. It could be that the buffers are getting overflown there, but I am finding that I am having a hard time telling. Also, if it were on the client side, then in theory, I should get one of the attempts that gets all of the messages as the os scheduler schedules the threads differently to send, so I am thinking it must be on the server side somewhere.
I tried to run your program, but it couldn't create threads over 256, I've recompiled my kernel, but didn't have enough time to get to it again.
You can put some printf statements into the libol code to trace what is happening with each log entry and the output queue. I did this to mine and I can see in the output that syslog-ng sometimes (often) simply throws messages away. The interesting parts are in pkt_buffer.c and queue.c in libol:
if (self->queue_size == self->queue_max) { /* fifo full */ <== oops ol_string_free(string); <== this tosses your message return ST_FAIL | ST_OK; <== return code is ignored }
Okay I set log_fifo_size to be something like 72,000 does that not integrate well with queue_max? I will give this a try though and see if this is where I am losing messages.
queue_max is equal to log_fifo_size here. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1 url: http://www.balabit.hu/pgpkey.txt
You're almost certainly overflowing the internal queue in either the server or the clients or both. Try logging the messages on the clients to a file as well as the network, and see if the local syslog-ngs are dropping messages before the server has any chance.
You can put some printf statements into the libol code to trace what is happening with each log entry and the output queue. I did this to mine and I can see in the output that syslog-ng sometimes (often) simply throws messages away. The interesting parts are in pkt_buffer.c and queue.c in libol:
if (self->queue_size == self->queue_max) { /* fifo full */ <== oops ol_string_free(string); <== this tosses your message return ST_FAIL | ST_OK; <== return code is ignored }
I've raised this issue on this list before, but have been ignored.
I wouldn't say you have been ignored. I once explained why your arguments were false. here's the URL: http://lists.balabit.hu/pipermail/syslog-ng/2000-October/000890.html carefully sizing the output fifo size will not drop you fetched messages. The shortest message can be 4 bytes, so the 1024 byte input buffer (contrary to what I said in the message above, it's not 2048), you can have at most 256 lines in a single packet. Have the number of sources and multiply it by 256, and the fifo you get will hold all input data.
Regardless of how high your fifo size is, syslog-ng will lose messages if the sources generate messages faster than the destintation can consume them. Raising the fifo size only masks transients, but does not help in the steady state.
if the fifo is large enough to hold all messages in a single poll() run, TCP will flow-control the input stream. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1 url: http://www.balabit.hu/pgpkey.txt
I found out the problem with the test program you sent to the list. After correcting the problem, all messages were received syslog-ng just fine. You forgot to add an "\n" to the end of each message. Since TCP is used as transport protocol, messages may be piggybacked. - int length = sprintf(message, "%d-%d %s", thread_number,(int)counter,buf); + int length = sprintf(message, "%d-%d %s\n", thread_number,(int)counter,buf); tudor:~/src/syslog-ng-1.4.8/src$ wc -l valami 36300 valami with logconf.h: #define REMOTE_SERVER "tudor" #define REMOTE_PORT 2000 #define MESSAGE_SIZE 100 #define USECONDS_BETWEEN_MESSAGES 250000 #define SECONDS_TO_TRANSMIT 30 #define NUMBER_OF_THREADS 300 syslog-ng.conf: options { gc_idle_threshold(30); gc_busy_threshold(3000); }; source src { tcp(localport(2000) max_connections(300)); }; source xxx { internal(); }; destination dst { file("valami" log_fifo_size(6000)); }; log { source(src); destination(dst); }; -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1 url: http://www.balabit.hu/pgpkey.txt
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18717 27.4 4.3 6268 5572 ? R Oct04 642:10 syslog-ng
on 333Mhz PII. The VM use seems steady though...we'll see tomorrow.
No such luck.
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18717 33.8 6.4 8976 8280 ? R Oct04 1169:12 syslog-ng
Just keeps going up and up, with both CPU and RAM.
That was Friday evening. Now it's Monday morning:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18717 48.5 13.4 17912 17244 ? R Oct04 3764:46 syslog-ng
I guess I go back to a cron job restarting syslog-ng daily...
could you please first provide a core dump of syslog-ng? (of course only if you think it doesn't contain sensitive information) -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1 url: http://www.balabit.hu/pgpkey.txt
participants (7)
-
A.L.Lambert
-
Balazs Scheidler
-
j
-
Jeffrey W. Baker
-
matthew.copeland@honeywell.com
-
Ohrberg, Mick
-
Scott McDermott