Hello, I'm testing syslog-ng with TCP transport, and I'm a little surprised- perhaps I'm doing something wrong? I'm finding severe message loss, which I had imagined TCP would eliminate. Briefly, my testing paradigm is as follows: On a client, I run a program which generates 5000 messages (LOCAL6/INFO) in rapid succession. The client is a Sparc Ultra 10, Solaris 7, and so uses the reliable door IPC mechanism for local message delivery. The loghost is an IBM 43P, AIX4.3.3, listening on TCP port 514. The messages have sequence numbers in their body, for example: Jan 4 10:34:24 local@peewee/peewee TEST[2362]: (3241) TEST MESSAGE (3241) Jan 4 10:34:24 local@peewee/peewee TEST[2362]: (3242) TEST MESSAGE (3242) I use a perl script to detect which are missing from the local (Sparc) logs and the remote (IBM) logs. Here is a sample result: SPARC (Client) ----------------------------------- Missing trials: Total found: 5000 (missing 0) IBM (remote Loghost) ----------------------------------- Missing trials: 3231-3240 3475-3523 3637-3662 3776-3805 4019-4087 4201-4232 4346-4381 4516-4532 4746-4791 4905-4907 Total found: 4682 (missing 318) Searching the IBM log corroborates this result. Therefore, log messages are being dropped somewhere. As this example shows, they are not lost on the Sparc; only on the IBM loghost. This effect is exacerbated by multiple clients- the client test program can listen on a port for a "go" signal so that I can coordinate a synchronized stress-test on the loghost. Massive losses using UDP are not surprising; but I would expect that with TCP the messages are not lost. Here are excerpts from the client and loghost config files to demostrate my belief that I'm using TCP: Client (SPARC): ----------------------------------- # ... destination d_vader { tcp("130.126.112.253" port(514)); }; destination d_local6 { file("/var/log/syslog/local6"); }; log { source(local); filter(f_local6); destination(d_vader); destination(d_local6); }; Loghost (IBM) ----------------------------------- source local { unix-dgram("/dev/log"); tcp(ip(0.0.0.0) port(514)); internal(); }; #... destination d_local6 { file("/var/log/syslog/local6"); }; log { source(local); filter(f_local6); destination(d_local6); }; Please ignore that I'm calling messages received from a network source "local"-- this is only during testing. Would anybody like to share some insight into this problem? I have another comment: When I run syslog-ng on Solaris (not AIX, though), I get the following result. # /usr/local/sbin/syslog-ng Terminated A daemon process is left running, however, and seems to operate normally (save for the previous issue). Following is 'truss' output for the initial syslog-ng process, containing a few lines preceeding the receipt of a kill signal: open("/etc/passwd", O_RDONLY) = 4 fstat64(4, 0xFFBEEB88) = 0 ioctl(4, TCGETA, 0xFFBEEB14) Err#25 ENOTTY read(4, " r o o t : x : 0 : 1 : S".., 8192) = 756 llseek(4, 0xFFFFFFFFFFFFFD2D, SEEK_CUR) = 33 close(4) = 0 ioctl(3, TCGETA, 0xFFBEF294) Err#25 ENOTTY llseek(3, 0, SEEK_CUR) = 4128 close(3) = 0 fork1() = 2539 signotifywait() = 15 lwp_mutex_wakeup(0xFF3DFDF0) = 0 lwp_mutex_lock(0xFF3DFDF0) = 0 lwp_mutex_wakeup(0xFF14D4D0) = 0 lwp_mutex_lock(0xFF14D4D0) = 0 lwp_mutex_wakeup(0xFF150C58) = 0 lwp_mutex_lock(0xFF150C58) = 0 lwp_sema_post(0xFF00FE78) = 0 lwp_sema_wait(0xFF00FE78) = 0 lwp_mutex_wakeup(0xFF150C58) = 0 lwp_mutex_lock(0xFF150C58) = 0 Received signal #15, SIGTERM [default] siginfo: SIGTERM pid=2539 uid=0 lwp_sigredirect(1, SIGTERM) = 0 *** process killed *** Does anyone know offhand why this should happen? Thanks very much for any advice you can offer. -- Jonathan Marks Systems Administrator, Production Systems Group Computing and Communication Services Office University of Illinois at Urbana-Champaign
On Thu, 4 Jan 2001, Jon Marks wrote:
Hello,
I'm testing syslog-ng with TCP transport, and I'm a little surprised- perhaps I'm doing something wrong? I'm finding severe message loss, which I had imagined TCP would eliminate.
I've reported the same thing on this list in the past. Indeed, my machines can even lose messages when the input is a pipe and the output is a local file! The culprit is this bit of code at line 71 in libol: if (self->queue_size == self->queue_max) { /* fifo full */ ol_string_free(string); return ST_FAIL | ST_OK; } If the queue is full, your message is dismissed and cannot be recovered. If you put a printf() statement in that code block, I bet you will see that you are hitting this misfeature. I tried fixing this problem, but there is not a simple way. It requires significant changes to syslog-ng's message handling loop. -jwb
Hello,
I'm testing syslog-ng with TCP transport, and I'm a little surprised- perhaps I'm doing something wrong? I'm finding severe message loss, which I had imagined TCP would eliminate.
I've reported the same thing on this list in the past. Indeed, my machines can even lose messages when the input is a pipe and the output is a local file!
And I replied that you are _not_ correct. Several times. Please stop spreading misinformation.
The culprit is this bit of code at line 71 in libol:
if (self->queue_size == self->queue_max) { /* fifo full */ ol_string_free(string); return ST_FAIL | ST_OK; }
If the queue is full, your message is dismissed and cannot be recovered. If you put a printf() statement in that code block, I bet you will see that you are hitting this misfeature. I tried fixing this problem, but there is not a simple way. It requires significant changes to syslog-ng's message handling loop.
You have to raise the log_fifo_size variable. Try to estimate the number of possible received messages in a single poll run. Each message source receives maximum 1024 bytes of data. A single log message is about 20 to 30 bytes. So on a single run each log source may emit 1024/20 = 50 messages. Your log fifo must be able to hold this number of messages for each source. So for two sources, you'll need at least 100 slots in your fifo: destination d_tcp { tcp(x.x.x.x log_fifo_size(100)); }; -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1 url: http://www.balabit.hu/pgpkey.txt
Balazs Scheidler on Thu 4/01 19:20 +0100:
I'm testing syslog-ng with TCP transport, and I'm a little surprised- perhaps I'm doing something wrong? I'm finding severe message loss, which I had imagined TCP would eliminate.
I've reported the same thing on this list in the past. Indeed, my machines can even lose messages when the input is a pipe and the output is a local file!
And I replied that you are _not_ correct. Several times. Please stop spreading misinformation.
Actually, he is correct, from a simple point of view. Message *are* lost if the queue is not large enough. This is quite clear below; the string is freed and the loop continues, without the log ever being written to backing store anywhere.
The culprit is this bit of code at line 71 in libol:
if (self->queue_size == self->queue_max) { /* fifo full */ ol_string_free(string); return ST_FAIL | ST_OK; }
If the queue is full, your message is dismissed and cannot be recovered. If you put a printf() statement in that code block, I bet you will see that you are hitting this misfeature. I tried fixing this problem, but there is not a simple way. It requires significant changes to syslog-ng's message handling loop.
You have to raise the log_fifo_size variable.
This shouldn't require configuration. Ideally, log_fifo_size should grow and shrink dynamically.
On Thu, 4 Jan 2001, Balazs Scheidler wrote:
Hello,
I'm testing syslog-ng with TCP transport, and I'm a little surprised- perhaps I'm doing something wrong? I'm finding severe message loss, which I had imagined TCP would eliminate.
I've reported the same thing on this list in the past. Indeed, my machines can even lose messages when the input is a pipe and the output is a local file!
And I replied that you are _not_ correct. Several times. Please stop spreading misinformation.
Repeating yourself doesn't make you right. Look at the code again: if (self->queue_size == self->queue_max) { /* fifo full */ ol_string_free(string); return ST_FAIL | ST_OK; } What do you think happens when this code is hit? The log message is discarded and lost. It cannot be read any other way. When this block of code is executed, messages are lost. Execution tracing shows that this block of code is sometimes executed. Thus, messages will sometimes be lost. QED. Your repeated assertions are no solution. In practice, syslog-ng drops log messages. I have seen it with my own eyes, and I have proved it logically above. You have not given me any reason to believe otherwise. Your suggestion to raise the fifo size does not work for fast, steady-state log message generation. I can raise the fifo size to 1,000,000 messages, and I get a syslog-ng process that consumes all available memory and still drops messages. -jwb
Hi All! On 2001 Jan 04, Jeffrey W. Baker wrote:
if (self->queue_size == self->queue_max) { /* fifo full */ ol_string_free(string); return ST_FAIL | ST_OK; }
What do you think happens when this code is hit? The log message is discarded and lost. It cannot be read any other way. When this block of code is executed, messages are lost. Execution tracing shows that this block of code is sometimes executed. Thus, messages will sometimes be lost. QED.
Yes.
Your repeated assertions are no solution. In practice, syslog-ng drops log messages. I have seen it with my own eyes, and I have proved it logically above. You have not given me any reason to believe otherwise. Your suggestion to raise the fifo size does not work for fast, steady-state log message generation. I can raise the fifo size to 1,000,000 messages, and I get a syslog-ng process that consumes all available memory and still drops messages.
Yes again, and becouse of this, THERE ARE _NO_ SOLUTION for this problem. If you pour water to a horn in full speed, the water will overflow. Another example. (From a candy camera) There was a conveyer belt. And a cakes go to this belt in every ten seconds. The task was to put the cakes into boxes. Everybody do it rights. But after a while the conveyer belt goes faster. Every people reach his/her limit. After this limit (when belt goes faster than this) cakes dropped at the end of the conveyer belt. This happens becouse input is faster than output. And _nobody_ can do anything to interfere this. But I have no state, when syslog-ng is the bottleneck. More often the HDD or the network card(s). -- PGP ID 0x8D143771, /C5 95 43 F8 6F 19 E8 29 53 5E 96 61 05 63 42 D0 GPG ID ABA0E8B2, 45CF B559 8281 8091 8469 CACD DB71 AEFC ABA0 E8B2 I refused to attend his funeral. But I wrote a very nice letter explaining that I approved of it. -- Mark Twain
SZALAY Attila on Thu, Jan 04, 2001 at 08:58:46PM +0100: Hi,
Yes again, and becouse of this, THERE ARE _NO_ SOLUTION for this problem.
If you pour water to a horn in full speed, the water will overflow.
unless of course, the horn has a way to stop the flow of water until it has capacity to process some more of it. :) The only way to solve this problem (IMHO) would be to make i/o to syslog-ng a blocking operation. Not many situations where this is what you would want. If your system is not fast enough to process the amount of messages that need to go through, and/or doesn't have enough RAM to hold the messages until they can go to disk/network, reduce log-traffic or size your system accordingly. Slowing down the logging clients hardly can be called a solution. Greetings, Gregor. -- Gregor Binder <gbinder@sysfive.com> http://www.sysfive.com/~gbinder/ sysfive.com GmbH UNIX. Networking. Security. Applications. Gaertnerstrasse 125b, 20253 Hamburg, Germany TEL +49-40-63647482
Yes again, and becouse of this, THERE ARE _NO_ SOLUTION for this problem.
If you pour water to a horn in full speed, the water will overflow.
unless of course, the horn has a way to stop the flow of water until it has capacity to process some more of it. :)
The only way to solve this problem (IMHO) would be to make i/o to syslog-ng a blocking operation. Not many situations where this is what you would want.
one of the design goals of syslog-ng was _not_ to block programs sending log messages. I could stop reading the source socket, but that would cause blocked programs on the sending side, which would mean lower performance of your computing system. that said, I may implement this blocking fashion in syslog-ng sometimes. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1 url: http://www.balabit.hu/pgpkey.txt
Actually what you really want is a synchronous transport bus to carry the data instead of lowly TCP/IP on top of the flimsy Ethernet. The fix is probably more likely to be to build in some of the message passing control features of tools like ISIS or TibCO's TIB Bus. Todd Glassey ----- Original Message ----- From: "Gregor Binder" <gbinder@sysfive.com> To: <syslog-ng@lists.balabit.hu> Sent: Thursday, January 04, 2001 2:51 PM Subject: Re: [syslog-ng]TCP transport reliability
SZALAY Attila on Thu, Jan 04, 2001 at 08:58:46PM +0100:
Hi,
Yes again, and becouse of this, THERE ARE _NO_ SOLUTION for this problem.
If you pour water to a horn in full speed, the water will overflow.
unless of course, the horn has a way to stop the flow of water until it has capacity to process some more of it. :)
The only way to solve this problem (IMHO) would be to make i/o to syslog-ng a blocking operation. Not many situations where this is what you would want.
If your system is not fast enough to process the amount of messages that need to go through, and/or doesn't have enough RAM to hold the messages until they can go to disk/network, reduce log-traffic or size your system accordingly. Slowing down the logging clients hardly can be called a solution.
Greetings, Gregor.
-- Gregor Binder <gbinder@sysfive.com> http://www.sysfive.com/~gbinder/ sysfive.com GmbH UNIX. Networking. Security. Applications. Gaertnerstrasse 125b, 20253 Hamburg, Germany TEL +49-40-63647482
_______________________________________________ syslog-ng maillist - syslog-ng@lists.balabit.hu https://lists.balabit.hu/mailman/listinfo/syslog-ng
Hello,
I'm testing syslog-ng with TCP transport, and I'm a little surprised- perhaps I'm doing something wrong? I'm finding severe message loss, which I had imagined TCP would eliminate.
I've reported the same thing on this list in the past. Indeed, my machines can even lose messages when the input is a pipe and the output is a local file!
And I replied that you are _not_ correct. Several times. Please stop spreading misinformation.
Repeating yourself doesn't make you right. Look at the code again:
if (self->queue_size == self->queue_max) { /* fifo full */ ol_string_free(string); return ST_FAIL | ST_OK; }
What do you think happens when this code is hit? The log message is discarded and lost. It cannot be read any other way. When this block of code is executed, messages are lost. Execution tracing shows that this block of code is sometimes executed. Thus, messages will sometimes be lost. QED.
You are right that messages may be lost in a special case. One of the design goals of syslog-ng was that sending processes should never block. Have you ever had a computer whose syslogd was blocked? Sending processes
Your repeated assertions are no solution. In practice, syslog-ng drops log messages. I have seen it with my own eyes, and I have proved it logically above. You have not given me any reason to believe otherwise. Your suggestion to raise the fifo size does not work for fast, steady-state log message generation. I can raise the fifo size to 1,000,000 messages, and I get a syslog-ng process that consumes all available memory and still drops messages.
-jwb
_______________________________________________ syslog-ng maillist - syslog-ng@lists.balabit.hu https://lists.balabit.hu/mailman/listinfo/syslog-ng
-- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1 url: http://www.balabit.hu/pgpkey.txt
I'm testing syslog-ng with TCP transport, and I'm a little surprised- perhaps I'm doing something wrong? I'm finding severe message loss, which I had imagined TCP would eliminate. Briefly, my testing paradigm is as follows:
What are you using to send log messages? shell script with logger, or a custom program or? can you post its source?
lwp_sema_post(0xFF00FE78) = 0 lwp_sema_wait(0xFF00FE78) = 0 lwp_mutex_wakeup(0xFF150C58) = 0 lwp_mutex_lock(0xFF150C58) = 0 Received signal #15, SIGTERM [default] siginfo: SIGTERM pid=2539 uid=0 lwp_sigredirect(1, SIGTERM) = 0 *** process killed ***
Does anyone know offhand why this should happen?
Thanks very much for any advice you can offer.
syslog-ng forks when started but waits for its child to signal whether initializing the configuration was successful. When the config is initialized the child sends a SIGTERM is sent to its parent, exactly the signal it was waiting for. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1 url: http://www.balabit.hu/pgpkey.txt
participants (7)
-
Balazs Scheidler
-
Gregor Binder
-
Jeffrey W. Baker
-
Jon Marks
-
Scott McDermott
-
SZALAY Attila
-
todd glassey