[syslog-ng]TCP transport reliability

Jon Marks j-marks@uiuc.edu
Thu, 4 Jan 2001 11:09:59 -0600


Hello,

I'm testing syslog-ng with TCP transport, and I'm a little surprised-
perhaps I'm doing something wrong? I'm finding severe message loss,
which I had imagined TCP would eliminate. Briefly, my testing paradigm is
as follows:

On a client, I run a program which generates 5000 messages (LOCAL6/INFO)
in rapid succession. The client is a Sparc Ultra 10, Solaris 7, 
and so uses the reliable door IPC mechanism for local message delivery.
The loghost is an IBM 43P, AIX4.3.3, listening on TCP port 514. The
messages have sequence numbers in their body, for example:

Jan  4 10:34:24 local@peewee/peewee TEST[2362]: (3241) TEST MESSAGE (3241)
Jan  4 10:34:24 local@peewee/peewee TEST[2362]: (3242) TEST MESSAGE (3242)

I use a perl script to detect which are missing from the local (Sparc)
logs and the remote (IBM) logs. Here is a sample result:

SPARC (Client)
-----------------------------------

Missing trials:

Total found: 5000 (missing 0)



IBM (remote Loghost)
-----------------------------------

Missing trials:
3231-3240 3475-3523 3637-3662 3776-3805 4019-4087 4201-4232 4346-4381
4516-4532 4746-4791 4905-4907

Total found: 4682 (missing 318)



Searching the IBM log corroborates this result. Therefore, log messages
are being dropped somewhere. As this example shows, they are not lost on
the Sparc; only on the IBM loghost. This effect is exacerbated by multiple
clients- the client test program can listen on a port for a "go" signal
so that I can coordinate a synchronized stress-test on the loghost.
Massive losses using UDP are not surprising; but I would expect that
with TCP the messages are not lost.

Here are excerpts from the client and loghost config files to demostrate
my belief that I'm using TCP:

Client (SPARC):
-----------------------------------

# ...
destination d_vader { tcp("130.126.112.253" port(514)); };
destination d_local6 { file("/var/log/syslog/local6"); };
log { source(local); filter(f_local6); destination(d_vader);
      destination(d_local6); };


Loghost (IBM)
-----------------------------------

source local { unix-dgram("/dev/log");
               tcp(ip(0.0.0.0) port(514));
                internal(); };
#...
destination d_local6 { file("/var/log/syslog/local6"); };
log { source(local); filter(f_local6);
      destination(d_local6); };


Please ignore that I'm calling messages received from a network source
"local"-- this is only during testing. Would anybody like to share some
insight into this problem?

I have another comment: When I run syslog-ng on Solaris (not AIX, though),
I get the following result.

# /usr/local/sbin/syslog-ng
Terminated

A daemon process is left running, however, and seems to operate normally
(save for the previous issue). Following is 'truss' output for the initial
syslog-ng process, containing a few lines preceeding the receipt of a
kill signal:

open("/etc/passwd", O_RDONLY)                   = 4
fstat64(4, 0xFFBEEB88)                          = 0
ioctl(4, TCGETA, 0xFFBEEB14)                    Err#25 ENOTTY
read(4, " r o o t : x : 0 : 1 : S".., 8192)     = 756
llseek(4, 0xFFFFFFFFFFFFFD2D, SEEK_CUR)         = 33
close(4)                                        = 0
ioctl(3, TCGETA, 0xFFBEF294)                    Err#25 ENOTTY
llseek(3, 0, SEEK_CUR)                          = 4128
close(3)                                        = 0
fork1()                                         = 2539
signotifywait()                                 = 15
lwp_mutex_wakeup(0xFF3DFDF0)                    = 0
lwp_mutex_lock(0xFF3DFDF0)                      = 0
lwp_mutex_wakeup(0xFF14D4D0)                    = 0
lwp_mutex_lock(0xFF14D4D0)                      = 0
lwp_mutex_wakeup(0xFF150C58)                    = 0
lwp_mutex_lock(0xFF150C58)                      = 0
lwp_sema_post(0xFF00FE78)                       = 0
lwp_sema_wait(0xFF00FE78)                       = 0
lwp_mutex_wakeup(0xFF150C58)                    = 0
lwp_mutex_lock(0xFF150C58)                      = 0
    Received signal #15, SIGTERM [default]
      siginfo: SIGTERM pid=2539 uid=0
lwp_sigredirect(1, SIGTERM)                     = 0
        *** process killed ***


Does anyone know offhand why this should happen? 

Thanks very much for any advice you can offer.

-- 
Jonathan Marks

Systems Administrator, Production Systems Group
Computing and Communication Services Office
University of Illinois at Urbana-Champaign