Hi, I get a kind of message loss trouble like in some previous message with the subject "remote logging not reliable", but in his case, the remote logging was done other tcp and this was pointed as the probable cause of message loss. In my case only udp is involved. Description of the infra: About 2500 unix hosts sending logs via their original syslog daemon (Solaris or RedHat). They are set up that way: ===== kern.info @logserver *.error;user.none @logserver auth.info @logserver ===== The central syslog-ng (1.9.9) server is a solaris 8 host configured that way: ===== options { time_reopen (1); time_reap(600); stats_freq(60); log_fifo_size (25000000); keep_hostname (yes); long_hostnames (no); use_dns (yes); dns_cache (yes); dns_cache_size(3000); use_fqdn (no); # utilisation du nom court de la machine owner("root"); # Logs owner group("sys"); # Logs group owner perm(0755); dir_owner("sysexplo"); # Directory Owner dir_group("sys"); # Directpry Group dir_perm(0775); # Directory Perm create_dirs (yes); use_time_recvd(yes); # gc_idle_threshold(1000); # gc_busy_threshold(100000); }; # # Configuration directives for remote logs # source lan { udp (port(514)); }; destination hostfiles { file("/projets/SYS/sysexplo/syslogdata/$YEAR$MONTH$DAY.logremote/$HOST" owner("sysexplo") group("sys") perm(0755) template("$ISODATE $HOST $MSG\n") ); }; log { source(lan); destination(hostfiles); flags(final); }; # #local logs # source localmsg { sun-stream("/dev/log" door("/etc/.syslog_door")); internal(); }; destination syslog { file("/var/log/syslog"); }; destination authlog { file("/var/adm/authlog"); }; destination messages { file("/var/adm/messages"); }; # filters to mimic traditional Solaris logging filter f_mail { facility(mail); }; filter f_auth { level(info) and facility(auth, authpriv); }; filter f_not_mail { not facility(mail); }; log { source(localmsg); filter(f_auth); destination(authlog); }; log { source(localmsg); filter(f_mail); destination(syslog); }; log { source(localmsg); filter(f_not_mail); destination(messages); }; log { source(localmsg); destination(hostfiles); flags(final); }; # Also save logs from local host # ===== This generates between 3 to 8 GiB logs per day on the logserver. Everything went fine until we wanted to check a specific event on a specific host which generates something like 10 lines of auth.info logs for which only 5 lines were saved on the log server. It appeared that dome packets were randomly ignored. But not sure if it was a network issue (udp) or syslog-ng we produced the following tool to dump what arrives on the udp socket: ==== syslog-dump.c #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #include <sys/errno.h> #define BUFFLEN 2047 int main (int argc, char **argv) { int syslog_socket; int syslog_port=514; struct sockaddr_in syslog_addr; unsigned char syslog_buffer[BUFFLEN+1]; int len,ret; /* create socket */ syslog_socket=socket(PF_INET, SOCK_DGRAM, 0); if (syslog_socket==-1) { printf("Error creating socket"); exit(1); } bzero(&syslog_addr,sizeof(syslog_addr)); syslog_addr.sin_family = AF_INET; syslog_addr.sin_addr.s_addr = htonl(INADDR_ANY); syslog_addr.sin_port = htons(syslog_port); ret=bind(syslog_socket, (struct sockaddr *) &syslog_addr, sizeof(syslog_addr)); if (ret==-1) { perror("Error binding to socket"); exit(1); } while (len=read(syslog_socket, syslog_buffer, BUFFLEN)){ printf("%s\n",syslog_buffer); bzero(syslog_buffer, BUFFLEN+1); } } ==== We run this instead of syslog-ng for a few minutes and produces som 20000 lines of log on 2 hosts in 20 seconds using logger. During the tests I also kept receiving the normal messages from my 2500 hosts in addition to my test messages. I tested it for more than half an hour and never lost any message using my syslog-dump program (something like 20 consecutive tests). The next half hour I repeated the test with syslog-ng. Unfortunately I lost around 10% to 20% of my test messages every time. I started with syslog-ng 1.6.9 but upgraded to 1.9.9 to be sure I had all the improvements. Has anybody an idea of where my packets get lost, is there a tunning only solution, should I start looking at the code to find my lost packets ? I'd appreciate a small explanation of syslog-ng internal if that is the case :) Aside from that I would like to say that I'm quite happy with syslog-ng features. Regards, Vincent.
On Sun, 5 Mar 2006, Vincent Haverlant wrote:
Hi,
I get a kind of message loss trouble like in some previous message with the subject "remote logging not reliable", but in his case, the remote logging was done other tcp and this was pointed as the probable cause of message loss. In my case only udp is involved.
is your OS dropping packets before syslog-ng gets a chance to grab them? maybe run this command: netstat -su and look for dropped packets. if you see some, then you can increase your UDP buffer size to help smooth this out in the case of bursts. Mike
Hi Mike, Le Sun Mar 5 15:12:11 2006, Mike a écrit: | | | On Sun, 5 Mar 2006, Vincent Haverlant wrote: | | >Hi, | > | >I get a kind of message loss trouble like in some previous message with the subject "remote logging not reliable", but in his case, the remote | >logging was done other tcp and this was pointed as the probable cause of message loss. In my case only udp is involved. | > | | is your OS dropping packets before syslog-ng gets a chance to grab them? Definitely no or at least not in any visible way: - no drop in netstat -s for either UDP nort IPv4 - if I snoop at the same time on the receiving interface, I get the right number of packets. | if you see some, then you can increase your UDP buffer size to help smooth this out in the case of bursts. I'd love to do that anyway but in fact this is only doable by a setsockopt syscall (SO_RCVBUF) and there's not option in the config file to do it. I'll try to do it but I'm not sure the cause is here. Vincent. -- .~. Vincent Haverlant -- Galadril -- #ICQ: 35695155 /V\ MSN: vincent_msn@haverlant.org -- http://www.haverlant.org/ /( )\ Parinux member: http://www.parinux.org/ ^^-^^ GPG: 8FEA 52C2 5C54 A201 2375 0FA5 AF2E 1881 92D0 EE84
| >Hi, | > | >I get a kind of message loss trouble like in some previous message with the subject "remote logging not reliable", but in his case, the remote | >logging was done other tcp and this was pointed as the probable cause of message loss. In my case only udp is involved. | > | | is your OS dropping packets before syslog-ng gets a chance to grab them?
Definitely no or at least not in any visible way: - no drop in netstat -s for either UDP nort IPv4 - if I snoop at the same time on the receiving interface, I get the right number of packets.
| if you see some, then you can increase your UDP buffer size to help smooth this out in the case of bursts.
I'd love to do that anyway but in fact this is only doable by a setsockopt syscall (SO_RCVBUF) and there's not option in the config file to do it. I'll try to do it but I'm not sure the cause is here.
Vincent.
Heya, this tells you how to adjust the buffer size system wide. http://www.29west.com/docs/THPM/udp-buffer-sizing.html but if netstat -su is not showing packet loss, then adjusting this value probably won't help you. but it may be worth a shot.. Mike
Le Mon Mar 6 06:59:01 2006, Mike a écrit: | Heya, | | this tells you how to adjust the buffer size system wide. http://www.29west.com/docs/THPM/udp-buffer-sizing.html | | but if netstat -su is not showing packet loss, then adjusting this value probably won't help you. but it may be worth a shot.. | | Mike Hi, This is a very good and interesting link for the generic tuning of the network stack. However what is says is how to set the udp max buffer size: # ndd /dev/udp udp_max_buf 1048576 which according to http://docs.sun.com/app/docs/doc/806-6779/6jfmsfr8b?a=view sets the "maximum allowed buffer size for an udp socket", not the actual value. The default receive buffer size is set using the parameter udp_recv_hiwat which is currently set to its maximum value of 65536. To increase it the only way is the setsockopt syscall. Anyway I don't get that many udpInOverflows (a few everyday) but it does not explain the few thousand messages I miss in my logs files. Regards, Vincent. -- .~. Vincent Haverlant -- Galadril -- #ICQ: 35695155 /V\ MSN: vincent_msn@haverlant.org -- http://www.haverlant.org/ /( )\ Parinux member: http://www.parinux.org/ ^^-^^ GPG: 8FEA 52C2 5C54 A201 2375 0FA5 AF2E 1881 92D0 EE84
Hey,
Hi,
This is a very good and interesting link for the generic tuning of the network stack. However what is says is how to set the udp max buffer size: # ndd /dev/udp udp_max_buf 1048576 which according to http://docs.sun.com/app/docs/doc/806-6779/6jfmsfr8b?a=view sets the "maximum allowed buffer size for an udp socket", not the actual value.
The default receive buffer size is set using the parameter udp_recv_hiwat which is currently set to its maximum value of 65536. To increase it the only way is the setsockopt syscall.
Anyway I don't get that many udpInOverflows (a few everyday) but it does not explain the few thousand messages I miss in my logs files.
Regards, Vincent.
hmmm. strange..on Linux it seems to use that max value without modifying applications. I could be wrong tho. options { . . . use_dns (yes); dns_cache (yes); dns_cache_size(3000); use_fqdn (no); # utilisation du nom court de la machine . . . . }; hmm..I am kinda wondering about the DNS usage now tho. I have never used this feature before, but from what the docs say, syslog-ng will block on DNS queries...can you tell if any of DNS queries are failing? maybe you could add in some options here: dns_cache_expire(n) Number of seconds while a successful lookup is cached. dns_cache_expire_failed(n) Number of seconds while a failed lookup is cached. but if syslog-ng blocks on DNS queries, I would imagine that you would see your udpInOverflows value increase.... anyone know the default value of dns_cache_expire() off the top of your heads? I would have to admit that I am grabbing at straws now tho. Mike
Le Mon Mar 6 08:07:51 2006, Mike a écrit: | | >The default receive buffer size is set using the parameter | >udp_recv_hiwat which is currently set to its maximum value of 65536. To | >increase it the only way is the setsockopt syscall. | > | >Anyway I don't get that many udpInOverflows (a few everyday) but it does | >not explain the few thousand messages I miss in my logs files. | > | >Regards, | >Vincent. | | | hmmm. strange..on Linux it seems to use that max value without modifying applications. I could be wrong tho. | I think it is the same issue on linux: (extract from man 7 socket) SO_RCVBUF Sets or gets the maximum socket receive buffer in bytes. The default value is set by the rmem_default sysctl and the maximum allowed value is set by the rmem_max sysctl. The buffer size is controled by two parameters: rmem_default and rmem_max. The difference with Solaris is that these two parameters are global to all ip protocols except set otherwise for tcp with tcp_rmem. | options { | . | . | . | use_dns (yes); | dns_cache (yes); | dns_cache_size(3000); | use_fqdn (no); # utilisation du nom court de la machine | . | . | . | . | }; | | hmm..I am kinda wondering about the DNS usage now tho. I have never used this feature before, but from what the docs say, syslog-ng will block on DNS | queries...can you tell if any of DNS queries are failing? I saw that too, unfortunately even after turning it off, I still miss between 5 to 15% messages in my test, which I will admit is a burst test but bursts can happen when you have 2500 hosts. | maybe you could add in some options here: | dns_cache_expire(n) | Number of seconds while a successful lookup is cached. | | | dns_cache_expire_failed(n) | Number of seconds while a failed lookup is cached. | | but if syslog-ng blocks on DNS queries, I would imagine that you would see your udpInOverflows value increase.... | | anyone know the default value of dns_cache_expire() off the top of your heads? dns_cache_expire -> 3600 dns_cache_expire_failed -> 60 I'm quite at loss as to what to do now... Vincent. -- .~. Vincent Haverlant -- Galadril -- #ICQ: 35695155 /V\ MSN: vincent_msn@haverlant.org -- http://www.haverlant.org/ /( )\ Parinux member: http://www.parinux.org/ ^^-^^ GPG: 8FEA 52C2 5C54 A201 2375 0FA5 AF2E 1881 92D0 EE84
Just as a point of reference I'm having the same problem with my configuration. Virtually the same infrastructure, but about 3x the number of linux hosts dumping into syslog. I hadn't taken the next step of actually testing to see if packets were being dropped, etc., but the syslog-ng log reports lots of dropped messages despite using numerous filters to reduce the traffic going to logfiles and the database. Paul Krizak 5900 E. Ben White Blvd. MS 625 Advanced Micro Devices Austin, TX 78741 Linux/Unix Systems Engineering Phone: (512) 602-8775 Microprocessor Solutions Sector Vincent Haverlant wrote:
Le Mon Mar 6 08:07:51 2006, Mike a écrit: | | >The default receive buffer size is set using the parameter | >udp_recv_hiwat which is currently set to its maximum value of 65536. To | >increase it the only way is the setsockopt syscall. | > | >Anyway I don't get that many udpInOverflows (a few everyday) but it does | >not explain the few thousand messages I miss in my logs files. | > | >Regards, | >Vincent. | | | hmmm. strange..on Linux it seems to use that max value without modifying applications. I could be wrong tho. |
I think it is the same issue on linux: (extract from man 7 socket) SO_RCVBUF Sets or gets the maximum socket receive buffer in bytes. The default value is set by the rmem_default sysctl and the maximum allowed value is set by the rmem_max sysctl.
The buffer size is controled by two parameters: rmem_default and rmem_max. The difference with Solaris is that these two parameters are global to all ip protocols except set otherwise for tcp with tcp_rmem.
| options { | . | . | . | use_dns (yes); | dns_cache (yes); | dns_cache_size(3000); | use_fqdn (no); # utilisation du nom court de la machine | . | . | . | . | }; | | hmm..I am kinda wondering about the DNS usage now tho. I have never used this feature before, but from what the docs say, syslog-ng will block on DNS | queries...can you tell if any of DNS queries are failing?
I saw that too, unfortunately even after turning it off, I still miss between 5 to 15% messages in my test, which I will admit is a burst test but bursts can happen when you have 2500 hosts.
| maybe you could add in some options here: | dns_cache_expire(n) | Number of seconds while a successful lookup is cached. | | | dns_cache_expire_failed(n) | Number of seconds while a failed lookup is cached. | | but if syslog-ng blocks on DNS queries, I would imagine that you would see your udpInOverflows value increase.... | | anyone know the default value of dns_cache_expire() off the top of your heads?
dns_cache_expire -> 3600 dns_cache_expire_failed -> 60
I'm quite at loss as to what to do now... Vincent.
Hello Guys, Here's an update on my investigation (still using latest 1.9.9 snapshot on Solaris 8): - I have proof (snoop/tcpdump) that all sent packets indeed arrive at the network interface of my syslog server - I have validated, using a simple while { recvfrom; printf } loop that all udp packets CAN go through the network layer with the default system buffer settings without any packet loss. - By adding a debugging mesage right after the recvfrom call in syslog-ng (in fd_do_read), I can validate that no message that is received that way is lost within syslog-ng. However I still observe a loss of messages. - I do not see a significant increase in udpInOverflows during the tests. Even though through the day, this value may increase of a few units. Default buffer settings are: ndd /dev/udp udp_recv_hiwat 65536 ndd /dev/udp udp_max_buf 1048576 Meaning that the default socket buffer size is 64k. IT CANNOT BE INCREASED FURTHER UP. I thus decided to the following code: ---- int rcvbuflen=1048576; setsockopt(sock, SOL_SOCKET, SO_RCVBUF, &rcvbuflen, sizeof(rcvbuflen)); ---- in g_sockaddr_inet_bind_prepare which apparently solved my problem or at least allowed my test to pass. This does not guarantee that when the traffic increases further up for any reason, I will not jump once again to the limit of the socket buffer. I still have one problem though: why didn't the packet show up in udpInOverflow when they should have ? I'll keep you updated when I find out. Also I'll try to provide a patch for the setsockop stuff but not being a good developper it might be better if one of you official syslog-ng team member took it from here. Keep up the good work... Vincent. -- .~. Vincent Haverlant -- Galadril -- #ICQ: 35695155 /V\ MSN: vincent_msn@haverlant.org -- http://www.haverlant.org/ /( )\ Parinux member: http://www.parinux.org/ ^^-^^ GPG: 8FEA 52C2 5C54 A201 2375 0FA5 AF2E 1881 92D0 EE84
On Thu, 2006-03-09 at 11:27 +0100, Vincent Haverlant wrote:
I still have one problem though: why didn't the packet show up in udpInOverflow when they should have ? I'll keep you updated when I find out. Also I'll try to provide a patch for the setsockop stuff but not being a good developper it might be better if one of you official syslog-ng team member took it from here.
I would be interested why it did not show up. I'm not using Solaris regularly, but I would look at the values for these stats values (picked from the output of netstat -s) ipInDiscards Probe that fires whenever an input IP datagram is discarded for reasons unrelated to the packet (for example, for lack of buffer space). This probe does not fire for any datagram discarded while awaiting reassembly. udpInOverflows Probe that fires whenever a UDP datagram is received, but subsequently dropped due to lack of buffer space. -- Bazsi
Le Mon Mar 13 17:56:59 2006, Balazs Scheidler a écrit: | On Thu, 2006-03-09 at 11:27 +0100, Vincent Haverlant wrote: | | > I still have one problem though: why didn't the packet show up in | > udpInOverflow when they should have ? I'll keep you updated when I find | > out. Also I'll try to provide a patch for the setsockop stuff but not | > being a good developper it might be better if one of you official | > syslog-ng team member took it from here. | | I would be interested why it did not show up. I'm not using Solaris | regularly, but I would look at the values for these stats values (picked | from the output of netstat -s) Yep, I'm still working on a perfectly reproductible test case before opening a request on Sun support. I'll keep you informed. | ipInDiscards | | Probe that fires whenever an input IP datagram is discarded for reasons | unrelated to the packet (for example, for lack of buffer space). This | probe does not fire for any datagram discarded while awaiting | reassembly. | | udpInOverflows | | Probe that fires whenever a UDP datagram is received, but subsequently | dropped due to lack of buffer space. That's pretty clear, I should see the udpInOverflows counter intrementing. Vincent. -- .~. Vincent Haverlant -- Galadril -- #ICQ: 35695155 /V\ MSN: vincent_msn@haverlant.org -- http://www.haverlant.org/ /( )\ Parinux member: http://www.parinux.org/ ^^-^^ GPG: 8FEA 52C2 5C54 A201 2375 0FA5 AF2E 1881 92D0 EE84
participants (4)
-
Balazs Scheidler
-
Mike
-
Paul Krizak
-
Vincent Haverlant