Hi Krisztian and Balazs, First: a big thank-you for your work on the tproxy code! I'm trying to use tproxy to implement a fully transparent layer 7 proxy as follows: TCP connections arrive and are REDIRECTed to a single local port. A userspace process listen()s on that port, and makes ongoing (transparent) connections on new TCP sockets by calling bind(), tproxy setsockopt()s and connect(). In general it works well, but I'm having a few issues which I think are possibly SMP-related. I believe I've reduced some of these to a simple test case, sources for which are attached. I'm using Linux kernel 2.4.27 and all four patches in cttproxy-2.4.27-2.0.0 patch. To run the test case, you need two machines; I think the 'client' must be SMP. The 'server', 10.0.3.2, listens on a single TCP port and has a simple loop which accept()s and close()s TCP connections that it receives. The SMP 'client', 10.0.3.3, has two processes each connecting to the server. The clients loop through a port range 32768-49152. They bind() on 10.0.3.3, receiving some port from the kernel. They then assign a transparent port in the loop port range on unregistered IP 10.0.3.253, and connect() to the server. (The server has a route set up so that it knows to return traffic on 10.0.3.253 to the client box). The problem: once in a while, one of the client processes takes 3s to connect() to the server. Then, the resulting TCP connection is NOT TRANSPARENT (i.e. 10.0.3.3 is used, not 10.0.3.253). This can be seen by running "tcpdump host 10.0.3.3" on either box. However, none of the client process system calls fail at any point. In the case that CONFIG_IP_NF_NAT_NRES is set, at the same time this happens, the _other process_ has a -EINVAL failure in ip_tproxy_setsockopt_flags(), with corresponding "failed to register NAT reservation" error in dmesg. When CONFIG_IP_NF_NAT_NRES is unset, this failure doesn't happen. But either way, on the _original process_, the non-transparent TCP connection happens. What I believe is happening is as follows: There is evidence in dmesg that the first SYN packet of the connect() passes through the LOCAL_OUT iptables hooks (I see "ip_tproxy_fn(): new connection, hook=3" and "ip_tproxy_fn(): new connection, hook=4", but for some reason the packet never actually makes it onto the wire. I can't see where it goes missing. But anyway, connect() waits 3s and resends the SYN. This time, as the second packet goes through the iptables, for some reason it's not translated. It makes it onto the wire and the rest of the connection proceeds untranslated. I haven't been able to progress much further debugging this, and wondered if you had any ideas? My principal concern is that the userspace processes don't receive an error and have no proper way of telling that the connection is going untransparent. Am I making a stupid mistake somewhere? One other curious thing here: MUST_BE_READ_LOCKED(&ip_tproxy_lock) in ip_tproxy_relatedct_add() fails. Could this be related in any way? Finally, what is the purpose of the new CONFIG_IP_NF_NAT_NRES option? Thank-you for reading this, and for any advice you have! Jim Minter <jim@minter.demon.co.uk> == 8< == CLIENT CODE == 8< == #include <arpa/inet.h> #include <linux/netfilter_ipv4/ip_tproxy2.h> // the v2.0 header #include <sys/socket.h> #include <sys/time.h> #include <time.h> #include <unistd.h> #include <cerrno> #include <cstdarg> #include <cstdio> #include <cstdlib> void error(const char *c) { perror(c); exit(1); } void log(const char *fmt, ...) { char buf[10]; struct timeval timeval; gettimeofday(&timeval, NULL); strftime(buf, sizeof(buf), "%T", localtime(&timeval.tv_sec)); printf("%s.%06u ", buf, timeval.tv_usec); va_list args; va_start(args, fmt); vprintf(fmt, args); va_end(args); } int main(int argc, char **argv) { int lo = 32768; int hi = 49152; setbuf(stderr, NULL); setlinebuf(stdout); while(1) for(int port = lo; port < hi; port++) { fprintf(stderr, "."); int s = socket(PF_INET, SOCK_STREAM, 0); if(s == -1) error("socket"); { // seems to be necessary...? int param = 1; if(setsockopt(s, SOL_SOCKET, SO_REUSEADDR, ¶m, sizeof(param))) error("setsockopt SO_REUSEADDR"); } { // bind to our local IP, and output the port the kernel gave us struct sockaddr_in sockaddr_in; sockaddr_in.sin_family = AF_INET; sockaddr_in.sin_port = 0; sockaddr_in.sin_addr.s_addr = inet_addr("10.0.3.3"); if(bind(s, (struct sockaddr *)&sockaddr_in, sizeof(sockaddr_in))) error("bind"); socklen_t sl = sizeof(sockaddr_in); if(getsockname(s, (struct sockaddr *)&sockaddr_in, &sl)) error("getsockname"); log("%u\n", ntohs(sockaddr_in.sin_port)); } { // now get ourselves a looped port on unregistered IP 10.0.3.253. struct in_tproxy in_tproxy; in_tproxy.op = TPROXY_ASSIGN; in_tproxy.v.addr.faddr.s_addr = inet_addr("10.0.3.253"); in_tproxy.v.addr.fport = htons(port); if(setsockopt(s, SOL_IP, IP_TPROXY, &in_tproxy, sizeof(in_tproxy))) error("setsockopt TPROXY_ASSIGN"); log("a\n"); in_tproxy.op = TPROXY_CONNECT; in_tproxy.v.addr.faddr.s_addr = inet_addr("10.0.3.2"); in_tproxy.v.addr.fport = htons(7000); if(setsockopt(s, SOL_IP, IP_TPROXY, &in_tproxy, sizeof(in_tproxy))) error("setsockopt TPROXY_CONNECT"); log("b\n"); in_tproxy.op = TPROXY_FLAGS; in_tproxy.v.flags = ITP_CONNECT | ITP_ONCE; if(setsockopt(s, SOL_IP, IP_TPROXY, &in_tproxy, sizeof(in_tproxy))) { perror("setsockopt TPROXY_FLAGS"); close(s); continue; } log("c\n"); } { // now connect struct sockaddr_in sockaddr_in; sockaddr_in.sin_family = AF_INET; sockaddr_in.sin_port = htons(7000); sockaddr_in.sin_addr.s_addr = inet_addr("10.0.3.2"); if(connect(s, (struct sockaddr *)&sockaddr_in, sizeof(sockaddr_in))) error("connect"); log("d\n"); } { // wait for other side to close char buf; if(read(s, &buf, 1) != 0) error("read"); } close(s); log("e\n"); } return 0; } == 8< == CLIENT CODE ENDS == 8< == == 8< == SERVER CODE == 8< == #include <netinet/in.h> #include <sys/socket.h> #include <unistd.h> #include <cerrno> #include <cstdio> #include <cstdlib> void error(const char *c) { perror(c); exit(1); } int main() { int s = socket(PF_INET, SOCK_STREAM, 0); if(s == -1) error("socket"); { int param = 1; if(setsockopt(s, SOL_SOCKET, SO_REUSEADDR, ¶m, sizeof(param))) error("setsockopt SO_REUSEADDR"); } { struct sockaddr_in sockaddr_in; sockaddr_in.sin_family = AF_INET; sockaddr_in.sin_port = htons(7000); sockaddr_in.sin_addr.s_addr = INADDR_ANY; if(bind(s, (struct sockaddr *)&sockaddr_in, sizeof(sockaddr_in))) error("bind"); } if(listen(s, SOMAXCONN)) error("listen"); while(1) { int fd = accept(s, NULL, 0); if(fd == -1) error("accept"); close(fd); } } == 8< == SERVER CODE ENDS == 8< ==
Hi Jim, 2004-12-15, sze keltezéssel 12:09-kor jim@minter.demon.co.uk ezt írta:
I'm trying to use tproxy to implement a fully transparent layer 7 proxy as follows: TCP connections arrive and are REDIRECTed to a single local port. A userspace process listen()s on that port, and makes ongoing (transparent) connections on new TCP sockets by calling bind(), tproxy setsockopt()s and connect(). In general it works well, but I'm having a few issues which I think are possibly SMP-related. I believe I've reduced some of these to a simple test case, sources for which are attached. I'm using Linux kernel 2.4.27 and all four patches in cttproxy-2.4.27-2.0.0 patch. To run the test case, you need two machines; I think the 'client' must be SMP.
OK, so everything in my reply is pure theory, I did not test the samples (yet).
The 'server', 10.0.3.2, listens on a single TCP port and has a simple loop which accept()s and close()s TCP connections that it receives.
The SMP 'client', 10.0.3.3, has two processes each connecting to the server. The clients loop through a port range 32768-49152. They bind() on 10.0.3.3, receiving some port from the kernel. They then assign a transparent port in the loop port range on unregistered IP 10.0.3.253, and connect() to the server. (The server has a route set up so that it knows to return traffic on 10.0.3.253 to the client box).
The problem: once in a while, one of the client processes takes 3s to connect() to the server. Then, the resulting TCP connection is NOT TRANSPARENT (i.e. 10.0.3.3 is used, not 10.0.3.253). This can be seen by running "tcpdump host 10.0.3.3" on either box. However, none of the client process system calls fail at any point.
Unfortunately handling errors is the most problematic part of tproxy. The difficulty lies in the fact that when the setsockopt() calls return, we have no way of knowing if the not-yet-established connection will clash with another connection in the conntrack hash or not. This is because the connection won't be created until the first packet leaves the machine, which is shortly after you call connect(). If the tproxy Netfilter hook detects that it cannot apply a NAT mapping, it just drops the packet (and probably the conntrack entry as well) since it has no way of notifying the user-space process. If you start two client processes, you'll have a good chance of trying to assign "colliding" foreign addresses. If you set REUSEADDR, tproxy will allow you to assign the same foreign address more than once, since you've explicitly requested to do so by setting REUSEADDR (let's assume you've chosen port x). However, as soon as you try to use them, you'll experience problems, since the reply tuples of the connections would be the same. Of course connection tracking won't allow this, so trying to apply the NAT mapping will fail for one of the client processes. (I don't know yet why the packets leave the machine with an unmodified source IP, in theory they should be dropped, or at least NAT-ted to the wrong source port number...)
In the case that CONFIG_IP_NF_NAT_NRES is set, at the same time this happens, the _other process_ has a -EINVAL failure in ip_tproxy_setsockopt_flags(), with corresponding "failed to register NAT reservation" error in dmesg. When CONFIG_IP_NF_NAT_NRES is unset, this failure doesn't happen. But either way, on the _original process_, the non-transparent TCP connection happens.
NAT reservations make it possible for tproxy to fail early. If NAT reservations are enabled, tproxy registers "reservations" for foreign addresses to be used later. If such a registration fails, that means that the foreign address is already reserved for some other connection. This is why in that case even the setsockopt() call fails. This is good, since it provides you a way of detecting the error.
What I believe is happening is as follows: There is evidence in dmesg that the first SYN packet of the connect() passes through the LOCAL_OUT iptables hooks (I see "ip_tproxy_fn(): new connection, hook=3" and "ip_tproxy_fn(): new connection, hook=4", but for some reason the packet never actually makes it onto the wire.
Don't you have any kind of errors in the kernel logs when this happens? Tproxy could drop the packet, but you should get an error message in that case.
I can't see where it goes missing. But anyway, connect() waits 3s and resends the SYN. This time, as the second packet goes through the iptables, for some reason it's not translated. It makes it onto the wire and the rest of the connection proceeds untranslated.
_This_ is strange... Could you send me a tcpdump capture of that traffic and the matching tproxy debug output?
I haven't been able to progress much further debugging this, and wondered if you had any ideas? My principal concern is that the userspace processes don't receive an error and have no proper way of telling that the connection is going untransparent. Am I making a stupid mistake somewhere?
I have a few recommendations: * Try to avoid explicitly specifying the foreign (fake) port number at all costs. If you assign a foreign port of zero, connection tracking will select a free port number when applying the NAT mapping. This way you won't have such weird problems. * Each and every connection _must_ have unique endpoints. When you run two instances of your client, you'll run into a theoretical problem as well: sometimes you try to establish two TCP connections with exactly the same endpoints. This is clearly invalid, and wouldn't be possible without using tproxy, of course.
One other curious thing here: MUST_BE_READ_LOCKED(&ip_tproxy_lock) in ip_tproxy_relatedct_add() fails. Could this be related in any way?
Not really, that call is completely bogus IMHO. We probably don't need that check there, I'll remove it.
Finally, what is the purpose of the new CONFIG_IP_NF_NAT_NRES option?
See above. :) -- Regards, Krisztian KOVACS
Hi Krisztian! Many thanks for your reply. hidden@balabit.hu wrote: <reply snipped in places>
Unfortunately handling errors is the most problematic part of tproxy. The difficulty lies in the fact that when the setsockopt() calls return, we have no way of knowing if the not-yet-established connection will clash with another connection in the conntrack hash or not. This is because the connection won't be created until the first packet leaves the machine, which is shortly after you call connect(). If the tproxy Netfilter hook detects that it cannot apply a NAT mapping, it just drops the packet (and probably the conntrack entry as well) since it has no way of notifying the user-space process.
Agreed. It's not possible to pre-add the mapping to the conntrack table at the setsockopt() stage, I take it. I'll be keen to move to using NAT reservations as evidently it will help me in the long run -- it's just that as this bug shows up with and without them, at this stage I'm not using them, for simplicity.
If you start two client processes, you'll have a good chance of trying to assign "colliding" foreign addresses. If you set REUSEADDR, tproxy will allow you to assign the same foreign address more than once, since you've explicitly requested to do so by setting REUSEADDR (let's assume you've chosen port x). However, as soon as you try to use them, you'll experience problems, since the reply tuples of the connections would be the same. Of course connection tracking won't allow this, so trying to apply the NAT mapping will fail for one of the client processes. (I don't know yet why the packets leave the machine with an unmodified source IP, in theory they should be dropped, or at least NAT-ted to the wrong source port number...)
Fair enough. If I adjust the program such that one process asks tproxy to assign odd numbered foreign ports, and the other process even numbered foreign ports, the problem still happens just as quickly -- so it's not a simple collision fault! As an aside, the Linux TCP/IP stack allows a single IP address to make >65,536 TCP connections at once. It does this by allowing >1 sockets to share the same local port [in the auto-bind code called by TCP connect()], as long as they're connecting to different remote end-points. The return packets are demultiplexed by remote end-point as well as the local one. Additionally, some OSes even allow the user to pre-bind sockets to a local port of _their choice_ before making a connect(), easily allowing >1 connections at once per local port!
What I believe is happening is as follows: There is evidence in dmesg that the first SYN packet of the connect() passes through the LOCAL_OUT iptables hooks (I see "ip_tproxy_fn(): new connection, hook=3" and "ip_tproxy_fn(): new connection, hook=4", but for some reason the packet never actually makes it onto the wire.
Don't you have any kind of errors in the kernel logs when this happens? Tproxy could drop the packet, but you should get an error message in that case.
No errors at all :o(. The curious thing is that I added extra printk's to all the cases in the tproxy code where I could see "return NF_DROP" (or equivalent), and none of these printed -- so I presume the packet drop is elsewhere (I don't know where).
_This_ is strange... Could you send me a tcpdump capture of that traffic and the matching tproxy debug output?
Will do, in a separate post.
I have a few recommendations:
* Try to avoid explicitly specifying the foreign (fake) port number at all costs. If you assign a foreign port of zero, connection tracking will select a free port number when applying the NAT mapping. This way you won't have such weird problems.
I agree, I'd love to, but my app isn't able to choose the fake ports it uses -- my only option is detecting errors and dropping the connection if necessary.
* Each and every connection _must_ have unique endpoints. When you run two instances of your client, you'll run into a theoretical problem as well: sometimes you try to establish two TCP connections with exactly the same endpoints. This is clearly invalid, and wouldn't be possible without using tproxy, of course.
Yes, you're right. It is possible to run into this case with the test programs I sent if you wait long enough, but I'm not too worried about this just now as it doesn't appear to result in any more non-NATted traffic.
One other curious thing here: MUST_BE_READ_LOCKED(&ip_tproxy_lock) in ip_tproxy_relatedct_add() fails. Could this be related in any way?
Not really, that call is completely bogus IMHO. We probably don't need that check there, I'll remove it.
OK. Food for thought :o). I'll get back to you with some tcpdumps, etc. Cheers, Jim
Hi Jim, 2004-12-17, p keltezéssel 15:19-kor jim@minter.demon.co.uk ezt írta:
Fair enough. If I adjust the program such that one process asks tproxy to assign odd numbered foreign ports, and the other process even numbered foreign ports, the problem still happens just as quickly -- so it's not a simple collision fault!
As an aside, the Linux TCP/IP stack allows a single IP address to make
65,536 TCP connections at once. It does this by allowing >1 sockets to share the same local port [in the auto-bind code called by TCP connect()], as long as they're connecting to different remote end-points. The return packets are demultiplexed by remote end-point as well as the local one. Additionally, some OSes even allow the user to pre-bind sockets to a local port of _their choice_ before making a connect(), easily allowing >1 connections at once per local port!
Of course, this is clear. This is what REUSEADDR was invented for, and tproxy allows you to assign the same foreign address to multiple sockets as well (of course with some restricitions). Unfortunately in the IP stack of the kernel these things are much more simple: if you set REUSEADDR, you're allowed to bind() to an address already taken. However, you'll get an error when trying to connect() to the same destination host. In case of tproxy this is much more difficult, since you won't be able to detect clashes before it's too late. (Without NAT reservations.)
What I believe is happening is as follows: There is evidence in dmesg that the first SYN packet of the connect() passes through the LOCAL_OUT iptables hooks (I see "ip_tproxy_fn(): new connection, hook=3" and "ip_tproxy_fn(): new connection, hook=4", but for some reason the packet never actually makes it onto the wire.
Don't you have any kind of errors in the kernel logs when this happens? Tproxy could drop the packet, but you should get an error message in that case.
No errors at all :o(. The curious thing is that I added extra printk's to all the cases in the tproxy code where I could see "return NF_DROP" (or equivalent), and none of these printed -- so I presume the packet drop is elsewhere (I don't know where).
Ok, do you have any DNAT/MASQUERADE rules in your iptables config? Or what kind of NAT rulese do you use? Another shortcoming of the NAT-based operation of tproxy is the following: you have to make sure that you do not reuse the _local_ address before the conntrack entry of the previous connection from that address times out. So, if you make a lot of connections from the same IP, and the local autobind port range is not enough for you, you'll have to use additional local IP addresses as well. (Note that these do not need to be routable IP addresses.) For example if you make 400 short-lived connections per second, and have configured the local port range to contain 50000 ports, it will take 125 seconds for the port range to turn over. The timeout of conntrack entries in TIME_WAIT state is 120 seconds, so with 400 cps you're already likely to have problems.
_This_ is strange... Could you send me a tcpdump capture of that traffic and the matching tproxy debug output?
Will do, in a separate post.
I have a few recommendations:
* Try to avoid explicitly specifying the foreign (fake) port number at all costs. If you assign a foreign port of zero, connection tracking will select a free port number when applying the NAT mapping. This way you won't have such weird problems.
I agree, I'd love to, but my app isn't able to choose the fake ports it uses -- my only option is detecting errors and dropping the connection if necessary.
You're right, unfortunately there are cases when this is not an option. -- Regards, Krisztian KOVACS
Hi Krisztian :o)
Ok, do you have any DNAT/MASQUERADE rules in your iptables config? Or what kind of NAT rulese do you use?
None!
Another shortcoming of the NAT-based operation of tproxy is the following: you have to make sure that you do not reuse the _local_ address before the conntrack entry of the previous connection from that address times out. So, if you make a lot of connections from the same IP, and the local autobind port range is not enough for you, you'll have to use additional local IP addresses as well. (Note that these do not need to be routable IP addresses.)
I'm aware of this -- the examples I've put together (see below) are taken immediately after booting the kernel, and problems occur well before the local TCP port range is exhausted.
_This_ is strange... Could you send me a tcpdump capture of that traffic and the matching tproxy debug output?
Will do, in a separate post.
I've put together a fine collection of logs and tcpdumps from a 20s run of my test programs. They show the problem occurring six times and the tar file is 2.2M. Is there somewhere I can e-mail/FTP this to, for you to see? Cheers, Jim
Hi,
I've put together a fine collection of logs and tcpdumps from a 20s run of my test programs. They show the problem occurring six times and the tar file is 2.2M. Is there somewhere I can e-mail/FTP this to, for you to see?
You should be able to get this (all being well) at: http://www.minter.demon.co.uk/tproxy-bug.tar.bz2 See the README file within the package. Cheers, Jim
Hi Jim, 2004-12-17, p keltezéssel 16:55-kor jim@minter.demon.co.uk ezt írta:
I've put together a fine collection of logs and tcpdumps from a 20s run of my test programs. They show the problem occurring six times and the tar file is 2.2M. Is there somewhere I can e-mail/FTP this to, for you to see?
You should be able to get this (all being well) at: http://www.minter.demon.co.uk/tproxy-bug.tar.bz2
See the README file within the package.
OK, thanks, I've downloaded the tarball. BTW, the syslog is indeed not very useful, since it is horribly incomplete... Could you try what happens if you omit the ITP_ONCE flag from the FLAGS setsockopt(), and set only ITP_CONNECT? -- Regards, Krisztian KOVACS
Hi! hidden@balabit.hu wrote:
OK, thanks, I've downloaded the tarball. BTW, the syslog is indeed not very useful, since it is horribly incomplete...
Sorry :o(. I'm currently recompiling the kernel with a larger log buffer and will rerun the tests and post an updated tarball.
Could you try what happens if you omit the ITP_ONCE flag from the FLAGS setsockopt(), and set only ITP_CONNECT?
OK, in this case we don't get any un-NATted packets at the remote host, but sooner or later one of the processes gets stuck in a connect() call and never returns: presumably every time it attempts to issue a SYN packet, this packet gets lost somewhere? Maybe with proper logging it will be clearer what's going on here. Jim
Hi Jim, 2004-12-20, h keltezéssel 11:48-kor jim@minter.demon.co.uk ezt írta:
hidden@balabit.hu wrote:
OK, thanks, I've downloaded the tarball. BTW, the syslog is indeed not very useful, since it is horribly incomplete...
Sorry :o(. I'm currently recompiling the kernel with a larger log buffer and will rerun the tests and post an updated tarball.
I'm afraid it won't help much, but let's see.
Could you try what happens if you omit the ITP_ONCE flag from the FLAGS setsockopt(), and set only ITP_CONNECT?
OK, in this case we don't get any un-NATted packets at the remote host, but sooner or later one of the processes gets stuck in a connect() call and never returns: presumably every time it attempts to issue a SYN packet, this packet gets lost somewhere? Maybe with proper logging it will be clearer what's going on here.
OK, thanks. So, in the meantime I reproduced the problem (and tested without ITP_ONCE as well). Seems interesting, since I get a lot of "failed to apply NAT mapping" errors... -- Regards, Krisztian KOVACS
Hi Krisztian!
Sorry :o(. I'm currently recompiling the kernel with a larger log buffer and will rerun the tests and post an updated tarball.
I'm afraid it won't help much, but let's see.
OK, http://www.minter.demon.co.uk/tproxy-bug-2.tar.bz2 [3.1M] is available; the system log files are considerably longer this time, but it's possible that you'll find that they're still not complete :o/
Could you try what happens if you omit the ITP_ONCE flag from the FLAGS setsockopt(), and set only ITP_CONNECT?
OK, in this case we don't get any un-NATted packets at the remote host, but sooner or later one of the processes gets stuck in a connect() call and never returns: presumably every time it attempts to issue a SYN packet, this packet gets lost somewhere? Maybe with proper logging it will be clearer what's going on here.
OK, thanks. So, in the meantime I reproduced the problem (and tested without ITP_ONCE as well). Seems interesting, since I get a lot of "failed to apply NAT mapping" errors...
The above tarball also has a log of a run without the ITP_ONCE flag. It's encouraging that you've been able to reproduce the problem at your end -- was it on an SMP box? By "failed to apply NAT mapping" error, I assume you mean the "IP_TPROXY: error applying NAT mapping" error? Just to confirm, I'm not getting any of these error messages at all (perhaps because I've configured NAT reservations off?) Cheers, Jim
On Mon, Dec 20, 2004 at 10:48:50AM +0000, jim@minter.demon.co.uk wrote:
Sorry :o(. I'm currently recompiling the kernel with a larger log buffer and will rerun the tests and post an updated tarball.
If you just run 'dmesg' it'll use a log buffer that is not big enough to pull everything from the kernel. Try 'dmesg -s 10000000' or something like that. --L
participants (3)
-
jim@minter.demon.co.uk
-
KOVACS Krisztian
-
Lennert Buytenhek