Hi Krisztian! Many thanks for your reply. hidden@balabit.hu wrote: <reply snipped in places>
Unfortunately handling errors is the most problematic part of tproxy. The difficulty lies in the fact that when the setsockopt() calls return, we have no way of knowing if the not-yet-established connection will clash with another connection in the conntrack hash or not. This is because the connection won't be created until the first packet leaves the machine, which is shortly after you call connect(). If the tproxy Netfilter hook detects that it cannot apply a NAT mapping, it just drops the packet (and probably the conntrack entry as well) since it has no way of notifying the user-space process.
Agreed. It's not possible to pre-add the mapping to the conntrack table at the setsockopt() stage, I take it. I'll be keen to move to using NAT reservations as evidently it will help me in the long run -- it's just that as this bug shows up with and without them, at this stage I'm not using them, for simplicity.
If you start two client processes, you'll have a good chance of trying to assign "colliding" foreign addresses. If you set REUSEADDR, tproxy will allow you to assign the same foreign address more than once, since you've explicitly requested to do so by setting REUSEADDR (let's assume you've chosen port x). However, as soon as you try to use them, you'll experience problems, since the reply tuples of the connections would be the same. Of course connection tracking won't allow this, so trying to apply the NAT mapping will fail for one of the client processes. (I don't know yet why the packets leave the machine with an unmodified source IP, in theory they should be dropped, or at least NAT-ted to the wrong source port number...)
Fair enough. If I adjust the program such that one process asks tproxy to assign odd numbered foreign ports, and the other process even numbered foreign ports, the problem still happens just as quickly -- so it's not a simple collision fault! As an aside, the Linux TCP/IP stack allows a single IP address to make >65,536 TCP connections at once. It does this by allowing >1 sockets to share the same local port [in the auto-bind code called by TCP connect()], as long as they're connecting to different remote end-points. The return packets are demultiplexed by remote end-point as well as the local one. Additionally, some OSes even allow the user to pre-bind sockets to a local port of _their choice_ before making a connect(), easily allowing >1 connections at once per local port!
What I believe is happening is as follows: There is evidence in dmesg that the first SYN packet of the connect() passes through the LOCAL_OUT iptables hooks (I see "ip_tproxy_fn(): new connection, hook=3" and "ip_tproxy_fn(): new connection, hook=4", but for some reason the packet never actually makes it onto the wire.
Don't you have any kind of errors in the kernel logs when this happens? Tproxy could drop the packet, but you should get an error message in that case.
No errors at all :o(. The curious thing is that I added extra printk's to all the cases in the tproxy code where I could see "return NF_DROP" (or equivalent), and none of these printed -- so I presume the packet drop is elsewhere (I don't know where).
_This_ is strange... Could you send me a tcpdump capture of that traffic and the matching tproxy debug output?
Will do, in a separate post.
I have a few recommendations:
* Try to avoid explicitly specifying the foreign (fake) port number at all costs. If you assign a foreign port of zero, connection tracking will select a free port number when applying the NAT mapping. This way you won't have such weird problems.
I agree, I'd love to, but my app isn't able to choose the fake ports it uses -- my only option is detecting errors and dropping the connection if necessary.
* Each and every connection _must_ have unique endpoints. When you run two instances of your client, you'll run into a theoretical problem as well: sometimes you try to establish two TCP connections with exactly the same endpoints. This is clearly invalid, and wouldn't be possible without using tproxy, of course.
Yes, you're right. It is possible to run into this case with the test programs I sent if you wait long enough, but I'm not too worried about this just now as it doesn't appear to result in any more non-NATted traffic.
One other curious thing here: MUST_BE_READ_LOCKED(&ip_tproxy_lock) in ip_tproxy_relatedct_add() fails. Could this be related in any way?
Not really, that call is completely bogus IMHO. We probably don't need that check there, I'll remove it.
OK. Food for thought :o). I'll get back to you with some tcpdumps, etc. Cheers, Jim