[tproxy] tproxy race condition? [RESEND]

Thu, 16 Dec 2004 15:18:54 +0100

  Hi Jim,

2004-12-15, sze keltezéssel 12:09-kor jim@minter.demon.co.uk ezt írta:
> I'm trying to use tproxy to implement a fully transparent layer 7 proxy as
> follows: TCP connections arrive and are REDIRECTed to a single local port.
> A userspace process listen()s on that port, and makes ongoing (transparent)
> connections on new TCP sockets by calling bind(), tproxy setsockopt()s and
> connect().  In general it works well, but I'm having a few issues which I
> think are possibly SMP-related.  I believe I've reduced some of these to a
> simple test case, sources for which are attached.  I'm using Linux kernel
> 2.4.27 and all four patches in cttproxy-2.4.27-2.0.0 patch.  To run the
> test case, you need two machines; I think the 'client' must be SMP.

  OK, so everything in my reply is pure theory, I did not test the
samples (yet).

> The 'server', 10.0.3.2, listens on a single TCP port and has a simple loop
> which accept()s and close()s TCP connections that it receives.
> 
> The SMP 'client', 10.0.3.3, has two processes each connecting to the
> server.  The clients loop through a port range 32768-49152.  They bind() on
> 10.0.3.3, receiving some port from the kernel.  They then assign a
> transparent port in the loop port range on unregistered IP 10.0.3.253, and
> connect() to the server.  (The server has a route set up so that it knows
> to return traffic on 10.0.3.253 to the client box).
> 
> The problem: once in a while, one of the client processes takes 3s to
> connect() to the server.  Then, the resulting TCP connection is NOT
> TRANSPARENT (i.e. 10.0.3.3 is used, not 10.0.3.253).  This can be seen by
> running "tcpdump host 10.0.3.3" on either box.  However, none of the client
> process system calls fail at any point.

  Unfortunately handling errors is the most problematic part of tproxy.
The difficulty lies in the fact that when the setsockopt() calls return,
we have no way of knowing if the not-yet-established connection will
clash with another connection in the conntrack hash or not. This is
because the connection won't be created until the first packet leaves
the machine, which is shortly after you call connect(). If the tproxy
Netfilter hook detects that it cannot apply a NAT mapping, it just drops
the packet (and probably the conntrack entry as well) since it has no
way of notifying the user-space process.

  If you start two client processes, you'll have a good chance of trying
to assign "colliding" foreign addresses. If you set REUSEADDR, tproxy
will allow you to assign the same foreign address more than once, since
you've explicitly requested to do so by setting REUSEADDR (let's assume
you've chosen port x). However, as soon as you try to use them, you'll
experience problems, since the reply tuples of the connections would be
the same. Of course connection tracking won't allow this, so trying to
apply the NAT mapping will fail for one of the client processes. (I
don't know yet why the packets leave the machine with an unmodified
source IP, in theory they should be dropped, or at least NAT-ted to the
wrong source port number...)

> In the case that CONFIG_IP_NF_NAT_NRES is set, at the same time this
> happens, the _other process_ has a -EINVAL failure in
> ip_tproxy_setsockopt_flags(), with corresponding "failed to register NAT
> reservation" error in dmesg.  When CONFIG_IP_NF_NAT_NRES is unset, this
> failure doesn't happen.  But either way, on the _original process_, the
> non-transparent TCP connection happens.

  NAT reservations make it possible for tproxy to fail early. If NAT
reservations are enabled, tproxy registers "reservations" for foreign
addresses to be used later. If such a registration fails, that means
that the foreign address is already reserved for some other connection.
This is why in that case even the setsockopt() call fails. This is good,
since it provides you a way of detecting the error.

> What I believe is happening is as follows: There is evidence in dmesg that
> the first SYN packet of the connect() passes through the LOCAL_OUT iptables
> hooks (I see "ip_tproxy_fn(): new connection, hook=3" and "ip_tproxy_fn():
> new connection, hook=4", but for some reason the packet never actually
> makes it onto the wire.

  Don't you have any kind of errors in the kernel logs when this
happens? Tproxy could drop the packet, but you should get an error
message in that case.

>  I can't see where it goes missing.  But anyway,
> connect() waits 3s and resends the SYN.  This time, as the second packet
> goes through the iptables, for some reason it's not translated.  It makes
> it onto the wire and the rest of the connection proceeds untranslated.

  _This_ is strange... Could you send me a tcpdump capture of that
traffic and the matching tproxy debug output?

> I haven't been able to progress much further debugging this, and wondered
> if you had any ideas?  My principal concern is that the userspace processes
> don't receive an error and have no proper way of telling that the
> connection is going untransparent.  Am I making a stupid mistake somewhere?

  I have a few recommendations:

      * Try to avoid explicitly specifying the foreign (fake) port
        number at all costs. If you assign a foreign port of zero,
        connection tracking will select a free port number when applying
        the NAT mapping. This way you won't have such weird problems.
      * Each and every connection _must_ have unique endpoints. When you
        run two instances of your client, you'll run into a theoretical
        problem as well: sometimes you try to establish two TCP
        connections with exactly the same endpoints. This is clearly
        invalid, and wouldn't be possible without using tproxy, of
        course.

> One other curious thing here: MUST_BE_READ_LOCKED(&ip_tproxy_lock) in
> ip_tproxy_relatedct_add() fails.  Could this be related in any way?

  Not really, that call is completely bogus IMHO. We probably don't need
that check there, I'll remove it.

> Finally, what is the purpose of the new CONFIG_IP_NF_NAT_NRES option?

  See above. :)

-- 
 Regards,
   Krisztian KOVACS