[zorp] Zorp 2.1.5.5 can't handle load

Balazs Scheidler zorp@lists.balabit.hu
Fri, 21 May 2004 10:58:08 +0200


2004-05-20, cs keltezéssel 22:51-kor Sheldon Hearn ezt írta:
> Hi folks,
> 
> I'm worried.  I'm in a situation where I've put a proxy cluster into
> production without adequately testing the Zorp component under load.
> 
> I spent a lot of time testing load balancing, but didn't check that Zorp
> could cope with a large number of concurrent connections.
> 
> We're running zorp-2.1.5.5 on a Linux 2.4.25 (Gentoo) kernel with
> glibc-2.3.2 (Gentoo r9).
> 
> The http proxy dies with sig11 (all registers printed zero in the stack
> dump sent to syslog) when it reaches some small number of concurrent
> threads over 130.
> 
> So we tried using just the TCP plug proxy, even for HTTP connections,
> but can't get a single instance using more than 1020 threads.
> 
> We have 4 zorp boxes handling a 100Mbps uplink, load-balanced with LVS. 
> LVS ipvsadm also shows that the Zorp boxes aren't handling more than
> about 1000 concurrent connections.
> 
> The visible symptom of all this is that some connection attempts aren't
> even accepted, while others are accepted but not serviced.
> 
> I've done a lot of Googling, and all the stuff on how to increase the
> number of processes allowed per process doesn't seem to apply;
> PTHREAD_THREADS_MAX is already large in the glibc sources, and NR_TASKS
> doesn't exist in the kernel source.
> 
> I've bumped up ulimits for file descriptors and processes per user, but
> these don't help.
> 
> Help.  I realise I went into production prematurely, but now that I'm
> here, it's a horrible place and I'm worried that I overestimated Zorp's
> ability to cope with load.  Am I expecting too much from Zorp, or is
> this just something that more experienced Linux folks would know about?
> 
> Any ideas on how to get Zorp to handle the kind of concurrency other
> people on the list must be getting[1] would be greatly appreciated.  
> 
> Either I need to get Zorp to service a larger number of concurrent
> requests, or I need to know why it's not coping when it reaches the
> limit on concurrent requests.  I tried lowering --threads to 200, but my
> connection attempts still either aren't accepted or time out waiting for
> a response.

The solution is to split your single Zorp instances to smaller instances
working on the same set of connections. This can be achieved by running
for example 16 instances of HTTP listening on different ports. (for
example 50080 - 50095) then use 16 packet filter rules to distribute the
load between processes based on source port for example. 

How this can be achieved:

def define_services():
	Service("http", HttpProxy, ...)

def instance1():
	define_services()
	Listener(SockAddrInet('1.2.3.4', 50080), 'http')

def instance2():
	define_services()
	Listener(SockAddrInet('1.2.3.4', 50081), 'http')

def instance3():
	define_services()
	Listener(SockAddrInet('1.2.3.4', 50082), 'http')

etc.

You can either use the stock --sport match with ranges to distribute the
load, but it's better to use u32 where you can do things like: source
port module 16 decides which listener to redirect to.

iptables -t tproxy -A PREROUTING -p tcp -m u32 --u32 '0>>22&0x3C@0>>16&0xF=0' -j TPROXY --on-port 50080
iptables -t tproxy -A PREROUTING -p tcp -m u32 --u32 '0>>22&0x3C@0>>16&0xF=1' -j TPROXY --on-port 50081
iptables -t tproxy -A PREROUTING -p tcp -m u32 --u32 '0>>22&0x3C@0>>16&0xF=2' -j TPROXY --on-port 50082
iptables -t tproxy -A PREROUTING -p tcp -m u32 --u32 '0>>22&0x3C@0>>16&0xF=3' -j TPROXY --on-port 50083

and so on. Creating 16 processes will probably suffice. 
How many connections do you have in a second?

We have somewhere between 500-600 new connections/sec distributed on 4 
computers running 16 processes each. And latency is ok. And btw: which tproxy 
version are you using?

Do you have more system or userspace CPU time? (vmstat will tell you that)

-- 
Bazsi
PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1