-
Notifications
You must be signed in to change notification settings - Fork 6
Benchmarking TCP Proxies written in different languages: C, CPP, Rust, Golang, Java, Python
Table of contents generated with markdown-toc
TL;DR; you can jump right to Benchmarks and look into methodology later.
There are three types of load to compare different aspects of TCP proxies:
-
moderate load
-25k RPS
(requests per second). Connections are being re-used for50
requests.- In this mode, we benchmark handling traffic over persisted connections.
- Moderate request rate is chosen to benchmark proxies under normal conditions.
-
max load
- sending as many requests as the server can handle.- The intent is to test the proxies under stress conditions.
- Also, we find the max throughput of the service (the saturation point).
-
no-keepalive
- using each connection for a single request- So we can compare the performance characteristics of establishing new connections.
- Establishing a connection is an expensive operation.
- It involves resource allocation and dispatching tasks between worker threads.
- As well as clean-up operations once a connection is closed.
To compare different solutions, we use the following set of metrics:
- Latency (in microseconds, or
µs
)-
p50
(median) - a value that is greater than 50% of observed latency samples -
p90
- 90th percentile, or a value that is better than 9 out of 10 latency samples. Usually a good proxy for a perceivable latency by humans. - Tail latency -
p99
- 99th percentile, the threshold for the worst 1% of samples. - Outlier-latencies:
p99.9
andp99.99
- may be important for systems with multiple network hops or large fan-outs (e.g., a request gathers data from tens or hundreds of microservices) -
max
- the worst-case. -
tm99.9
- trimmed mean, or the mean value of all samples without the best and worst 0.1%. It is more useful than the traditional mean, as it removes a potentially disproportionate influence of outliers: https://en.wikipedia.org/wiki/Truncated_mean -
stddev
- the standard deviation of the latency. The lower, the better: https://en.wikipedia.org/wiki/Standard_deviation
-
- Throughput
rps
(requests per second) - CPU utilization
- Memory utilization
We primarily focus on the latency and keep an eye on the cost of that latency in terms of CPU/Memory.
For the max load,
we also assess the maximum possible throughput of the system.
Why do we need to observe trimmed mean if we already have median (i.e. p50
)?
p50
(or percentiles in general) may not necessarily capture performance regressions. For instance:
-
1,2,3,4,5,6,7,8,9,10
-p50
is5
,trimmed mean
is5
-
5,5,5,5,5,6,7,8,9,10
-p50
is still5
, however thetrimmed mean
is6.25
.
The same applies to any other percentile. If the team only uses p90
or p99
to monitor their system's performance, they may miss dramatic regressions without being aware of that.
Of course, we may use multiple fences
(p10
, p25
, etc.) - but why, if we can use a single metric?
In contrast, the traditional mean is susceptible to noise and outliers and not as good for capturing the general tendency.
These benchmarks compare TCP proxies written in different languages, which use Non-blocking I/O. Why TCP proxies? This is the simplest application dealing with the network I/O. All it does is connection establishment and forward traffic. Why Non-blocking I/O? You can read this post, which tries to demonstrate why Non-blocking I/O is a much better option for network applications.
Let's say you're building a network service. TCP proxy benchmarks are the lower boundary for the request latency it may have. Everything else is added on top of that (e.g., parsing, validating, packing, traversing, construction of data, etc.).
So the following solutions are being compared:
- Baseline (
perf-gauge <-> nginx
) - direct communication to Nginx to establish the baseline: https://nginx.org/en/ - HAProxy (
perf-gauge <-> HAProxy <-> nginx
) - HAProxy in TCP-proxy mode. To compare to a mature solution written inC
: http://www.haproxy.org/ -
draft-http-tunnel
- a simple C++ solution with very basic functionality (asio) (running in TCP mode): https://github.com/cmello/draft-http-tunnel/ -
http-tunnel
- a simple HTTPTunnel written in Rust (tokio) (running in TCP mode): https://github.com/xnuter/http-tunnel/ -
tcp-proxy
- a Golang solution: https://github.com/jpillora/go-tcp-proxy -
NetCrusher
- a Java solution (Java NIO): https://github.com/NetCrusherOrg/NetCrusher-java/ -
pproxy
- a Python solution based onasyncio
(running in TCP Proxy mode): https://pypi.org/project/pproxy/
Thanks to Cesar Mello who coded the TCP proxy in C++ to make this benchmark possible.
Benchmarking network services is tricky, especially if we need to measure differences down to microseconds granularity. To rule out network delays/noise, we can try to employ one of the options:
- use co-located servers, e.g., VMs on the same physical machine or in the same rack.
- use a single VM, but assign CPU cores to different components to avoid overlap
Both are not ideal, but the latter seems to be an easier way. We need to make sure that the instance type is CPU optimized and won't suffer from noisy-neighbor issues. In other words, it must have exclusive access to all cores as we're going to drive CPU utilization close to 100%.
E.g., if we use an 8-core machine, we can use the following assignment scheme:
- Cores 0-1: Nginx (serves
10kb
of payload per request) - Cores 2-3: TCP proxy
- Cores 4-7: perf-gauge - the load-generator.
This can be achieved by using cpu sets:
apt-get install cgroup-tools
Then we can create non-overlapping CPU-sets and run different components without competing for CPU and ruling out any network noise.
perf-gauge
can emit metrics to Prometheus.
To launch a stack, you can use https://github.com/xnuter/prom-stack.
I just forked prom-stack
and removed anything but prometheus,
push-gateway
and grafana.
You can clone the stack and launch make.
Then set the variable with the host, for instance:
export PROMETHEUS_HOST=10.138.0.2
Please note that we disable logging for all configurations to minimize the number of variables and the level of noise.
- Perf-gauge
- Nginx
- TCP Proxies
Okay, we finally got to benchmark results. All benchmark results are split into two batches:
- Baseline, C, C++, Rust - comparing high-performance solutions
- Rust, Golang, Java, Python - comparing memory-safe languages
Yep, Rust belongs to both worlds.