No one ever gets in trouble for posting micro benchmarks and making broad assumptions about the cause of observed results! This post will focus on a couple of such benchmarks pertaining to blocking operations on otherwise asynchronous runtimes. Along the way I’ll give only sparse background on these projects I’ve been working on, but plenty of links if you are interested in reading further. This blog post is sort of a followup to an URLO post: Futures 0.3, async♯await experience snapshot, and I’ll cross-post this one to URLO as well.
I don’t care much for laptop benchmark results for software not intended to run on laptops. So before I forget, all tests below were conducted on:
ec2 m5dn.large instances
2x CPU: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
ubuntu 18.04 amd64 hvm-ssd 20191113
kernel 4.15.0-1054-aws
rustc 1.42.0-nightly (da3629b05 2019-12-29)
tokio 0.2.8
hyper 0.3.1
blocking-permit@923bddda
body-image-futio@3bc3760e
I started the blocking-permit crate as a prototype while tokio’s much anticipated 0.2 didn’t have any comparable facility (and I wasn’t sure how to constructively contribute). I continued and released the crate for a few features I want, and which the latest tokio version doesn’t yet have. That gap could certainly be closed, but perhaps others will find these same features useful:
DispatchPool
and BlockingPermit
that are portable across tokio
and async-std runtimes; orI provide some comparative micro-benchmarks via cargo bench
with this crate,
below. In these benchmarks I compare different strategies for running blocking
operations. These include:
BlockingPermit
is requested via blocking_permit_future
(this crate)
and a fixed size Semaphore
, to limit concurrency of running the operation
on a reactor thread, in the tokio case via tokio::task::block_in_place
, or
otherwise by running directly.DispatchPool
via dispatch_rx
(this crate)
and results are awaited.tokio::task::spawn_blocking
to a similarly
dedicated set of tokio managed threads.On a second axis is the type of the blocking workload under test:
The total thread count is kept consistent across all strategies. The
dispatching strategies (dispatch_rx and spawn_blocking) employ 4 core
threads and 4 “extra” dispatch-dedicated threads. The direct and permit
strategies employ 8 core threads, and 4 Semaphore
permits if applicable. For
the sleep operations, this is extended to 40 “extra” threads or permits.
Note these tests were run on a virtual host with two virtual CPUs.
To simulate concurrency, batches of 100 operations (200 for sleep) are
spawned and then awaited concurrently via futures::stream::FuturesUnordered
,
for each iteration of the benchmark. Some overhead could likely be attributed
to the latter’s bookkeeping.
cargo bench --features=tokio-omnibus
test noop_threaded_direct ... bench: 52,220 ns/iter (+/- 10,434)
test noop_threaded_dispatch_rx ... bench: 159,234 ns/iter (+/- 57,529)
test noop_threaded_permit ... bench: 72,940 ns/iter (+/- 3,534)
test noop_threaded_spawn_blocking ... bench: 147,804 ns/iter (+/- 8,325)
test r_expensive_threaded_direct ... bench: 1,253,295 ns/iter (+/- 189,310)
test r_expensive_threaded_dispatch_rx ... bench: 1,544,520 ns/iter (+/- 164,754)
test r_expensive_threaded_permit ... bench: 1,272,118 ns/iter (+/- 89,466)
test r_expensive_threaded_spawn_blocking ... bench: 1,574,170 ns/iter (+/- 162,377)
test sleep_threaded_direct ... bench: 649,913 ns/iter (+/- 8,448)
test sleep_threaded_dispatch_rx ... bench: 992,531 ns/iter (+/- 61,243)
test sleep_threaded_permit ... bench: 800,128 ns/iter (+/- 41,226)
test sleep_threaded_spawn_blocking ... bench: 1,624,739 ns/iter (+/- 377,409)
When comparing benchmarks for our dispatch_rx vs. spawn_blocking, noop and expensive are fairly close, but sleep (at 40 threads) using tokio’s spawn_blocking is considerably slower. As the higher thread counts are likely representative of real world use cases with blocking APIs, I find this noteworthy.
The previous DispatchPool
of blocking-permit release 0.1.0 used crossbeam’s
MPMC channel, which was much slower then the above (released as 1.0.0)
results and tokio’s spawn_blocking pool, particularly at higher thread counts,
as with sleep.
Comparing permit to direct strategies gives some sense of Semaphore
overhead. Comparing permit to either dispatching strategy is likely to be of
limited value due to limitations of these benchmarks—most importantly, there is no
actual I/O for which reactor threads need to be kept available.
Note that tokio::task::block_in_place
not only informs tokio of an imminent
blocking operation but actively starts the process to enlist a new reactor
thread. Without some fixed number of Semaphore
permits, the number of threads
is effectively unbounded. While tokio 0.1 had a configurable cap on the total
number of threads, this was found to be a liability as a dead- or live-lock
potential, and was removed in tokio 0.2. If you want your process to have a
known maximum (and small) number of threads like I do, then you need to either
not use block_in_place
or constrain the blocking concurrency by some other
means. If the sum of such semaphore(s) permits is less than the configured
total number of reactor (core) threads, then we are guaranteed to always have
a reactor thread, with disused blocking threads will be kept alive and rotated
in as subsequent reactors. This turns out to be reasonably (but not perfectly)
efficient.
I’m surprised that something like the permit strategy wasn’t mentioned in Stop worrying about blocking: the new async-std runtime, inspired by Go (async-std blog, by Stejepan Glavina), or in async-std#631 as it would seem to be directly applicable as a strategy there as well. (Much more on “worrying about blocking” in the conclusion.)
The above permit benchmarks use tokio’s async Semaphore
, which was
originally available in 0.2 alphas, made private in the 0.2.0
release, and finally re-exported with interface changes in an
0.2.5 PATCH release. Our BlockingPermitFuture
wraps the underlying permit
future to provide a consistent, extended interface. In Tokio’s case, that
future is an unnamed type returned from an an async fn
, so I need to move it
to the heap with Pin<Box<dyn Future<_>>
.
While the fate of tokio’s Semaphore
was uncertain I also integrated the
futures-intrusive Semaphore
and this is retained for compatibility with
other runtimes. A different complication with this Semaphore
is that its
future type isn’t Sync
, but in my usage with hyper, Sync
becomes an
unexpected requirement. The permit strategy benchmarks are re-run with
this Semaphore
-type, below:
cargo bench --features=futures-intrusive,tokio-threaded permit
test noop_threaded_permit ... bench: 74,282 ns/iter (+/- 3,821)
test r_expensive_threaded_permit ... bench: 1,304,930 ns/iter (+/- 80,482)
test sleep_threaded_permit ... bench: 6,811,144 ns/iter (+/- 1,446,636)
Note that the sleep workload is significantly deteriorated with the
futures-intrusive Semaphore
. All permit using benchmarks are currently
slower with this Semaphore
than with tokio’s. Removing the Sync
wrapper in
these benchmarks does not measurably improve them.
Since the prior URLO post, the massive async♯await upgrade
has been completed, including a reasonably pleasant experience of re-writing
all its tests and benchmarks in generic, async♯await style (here with a total
of 34 uses of await
and test LoC changes: 1636 insertions, 622
deletions). I’ve released (besides blocking-permit) a new set of
body-image-* 2.0.0 crates using the latest tokio, hyper, and futures.
Firstly, its worth summarizing that an original set of stream-to-sink forwarding benchmarks, available in the body-image-futio 1.3.0 release (using tokio 0.1), have all improved at least somewhat in the 2.0.0 release (using tokio 0.2). But I want to focus on a new set of benchmarks, added in 2.0.0, which better simulate my use case for the project.
A new set of client benchmarks compares the performance of making HTTP requests to a server which returns 8MiB chunked responses and caching the body in the various transparent ways that the body-image crate allows: scattered allocations in ram, or written to the file system as received, and optionally memory mapped. The client then summarizes a stream of the response body (either in memory of read back from the filesystem) by finding the highest byte value (always 0xFF, as proof of reading) and verifying the total length.
Like in the prior blocking-permit benchmarks, concurrency is simulating by
batching, in this case 16 request-response-summarize operations per benchmark
iteration. The total thread count is kept to 4 threads (2 core, 2 “extra”
dispatch-dedicated threads) and 2 (tokio) Semaphore
permits. The tests were
run on the same ec2 host configuration described above.
In results below, a separate server is run via cargo run --release --example
server
on a separate virtual host of identical type, provided to the
benchmarks via a BENCH_SERVER_URL
environment variable. If the benchmark is
run without this variable, a server is run in process but still communicates
with the clients via the localhost TCP-stack.
Terms used in the benchmark names:
Bytes
chunks are left scattered and pushed to a Vec<Bytes>
without copying. Once the body is complete it is summarized by reading each
chunk in turn, without copying.Bytes
buffer in one allocation. Note this resembles many
hyper body handling examples and may be an unfortunate requirement of parsing
APIs, etc.MADV_SEQUENTIAL
to suggest aggressive read-ahead, before
it is read as a single contiguous UniBodyBuf
item, with no additional
copying.Bytes
item
before processing it. The extra copy isn’t recommended but could possibly
justify the mmap if there are multiple read passes and one requires Bytes
.BlockingPermit
is obtained via blocking_permit_future
before running
the write and read operations on a reactor thread, in this tokio case via
tokio::task::block_in_place
.DispatchPool
via dispatch_rx
.% BENCH_SERVER_URL=http://10.0.0.115:9087 cargo bench client
test client_01_ram ... bench: 58,528,073 ns/iter (+/- 3,776,825)
test client_01_ram_gather ... bench: 81,692,816 ns/iter (+/- 3,984,386)
test client_10_fs_direct ... bench: 119,846,890 ns/iter (+/- 1,999,313)
test client_10_fs_permit ... bench: 141,334,426 ns/iter (+/- 3,529,708)
test client_12_fs_dispatch ... bench: 151,809,679 ns/iter (+/- 6,100,962)
test client_15_mmap_direct_copy ... bench: 137,397,757 ns/iter (+/- 2,643,698)
test client_15_mmap_permit_copy ... bench: 157,148,683 ns/iter (+/- 6,222,127)
test client_16_mmap_direct ... bench: 110,491,801 ns/iter (+/- 2,796,833)
test client_17_mmap_permit ... bench: 124,647,965 ns/iter (+/- 4,390,988)
The aggregate mean transfer and processing rate, in terms of the original body size, for the fastest, ram case is 2.14 GiB/sec or 17.1 Gib/sec. Amazon AWS suggests that one can get “up to 25 Gbps” of (raw TCP) bandwidth for these instances, independent of EBS traffic, and made possible because client and server instances are in the same VPC and availability zone. Since the client instance’s 2 CPUs are approximately saturated in these tests (mostly system time actually), I would expect results to be slower if TLS was also involved (its not).
An interesting if entirely anecdotal comparison, is the maximum ~70 MB/s reported (reading off graph y axis) in that same async-std blog article running on much larger “m5a.16xlarge” and “m5a.8xlarge” instances? Perhaps that benchmark wasn’t so optimized with large body payloads or has other constraints?
The fs (and mmap) results are based an ext4 filesystem on EBS “gp2” SSD network attached storage. The “m5dn.large” ec2 instance type also has access to directly attached NVMe SSD (instance store). However, mounting and using that storage for the body-image temp. files was not found to significantly change the results. At the observed rate, gp2 storage for these sequential writes and sequential (and mapped) reads, does not appear to be a significant bottleneck, but see below for additional optimizations.
Yes, tmpfs can behave as a RAM disk, so let me explain the use case lest I’m accused of cheating just for the benchmarks. As async♯await can be described as cooperative multi-tasking, this use of tmpfs can be described as cooperative swapping. Linux tmpfs will start using available swap space if the instance experiences memory pressure. By writing body chunks to tmpfs, I’m telling the kernel that these are currently low-priority pages that it can remove them from RAM if needed. Once the body is downloaded, and processing begins by reading or memory mapping, I’m informing the kernel that these pages are again high-priority. This cooperative swapping will behave much better than the uncooperative swapping that could occur if I just keep all bodies in RAM and cause memory pressure that way. Then any operation in the entire process, reactor or other threads, becomes a potential blocking operation as it may need to wait for executable or heap pages to swap back in! The kernel doesn’t really have much to go on for page prioritization purposes.
Another nice feature of tmpfs on Linux is that it supports Transparent Huge Pages (default 2MiB) without any other required configuration. Huge pages drastically reduce kernel virtual memory bookkeeping overhead for memory mapping, at the cost of over-allocating up to 1 page of memory per mapped file. There may be workarounds for space efficiency as well, but I’m currently not worried much about it.
Below shows the setup for tmpfs with huge pages and then the updated benchmark results:
% sudo mount -t tmpfs -o size=500M,huge=always,uid=1001,gid=1001 tmpfs \
../target/testmp
% BENCH_SERVER_URL=http://10.0.0.115:9087 cargo bench client
test client_01_ram ... bench: 58,392,758 ns/iter (+/- 16,045,924)
test client_01_ram_gather ... bench: 85,527,831 ns/iter (+/- 18,187,177)
test client_10_fs_direct ... bench: 94,056,654 ns/iter (+/- 2,915,506)
test client_10_fs_permit ... bench: 120,463,257 ns/iter (+/- 2,742,582)
test client_12_fs_dispatch ... bench: 130,157,569 ns/iter (+/- 5,993,078)
test client_15_mmap_direct_copy ... bench: 104,351,507 ns/iter (+/- 2,596,867)
test client_15_mmap_permit_copy ... bench: 129,762,411 ns/iter (+/- 6,165,487)
test client_16_mmap_direct ... bench: 79,705,427 ns/iter (+/- 3,013,775)
test client_17_mmap_permit ... bench: 99,289,803 ns/iter (+/- 4,002,980)
As expected the fs and mmap results are all improved. Here mmap_direct
actually outperforms ram_gather, and is only a 36% latency overhead beyond
ram (without gather). Now, what if I only need to use fs (and mmap) on
10% or 1% of responses exceed some configured size in the Tunables
, the
remainder staying in ram? That net overhead should be more like 3.6% or
0.36%. With tmpfs, this seems like an effective way to have my cake and eat it
too—a very limited resource cost for piece of mind that my process isn’t going
to exceed RAM or thrash if it requires any swap space at all, and no need to
enforce some low maximum http body size or low concurrency.
Dear Reader,
Having made it this far, did you notice that in all cases tested, the direct
strategy (blocking a reactor thread) performs best?
Its true. I have almost nothing positive performance-wise to show for a
significant amount of effort spent implementing the permit and dispatch
strategies employed in these tests. One minor win is that from preliminary
benchmarking I made the decision (in the 3rd rewrite, sigh) to compose
permit and dispatch strategies as wrappers over direct AsyncBodyImage
(our Stream
) and AsyncBodySink
types, so the latter may be directly
selected at compile time, without further overhead or complication.
One other possible advantage of permit, as hinted in these benchmark results, is that the variance (if not the mean) of request/response handling is reduced with permit. That might be important in some use cases, for example for fairness to multiple users.
How could direct be best performing? What have I done wrong, if the demigods of rust async (well, some of them) tell us that blocking (their? our?) reactor threads is a bad idea? I speculate that this is both true and that I’ll be using the direct strategy in production, for all of the following reasons:
Memory cache locality and coherence is rather important in I/O heavy applications, so moving a read or write operation with a pointer to a buffer in some CPU’s cache or NUMA zone to another thread that might get scheduled anywhere is inherently costly. And this is typically exacerbated on virtualized instances. I believe this is one reason why permit tends to outperform the dispatch strategy in the above client benchmarks.
8KiB or 512KiB filesystem sequential writes to SSDs (network attached or local) simply aren’t blocking enough to matter, at least as compared with the overhead of any other strategy.
Tokio’s threaded runtime, with sufficient number of reactor threads and its work stealing abilities, is quiet capable of handling an occasionally blocked reactor thread without replacement or slowdown, at least for this tested work load.
I think there may still be plenty of other use cases for either the
BlockingPermit
or the DispatchPool
, but they will tend to be much more
course grain. For example, I envision dispatching a one-step blocking operation
to uncompress HTTP body payloads enmasse, in the background of a particular
server, once bodies are completely downloaded.
These findings similarly make me suspect that efforts to provide a direct
drop-in std
replacement in the form of AsyncRead
and AsyncWrite
and
associated types and functions, is probably attacking the problem at too low a
level. At the very least, I don’t suspect those traits as currently conceived
will be useful for body-image-futio.
Some related questions:
Does hyper-tls or native-tls or tokio-tls offload packet decryption from the reactor thread? If any of these do, I haven’t been able to find where it happens. Please point it out for me.
Does hyper’s HTTP 1 chunked transfer encoding implementation attempt to offload Base64 encode/decode from the reactor thread? Should it?
Does the newer async-compression worry about blocking a reactor thread? Should it?
Or is the general strategy with these just that only a small block, say 8KiB or
64KiB, is decrypted/decoded/decompressed for each Stream::poll_next()
? That
would sound a lot like the above best performing direct strategy with
blocking filesystem I/O.
Why use the new body-image-* 2.0.0 crates?
If you can do all of your dynamic response body production, POST request body consumption, or client request/response body handling in a purely streaming, chunk-at-a-time fashion, then that’s going to be the best thing to do. Fire and forget. The pattern is generally limited to HTTP proxies and load balancers.
If however, you need to do things like:
parse body payloads, producing large parse graphs in RAM, or
mutate the start of a body based on content in the middle or end of the body, or
produce dynamic content bodies much faster then your clients, or server POST endpoints, can consume them over the network
…then you might find the above tested setups and tunable options with body-image actually gain you efficiencies and/or make you sleep better at night not worrying over memory consumption, unbounded thread growth, or uncontrolled swapping. Let me know what you find!