Blocking Permit

No one ever gets in trouble for posting micro benchmarks and making broad assumptions about the cause of observed results! This post will focus on a couple of such benchmarks pertaining to blocking operations on otherwise asynchronous runtimes. Along the way I’ll give only sparse background on these projects I’ve been working on, but plenty of links if you are interested in reading further. This blog post is sort of a followup to an URLO post: Futures 0.3, async♯await experience snapshot, and I’ll cross-post this one to URLO as well.

I don’t care much for laptop benchmark results for software not intended to run on laptops. So before I forget, all tests below were conducted on:

ec2 m5dn.large instances
2x CPU: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
ubuntu 18.04 amd64 hvm-ssd 20191113
kernel 4.15.0-1054-aws
rustc 1.42.0-nightly (da3629b05 2019-12-29)
tokio 0.2.8
hyper 0.3.1
blocking-permit@923bddda
body-image-futio@3bc3760e

I started the blocking-permit crate as a prototype while tokio’s much anticipated 0.2 didn’t have any comparable facility (and I wasn’t sure how to constructively contribute). I continued and released the crate for a few features I want, and which the latest tokio version doesn’t yet have. That gap could certainly be closed, but perhaps others will find these same features useful:

for a DispatchPool and BlockingPermit that are portable across tokio and async-std runtimes; or
simply to add additional pool(s), specific to particular operations, within one of these runtimes; or
for its current performance advantages.

I provide some comparative micro-benchmarks via cargo bench with this crate, below. In these benchmarks I compare different strategies for running blocking operations. These include:

direct: Operations are run directly on a reactor thread without any coordination.
permit: A BlockingPermit is requested via blocking_permit_future (this crate) and a fixed size Semaphore, to limit concurrency of running the operation on a reactor thread, in the tokio case via tokio::task::block_in_place, or otherwise by running directly.
dispatch_rx: Operations are dispatched to a DispatchPool via dispatch_rx (this crate) and results are awaited.
spawn_blocking: Operations are dispatched via tokio::task::spawn_blocking to a similarly dedicated set of tokio managed threads.

On a second axis is the type of the blocking workload under test:

noop: The operation does nothing. This attempts to measure the overhead of any coordination.
expensive: A CPU intensive operation. An array of 300 usize integers is randomly shuffled then sorted in place.
sleep: Sleeps for an exponentially distributed semi-random sleep time of 1 to 49 microseconds. This best simulates a blocking remote operation.

The total thread count is kept consistent across all strategies. The dispatching strategies (dispatch_rx and spawn_blocking) employ 4 core threads and 4 “extra” dispatch-dedicated threads. The direct and permit strategies employ 8 core threads, and 4 Semaphore permits if applicable. For the sleep operations, this is extended to 40 “extra” threads or permits. Note these tests were run on a virtual host with two virtual CPUs.

To simulate concurrency, batches of 100 operations (200 for sleep) are spawned and then awaited concurrently via futures::stream::FuturesUnordered, for each iteration of the benchmark. Some overhead could likely be attributed to the latter’s bookkeeping.

Results

cargo bench --features=tokio-omnibus

test noop_threaded_direct                ... bench:      52,220 ns/iter (+/- 10,434)
test noop_threaded_dispatch_rx           ... bench:     159,234 ns/iter (+/- 57,529)
test noop_threaded_permit                ... bench:      72,940 ns/iter (+/- 3,534)
test noop_threaded_spawn_blocking        ... bench:     147,804 ns/iter (+/- 8,325)
test r_expensive_threaded_direct         ... bench:   1,253,295 ns/iter (+/- 189,310)
test r_expensive_threaded_dispatch_rx    ... bench:   1,544,520 ns/iter (+/- 164,754)
test r_expensive_threaded_permit         ... bench:   1,272,118 ns/iter (+/- 89,466)
test r_expensive_threaded_spawn_blocking ... bench:   1,574,170 ns/iter (+/- 162,377)
test sleep_threaded_direct               ... bench:     649,913 ns/iter (+/- 8,448)
test sleep_threaded_dispatch_rx          ... bench:     992,531 ns/iter (+/- 61,243)
test sleep_threaded_permit               ... bench:     800,128 ns/iter (+/- 41,226)
test sleep_threaded_spawn_blocking       ... bench:   1,624,739 ns/iter (+/- 377,409)

When comparing benchmarks for our dispatch_rx vs. spawn_blocking, noop and expensive are fairly close, but sleep (at 40 threads) using tokio’s spawn_blocking is considerably slower. As the higher thread counts are likely representative of real world use cases with blocking APIs, I find this noteworthy.

The previous DispatchPool of blocking-permit release 0.1.0 used crossbeam’s MPMC channel, which was much slower then the above (released as 1.0.0) results and tokio’s spawn_blocking pool, particularly at higher thread counts, as with sleep.

Comparing permit to direct strategies gives some sense of Semaphore overhead. Comparing permit to either dispatching strategy is likely to be of limited value due to limitations of these benchmarks—most importantly, there is no actual I/O for which reactor threads need to be kept available.

What’s the motivation behind the permit strategy anyway?

Note that tokio::task::block_in_place not only informs tokio of an imminent blocking operation but actively starts the process to enlist a new reactor thread. Without some fixed number of Semaphore permits, the number of threads is effectively unbounded. While tokio 0.1 had a configurable cap on the total number of threads, this was found to be a liability as a dead- or live-lock potential, and was removed in tokio 0.2. If you want your process to have a known maximum (and small) number of threads like I do, then you need to either not use block_in_place or constrain the blocking concurrency by some other means. If the sum of such semaphore(s) permits is less than the configured total number of reactor (core) threads, then we are guaranteed to always have a reactor thread, with disused blocking threads will be kept alive and rotated in as subsequent reactors. This turns out to be reasonably (but not perfectly) efficient.

I’m surprised that something like the permit strategy wasn’t mentioned in Stop worrying about blocking: the new async-std runtime, inspired by Go (async-std blog, by Stejepan Glavina), or in async-std#631 as it would seem to be directly applicable as a strategy there as well. (Much more on “worrying about blocking” in the conclusion.)

Picking an async Semaphore implementation

The above permit benchmarks use tokio’s async Semaphore, which was originally available in 0.2 alphas, made private in the 0.2.0 release, and finally re-exported with interface changes in an 0.2.5 PATCH release. Our BlockingPermitFuture wraps the underlying permit future to provide a consistent, extended interface. In Tokio’s case, that future is an unnamed type returned from an an async fn, so I need to move it to the heap with Pin<Box<dyn Future<_>>.

While the fate of tokio’s Semaphore was uncertain I also integrated the futures-intrusive Semaphore and this is retained for compatibility with other runtimes. A different complication with this Semaphore is that its future type isn’t Sync, but in my usage with hyper, Sync becomes an unexpected requirement. The permit strategy benchmarks are re-run with this Semaphore-type, below:

cargo bench --features=futures-intrusive,tokio-threaded permit

test noop_threaded_permit                ... bench:      74,282 ns/iter (+/- 3,821)
test r_expensive_threaded_permit         ... bench:   1,304,930 ns/iter (+/- 80,482)
test sleep_threaded_permit               ... bench:   6,811,144 ns/iter (+/- 1,446,636)

Note that the sleep workload is significantly deteriorated with the futures-intrusive Semaphore. All permit using benchmarks are currently slower with this Semaphore than with tokio’s. Removing the Sync wrapper in these benchmarks does not measurably improve them.

body-image

Since the prior URLO post, the massive async♯await upgrade has been completed, including a reasonably pleasant experience of re-writing all its tests and benchmarks in generic, async♯await style (here with a total of 34 uses of await and test LoC changes: 1636 insertions, 622 deletions). I’ve released (besides blocking-permit) a new set of body-image-* 2.0.0 crates using the latest tokio, hyper, and futures.

Firstly, its worth summarizing that an original set of stream-to-sink forwarding benchmarks, available in the body-image-futio 1.3.0 release (using tokio 0.1), have all improved at least somewhat in the 2.0.0 release (using tokio 0.2). But I want to focus on a new set of benchmarks, added in 2.0.0, which better simulate my use case for the project.

A new set of client benchmarks compares the performance of making HTTP requests to a server which returns 8MiB chunked responses and caching the body in the various transparent ways that the body-image crate allows: scattered allocations in ram, or written to the file system as received, and optionally memory mapped. The client then summarizes a stream of the response body (either in memory of read back from the filesystem) by finding the highest byte value (always 0xFF, as proof of reading) and verifying the total length.

Like in the prior blocking-permit benchmarks, concurrency is simulating by batching, in this case 16 request-response-summarize operations per benchmark iteration. The total thread count is kept to 4 threads (2 core, 2 “extra” dispatch-dedicated threads) and 2 (tokio) Semaphore permits. The tests were run on the same ec2 host configuration described above.

In results below, a separate server is run via cargo run --release --example server on a separate virtual host of identical type, provided to the benchmarks via a BENCH_SERVER_URL environment variable. If the benchmark is run without this variable, a server is run in process but still communicates with the clients via the localhost TCP-stack.

Terms used in the benchmark names:

ram: Response Bytes chunks are left scattered and pushed to a Vec<Bytes> without copying. Once the body is complete it is summarized by reading each chunk in turn, without copying.
ram_gather: As per ram above, but before processing the body, it is first gathered into a single contiguous Bytes buffer in one allocation. Note this resembles many hyper body handling examples and may be an unfortunate requirement of parsing APIs, etc.
fs: All response chunks are written directly to the filesystem, un-buffered, as they are received. When later readying the body (once it is completely written), 64KiB (uninitialized) buffers are used.
mmap: As per fs above, but before processing, the entire body is first memory mapped (into virtual memory). On unix, madvise(2) is then called on the memory region with MADV_SEQUENTIAL to suggest aggressive read-ahead, before it is read as a single contiguous UniBodyBuf item, with no additional copying.
mmap_*_copy: As per above mmap but copies the memory mapped region to a new Bytes item before processing it. The extra copy isn’t recommended but could possibly justify the mmap if there are multiple read passes and one requires Bytes.
direct: Blocking filesystem write/read operations are run directly on a reactor thread without any coordination.
permit: A BlockingPermit is obtained via blocking_permit_future before running the write and read operations on a reactor thread, in this tokio case via tokio::task::block_in_place.
dispatch: Write/read operations are dispatched to a DispatchPool via dispatch_rx.

Results

% BENCH_SERVER_URL=http://10.0.0.115:9087 cargo bench client

test client_01_ram              ... bench:  58,528,073 ns/iter (+/- 3,776,825)
test client_01_ram_gather       ... bench:  81,692,816 ns/iter (+/- 3,984,386)
test client_10_fs_direct        ... bench: 119,846,890 ns/iter (+/- 1,999,313)
test client_10_fs_permit        ... bench: 141,334,426 ns/iter (+/- 3,529,708)
test client_12_fs_dispatch      ... bench: 151,809,679 ns/iter (+/- 6,100,962)
test client_15_mmap_direct_copy ... bench: 137,397,757 ns/iter (+/- 2,643,698)
test client_15_mmap_permit_copy ... bench: 157,148,683 ns/iter (+/- 6,222,127)
test client_16_mmap_direct      ... bench: 110,491,801 ns/iter (+/- 2,796,833)
test client_17_mmap_permit      ... bench: 124,647,965 ns/iter (+/- 4,390,988)

The aggregate mean transfer and processing rate, in terms of the original body size, for the fastest, ram case is 2.14 GiB/sec or 17.1 Gib/sec. Amazon AWS suggests that one can get “up to 25 Gbps” of (raw TCP) bandwidth for these instances, independent of EBS traffic, and made possible because client and server instances are in the same VPC and availability zone. Since the client instance’s 2 CPUs are approximately saturated in these tests (mostly system time actually), I would expect results to be slower if TLS was also involved (its not).

An interesting if entirely anecdotal comparison, is the maximum ~70 MB/s reported (reading off graph y axis) in that same async-std blog article running on much larger “m5a.16xlarge” and “m5a.8xlarge” instances? Perhaps that benchmark wasn’t so optimized with large body payloads or has other constraints?

The fs (and mmap) results are based an ext4 filesystem on EBS “gp2” SSD network attached storage. The “m5dn.large” ec2 instance type also has access to directly attached NVMe SSD (instance store). However, mounting and using that storage for the body-image temp. files was not found to significantly change the results. At the observed rate, gp2 storage for these sequential writes and sequential (and mapped) reads, does not appear to be a significant bottleneck, but see below for additional optimizations.

Using tmpfs

Yes, tmpfs can behave as a RAM disk, so let me explain the use case lest I’m accused of cheating just for the benchmarks. As async♯await can be described as cooperative multi-tasking, this use of tmpfs can be described as cooperative swapping. Linux tmpfs will start using available swap space if the instance experiences memory pressure. By writing body chunks to tmpfs, I’m telling the kernel that these are currently low-priority pages that it can remove them from RAM if needed. Once the body is downloaded, and processing begins by reading or memory mapping, I’m informing the kernel that these pages are again high-priority. This cooperative swapping will behave much better than the uncooperative swapping that could occur if I just keep all bodies in RAM and cause memory pressure that way. Then any operation in the entire process, reactor or other threads, becomes a potential blocking operation as it may need to wait for executable or heap pages to swap back in! The kernel doesn’t really have much to go on for page prioritization purposes.

Another nice feature of tmpfs on Linux is that it supports Transparent Huge Pages (default 2MiB) without any other required configuration. Huge pages drastically reduce kernel virtual memory bookkeeping overhead for memory mapping, at the cost of over-allocating up to 1 page of memory per mapped file. There may be workarounds for space efficiency as well, but I’m currently not worried much about it.

Below shows the setup for tmpfs with huge pages and then the updated benchmark results:

% sudo mount -t tmpfs -o size=500M,huge=always,uid=1001,gid=1001 tmpfs \
    ../target/testmp

% BENCH_SERVER_URL=http://10.0.0.115:9087 cargo bench client

test client_01_ram              ... bench:  58,392,758 ns/iter (+/- 16,045,924)
test client_01_ram_gather       ... bench:  85,527,831 ns/iter (+/- 18,187,177)
test client_10_fs_direct        ... bench:  94,056,654 ns/iter (+/- 2,915,506)
test client_10_fs_permit        ... bench: 120,463,257 ns/iter (+/- 2,742,582)
test client_12_fs_dispatch      ... bench: 130,157,569 ns/iter (+/- 5,993,078)
test client_15_mmap_direct_copy ... bench: 104,351,507 ns/iter (+/- 2,596,867)
test client_15_mmap_permit_copy ... bench: 129,762,411 ns/iter (+/- 6,165,487)
test client_16_mmap_direct      ... bench:  79,705,427 ns/iter (+/- 3,013,775)
test client_17_mmap_permit      ... bench:  99,289,803 ns/iter (+/- 4,002,980)

As expected the fs and mmap results are all improved. Here mmap_direct actually outperforms ram_gather, and is only a 36% latency overhead beyond ram (without gather). Now, what if I only need to use fs (and mmap) on 10% or 1% of responses exceed some configured size in the Tunables, the remainder staying in ram? That net overhead should be more like 3.6% or 0.36%. With tmpfs, this seems like an effective way to have my cake and eat it too—a very limited resource cost for piece of mind that my process isn’t going to exceed RAM or thrash if it requires any swap space at all, and no need to enforce some low maximum http body size or low concurrency.

Conclusions

Dear Reader,

Having made it this far, did you notice that in all cases tested, the direct strategy (blocking a reactor thread) performs best?

Its true. I have almost nothing positive performance-wise to show for a significant amount of effort spent implementing the permit and dispatch strategies employed in these tests. One minor win is that from preliminary benchmarking I made the decision (in the 3rd rewrite, sigh) to compose permit and dispatch strategies as wrappers over direct AsyncBodyImage (our Stream) and AsyncBodySink types, so the latter may be directly selected at compile time, without further overhead or complication.

One other possible advantage of permit, as hinted in these benchmark results, is that the variance (if not the mean) of request/response handling is reduced with permit. That might be important in some use cases, for example for fairness to multiple users.

How could direct be best performing? What have I done wrong, if the demigods of rust async (well, some of them) tell us that blocking (their? our?) reactor threads is a bad idea? I speculate that this is both true and that I’ll be using the direct strategy in production, for all of the following reasons:

Memory cache locality and coherence is rather important in I/O heavy applications, so moving a read or write operation with a pointer to a buffer in some CPU’s cache or NUMA zone to another thread that might get scheduled anywhere is inherently costly. And this is typically exacerbated on virtualized instances. I believe this is one reason why permit tends to outperform the dispatch strategy in the above client benchmarks.
8KiB or 512KiB filesystem sequential writes to SSDs (network attached or local) simply aren’t blocking enough to matter, at least as compared with the overhead of any other strategy.
Tokio’s threaded runtime, with sufficient number of reactor threads and its work stealing abilities, is quiet capable of handling an occasionally blocked reactor thread without replacement or slowdown, at least for this tested work load.

I think there may still be plenty of other use cases for either the BlockingPermit or the DispatchPool, but they will tend to be much more course grain. For example, I envision dispatching a one-step blocking operation to uncompress HTTP body payloads enmasse, in the background of a particular server, once bodies are completely downloaded.

These findings similarly make me suspect that efforts to provide a direct drop-in std replacement in the form of AsyncRead and AsyncWrite and associated types and functions, is probably attacking the problem at too low a level. At the very least, I don’t suspect those traits as currently conceived will be useful for body-image-futio.

Some related questions:

Does hyper-tls or native-tls or tokio-tls offload packet decryption from the reactor thread? If any of these do, I haven’t been able to find where it happens. Please point it out for me.
Does hyper’s HTTP 1 chunked transfer encoding implementation attempt to offload Base64 encode/decode from the reactor thread? Should it?
Does the newer async-compression worry about blocking a reactor thread? Should it?

Or is the general strategy with these just that only a small block, say 8KiB or 64KiB, is decrypted/decoded/decompressed for each Stream::poll_next()? That would sound a lot like the above best performing direct strategy with blocking filesystem I/O.

Why use the new body-image-* 2.0.0 crates?

If you can do all of your dynamic response body production, POST request body consumption, or client request/response body handling in a purely streaming, chunk-at-a-time fashion, then that’s going to be the best thing to do. Fire and forget. The pattern is generally limited to HTTP proxies and load balancers.

If however, you need to do things like:

parse body payloads, producing large parse graphs in RAM, or
mutate the start of a body based on content in the middle or end of the body, or
produce dynamic content bodies much faster then your clients, or server POST endpoints, can consume them over the network

…then you might find the above tested setups and tunable options with body-image actually gain you efficiencies and/or make you sleep better at night not worrying over memory consumption, unbounded thread growth, or uncontrolled swapping. Let me know what you find!