Cancellation Safety in different Rust AsyncRuntimes
By Nikita Bishonen β’ 8 minutes read β’
To get more context I recommend my dear friends to read or watch a great talk made by Rain, where they/she described her attitude to βcancellation safetyβ in async Rust.
TL;DR for those who prefer to hack and not to read:
git clone https://gitlab.com/blogging1/cancellation-safety;cargo test --no-fail-fast --workspace(or your runtime of choice by-p cancel_%RUNTIME%intsead of--workspace);- Read errors and tests;
- Read
lib.rsfor runtimes you are interested in; - Play with the code to fix all the tests;
The Problem
βCancellation Safetyβ usually means absence of unexpected βeffectsβ when sequence of asynchronous operations has been stopped before it reaches the final state. I can think of a panic! in the middle of a synchronous sequence of operations as an analogy.
The fundamental principles of it tied to the Future trait and to the implementation details of the Asynchrnous Runtime. Here, I will like to experiment in an attempt to find common patterns and differences in how Tokio, smol and glommio handle cancellations.
Future::pollis a key aspect here, as at compile-time everyFutureimplementation is a state-machine, that represents a sequence of operations, βcancellationβ means that Runtime (Reactor, Event Loop or Scheduler) will stop polling this state-machine and (hopefully) destroy itβs state;PinandUnpinhave their own places, as!Unpintypes have to keep additional guarantees to be considered βsafeβ (mitigate βeffectsβ);Wakerimplementation may play a big role in handling βcancellationβ requests;
The Example
In our adventure we will go join our strong and proud friends who lives in Moria:
pub async fn mine_with_tool<F, FT>(dwarf: &mut Dwarf, mut pickaxe: F) ... {
let mut bag: Bag = Vec::new();
for i in 0..MAX_ALLOWED_SHIFTS {
pickaxe().await;
...
println!("Here at the Gates the king AWAITS");
...
bag.push(Ferrum::Dirty);
}
dwarf.bag = Some(bag);
}so our asynchronous process will imitate dwarf working in mines. The mine_with_tool itself is a composite trait Future implementation, that internally do MAX_ALLOWED_SHIFTS steps to reach itβs final (completed) state. In simple terms, line 4: pickaxe().await; will be a point when state-machine polled and make progress. It means that this exact spot may be used to βcancelβ the execution.
How it loolks at high-level intermediate representation (code cleaned to make it less verbose, please run cargo +nightly rustc -- -Z unpretty=hir inside moria folder to see the full output):
async fn mine_with_tool<F, FT>(dwarf: &'_ mut Dwarf, pickaxe: F) -> /*impl Trait*/
where F: FnMut() -> FT, FT: std::future::Future |mut _task_context: ResumeTy|
{
...
let mut bag: Bag = Vec::new();
{
...
loop {
match next(&mut iter) {
None {} => break,
Some { 0: i } => {
match into_future(pickaxe()) {
mut __awaitee =>
loop {
match unsafe {
poll(new_unchecked(&mut __awaitee),
get_context(_task_context))
} {
Ready { 0: result } => break result,
Pending {} => { }
}
_task_context = (yield ());
},
};
...
bag.push(Ferrum::Dirty);
}
}
},
...
dwarf.bag = Some(bag);
...
}alomost no βasyncβ magic anymore, we have a loop, an unsafe and a yield ().
Letβs jump to a βsafetyβ aspect. In my humble opinion, if there is no unsafe code and Rust compiler is pleased, the operation is safe. Yet, it doesnβt mean that the program behaves the way programmers may expect it to behave.
... fn mine_with_tool<...>(dwarf: &mut Dwarf ... {
let mut bag: Bag = Vec::new();
...
bag.push(Ferrum::Dirty);
...
dwarf.bag = Some(bag);
}try to spot yourself what makes this trait Future implementation be named βcancellation not-safeβ (I prefer not-safe here, as unsafe is important but unrelated Rust term).
The answer lies in how this asynchronous operation stores itβs state and tracks the own progress. Both things happen internally, while holding a mutable reference to the state given by the caller side dwarf: &mut Dwarf. Letβs see two examples of this feature usage inside smol runtime to see why such implementation can give a surprises to caller if cancelled.
pub async fn work(dwarf: &mut Dwarf) {
super::mine(dwarf)
.or(async {
Timer::after(HALF_SHIFT).await;
})
.await;
}Dwarf works here half of the shift time and made some progress. But when we go to see run the test:
let mut dwallin = Dwarf::new(Name::Dwalin);
timeout::work(&mut dwallin).await;
// Bag is empty, but I heard the song!
assert!(dwallin.bag.is_some());we see itβs assertation that some work has been done fails. The reason is that the timeout happened before Dwallin was able to finish his work and all what he put into the βbagβ in internal state of the state-machine has been dropped once we cancel the operation at the first poll after timeout. (Try to change Timer::after(HALF_SHIFT) to use FULL_SHIFT and see if it will help π). So this is what we may say a cancellation not-safe (or incorrect as Rain proposed) implementation of asynchronous operation that leads to a behaviour that we would not expect.
More interesting (and closer to real-world) can happen if we make our Future implementation more complex and βdirtyβ:
pub async fn work(dwarf: &Mutex<Dwarf>) {
let mut dwarf_guard = dwarf.lock().await;
let mut old_bag = dwarf_guard.bag.take().unwrap();
timeout(HALF_SHIFT, super::mine(&mut dwarf_guard))
.await
.err()
.unwrap();
if let Some(new_bag) = dwarf_guard.bag.as_mut() {
new_bag.append(&mut old_bag)
}
}I know, Dwarfs are not good at asynchronous programming, but it illustrates the problem really good. Our operation takes state out of input, holds it internally, than accumulates with own computation results and give it back.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ββββββββββββββ ββββββββββββββββ βββββββββ βββββββββββββββ βββββββββββββββββ
β β Lock Mutex β -> β Take Old Bag β -> β Mine β -> βTake New Bag β -> β Merge Bags ββ
β ββββββββββββββ ββββββββββββββββ βββββββββ βββββββββββββββ βββββββββββββββββ
β ββββββββββββββ ββββββββββββββββ β
β->βRelease Lockβ -> β End β β
β ββββββββββββββ ββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββThe issue is that operation is βsafeβ from the mutex point of view, as we know noone else will change the state of the dwarf. Yet it is not βcorrectβ if we cancel super::mine before it fully completes. old_bag will be dropped as new_bag will not be a thing (dwarf_guard.bag.as_mut() is None). As work future implementation is a composition of itself with super::mine future implementation and timeout future implementation, our logic becomes broken, because both futures have incorrect behavior from the system perspective (while it is totally expected, valid and safe Rust code imho).
The Comparison
Runtime Architecture Overview
Tokio: The Industry Standard
Tokio represents the most widely used async runtime in the Rust ecosystem. Itβs architecture focuses on:
- Multi-threaded thread pool executor for CPU-bound work;
- Completer-based I/O model for non-blocking operations;
- Robust task cancellation with explicit abort capabilities;
- Comprehensive ecosystem of supporting libraries;
Key characteristics:
- Built-in cancellation tokens via
tokio_util::sync::CancellationToken; tokio::spawn()for creating tasks;tokio::select!for concurrent operations;- Built-in timeout utilities;
smol: The Minimal Approach
smol takes a radically different approach with:
- Single-threaded executor by default;
- Simplified API focusing on essential async operations;
- No own runtime - it brings to you existing executors;
- Lightweight dependencies and minimal overhead;
Key characteristics:
smol::spawn()for task creation;smol::Timerfor async delays;smol::channelfor message passing;- No explicit cancellation tokens in core API;
Glommio: I/O Performance Focus
Glommio represents a specialized runtime designed for high-performance I/O workloads:
- Local executor model with share-nothing-first approach;
- I/O-optimized with dedicated thread-per-core architecture;
- No shared memory between threads by default;
- Local-only futures for better cache locality;
Key characteristics:
glommio::LocalExecutorfor single-threaded execution;glommio::spawn_local()for tasks;glommio::timerfor delays;glommio::channels::local_channelfor local-only communication;
Cancellation Scenarios Analysis
Most of the scenarios are work similar across runtimes:
- Simple Drop Cancellation: Use explicit drop mechanisms for futures and all runtimes show partial work loss when futures are dropped unexpectedly;
- Timeout Cancellation: All provide explicit timeout handling and tests show that timeout doesnβt preserve partial results;
- Mutex-Protected Operations:
tokio::sync::Mutex,smol::lock::Mutex,glommio::sync::RwLockall share similar semantics, while tests demonstrate lock holding patterns and cleanup works the way they should, while we still loss our progress; - Channel-Based Communication: very similar, only with locality nuances;
But some of them have nuances:
- Macroses like
select!and Concurrent Operations with Cancellation Tokens are specific toTokioruntime ecosystem, which seems to be not a good or bad things. As Rain described in her talk and what I heard from other developers, such macroses sometimes are totally banned in projects due to their non-explicit nature (and for example replaced with futures_concurrency); - βExplicitβ Cancel:
- in
Tokiois done viaJoinHandle::abortand dropping handle will not cause a cancellation,nahdle.awaitwill return an errorJoinErrorif future implementation was cancelled (or normal result if it finished), also worth noting thatspawn_blockingtasks are not βcancellableβ because they are not asynchronous (but it will prevent the task from starting if it wasnβt yet!); - in
smolyou can callTask::canceland wait for the cancellation (which may returnSomeif it finished), it is similar to dropping the future implementation; - in
glommioapproach is also identical tosmolwithOptionbeing returned on awaiting of the cancellateion;
- in
From the documentation side this topic is only covered in Tokio API docs, smol has one small (ha-ha) note on this matter:Note that canceling a task actually wakes it and reschedules one last time. Then, the executor can destroy the task by simply dropping its Runnable or by invoking run()., while in glommio I was unable to find mentioning cancellation safety or correctness. I think there are two factors why it is, what it is:
- Tokio has much bigger popularity and usage, though have more resources to add documentation;
- Tokio has much βdangerousβ API and internal executor model that makes it easier to run into cancellation safety issues using it;
I hope you found something new in this blog post, write your thoughts in the comments and check additional resources if you want to.
Additional Resources and Sources of inspiration
- Great Rainβs talk;
- Current comment on state of the book section about cancellation safety;
- Tokio Docs on that matter
- Sructured Concurrency lib
Comments
You can comment on this blog post by publicly replying to this post using a Mastodon or other ActivityPub/Fediverse account. Known non-private replies are displayed below.