Cancellation Safety in different Rust AsyncRuntimes

By Nikita Bishonen β€’ 8 minutes read β€’



To get more context I recommend my dear friends to read or watch a great talk made by Rain, where they/she described her attitude to β€œcancellation safety” in async Rust.

TL;DR for those who prefer to hack and not to read:

The Problem

β€œCancellation Safety” usually means absence of unexpected β€œeffects” when sequence of asynchronous operations has been stopped before it reaches the final state. I can think of a panic! in the middle of a synchronous sequence of operations as an analogy.

The fundamental principles of it tied to the Future trait and to the implementation details of the Asynchrnous Runtime. Here, I will like to experiment in an attempt to find common patterns and differences in how Tokio, smol and glommio handle cancellations.

The Example

In our adventure we will go join our strong and proud friends who lives in Moria:

pub async fn mine_with_tool<F, FT>(dwarf: &mut Dwarf, mut pickaxe: F) ... {
    let mut bag: Bag = Vec::new();
    for i in 0..MAX_ALLOWED_SHIFTS {
        pickaxe().await;
        ...
        println!("Here at the Gates the king AWAITS");
        ...
        bag.push(Ferrum::Dirty);
    }
    dwarf.bag = Some(bag);
}

so our asynchronous process will imitate dwarf working in mines. The mine_with_tool itself is a composite trait Future implementation, that internally do MAX_ALLOWED_SHIFTS steps to reach it’s final (completed) state. In simple terms, line 4: pickaxe().await; will be a point when state-machine polled and make progress. It means that this exact spot may be used to β€œcancel” the execution.

How it loolks at high-level intermediate representation (code cleaned to make it less verbose, please run cargo +nightly rustc -- -Z unpretty=hir inside moria folder to see the full output):

async fn mine_with_tool<F, FT>(dwarf: &'_ mut Dwarf, pickaxe: F) -> /*impl Trait*/
    where F: FnMut() -> FT, FT: std::future::Future |mut _task_context: ResumeTy|
{
    ...
    let mut bag: Bag = Vec::new();
    {
        ...
        loop {
            match next(&mut iter) {
                None {} => break,
                Some {  0: i } => {
                    match into_future(pickaxe()) {
                        mut __awaitee =>
                            loop {
                                match unsafe {
                                        poll(new_unchecked(&mut __awaitee),
                                            get_context(_task_context))
                                    } {
                                    Ready {  0: result } => break result,
                                    Pending {} => { }
                                }
                                _task_context = (yield ());
                            },
                    };
                    ...
                    bag.push(Ferrum::Dirty);
                }
            }
        },
        ...
    dwarf.bag = Some(bag);
    ...
}

alomost no β€œasync” magic anymore, we have a loop, an unsafe and a yield ().

Let’s jump to a β€œsafety” aspect. In my humble opinion, if there is no unsafe code and Rust compiler is pleased, the operation is safe. Yet, it doesn’t mean that the program behaves the way programmers may expect it to behave.

... fn mine_with_tool<...>(dwarf: &mut Dwarf ... {
    let mut bag: Bag = Vec::new();
        ...
        bag.push(Ferrum::Dirty);
        ...
    dwarf.bag = Some(bag);
}

try to spot yourself what makes this trait Future implementation be named β€œcancellation not-safe” (I prefer not-safe here, as unsafe is important but unrelated Rust term).

The answer lies in how this asynchronous operation stores it’s state and tracks the own progress. Both things happen internally, while holding a mutable reference to the state given by the caller side dwarf: &mut Dwarf. Let’s see two examples of this feature usage inside smol runtime to see why such implementation can give a surprises to caller if cancelled.

pub async fn work(dwarf: &mut Dwarf) {
    super::mine(dwarf)
        .or(async {
            Timer::after(HALF_SHIFT).await;
        })
        .await;
}

Dwarf works here half of the shift time and made some progress. But when we go to see run the test:

let mut dwallin = Dwarf::new(Name::Dwalin);
timeout::work(&mut dwallin).await;
// Bag is empty, but I heard the song!
assert!(dwallin.bag.is_some());

we see it’s assertation that some work has been done fails. The reason is that the timeout happened before Dwallin was able to finish his work and all what he put into the β€œbag” in internal state of the state-machine has been dropped once we cancel the operation at the first poll after timeout. (Try to change Timer::after(HALF_SHIFT) to use FULL_SHIFT and see if it will help πŸ˜‰). So this is what we may say a cancellation not-safe (or incorrect as Rain proposed) implementation of asynchronous operation that leads to a behaviour that we would not expect.

More interesting (and closer to real-world) can happen if we make our Future implementation more complex and β€œdirty”:

pub async fn work(dwarf: &Mutex<Dwarf>) {
    let mut dwarf_guard = dwarf.lock().await;
    let mut old_bag = dwarf_guard.bag.take().unwrap();
    timeout(HALF_SHIFT, super::mine(&mut dwarf_guard))
        .await
        .err()
        .unwrap();
    if let Some(new_bag) = dwarf_guard.bag.as_mut() {
        new_bag.append(&mut old_bag)
    }
}

I know, Dwarfs are not good at asynchronous programming, but it illustrates the problem really good. Our operation takes state out of input, holds it internally, than accumulates with own computation results and give it back.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚ Lock Mutex β”‚ -> β”‚ Take Old Bag β”‚ -> β”‚ Mine  β”‚ -> β”‚Take New Bag β”‚ -> β”‚  Merge Bags  β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                                    β”‚
β”‚->β”‚Release Lockβ”‚ -> β”‚    End       β”‚                                                    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The issue is that operation is β€œsafe” from the mutex point of view, as we know noone else will change the state of the dwarf. Yet it is not β€œcorrect” if we cancel super::mine before it fully completes. old_bag will be dropped as new_bag will not be a thing (dwarf_guard.bag.as_mut() is None). As work future implementation is a composition of itself with super::mine future implementation and timeout future implementation, our logic becomes broken, because both futures have incorrect behavior from the system perspective (while it is totally expected, valid and safe Rust code imho).

The Comparison

Runtime Architecture Overview

Tokio: The Industry Standard

Tokio represents the most widely used async runtime in the Rust ecosystem. It’s architecture focuses on:

Key characteristics:

smol: The Minimal Approach

smol takes a radically different approach with:

Key characteristics:

Glommio: I/O Performance Focus

Glommio represents a specialized runtime designed for high-performance I/O workloads:

Key characteristics:

Cancellation Scenarios Analysis

Most of the scenarios are work similar across runtimes:

But some of them have nuances:

From the documentation side this topic is only covered in Tokio API docs, smol has one small (ha-ha) note on this matter:Note that canceling a task actually wakes it and reschedules one last time. Then, the executor can destroy the task by simply dropping its Runnable or by invoking run()., while in glommio I was unable to find mentioning cancellation safety or correctness. I think there are two factors why it is, what it is:

  1. Tokio has much bigger popularity and usage, though have more resources to add documentation;
  2. Tokio has much β€œdangerous” API and internal executor model that makes it easier to run into cancellation safety issues using it;

I hope you found something new in this blog post, write your thoughts in the comments and check additional resources if you want to.

Additional Resources and Sources of inspiration

Comments

You can comment on this blog post by publicly replying to this post using a Mastodon or other ActivityPub/Fediverse account. Known non-private replies are displayed below.

Open Post