Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boxing large futures is unsubstantiated #83

Open
zetanumbers opened this issue Oct 16, 2024 · 2 comments
Open

Boxing large futures is unsubstantiated #83

zetanumbers opened this issue Oct 16, 2024 · 2 comments

Comments

@zetanumbers
Copy link

In the context of da62e65, pay attention to this code:

// src/runnable.rs:515

// Allocate large futures on the heap.
let ptr = if mem::size_of::<Fut>() >= 2048 {
    let future = |meta| {
        let future = future(meta);
        Box::pin(future)
    };

    RawTask::<_, Fut::Output, S, M>::allocate(future, schedule, self)
} else {
    RawTask::<Fut, Fut::Output, S, M>::allocate(future, schedule, self)
};

The future is boxed whenever its size is greater or equal to 2048 bytes. Comment tries to elaborate on this condition but it seems not to notice that RawTask::allocate already places the future on the heap next to the header. I don't see any reason to additionally box the future even if it is large by itself.

However one might argue that running future shares space with its output result after the completion, thus boxed future would deallocate the space which that future would have occupied as soon as it completes, possibly saving some space until its Task object is consumed in some way. Then however the right condition would be to compare future's size with its output type's size, to make sure there is a substantial difference between them to save on:

mem::size_of::<Fut>() >= mem::size_of::<Result<Fut::Output, Panic>> + 2048

Even then that approach would be unsubstantiated, due to not having data to show any practical benefit. There could be not enough delay while completed task is held in memory to cause substantial memory usage cost.

It also introduces an additional pointer indirection which cannot be avoided, even if user would have liked to do so. However in case this branch is removed, user would be able to box up the future on their side to reach the old behavior. Given this library's purpose is to use it to "create a custom async executor", I am inclined to believe latter choice would be more appropriate, while this behavior and possible optimization should probably be noticed in the documentation as well.

I've even made some simple benchmarks to check how the code would perform with and without the Box::pin branch:

#[bench]
fn large_task_create(b: &mut Bencher) {
    b.iter(|| {
        let data = [0; 0x40000];
        let _ = async_task::spawn(
            async move {
                black_box(data);
            },
            drop,
        );
    });
}

#[bench]
fn large_task_run(b: &mut Bencher) {
    b.iter(|| {
        let data = [0; 0x40000];
        let (runnable, task) = async_task::spawn(
            async move {
                black_box(data);
            },
            drop,
        );
        runnable.run();
        future::block_on(task);
    });
}

I have found no noticeable difference, while roughly accounting for instability of benchmark results.

This issue also relates to #66, which highlights one of drawbacks of this decision.

@taiki-e
Copy link
Collaborator

taiki-e commented Oct 16, 2024

AFAIK, boxing is to avoid stack overflows.
However, I think the boundary should be larger in the release mode like tokio does.
https://github.com/tokio-rs/tokio/blob/512e9decfb683d22f4a145459142542caa0894c9/tokio/src/runtime/mod.rs#L400

Another approach that would may work is to use #[inline(always)] to avoid intermediate allocations on the stack.

@zetanumbers
Copy link
Author

AFAIK, boxing is to avoid stack overflows.

That would be good, but I only can see the opposite being true. Particularly if you change data array size from 0x4_0000 to 0x10_0000 from aforementioned benchmarks you'll get the stack overflow until you actually delete that branch. This indicates that the optimizer is able to perform in-place initialization, even if it doesn't use Box type, which does make sense considering there's no significant difference for LLVM. However it also seem like that the optimizer struggles with the added complexity from this if branch.

There could have been a possibility that this change did help to avoid stack overflow, considering it did not create a new closure back then change was introduced, but called Box::pin directly inside of the discussed branch (see d3011bf). Even if that was true (besides that nobody paid attention or got any stack overflows I've encountered), you can make a conclusion this feature is not functional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants