Document Number: | p3456r0 |
---|---|
Date: | 2024-10-15 |
Target: | SG1 |
Revises: | |
Reply to: | Gor Nishanov (gorn@microsoft.com) |
This document compares the features offered by the global threadpool on Windows and MacOS with the facilities proposed in the P2079R4 System Execution Context paper. It explores extensions to P2079R4 that capture most of the primitives offered by the operating systems and outlines implementations for Windows, MacOS, and Linux.
Win32 and Darwin offer global system threadpools with the following features:
The following table summarizes the features:
p2079r4 | Windows | MacOS | Asio | |
---|---|---|---|---|
schedule | + | + | + | + |
schedule_at | + | + | + | |
defer | + | |||
bulk | + | + | ||
i/o | + | + | + | |
priorities | + | + |
In subsequent sections we will try to outline whether we can get as much parity with system threadpools with C++26 system context and what can be added later.
As a starting point, we will use p2079r4's system_scheduler and over the course of this paper expand it to cover most of the functionality offered by operation system global threadpools and boost asio.
system_scheduler get_system_scheduler();
class system_scheduler() {
public:
system_scheduler() = delete;
bool operator==(const system_scheduler&) const noexcept;
std::execution::forward_progress_guarantee get_forward_progress_guarantee() noexcept;
sender auto schedule();
sender auto bulk(integral auto i, auto f);
};
Win32 threadpool offers three priority levels: high, normal and low. When a thread comes to pick up work, it will only pick up
work from a lower priority queue, only when there is no work items in the higher priority queues. MacOS offers additional
priority level background
that is of lower priority than low
and all the work with background
priority will not run
when low-power mode is enabled.
To reach parity we recommend to to extend get_system_scheduler()
to take an optional priority
parameter to take advantage
of the underlying system threadpools. On systems without
OS threadpool, this can be easily supported by providing
a queue per priority level.
enum class system_scheduler_priority {
background
low,
normal,
high,
};
system_scheduler get_system_scheduler(system_scheduler_priority priority = system_scheduler_priority::normal);
On Darwin, it can be implemented as:
class system_scheduler {
...
using native_handle_type = dispatch_queue_t;
native_handle_type native_handle();
friend system_scheduler get_system_scheduler();
private:
explicit system_scheduler(native_handle_type h) : handle{h}
native_handle_type handle{};
};
system_scheduler get_system_scheduler(system_scheduler_priority priority = system_scheduler_priority::normal) {
static array<int, 4> arr = {
DISPATCH_QUEUE_PRIORITY_BACKGROUND,
DISPATCH_QUEUE_PRIORITY_LOW,
DISPATCH_QUEUE_PRIORITY_DEFAULT,
DISPATCH_QUEUE_PRIORITY_HIGH
};
auto queue = dispatch_get_global_queue(to_underlying(priority), 0)
return system_scheduler(q);
}
On Windows:
class system_scheduler {
...
using native_handle_type = TP_CALLBACK_ENVIRON*;
native_handle_type native_handle();
friend system_scheduler get_system_scheduler();
private:
explicit system_scheduler(native_handle_type h) : handle{h}
native_handle_type handle{};
};
TP_CALLBACK_ENVIRON env_for_priority(TP_CALLBACK_PRIORITY pri) {
TP_CALLBACK_ENVIRON result;
InitializeThreadpoolEnvironment(&result);
SetThreadpoolCallbackPriority(&result, pri);
return result;
}
array<TP_CALLBACK_ENVIRON, 4> arr = {
env_for_priority(TP_CALLBACK_PRIORITY_LOW),
env_for_priority(TP_CALLBACK_PRIORITY_LOW),
env_for_priority(TP_CALLBACK_PRIORITY_NORMAL),
env_for_priority(TP_CALLBACK_PRIORITY_HIGH),
};
system_scheduler get_system_scheduler(system_scheduler_priority priority = system_scheduler_priority::normal)
{
return system_scheduler(arr[to_underlying(priority)]);
}
One design point is whether normal should be a default
enum value, as in system_scheduler_priority{} == system_scheduler_priority::normal
.
If we choose to do that, the enum will look like:
enum class system_scheduler_priority : int {
background = -2,
low = -1,
normal = 0,
high = 1,
};
Bulk returns a sender describing the task of invoking the provided function with every index in the provided shape along with the values sent by the input sender. For example, the following coroutine will perform matrix multiplication using bulk.
task<void> matmul(float* xout, float* x, float* w, int n, int d) {
auto sh = get_system_scheduler();
// Parallel matrix multiplication using bulk.
// W (d,n) * x (n) -> xout (d)
co_await sh.bulk(d, [&](size_t i) {
float val = 0.0f;
for (int j = 0; j < n; j++)
val += w[i * n + j] * x[j];
xout[i] = val;
});
}
Darwin libdispatch offers the dispatch_apply_f that submits a single function to the dispatch queue and causes the function to be executed the specified number of times.
void dispatch_apply_f(size_t iterations, dispatch_queue_t queue, void *context, void (*work)(void *context, size_t iteration));
Windows does not have an equivalent, but, such facility can be readily implemented as an algorithm on top of non-bulk schedule.
In a few tests we ran, on Darwin, running bulk algorithm over scheduler is more performant than the dispatch_apply_f.
Additionally bulk algorithm supports cancellation that can prompt early exit from the calculation when not needed.
Another observation is that on MSVC, running parallel algorithms (which internally use the global threadpool) completely starves all other threadpool work until the parallel computation is completed. This appears to be suboptimal behavior.
We would like to propose that bulk should:
This feature is not available in Win32 or Darwin and only offered by by boost::asio and earlier revisions of executor proposal.
It allows to defer an execution of a work item immediately after the current work item returns control to the threadpool.
The implementation is straightforward. A scheduler will need to maintain a thread-local queue of work that will accumulate deferred items and execute them all immediately after the current executing work item finishes.
system_scheduler get_system_scheduler();
class system_scheduler() {
public:
system_scheduler() = delete;
bool operator==(const system_scheduler&) const noexcept;
std::execution::forward_progress_guarantee get_forward_progress_guarantee() noexcept;
sender auto schedule();
sender auto defer(); // <-- freshly added
sender auto bulk(integral auto i, auto f);
};
Win32, Darwin and Asio threadpools support scheduling work items based on time (absolute or relative).
Underlying implementation maintains a data structure to keep the scheduled work items in the order of expiration. Usually, it is a pairing heap (O(1) insert, O(log n) extract expiring item).
To reduce impact of O(log n), Windows gives an opportunity to the user to indicate whether it is OK to delay execution of a work item with a specified window, thus allowing to group some of the timed items together, reducing the total number of elements maintained by the heap.
Windows also support cancellation of timed work items, whereas Darwin does not.
Thus on Darwin, support for timed operations will need to rely on C++ runtime maintained pairing heap if we would like to support cancellation, since now the heap is in the c++ library implementor hand's, we can also add support for coalescing window increasing performance when number of timed items is high.
template <typename Rep, typename Ratio>
schedule_after_sender system_scheduler::schedule_after(std::chrono::duration<Rep, Ratio> delay, milliseconds window = {}) const noexcept;
template <typename Clock, typename Duration>
schedule_at_sender system_scheduler::schedule_at(std::chrono::time_point<Clock, Duration>& abs_time, milliseconds window = {}) const noexcept;
We choose to keep the window in milliseconds as it matches Win32 API. While it is possible to have it specified in microseconds or nanoseconds (or have it as templatized duration), we ended up with milliseconds based on simplicity and windows experience.
One fundamental feature of OS provided threadpools is that they have visibility in what threads are doing and can make intelligent decisions when to remove or add a thread to the threadpool.
We explored two contending approaches:
The first approach seems sub-optimal as it forces system wide overhead for all processes and typically requires root privilege to install.
Sampling of /proc/self/task/{}/stat seems more promising. This is the approach used by libdispatch on Linux. Moreover, the overhead of sampling can be reduced significantly utilizing io_uring
to sample thousands of thread stats in a one kernel transition.
Knowledge of the threadpool thread state allows to implement the desired policies to grow and shrink threadpool as needed.
One open question is what to do when a pending work item (via schedule() or as a result of I/O completion) is cancelled. We can see several strategies.
Out of this options, the first one looks least problematic, while it does not speed up completion of the cancelled item if there are many items ahead in the queue, it does not disturb the execution of other items and if there are many cancellations, they will be handled by threadpool concurrently.
Eager, can lead to deadlocks, stack overflows, slowing down cancellation as it would be executed by the thread initiating the cancellation.
Defer is immune from the deadlock, but, has the same issue of running the cancellation on the same thread.
Finally, eager, prioritizes cancelled items over regular items. It may or many not be beneficial over the conservative approach depending on the exact scenario.
Thus, conservative approach is recommended as a default.
The most likely implementation of a timer (in the cases when such facility is not available in the OS) would be to maintain a heap of timer items in the order of expiration and use a threadpool control thread or other facilities to wait until the nearest timer expires. Once expired, the work item is queued into a threadpool queue as if it was scheduled via schedule_sender.
If a timed work is cancelled prior to expiration, it is also posted into a threadpool in a cancelled state.
A libunifex explored how sender/receive I/O may look like. For example:
task<void> read(auto scheduler sched) {
auto in = open_file_read_only(sched, "file.txt");
std::array<char, 1024> buffer;
co_await async_read_some_at(
in, 0, as_writable_bytes(span{buffer.data(), buffer.size()}));
}
While it is unlikely that in time for C++26 we will be able
to specify I/O suite covering various aspect of the I/O, we can expose native_handle()
to the system_scheduler
that
would allow developers to write libraries on top providing
an I/O support that will be executing completions on
system scheduler.
Early in the priority section we already got a sneak preview of native_handle. In this section we can show how it can be used to support an I/O.
In Windows, native_handle is an alias to TP_CALLBACK_ENVIRON*
that
carries information about what threadpool to use and what is
the priorities of the items being scheduled. It can be used
with existing win32 threadpool APIs as follows:
auto io = CreateThreadpoolIo(file, cb, context,
get_system_scheduler().native_handle());
Of course, we expect that the users will have a nice C++ wrappers around raw win32 APIs, but, the purpose of this to show how to make C++ system context interact with components that expect windows threadpool (TP_CALLBACK_ENV) as an argument.
Similarly, on Darwin, it would look
dispatch_read(f, 1024, get_system_scheduler().native_handle(),
^(dispatch_data_t data, int error) {...});
It is likely that Linux implementation will use io_uring to send commands and receive completions and subsequently it will post completion handle to the threadpool as if by schedule(). Thus, on Linux, native handle needs to have enough context for the user to be able to develop their own I/O wrappers on top.
Thus, in order to expose most of the features of underlying OS global threadpool the system_context would look like:
enum class system_scheduler_priority {
background
low,
normal,
high,
};
system_scheduler get_system_scheduler(system_scheduler_priority priority = system_scheduler_priority::normal);
class system_scheduler() {
public:
system_scheduler() = delete;
bool operator==(const system_scheduler&) const noexcept;
std::execution::forward_progress_guarantee get_forward_progress_guarantee() noexcept;
using native_handle_type = implementation-defined;
native_handle_type native_handle();
sender auto schedule();
sender auto defer();
sender auto bulk(integral auto i, auto f);
template <typename Rep, typename Ratio>
schedule_after_sender system_scheduler::schedule_after(
std::chrono::duration<Rep, Ratio> delay,
milliseconds window = {}) const noexcept;
template <typename Clock, typename Duration>
schedule_at_sender system_scheduler::schedule_at(
std::chrono::time_point<Clock, Duration>& abs_time,
milliseconds window = {}) const noexcept;
};
This paper was coming in hot for the deadline. We intend to clean it up and improved before seeing it in Poland.
[p2300] https://wg21.link/p2300 std::execution
[p2079r4] https://wg21.link/p2079r4 System execution context
[asio-executors] https://github.com/chriskohlhoff/executors/
[win32 threadpool details] https://www.youtube.com/watch?v=CzgNVuXVMWo
[libdispatch] https://developer.apple.com/documentation/dispatch?language=objc
[io_uring] https://kernel.dk/io_uring.pdf