An Analysis of Async and Futures

ISO/IEC JTC1 SC22 WG21 N2974 = 09-0164 - 2009-09-27

Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org

Introduction
Action Analysis Async/Future
    Actions at the Send
    Actions at the Receive
    Action Combinations
Other Issues
    Limiting Threads
    Thread Pools
    Speculative Concurrency and Execution
Design and Implementation Choices
Current Operations
Alternate Operations
    Various Uses
    Code for Query-Like Methods

Introduction

The intent of the async and future combination is to potentially initiate a task concurrently, and then retreive its return value or exception at a later time.

The semantics of async and future have come under considerable debate. Part of the debate arises from the expected use, part arises from implicit assumptions about implementation, and part arises from a lack of clarity in the implications of certain operations. This paper is intended to clarify the latter two parts of the debate.

Action Analysis Async/Future

The implementation of async and future requires actions at two points, the 'send' and the 'receive'. The 'send' is just the call to async. The 'receive' is the call to get or any other call that might be used to decide whether or not to call get. For the purposes of this discussion, we assume a 'unique' future.

Actions at the Send

Clarifying the actions and their consequences really requires a table, but that table is too big to understand at once, so we first describe the attributes and then present the table.

Actions and Benefits

The available actions are as follows.

Fork: a new thread for the task immediately. This action enables responsive threads.
Delay: the thread fork for the task until resources are available, which limits the number of active threads, which in turn helps avoid oversubscription.
Enqueue: the task for later execution. In addition to limiting the number of threads, this action enables task stealing, where a call to get removes the task from the queue and executes it sequentially. Thus, this action enables speculative concurrency. With an operation to simply remove the task from the queue, this technique enables speculative execution as well.
Invoke: the task immediately. This action executes the task sequentially, thus having no concurrency overhead.
Defer: the decision on execution until the 'receive'. This action enables speculative execution, because if the get is never called, the task is never executed.

Future States

The states communicate between a asynchronous threads that might affect the future storage. They are otherwise self-explanatory.

Task Restrictions

The use of async necessarily imposes some restrictions on the type of synchronization that the tasks may employ.

Get-free: tasks shall not deadlock with the caller of get. Most sender actions have this constraint.
Async-free: tasks shall not deadlock with the caller of async.
Delayable: tasks shall not require responsiveness and shall tolerate live lock.

The following table provides the various attributes of the various actions.

Attributes for various send actions
Action	Benefits	States	Restrictions
Fork	responsive	running, complete	get-free
Delay	thread limiting	running, complete	get-free, delayable
Enqueue	thread limiting, speculative concurrency, speculative execution	enqueued, running, complete	get-free, delayable
Invoke	indirect sequential		async-free
Defer	direct sequential, thread limiting, speculative concurrency, speculative execution		get-free, delayable

Actions at the Receive

Again, clarifying the actions and their consequences requires attributes and then table.

Actions and Benefits

Retrieve: the value or exception from the future storage. This action is required to exploit concurrency.
Wait: for the task to complete. This action enables synchronization and will likely soon be followed by value retrieval.
Check: the state of the future. The design point for this operation is to see if the state is not yet complete, and do something else before trying again later. This action extends concurrency between the receiver and the task.
Dequeue: the task from the queue without executing it. This action contributes to speculative execution.
Call: the task sequentially. This action avoids concurrency, and may avoid some synchronization. In additon, the sequential overhead is lower than the invoke action because the call can be direct; no value need be stored nor any exception caught.
Morph: the future into a different kind of future, corresponding to a sender action of fork, delay, or enqueue. This action only applies to tasks whose action was deferred at the sender.

Synchronization

Implementing the receive actions requires varying amounts of synchronization.

Test: progress via an atomic load relaxed.
Acquire: memory via an atomic load acquire.
Lock: a mutex to form a critical section. There are special cases that may have slightly weaker synchronization, but the expectation is that these will not be significant enough to affect the analysis.
Wait: on a condition variable.
Reuse: any synchronization from that used to communicate the future itself. That is, the release of the future is sufficient to release any information needed to receive perform the receive action. No additional synchronization is required.

The following table shows the possible receive actions for each send action, and lists the synchronization required for that combination. An empty entry indicates that the combination is not possible.

Synchronization at receive actions for send actions
Sender Action	Receiver Actions
Sender Action	Retreive	Wait	Check	Dequeue	Call	Morph
Fork	acquire	wait	test
Delay	acquire	wait	test
Enqueue	acquire	wait	test	lock	lock
Invoke	reuse
Defer				reuse	reuse	reuse

Action Combinations

There are some combinations of sender and receiver actions that are at best marginally useful. We highlight these combinations with the intent that they may not be supported.

The 'invoke' action has only one advantage over the 'defer' action, and that is to shift the task restrictions from get-free to async-free and to shift use of thread-local variables from the get to the async. However, these advantages are actually disadvantages, because no other action has those properties. The 'invoke' action prevents speculative execution, which is a desireable property. Furthermore, the 'invoke/retrieve' combination has higher overhead than the 'defer/call' combination, which does support speculative execution. That is, the 'defer' action is strictly more useful than the 'invoke'.

Delayed forking of a task seems less useful than enqueuing the task. However, it has a lower synchronization, storage, and implementation burden. Furthermore, for those situations it which it is desireable that receivers do not execute tasks sequentially, the delayed forking provides assurance that the task will execute on another thread.

The 'defer' sender action leaves significant choices to the receiver. A receiver 'fork' followed by 'wait' seems rarely desirable in comparison to 'call'. All 'fork' followed by 'wait' seems to offer only a more controlled environment for the use of thread-local storage, but at the high cost of increased thread overhead for no additional concurrency. A 'fork' or 'delay' sender action seems more appropriate to these circumstances.

Other Issues

Limiting Threads

There are two approaches to limiting the number of threads.

Limit the number of threads so that the system load is acceptable.: This approach may be difficult to implement in some environments. It is best for using all available processor time.
Limit the number of threads to a maximum number.: This approach has straightforward implementation. It is best for memory constrained environments. However, there is an additional requirement on 'delayable' tasks that they have only critical-section-like synchronization. That is, arbitrary synchronization with other threads is dangerous.

Speculative Concurrency and Execution

When applications desire to use speculative concurrency, task that are speculative should be assigned lower priority than those that are not speculative.

Tasks that may not be executed at all, should have an even lower priority. It may be difficult to determine when task may never be executed, so one possible approach is to simply execute only after some demand has been made for the future results.

It is often difficult to determine a priori when potential concurrency in an application is valuable. One mechanism to obtain decent performance in this environment is to make heavy use of nested tasks with 'enqueue/call', relying on the majority of tasks to be executed sequentially, while still leaving opportunity for tasks to be executed concurrently. However, in this environment, first-come-first-served scheduling tends to produce a pathologically large number of concurrent tasks. Better is to have each processor maintain its own double-ended queue. A processor usually pushes and pops from the front of that queue. However, when the queue is empty, the processor can steal from the rear of another processor. With this approach, processors tend to steal large task that other processors are not working on, and the number of tasks executed concurrently tends to be much more managable.

Execution Policies

Whenever multiple sender actions exist, the implementation needs policies on which action to choose. The simplest policy is to simply execute a user-specified action. However, more useful policies offer the implementation a range of policies within which it can chose to best use system resources.

We anticipate the following policies and the actions they are permitted to perform.

immediate: fork
arbitrary_sync: fork or delay, but containing arbitrary synchronization so that threads may not be limited by count
fairly_soon: fork or delay, but with a hint that the work is of significant priority
whenever: fork or delay, but with a hint that the work is of lower priority
speculative_concurrency: fork, delay, or enqueue, but with a hint that actual concurrency is on an "if available" basis
speculative_execution: fork, delay, or enqueue, or defer but with a hint that the value may not be requested and actual concurrency is on an "if available" basis

Thread Pools

Surprisingly, thread pools have no impact on the discussion above. However, the implementation of both async and future may be significantly affected. In particular, the need to establish a happens-before relationship between the last thread-local-storage destruction and returning the future value requires an intrusive implementation of async, a modification of futures to store std::thread for joining, or new language facilities.

Design and Implementation Choices

Existing future designs and implementations are based on a set of actions that the future supports. As the set o factions expand, the representation may also increase. In this section, we look at some of the representations.

Future Representation and Storage

The representation and storage required by a future depends on the actions supportd by that future. The representation is the fields allocated within the class itself. The storage is the dynamic data.

storage pointer: references the dynamic storage used to communicate between the worker thread and the receiver. This pointer is needed if and only if there is any storage,
call action: indicates the action used to create the future. This information is only necessary when capturing multiple actions within the same type. In some cases, the a null storage pointer is sufficient indication of the call action, in which case we omit this field and mark the storage pointer with an asterisk.
value/exception: stores the returned value, or any uncaught exception.
mutex&condvar: enables locking and waiting.
std::function: enables task stealing and speculative execution.

Future designs, actions, and representation
Future Design	Sender Actions	Receiver Actions	Representation	Storage
N2671 `unique_future` (working draft)	Fork, Delay	Retrieve, Wait, Check	storage pointer	value/exception, mutex&condvar
N2889 `joining_future`	Fork, Delay, Defer	Wait&Retrieve, Call	storage pointer*, `std::thread`, `std::function`	value/exception, mutex&condvar
N2901 `unique_future`	Fork, Delay, Defer	Retrieve, Wait, Check, Call, Dequeue, Morph	storage pointer*, `std::function`	value/exception, mutex&condvar
this paper, excepting invoke	Fork, Delay, Enqueue, Defer	Retrieve, Wait, Check, Call	storage pointer*, `std::function`	`std::thread`, `std::function`, value/exception, mutex&condvar, queue handling

FIX: We don't need all fields in all copies of the future.

The draft future design from N2671 is underspecified with respect to async and thus cannot distinguish between the fork, delay, and invoke actions.

Paper N2889 added an additional future that supported actions 'fork/delay' and 'defer&call'.

Paper N2901 implicitly extended futures to support 'fork/call' and 'defer&call'.

Current Operations

In this section we look at operations on unique_future defined in the working draft.

The use of unique_future in data structures in hampered by the lack of a coordinated set of operations: a default constructor, a move assignment operator, and a test for the default. The std::thread class has such operations so as to enable storing them in data structures.

Furthermore, an 'empty' value associated with default constructors can lead to defined exceptional behavior rather than the current undefined behavior for double get operations.


std::unique_future<moveable< f = std::async(g);
moveable v = f.get(); // fine
moveable w = f.get(); // 'cannot get from empty' exception

These problems are addressed by N2888 and successors.

Anthony Williams points a problem with the return type of get.

... given "T&& get();", "T&v=f.get();" requires that the reference is valid beyond the call to get() — specifically, it requires that it is valid long enough for the constructor of v to use it.

This means that get() cannot free the internal storage for the value, even if it marks the future as "empty".

Consequently, you may have "empty" futures that actually have a very large amount of internal storage associated with them, and even a valid object of some resource-owning type if that type does not have a move constructor. In this case, I would rather that the future remained "ready", and a subsequent call to get() was permitted.

He also points out that the freedom to "implement the get() as a call to a stored std::function and never store the result" requires returning by value rather than rvalue reference.

The is_ready operation is ambiguous with respect to the 'call' action. Does f.is_ready() true mean that its value has been computed and stored, or that obtaining its value does not require waiting on another thread? The former definition could lead to problematic behavior because for 'defered' tasks, f.is_ready() will never become true and hence the call to get may be indefinitely delayed. On the other hand, the latter definition could lead to unexpected delays due to unintended sequential evaluation. Furthermore, the latter definition could lead to retractions. Consider an 'enqueued' future. Its is_ready() returns true, then the scheduler assigns that future to a thread, and some subsequent is_ready() calls would return false. However, the caller may have already decided to call get and thus be subject to synchronization delays.

Assuming the later definition for is_ready, then sequential tasks will have not been executed at the return from a call to is_ready. In that case, what should has_value and has_exception return? The information necessary for a precise answer is not available. We must either provide a convention that !f.has_value()&&!f.has_exception() means that the value has not been computed, or we must extend the interface. Since exceptions are by definition rare, these operations seem to be of little value, and given their semantic difficulties may well warrant being removed from the interface.

The wait operations were designed to reflect internal condition variable waits to the user of the future. Unfortunately, the timed waits do not have a corresponding operations on std::thread. To enable timed waits one must implement waiting a condition variable, even when timed waits are not actually used and a join on the std::thread would be sufficient. Even using a condition variable may yield unexpected waiting when the currently-still-necessary join on the controlling thread waits for destruction of thread-local variables.

Anthony Williams comments:

This suggests to me that std::thread is lacking a timed-join function. Such a function is implementable on both POSIX and Windows platforms. On Windows there is no overhead, as the platform supports it natively. On POSIX then the cost is larger. It does mean that std::thread can't be a simple wrapper around pthread_t (e.g. so you can store a semaphore for sem_timedwait, or use a condition variable), but you generally need more than that anyway since pthread_t doesn't support all the operations required of std::thread::id.

Anyway, you could add this overhead exclusively to those threads that were tied to async-launched futures, even if it wasn't added to std::thread.

Alternate Operations

Reformulating the future operations may well provide interactions that are more robust to varying implementation. In order to outline these operations, we concentrate on the various ways futures might be used.

Similarly, another important fact is whether or not the computation has completed. That is, whether or not the value or exception is stored. That operation is is_done.

Ideally, though, user-level operations should not be calling functions that are specific to particular sender actions. We want operations that mean roughly the same thing regardless of the actions used to create the future. So, we try to recast the operations in those terms.

Code for Query-Like Methods

Before we can simplify uses, we must define some tools.

Central to how a future might be used is whether or not the future 'owns' the task. A future that is 'defered', and hence may be 'called', owns the task. On the other hand, a future cannot own a task that is currently executing on a thread. More subtle is an enqueued task, which is neither owned by the future nor committed to another thread. In this case, we need an operation to resolve the situation, that is to make the future take the task from the queue to make it callable. That operation is be_owner, which returns true if either the future already owned the task or if it was successful in obtaining ownership.


bool is_owner() {
	return sender_action == defer;
}

bool be_owner() {
	if ( sender_action == defer ) return true;
	else if ( sender_action != enqueue ) return false;
	else return dequeue(); /* whether or not it worked */
}

Likewise whether or not a task has already started, and is thus committed to concurrency, is important.


bool is_started() {
	return sender_action != defer && state != enqueued;
}

Likewise whether or not a task has already completed, and is thus its value is ready for copy/move, is important.


bool is_done() {
	return sender_action != defered && state == complete;
}

Use Cases

The various uses cases are coded in terms of the operations above and sender and receiver actions encoded as methods.

Obtain Value as Soon as Possible

Of course, the most prevalent operation will be simply to obtain the value as soon as possible. The code is:


if ( f.be_owner() ) notional_work += f.call();
else { f.wait(); notional_work += f.retrieve(); }

In the case of the 'enqueue' futures, this operation provides the mechanism for speculative concurrency.


notional_work += f.value();

An alternate form returns a reference to a object stored within the future. This approach may be more efficient in some case, less in others.


notional_work += f.refer();

Abandon a Task

For speculative execution, simply not asking for the value will serve in the case of 'defered' futures. However, for 'enqueued' futures, efficiency is improved with an explicit action.


f.be_owner(); /* proceede to future destruction */

This code would be clearer as a distinct method.


f.abandon();

Speculative Execution with Waiting

For algorthms that can profitably use any result computed, but need not use that the result, the following code serves.


if ( f.be_owner() ) /* ignore task not yet computed */ ;
else { f.wait(); notional_work += f.retrieve(); }

The else clause is effectively a value().


if ( f.started_or_never() ) notional_work += f.value();

Speculative Execution without Waiting

For algorthms that can profitably use any result available immediately, but need not use that the result, the following code serves. Note that there will be some wasted computation with this approach.


if ( f.is_done() ) notional_work += f.retrieve();
else f.be_owner() /* abandon the task */ ;

There does seem to be advantage to folding the is_done and be_owner operations into a single method.


if ( f.done_or_never() ) notional_work += f.value();

Speculative Rendevous with Waiting

For algorthms that can profitably that is otherwise being computed, but need not use that the result, the following code serves.


if ( f.is_started() ) { f.wait(); notional_work += f.retrieve(); }

Which can be simplified to:


if ( f.is_started() ) notional_work += f.value();

Speculative Rendezvous without Waiting

For algorthms that can profitably that has otherwise been computed, but need not use that the result, the following code serves.


if ( f.is_done() ) notional_work += f.retrieve();


if ( f.is_done() ) notional_work += f.value();

Responsive Execution or Rendezvous without Waiting

If a GUI managament thread is to be responsive, it needs to avoid doing any extensive computation. That is, the task must execute on another thread. We assume that the management thread will periodically inspect the future, and if the computation is done, act on it. The following code will serve.


if ( f.is_owner() ) f.fork() /* or f.delay() */ ;
else if ( f.is_done() ) notional_work += f.retrieve();

The first condition handles the case where a future has been 'defered', and morphs it into future with asynchronous execution. We avoid the 'enqueue' action because its advantage over 'fork' is that it can call tasks synchronously, which is not wanted here. There does seem to be advantage to folding the is_owner, fork, and is_done operations into a single method. However, there is one complication. We may want to apply a policy, not an action in scheduling the task for execution. This single method would thus need a policy parameter.


if ( f.be_done(immediately) ) notional_work += f.retrieve();