ISO/IEC JTC1 SC22 WG21 N2889 = 09-0079 - 2009-06-21
Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org
Problem Description
Solution Domain
Thread Resources
Solution Value
Related Work
Proposed Solution
Acknowledgements
The async
Function
Thread Joining
Execution Policies
Eager and Lazy Evaluation
Direct Execution
New Future Type
Proposed Wording
30.6.1 Overview [futures.overview]
30.6.? Function template async
[futures.async]
30.6.? Class template joining_future
[futures.joining_future]
One of the simplest methods for exploiting parallelism is to call one subroutine in parallel with another. However, with the current threading facilities, doing so is a difficult task.
There have been repeated requests for a simpler mechanism, all of which were rejected by the committee as not being within the spirit of the Kona compromise. However, there are now national body comments requesting such a facility.
UK-182 30.3.3.2.2
Future, promise and packaged_task provide a framework for creating future values, but a simple function to tie all three components together is missing. Note that we only need a simple facility for C++0x. Advanced thread pools are to be left for TR2.
async( F&& f, Args && ... );
Semantics are similar to creating a thread object with a packaged_task invokingf
withforward<Args>(args...)
but details are left unspecified to allow different scheduling and thread spawning implementations. It is unspecified whether a task submitted toasync
is run on its own thread or a thread previously used for another async task. If a call to async succeeds, it shall be safe to wait for it from any thread. The state ofthread_local
variables shall be preserved during async calls. No two incomplete async tasks shall see the same value ofthis_thread::get_id()
. [Note: this effectively forces new tasks to be run on a new thread, or a fixed-size pool with no queue. If the library is unable to spawn a new thread or there are no free worker threads then the async call should fail.]
Concurrency and parallism represent broad domain of problems and solutions. Mechanisms are generally appropriate to a limited portion of that domain. So, mechanisms should explicitly state the domain in which they are intended to be useful.
The anticipated domain for the following async
solution
is extracting a limited amount of concurrency
from existing sequential programs.
That is, some function calls
will be made asynchrounous where appropriate
to extract high-level concurrency from program structure,
and not from its data structures.
The facility is not intended to compete with OpenMP or automatic parallelizers
that extract loop-level parallelism.
To be concrete,
the async
facility would be appropriate
to the recursive calls to quicksort,
but not to the iteration in a partition.
In this domain, the programming model is:
In this model, nested asynchronous calls are not only supported, but desired, as they provide the implementation the opportunity to reuse threads for many potentially, but not actually, asynchronous calls.
The central technical problem in providing an asynchronous execution facility is to provide it in a manner that does not require the use of thread pools, while at the same time avoiding problems synchronizing with the destructors for any thread-local variables used by any threads created to perform the asynchronous work. See N2880: C++ object lifetime interactions with the threads API, Hans-J. Boehm, Lawrence Crowl, ISO/IEC JTC1 WG21 N2880, 2009-05-01.
While not explicit, the essential lesson of N2880 is as follows.
Threads have variables in the form of thread-local variables, parameters, and automatic variables. To ensure that the resources held by those variables are released, one must join with the thread so that those variables are destroyed. To ensure that destructors of those variables are well-defined, one must join with the thread before its referenced environment is destroyed.
Some consequences of this observation are:
In addition to the technical details, the committee must consider the value in any solution that meets the procedural bounds of the Kona compromise and the technical bounds embodied in N2880. In particular, external facilities like Cilk, the Threading Building Blocks, and the Parallel Patterns Library are known to be better able to handle fined-grained parallelism. So, is the solution space of sufficient value, relative to these external facilities, for standardization in C++0x?
The value in a solution is relative not only to external facilities,
but also relative to facilities in the current standard.
Our concurrency primitive, std::thread
,
does not return values,
and getting a value out through std::packaged_task
and std::unique_future
may take more training than many programmers are willing to accept.
So, is the solution space of sufficient value,
relative to these internal facilities,
for standardization in C++0x?
In this paper, we presume that the value in the solution comes from its improvement over existing internal facilities. The wording of the UK national body comment implies the same conclusion. On that basis, we propose the following solution.
Oliver Kowalke is implementing boost.task
(formerly known as boost.threadpool).
In this library, launch_in_thread()
reuses existing threads.
The function returns a returns handle object for both thread and return value.
This library also allows task interruption.
It is available at the Boost Vault
(http://www.boostpro.com/vault/
— section 'Concurrent Programming')
or from the Boost sandbox
(svn —
https://svn.boost.org/svn/boost/sandbox/task/).
Herb Sutter has proposed an alternate solution in draft text, generally taking a different choice for those issues in which consensus has not formed. This paper should appear as N2901.
The proposed solution consists of
a set of async
functions to launch asychronous work
and a future to manage the function result.
This solution derives from an extensive discussion on the C++ threads standardisation <cpp-threads@decadentplace.org.uk> mailing list. That discusson has not yet reached consensus. We highlight points of disagreement below. Note that the presentation in this paper is substantially expanded from earlier drafts, clarifying several issues, so the disagreements may be weaker than they were in discussion.
Thanks to the following contributors to the discussion on this topic: Hans Boehm, Beman Dawes, Peter Dimov, Pablo Halpern, Howard Hinnant, Oliver Kowalke, Doug Lea, Arch Robison, Bjarne Stroustrup, Alexander Terekhov, and Anthony Williams. In particular, we are extremely grateful to Herb Sutter for forcing a thorough analysis into the issues.
async
Function
The async
functions
use the standard techniques for deferring function execution.
The function and its arguments are listed separately
as parameters to the async
functions,
which are later combined at the point of invocation
to call the designated work.
For example, consider computing the sum of a very large array. The first task is to not compute asynchronously when the overhead would be significant. The second task is to split the work into two pieces, one executed by the host thread and one executed asynchronously.
int parallel_sum(int* data, int size) { int sum = 0; if ( size < 1000 ) for ( int i = 0; i < size; ++i ) sum += data[i]; else { auto handle = std::async(parallel_sum, data+size/2, size-size/2); sum += parallel_sum(data, size/2); sum += handle.get(); } return sum; }
Because Kona compromise prohibits thread pools and because we must join with any thread created, any asynchronous execution facility must ensure, at the very least, that any thread created is joined before the resulting handle is destroyed. (And, of course, the programmer must destroy the handle, not abandon it to free store.)
A consequence of the joining is that threads cannot be reused. Otherwise, some section of the program would lose control of the resources accreted in the thread being reused.
This issue has not yet reached consensus.
Given that the thread must join,
there are two implementation strategies,
intrusively implement async
or keep the std::thread
within the future for later join
ing.
In the intrusive async
,
the implementation within the thread
must
set_value
or set_exception
function of the promise
corresponding to the future.That is, the promise effectively joins the thread before the future becomes ready.
When storing the std::thread
within the future,
the implementation of async
is a straightforward composition of
packaged_task
, unique_future
,
and std::thread
.
One consequence of
storing the std::thread
within the future
is that either unique_future
must be substantially modified
or that we introduce a new future type.
Another consequence of
storing the std::thread
within the future
is that the waiting function changes
from a condition-variable wait
to a thread join.
The std::thread
class
does provides neither a timed_join
nor a try_join
,
and so a joining future
cannot implement the full interface of unique_future
.
The async
functions have a policy parameter.
Three policies are defined in this paper.
The intent of this proposal is to closely follow
the parameters and overloads of the std::thread
constructors.
We expect this consistency to provide the least surprise to users.
Because std::thread
has a variadic constructor,
the std::async
function has a variadic overload.
A consequence is that the standard technique
for implementing a default policy
via a defaulted paramter
does not work.
Hence the proposal places the policy at the front of the parameter list
and implements the default policy
with a separate overload that does not have that parameter.
This placement seems unnatural to many members of the committee,
and they desire to place policy parameter at the end.
The only way to provide the policy parameter at the end
and to be consistent with std::thread
constructors
is to remove the variadic constructor from std::thread
.
Doing so would not lose a great deal of syntactic consiceness,
because the lambda facility can encapsulate many parameters.
The async
form above can be written as follows.
auto handle = std::async( [=]{ return parallel_sum( data+size/2, size-size/2); }, fully_threaded );
We have no objection to that approach.
Indeed, it would make the referencing environment of
the executed function quite explicit
in the form of the lambda-capture.
Should the variadic std::thread
constructor be removed,
we will modify the proposal to move the policy parameter
to the end of the list and default it.
Alternatively,
one could have inconsistent parameters
for std::thread
constructors
and std::async
overloads.
This choice has not reached consensus.
When the work is invoked serially, we propose to do so at the point of value request, rather than at the point of initiation. That is, work is invoked lazily rather than eagerly. This approach may seem surprising, but there are reasons to prefer invocation-on-request.
async
functions have already committed
to an eager serial execution.
async
might introduce deadlock.
In contrast,
executing the work serially at the call to get
cannot introduce any deadlock that was not already present
because the calling thread is necessarily blocked.
async
earlier.
When there are sufficient processor resources,
the function executes concurrently and speculatively.
When there are not sufficient resources,
the function will execute only when truly needed.
Eager semantics seem more natural when programmers think of "waiting to use the return value". On the other hand, lazy semantics seem more natural when programmers think of "moving the call earlier". Consider the following examples.
int original( int a, b ) { int c = work1( a ); int d = work2( b ); } int eager( int a, b ) { auto handle = async( work1, a ); int d = work2( b ); int c = handle.get(); } int lazy( int a, b ) { auto handle = async( work2, b ); int c = work1( a ); int d = handle.get(); }
Note also that in the proposed lazy semantics,
any serial execution will be in the context
of the thread that executes the get()
.
While we expect that this thread will nearly always
be the same as the thread that executes async()
it need not be because a future can be moved.
There are consequences to lazy evaluation.
In particular,
the future returned from async
must either be a modified version of the existing unique_future
or the future must be of a new type.
The reason is that lazy evaluation
requires that the future carry an std::function
to represent the computation needed.
A desirable implementation
in the case of synchronous execution
is direct execution,
in which the call to the std::function
representing the work
returns its result or exception directly to the caller.
In lazy evaluation,
direct execution is straightforward;
the implementation of a synchronous get()
simply calls the std::function
and returns its result.
Any exeption is simply propogated as in a normal function call.
In eager evaluation,
one must necessarily save the result in a variable
for later copy/move at from the get()
call.
However, propogating the exception at the async()
call
would introduce a second place in which the programmer
must protect against an exception.
That burden is undesirable,
so the async()
call should also save
any exception for later propogation by the get()
call.
All of this means that eager evaluation cannot exploit direct execution.
Direct execution has a consequence however.
Since the value or exception status is unknown until the get
call,
the has_value
and has_exception
queries
cannot provide meaningful results before then.
That is, direct execution
invalidates part of the interface to unique_future
.
Note, however,
that the has_value
and has_exception
queries
are meaningful with lazy evaluation
so long as the first call to them
invokes the std::function
in an indirect manner.
As hinted several times earlier,
we must make a choice in the future type returned by async
:
unique_future
,unique_future
, orBased on the discussion above,
Furthermore,
a modified unique_future
would necessarily induce more overhead
on the original intended uses of unique_future
.
The direct overhead might be as low as a few words
and a couple of tests or virtual calls.
Unfortunately, tests and virtual calls
tend to introduce pipeline bubbles
and virtual calls tend to be barriers to optimization.
So, the indirect overhead might be substantially higher.
However, we have no measurements
comparing that overhead
to the normal cost of unique_future
.
Even so,
the original uses of unique_future
,
such as in coordinating thread pools invocations,
are likely to be in more performance-sensitive
code than are uses of async
.
Therefore, avoiding potential performance impact to thread pools
implies a new future type.
Modifying unique_future
implies revisiting aspects of the working draft
that we thought were stable.
Introducing a new future type
would avoid potentially destabilizing the draft.
On balance,
we believe that a new future type is the best overall
solution to the conflicting desireable features in
the return type of the async
function.
This choice has not reached consensus.
Given that we have a new future,
we remove timed_wait
,
is_ready
,
has_value
, and
has_exception
,
from the interface.
That is, the new future interface,
a joining_future
,
is modeled in part on thread
,
which has a unique owner and is therefore only movable.
The proposed wording is as follows. It consists primarily of two new subsections.
Add to the synopsis the appropriate entries from the following sections.
async
[futures.async]Add the following section.
enum async_policy { fully_threaded, fully_synchronous, impl_discretion }
template<class F>
requires Callable<F>;
joining_future<Callable::result_type>
async(async_policy policy, F f);template<typename F, typename ... Args>
requires Callable<F, Args...>;
joining_future<Callable::result_type>
async(async_policy policy, F&& f, Args&&...);template<class F>
requires Callable<F>;
joining_future<Callable::result_type>
async(F f);template<typename F, typename ... Args>
requires Callable<F, Args...>;
joining_future<Callable::result_type>
async(F&& f, Args&&...);Requires:
F
and each typeTi
inArgs
shall beCopyConstructible
if an lvalue and otherwiseMoveConstructible
.INVOKE(f, w1, w2, ..., wN)
(20.7.2) shall be a valid expression for some values w1, w2, ..., wN, whereN == sizeof...(Args)
.Effects: Constructs an object of type
joining_future<Callable::result_type>
([futures.joining_future]). Ifpolicy
isfully_threaded
, creates an object of typethread
and executesINVOKE(f, t1, t2, ..., tN)
in a new thread of execution, where t1, t2, ..., tN are the values inargs...
. Any return value is captured by thejoining_future
. Any exception not caught byf
is captured by thejoining_future
. Ifpolicy
isfully_synchronous
, the thread callingjoining_future::get()
([future.joining_future]) executesINVOKE(f, t1, t2, ..., tN)
in the caller's own thread of execution, where t1, t2, ..., tN are the values inargs...
. The invocation is said to be deferred. Ifpolicy
isimpl_discretion
, the implementation may choose either policy above at any call toasync
. [Note: Implementations should defer invocations when no more concurrency can be effectively exploited. —end note] If there is nopolicy
parameter, the behavior is as if there was aimpl_discretion
parameter was specified.Synchronization: The invocation of the
async
happens before (1.10 [intro.multithread]) the invocation off
. [Note: This statement applies even when the correspondingjoining_future
is moved to another thread. —end note]Throws:
std::system_error
ifpolicy
isfully_threaded
and the implementation is unable to start a new thread.Error conditions: —
resource_unavailable_try_again
— ifpolicy
isfully_threaded
and either the system lacked the necessary resources to create another thread, or the system-imposed limit on the number of threads in a process would be exceeded.[Example: Two items of work can be executed in parallel as below.
extern int work1(int value); extern int work2(int value); int work(int value) { auto handle = std::async(std::impl_discretion, work2, value); int tmp = work1(value); return tmp + handle.get(); }
—end example:] [Note: The statement
return work1(value) + handle.get();
might not result in parallelism because
get()
may be evaluated beforework1()
, thus forcingwork2
to be evaluated beforework1()
. —end note:]
joining_future
[futures.joining_future]Add the following section after the one above.
namespace std { template<class R> class joining_future { public: joining_future(joining_future &&); joining_future(const joining_future& rhs) = delete; ~joining_future(); joining_future& operator=(const joining_future& rhs) = delete; // retrieving the value see below get(); // functions to check state and wait for ready }; }
The implementation shall provide the template
joining_future
and two specializations,joining_future<R&>
andjoining_future<void>
. These differ only in the return type and return value of the member functionget
, as set out in its description, below.
joining_future(joining_future&& rhs);
Effects: move constructs a
joining_future
object whose associated state is the same as the state ofrhs
before. The associated state derives from theasync
call that provided the original future. The state consists of one or more of anythread
created by the call, the function object and its arguments, the return value of its invocation, or the exception of its invocation.Postcondition:
rhs
can be safely destroyed.~joining_future();
Effects: destroys
*this
and its associated state if no other object refers to that. If the invocation has been deferred, but not yet executed viaget
, the invocation is not executed.Synchronization: If the invocation has been deferred, then the associated
async
call happens before (1.10 [intro.multithread]) the destructor return. Otherwise, as if associated-thread.join()
.R&& joining_future::get();
R& joining_future<R&>::get();
void joining_future<void>::get();
Note: as described above, the template and its two required specializations differ only in the return type and return value of the member function
get
.Effects: If the invocation has been deferred, then executes
INVOKE(f, t1, t2, ..., tN)
where t1, t2, ..., tN are the values inargs...
.Returns: If the invocation has been deferred, then
joining_future::get()
returns the rvalue-reference of the result of the invocation.joining_future<R&>::get()
returns the reference of the result of the invocation.joining_future<void>::get()
returns nothing.Otherwise,
joining_future::get()
returns an rvalue-reference to the value stored in the asynchronous result.joining_future<R&>::get()
returns the stored reference.joining_future<void>::get()
returns nothing.Throws: If the invocation has been deferred, then any exception from the invocation. Otherwise, the stored exception, if an exception was stored and not retrieved before.
Synchronization: If the invocation has been deferred, and the return from the invocation happens before (1.10 [intro.multithread]) the
get
returns. Otherwise, as if associated-thread.join()
.Remark: the effect of calling
get()
a second time on the samejoining_future
object is unspecified.