An Asynchronous Call for C++

ISO/IEC JTC1 SC22 WG21 N2973 = 09-0163 - 2009-09-27

Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org

This paper is a revision of N2889 = 09-0079 - 2009-06-21.

Problem Description
    Solution Domain
    Thread Resources
    Solution Value
    Related Work
Proposed Solution
    Acknowledgements
    The async Function
    Thread Joining
    Execution Policies
    Eager and Lazy Evaluation
    Direct Execution
    Modified Future Types
Proposed Wording
    30.3.1 Class thread [thread.thread.class]
    30.3.1.2 thread constructors [thread.thread.constr]
    30.6.1 Overview [futures.overview]
    30.6.4 Class template unique_future [futures.unique_future]
    30.6.5 Class template shared_future [futures.shared_future]
    30.6.? Function template async [futures.async]

Problem Description

One of the simplest methods for exploiting parallelism is to call one subroutine in parallel with another. However, with the current threading facilities, doing so is a difficult task.

There have been repeated requests for a simpler mechanism, all of which were rejected by the committee as not being within the spirit of the Kona compromise. However, there are now national body comments requesting such a facility.

UK-182 30.3.3.2.2

Future, promise and packaged_task provide a framework for creating future values, but a simple function to tie all three components together is missing. Note that we only need a simple facility for C++0x. Advanced thread pools are to be left for TR2.

async( F&& f, Args && ... ); Semantics are similar to creating a thread object with a packaged_task invoking f with forward<Args>(args...) but details are left unspecified to allow different scheduling and thread spawning implementations. It is unspecified whether a task submitted to async is run on its own thread or a thread previously used for another async task. If a call to async succeeds, it shall be safe to wait for it from any thread. The state of thread_local variables shall be preserved during async calls. No two incomplete async tasks shall see the same value of this_thread::get_id(). [Note: this effectively forces new tasks to be run on a new thread, or a fixed-size pool with no queue. If the library is unable to spawn a new thread or there are no free worker threads then the async call should fail.]

Solution Domain

Concurrency and parallism represent a broad domain of problems and solutions. Mechanisms are generally appropriate to a limited portion of that domain. So, mechanisms should explicitly state the domain in which they are intended to be useful.

One anticipated domain for the following async solution is extracting a limited amount of concurrency from existing sequential programs.

That is, some function calls will be made asynchrounous where appropriate to extract high-level concurrency from program structure, and not from its data structures. The facility is not intended to compete with OpenMP or automatic parallelizers that extract loop-level parallelism. To be concrete, the async facility would be appropriate to the recursive calls to quicksort, but not to the iteration in a partition.

In this domain, the programming model is:

In this model, nested asynchronous calls are not only supported, but desired, as they provide the implementation the opportunity to reuse threads for many potentially, but not actually, asynchronous calls.

Another anticipated domain is to off load work from threads that must remain responsive, such as GUI threads. In this environment, programmers often must insist on concurrency.

Thread Resources

The central technical problem in providing an asynchronous execution facility is to provide it in a manner that does not require the use of thread pools, while at the same time avoiding problems synchronizing with the destructors for any thread-local variables used by any threads created to perform the asynchronous work. See N2880: C++ object lifetime interactions with the threads API, Hans-J. Boehm, Lawrence Crowl, ISO/IEC JTC1 WG21 N2880, 2009-05-01.

While not explicit, the essential lesson of N2880 is as follows.

Threads have variables in the form of thread-local variables, parameters, and automatic variables. To ensure that the resources held by those variables are released, one must join with the thread so that those variables are destroyed. To ensure that destructors of those variables are well-defined, one must join with the thread before its referenced environment is destroyed.

Some consequences of this observation are:

Solution Value

In addition to the technical details, the committee must consider the value in any solution that meets the procedural bounds of the Kona compromise and the technical bounds embodied in N2880. In particular, external facilities like Cilk, the Threading Building Blocks, and the Parallel Patterns Library are known to be better able to handle fined-grained parallelism. So, is the solution space of sufficient value, relative to these external facilities, for standardization in C++0x?

The value in a solution is relative not only to external facilities, but also relative to facilities in the current standard. Our concurrency primitive, std::thread, does not return values, and getting a value out through std::packaged_task and std::unique_future may take more training than many programmers are willing to accept. So, is the solution space of sufficient value, relative to these internal facilities, for standardization in C++0x?

In this paper, we presume that the value in the solution comes from its improvement over existing internal facilities. The wording of the UK national body comment implies the same conclusion. On that basis, we propose the following solution.

Related Work

Oliver Kowalke is implementing boost.task (formerly known as boost.threadpool). In this library, launch_in_thread() reuses existing threads. The function returns a returns handle object for both thread and return value. This library also allows task interruption. It is available at the Boost Vault (http://www.boostpro.com/vault/ — section 'Concurrent Programming') or from the Boost sandbox (svn — https://svn.boost.org/svn/boost/sandbox/task/).

Herb Sutter has proposed an alternate solution in N2901, which should appear in updated form as N2970,

Proposed Solution

The proposed solution consists of a set of async functions to launch asychronous work and a future to manage the function result.

Acknowledgements

This solution derives from an extensive discussion on the C++ threads standardisation <cpp-threads@decadentplace.org.uk> mailing list. That discusson has not yet reached consensus. We highlight points of disagreement below. Note that the presentation in this paper is substantially expanded from earlier drafts, clarifying several issues, so the disagreements may be weaker than they were in discussion.

Thanks to the following contributors to the discussion on this topic: Hans Boehm, Beman Dawes, Peter Dimov, Pablo Halpern, Howard Hinnant, Oliver Kowalke, Doug Lea, Arch Robison, Bjarne Stroustrup, Alexander Terekhov, and Anthony Williams. In particular, we are extremely grateful to Herb Sutter for forcing a thorough analysis into the issues.

The async Function

The async functions use the standard techniques for deferring function execution. The function may be wrapped in a lambda expression to encapsulate arguments to the function as well.

For example, consider computing the sum of a very large array. The first task is to not compute asynchronously when the overhead would be significant or when processor resources might be fully engaged. The second task is to split the work into two pieces, one executed by the host thread and one executed asynchronously.


int parallel_sum(int* data, int size)
{
    int sum = 0;
    if ( size < 1000 )
        for ( int i = 0; i < size; ++i )
            sum += data[i];
    else {
        auto handle = std::async(
            [=]{ return parallel_sum( data+size/2, size-size/2 ); } );
        sum += parallel_sum(data, size/2);
        sum += handle.get();
    }
    return sum;
}

Thread Joining

Because Kona compromise prohibits thread pools and because we must join with any thread created, any asynchronous execution facility must ensure, at the very least, that any thread created is joined before the resulting handle is destroyed. (And, of course, the programmer must destroy the handle, not abandon it to free store.)

A consequence of the joining is that std::threads cannot be reused. Otherwise, some section of the program would lose control of the resources accreted in the std::thread being reused. Note, though, that it is possible to reuse operating system threads in the implementation of std::thread.

Given that the thread must join, there are two implementation strategies, intrusively implement async or keep the std::thread within the future for later joining.

In the intrusive async, the implementation within the thread must

That is, the promise effectively joins the thread before the future becomes ready.

When storing the std::thread within the future, the implementation of async is a straightforward composition of std::thread, packaged_task, and a modified unique_future,

One consequence of storing the std::thread within the future is that either unique_future must be substantially modified or that we introduce a new future type.

This paper proposes a solution that does not choose between these implementation strategies. The solution avoids this choice by simply requiring the appropriate synchronization, as if the thread were joined at the critical times.

Unlike in prior proposals, we now require that the last future to own associated state from an async call also "join with" any thread created by the async call. This new behavior is necessary to prevent undefined behavior when the future is simply abandoned.

Execution Policies

The async functions have a policy parameter. Three policies are defined in this paper.

Always invoke the function in a new thread.
Note that we do not choose "another thread" as a consequence of the discussion above.
Always invoke the function serially.
The value in this policy is primarily in temporarily reducing local concurrency in experiments to achieve higher system performance.
Invoke the function at the discretion of the implementation.
The implementation may use either of the above policies on a per-call basis. The value in this policy is that it enables the implementation to better manage resources. It is the policy used when the programmer does not specify one.

The intent of this proposal is to closely follow the parameters and overloads of the std::thread constructors. Concensus has emerged to remove variadic constructors from std::thread to both maintain consistency and simplify the form of std::async. Programmers can encapsulate function calls with parameters in a lambda expression to obtain the desired behavior.

Eager and Lazy Evaluation

When the work is invoked serially, we propose to do so at the point of value request, rather than at the point of initiation. That is, work is invoked lazily rather than eagerly. This approach may seem surprising, but there are reasons to prefer invocation-on-request.

Eager semantics seem more natural when programmers think of "waiting to use the return value". On the other hand, lazy semantics seem more natural when programmers think of "moving the call earlier". Consider the following examples.


int original( int a, b ) {
    int c = work1( a );
    int d = work2( b );
}

int eager( int a, b ) {
    auto handle = async( [=]{ return work1( a ); } );
    int d = work2( b );
    int c = handle.get();
}

int lazy( int a, b ) {
    auto handle = async( [=]{ return work1( b ); } );
    int c = work1( a );
    int d = handle.get();
}

Note also that in the proposed lazy semantics, any serial execution will be in the context of the thread that executes the get(). While we expect that this thread will nearly always be the same as the thread that executes async(), it need not be because a future can be moved.

There are consequences to lazy evaluation. In particular, unique_future must be modified to carry an std::function in its associated state to represent the computation needed.

Direct Execution

A desirable implementation in the case of synchronous execution is direct execution, in which the call to the std::function representing the work returns its result or exception directly to the caller.

In lazy evaluation, direct execution is straightforward; the implementation of a synchronous get() simply calls the std::function and returns its result. Any exception is simply propogated as in a normal function call.

However, the rvalue-reference return type of get effectively eliminates direct execution as an implementation possibility.

Modified Future Types

To enable sequential evaluation, the following modifications to the existing futures are proposed.

Proposed Wording

The proposed wording is as follows. It consists primarily of two new subsections.

30.3.1 Class thread [thread.thread.class]

Within paragraph 1, edit the synopsis as follows.


namespace std {
  class thread {
  public:
    // types:
    class id;
    typedef implementation-defined native_handle_type; // See 30.2.3

    // construct/copy/destroy:
    thread();
    template <class F> explicit thread(F f);
    template <class F, class ...Args> thread(F&& f, Args&&... args);
    ~thread();
    thread(const thread&) = delete;
    thread(thread&&);
    thread& operator=(const thread&) = delete;
    thread& operator=(thread&&);

    // members:
    void swap(thread&&);
    bool joinable() const;
    void join();
    void detach();
    id get_id() const;
    native_handle_type native_handle(); // See 30.2.3

    // static members:
    static unsigned hardware_concurrency();
  };
}

30.3.1.2 thread constructors [thread.thread.constr]

Edit the constructor prototypes as follows.


template <class F> explicit thread(F f);
template <class F, class ...Args> thread(F&& f, Args&&... args);

Edit paragraph 4 as follows.

Requires: F and each type Ti in Args shall be CopyConstructible if an lvalue and otherwise MoveConstructible. INVOKE (f, w1, w2, ..., wN) (20.7.2) f() shall be a valid expression for some values w1, w2, ..., wN, where N == sizeof...(Args).

Edit paragraph t as follows.

Effects: Constructs an object of type thread and executes INVOKE (f, t1, t2, ..., tN) f() in a new thread of execution, where t1, t2, ..., tN are the values in args.... Any return value from f is ignored. If f terminates with an uncaught exception, std::terminate() shall be called.

30.6.1 Overview [futures.overview]

Add to the synopsis the appropriate entries from the following sections.

30.6.4 Class template unique_future [futures.unique_future]

Edit the synopsis as follows. The edit adds is_deferred and removes const qualification from member functions is_ready, wait, wait_for, and wait_until.


namespace std {
  template <class R>
  class unique_future {
  public:
    unique_future(unique_future &&);
    unique_future(const unique_future& rhs) = delete;
    unique_future();
    unique_future& operator=(const unique_future& rhs) = delete;
    // retrieving the value
    see below get();
    // functions to check state and wait for ready
    bool is_deferred() const;
    bool is_ready() const;
    bool has_exception() const;
    bool has_value() const;
    void wait() const;
    template <class Rep, class Period>
      bool wait_for(
        const chrono::duration<Rep, Period>& rel_time) const;
    template <class Clock, class Duration>
      bool wait_until(
        const chrono::time_point<Clock, Duration>& abs_time) const;
  };
}

After paragraph 4, add a new paragraph as follows. This paragraph is a requirement on the destructor.

Synchronization: If no other thread refers to the associated state, and that state is associated with a thread created by an async call ([futures.async]), as if associated-thread.join().

After paragraph 5, add a new paragraph as follows. This paragraph is a requirement on get().

Effects: As if is_ready().

Delete paragraph 6. This synchronization now happens as a consequence of "as if" is_ready().

Synchronization: if *this is associated with a promise object, the completion of set_value() or set_exception() to that promise happens before (1.10) get() returns.

After paragraph 9, add a new prototype.


bool is_deferred() const;

After that prototype, add a new paragraph.

Returns: true if and only if the associated state has a deferred function and that function has not executed.

Edit the is_ready() prototype as follows.


bool is_ready() const;

After that prototype add a new paragraph.

Effects: if is_deferred(), then execute the deferred function. Otherwise, no effect.

After that paragraph add a new paragraph.

Postcondition: is_deferred() == false.

After that paragraph add a new paragraph.

Synchronization: if *this is associated with a promise object, the completion of set_value() or set_exception() to that promise happens before (1.10) is_ready() returns. If the future is associated with a thread created by an async call ([futures.async]), as if associated-thread.join().

Edit the wait() prototype as follows.


void wait() const;

Edit paragraph 13 as follows.

Effects: as if is_ready(). blocks Blocks until *this is ready. As if is_ready().

Delete paragraph 14. This synchronization now happens as a consequence of "as if" is_ready.

Synchronization: if *this is associated with a promise object, the completion of set_value() or set_exception() to that promise happens before (1.10) get() returns.

Edit the wait_for() prototype as follows.


template <class Rep, class period Period>
void wait_for(const chrono::duration<Rep, Period>& rel_time) const;

Edit paragraph 16 as follows.

Effects: as if is_ready(). blocks Blocks until *this is ready or until rel_time has elapsed. As if is_ready().

Edit the wait_until() prototype as follows.


template <class Clock, class Duration>
void wait_until(const chrono::time_point<Clock, Duration>& abs_time) const;

30.6.5 Class template shared_future [futures.shared_future]

Edit paragraph 3 as follows. That paragraph is a requirement on the construction from a unique_future.

Effects: as if rhs.is_ready(). move Move constructs a shared_future object whose associated state is the same as the state of rhs before after the is_ready() effects.

After paragraph 5, add a new paragraph as follows. This paragraph is a requirement on the destructor.

Synchronization: If no other thread refers to the associated state, and that state is associated with a thread created by an async call ([futures.async]), as if associated-thread.join().

After paragraph 10, add a new paragraph. This paragraph describes is_ready().

Synchronization: if *this is associated with a promise object, the completion of set_value() or set_exception() to that promise happens before (1.10) is_ready() returns. If the future is associated with a thread created by an async call ([futures.async]), as if associated-thread.join().

Edit paragraph 13 as follows. This paragraph describes wait().

Effects: as if is_ready(). blocks Blocks until *this is ready. As if is_ready().

Delete paragraph 14. The synchronization is handled by the "as if" is_ready().

Synchronization: if *this is associated with a promise object, the completion of set_value() or set_exception() to that promise happens before (1.10) get() returns.

Edit paragraph 16 as follows. This paragraph describes wait_for().

Effects: as if is_ready(). blocks Blocks until *this is ready or until rel_time has elapsed. As if is_ready().

30.6.? Function template async [futures.async]

Add the following section.


enum async_policy {
    async_threaded,
    async_serial,
    async_discretion
}
template<class Callable>
unique_future<typename Callable::result_type>
async(F f, async_policy policy = async_discretion);

Requires: F shall be CopyConstructible if an lvalue and otherwise MoveConstructible. The expression f() shall be a valid expression.

Effects: Constructs an object of type unique_future<typename Callable::result_type> ([futures.unique_future]). If policy is async_threaded, creates an object of type thread and executes f() in a new thread of execution. Any return value is captured by the unique_future. Any exception not caught by f is captured by the unique_future. The thread is associated with the unique_future, and affects the behavior of the unique_future. If policy is async_serial, then f is associated with the returned unique_future. The invocation is said to be deferred. If policy is async_discretion, the implementation may choose either policy above at any call to async. [Note: Implementations should defer invocations when no more concurrency can be effectively exploited. —end note]

Synchronization: The invocation of the async happens before (1.10 [intro.multithread]) the invocation of f. [Note: This statement applies even when the corresponding unique_future is moved to another thread. —end note]

Throws: std::system_error if policy is async_threaded and the implementation is unable to start a new thread.

Error conditions:resource_unavailable_try_again — if policy is async_threaded and either the system lacked the necessary resources to create another thread, or the system-imposed limit on the number of threads in a process would be exceeded.

[Example: Two items of work below may be executed concurrently.


extern int work1(int value);
extern int work2(int value);
int work(int value) {
  auto handle = std::async( [=]{ return work2(value); } );
  int tmp = work1(value);
  return tmp + handle.get();
}

end example:] [Note: The statement


return work1(value) + handle.get();

might not result in concurrency because get() may be evaluated before work1(), thus forcing work2 to be evaluated before work1(). —end note:]