Document Number:N2907=09-0097
Date:2009-06-18
Author:Anthony Williams
Just Software Solutions Ltd

N2907 - Managing the lifetime of thread_local variables with contexts

This paper discusses a suggestion I made on the LWG reflector and cpp-thread mailing list to address the issues raised in N2880 surrounding the lifetime of thread_local variables.

The basic idea of this proposal is that the lifetime of thread_local variables is tied to the lifetime of an instance of the new class thread_local_context. Each thread has an implicit instance of such a class constructed prior to the invocation of the thread function, and destroyed after completion of the thread function, but additional instances can be created in order to deliberately limit the lifetime of thread_local variables: when a thread_local_context object is destroyed, all the thread_local variables tied to it are also destroyed.

Addressing the concerns of N2880

This enables us to address several of the concerns of N2880. Firstly, if we use a mechanism other than thread::join to wait for a thread to complete its work — such as waiting for a unique_future to be ready — then N2880 correctly highlights that under the current working paper the destructors of thread_local variables will still be running after the waiting thread has resumed. By judicious use of a thread_local_context instance and block scoping, we can ensure that the thread_local variables are destroyed before the future value is set. e.g.

int find_the_answer();
void thread_func(std::promise<int> * p)
{
    int local_result;
    {
        thread_local_context context; // create a new context for thread_locals
        local_result=find_the_answer();
    } // destroy thread_local variables along with the context object
    p->set_value(local_result);
}

int main()
{
    std::promise<int> p;
    std::thread t(thread_func,&p);
    t.detach(); // we're going to wait on the future
    std::cout<<p.get_future().get()<<std::endl;
}

When the call to get() returns, we know that not only is the future value ready, but the thread_local variables on the other thread have also been destroyed.

Reusing threads

A second concern of N2880 was the potential for accumulating vast amounts of thread_local variables when reusing threads for multiple independent tasks, such as when implementing a thread pool. Under such circumstances, the thread pool implementation can wrap each task inside a scope containing a thread_local_context variable to ensure that when a task is completed its thread_local variables are destroyed in a timely fashion. e.g.

std::mutex task_mutex;
std::queue<std::function<void()>> tasks;
std::condition_variable task_cond;
bool done=false;

void worker_thread()
{
    std::unique_lock<std::mutex> lk(task_mutex);
    while(!done)
    {
        task_cond.wait(lk,[]{return !tasks.empty();});
        std::function<void()> task=tasks.front();
        tasks.pop_front();
        lk.unlock();
        {
            thread_local_context context;
            task();
        }
        lk.lock();
    }
}

With this scheme, the thread_local variables are destroyed between each task invocation when the thread_local_context object is destroyed, so if the sets of variables used by the tasks do not overlap then the problem of increasing memory usage is avoided.

Consequences for implementations

Obviously, such a class would have to be tightly integrated with the mechanism for thread_local variables used by a compiler, so that they can be destroyed at the appropriate points, and constructed again if necessary. This is a key point — for the second scenario to work, then if a thread_local_context is destroyed and a fresh one constructed then any thread_local variables used during the lifetime of a context object must be created afresh, even if they were already created and destroyed during the lifetime of a prior context object on the same thread.

This does mean that implementations are pretty much restricted to initializing thread_local variables on first use, with a mechanism that allows the destructor of thread_local_context objects to reset that "first use" flag. If the thread_local_context is implemented with compiler intrinsics then the compiler may still be able to find optimization opportunities that allow batching of initializations or less-frequent checking of the "first use" flag.

Nesting of thread_local_context object lifetimes

There is are interesting issues surrounding the behaviour of code with nested thread_local_context objects. Is such nesting allowed at all? What happens to thread_local variables that have already been assigned variables when a thread_local_context object is constructed? What about pointers to such variables?

I believe there are several possible answers to these questions, and I will address each in turn.

Is nesting allowed?

Certainly it could be argued that things are simpler if nesting is disallowed, and the use cases primarily point to thread_local_context being used high up in the call chain either directly in the thread function or not many levels down. However, I think this is an unnecessary restriction. What I do believe is important however is that lifetimes are properly nested, and a couple of simple rules should be enforced:

If these rules are not obeyed then std::terminate should be called in the destructor of the thread_local_context object being executed when the violation is discovered.

What happens to thread_local variables with values assigned prior to construction of a thread_local_context?

The importance of this question can be neatly demonstrated by the following example. Note that this example does not use a nested context, but the same issues apply, and the answer should be the same in examples that do use nested contexts (if we permit them).

static thread_local int i=0;

int main()
{
    i=42;
    {
        thread_local_context context;
        std::cout<<i<<",";
        i=123;
    }
    std::cout<<i<<std::endl;
}

What does this program print?

  1. 42,123
  2. 0,42
  3. 0,123
  4. Undefined behaviour / std::terminate called

I can see use cases for both option 1 (42,123) and option 2 (0,42). Option 3 is only there as a straw man — the whole point of the context objects is that thread_local objects created within the lifetime of the context object are then destroyed when the context object is destroyed.

Though potentially tempting, I think that undefined behaviour or termination is undesirable as it would be hard to identify the problem when looking at the source code, and it would be easy to trigger such behaviour by calling a function that used thread_local prior to the construction of the context.

So, which of options 1 and 2 do we go for? I favour option 2: the construction of the context object creates a "clean slate" for thread_local variables.

The downside of doing so is that any library that uses thread_local data structures as a cache for optimization purposes (such as an allocator with thread-local heaps) will have to recreate those structures within each context, even though it might be desirable to preserve such structures across contexts. For example, with the worker_thread in the code above it might be desirable to preserve per-thread heaps across task invocations to avoid repeatedly constructing/destructing the heap.

However, I believe that this downside is outweighed by the clarity of the code: with option 2, within a new thread_local_context you know that you have a "clean slate", and that no thread_local variables have values left from another scope. With option 1, then our worker thread example would suddenly start "leaking" values from one task to another if that variable happened to be used in the code outside the context. With option 2 this is not possible, as each task gets a new copy of all the variables.

What about pointers to thread_local variables?

Let's look at our example again, but this time we'll also store the address of i in a normal local variable p, and dereference this pointer inside the context.

static thread_local int i=0;

int main()
{
    i=42;
    int* p=&i;
    {
        thread_local_context context;
        std::cout<<i<<",";
        std::cout<<*p<<",";
        *p=99;
        i=123;
    }
    std::cout<<i<<std::endl;
}

What does this example print now?

  1. 42,42,123
  2. 0,42,99
  3. 0,0,99
  4. Undefined behaviour

I believe that options 1 and 2 here are the behaviours that best correspond to options 1 and 2 for the lifetime issues: if we preserve values from the parent context (option 1) then *p and i refer to the same variable. On the other hand, if we go for option 2 (the "clean slate" option), then *p refers to the variable from the outer context, whereas in the nested context i refers to the new variable (which thus has a different address.)

I think the third and fourth alternatives are understandable from an implementation perspective if we go for the "clean slate" option, but not desirable. The third alternative corresponds to an implementation that magically saves the values of the thread_local variables when the new context is initialized, and reuses the same addresses to refer to the value of that thread_local variable in the current context. For example, this could be done on a segmented architecture where thread_local variables live in a special segment, and that segment is remapped for the new context, and the mapping restored when the context is destroyed. However, I think this is undesirable behaviour — we allow pointers to thread_local variables to be passed between threads, and I think this is directly analagous: we should also allow pointers to thread_local variables to be passed between contexts in a single thread. The fourth option (Undefined behaviour) is just a "give implementors freedom" option, but I think it is undesirable for the reasons just given, and because I think undefined behaviour should not be introduced without very good cause.

Interaction with std::packaged_task and the proposed std::async function

If this proposal is adopted, then it could be used as part of an implementation of std::async (as proposed in N2889 and N2901) to ensure that the associated future did not become ready before the thread-local variables for the asynchronous task had been destroyed.

This proposal could also be integrated with std::packaged_task to ensure that the contained task was run in its own context, and that the context was destroyed (and the future result value stored) before the future became ready. This would allow end users to write a simple function for spawning a task with a return value on a new thread without having to worry about the issue of destruction of thread-local variables. However, it could potentially yield surprising behaviour if the task was invoked directly on an existing thread, particularly if the "clean slate" option was chosen.

Proposed Wording

I have no proposed wording at this time. If the committee agrees to proceed with this, then I can work to provide wording.

Acknowledgements

Thanks to Alberto Ganesh Barbati, Peter Dimov, Lawrence Crowl, Beman Dawes and others who have commented on this proposal on the mailing lists and via personal email.