Dynamic Initialization and Destruction with Concurrency

ISO/IEC JTC1 SC22 WG21 N2148 = 07-0008 - 2007-01-13

Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org

Introduction

Concurrency introduces potential deadlock or data races in the dynamic initialization and destruction of static-duration objects. Because the opportunity for programmers to manually synchronize such objects is limited by a lack of syntactic handles, the language must introduce new syntax, define synchronization, or limit programs.

This proposal breaks the problem into three reasonably separable parts: initialization of function-local static-duration objects, initialization of non-local static-duration objects, and destruction of all static-duration objects. Two of these problems have two proposed solutions.

This proposal exploits properties of a new algorithm for fast synchronized initialization by Mike Burrows of Google. (See the appendix for an implementation and an argument for its correctness.) The algorithm has three attributes signficant to its use here.

The "already initialized" check (e.g. for function-local static-duration objects already initialized) requires only one additional regular load over a non-concurrent check, presuming a fast implementation of N2147 Thread-Local Storage.
The initialization does not occur while holding any lock required by the synchronization mechanism.
The algorithm is portable.

Function-Local Initialization

The core problem with function-local static-duration object initialization is that the containing function may be invoked concurrently, and thus the definition may execute concurrently. Failing to synchronize may introduce a race condition. The obvious solution is to synchronize. The problem is that such synchronization may introduce deadlock if the synchronization involves holding a lock while the initializer executes. We present two proposed solutions. The first solution assumes a synchronization mechanism that does not hold a lock during initialization. The second solution assumes that the mechanism may hold a lock, and thus requires new mechanisms to prevent deadlock.

Option: No Introduced Locking

The proposal relies on Burrows' algorithm to avoid holding a lock while the initializer executes. Any deadlock that occurs as a result of initialization, must necessarily be a race condition in the absence of synchronization.

6.7 Declaration statement [stmt.dcl]

In paragraph 4, edit

Otherwise such an object is initialized the first time control passes through its declaration; such an object is considered initialized upon the completion of its initialization. If the initialization exits by throwing an exception, the initialization is not complete, so it will be tried again the next time control enters the declaration. If control enters the declaration concurrently while the object is being initialized, the concurrent execution waits for completion of the initialization. The implementation shall not introduce any locking around execution of the initializer. If control re-enters the declaration (recursively) while the object is being initialized, the behavior is undefined.

Option: Possible Introduced Locking

If the initialization synchronization mechanism holds a lock and the thread doing the initialization holds a lock, and the initializer attempts to grab a lock, the initialization may deadlock. To resolve this problem, we enable programmers to chose to synchronize or not.

The definition of a function-local static-duration variable may indicate the synchronization of its initialization with a synchronizer.

If the synchronizer is protected, the initialization of the variable is locked. Each variable has its own lock. In the event that the initializer recursively invokes itself, the behavior is undefined. (Deadlock is probably preferable to any other action.)
If the synchronizer is private, the initialization of the variable is not protected implicitly, and the program is undefined if it executes the definition of the variable concurrently or recursively.
If the synchronizer is not specified, the initialization is implicitly protected. [Note: We expect compilers to provide options that change this behavior to permit migrating code from its current default.]

For example:

int f() {
    static int u protected = g(); // synchronized
    static int v = h(); // synchronized
    static int w private = i(); // unsynchronized
    static nonPOD x private (3); // unsynchronized
    return u + v + w + x.to_int();
}

A function-local static-duration object that is not also a variable has no syntactic handle for a synchronizer. In this case, the default synchronizer applies.

Implementation

There is existing experience implementing the semantics for function-local static-duration object initialization. The current GNU compiler implements protected semantics. The current Sun compiler implements private semantics. Thus, there is implementation experience for both approaches individually. There is currently no implementation of the proposed synchronizer syntax.

6.7 Declaration statement [stmt.dcl]

In paragraph 4, edit

Otherwise such an object is initialized the first time control passes through its declaration; such an object is considered initialized upon the completion of its initialization. If the initialization exits by throwing an exception, the initialization is not complete, so it will be tried again the next time control enters the declaration. If control enters the declaration concurrently while the object is being initialized, the concurrent execution waits for completion of the initialization. If control re-enters the declaration (recursively) while the object is being initialized, the behavior is undefined.

8.5 Initializers [dcl.init]

In paragraph 1, edit

A declarator can specify an initial value for the identifier being declared. The identifier designates an object or reference being initialized. The process of initialization described in the remainder of 8.5 applies also to initializations specified by other syntactic contexts, such as the initialization of function parameters with argument expressions (5.2.2) or the initialization of return values (6.6.3). For function-scoped static-duration objects, the initalizer may specify a synchronizer.

initializer:

initalizer-synchronizer initializer-specification

initalizer-synchronizer

initializer-specification

initalizer-synchronizer:

private

protected

initializer-specification:

= initializer-clause

( expression-list )

initializer-clause:

assignment-expression

{ initializer-list ,_opt }

{ }

initializer-list:

initializer-clause

initializer-list , initializer-clause



 At the end of section 8.5, add a new paragraph 6


The definition of a function-scoped static-duration variable
may indicate the synchronization of its initialization
with an initializer-synchronizer.

If the synchronizer is protected,
the initialization of the variable
is protected by implicit double-checked locking.
Each variable has its own lock.
In the event that the initializer recursively invokes itself,
the behavior is undefined.


If the synchronizer is private,
the initialization of the variable
is not protected implicitly,
and the program is undefined
if it executes the definition of the variable
concurrently or recursively.


If the synchronizer is not specified,
the initialization is implicitly protected.




 At the end of section 8.5, add a new paragraph 7


A function-local static-duration object
that is not also a variable,
is synchronized as if it were a variable
with an unspecified synchronizer.



12.6 Initialization [class.init]

 In paragraph 1, edit


When no initializer initializer-specifier
is specified for an object of (possibly cv-qualified) class type
(or array thereof),
or the initializer initializer-specifier
has the form (),
the object is initialized as specified in 8.5.





Non-Local Initialization

 There are two problems
with non-local static-duration object initialization.
The first problem is that such initialization may occur concurrently.
There are two sources to the concurrency,
dynamic opening of a shared library
in a multi-threaded context
and creating of multiple threads in initializers,
and hence the potentially implied concurrent execution of other initializers.
The second problem is that static-duration object initialization
may constitute a large fraction of process time,
particularly when users are waiting for program start,
and hence may benefit from parallel execution.

 The standard currently admits unordered initialization.
When initializing one object,
reference to a relatively unordered object
may result in access before dynamic initialization
but after zero initialization.
Since the concept of zero initialization
has limited utility for many objects,
particularly those that require dynamic construction,
we consider this feature of the current C++ standard to be of limited utility.
Therefore, we propose to make such references undefined.

 Making such references undefined
would normally imply that
std::cout and std::operator new
could not be used by any dynamic initializer.
To resolve this problem,
the standard should provide additional guarantees,
and it currently appears to do so.

 In the sequential 2003 standard,
unspecified order of initialization
essentially required some serial order of initialization.
In the parallel proposed standard,
the unspecified order of initialization
admits concurrent initialization.
The necessary additional restriction
is that a single object may be initialized at most once.
This restriction may require locking.
With concurrent initialization,
any write (or read in the presence of writes) to statically-initialized objects
must be properly locked.
An alternate approach,
not proposed here,
is to require a single global order for all initialization.

 The implementation must initialize
static-duration objects before any of their use
within main or the functions it calls.


Implementation

 To the best of our knowledge,
no compiler synchronizes
the initialization of non-local static-duration objects.
However, the technical issues
are similar to existing function-local initializations,
and so present no new challenges.


3.6.2 Initialization of non-local objects [basic.start.init]

 In paragraph 1, edit


Dynamic initialization of an object
defined in namespace scope
is either ordered or unordered.
Definitions of explicitly specialized class template static data members
have ordered initialization.
Other class template static data members
(i.e., implicitly or explicitly instantiated specializations)
have unordered initialization.
Other objects defined in namespace scope have ordered initialization.
Such oObjects within a single translation unit
and with ordered initialization
are relatively ordered,
otherwise they are relatively unordered.
Relatively ordered objects
shall be initialized in the order of their definitions
within the translation unit.
The order of initialization
is unspecified for relatively unordered objects
with unordered initialization
and for objects defined in different translation units.
If the initialization of an object
uses a relatively unordered object,
the behavior is undefined;
no diagnostic is required.
[Note: This definition permits concurrent initialization.]





Destruction

 The primary problem with destruction of static-duration objects
is preventing execution of the destructors before threads finish.
In many applications, finishing a thread requires terminating the threads.
For various reasons, this shutdown needs to be cooperative.
This need in turn implies that all code be able to handle shutdown.
This requirement is new; nearly all existing code is not properly written.
In addition, the difficulty of the code is equivalent to exception safety,
which is known to be a hard task.
The concern is that
few programs will every be able to successfully finish in this environment.
Therefore, it would be better to never execute static destructors.

 Recognizing the potential controversy in not executing static destructors,
we propose two options, one without destruction, and one with.



Option: No Destructor Execution

 The imposed burden on most programs
in not executing static-duration object destructors
is in fact fairly small.
Few of the destructors have external effect.
Those that do can use the std::atexit mechanism
to clean up their state.


3.6.3 Termination [basic.start.term]

 In paragraph 1, edit


Destructors (12.4) for initialized objects of static storage duration
(declared at block scope or at namespace scope)
are not called
as a result of returning from main
and or as a result of calling
std::exit (18.4).
These objects
are destroyed in the reverse order
of the completion of their constructor or
of the completion of their dynamic initialization.
If an object is initialized statically,
the object is destroyed in the same order
as if the object was dynamically initialized.
For an object of array or class type,
all subobjects of that object are destroyed
before any local object with static storage duration
initialized during the construction of the subobjects
is destroyed.


 Delete paragraph 2.


If a function contains a local object of static storage duration
that has been destroyed
and the function is called
during the destruction of an object with static storage duration,
the program has undefined behavior
if the flow of control
passes through the definition of the previously destroyed local object.


 Delete paragraph 3.


If a function is registered with std::atexit
(see <cstdlib>, 18.4)
then following the call to std::exit,
any objects with static storage duration
initialized prior to the registration of that function
shall not be destroyed
until the registered function is called from the termination process
and has completed.
For an object
with static storage duration
constructed after a function is registered with std::atexit,
then following the call to std::exit,
the registered function is not called
until the execution of the object's destructor has completed.
If std::atexit is called during the construction of an object,
the complete object to which it belongs shall be destroyed
before the registered function is called.



18.4 Start and termination [lib.support.start.term]

 In paragraph 8, bullet 1, edit


First, objects with static storage duration are destroyed and
functions registered by calling atexit are called.
Non-local objects with static storage duration
are destroyed in the reverse order of the completion of their constructor.
(Automatic oObjects are not destroyed
as a result of calling exit().)218)
Functions registered with atexit
are called in the reverse order of their registration,
except that a function is called after any previously registered functions
that had already been called at the time it was registered.219)
A function registered with atexit
before a non-local object obj1 of static storage duration
is initialized
will not be called until obj1's destruction has completed.
A function registered with atexit
after a non-local object obj2 of static storage duration
is initialized
will be called before obj2's destruction starts.
A local static object obj3
is destroyed at the same time it would be
if a function calling the obj3 destructor
were registered with atexit
at the completion of the obj3 constructor.




Option: Destructor Execution

 As stated before, the primary problem with destruction
is threads accessing objects during or after their destruction.
To prevent this problem,
we require that all user threads terminate
before destruction begins.
The mechanism to terminate threads is under discussion
and as yet has no firm proposal.

 Destruction also introduces the potential
to consume a large fraction of process time.
Therefore, similar to initialization,
we enable concurrent destruction.

 Objects defined at namespace scope with relatively ordered initializations
must be destructed in reverse order of their initialization.

 The complication to this approach
is destruction of function-local static-duration objects
and calls to functions registered with std::atexit.
Since the order
of initialization of function-local static-duration objects
and of calls to std::atexit
is defined by execution,
rather than by declaration,
we call these execution-ordered objects.
(For expository purposes, calls to std::atexit
correspond to initialization of a virtual object.)
Non-local static-duration objects are called declaration-ordered objects.
For execution-ordered objects
initialized from within the dynamic context
of a declaration-ordered initialization,
their destruction
shall occur in reverse order of completion of their initialization
immediately before the destruction of the context object.

 To make this approach viable,
initialization and destruction of execution-ordered objects
must obey the same restrictions as those on declaration-ordered objects.
Specifically, the initialization and destruction of an execution-ordered object
must not use an object that is relatively unordered
with respect to the context object
and must synchronize use of objects with trivial destructors.

 Finally, execution-ordered objects may be initialized
outside the context of initialization of a declaration-ordered object,
that is, they may be initialized from ordinary code.
These objects are destructed
in reverse order of the completion of their initialization
before destruction of any declaration-ordered object.


Implementation

 One potential concern
is the run-time cost of managing destruction order.
To address this issue, we provide an implementation outline below.



 Define a static-duration object
that lists pending destructions for groups of declaration-ordered objects.
Call this list the declaration-destructor list.


 Define a static-duration object
that lists pending destructions for execution-ordered objects
not initialized in the context of a declaration-ordered initialization.
Call this list the execution-destructor list.


 Define a thread-duration object that lists pending destructions.
Call this list the threaded-destructor list.


 When starting initialization of a group of relatively-ordered objects,
create an empty threaded-destructor list.


 For each dynamic initialization within an initialization region,
non-atomically insert the destruction
at the head of the threaded-destructor list.
This list will capture the function-local objects
initialized as a consequence of the initialization of non-local objects.
The code can be as simple as an insertion onto a singly-linked list
with nodes statically allocated.


 When finishing an initialization region,
atomically move the threaded-destructor list
to the declaration-destructor list
as a group.
The code can be as simple as
an atomic insertion onto a singly-linked list
with nodes statically allocated.
The atomic insertion can be done with a compare-and-swap-with-release loop,
which will terminate rapidly.
A read-acquire on the head of the loop will be necessary
before traversing the list.


 For each dynamic initialization not within an initialization region,
atomically insert the destruction at the head of the execution-destructor list.
The code can be as simple as an atomic insertion onto a singly-linked list
with nodes statically allocated.
The insertion has the same basic algorithm as above.


 Upon program exit,
iterate over the execution-destructor list
and call the corresponding destructors.
After those complete,
iterate over the declaration-destructor list
and start the corresponding group destruction concurrently.
Within each group, iterate sequentially over the destructor list.





3.6.3 Termination [basic.start.term]

 In paragraph 1, edit


Destructors (12.4) for initialized objects of static storage duration
(declared at block scope or at namespace scope)
are called as a result of returning from main
and as a result of calling std::exit (18.4).
These objects
are destroyed in the reverse order
of the completion of their constructor or
of the completion of their dynamic initialization.
Dynamic initializations of local static objects
executing inside the dynamic scope
of the initialization a non-local static object
are relatively ordered to the non-local static object.
Dynamic initializations of local static objects
executing outside the dynamic scope
of the initialization a non-local static object
are relatively ordered to main.
Objects with relatively ordered initialization
shall be destroyed in reverse order of completion of their initialization.
Objects with relatively unordered initialization
may be destroyed concurrently,
except that all objects relatively ordered to main
shall be destroyed before any non-local static-duration objects.
If an object is initialized statically,
the object is destroyed in the same order
as if the object was dynamically initialized.
For an object of array or class type,
all subobjects of that object are destroyed
before any local object with static storage duration
initialized during the construction of the subobjects
is destroyed.


 In paragraph 2, edit


If a function contains a local object of static storage duration
that has been destroyed
and the function is called
during the destruction of an object with static storage duration,
the program has undefined behavior
if the flow of control
passes through the definition of the previously destroyed local object.
Likewise, the behavior is undefined
if the function-local object is used indirectly
(i.e. through a pointer).


 In paragraph 3, edit


A call to std::atexit
from inside the dynamic scope of
the initialization of a non-local static object
is relatively ordered to the non-local static object.
A call to std::atexit
from outside the dynamic scope of
the initialization of a non-local static object
is relatively ordered to main.
If a function is registered with std::atexit
(see <cstdlib>, 18.4)
then following the call to std::exit,
any relatively ordered objects with static storage duration
initialized prior to the registration of that function
shall not be destroyed
until the registered function is called from the termination process
and has completed.
For an a relatively ordered object
with static storage duration
constructed after a function is registered with std::atexit,
then following the call to std::exit,
the registered function is not called
until the execution of the object's destructor has completed.
If std::atexit is called during the construction of an object,
the complete object to which it belongs shall be destroyed
before the registered function is called.



18.4 Start and termination [lib.support.start.term]

 In paragraph 8, add a new bullet 0


First, terminate all threads.


 In paragraph 8, existing bullet 1, edit


First, Next,
objects with static storage duration are destroyed and
functions registered by calling atexit are called.219)
See 3.6.3 for the order of destructions and calls.

Non-local objects with static storage duration
are destroyed in the reverse order of the completion of their constructor.
(Automatic objects are not destroyed
as a result of calling exit().)218)
Functions registered with atexit
are called in the reverse order of their registration,
except that a function is called after any previously registered functions
that had already been called at the time it was registered.219)
A function registered with atexit
before a non-local object obj1 of static storage duration
is initialized
will not be called until obj1's destruction has completed.
A function registered with atexit
after a non-local object obj2 of static storage duration
is initialized
will be called before obj2's destruction starts.
A local static object obj3
is destroyed at the same time it would be
if a function calling the obj3 destructor
were registered with atexit
at the completion of the obj3 constructor.





Appendix: A Fast Implementation of Synchronized Initialization

Mike Burrows, m3b at Google.



Overview

I show a fast form of concurrent synchronized initialization
in the form of an alternate implementation of pthread_once.



Licensing

It is our intent to make the following technique freely available.
To that end, some licensing appears to be required.

Copyright (c) 2007, Google Inc.

All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:


Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following disclaimer
in the documentation and/or other materials provided with the
distribution.
Neither the name of Google Inc. nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.


THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,           
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY           
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.



Header fast_pthread_once.h

The header defines the synchronization type,
defines a synchronization sentinal values,
declares the synchronization function,
and defines an inline fast synchronization function.


#ifndef FAST_PTHREAD_ONCE_H
#define FAST_PTHREAD_ONCE_H

#include <signal.h>
#include <stdint.h>

typedef sig_atomic_t fast_pthread_once_t;
#define FAST_PTHREAD_ONCE_INIT SIG_ATOMIC_MAX
extern __thread fast_pthread_once_t _fast_pthread_once_per_thread_epoch;

#ifdef __cplusplus
extern "C" {
#endif

extern void fast_pthread_once( pthread_once_t *once, void (*func)(void) );

inline static void fast_pthread_once_inline(
    fast_pthread_once_t *once, void (*func)(void) )
{
    fast_pthread_once_t x = *once;        /* unprotected access */
    if ( x > _fast_pthread_once_per_thread_epoch ) {
        fast_pthread_once( once, func );
    }
}

#ifdef __cplusplus
}
#endif

#endif FAST_PTHREAD_ONCE_H




Source fast_pthread_once.c

The source is written in C.
The lines of the primary function
are numbered for reference in the subsequent correctness argument.


#include "fast_pthread_once.h"

#include <pthread.h>

static pthread_mutex_t mu = PTHREAD_MUTEX_INITIALIZER;
  /* protects global_epoch and all fast_pthread_once_t writes */

static pthread_cond_t cv = PTHREAD_COND_INITIALIZER;
  /* signalled whenever a fast_pthread_once_t is finalized */

#define BEING_INITIALIZED (FAST_PTHREAD_ONCE_INIT - 1)

static fast_pthread_once_t global_epoch = 0; /* under mu */

__thread fast_pthread_once_t _fast_pthread_once_per_thread_epoch;

static void check( int x )
{
    if ( x == 0 )
        abort();
}

void fast_pthread_once( fast_pthread_once_t *once, void (*func)(void) )
{
/*01*/    fast_pthread_once_t x = *once;        /* unprotected access */
/*02*/    if ( x > _fast_pthread_once_per_thread_epoch ) {
/*03*/        check( pthread_mutex_lock(&mu) == 0 );
/*04*/        if ( *once == FAST_PTHREAD_ONCE_INIT ) {
/*05*/            *once = BEING_INITIALIZED;
/*06*/            check( pthread_mutex_unlock(&mu) == 0 );
/*07*/            (*func)();
/*08*/            check( pthread_mutex_lock(&mu) == 0 );
/*09*/            global_epoch++;
/*10*/            *once = global_epoch;
/*11*/            check( pthread_cond_broadcast(&cv) == 0 );
/*12*/        } else {
/*13*/            while ( *once == BEING_INITIALIZED ) {
/*14*/                check( pthread_cond_wait(&cv, &mu) == 0 );
/*15*/            }
/*16*/        }
/*17*/        _fast_pthread_once_per_thread_epoch = global_epoch;
/*18*/        check (pthread_mutex_unlock(&mu) == 0);
          }
}




Correctness Argument

The above code implements pthread_once()
in a way that is both fast (a few simple instructions, no memory barriers)
and portable.
Careful use of memory atomicity and monotonicity
are used in place of a memory barrier.

The fast-path for the inline function in the header
has the following code as generated by gcc 3.2.2 on an x86:



mov %gs:(%esi),%eax
# access to thread-local storage
cmp %eax, pthread_once_t_location
# accesses the pthread_once_t variable
jg slow_path
# decides whether to call the function



This code is one more instruction than a test that is not thread safe. 
This code touches two cache lines,
while an unsafe version would touch only one.
However, the additional cache line is thread-specific,
and not especially likely to cause misses.
As one might expect,
repeated calls to the inline version
achieve around 1 billion fast-path calls per second
on a 2.8GHz processor with a hot cache.

In the sketch proof that follows,
assumptions and deductions are given names in square brackets.
This sketch will no doubt appear to be overly pedantic to some
and sloppy to others.
It probably contains errors and misuse of terminology;
I hope these faults are not fatal.
Its purpose is to convince the reader
that with sufficient effort a full proof could be constructed,
and perhaps to form the basis for such a proof
for anyone eager for such mental exercise.

The portability assumptions beyond those in a straightforward implementation
are as follows:


[thread_local]
There is a way to access thread-specific data or thread-local storage.
The technique is useful only if such an access is faster than a memory barrier.
For the disassembly above,
the compiler used its ability to access thread-local storage
using variables declared with __thread,
so the thread-local access is a normal "mov" instruction
with a different segment register.


[atomicity]
There exists an integral type T (such as sig_atomic_t)
such that loads and stores of a variable of type T are atomic.
It is not possible to read from such a variable any value
that was not written to the variable,
even if multiple reads and writes occur on many processors.
Another way to say this is that variables of type T
do not exhibit word tearing when accessed with full-width accesses.

[*once_bound]
The number of pthread_once_t variables in a programme
is bounded and can be counted in an instance of type T without overflow.
If T is 32-bits,
this implementation requires that
the number of pthread_once_t variables in a programme
not exceed 2**31-3.
(Recall that pthread_once_t variables
are intended to have static storage class,
so it would be remarkable for such a huge number to exist.)


[monotonicity]
Accesses to a single variable V of type T by a single thread X
appear to occur in programme order.
For example, if V is initially 0, then X writes 1, and then 2 to V,
no thread (including but not limited to X)
can read a value from V and then subsequently read a lower value from V.
(Notice that this does not prevent arbitrary load and store reordering;
it constrains ordering only between actions on a single memory location.
This assumption is reasonable on all architectures that I currently know about.
I suspect that the Java and CLR memory models require this assumption also.)



Let us say that
every fast_pthread_once variable *once
is "finalized"
when the associated initialization function "func"
returns (line 7 in the code above).
We want to argue [correctness] (the combination of [safety] and [liveness])
of the code above, using the assumptions above, plus a few others given below.
For [safety], we require [safety_1] that
each variable *once is finalized at most once.
We also require [safety_2] that
any thread that calls fast_pthread_once on *once
does not return until *once has been finalized at least once,
and all the modifications made by func
are visible to that thread.
For [liveness], we require that the code does not deadlock or loop forever.

For the code, all variable accesses are consistent,
except the load of *once on line 1 because:

All modifications to *once and all loads but line 1
occur under mu,
so all accesses to *once except the load on line 1 are consistent.

All accesses to global_epoch occur under mu,
so all accesses to global_epoch are consistent.

The variable _fast_pthread_once_per_thread_epoch
is a per-thread variable accessed only by its owning thread,
so all accesses to _fast_pthread_once_per_thread_epoch
are consistent (requires [thread_local]).
The consistency of variable accesses will not be mentioned further,
except for *once on line 1.



[release_consistency]
We assume that we have mutexes that provide at least release consistency.
We use pthread_mutex_t in the example.

[fairness]
We assume that runnable threads eventually run.

[atomicity2]
We assume that fast_pthread_once variables are of the type T,
from [atomicity].

[initialization]
We assume that fast_pthread_once_t variables
are originally set to FAST_PTHREAD_ONCE_INIT.

[global_epoch_value_range]
global_epoch counts the number of *once variables
that have been finalized;
it is incremented (line 9) each time func is called (line 7)
and never modified otherwise.
From [*once_bound],
we know the value of global_epoch
is in {0, ..., FAST_PTHREAD_ONCE_INIT-2},
because global_epoch is initially 0,
and FAST_PTHREAD_ONCE_INIT
is the maximum value for the type of *once.

[*once_value_sequence]
Every *once variable is initialized to
FAST_PTHREAD_ONCE_INIT [initialization].
There are only two places where *once
is changed:

Line 5:
*once is set to BEING_INITIALIZED
(==FAST_PTHREAD_ONCE_INIT-1)

Line 10:
*once is set to the value of global_epoch,
which is in {0, ..., FAST_PTHREAD_ONCE_INIT-2}
(from global_epoch_value_range)


The sets {FAST_PTHREAD_ONCE_INIT},
{BEING_INITIALIZED}, and
{0, ..., FAST_PTHREAD_ONCE_INIT-2}
have no intersection because
BEING_INITIALIZED==FAST_PTHREAD_ONCE_INIT-1,
and FAST_PTHREAD_ONCE_INIT
is the maximum value for the type of *once.
The first assignment (line 5) to BEING_INITIALIZED
is predicated on *once being FAST_PTHREAD_ONCE_INIT,
and the update is atomic due to the use of a mutex.
Therefore, a *once
may make the transition from FAST_PTHREAD_ONCE_INIT
to BEING_INITIALIZED at most once.
The second assignment (line 10) occurs
only if the executing thread performed the first assignment;
therefore the second assignment occurs at most once.
So any given *once location takes on a sequence of values
that is a prefix of the sequence:

FAST_PTHREAD_ONCE_INIT, BEING_INITIALIZED, E

where E < FAST_PTHREAD_ONCE_INIT,
E < BEING_INITIALIZED,
and E is in {0, ..., FAST_PTHREAD_ONCE_INIT-2}.

[safety_1]
The function func
is called at most once for each *once variable,
because it is called only if the executing thread
performed the assignment at line 5,
which is performed at most once,
from [*once_value_sequence].
This shows [safety_1].

[slow_path_safety_2]
Any thread W that reaches line 10 has called func;
therefore *once is finalized
and the modifications from func are visible to W.
If W reaches line 18,
W either passed through line 10,
or passed through line 13 and found *once != BEING_INITIALIZED.
We know from [monotonicity] and [*once_value_sequence] that
*once cannot be FAST_PTHREAD_ONCE_INIT (line 4)
or BEING_INITIALIZED (line 13),
so it must be some value E,
a value in {0, ..., FAST_PTHREAD_ONCE_INIT-2}.
The location *once
takes on such a value only after being finalized.
Moreover, accesses by a thread after line 13 seeing E
are ordered after the assignment of E to *once
and the modifications made by the call to func.
This is from [release_consistency]
and the facts that accesses at line 13 occur with the mutex held
that the thread that finalized *once
released the same mutex after doing so.
Thus, if W reaches line 18,
it does so after *once has been finalized,
and with all modifications made by func visible to it.
This shows [safety_2] for the slow path, which we call [slow_path_safety_2].

[fast_path_safety_2]
We now need to show that any thread X that
takes the fast path (just lines 1 and 2, with a false predicate at line 2),
does so only if *once is finalized
and func's modifications are visible to X.
The value read from *once at line 1 is not read under mu,
so the value read may be inconsistent.
However, it is known to be one of the values

FAST_PTHREAD_ONCE_INIT, BEING_INITIALIZED, E

(where E is in {0, ..., FAST_PTHREAD_ONCE_INIT-2}
and E < FAST_PTHREAD_ONCE_INIT
and E < BEING_INITIALIZED)
because of [*once_bound], [atomicity2], and [*once_value_sequence].
Further, we know that X's _fast_pthread_once_per_thread_epoch
is in {0, ..., FAST_PTHREAD_ONCE_INIT-2}
because it is assigned only from global_epoch (line 17)
and we have [global_epoch_value_range].
Therefore, X sees the line 2 predicate as false only if both:

X read some value E from *once
(rather than reading FAST_PTHREAD_ONCE_INIT
or BEING_INITIALIZED),
and

X's _fast_pthread_once_per_thread_epoch exceeds the value E;
both values were ultimately derived from global_epoch,
which is consistently accessed under mu.

From the first of these
plus assumptions [*once_bound], [monotonicity], and [*once_value_sequence],
we know that for X to take the fast path,
some thread Y executed line 10, and therefore Y finalized *once.
From the second of these,
we know that X must have acquired mu (mu is held at line 17)
after Y finalized *once and subsequently released mu.
So from [release_consistency],
all the values modified by func are visible to X.
This gives us [fast_path_safety_2].

[safety]
Combining [safety_1], [slow_path_safety_2], and [fast_path_safety_2],
we have [safety].

[mu_deadlock_freedom]
From the code, while mu is held,
no other mutexes are acquired nor any other blocking call made,
so mu cannot participate in a deadlock.

[liveness]
Liveness comes from these observations

the scheduler does not delay arbitrarily [fairness],

the mutex cannot cause deadlock [mu_deadlock_freedom],

all changes that might cause line 13 to see a false predicate
must be made by line 10,
and the same thread must then signal the condition variable (line 11)
thus waking any waiter (line 14), and
there is only one loop, at line 13.
This loop will is entered only if
some thread Y
has set *once != FAST_PTHREAD_ONCE_INIT (from line 4).
The loop will terminate if Y ever reaches lines 10, 11 and 18 (in sequence),
which is shown by the foregoing observations.



[correctness]
We have [safety] and [liveness].



Test fast_pthread_once_test.c


#include <stdio.h>
#include <assert.h>
#include "fast_pthread_once.h"

fast_pthread_once_t once = FAST_PTHREAD_ONCE_INIT;

void run_once(void)
{
    static int run_count = 0;
    assert( run_count == 0 );
    run_count++;
}

int main( int argc, char *argv[] )
{
    int use_inline = 0;
    int n = 1000 * 1000 * 1000;
    int i;
    for ( i = 1; i < argc; i++ ) {
        const char *arg = argv[i];
        if ( strcmp(arg, "-i") == 0 ) {
            use_inline = 1;
        } else if ( strspn(arg, "0123456789") == strlen(arg) ) {
            n = atoi(arg);
        } else {
            fprintf( stderr, "usage: %s [-i] [number]\n", argv[0] );
            return 2;
        }
    }
    if ( use_inline ) {
        for ( i = 0; i != n; i++ ) {
            fast_pthread_once_inline(&once, run_once);
        }
    } else {
        for ( i = 0; i != n; i++ ) {
            fast_pthread_once(&once, run_once);
        }
    }
    return 0;
}




Improvements

There are two primary sources of inefficiency in code.
First, there is a single mutex for maintaining the global epoch.
Second, all waiting threads are woken for each completed initialization.
We do not believe these inefficiencies will matter in practice,
because they become significant only for
many threads over many long-running initializations.
(Recall that the mutex is not held during the actual initialization.)
However, for pathological programs
the following improvements should suffice.


Use atomic operations (without memory barriers)
to turn the *once location into a mutex
and update global_epoch atomically.

Use many condition variables
(but still only one mutex), hashed by pthread_once_t address.
Unlike having multiple mutexes, this would not affect the fast path,
and could reduce the effect of premature waking by an arbitrary factor.

mov	%gs:(%esi),%eax	# access to thread-local storage
cmp	%eax, pthread_once_t_location	# accesses the pthread_once_t variable
jg	slow_path	# decides whether to call the function