Dynamic Initialization and Destruction with Concurrency

ISO/IEC JTC1 SC22 WG21 N2444 = 07-0314 - 2007-10-16

Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org

This document is a revision of N2382 = 07-0242 - 2007-08-05.

Introduction
Function-Local Initialization: Implementation; 6.7 Declaration statement [stmt.dcl]
Non-Local Initialization: Implementation; 3.6.2 Initialization of non-local objects [basic.start.init]
Destruction: Implementation; 3.6.3 Termination [basic.start.term]; 18.4 Start and termination [lib.support.start.term]
Appendix: A Fast Implementation of Synchronized Initialization: Overview; Licensing; Header fast_pthread_once.h; Source fast_pthread_once.c; Correctness Argument; Test fast_pthread_once_test.c; Improvements

Introduction

Concurrency introduces potential deadlock or data races in the dynamic initialization and destruction of static-duration objects. Because the opportunity for programmers to manually synchronize such objects is limited by a lack of syntactic handles, the language must introduce new syntax, define synchronization, or limit programs.

This proposal breaks the problem into three reasonably separable parts: initialization of function-local static-duration objects, initialization of non-local static-duration objects, and destruction of all static-duration objects.

This proposal exploits properties of a new algorithm for fast synchronized initialization by Mike Burrows of Google. (See the appendix for an implementation and an argument for its correctness.) The algorithm has three attributes signficant to its use here.

The "already initialized" check (e.g. for function-local static-duration objects already initialized) requires only one additional regular load over a non-concurrent check, presuming a fast implementation of N2280 Thread-Local Storage.
The initialization does not occur while holding any lock required by the synchronization mechanism.
The algorithm is portable.

Function-Local Initialization

The core problem with function-local static-duration object initialization is that the containing function may be invoked concurrently, and thus the definition may execute concurrently. Failing to synchronize may introduce a race condition. The obvious solution is to synchronize. The problem is that such synchronization may introduce deadlock if the synchronization involves holding a lock while the initializer executes.

The proposal relies on Burrows' algorithm to avoid holding a lock while the initializer executes. Any deadlock that occurs as a result of initialization, must necessarily be a race condition in the absence of synchronization.

Implementation

The GNU compiler currently implements synchronized function-local variable initialization, though it holds the lock during initialization.

6.7 Declaration statement [stmt.dcl]

In paragraph 4, edit

Otherwise such an object is initialized the first time control passes through its declaration; such an object is considered initialized upon the completion of its initialization. If the initialization exits by throwing an exception, the initialization is not complete, so it will be tried again the next time control enters the declaration. If control enters the declaration concurrently while the object is being initialized, the concurrent execution waits for completion of the initialization. The implementation shall not introduce any locking around execution of the initializer. If control re-enters the declaration (recursively) while the object is being initialized, the behavior is undefined.

Non-Local Initialization

There are two problems with non-local static-duration object initialization. The first problem is that such initialization may occur concurrently. There are two sources to the concurrency, dynamic opening of a shared library in a multi-threaded context and creating of multiple threads in initializers, and hence the potentially implied concurrent execution of other initializers. The second problem is that static-duration object initialization may constitute a large fraction of process time, particularly when users are waiting for program start, and hence may benefit from parallel execution.

The standard currently admits unordered initialization. When initializing one object, reference to a relatively unordered object may result in access before dynamic initialization but after zero initialization. Since the concept of zero initialization has limited utility for many objects, particularly those that require dynamic construction, we consider this feature of the current C++ standard to be of limited utility. Therefore, we propose to make such references undefined.

Making such references undefined would normally imply that std::cout and std::operator new could not be used by any dynamic initializer. To resolve this problem, the standard should provide additional guarantees, and it currently appears to do so.

In the sequential 2003 standard, unspecified order of initialization essentially required some serial order of initialization. In the parallel proposed standard, the unspecified order of initialization admits concurrent initialization. The necessary additional restriction is that a single object may be initialized at most once. This restriction may require locking. With concurrent initialization, any write (or read in the presence of writes) to statically-initialized objects must be properly locked. An alternate approach, not proposed here, is to require a single global order for all initialization.

The implementation must initialize static-duration objects before any of their use within main or the functions it calls.

Implementation

To the best of our knowledge, no compiler synchronizes the initialization of non-local static-duration objects. However, the technical issues are similar to existing function-local initializations, and so present no new challenges.

3.6.2 Initialization of non-local objects [basic.start.init]

In paragraph 1, edit

Dynamic initialization of an a non-function-local static-duration object is either ordered or unordered. Definitions of explicitly specialized class template static data members have ordered initialization. Other class template static data members (i.e., implicitly or explicitly instantiated specializations) have unordered initialization. Other objects defined in namespace scope have ordered initialization. Such objects ~~Objects~~ within a single translation unit and with ordered initialization are relatively ordered, otherwise they are relatively unordered. Relatively ordered objects shall be initialized in the order of their definitions within the translation unit. The order of initialization is unspecified for relatively unordered objects ~~with unordered initialization and for objects defined in different translation units~~. If the initialization of an object uses a relatively unordered object, the behavior is undefined; no diagnostic is required. [Note: This definition permits concurrent initialization.]

Destruction

The primary problem with destruction of static-duration objects is access to static-duration objects after their destructors have executed, thus resulting in undefined behavior. To prevent this problem, we require that all user threads finish before destruction begins. For threads that do not naturally finish, mechanisms to terminate threads are proposed in N2320 Multi-threading Library for Standard C++ and its successors.

Destruction also introduces the potential to consume a large fraction of process time. Therefore, similar to initialization, we enable concurrent destruction.

Objects defined at namespace scope with relatively ordered initializations must be destructed in reverse order of their initialization.

The complication to this approach is destruction of function-local static-duration objects and calls to functions registered with std::atexit. Since the order of initialization of function-local static-duration objects and of calls to std::atexit is defined by execution, rather than by declaration, we call these execution-ordered objects. (For expository purposes, calls to std::atexit correspond to initialization of a virtual object.) Non-local static-duration objects are called declaration-ordered objects. For execution-ordered objects initialized from within the dynamic context of a declaration-ordered initialization, their destruction shall occur in reverse order of completion of their initialization immediately before the destruction of the context object.

To make this approach viable, initialization and destruction of execution-ordered objects must obey the same restrictions as those on declaration-ordered objects. Specifically, the initialization and destruction of an execution-ordered object must not use an object that is relatively unordered with respect to the context object and must synchronize use of objects with trivial destructors.

Finally, execution-ordered objects may be initialized outside the context of initialization of a declaration-ordered object, that is, they may be initialized from ordinary code. These objects are destructed in reverse order of the completion of their initialization before destruction of any declaration-ordered object.

Implementation

One potential concern is the run-time cost of managing destruction order. To address this issue, we provide an implementation outline below.

Define a static-duration object that lists pending destructions for groups of declaration-ordered objects. Call this list the declaration-destructor list.
Define a static-duration object that lists pending destructions for execution-ordered objects not initialized in the context of a declaration-ordered initialization. Call this list the execution-destructor list.
Define a thread-duration object that lists pending destructions. Call this list the threaded-destructor list.
When starting initialization of a group of relatively-ordered objects, create an empty threaded-destructor list. The set of static-duration object initializations executed within the dynamic scope of the initializations of this group of objects is called an initialization region.
For each dynamic initialization within an initialization region, non-atomically insert the destruction at the head of the threaded-destructor list. This list will capture the function-local objects initialized as a consequence of the initialization of non-local objects. The code can be as simple as an insertion onto a singly-linked list with nodes statically allocated.
When finishing an initialization region, atomically move the threaded-destructor list to the declaration-destructor list as a group. The code can be as simple as an atomic insertion onto a singly-linked list with nodes statically allocated. The atomic insertion can be done with a compare-and-swap-with-release loop, which will terminate rapidly. A read-acquire on the head of the loop will be necessary before traversing the list.
For each dynamic initialization not within an initialization region, atomically insert the destruction at the head of the execution-destructor list. The code can be as simple as an atomic insertion onto a singly-linked list with nodes statically allocated. The insertion has the same basic algorithm as above.
Upon program exit, iterate over the execution-destructor list and call the corresponding destructors. After those complete, iterate over the declaration-destructor list and start the corresponding group destruction concurrently. Within each group, iterate sequentially over the destructor list.

3.6.3 Termination [basic.start.term]

In paragraph 1, edit

Destructors (12.4) for initialized objects of static storage duration ~~(declared at block scope or at namespace scope)~~ are called as a result of returning from main and as a result of calling std::exit (18.4). ~~These objects are destroyed in the reverse order of the completion of their constructor or of the completion of their dynamic initialization.~~ Dynamic initializations of local static objects executing inside the dynamic scope of the initialization a non-local static object are relatively ordered to the non-local static object. Dynamic initializations of local static objects executing outside the dynamic scope of the initialization a non-local static object are relatively ordered to main. Objects with relatively ordered initialization shall be destroyed in reverse order of completion of their initialization. All objects relatively ordered to main shall be destroyed before any non-local static-duration objects; otherwise, if the destruction of an object uses a relatively unordered object, the behavior is undefined; no diagnostic is required. [Note: This definition permits concurrent destruction.] If an object is initialized statically, the object is destroyed in the same order as if the object was dynamically initialized. For an object of array or class type, all subobjects of that object are destroyed before any local object with static storage duration initialized during the construction of the subobjects is destroyed.

In paragraph 2, edit

If a function contains a local object of static storage duration that has been destroyed and the function is called during the destruction of an object with static storage duration, the program has undefined behavior if the flow of control passes through the definition of the previously destroyed local object. Likewise, the behavior is undefined if the function-local object is used indirectly (i.e. through a pointer) after its destruction.

In paragraph 3, edit

A call to std::atexit from inside the dynamic scope of the initialization of a non-local static object is relatively ordered to the non-local static object. A call to std::atexit from outside the dynamic scope of the initialization of a non-local static object is relatively ordered to main. If a function is registered with std::atexit (see <cstdlib>, 18.4) then following the call to std::exit, any relatively ordered objects with static storage duration initialized prior to the registration of that function shall not be destroyed until the registered function is called from the termination process and has completed. For an a relatively ordered object with static storage duration constructed after a function is registered with std::atexit, then following the call to std::exit, the registered function is not called until the execution of the object's destructor has completed. If std::atexit is called during the construction of an object, the complete object to which it belongs shall be destroyed before the registered function is called.

18.4 Start and termination [lib.support.start.term]

In paragraph 8, add a new bullet 0

The program shall ensure that all threads, except the main thread, have terminated before calling exit or returning from main.

In paragraph 8, existing bullet 1, edit

~~First,~~ Next, objects with static storage duration are destroyed and functions registered by calling atexit are called.219) See 3.6.3 for the order of destructions and calls. ~~Non-local objects with static storage duration are destroyed in the reverse order of the completion of their constructor.~~ (Automatic objects are not destroyed as a result of calling exit().)218) Functions registered with atexit are called in the reverse order of their registration, except that a function is called after any previously registered functions that had already been called at the time it was registered.219) A function registered with atexit before a non-local object obj1 of static storage duration is initialized will not be called until obj1's destruction has completed. A function registered with atexit after a non-local object obj2 of static storage duration is initialized will be called before obj2's destruction starts. A local static object obj3 is destroyed at the same time it would be if a function calling the obj3 destructor were registered with atexit at the completion of the obj3 constructor.

Appendix: A Fast Implementation of Synchronized Initialization

Mike Burrows, m3b at Google.

Overview

I show a fast form of concurrent synchronized initialization in the form of an alternate implementation of pthread_once.

Licensing

It is our intent to make the following technique freely available. To that end, some licensing appears to be required.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of Google Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Header fast_pthread_once.h

The header defines the synchronization type, defines a synchronization sentinal values, declares the synchronization function, and defines an inline fast synchronization function.


#ifndef FAST_PTHREAD_ONCE_H
#define FAST_PTHREAD_ONCE_H

#include <signal.h>
#include <stdint.h>

typedef sig_atomic_t fast_pthread_once_t;
#define FAST_PTHREAD_ONCE_INIT SIG_ATOMIC_MAX
extern __thread fast_pthread_once_t _fast_pthread_once_per_thread_epoch;

#ifdef __cplusplus
extern "C" {
#endif

extern void fast_pthread_once( pthread_once_t *once, void (*func)(void) );

inline static void fast_pthread_once_inline(
    fast_pthread_once_t *once, void (*func)(void) )
{
    fast_pthread_once_t x = *once;        /* unprotected access */
    if ( x > _fast_pthread_once_per_thread_epoch ) {
        fast_pthread_once( once, func );
    }
}

#ifdef __cplusplus
}
#endif

#endif FAST_PTHREAD_ONCE_H

Source fast_pthread_once.c

The source is written in C. The lines of the primary function are numbered for reference in the subsequent correctness argument.


#include "fast_pthread_once.h"

#include <pthread.h>

static pthread_mutex_t mu = PTHREAD_MUTEX_INITIALIZER;
  /* protects global_epoch and all fast_pthread_once_t writes */

static pthread_cond_t cv = PTHREAD_COND_INITIALIZER;
  /* signalled whenever a fast_pthread_once_t is finalized */

#define BEING_INITIALIZED (FAST_PTHREAD_ONCE_INIT - 1)

static fast_pthread_once_t global_epoch = 0; /* under mu */

__thread fast_pthread_once_t _fast_pthread_once_per_thread_epoch;

static void check( int x )
{
    if ( x == 0 )
        abort();
}

void fast_pthread_once( fast_pthread_once_t *once, void (*func)(void) )
{
/*01*/    fast_pthread_once_t x = *once;        /* unprotected access */
/*02*/    if ( x > _fast_pthread_once_per_thread_epoch ) {
/*03*/        check( pthread_mutex_lock(&mu) == 0 );
/*04*/        if ( *once == FAST_PTHREAD_ONCE_INIT ) {
/*05*/            *once = BEING_INITIALIZED;
/*06*/            check( pthread_mutex_unlock(&mu) == 0 );
/*07*/            (*func)();
/*08*/            check( pthread_mutex_lock(&mu) == 0 );
/*09*/            global_epoch++;
/*10*/            *once = global_epoch;
/*11*/            check( pthread_cond_broadcast(&cv) == 0 );
/*12*/        } else {
/*13*/            while ( *once == BEING_INITIALIZED ) {
/*14*/                check( pthread_cond_wait(&cv, &mu) == 0 );
/*15*/            }
/*16*/        }
/*17*/        _fast_pthread_once_per_thread_epoch = global_epoch;
/*18*/        check (pthread_mutex_unlock(&mu) == 0);
          }
}

Correctness Argument

The above code implements pthread_once() in a way that is both fast (a few simple instructions, no memory barriers) and portable. Careful use of memory atomicity and monotonicity are used in place of a memory barrier.

The fast-path for the inline function in the header has the following code as generated by gcc 3.2.2 on an x86:

mov %gs:(%esi),%eax # access to thread-local storage

cmp %eax, pthread_once_t_location # accesses the pthread_once_t variable

jg slow_path # decides whether to call the function

This code is one more instruction than a test that is not thread safe. This code touches two cache lines, while an unsafe version would touch only one. However, the additional cache line is thread-specific, and not especially likely to cause misses. As one might expect, repeated calls to the inline version achieve around 1 billion fast-path calls per second on a 2.8GHz processor with a hot cache.

In the sketch proof that follows, assumptions and deductions are given names in square brackets. This sketch will no doubt appear to be overly pedantic to some and sloppy to others. It probably contains errors and misuse of terminology; I hope these faults are not fatal. Its purpose is to convince the reader that with sufficient effort a full proof could be constructed, and perhaps to form the basis for such a proof for anyone eager for such mental exercise.

The portability assumptions beyond those in a straightforward implementation are as follows:

[thread_local]: There is a way to access thread-specific data or thread-local storage. The technique is useful only if such an access is faster than a memory barrier. For the disassembly above, the compiler used its ability to access thread-local storage using variables declared with __thread, so the thread-local access is a normal "mov" instruction with a different segment register.
[atomicity]: There exists an integral type T (such as sig_atomic_t) such that loads and stores of a variable of type T are atomic. It is not possible to read from such a variable any value that was not written to the variable, even if multiple reads and writes occur on many processors. Another way to say this is that variables of type T do not exhibit word tearing when accessed with full-width accesses.
[*once_bound]: The number of pthread_once_t variables in a programme is bounded and can be counted in an instance of type T without overflow. If T is 32-bits, this implementation requires that the number of pthread_once_t variables in a programme not exceed 2**31-3. (Recall that pthread_once_t variables are intended to have static storage class, so it would be remarkable for such a huge number to exist.)
[monotonicity]: Accesses to a single variable V of type T by a single thread X appear to occur in programme order. For example, if V is initially 0, then X writes 1, and then 2 to V, no thread (including but not limited to X) can read a value from V and then subsequently read a lower value from V. (Notice that this does not prevent arbitrary load and store reordering; it constrains ordering only between actions on a single memory location. This assumption is reasonable on all architectures that I currently know about. I suspect that the Java and CLR memory models require this assumption also.)

Let us say that every fast_pthread_once variable *once is "finalized" when the associated initialization function "func" returns (line 7 in the code above). We want to argue [correctness] (the combination of [safety] and [liveness]) of the code above, using the assumptions above, plus a few others given below. For [safety], we require [safety_1] that each variable *once is finalized at most once. We also require [safety_2] that any thread that calls fast_pthread_once on *once does not return until *once has been finalized at least once, and all the modifications made by func are visible to that thread. For [liveness], we require that the code does not deadlock or loop forever.

For the code, all variable accesses are consistent, except the load of *once on line 1 because:

All modifications to *once and all loads but line 1 occur under mu, so all accesses to *once except the load on line 1 are consistent.
All accesses to global_epoch occur under mu, so all accesses to global_epoch are consistent.
The variable _fast_pthread_once_per_thread_epoch is a per-thread variable accessed only by its owning thread, so all accesses to _fast_pthread_once_per_thread_epoch are consistent (requires [thread_local]). The consistency of variable accesses will not be mentioned further, except for *once on line 1.

[release_consistency] We assume that we have mutexes that provide at least release consistency. We use pthread_mutex_t in the example.

[fairness] We assume that runnable threads eventually run.

[atomicity2] We assume that fast_pthread_once variables are of the type T, from [atomicity].

[initialization] We assume that fast_pthread_once_t variables are originally set to FAST_PTHREAD_ONCE_INIT.

[global_epoch_value_range] global_epoch counts the number of *once variables that have been finalized; it is incremented (line 9) each time func is called (line 7) and never modified otherwise. From [*once_bound], we know the value of global_epoch is in {0, ..., FAST_PTHREAD_ONCE_INIT-2}, because global_epoch is initially 0, and FAST_PTHREAD_ONCE_INIT is the maximum value for the type of *once.

[*once_value_sequence] Every *once variable is initialized to FAST_PTHREAD_ONCE_INIT [initialization]. There are only two places where *once is changed:

Line 5:: *once is set to BEING_INITIALIZED (==FAST_PTHREAD_ONCE_INIT-1)
Line 10:: *once is set to the value of global_epoch, which is in {0, ..., FAST_PTHREAD_ONCE_INIT-2} (from global_epoch_value_range)

The sets {FAST_PTHREAD_ONCE_INIT}, {BEING_INITIALIZED}, and {0, ..., FAST_PTHREAD_ONCE_INIT-2} have no intersection because BEING_INITIALIZED==FAST_PTHREAD_ONCE_INIT-1, and FAST_PTHREAD_ONCE_INIT is the maximum value for the type of *once. The first assignment (line 5) to BEING_INITIALIZED is predicated on *once being FAST_PTHREAD_ONCE_INIT, and the update is atomic due to the use of a mutex. Therefore, a *once may make the transition from FAST_PTHREAD_ONCE_INIT to BEING_INITIALIZED at most once. The second assignment (line 10) occurs only if the executing thread performed the first assignment; therefore the second assignment occurs at most once. So any given *once location takes on a sequence of values that is a prefix of the sequence:

FAST_PTHREAD_ONCE_INIT, BEING_INITIALIZED, E

where E < FAST_PTHREAD_ONCE_INIT, E < BEING_INITIALIZED, and E is in {0, ..., FAST_PTHREAD_ONCE_INIT-2}.

[safety_1] The function func is called at most once for each *once variable, because it is called only if the executing thread performed the assignment at line 5, which is performed at most once, from [*once_value_sequence]. This shows [safety_1].

[slow_path_safety_2] Any thread W that reaches line 10 has called func; therefore *once is finalized and the modifications from func are visible to W. If W reaches line 18, W either passed through line 10, or passed through line 13 and found *once != BEING_INITIALIZED. We know from [monotonicity] and [*once_value_sequence] that *once cannot be FAST_PTHREAD_ONCE_INIT (line 4) or BEING_INITIALIZED (line 13), so it must be some value E, a value in {0, ..., FAST_PTHREAD_ONCE_INIT-2}. The location *once takes on such a value only after being finalized. Moreover, accesses by a thread after line 13 seeing E are ordered after the assignment of E to *once and the modifications made by the call to func. This is from [release_consistency] and the facts that accesses at line 13 occur with the mutex held that the thread that finalized *once released the same mutex after doing so. Thus, if W reaches line 18, it does so after *once has been finalized, and with all modifications made by func visible to it. This shows [safety_2] for the slow path, which we call [slow_path_safety_2].

[fast_path_safety_2] We now need to show that any thread X that takes the fast path (just lines 1 and 2, with a false predicate at line 2), does so only if *once is finalized and func's modifications are visible to X. The value read from *once at line 1 is not read under mu, so the value read may be inconsistent. However, it is known to be one of the values

FAST_PTHREAD_ONCE_INIT, BEING_INITIALIZED, E

(where E is in {0, ..., FAST_PTHREAD_ONCE_INIT-2} and E < FAST_PTHREAD_ONCE_INIT and E < BEING_INITIALIZED) because of [*once_bound], [atomicity2], and [*once_value_sequence]. Further, we know that X's _fast_pthread_once_per_thread_epoch is in {0, ..., FAST_PTHREAD_ONCE_INIT-2} because it is assigned only from global_epoch (line 17) and we have [global_epoch_value_range]. Therefore, X sees the line 2 predicate as false only if both:

X read some value E from *once (rather than reading FAST_PTHREAD_ONCE_INIT or BEING_INITIALIZED), and
X's _fast_pthread_once_per_thread_epoch exceeds the value E; both values were ultimately derived from global_epoch, which is consistently accessed under mu.

From the first of these plus assumptions [*once_bound], [monotonicity], and [*once_value_sequence], we know that for X to take the fast path, some thread Y executed line 10, and therefore Y finalized *once. From the second of these, we know that X must have acquired mu (mu is held at line 17) after Y finalized *once and subsequently released mu. So from [release_consistency], all the values modified by func are visible to X. This gives us [fast_path_safety_2].

[safety] Combining [safety_1], [slow_path_safety_2], and [fast_path_safety_2], we have [safety].

[mu_deadlock_freedom] From the code, while mu is held, no other mutexes are acquired nor any other blocking call made, so mu cannot participate in a deadlock.

[liveness] Liveness comes from these observations

the scheduler does not delay arbitrarily [fairness],
the mutex cannot cause deadlock [mu_deadlock_freedom],
all changes that might cause line 13 to see a false predicate must be made by line 10, and the same thread must then signal the condition variable (line 11) thus waking any waiter (line 14), and
there is only one loop, at line 13. This loop will is entered only if some thread Y has set *once != FAST_PTHREAD_ONCE_INIT (from line 4). The loop will terminate if Y ever reaches lines 10, 11 and 18 (in sequence), which is shown by the foregoing observations.

[correctness] We have [safety] and [liveness].

Test fast_pthread_once_test.c


#include <stdio.h>
#include <assert.h>
#include "fast_pthread_once.h"

fast_pthread_once_t once = FAST_PTHREAD_ONCE_INIT;

void run_once(void)
{
    static int run_count = 0;
    assert( run_count == 0 );
    run_count++;
}

int main( int argc, char *argv[] )
{
    int use_inline = 0;
    int n = 1000 * 1000 * 1000;
    int i;
    for ( i = 1; i < argc; i++ ) {
        const char *arg = argv[i];
        if ( strcmp(arg, "-i") == 0 ) {
            use_inline = 1;
        } else if ( strspn(arg, "0123456789") == strlen(arg) ) {
            n = atoi(arg);
        } else {
            fprintf( stderr, "usage: %s [-i] [number]\n", argv[0] );
            return 2;
        }
    }
    if ( use_inline ) {
        for ( i = 0; i != n; i++ ) {
            fast_pthread_once_inline(&once, run_once);
        }
    } else {
        for ( i = 0; i != n; i++ ) {
            fast_pthread_once(&once, run_once);
        }
    }
    return 0;
}

Improvements

There are two primary sources of inefficiency in code. First, there is a single mutex for maintaining the global epoch. Second, all waiting threads are woken for each completed initialization. We do not believe these inefficiencies will matter in practice, because they become significant only for many threads over many long-running initializations. (Recall that the mutex is not held during the actual initialization.) However, for pathological programs the following improvements should suffice.

Use atomic operations (without memory barriers) to turn the *once location into a mutex and update global_epoch atomically.
Use many condition variables (but still only one mutex), hashed by pthread_once_t address. Unlike having multiple mutexes, this would not affect the fast path, and could reduce the effect of premature waking by an arbitrary factor.

mov	%gs:(%esi),%eax	# access to thread-local storage
cmp	%eax, pthread_once_t_location	# accesses the pthread_once_t variable
jg	slow_path	# decides whether to call the function