1. Introduction
C++23 introduced a new formatted output facility,
([P2093]).
It was defined in terms of formatting into a temporary
to simplify
the specification and to clearly indicate the requirement for non-interleaved
output. Unfortunately, it was discovered that this approach does not allow for a
more efficient implementation strategy, such as writing directly to a stream
buffer under a lock, as reported in [LWG4042]. This paper proposes a solution
to address this shortcoming.
2. Changes since R1
-
Made the new behavior an opt in for user-defined formatters to prevent potential deadlocks when they perform locking in their
functions.format -
Added a missing
argument in the call tostream
in the Effects clause ofprint
.println -
Added instructions to update the
feature testing macro.__cpp_lib_print -
Provided an example illustrating a problem with interleaved output.
-
Provided an example illustrating a problem with locking in C++ and Java.
3. Changes since R0
-
Added preliminary results for libstdc++ provided by Jonathan Wakely.
-
Replaced the definition of
with a more efficient one that doesn’t callprintln
.format -
Fixed typos.
4. Problems
As reported in [LWG4042],
/
is currently defined in
terms of formatting into a temporary
, e.g. [print.fun]:
void vprint_nonunicode ( FILE * stream , string_view fmt , format_args args ); Preconditions:
is a valid pointer to an output C stream.
stream Effects: Writes the result of
to
vformat ( fmt , args ) .
stream Throws: Any exception thrown by the call to
([format.err.report]).
vformat if writing to
system_error fails. May throw
stream .
bad_alloc
This prohibits a more efficient implementation strategy of formatting directly
into a stream buffer under a lock (
/
in POSIX, [STDIO-LOCK]) like C stdio and other formatting facilities do.
The inability to achieve this with the current wording stems from the observable effects: throwing an exception from a user-defined formatter currently prevents any output from a formatting function, whereas with the direct method, the output written to the stream before the exception occurred is preserved. Most errors are caught at compile time, making this situation uncommon. The current behavior can be easily replicated by explicitly formatting into an intermediate string or buffer.
Another problem is that such double buffering may require unbounded memory
allocations, making
unsuitable for resource-constrained
applications creating incentives for continued use of unsafe APIs. In the direct
method, there are usually no memory allocations.
5. Proposal
The current paper proposes expressing the desire to have non-interleaved
output in a way that permits a more efficient implementation similar
to
’s. It is based on the locking mechanism provided by C streams,
quoting Section 7.21.2 Streams of the C standard ([N2310-STREAMS]):
7 Each stream has an associated lock that is used to prevent data races when multiple threads of execution access a stream, and to restrict the interleaving of stream operations performed by multiple threads. Only one thread may hold this lock at a time. The lock is reentrant: a single thread may hold the lock multiple times at a given time.
8 All functions that read, write, position, or query the position of a stream lock the stream before accessing it. They release the lock associated with the stream when the access is complete.
As shown in Performance, this can give more than 20% speed up even compared to writing to a stack-allocated buffer.
All of the following languages use an implementation consistent with the current proposal (no intermediate buffering):
-
C (
)printf -
Rust (
)println ! -
Java (
)System . out . format
IOStreams don’t provide atomicity which is even weaker than the guarantees provided by these languages and the current proposal. For example:
#include <iostream>#include <thread>void worker () { for ( int i = 0 ; i < 3 ; ++ i ) { // Simulate work. std :: this_thread :: sleep_for ( std :: chrono :: milliseconds ( 200 )); std :: cout << "thread " << std :: this_thread :: get_id () << ": work" << i << " done \n " ; } } int main () { auto t1 = std :: jthread ( worker ); auto t2 = std :: jthread ( worker ); }
may produce the following output:
thread 140239754491456 : work0 done thread 140239746098752 : work0 done thread thread 140239746098752 : work140239754491456 : work1 done 1 done thread 140239754491456 : work2 done thread 140239746098752 : work2 done
Neither
nor
have this issue.
One problem with locking a stream is that it may introduce potential for deadlocks in case a user-defined formatter is also doing locking internally. For example:
struct deadlockable { int value = 0 ; mutable std :: mutex mutex ; }; template <> struct std :: formatter < deadlockable > { constexpr auto parse ( std :: format_parse_context & ctx ) { return ctx . begin (); } auto format ( const deadlockable & d , std :: format_context & ctx ) const { std :: lock_guard < std :: mutex > lock ( d . mutex ); return std :: format_to ( ctx . out (), "{}" , d . value ); } }; deadlockable d ; auto t = std :: thread ([ & ]() { std :: ( "start \n " ); std :: lock_guard < std :: mutex > lock ( d . mutex ); for ( int i = 0 ; i < 1000000 ; ++ i ) d . value += 10 ; std :: ( "done \n " ); }); for ( int i = 0 ; i < 100 ; ++ i ) std :: ( "{}" , d ); t . join ();
The same problem exists in other languages, for example:
class Deadlockable { public int value ; public String toString () { synchronized ( this ) { return Integer . toString ( value ); } } } class Hello { public static void main ( String [] args ) throws InterruptedException { Deadlockable d = new Deadlockable (); Thread t = new Thread ( new Runnable () { private Deadlockable d ; public Runnable init ( Deadlockable d ) { this . d = d ; return this ; } @Override public void run () { System . out . println ( "start" ); synchronized ( d ) { for ( int i = 0 ; i < 1000000 ; ++ i ) d . value += 10 ; System . out . format ( "done" ); } } }. init ( d )); t . start (); for ( int i = 0 ; i < 100 ; ++ i ) System . out . format ( "%s" , d ); t . join (); } }
Found one Java - level deadlock : ============================= "main" : waiting to lock monitor 0x0000600002fb4750 ( object 0x000000070fe120e8 , a Deadlockable ), which is held by "Thread-0" "Thread-0" : waiting for ownable synchronizer 0x000000070fe08998 , ( a java . util . concurrent . locks . ReentrantLock$NonfairSync ), which is held by "main" Java stack information for the threads listed above : =================================================== "main" : at Deadlockable . toString ( Hello . java : 5 ) - waiting to lock < 0x000000070fe120e8 > ( a Deadlockable ) at java . util . Formatter$FormatSpecifier . printString ( java . base @21.0.2 / Formatter . java : 3158 ) at java . util . Formatter$FormatSpecifier . ( java . base @21.0.2 / Formatter . java : 3036 ) at java . util . Formatter . format ( java . base @21.0.2 / Formatter . java : 2791 ) at java . io . PrintStream . implFormat ( java . base @21.0.2 / PrintStream . java : 1367 ) at java . io . PrintStream . format ( java . base @21.0.2 / PrintStream . java : 1346 ) at Hello . main ( Hello . java : 32 ) "Thread-0" : at jdk . internal . misc . Unsafe . park ( java . base @21.0.2 / Native Method ) - parking to wait for < 0x000000070fe08998 > ( a java . util . concurrent . locks . ReentrantLock$NonfairSync ) at java . util . concurrent . locks . LockSupport . park ( java . base @21.0.2 / LockSupport . java : 221 ) at java . util . concurrent . locks . AbstractQueuedSynchronizer . acquire ( java . base @21.0.2 / AbstractQueuedSynchronizer . java : 754 ) at java . util . concurrent . locks . AbstractQueuedSynchronizer . acquire ( java . base @21.0.2 / AbstractQueuedSynchronizer . java : 990 ) at java . util . concurrent . locks . ReentrantLock$Sync . lock ( java . base @21.0.2 / ReentrantLock . java : 153 ) at java . util . concurrent . locks . ReentrantLock . lock ( java . base @21.0.2 / ReentrantLock . java : 322 ) at jdk . internal . misc . InternalLock . lock ( java . base @21.0.2 / InternalLock . java : 74 ) at java . io . PrintStream . format ( java . base @21.0.2 / PrintStream . java : 1344 ) at Hello$1 . run ( Hello . java : 27 ) - locked < 0x000000070fe120e8 > ( a Deadlockable ) at java . lang . Thread . runWith ( java . base @21.0.2 / Thread . java : 1596 ) at java . lang . Thread . run ( java . base @21.0.2 / Thread . java : 1583 ) Found 1 deadlock .
This is obviously bad code because it unnecessarily calls
/
under a lock but it is still undesirable to have it
deadlocked.
To prevent deadlocks while still providing major performance improvements and
preventing dynamic allocations for the common case, this paper proposes making
user-defined formatters opt into the new behavior. All standard formatters are
nonlocking and will be opted in which means that
can be used as
a replacement for all current uses of
without concerns that it causes
unbounded memory allocation. The opt in is done via the
member variable named
:
struct foo {}; template <> struct fmt :: formatter < foo > { static constexpr bool locking = false; // ... };
6. Performance
The following benchmark demonstrates the difference in performance between
different implementation strategies using the reference implementation of
from [FMT]. This benchmark is based on the one from [P2093] but
modified to avoid the small string optimization effects. It formats a simple
message and prints it to the output stream redirected to
. It uses
the Google Benchmark library [GOOGLE-BENCH] to measure timings:
#include <cstdio>#include <benchmark/benchmark.h>#include <fmt/format.h>void printf ( benchmark :: State & s ) { while ( s . KeepRunning ()) std :: printf ( "The answer to life, the universe, and everything is %d. \n " , 42 ); } BENCHMARK ( printf ); void vprint_string ( fmt :: string_view fmt , fmt :: format_args args ) { auto s = fmt :: vformat ( fmt , args ); int result = fwrite ( s . data (), 1 , s . size (), stdout ); if ( result < s . size ()) throw fmt :: format_error ( "fwrite error" ); } template < typename ... T > void print_string ( fmt :: format_string < T ... > fmt , T && ... args ) { vprint_string ( fmt , fmt :: make_format_args ( args ...)); } void print_string ( benchmark :: State & s ) { while ( s . KeepRunning ()) { print_string ( "The answer to life, the universe, and everything is {}. \n " , 42 ); } } BENCHMARK ( print_string ); void vprint_stack ( fmt :: string_view fmt , fmt :: format_args args ) { auto buf = fmt :: memory_buffer (); fmt :: vformat_to ( std :: back_inserter ( buf ), fmt , args ); int result = fwrite ( buf . data (), 1 , buf . size (), stdout ); if ( result < buf . size ()) throw fmt :: format_error ( "fwrite error" ); } template < typename ... T > void print_stack ( fmt :: format_string < T ... > fmt , T && ... args ) { vprint_stack ( fmt , fmt :: make_format_args ( args ...)); } void print_stack ( benchmark :: State & s ) { while ( s . KeepRunning ()) { print_stack ( "The answer to life, the universe, and everything is {}. \n " , 42 ); } } BENCHMARK ( print_stack ); void print_direct ( benchmark :: State & s ) { while ( s . KeepRunning ()) fmt :: ( "The answer to life, the universe, and everything is {}. \n " , 42 ); } BENCHMARK ( print_direct ); BENCHMARK_MAIN ();
Here
formats into a temporary string,
formats into
a buffer allocated on stack and
formats directly into the C
stream buffer under a lock.
is included for comparison.
The benchmark was compiled with Apple clang version 15.0.0 (clang-1500.1.0.2.5)
with
and run on macOS 14.2.1 with M1 Pro CPU. Below are the
results:
Run on ( 8 X 24 MHz CPU s ) CPU Caches : L1 Data 64 KiB L1 Instruction 128 KiB L2 Unified 4096 KiB ( x8 ) Load Average : 5.03 , 3.99 , 3.89 ------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------- printf 81.8 ns 81.5 ns 8496899 print_string 88.5 ns 88.2 ns 7993240 print_stack 63.8 ns 61.9 ns 11524151 print_direct 51.3 ns 51.0 ns 13846580
Note that estimated CPU frequency is incorrect.
On Linux (Ubuntu 22.04.3 LTS) with gcc 11.4.0, glibc/libstdc++ and Intel Core
i9-9900K CPU the results are similar except that
is slightly faster
than
with the stack-allocated buffer optimization:
Run on ( 16 X 3600 MHz CPU s ) CPU Caches : L1 Data 32 KiB ( x8 ) L1 Instruction 32 KiB ( x8 ) L2 Unified 256 KiB ( x8 ) L3 Unified 16384 KiB ( x1 ) Load Average : 0.00 , 0.00 , 0.00 ------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------- printf 52.1 ns 52.1 ns 13386398 print_string 65.7 ns 65.7 ns 10674838 print_stack 55.8 ns 55.8 ns 12535414 print_direct 46.3 ns 46.3 ns 15087266
Direct output is 42-72% faster than writing to a temporary string and 21-24% faster than writing to a stack-allocated buffer on this benchmark.
Preliminary testing in libstc++ showed ~25% improvement compared to the existing implementation.
7. Implementation
This proposal has been implemented in the open-source {fmt} library ([FMT]) bringing major performance improvements.
8. Wording
Update the value of the feature-testing macro
to the date of
adoption in [version.syn].
Modify [print.fun] as indicated:
...
template < class ... Args > void ( FILE * stream , format_string < Args ... > fmt , Args && ... args );
Effects:
Letlocksafe
be true
if formatter < remove_cvref_t < Arg >>:: locking
is
valid and evaluates to false
for all Arg
in Args
, false
otherwise.
If the ordinary literal encoding ([lex.charset]) is UTF-8, equivalent to:
vprint_unicode ( stream , fmt . str , make_format_args ( args ...)); locksafe ? vprint_unicode_locking ( stream , fmt . str , make_format_args ( args ...)) : vprint_unicode ( stream , fmt . str , make_format_args ( args ...));
Otherwise, equivalent to:
vprint_nonunicode ( stream , fmt . str , make_format_args ( args ...)); locksafe ? vprint_nonunicode_locking ( stream , fmt . str , make_format_args ( args ...)) : vprint_nonunicode ( stream , fmt . str , make_format_args ( args ...));
...
template < class ... Args > void println ( FILE * stream , format_string < Args ... > fmt , Args && ... args );
Effects: Equivalent to:
( stream , "{} \n " , format ( fmt , std :: forward < Args > ( args )...)); ( stream , runtime_format ( string ( fmt . get ()) + '\n' ), std :: forward < Args > ( args )...);
Effects: Equivalent to:void vprint_unicode ( FILE * stream , string_view fmt , format_args args );
vprint_unicode_locking ( stream , "{}" , vformat ( fmt , args ));
void vprint_unicode_locking ( FILE * stream , string_view fmt , format_args args );
Preconditions:
is a valid pointer to an output C stream.
Effects:
The function initializes an automatic variable via
string out = vformat ( fmt , args );
Let
denote the character representation of formatting arguments
provided by
formatted according to specifications given in
.
stream
.
If stream
refers to a terminal capable of displaying Unicode, writes out
to
the terminal using the native Unicode API; if out
contains invalid code units,
the behavior is undefined and implementations are encouraged to diagnose it.
Otherwise writes out
to stream
unchanged. If the native Unicode API is used,
the function flushes stream
before writing out
.
Releases the lock
.
...
Effects: Equivalent to:void vprint_nonunicode ( FILE * stream , string_view fmt , format_args args );
vprint_nonunicode_locking ( "{}" , vformat ( fmt , args ));
void vprint_nonunicode_locking ( FILE * stream , string_view fmt , format_args args );
Preconditions:
is a valid pointer to an output C stream.
Effects:
Writes the result of
Locks
to
.
, writes the character representation of formatting arguments
provided by
formatted according to specifications given in
to
and releases the lock.
Throws: Any exception thrown by the call to
([format.err.report]).
if writing to
fails. May throw
.
...
Modify [format.formatter.spec] as indicated:
...
Let
be either
or
. Each specialization of
is
either enabled or disabled, as described below. A debug-enabled specialization
of
additionally provides a public, constexpr, non-static member
function
which modifies the state of the
to be
as if the type of the std-format-spec parsed by the last call to parse were
. Each header that declares the template
provides the following
enabled specializations:
-
The debug-enabled specializations
template <> struct formatter < char , char > ; template <> struct formatter < char , wchar_t > ; template <> struct formatter < wchar_t , wchar_t > ; ...
The parse member functions of these formatters interpret the format specification as a std-format-spec as described in [format.string.std].
These formatters define astatic constexpr bool
member variable named locking
that evaluates to false
.
[Note 1: Specializations such as
and
that would require implicit multibyte /
wide string or character conversion are disabled. — end note]
...
9. Acknowledgements
Thanks to Jonathan Wakely for implementing the proposal in libstdc++, providing benchmark results and suggesting various improvements to the paper.
Thanks to Ben Craig for proposing to make user-defined formatters opt into the new behavior to prevent potential deadlocks.