"In character, in manner, in style, in all things, the supreme excellence is simplicity." — Henry Wadsworth Longfellow
1. Introduction
The C++20 formatting facility (
) allows formatting of
as an
integer via format specifiers such as
and
. Unfortunately [P0645] that
introduced the facility didn’t take into account that signedness of
is
implementation-defined and specified this formatting in terms of
with the value implicitly converted (promoted) to
. This had some
undesirable effects discovered after getting usage experience and resolved in
the {fmt} library ([FMT]). This paper proposes applying a similar fix to
.
First,
normally produces consistent output across platforms for
the same integral types and the same IEEE 754 floating point types. Formatting
as an integer breaks this nice property making the output
implementation-defined even if the
size is effectively the same.
Second,
is used as a code unit type in
and other text
processing facilities. In these use cases one normally needs to either output
as (a part of) text which is the default or as a bit pattern. Having it
sometimes be output as a signed integer is surprising to users. It is
particularly surprising when formatted in a non-decimal base. For example,
assuming UTF-8 literal encoding:
for ( char c : std :: string ( "🤷" )) { std :: ( " \\ x{:02x}" , c ); }
will print either
\xf0 \x9f \xa4 \xb7
or
\x -10 \x -61 \x -5 c \x -49
depending on a platform. Since it is implementation-defined, the user may not even be aware of this issue which can then manifest itself when the code is compiled and run on a different platform or with different compiler flags.
This particular case can be fixed by adding a cast to
but it
may not be as easy to do when formatting ranges compared to using format
specifiers.
2. Changes from R2
-
Added LEWG poll results.
3. Changes from R1
-
Added instructions to bump the
feature test macro per LEWG feedback.__cpp_lib_format -
Added a missing cast for the case of formatting
aschar
per LEWG feedback.wchar_t
4. Changes from R0
-
Changed the title from "Dude, where’s my char?" to "Fix formatting of code units as integers" per SG16 feedback.
-
Added all affected format specifiers to the before/after table per SG16 feedback.
-
Clarified how this compares with
format specifiers.printf -
Added SG16 poll results for R0.
-
Fixed handling of the case of formatting
aschar
per SG16 feedback.wchar_t
5. Polls
LEWG poll results for R1:
POLL: Forward P2909R1 to LWG for C++26 (and as a defect) (to be confirmed by Electronic Polling)
SF F N A SA 5 11 1 0 0
Outcome: Strong consensus in favour
POLL: For a feature test Macro we prefer a new Macro (over bumping “__cpp_lib_format”)
SF F N A SA 0 3 4 3 2
Outcome: No consensus
SG16 poll results for R0:
Poll 1: Modify P2909R0 "Dude, where’s my char‽" to maintain
semi-consistency with printf such that the
,
,
,
, and
conversions convert all integer types as unsigned.
SF F N A SA 1 2 0 2 2
Outcome: No consensus for change
Poll 2: Modify P2909R0 "Dude, where’s my char‽" to remove the change
to handling of the
specifier.
SF F N A SA 2 1 2 1 1
Outcome: No consensus for change
Poll 3: Forward P2909R0 "Dude, where’s my char‽", amended with a descriptive title, an expanded before/after table, and fixed CharT wording, to LEWG with the recommendation to adopt it as a Defect Report.
SF F N A SA 2 2 2 1 0
Outcome: Weak consensus - LEWG may want to look at this closely
6. Proposal
This paper proposes making code unit types formatted as unsigned integers instead of implementation-defined.
Code | Before | After |
---|---|---|
|
\xf0\x9f\xa4\xb7 or \x-10\x-61\x-5c\x-49 (implementation-defined) |
\xf0\x9f\xa4\xb7 |
|
or -10000 -10000 -16 -20 -10 -10 (implementation-defined) |
11110000 11110000 240 360 f0 F0 |
This somewhat improves consistency with
and
(but not
)
specifiers which always treat arguments as unsigned. For example:
printf ( "%x" , '\x80' );
prints
ffffff80
regardless of whether
is signed or unsigned.
This is not a goal though but a side effect of picking a consistent
platform-independent representation for code unit types. Unlike
,
doesn’t need to convey signedness or other type information in
format specifiers. The latter is an artefact of varargs limitations.
The current paper updates the
feature test macro instead of
introducing a new one since the amount of work to check the macro and perform
different action based on it is comparable to switching to the type with
signdness that doesn’t depend on the implementation (
or
).
7. Wording
Update the value of the feature-testing macro
to the date of
adoption in [version.syn].
Change in [tab:format.type.char]:
Table 69: Meaning of type options for
[tab:format.type.char]
Type | Meaning |
---|---|
none,
| Copies the character to the output. |
, , , , ,
|
As specified in Table 68
with converted to the corresponding
unsigned type
.
|
| Copies the escaped character ([format.string.escaped]) to the output. |
Change in [format.arg]:
template < class T > explicit basic_format_arg ( T & v ) noexcept ;
...
Effects: Let
be
.
-
If
isTD
orbool
, initializeschar_type
withvalue
;v -
otherwise, if
isTD
andchar
ischar_type
, initializeswchar_t
withvalue
;static_cast < wchar_t > ( v static_cast < unsigned char > ( v ) )
...
8. Impact on existing code
This is a breaking change but it only affects the output of negative/large
code units when output via opt-in format specifiers. There were no issues
reported when the change was shipped in {fmt} and the number of uses of
is orders of magnitude smaller at the moment.
9. Implementation
The proposed change has been implemented in the {fmt} library ([FMT]).