Extending < charconv > support to more character types
- Document number:
- P3876R0
- Date:
2025-11-13 - Audience:
- SG16
- Project:
- ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
- Reply-to:
- Jan Schultke <janschultke@gmail.com>
- Co-authors:
- Peter Bindels <dascandy@gmail.com>
- GitHub Issue:
- wg21.link/P3876/github
- Source:
- github.com/eisenwave/cpp-proposals/blob/master/src/charconv-ext.cow
in and .
Contents
Introduction
Design
Which character types to support
char8_t
char16_t and char32_t
wchar_t
to_chars
from_chars
Unicode error handling
"Fixing" std :: format ( std :: wformat_string )
Function signature and result types
Result type
Summary
constexpr floating-point overloads
More composable interface taking std :: span or std :: string_view
Impact on existing code
Implementation experience
Implementation survey
New alias templates
Wording
[version.syn]
[charconv.syn]
[charconv.to.chars]
[charconv.from.chars]
[format.string.std]
[diff.cpp26.format]
References
1. Introduction
Support for and other Unicode character types
in and is clearly useful.
File formats such as JSON require the use of Unicode character encodings,
so an application that deals with JSON may want to use in its APIs
and internally.
However, when attempting to use for this purpose,
one quickly runs into problems:
The user could use the overload
and then transcode to UTF-8 as ,
but the standard library provides no transcoding facilities yet.
Even if there was support, using is an unnecessary middle man.
In general, and are important
cornerstones upon which other facilities are built,
or could be built in the future.
The lack of support for (and other character types)
severely limits what can be done elsewhere:
-
acceptingstd :: to_chars is arguably a prerequisite forchar8_t withstd :: format format strings because conversions of arithmetic types are specified in terms ofchar8_t .std :: to_chars -
would similarly needstd :: print ( u8 " " ) to function withstd :: to_chars .char8_t -
A hypothetical
could not easily be created becausestd :: u8to_string is specified to returnstd :: to_string std :: format ( " {} " , val ) , i . e . in terms of std :: to_chars . -
A hypothetical string parsing counterpart to
would presumably be specified in terms ofstd :: format , but this would be problematic if parsingstd :: from_chars strings is to be supported.char8_t
Providing support for Unicode character types would be relatively simple.
All characters produced by
and all characters accepted by fall into the Basic Latin (ASCII) block
and are part of the basic character set ([lex.charset]).
This means that any existing implementation for ASCII-encoded could be made to work
with Unicode characters trivially.
2. Design
The design strategy is to prioritize simplicity and performance.
and are meant to be low-level,
high-performance conversion functions.
Decoding non-ASCII representations of digits,
handling UTF-8 encoding errors in detail, etc. are out of the question.
Most of the design choices are obvious,
but unfortunately,
functions have been designed as non-templates,
which we cannot reasonably perpetuate.
Most of the difficult design choices revolve around how to add the new overloads
without breaking changes to code which uses .
2.1. Which character types to support
All character types should be supported by and .
Find rationale for each type below.
2.1.1. char8_t
Due to how common UTF-8 is and due to now regularly being used
to represent UTF-8 text in C++ software,
the motivation in §1. Introduction mostly refers to .
In fact, there is a dedicated [SG16-Issue] for .
2.1.2. char16_t and char32_t
However, other Unicode encodings such as UTF-16 and UTF-32 are regularly used as well,
and if support for UTF-8 exists,
it is trivial to support these other encodings (through and )
because the conversion functions only deal with code points in the Basic Latin block anyway,
where code units are interchangeable.
Overall, the goal should be for a implementation
to emit the same code units/points for any Unicode character type,
and for to consume the same code units/points.
2.1.3. wchar_t
support is slightly less motivated, and isn't used much
outside of Windows environments.
However, it is not difficult to provide support for ,
and Windows C++ software may benefit from this support
(e.g. when feeding the output of into Windows API functions
accepting ()).
2.2. to_chars
The output format of should be identical
to that for .
This is easily implementable because all characters produced by
are Basic Latin characters in the basic character set.
2.3. from_chars
The formats accepted by should be identical to those for ,
which are specified in terms of functions like in the locale.
for Unicode characters should not accept any further constructs
such as parsing as
because this goes against its stated design goal of being a low-level,
high-performance utility for parsing numbers.
2.3.1. Unicode error handling
It is possible that a user attempts to invoke
on a malformed Unicode string.
However, this does not mean that any special consideration to UTF-8 or other encodings
needs to be paid.
simply assumes that the given character range
contains a pattern (for integers, a sequence of digits with optional prefix)
at the start of the range;
this pattern is made entirely of characters in the Basic Latin block.
i1 i2 123 All Unicode encodings are designed so that code only code points in the Basic Latin block
can be encoded with code units in the range [, ).
This means that simply treating greater code units as not part of the
pattern (which any implementation for ASCII-based does already)
is a proper way of Unicode error handling.
2.4. "Fixing" std :: format ( std :: wformat_string )
Since this proposal argues for support in ,
it makes sense to re-specify to call "directly".
The current [N5014] word in [format.string.std] paragraph 20 works as follows:
The meaning of some non-string presentation types is defined in terms of a call to
. In such cases, let [to_chars ,first ) be a range large enough to hold thelast output andto_chars be the formatting argument value. Formatting is done as if by callingvalue as specified and copying the output through the output iterator of the format context.to_chars
For ,
this means that the overload for is called,
and the resulting values are copied into the output.
This could hypothetically result in nonsensical and malformed output
if the ordinary literal encoding and wide literal encoding are completely different,
such as if is EBCDIC and is UTF-16.
This is technically allowed by [lex.charset],
although no known implementation exists that makes such an exotic design choice.
If we specified to instead call the
overload for the same character type as the format string (as proposed),
this would be an observable change,
but would only impact hypothetical implementations where
is utterly broken anyway.
2.5. Function signature and result types
and
are not function templates despite working with a wide variety of integer types
(at least, there need to be 11 overloads
for each signed and unsigned integer type and for ).
If we also added a non-template overload for each character type,
this would result in an absurd overload set of 55 functions
().
Such a huge overload set is clearly undesirable,
so function templates are necessary.
2.5.1. Result type
The existing and
classes cannot be turned into class templates without breaking both API and ABI.
That is because any existing aliases or uses of these types in
function parameters, return types, etc. would break if they were turned into templates.
Name mangling would also change.
There is also no good name for a new class template,
and if that was used for Unicode characters,
the asymmetry with the overloads would be even more apparent.
However, we could create one result type per character,
as well as alias templates and
which select the appropriate result class.
Another possible option is to create a base class as follows:
This would also allow deduction of from ,
unlike adding a new set of independent types.
However, this also technically breaks API because moving members into a base class
changes aggregate initialization.
Overall, the safest option is to make no changes to the existing result types.
2.5.2. Summary
In code, the design can be summarized as follows:
2.6. constexpr floating-point overloads
If [P3652R1] "Constexpr floating-point <charconv> functions" is accepted,
all new templated overloads should be made .
There is no good reason why only the overload should be .
overload and to transcode from the ordinary literal encoding to the desired encoding.
2.7. More composable interface taking std :: span or std :: string_view
It is worth noting that there is a stale proposal
[P2584R0] "A More Composable "
which proposes additional overloads taking ,
superseding the even more stale
[P2007R0] " should work with ".
Such changes are orthogonal to what is proposed here. However it needs to be considered what impact such new overloads would have on the functions added here. In particular, [P2584R0] proposes an interface such as:
If this was added, a non-breaking change would require adding four more overloads
taking , , ,
and .
A similar change to would actually expand the overload set by
20 function templates (5 character types × (1 integer overload + 3 floating-point overloads)),
resulting in 11 + 4 + 20 = 35 candidates in the overload set
(including the ones proposed here).
With the benefit of foresight, perhaps we should aim at a smaller overload set
and take a range parameter instead.
In any case, those changes are not within the scope of this proposal.
3. Impact on existing code
The proposal is a pure extension
of the and overload sets.
The existing non-template overloads for
and various arithmetic types are preserved.
with no known impact on existing code.
See §2.4. "Fixing" and § [diff.cpp26.format].
4. Implementation experience
Any existing implementation of and
for a platform with ASCII-based (Windows, POSIX, etc.)
is numerically implementing what is proposed here.
That is, the implementation may not use ,
but it produces or consumes values with the same numeric values.
4.1. Implementation survey
Find below a summary of existing implementations of
in the three major standard libraries.
This is necessary to understand what difficulties implementations would face
when supporting additional character types.
| Functions | libstdc++ | libc++ | MSVC STL |
|---|---|---|---|
(integer) |
std/charconv | to_chars_integral.h | inc/charconv |
(floating-point) |
floating_to_chars.cc | to_chars_floating_point.h | inc/charconv |
(integer) |
std/charconv | to_chars_integral.h | inc/charconv |
(floating-point) |
floating_from_chars.cc | from_chars_floating_point.h | inc/charconv |
All implementations are quite similar:
the underlying function performing the conversion is a function template
with type parameter ,
to handle integer types or floating-point types in bulk.
These could easily be turned into templates which also have a type parameter.
The only difficulty would be converting the existing uses of ordinary
character and string literals into correctly typed literals for .
:
This may have to be converted into
to avoid implicit conversion warnings.
A is correct in this case because libc++ only supports ASCII-encoded
and ,
so that all character types are numerically interchangeable for code points
in the Basic Latin block.
The only standard library implementation that supports non-ASCII
is the IBM XL C++ for z/OS,
but according to
IBM's documentation,
no implementation exists yet.
Even if is non-ASCII,
the "fix" is usually as simple as a :
4.2. New alias templates
The proposed alias templates can be implemented as follows:
The implementation of is analogous.
also work.
5. Wording
The following changes are relative to [N5014].
[version.syn]
In [version.syn], bump the feature-test macro:
and
macros are not bumped.
[charconv.syn]
In [charconv.syn], modify the synopsis as follows:
Immediately preceding [charconv.syn] paragraph 2, insert a paragraph as follows:
The exposition-only concept
is modeled by any character type ([basic.fundamental]).
[charconv.to.chars]
Immediately following [charconv.to.chars] paragraph 1, insert the following paragraph:
The output style of all functions named is specified in terms of
characters in the basic character set (and thus in terms of their Unicode code points)
or directly in terms of code points.
The output code points are inserted into the range [, )
by encoding them in the respective literal encoding for character literals
of the type of .
Immediately following [charconv.to.chars] paragraph 3, insert the following item:
Result:
ifto_chars_result isT ,char ifu8to_chars_result isT ,char8_t ifu16to_chars_result isT ,char16_t ifu32to_chars_result isT , andchar32_t ifwto_chars_result isT .wchar_t
Modify the overload for as follows:
Constraints:
is a character type ([basic.fundamental]).
is a signed or unsigned integer type or .
Preconditions:
has a value between 2 and 36 (inclusive).
Effects:
The value of is converted to a string of digits in the given base
(with no redundant leading zeroes).
Digits
in the range 0..9
are represented as U+0030..U+0039 DIGIT ZERO..NINE, and digits
in the range 10..35 (inclusive) are represented as lowercase characters a..z
U+0061..U+007A LATIN SMALL LETTER A..Z.
If is less than zero,
the representation starts with
U+002D HYPHEN-MINUS.
Throws: Nothing.
The "(inclusive)" is removed because the range notation "A..B" is universally inclusive, with no disambiguation required. Such range notation is already used in [lex.charset] without any attempt at disambiguation.
Modify the overloads for as follows:
Constraints:
is a character type ([basic.fundamental]).
is a cv-unqualified floating-point type.
Effects:
is converted to a string
in the style of in the locale.
The conversion specifier is or ,
chosen according to the requirement for a shortest representation (see above);
a tie is resolved in favor of .
Throws: Nothing.
Constraints:
is a character type ([basic.fundamental]).
is a cv-unqualified floating-point type.
Preconditions:
has the value of one of the enumerators of .
Effects:
is converted to a string
in the style of in the locale.
Throws: Nothing.
Constraints:
is a character type ([basic.fundamental]).
is a cv-unqualified floating-point type.
Preconditions:
has the value of one of the enumerators of .
Effects:
is converted to a string
in the style of in the locale
with the given precision.
Throws: Nothing.
is worded as follows:
,f — AF argument representing a floating-point number is converted to decimal notation in the style [-]ddd.ddd, where the number of digits after the decimal-point character is equal to the precision specification.double
This abstract description of the output style
(where presumably, "-" and "." are intended to represented characters
in the basic character set)
can be applied to the new overloads working with and other types,
just like it could have been applied to .
It may be beneficial to reword the whole subclause in terms of code points and decoupled from C wording, but this would take considerable effort and isn't necessary for this proposal.
If [P3652R1] has been accepted
or a later paper marked the existing
overloads ,
modify all the added overloads as follows:
[charconv.from.chars]
Modify [charconv.from.chars] paragraph 1 as follows:
All functions named
analyze the string [, ) for a pattern,
where [, ) is required to be a valid range.
If no characters code units match the pattern, is unmodified,
the member of the return value is
and the member is equal to .
[Note:
If the pattern allows for an optional sign,
but the string has no digit characters code units following the sign,
no characters code units match the pattern.
— end note]
Otherwise, the characters code units matching the pattern are interpreted
as a representation of a value of the type of .
The member of the return value points to the first character code unit
not matching the pattern, or has the value
if all characters code units match.
If the parsed value is not in the range representable by the type of ,
is unmodified and the member of the return value
is equal to .
Otherwise, is set to the parsed value,
after rounding according to ([round.style]),
and the member is value-initialized.
Immediately following [charconv.from.chars] paragraph 1, insert a new paragraph:
The output style of all functions named is specified in terms of
characters in the basic character set (and thus in terms of their Unicode code points)
or directly in terms of code points.
The analyzed pattern consists of those code points,
encoded as code units in the respective literal encoding for character literals
of the cv-unqualified type of .
[Note: In either form of specification, the pattern consists of code units encoding characters in the basic character set ([lex.charset]), meaning that each code unit encodes exactly one such character. Illegal code units or code units representing characters outside the basic character set are not handled specially; those code units are simply not part of the pattern.
[Example:
— end example] — end note]
Immediately following the inserted paragraph, insert the following item:
Result:
iffrom_chars_result isT ,char ifu8from_chars_result isT ,char8_t ifu16from_chars_result isT ,char16_t ifu32from_chars_result isT , andchar32_t ifwfrom_chars_result isT .wchar_t
Modify the overload for as follows:
Constraints:
is a character type ([basic.fundamental]).
is a signed or unsigned integer type or .
Preconditions:
has a value between 2 and 36 (inclusive).
Effects:
The pattern is the expected form of the subject sequence in the
The pattern is a sequence of digits in the given base,
where leading zeroes are ignored.
The code points U+0030..U+0039 DIGIT ZERO..NINE represent digits in the range 0..9;
Both U+0041..U+005A LATIN CAPITAL LETTER A..Z and
U+0061..U+007A LATIN SMALL LETTER A..Z represent digits in the range 10..35.
If locale
for the given nonzero base,
as described for ,
except that no or prefix shall appear if the value of base is 16,
and except that is the only sign that may appear,
and only if value has a signed type. is of signed type,
the pattern starts with an optional U+002D HYPHEN-MINUS
which causes the resulting to be negative.
Throws: Nothing.
and base prefixes.
This has no impact on the proposed change
because the Effects: are rewritten from scratch.
It would be possible to keep basing the wording on ,
but this is quite problematic.
[LWG4430] already fixed the accidental parsing of
prefixes in ,
and additional changes will be required for ,
which is added to C2y for octal prefixes.
There are so many deviations from the pattern that it arguably provides
negative value to specify in terms of it.
Modify the overload for as follows:
Constraints:
is a character type ([basic.fundamental]).
is a cv-unqualified floating-point type.
Preconditions:
has the value of one of the enumerators of .
Effects:
The pattern is the expected form of the subject sequence in the locale,
as described for , except that
the signU+002B PLUS SIGN may only appear in the exponent part;' + ' -
if
hasfmt set but notchars_format ::scientific, the otherwise optional exponent part shall appear;chars_format ::fixed -
if fmt has
set but notchars_format ::fixed, the optional exponent part shall not appear; andchars_format ::scientific -
if fmt is
, the prefixchars_format ::hexor" 0x " " 0X " 0x is assumed to precede the string, but is not part of the pattern.
In any case, the resulting is one of at most two floating-point values
closest to the value of the string matching the pattern.
Throws: Nothing.
[format.string.std]
Modify [format.string.std] paragraph 20 as follows:
The meaning of some non-string presentation types is defined
in terms of a call to .
In such cases, let
[, ) be a range
of elements of type ,
large enough to hold the output,
and let be the formatting argument value.
Formatting is done as if by calling as specified
and copying the output through the output iterator of the format context.
[Note: Additional padding and adjustments are performed prior to copying the output through the output iterator as specified by the format specifiers. — end note]
[diff.cpp26.format]
Add a new subclause in [diff.cpp26] as follows:
[format] formatting [diff.cpp26.format]
Affected subclause: [format.string.std]
Change: Output of and given a .
Rationale: Enabling consistent behavior for all character types.
Effect on original feature:
The values produced by when formatting integral
and floating-point values were previously converted to
without transcoding from the ordinary to the wide literal encoding.
Now, is called directly
with a buffer whose type depends on the type of format string,
which may produce different code units given and .
[Example:
— end example]