1. Revision history
1.1. Changes since R2
-
Return a
fromsubrange
, instead of just an iterator: discussion in § 4.5 Argument passing, and return type of scan.scan -
Default
toCharT
inchar
for consistency withscanner
(previously no default forformatter
).CharT -
Add design discussion about thousands separators in § 4.3.5.1 Design discussion: Thousands separator grouping checking and § 4.3.5.2 Design discussion: Separate flag for thousands separators.
-
Add design discussion about additional error information in § 4.6.2 Design discussion: Additional information.
-
Add clarification about field width calculation in § 4.3.4 Width and precision.
-
Add note about scope at the end of § 2 Introduction.
-
Fix/clarify error handling in example § 3.5 Alternative error handling.
-
Address SG16 feedback:
-
Add definition of "whitespace", and clarify matching of non-whitespace literal characters, in § 4.2 Format strings.
-
Add section about text encoding § 4.11 Encoding, and an example about handing reading code units § 4.3.8 Type specifiers: CharT.
-
Add example about using locales in § 4.10 Locales.
-
Add potential future extension: § 6.3 Reading code points (or even grapheme clusters?)
-
1.2. Changes since R1
-
Thoroughly describe the design
-
Add examples
-
Add specification (synopses only)
-
Design changes:
-
Return an
containing aexpected
fromtuple
, instead of using output parametersstd :: scan -
Make
take a range instead of astd :: scan string_view -
Remove support for partial successes
-
2. Introduction
With the introduction of
[P0645],
standard C++ has a convenient, safe, performant, extensible,
and elegant facility for text formatting,
over
and the
-family of functions.
The story is different for simple text parsing: the standard only
provides
and the
family, both of which have issues.
This asymmetry is also arguably an inconsistency in the standard library.
According to [CODESEARCH], a C and C++ codesearch engine based on the ACTCD19
dataset, there are 389,848 calls to
and 87,815 calls to
at
the time of writing. So although formatted input functions are less popular than
their output counterparts, they are still widely used.
The lack of a general-purpose parsing facility based on format strings has been raised in [P1361] in the context of formatting and parsing of dates and times.
This paper explores the possibility of adding a symmetric parsing facility,
to complement the
family, called
.
This facility is based on the same design principles and
shares many features with
.
This facility is not a parser per se, as it is probably not sufficient
for parsing something more complicated, e.g. JSON.
This is not a parser combinator library.
This is intended to be an almost-drop-in replacement for
,
capable of being a building block for a more complicated parser.
3. Examples
3.1. Basic example
if ( auto result = std :: scan < std :: string , int > ( "answer = 42" , "{} = {}" )) { // ~~~~~~~~~~~~~~~~ ~~~~~~~~~~~ ~~~~~~~ // output types input format // string const auto & [ key , value ] = result -> values (); // ~~~~~~~~~~ // scanned // values // result == true // result->range() gives an empty range (result->begin() == result->end()) // key == "answer" // value == 42 } else { // We’ll end up here if we had an error // Inspect the returned scan_error with result.error() }
3.2. Reading multiple values at once
auto input = "25 54.32E-1 Thompson 56789 0123" ; auto result = std :: scan < int , float , string_view , int , float , int > ( input , "{:d}{:f}{:9}{:2i}{:g}{:o}" ); // result is a std::expected, operator-> will throw if it doesn’t contain a value auto [ i , x , str , j , y , k ] = result -> values (); // i == 25 // x == 54.32e-1 // str == "Thompson" // j == 56 // y == 789.0 // k == 0123
3.3. Reading from an arbitrary range
std :: string input { "123 456" }; if ( auto result = std :: scan < int > ( std :: views :: reverse ( input ), "{}" )) { // If only a single value is returned, it can be inspected with result->value() // result->value() == 654 }
3.4. Reading multiple values in a loop
std :: vector < int > read_values ; std :: ranges :: forward_range auto range = ...; auto input = std :: ranges :: subrange { range }; while ( auto result = std :: scan < int > ( input , "{}" )) { read_values . push_back ( result -> value ()); input = result -> range (); }
3.5. Alternative error handling
// Since std::scan returns a std::expected, // its monadic interface can be used auto result = std :: scan < int > (..., "{}" ) . transform ([]( auto result ) { return result . value (); }); if ( ! result ) { // handle error } int num = * result ; // With [ P2561 ]: int num = std :: scan < int > (..., "{}" ). try ? . value ();
3.6. Scanning an user-defined type
struct mytype { int a {}, b {}; }; // Specialize std::scanner to add support for user-defined types // Inherit from std::scanner<string> to get format string parsing (scanner::parse()) from it template <> struct std :: scanner < mytype > : std :: scanner < std :: string > { template < typename Context > auto scan ( mytype & val , Context & ctx ) const -> std :: expected < typename Context :: iterator , std :: scan_error > { return std :: scan < int , int > ( ctx . range (), "[{}, {}]" ) . transform ([ & val ]( const auto & result ) { std :: tie ( val . a , val . b ) = result . values (); return result . begin (); }); } }; auto result = std :: scan < mytype > ( "[123, 456]" , "{}" ); // result->value().a == 123 // result->value().b == 456
4. Design
The new parsing facility is intended to complement the existing C++ I/O streams
library, integrate well with the chrono library, and provide an API similar to
. This section discusses the major features of its design.
4.1. Overview
The main user-facing part of the library described in this paper,
is the function template
, the input counterpart of
.
The signature of
is as follows:
template < class ... Args , scannable_range < char > Range > auto scan ( Range && range , format_string < Args ... > fmt ) -> expected < scan_result < ranges :: borrowed_ssubrange_t < Range > , Args ... > , scan_error > ; template < class ... Args , scannable_range < wchar_t > Range > auto scan ( Range && range , wformat_string < Args ... > fmt ) -> expected < scan_result < ranges :: borrowed_ssubrange_t < Range > , Args ... > , scan_error > ;
reads values of type
from the
it’s given,
according to the instructions given to it in the format string,
.
returns a
, containing either a
, or a
.
The
object contains a
pointing to the unparsed input,
and a
of
, containing the scanned values.
4.2. Format strings
As with
, the
syntax has the advantage of being familiar to many
programmers. However, it has similar limitations:
-
Many format specifiers like
,hh
,h
,l
, etc. are used only to convey type information. They are redundant in type-safe parsing and would unnecessarily complicate specification and parsing.j -
There is no standard way to extend the syntax for user-defined types.
-
Using
in a custom format specifier poses difficulties, e.g. for'%'
-like time parsing.get_time
Therefore, we propose a syntax based on
and [PARSE]. This syntax
employs
and
as replacement field delimiters instead of
. It
will provide the following advantages:
-
An easy-to-parse mini-language focused on the data format rather than conveying the type information
-
Extensibility for user-defined types
-
Positional arguments
-
Support for both locale-specific and locale-independent parsing (see § 4.10 Locales)
-
Consistency with
.std :: format
At the same time, most of the specifiers will remain quite similar to the ones
in
, which can simplify a, possibly automated, migration.
Maintaining similarity with
, for any literal non-whitespace character in
the format string, an identical character is consumed from the input range.
For whitespace characters, all available whitespace characters are consumed.
In this proposal, "whitespace" is defined to be the Unicode code points with the Pattern_White_Space property, as defined by UAX #31 (UAX31-R3a). Those code points are currently:
-
ASCII whitespace characters (U+0009 to U+000D, U+0020)
-
U+0085 (next line)
-
U+200E and U+200F (LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK)
-
U+2028 and U+2029 (LINE SEPARATOR and PARAGRAPH SEPARATOR)
Unicode defines a lot of different things
in the realm of whitespace, all for different kinds of use cases.
The Pattern_White_Space-property is chosen for its stability (it’s guaranteed to not change),
and because its intended use is for classifying things that should be treated as
whitespace in machine-readable syntaxes.
is insufficient for usage in a Unicode world,
because it only accepts a single code unit as input.
auto r0 = std :: scan < char > ( "abcd" , "ab{}d" ); // r0->value() == 'c' auto r1 = std :: scan < string , string > ( "abc \n def" , "{} {}" ); const auto & [ s1 , s2 ] = r1 -> values (); // s1 == "abc", s2 == "def"
As mentioned above, the format string syntax consists of replacement fields
delimited by curly brackets (
and
).
Each of these replacement fields corresponds to a value to be scanned from the input range.
The replacement field syntax is quite similar to
, as can be seen in the table below.
Elements that are in one but not the other are highlighted.
replacement field syntax
| replacement field syntax
|
---|---|
|
|
4.3. Format string specifiers
Below is a somewhat detailed description of each of the specifiers
in a
replacement field.
This design attempts to maintain decent compatibility with
whenever practical, while also bringing in some ideas from
.
4.3.1. Manual indexing
replacement - field ::= '{' [ arg - id ] [ ':' format - spec ] '}'
Like
,
supports manual indexing of
arguments in format strings. If manual indexing is used,
all of the argument indices have to be spelled out.
The same index can only be used once.
auto r = std :: scan < int , int , int > ( "0 1 2" , "{1} {0} {2}" ); auto [ i0 , i1 , i2 ] = r -> values (); // i0 == 1, i1 == 0, i2 == 2
4.3.2. Fill and align
fill - and - align ::= [ fill ] align fill ::= any character other than '{' or '}' align ::= one of '<' '>' '^'
The fill and align options are valid for all argument types.
The fill character is denoted by the
-option, or if it is absent,
the space character
.
The fill character can be any single Unicode scalar value.
The field width is determined the same way as it is for
.
If an alignment is specified, the value to be parsed is assumed to be properly aligned with the specified fill character.
If a field width is specified, it will be the maximum number of characters
to be consumed from the input range.
In that case, if no alignment is specified, the default alignment for the type
is considered (see
).
For the
alignment, the number of fill characters needs to be
the same as if formatted with
:
characters before,
characters after the value,
where
is the field width.
If no field width is specified, an equal number of alignment characters on both
sides are assumed.
This spec is compatible with
,
i.e., the same format string (wrt. fill and align)
can be used with both
and
,
with round-trip semantics.
Note: For format type specifiers other than
(default for
and
, can be specified for
and
),
leading whitespace is skipped regardless of alignment specifiers.
auto r0 = std :: scan < int > ( " 42" , "{}" ); // r0->value() == 42, r0->range() == "" auto r1 = std :: scan < char > ( " x" , "{}" ); // r1->value() == ' ', r1->range() == " x" auto r2 = std :: scan < char > ( "x " , "{}" ); // r2->value() == 'x', r2->range() == " " auto r3 = std :: scan < int > ( " 42" , "{:6}" ); // r3->value() == 42, r3->range() == "" auto r4 = std :: scan < char > ( "x " , "{:6}" ); // r4->value() == 'x', r4->range() == "" auto r5 = std :: scan < int > ( "***42" , "{:*>}" ); // r5->value() == 42 auto r6 = std :: scan < int > ( "***42" , "{:*>5}" ); // r6->value() == 42 auto r7 = std :: scan < int > ( "***42" , "{:*>4}" ); // r7->value() == 4 auto r8 = std :: scan < int > ( "42" , "{:*>}" ); // r8->value() == 42 auto r9 = std :: scan < int > ( "42" , "{:*>5}" ); // ERROR (mismatching field width) auto rA = std :: scan < int > ( "42***" , "{:*<}" ); // rA->value() == 42, rA->range() == "" auto rB = std :: scan < int > ( "42***" , "{:*<5}" ); // rB->value() == 42, rB->range() == "" auto rC = std :: scan < int > ( "42***" , "{:*<4}" ); // rC->value() == 42, rC->range() == "*" auto rD = std :: scan < int > ( "42" , "{:*<}" ); // rD->value() == 42 auto rE = std :: scan < int > ( "42" , "{:*<5}" ); // ERROR (mismatching field width) auto rF = std :: scan < int > ( "42" , "{:*^}" ); // rF->value() == 42, rF->range() == "" auto rG = std :: scan < int > ( "*42*" , "{:*^}" ); // rG->value() == 42, rG->range() == "" auto rH = std :: scan < int > ( "*42**" , "{:*^}" ); // rH->value() == 42, rH->range() == "*" auto rI = std :: scan < int > ( "**42*" , "{:*^}" ); // ERROR (not enough fill characters after value) auto rJ = std :: scan < int > ( "**42**" , "{:*^6}" ); // rJ->value() == 42, rJ->range() == "" auto rK = std :: scan < int > ( "*42**" , "{:*^5}" ); // rK->value() == 42, rK->range() == "" auto rL = std :: scan < int > ( "**42*" , "{:*^6}" ); // ERROR (not enough fill characters after value) auto rM = std :: scan < int > ( "**42*" , "{:*^5}" ); // ERROR (not enough fill characters after value)
Note: This behavior, while compatible with
,
is very complicated, and potentially hard to understand for users.
Since
doesn’t support parsing of fill characters this way,
it’s possible to leave this feature out for v1, and come back to this later:
it’s not a breaking change to add formatting specifiers that add new behavior.
4.3.3. Sign, #
, and 0
format - spec ::= ... [ sign ] [ '#' ] [ '0' ] ... sign ::= one of '+' '-' ' '
These flags would have no effect in
, so they are disabled.
Signs (both
and
), base prefixes, trailing decimal points, and leading zeroes
are always allowed for arithmetic values.
Disabling them would be a bad default for a higher-level facility
like
, so flags explicitly enabling them are not needed.
Note: This is incompatible with
format strings.
4.3.4. Width and precision
width ::= positive - integer OR '{' [ arg - id ] '}' precision ::= '.' nonnegative - integer OR '.' '{' [ arg - id ] '}'
The width specifier is valid for all argument types.
The meaning of this specifier somewhat deviates from
.
The width and precision specifiers of it are combined into
a single width specifier in
.
This specifier indicates the expected field width of the value to be
scanned, taking into account possible fill characters used for alignment.
If no fill characters are expected, it specifies the maximum width for the field.
std :: format
the width-field provides the minimum, and the precision-field the maximum
width for a value. In std :: scan
, the width-field provides the maximum.
auto str = std :: format ( "{:2}" , 123 ); // str == "123" // because only the minimum width was set by the format string auto result = std :: scan < int > ( str , "{:2}" ); // result->value() == 12 // result->range() == "3" // because the maximum width was set to 2 by the format string
For compatibility with
,
the width specifier is in field width units,
which is specified to be 1 per Unicode (extended) grapheme cluster,
except some grapheme clusters are 2 ([format.string.std] ¶ 13):
For a sequence of characters in UTF-8, UTF-16, or UTF-32, an implementation should use as its field width the sum of the field widths of the first code point of each extended grapheme cluster. Extended grapheme clusters are defined by UAX #29 of the Unicode Standard. The following code points have a field width of 2:
any code point with the East_Asian_Width="W" or East_Asian_Width="F" Derived Extracted Property as described by UAX #44 of the Unicode Standard
U+4dc0 – U+4dff (Yijing Hexagram Symbols)
U+1f300 – U+1f5ff (Miscellaneous Symbols and Pictographs)
U+1f900 – U+1f9ff (Supplemental Symbols and Pictographs)
The field width of all other code points is 1.
For a sequence of characters in neither UTF-8, UTF-16, nor UTF-32, the field width is unspecified.
This essentially maps 1 field width unit = 1 user perceived character.
It should be noted, that with this definition, grapheme clusters like emoji have a field width of 2.
This behavior is present in
today, but can potentially be surprising to users.
std :: format
can be set aside.
These options include:
-
Plain bytes or code units
-
Unicode code points
-
Unicode (extended) grapheme clusters
-
-like field width units, except only looking at code points, instead of grapheme clustersstd :: format -
Exclusively using UAX #11 (East Asian Width) widths
Specifying the width with another argument, like in
, is disallowed.
4.3.5. Localized (L
)
format - spec ::= ... [ 'L' ] ...
Enables scanning of values in locale-specific forms.
-
For integer types, allows for digit group separator characters, equivalent to
of the used locale. If digit group separator characters are used, their grouping must matchnumpunct :: thousands_sep
.numpunct :: grouping -
For floating-point types, the same as above. In addition, the locale-specific radix separator character is used, from
.numpunct :: decimal_point -
For
, the textual representation uses the appropriate strings frombool
andnumpunct :: truename
.numpunct :: falsename
4.3.5.1. Design discussion: Thousands separator grouping checking
As proposed, when using localized scanning, the grouping of thousands
separators in the input must exactly match the value retrieved from
. This behavior is consistent with iostreams.
It may, however, be undesirable: it is possible, that the user
would supply values with incorrect thousands separator grouping,
but that may need not be an error. The number is still unambiguously
parseable, with the check for grouping only done after parsing.
struct custom_numpunct : std :: numpunct < char > { std :: string do_grouping () const override { return " \3 " ; } char do_thousands_sep () const override { return ',' ; } }; auto loc = std :: locale ( std :: locale :: classic (), new custom_numpunct ); // As proposed: // Check grouping, error if invalid auto r0 = std :: scan < int > ( loc , "123,45" , "{:L}" ); // r0.has_value() == false // ALTERNATIVE: // Do not check grouping, only skip it auto r1 = std :: scan < int > ( loc , "123,45" , "{:L}" ); // r1.has_value() == true // r1->value() == 12345 // Current proposed behavior, _somewhat_ consistent with iostreams: istringstream iss { "123,45" }; iss . imbue ( locale ( locale :: classic (), new custom_numpunct )); int i {}; iss >> i ; // i == 12345 // iss.fail() == !iss == true
This highlights a problem with using
: we can either have a value, or an error.
IOStreams can both return an error, and a value.
This issue is also present with range errors with ints and floats,
see § 4.6.2 Design discussion: Additional information for more.
4.3.5.2. Design discussion: Separate flag for thousands separators
It may also be desirable to split up the behavior of skipping and checking
of thousands separators from the realm of localization. For example,
in the POSIX-extended version of
, there’s the '
format specifier,
which allows opting-into reading of thousands separators.
When a locale isn’t used, a set of options similar to the thousands separator
options used with the
locale (i.e.
with
grouping).
This would enable skipping of thousands separators without involving locale.
// NOT PROPOSED, // hypothetical example, with a ' format specifier auto r = std :: scan < int > ( "123,456" , "{:'}" ); // r->value() == 123456
4.3.6. Type specifiers: strings
Type | Meaning |
---|---|
none,
| Copies from the input until a whitespace character is encountered. |
| Copies an escaped string from the input. |
| Copies from the input until the field width is exhausted. Does not skip preceding whitespace. Errors, if no field width is provided. |
s
specifier is consistent with std :: istream
and std :: string
:
std :: string word ; std :: istringstream { "Hello world" } >> word ; // word == "Hello" auto r = std :: scan < string > ( "Hello world" , "{:s}" ); // r->value() == "Hello"
Note: The
specifier is consistent with
,
but is not supported for strings by
.
4.3.7. Type specifiers: integers
Integer values are scanned as if by using
, except:
-
A positive
sign and a base prefix are always allowed to be present.+ -
Preceding whitespace is skipped.
Type | Meaning |
---|---|
,
| with base 2. The base prefix is or .
|
| with base 8. For non-zero values, the base prefix is .
|
,
| with base 16. The base prefix is or .
|
| with base 10. No base prefix.
|
| with base 10. No base prefix. No sign allowed.
|
| Detect base from a possible prefix, default to decimal. |
| Copies a character from the input. |
none | Same as
|
Note: The flags
and
are not supported by
.
These flags are consistent with
.
4.3.8. Type specifiers: CharT
Type | Meaning |
---|---|
none,
| Copies a character from the input. |
, , , , , , ,
| Same as for integers. |
| Copies an escaped character from the input. |
CharT
with the c
type specifier
will just read a single code unit of type CharT
.
This can lead to invalid encoding in the scanned values.
// As proposed: // U+12345 is 0xF0 0x92 0x8D 0x85 in UTF-8 auto r = std :: scan < char , std :: string > ( "\u{12345}" , "{}{}" ); auto & [ ch , str ] = r -> values (); // ch == '\xF0' // str == "\x92\x8d\x85" (invalid utf-8) // This is the same behavior as with iostreams today
4.3.9. Type specifiers: bool
Type | Meaning |
---|---|
| Allows for textual representation, i.e. true or false
|
, , , , , ,
| Allows for integral representation, i.e. or
|
none | Allows for both textual and integral representation: i.e. true , , false , or .
|
4.3.10. Type specifiers: floating-point types
Similar to integer types,
floating-point values are scanned as if by using
, except:
-
A positive
sign is always allowed to be present.+ -
Preceding whitespace is skipped.
Type | Meaning |
---|---|
,
| with , with / -prefix allowed.
|
,
| with .
|
,
| with .
|
,
| with .
|
none | with , with / -prefix allowed.
|
4.4. Ranges
We propose, that
would take a range as its input.
This range should satisfy the requirements of
to
enable look-ahead, which is necessary for parsing.
template < class Range , class CharT > concept scannable_range = ranges :: forward_range < Range > && same_as < ranges :: range_value_t < Range > , CharT > ;
For a range to be a
, its character type (range
, code unit type)
needs to also be correct, i.e. it needs to match the character type of the format string.
Mixing and matching character types between the input range and the format string is not supported.
scan < int > ( "42" , "{}" ); // OK scan < int > ( L"42" , L"{}" ); // OK scan < int > ( L"42" , "{}" ); // Error: wchar_t[N] is not a scannable_range<char>
It should be noted, that standard range facilities related to iostreams, namely
, model
.
Thus, they can’t be used with
, and therefore, for example,
, can’t be read directly using
.
The reference implementation deals with this by providing a range type, that wraps a
, and provides a
-compatible interface to it.
At this point, this is deemed out of scope for this proposal.
To prevent excessive code bloat, implementations are encouraged to type-erase the range
provided to
, in a similar fashion as inside
.
This can be achieved with something similar to
from Range-v3.
The reference implementation does something similar to this, inside the implementation of
,
where ranges that are both contiguous and sized are internally passed along as
s,
and as type-erased
s otherwise.
It should be noted, that if the range is not type-erased, the library internals need to be exposed to the user (in a header), and be instantiated for every different kind of range type the user uses.
4.5. Argument passing, and return type of scan
In an earlier revision of this paper, output parameters were used to return the scanned values
from
. In this revision, we propose returning the values instead, wrapped in an
.
// R2 (current) auto result = std :: scan < int > ( input , "{}" ); auto [ i ] = result -> values (); // or: auto i = result -> value (); // R1 (previous) int i ; auto result = std :: scan ( input , "{}" , i );
The rationale behind this change is as follows:
-
It was easy to accidentally use uninitialized values (as evident by the example above). In this revision, the values can only be accessed when the operation is successful.
-
Modern C++ API design principles favor return values over output parameters.
-
The earlier design was conceived at a time, when C++17 support and usage wasn’t as prevalent as it is today. Back then, the only way to use a return-value API was through
, which wasn’t ergonomic.std :: tie -
Previously, there were real performance implications when using complicated tuples, both at compile-time and runtime. These concerns have since been alleviated, as compiler technology has improved.
The return type of
,
, contains a
over the unparsed input.
With this, a new type alias is introduced,
, that is defined as follows:
template < typename R > using borrowed_ssubrange_t = std :: conditional_t < ranges :: borrowed_range < R > , ranges :: subrange < ranges :: iterator_t < R > , ranges :: sentinel_t < R >> , ranges :: dangling > ;
Note: The name
is absolutely horrendeous, and is begging for a better alternative.
Compare this with
, which is defined as
,
when the range models
.
This is novel in the Ranges space: previously all algorithms have either returned an iterator,
or a subrange of two iterators. We believe that
warrants a diversion:
if (for
and
)
is false
,
will need to go through the rest of the input, in order to get an the end iterator to return.
A superior alternative is to simply return the sentinel, since that’s always correct
(the leftover range always has the same end as the source range) and requires no additional computation.
See this StackOverflow answer by Barry Revzin for more context: [BARRY-SO-ANSWER].
4.5.1. Design alternatives
As proposed,
returns an
, containing either an iterator and a tuple, or a
.
An alternative could be returning a
, with a result object as its first (0th) element, and the parsed values occupying the rest.
This would enable neat usage of structured bindings:
// NOT PROPOSED, design alternative auto [ r , i ] = std :: scan < int > ( "42" , "{}" );
However, there are two possible issues with this design:
-
It’s easy to accidentally skip checking whether the operation succeeded, and access the scanned values regardless. This could be a potential security issue (even though the values would always be at least value-initialized, not default-initialized). Returning an expected forces checking for success.
-
The numbering of the elements in the returned tuple would be off-by-one compared to the indexing used in format strings:
auto r = std :: scan < int > ( "42" , "{0}" ); // std::get<0>(r) refers to the result object // std::get<1>(r) refers to {0}
For the same reason as enumerated in 2. above, the
type as proposed doesn’t follow the tuple protocol, so that structured bindings can’t be used with it:
// NOT PROPOSED auto result = std :: scan < int > ( "42" , "{0}" ); // std::get<0>(*result) would refer to the iterator // std::get<1>(*result) would refer to {0}
4.6. Error handling
Contrasting with
, this proposed library communicates errors with return values,
instead of throwing exceptions. This is because error conditions are expected to be much
more frequent when parsing user input, as opposed to text formatting.
With the introduction of
, error handling using return values is also more ergonomic than before,
and it provides a vocabulary type we can use here, instead of designing something novel.
holds an enumerated error code value, and a message string.
The message is used in the same way as the message in
:
it gives more details about the error, but its contents are unspecified.
// Not a specification, just exposition class scan_error { public : enum code_type { good , // EOF: // tried to read from an empty range, // or the input ended unexpectedly. // Naming alternative: end_of_input end_of_range , invalid_format_string , invalid_scanned_value , value_out_of_range }; constexpr scan_error () = default ; constexpr scan_error ( code_type , const char * ); constexpr explicit operator bool () const noexcept ; constexpr code_type code () const noexcept ; constexpr const char * msg () const ; };
4.6.1. Design discussion: Essence of std :: scan_error
The reason why we propose adding the type
instead of just using
is,
that we want to avoid losing information. The enumerators of
are insufficient for
this use, as evident by the table below: there are no clear one-to-one mappings between
and
, but
would need to cover a lot of cases.
The
in
is extremely useful for user code, for use in logging and debugging.
Even with the
enumerators, more information is often needed, to isolate any possible problem.
Possible mappings from
to
could be:
|
|
---|---|
|
|
|
|
| |
| |
|
|
There are multiple dimensions of design decisions to be done here:
-
Should
use a custom enumeration?scan_error -
Yes. (currently proposed, our preference)
-
No, use
. Loses precision in error codesstd :: errc
-
-
Should
contain a message?scan_error -
Yes, a
. (currently proposed, weak preference)const char * -
Yes, a
. Potentially more expensive.std :: string -
No. Worse user experience for loss of diagnostic information
-
4.6.2. Design discussion: Additional information
Only having
does not give a way to differentiate
between different kinds of out-of-range errors, like overflowing
(absolute value too large, either positive or negative), or underflowing
(value not representable, between zero and the smallest subnormal).
Both
(through
), and
the
family of functions support differentiating between
differentiating between different kinds of overflow and underflow,
through the magnitude of the returned value.
currently does not (see [LWG3081]).
does not, either.
// larger than INT32_MAX std :: string source { "999999999999999999999999999999" }; { std :: istringstream iss { source }; int i {}; iss >> i ; // iss.fail() == true // i == INT32_MAX } { // (assuming sizeof(long) == 4) auto i = std :: strtol ( source . c_str (), nullptr , 10 ); // i == LONG_MAX // errno == ERANGE } { int i {}; auto [ ec , ptr ] = std :: from_chars ( source . data (), source . data () + source . size (), i ); // ec == std::errc::result_out_of_range // i == 0 (!) } { int i {}; auto r = std :: sscanf ( source . c_str (), "%d" , & i ); // r == 1 (?) // i == -1 (?) // errno == ERANGE }
This is predicated on an issue with using
:
we can only ever either return an error, or a value.
Those aforementioned facilities can both return an error code,
while simultaneously communicating additional information
about possible errors through the scanned value.
Nevertheless, there’s a simple reason for using
:
it prevents user errors. Because an
can indeed
only hold either a value or an error, there’s never a situation
where an user accidentally forgets to check for an error,
and mistakenly uses the scanned value directly instead:
int i {}; std :: cin >> i ; // We would need to check std::cin.operator bool() first, // to determine whether <code data-opaque bs-autolink-syntax='`i`'>i</code> was successfully read: // that’s very easy to forget auto r = std :: scan < int > (..., "{}" ); int i = r -> value (); // ^ // dereference // does not allow for accidentally accessing the value if we had an error
It’s a tradeoff.
Either we allow for an additional avenue for error reporting through the scanned value,
or we use
to prevent reading the values during an error.
Currently, this paper propses doing the latter.
4.7. Binary footprint and type erasure
We propose using a type erasure technique to reduce the per-call binary code size. The scanning function that uses variadic templates can be implemented as a small inline wrapper around its non-variadic counterpart:
template < scannable_range < char > Range > auto vscan ( Range && range , string_view fmt , scan_args_for < Range > args ) -> expected < ranges :: borrowed_ssubrange_t < Range > , scan_error > ; template < typename ... Args , scannable_range < char > SourceRange > auto scan ( SourceRange && source , format_string < Args ... > format ) -> expected < scan_result < ranges :: borrowed_ssubrange_t < SourceRange > , Args ... > , scan_error > { auto args = make_scan_args < SourceRange , Args ... > (); auto result = vscan ( std :: forward < SourceRange > ( range ), format , args ); return make_scan_result ( std :: move ( result ), std :: move ( args )); }
As shown in [P0645] this dramatically reduces binary code size, which will make
comparable to
on this metric.
type erases the arguments that are to be scanned.
This is similar to
, used with
.
Note: This implementation of
is more complicated
compared to
, which can be described as a one-liner calling
.
This is because the
returned by
needs to outlive the call to
, and then be converted to a
and returned from
.
Whereas with
, the
returned by
is immediately consumed by
, and not used elsewhere.
4.8. Safety
is arguably more unsafe than
because
([ATTR]) implemented by GCC and Clang
doesn’t catch the whole class of buffer overflow bugs, e.g.
char s [ 10 ]; std :: sscanf ( input , "%s" , s ); // s may overflow.
Specifying the maximum length in the format string above solves the issue but is error-prone, especially since one has to account for the terminating null.
Unlike
, the proposed facility relies on variadic templates instead of
the mechanism provided by
. The type information is captured
automatically and passed to scanners, guaranteeing type safety and making many of
the
specifiers redundant (see § 4.2 Format strings). Memory management is
automatic to prevent buffer overflow errors.
4.9. Extensibility
We propose an extension API for user-defined types similar to
,
used with
. It separates format string processing and parsing, enabling
compile-time format string checks, and allows extending the format specification
language for user types. It enables scanning of user-defined types.
auto r = scan < tm > ( input , "Date: {0:%Y-%m-%d}" );
This is done by providing a specialization of
for
:
template <> struct scanner < tm > { constexpr auto parse ( scan_parse_context & ctx ) -> expected < scan_parse_context :: iterator , scan_error > ; template < class ScanContext > auto scan ( tm & t , ScanContext & ctx ) const -> expected < typename ScanContext :: iterator , scan_error > ; };
The
function parses the
portion of the format
string corresponding to the current argument, and
parses the
input range
and stores the result in
.
An implementation of
can potentially use the istream extraction
for user-defined type
, if available.
4.10. Locales
As pointed out in [N4412]:
There are a number of communications protocol frameworks in use that employ text-based representations of data, for example XML and JSON. The text is machine-generated and machine-read and should not depend on or consider the locales at either end.
To address this,
provided control over the use of locales. We propose
doing the same for the current facility by performing locale-independent parsing
by default and designating separate format specifiers for locale-specific ones.
In particular, locale-specific behavior can be opted into by using the
format specifier, and supplying a
object.
std :: locale :: global ( std :: locale :: classic ()); // {} uses no locale // {:L} uses the global locale auto r0 = std :: scan < double , double > ( "1.23 4.56" , "{} {:L}" ); // r0->values(): (1.23, 4.56) // {} uses no locale // {:L} uses the supplied locale auto r1 = std :: scan < double , double > ( std :: locale { "fi_FI" }, "1.23 4,56" , "{} {:L}" ); // r1->values(): (1.23, 4.56)
4.11. Encoding
In a similar manner as with
, input given to
is assumed
to be in the (ordinary/wide) literal encoding.
Errors in encoding are handled in a "garbage in, garbage out" manner:
invalidly encoded code points are treated as if they were the Unicode noncharacter U+FFFF,
which doesn’t match any other character or pattern.
std :: scan
could be a use case
for erroneous behavior in the library. That’s because be provide well defined-behavior
for handling invalid encoding, but it’s still likely to be an error.
As motivation for erroneous behavior,
Unicode conformance requirement C.10 says that ill-formed input shall not be treated as a character,
and treat it as an error instead. // Invalid UTF-8 auto r = std :: scan < std :: string > ( "a \xc3 " , "{}" ); // r->value() == "a\xc3" // Erroneous behavior?
Other potential options for handling invalid encoding would be:
-
treat is as UB
-
always sanitize input encoding (potentially very slow when done character-by-character with
s)forward_range -
check for encoding when reading code units and strings, while potentially introducing a format specifier for "raw mode", which skips these checks
Note: This topic is under active contention in SG16. See also example in § 4.3.8 Type specifiers: CharT.
4.12. Performance
The API allows efficient implementation that minimizes virtual function calls
and dynamic memory allocations, and avoids unnecessary copies. In particular,
since it doesn’t need to guarantee the lifetime of the input across multiple
function calls,
can take
avoiding an extra string copy
compared to
. Since, in the default case, it also doesn’t
deal with locales, it can internally use something like
.
We can also avoid unnecessary copies required by
when parsing strings,
e.g.
auto r = std :: scan < std :: string_view , int > ( "answer = 42" , "{} = {}" );
This has lifetime implications similar to returning match objects in [P1433] and iterators or subranges in the ranges library and can be mitigated in the same way.
It should be noted, that as proposed, this library does not support
checking at compile-time, whether scanning a
would dangle, or
if it’s possible at all (it’s not possible to read a
from a non-
).
This is the case, because the concept
is defined in terms of the scanned type
and the input range character type
, not the type of the input range itself.
4.13. Integration with chrono
The proposed facility can be integrated with
([P0355])
via the extension mechanism, similarly to the integration between chrono and text
formatting proposed in [P1361]. This will improve consistency between parsing
and formatting, make parsing multiple objects easier, and allow avoiding dynamic
memory allocations without resolving to the deprecated
.
Before:
std :: istringstream is ( "start = 10:30" ); std :: string key ; char sep ; std :: chrono :: seconds time ; is >> key >> sep >> std :: chrono :: parse ( "%H:%M" , time );
After:
auto result = std :: scan < std :: string , std :: chrono :: seconds > ( "start = 10:30" , "{0} = {1:%H:%M}" ); const auto & [ key , time ] = result -> values ();
Note that the
version additionally validates the separator.
4.14. Impact on existing code
The proposed API is defined in a new header and should have no impact on existing code.
5. Existing work
[SCNLIB] is a C++ library that, among other things,
provides an interface similar to the one described in this paper.
As of the publication of this paper, the
-branch of [SCNLIB] contains the reference implementation for this proposal.
[FMT] has a prototype implementation of an earlier version of the proposal.
6. Future extensions
To keep the scope of this paper somewhat manageable, we’ve chosen to only include functionality we consider fundamental. This leaves the design space open for future extensions and other proposals. However, we are not categorically against exploring this design space, if it is deemed critical for v1.
All of the possible future extensions described below are implemented in [SCNLIB].
6.1. Integration with std :: istream
s
Today, in C++, standard I/O is largely done with iostreams, and not with ranges.
The library proposed in this paper doesn’t support that use case well.
The proposed concept of
requires
,
so facilities like
, which only models
,
can’t be used.
Integration with iostreams is needed to enable working with files and
.
This can be worked around with something like
,
and using its result with
, but error recovery with that gets very tricky very fast.
A possible solution would be a more robust
, that models at least
,
either through caching the read characters in the view itself, or by utilizing the stream buffer. [SCNLIB] implements this by providing a generic
, which wraps an
and a buffer, and provides an interface that models
.
6.2. scanf
-like [ character set ]
matching
supports the
format specifier, which allows for matching for a set of accepted
characters. Unfortunately, because some of the syntax for specifying that set is
implementation-defined, the utility of this functionality is hampered.
Properly specified, this could be useful.
auto r = scan < string > ( "abc123" , "{:[a-zA-Z]}" ); // r->value() == "abc", r->range() == "123" // Compare with: char buf [ N ]; sscanf ( "abc123" , "%[a-zA-Z]" , buf ); // ... auto _ = scan < string > (..., "{:[^ \n ]}" ); // match until newline
It should be noted, that while the syntax is quite similar, this is not a regular expression. This syntax is intentionally way more limited, as is meant for simple character matching.
[SCNLIB] implements this syntax, providing support for matching single characters/code points
(
), code point ranges (
), and regex-like wildcards (
or
.
6.3. Reading code points (or even grapheme clusters?)
in nowadays the type denoting a Unicode code point.
Reading individual code points, or even Unicode grapheme clusters, could be a useful feature.
Currently, this proposal only supports reading of individual code units (
or
).
[SCNLIB] supports reading Unicode code points with
.
6.4. Reading strings and chars of different width
In C++, we have character types other than
and
, too:
namely
,
, and
.
Currently, this proposal only supports reading strings with the same
character type as the input range, and reading
characters from
narrow
-oriented input ranges, as does
.
somewhat supports this with the
-flag (and the absence of one in
).
Providing support for reading differently-encoded strings could be useful.
// Currently supported: auto r0 = scan < wchar_t > ( "abc" , "{}" ); // Not supported: auto r1 = scan < char > ( L"abc" , L"{}" ); auto r2 = scan < string , wstring , u8string , u16string , u32string > ( "abc def ghi jkl mno" , "{} {} {} {} {}" ); auto r3 = scan < string , wstring , u8string , u16string , u32string > ( L"abc def ghi jkl mno" , L"{} {} {} {} {}" );
6.5. Scanning of ranges
Introduced in [P2286] for
, enabling the user to use
to scan ranges, could be useful.
6.6. Default values for scanned values
Currently, the values returned by
are value-constructed,
and assigned over if a value is read successfully.
It may be useful to be able to provide an initial value different from a value-constructed
one, for example, for preallocating a
, and possibly reusing it:
string str ; str . reserve ( n ); auto r0 = scan < string > (..., "{}" , { std :: move ( str )}); // ... r0 -> value (). clear (); auto r1 = scan < string > (..., "{}" , { std :: move ( r0 -> value ())});
6.7. Assignment suppression / discarding values
supports discarding scanned values with the
specifier in the format string. [SCNLIB] provides similar functionality through a special type,
:
7. Specification
At this point, only the synopses are provided.
Note the similarity with [P0645] (
) in some parts.
The changes to the wording include additions to the header
, and a new header,
.
7.1. Modify "Header < ranges >
synopsis" [ranges.syn]
#include <compare>#include <initializer_list>#include <iterator>namespace std :: ranges { // ... template < range R > using borrowed_iterator_t = see below ; // freestanding template < range R > using borrowed_subrange_t = see below ; // freestanding template < range R > using borrowed_ssubrange_t = see below ; // freestanding // ... }
7.2. Modify "Dangling iterator handling", paragraph 3 [range.dangling]
For a type
that models
:
-
if
modelsR
, thenborrowed_range
denotesborrowed_iterator_t < R >
,iterator_t < R > and
denotesborrowed_subrange_t < R >
;subrange < iterator_t < R >>
denotesborrowed_subrange_t < R >
, andsubrange < iterator_t < R >>
denotesborrowed_ssubrange_t < R >
;subrange < iterator_t < R > , sentinel_t < R >> -
otherwise,
both
andborrowed_iterator_t < R >
denoteborrowed_subrange_t < R >
.dangling
,borrowed_iterator_t < R >
, andborrowed_subrange_t < R >
all denoteborrowed_ssubrange_t < R >
.dangling
7.3. Header < scan >
synopsis
#include <expected>#include <format>#include <ranges>namespace std { class scan_error ; template < class Range , class ... Args > class scan_result ; template < class Range , class CharT > concept scannable_range = ranges :: forward_range < Range > && same_as < ranges :: range_value_t < Range > , CharT > ; template < class Range , class ... Args > using scan_result_type = expected < scan_result < ranges :: borrowed_ssubrange_t < Range > , Args ... > , scan_error > ; template < class ... Args , scannable_range < char > Range > scan_result_type < Range , Args ... > scan ( Range && range , format_string < Args ... > fmt ); template < class ... Args , scannable_range < wchar_t > Range > scan_result_type < Range , Args ... > scan ( Range && range , wformat_string < Args ... > fmt ); template < class ... Args , scannable_range < char > Range > scan_result_type < Range , Args ... > scan ( const locale & loc , Range && range , format_string < Args ... > fmt ); template < class ... Args , scannable_range < wchar_t > Range > scan_result_type < Range , Args ... > scan ( const locale & loc , Range && range , wformat_string < Args ... > fmt ); template < class Range , class CharT > class basic_scan_context ; template < class Context > class basic_scan_args ; template < class Range > using scan_args_for = basic_scan_args < basic_scan_context < unspecified , ranges :: range_value_t < Range >>> ; template < class Range > using vscan_result_type = expected < ranges :: borrowed_ssubrange_t < Range > , scan_error > ; template < scannable_range < char > Range > vscan_result_type < Range > vscan ( Range && range , string_view fmt , scan_args_for < Range > args ); template < scannable_range < wchar_t > Range > vscan_result_type < Range > vscan ( Range && range , wstring_view fmt , scan_args_for < Range > args ); template < scannable_range < char > Range > vscan_result_type < Range > vscan ( const locale & loc , Range && range , string_view fmt , scan_args_for < Range > args ); template < scannable_range < wchar_t > Range > vscan_result_type < Range > vscan ( const locale & loc , Range && range , wstring_view fmt , scan_args_for < Range > args ); template < class T , class CharT = char > struct scanner ; template < class T , class CharT > concept scannable = see below ; template < class CharT > using basic_scan_parse_context = basic_format_parse_context < CharT > ; using scan_parse_context = basic_scan_parse_context < char > ; using wscan_parse_context = basic_scan_parse_context < wchar_t > ; template < class Context > class basic_scan_arg ; template < class Visitor , class Context > decltype ( auto ) visit_scan_arg ( Visitor && vis , basic_scan_arg < Context > arg ); template < class Context , class ... Args > class scan - arg - store ; // exposition only template < class Range , class ... Args > constexpr see below make_scan_args (); template < class Range , class Context , class ... Args > expected < scan_result < Range , Args ... > , scan_error > make_scan_result ( expected < Range , scan_error >&& source , scan - arg - store < Context , Args ... >&& args ); }
7.4. Class scan_error
synopsis
namespace std { class scan_error { public : enum code_type { good , end_of_range , invalid_format_string , invalid_scanned_value , value_out_of_range }; constexpr scan_error () = default ; constexpr scan_error ( code_type error_code , const char * message ); constexpr explicit operator bool () const noexcept ; constexpr code_type code () const noexcept ; constexpr const char * msg () const ; private : code_type code_ ; // exposition only const char * message_ ; // exposition only }; }
7.5. Class template scan_result
synopsis
namespace std { template < class Range , class ... Args > class scan_result { public : using range_type = Range ; constexpr scan_result () = default ; constexpr ~ scan_result () = default ; constexpr scan_result ( range_type r , tuple < Args ... >&& values ); template < class OtherR , class ... OtherArgs > constexpr explicit ( see below ) scan_result ( OtherR && it , tuple < OtherArgs ... >&& values ); constexpr scan_result ( const scan_result & ) = default ; template < class OtherR , class ... OtherArgs > constexpr explicit ( see below ) scan_result ( const scan_result < OtherR , OtherArgs ... >& other ); constexpr scan_result ( scan_result && ) = default ; template < class OtherR , class ... OtherArgs > constexpr explicit ( see below ) scan_result ( scan_result < OtherR , OtherArgs ... >&& other ); constexpr scan_result & operator = ( const scan_result & ) = default ; template < class OtherR , class ... OtherArgs > constexpr scan_result & operator = ( const scan_result < OtherR , OtherArgs ... >& other ); constexpr scan_result & operator = ( scan_result && ) = default ; template < class OtherR , class ... OtherArgs > constexpr scan_result & operator = ( scan_result < OtherR , OtherArgs ... >&& other ); constexpr range_type range () const ; constexpr see below begin () const ; constexpr see below end () const ; template < class Self > constexpr auto && values ( this Self && ); template < class Self > requires sizeof ...( Args ) == 1 constexpr auto && value ( this Self && ); private : range_type range_ ; // exposition only tuple < Args ... > values_ ; // exposition only }; }
7.6. Class template basic_scan_context
synopsis
namespace std { template < class Range , class CharT > class basic_scan_context { public : using char_type = CharT ; using range_type = Range ; using iterator = ranges :: iterator_t < range_type > ; using sentinel = ranges :: sentinel_t < range_type > ; template < class T > using scanner_type = scanner < T , char_type > ; constexpr basic_scan_arg < basic_scan_context > arg ( size_t id ) const noexcept ; std :: locale locale (); constexpr iterator current () const ; constexpr range_type range () const ; constexpr void advance_to ( iterator it ); private : iterator current_ ; // exposition only sentinel end_ ; // exposition only std :: locale locale_ ; // exposition only basic_scan_args < basic_scan_context > args_ ; // exposition only }; }
7.7. Class template basic_scan_args
synopsis
namespace std { template < class Context > class basic_scan_args { size_t size_ ; // exposition only basic_scan_arg < Context >* data_ ; // exposition only public : basic_scan_args () noexcept ; template < class ... Args > basic_scan_args ( scan - arg - store < Context , Args ... >& store ) noexcept ; basic_scan_arg < Context > get ( size_t i ) noexcept ; }; template < class Context , class ... Args > basic_scan_args ( scan - arg - store < Context , Args ... > ) -> basic_scan_args < Context > ; }
7.8. Concept scannable
namespace std { template < class T , class Context , class Scanner = typename Context :: template scanner_type < remove_const_t < T >>> concept scannable - with = // exposition only semiregular < Scanner > && requires ( Scanner & s , const Scanner & cs , T & t , Context & ctx , basic_format_parse_context < typename Context :: char_type >& pctx ) { { s . parse ( pctx ) } -> same_as < expected < typename decltype ( pctx ) :: iterator , scan_error >> ; { cs . scan ( t , ctx ) } -> same_as < expected < typename Context :: iterator , scan_error >> ; }; template < class T , class CharT > concept scannable = scannable - with < remove_reference_t < T > , basic_scan_context < unspecified >> ; }
7.9. Class template basic_scan_arg
synopsis
namespace std { template < class Context > class basic_scan_arg { public : class handle ; private : using char_type = typename Context :: char_type ; // exposition only variant < monostate , signed char * , short * , int * , long * , long long * , unsigned char * , unsigned short * , unsigned int * , unsigned long * , unsigned long long * , bool * , char_type * , void ** , float * , double * , long double * , basic_string < char_type >* , basic_string_view < char_type >* , handle > value ; // exposition only template < class T > explicit basic_scan_arg ( T & v ) noexcept ; // exposition only public : basic_scan_arg () noexcept ; explicit operator bool () const noexcept ; }; }
7.10. Exposition-only class template scan - arg - store
synopsis
namespace std { template < class Context , class ... Args > class scan - arg - store { // exposition only tuple < Args ... > args ; // exposition only array < basic_scan_arg < Context > , sizeof ...( Args ) > data ; // exposition only }; }