1. Introduction
[P0645] has proposed a text formatting facility that provides a safe and
extensible alternative to the
family of functions. This paper explores
the possibility of adding a symmetric parsing facility which is based on the
same design principles and shares many features with [P0645], namely
According to [CODESEARCH], a C and C++ codesearch engine based on the ACTCD19
dataset, there are 389,848 calls to
and 87,815 calls to
at
the time of writing. So although formatted input functions are less popular than
their output counterparts, they are still widely used.
Lack of a general-purpose parsing facility based on format strings has been raised in [P1361] in the context of formatting and parsing of dates and times.
Although having a symmetric parsing facility seems beneficial, not all languages
provide it out-of-the-box. For example, Python doesn’t have a
equivalent
in the standard library but there is a separate
package ([PARSE]).
Example:
std :: string key ; int value ; std :: scan ( "answer = 42" , "{} = {}" , key , value ); // ~~~~~~~~~~~~~ ~~~~~~~~~ ~~~~~~~~~~ // input format arguments // // Result: key == "answer", value == 42
2. Design
The new parsing facility is intended to complement the existing C++ I/O streams
library, integrate well with the chrono library, and provide an API similar to
. This section discusses major features of its design.
2.1. Format strings
As with
, the
syntax has the advantage of being familiar to many
programmers. However, it has similar limitations:
-
Many format specifiers like
,hh
,h
,l
, etc. are used only to convey type information. They are redundant in type-safe parsing and would unnecessarily complicate specification and parsing.j -
There is no standard way to extend the syntax for user-defined types.
-
Using
in a custom format specifier poses difficulties, e.g. for'%'
-like time parsing.get_time
Therefore we propose a syntax based on [PARSE] and [P0645]. This syntax
employs
and
as replacement field delimiters instead of
. It
will provide the following advantages:
-
An easy to parse mini-language focused on the data format rather than conveying the type information
-
Extensibility for user-defined types
-
Positional arguments
-
Support for both locale-specific and locale-independent parsing (see §2.4 Locales)
-
Consistency with
proposed by [P0645].std :: format
At the same time most of the specifiers will remain the same as in
which
can simplify, possibly automated, migration.
2.2. Safety
is arguably more unsafe than
because
([ATTR]) implemented by GCC and Clang
doesn’t catch the whole class of buffer overflow bugs, e.g.
char s [ 10 ]; std :: sscanf ( input , "%s" , s ); // s may overflow.
Specifying the maximum length in the format string above solves the issue but is error-prone especially since one has to account for the terminating null.
Unlike
, the proposed facility relies on variadic templates instead of
the mechanism provided by
. The type information is captured
automatically and passed to scanners guaranteeing type safety and making many of
the
specifiers redundant (see §2.1 Format strings). Memory management is
automatic to prevent buffer overflow errors.
2.3. Extensibility
We propose an extension API for user-defined types similar to the one of [P0645]. It separates format string processing and parsing enabling compile-time format string checks and allows extending the format specification language for user types.
The general syntax of a replacement field in a format string is the same as in [P0645]:
replacement - field ::= '{' [ arg - id ] [ ':' format - spec ] '}'
where
is predefined for built-in types, but can be customized
for user-defined types. For example, the syntax can be extended for
-like date and time formatting
auto t = tm (); scan ( input , "Date: {0:%Y-%m-%d}" , t );
by providing a specialization of
for
:
template <> struct scanner < tm > { constexpr scan_parse_context :: iterator parse ( scan_parse_context & ctx ); template < class ScanContext > typename ScanContext :: iterator scan ( tm & t , ScanContext & ctx ); };
The
function parses the
portion of the format
string corresponding to the current argument and
parses the
input range
and stores the result in
.
An implementation of
can potentially use ostream extraction
for user-defined type
if available.
2.4. Locales
As pointed out in [N4412]:
There are a number of communications protocol frameworks in use that employ text-based representations of data, for example XML and JSON. The text is machine-generated and machine-read and should not depend on or consider the locales at either end.
To address this [P0645] provided control over the use of locales. We propose doing the same for the current facility by performing locale-independent parsing by default and designating separate format specifiers for locale-specific one.
2.5. Performance
The API allows efficient implementation that minimizes virtual function calls
and dynamic memory allocations, and avoids unnecessary copies. In particular,
since it doesn’t need to guarantee the lifetime of the input across multiple
function calls,
can take
avoiding an extra string copy
compared to
.
We can also avoid unnecessary copies required by
when parsing string,
e.g.
std :: string_view key ; int value ; std :: scan ( "answer = 42" , "{} = {}" , key , value );
This has lifetime implications similar to returning match objects in [P1433] and iterator or subranges in the ranges library and can be mitigated in the same way.
2.6. Binary footprint
We propose using a type erasure technique to reduce per-call binary code size. The scanning function that uses variadic templates can be implemented as a small inline wrapper around its non-variadic counterpart:
string_view :: iterator vscan ( string_view input , string_view fmt , scan_args args ); template < typename ... Args > inline auto scan ( string_view input , string_view fmt , const Args & ... args ) { return vscan ( input , fmt , make_scan_args ( args ...)); }
As shown in [P0645] this dramatically reduces binary code size which will make
comparable to
on this metric.
2.7. Integration with chrono
The proposed facility can be integrated with
([P0355])
via the extension mechanism similarly to integration between chrono and text
formatting proposed in [P1361]. This will improve consistency between parsing
and formatting, make parsing multiple objects easier, and allow avoiding dynamic
memory allocations without resolving to deprecated
.
Before:
std :: istringstream is ( "start = 10:30" ); std :: string key ; char sep ; std :: chrono :: seconds time ; is >> key >> sep >> std :: chrono :: parse ( "%H:%M" , time );
After:
std :: string key ; std :: chrono :: seconds time ; std :: scan ( "start = 10:30" , "{0} = {1:%H:%M}" , key , time );
Note that the
version additionally validates the separator.
2.8. Impact on existing code
The proposed API is defined in a new header and should have no impact on existing code.
3. Existing work
[SCNLIB] is a C++ library that, among other things, provides a
function
similar to the one proposed here. [FMT] has a prototype implementation of the
proposal.
4. Questions
Q1: Do we want this?
Q2: API options:
-
Pass arguments by reference and return an iterator:
std :: string key ; int value ; auto end = std :: scan ( input , "{} = {}" , key , value ); This is similar to what
, istreamscanf
, andoperator >>
do.std :: chrono :: parse -
Return an object wrapping an iterator and parsed values:
This option is more cumbersome to use because it requires passing all argument types as template arguments toauto result = std :: scan < std :: string , int > ( input , "{} = {}" ); auto end = result . end ; std :: string key = std :: get < 0 > ( result . values ); int value = std :: get < 1 > ( result . values );
. It may also require an extra move or copy to extract the argument’s value and impose additional requirements on the argument types (at least default constructibility). Syntactically it can be simplified using structured bindings.scan
Q3: naming:
-
scan -
parse -
other
The name "parse" is a bit problematic because of ambiguity between format string parsing and input parsing.
Main API |
|
|
|
Extension point |
|
|
|
Parse format string |
|
| ?
|
Extension function |
|
|
|
Format string parse context |
|
| ?
|
Context |
|
|
|