Use cases include the increasing number of text-based interchange formats such as JSON or XML that do not require internationalization support, but do require high throughput when produced by a server.
There are a lot of existing functions in C++ to perform such conversions, but none offers a high-performance solution. At a minimum, an implementation by an ordinary user of the language using an elementary textbook algorithm should not be able to outperform a quality standard library implementation. The requirements are thus:
C++ already provides at least the facilities in the following table, each with shortcomings highlighted in the second column.
facility | shortcomings |
---|---|
sprintf | format string, locale, buffer overrun |
snprintf | format string, locale |
sscanf | format string, locale |
atol | locale, does not signal errors |
strtol | locale, ignores whitespace and 0x prefix |
strstream | locale, ignores whitespace |
stringstream | locale, ignores whitespace, memory allocation |
num_put / num_get facets | locale, virtual function |
to_string | locale, memory allocation |
stoi etc. | locale, memory allocation, ignores whitespace and 0x prefix, exception on error |
As a rough performance comparison, the following simple numeric formatting task was implemented: Output the integer numbers 0 ... 1 million, separated by a single space character, into a contiguous array buffer of 10 MB. This task was executed 10 times. The execution environment was gcc 4.9 on Intel Core i5 M450.
strstream | 864 ms | uses std::strstream with application-provided buffer |
streambuf | 540 ms | uses simple custom streambuf with std::num_put<> facet |
direct | 285 ms | open-coded "divide by 10" algorithm, using the interface described below |
fixed-point | 125 ms | fixed-point algorithm found in an older AMD optimization guide, using the interface described below |
There are various approaches for even more efficient algorithms; see, for example, https://gist.github.com/anonymous/7700052 .
The following discussion assumes that a common interface style should
be established that covers (built-in) integer and floating-point
types. The type T
designates such an arithmetic type.
Note that given these restrictions, output of T to a string has a
small maximum length in all cases. The styles for input vs. output
will differ due to the differing functionality.
The fundamental interface for a string is that it is caller-allocated,
contiguous in memory, and not necessarily 0-terminated. That means,
it can be represented by a range [begin
,end
)
where begin
and end
are of type char
*
.
Given this framework, the following subsections discuss various specific interface styles for both output and input. In each case, the signature of an integer output or input function is shown. Criteria for comparison include impact on compiler optimizations, indication of output buffer overflow, and composability (as a measure of ease-of-use).
This subsection discusses various specific interface styles for output. In each case, the signature of an integer output function is shown. There is one failure mode for output: overflow of the provided output buffer. Criteria for comparison include impact on compiler optimizations, indication of output buffer overflow, and composability (as a measure of ease-of-use). For exposition of the latter, consecutive output of two numbers is shown, without any separator.
Conceptually, an output function has four parameters and two
results. The parameters are the begin
and
end
pointers of the buffer, the value, and the desired
base. The results are the updated begin
pointer and an
overflow indication.
char * to_string(char * begin, char * end, T value, int base = 10);
This interface style returns the updated begin
pointer.
That is, the resulting string is in [begin
,
return-value) and [return-value, end
)
is unused space in the string. Such an interface style is used for
many standard library algorithms, e.g. find
[alg.find].
All parameters are passed by value which helps the optimizer.
Overflow is indicated by return-value == end
.
The situation that the output exactly fits into the provided buffer
cannot be distinguished from overflow. Two consecutive outputs can be
produced trivially using:
p = to_string(p, end, value1); p = to_string(p, end, value2);
void to_string(char *& begin, char * end, T value, int base = 10);
This interface style updates the begin
pointer in place.
That is, the resulting string is in [old-begin
,
begin
) and [begin
,end
) is
unused space in the string. Aliasing rules allow that updates to
begin
change the data where begin points. To avoid
redundant updates, the implementation can copy begin
to a
local variable. Overflow is indicated by begin
reaching
end
. The situation that the output exactly fits into the
provided buffer cannot be distinguished from overflow. Two
consecutive outputs can be produced trivially using:
to_string(p, end, value1); to_string(p, end, value2);
void to_string(std::string_view& s, T value, int base = 10);This interface style groups the
begin
and
end
pointers into a string_view
which is
updated in-place. Comments on "iterator with in-situ update" apply
analogously.
bool to_string(char *& begin, char * end, T value, int base = 10);Comments on "iterator with in-situ update" apply analogously, except that the return value indicates whether overflow occurred.
int to_string(char * begin, char * end, T value, int base = 10);This interface style always returns the number of characters required to output T, regardless of whether sufficient space was provided. That is, an overflow occurred if the return value is larger than
end
-begin
, otherwise the resulting string is
in [begin
, begin + return-value
).
Such an interface style is used for snprintf
, except that
the proposed function never 0-terminates the output. All parameters
are passed by value which helps the optimizer. Overflow is indicated
by a return value strictly larger than the distance between
begin
and end
. Two consecutive outputs
require attention at the caller site to avoid buffer overflow:
int n = 0; n += to_string(begin, end, value1); n += to_string(begin + std::min(n, end-begin), end, value2);
An input function conceptually operates in two steps: First, it
consumes characters from the input string matching a pattern until the
first non-matching character or the end of the string is encountered.
Second, the matched characters are translated into a value of type
T
. There are two failure modes: no characters match, or
the pattern translates to a value that is not in the range
representable by T
.
Conceptually, an input function has three parameters and three
results. The parameters are the begin
and
end
pointers of the string and the desired base. The
results are the updated begin
pointer, a
std::error_code
and the parsed value.
This subsection discusses various specific interface styles for
input. Failure is indicated by std::error_code
with the
appropriate value. In each case, the signature of an integer input
function is shown. Criteria for comparison include impact on compiler
optimizations and composability (as a measure of ease-of-use). For
exposition of the latter, parsing of two consecutive values is shown,
without skipping of any separator.
const char * from_string(const char * begin, const char * end, T& value, std::error_code& ec, int base = 10);This interface style returns the updated
begin
pointer.
That is, the returned pointer points to the first character not
matching the pattern. Such an interface style is used for many
standard library algorithms. All parameters are passed by value which
helps the optimizer. Two consecutive inputs can be performed like this:
T value1, value2; std::error_code ec; p = from_string(p, end, value1, ec); if (ec) /* parse error */; p = from_string(p, end, value2, ec); if (ec) /* parse error */;
void from_string(const char *& begin, const char * end, T& value, std::error_code& ec, int base = 10);This interface style updates the
begin
pointer in place.
Two consecutive inputs can be performed like this:
T value1, value2; std::error_code ec; from_string(p, end, value1, ec); if (ec) /* parse error */; from_string(p, end, value2, ec); if (ec) /* parse error */;
std::error_code from_string(const char *& begin, const char * end, T& value, int base = 10);Returning the error code allows for more compact code at the call site:
T value1, value2; if (std::error_code ec = from_string(p, end, value1)) /* parse error */; if (std::error_code ec = from_string(p, end, value2)) /* parse error */;
std::map
shows that the naming of the
parts (first
and second
) carries no semantic
meaning which would help reading the resulting code. If the result
value moves to the return value, its type T
needs to be
passed explicitly (e.g. as a template parameter). The composition
example would be:
std::pair<T, std::error_code> res; res = from_string<T>(p, end); if (res.second) /* parse error */; T value1 = res.first; res = from_string<T>(p, end); if (res.second) /* parse error */; T value2 = res.second;
to_string
(overloaded from its
existing usage in the standard library) for the output function and
from_string
for the input (parse) function. I'm open for
other suggestions.
An output function is a function with the declaration patternchar * to_string(char * begin, char * end, T value /* possibly more parameters */);for some type T, convertingvalue
into a character string by successively filling the range [begin
,end
) and returning the one-past-the-end pointer where the conversion was complete. [ Note: The resulting string is not null-terminated. ] If the provided space is insufficiently large, returnsend
; the contents of the range [begin
,end
) are unspecified.For each integer type (including extended integer types)
T
exceptchar
andbool
, the following output function is provided:char * to_string(char * begin, char * end, T value, int base = 10);Valid values for
base
are between 2 and 36 (inclusive). Digits in the range 10..36 are represented as lowercase characters a..z. The value invalue
is converted to a string of digits in the givenbase
(with no redundant leading zeroes). If T is a signed integral type, the representation starts with a minus sign ifvalue
is negative.For each floating-point type
T
, the following output function is provided:char * to_string(char * begin, char * end, T value);A finite
value
is converted to a leading minus sign ifvalue
is negative, a string of decimal digits with at most one decimal point, and an optional trailing exponent introduced by a lowercase "e
" followed by an optional minus sign followed by decimal digits. The representation requires a minimal number of characters, yet parsing the representation using thefrom_string
function recoversvalue
exactly. Ifvalue
is an infinity, it is converted toinf
or-inf
; a NaN is converted tonan
or-nan
.For each floating-point type
T
, the following output function is provided:char * to_string(char * begin, char * end, T value, int precision);A finite
value
is converted to a leading minus sign ifvalue
is negative and a string of decimal digits with at most one decimal point and exactlyprecision
digits after the decimal point. Ifprecision
is 0, no decimal point appears, otherwise at least one digit appears before the decimal point. The result is correctly rounded, i.e. it yields the same result as an infinite-precision output rounded to the desired precision.
An input function is a function with the declaration patternconst char * from_string(const char * begin, const char * end, T& value, std::error_code& ec /* possibly more parameters */);for some type T. An input function analyzes the string [begin
,end
) for a pattern. If no characters match the pattern, the function retunsbegin
,value
is unmodified, andec
is set toEINVAL
. Otherwise, the characters matching the pattern are interpreted as a representation of a value of type T. The function returns a pointer that points to the first character not matching the pattern. If the parsed value is not in the range representable by T,value
is unmodified andec
is set toERANGE
. Otherwise,ec
is set such that conversion tobool
yields false.For each integer type (including extended integer types)
T
exceptchar
andbool
, the following input function is provided:const char * from_string(const char * begin, const char * end, T& value, std::error_code& ec, int base = 10);Valid values for base are between 2 and 36 (inclusive). Digits in the range 10..36 are represented as lowercase characters a..z. If
T
is an unsigned integral type, the pattern is a non-empty sequence of digit characters according tobase
. IfT
is a signed integral type, the pattern is an optional minus sign followed by a non-empty sequence of digit characters.For each floating-point type
T
, the following input function is provided:const char * from_string(const char * begin, const char * end, T& value, std::error_code& ec);The pattern is an optional minus sign followed by a non-empty sequence of decimal digit characters with at most one decimal point, and an optional trailing exponent introduced by "e
" or "E
" followed by an optional minus sign followed by a non-empty sequence of decimal digits. The resultingvalue
is one of at most two floating-point values closest to the given pattern. The pattern also matches the representations of infinity and NaN described for output.