ISO/IEC JTC1 SC22 WG21 P0067R5
Group: Library Working Group
Jens Maurer <Jens.Maurer@gmx.net>
2016-11-11

P0067R5: Elementary string conversions, revision 5

Introduction

Following up on N4412 "Shortcomings of iostreams", this paper presents low-level, locale-independent functions for conversions between integers and strings and between floating-point numbers and strings.

Use cases include the increasing number of text-based interchange formats such as JSON or XML that do not require internationalization support, but do require high throughput when produced by a server.

There are a lot of existing functions in C++ to perform such conversions, but none offers a high-performance solution. At a minimum, an implementation by an ordinary user of the language using an elementary textbook algorithm should not be able to outperform a quality standard library implementation. The requirements are thus:

no runtime parsing of format strings
no dynamic memory allcoation inherently required by the interface
no consideration of locales
no indirection through function pointers required
prevention of buffer overruns
when parsing a string, errors are distinguishable from valid numbers
when parsing a string, whitespace or decorations are not silently ignored

For floating-point numbers, there should be a facility to output a floating-point number with a minimum number of decimal digits where input from the digits is guaranteed to reproduce the original floating-point value.

The deliberations in the Kona LEWG sessions resulted in the following comments:

cover all the formatting facilities of printf (choice of base, scientific, precision,hexfloat)
spell out round-trip guarantees for the decimal format
explain cost of runtime parameter for base vs. having a separate overload with fixed base = 10
use the names to_chars and from_chars
possibly use interface struct { char* ptr; bool overflow; } to_chars(char* ptr, char* last, T value);
use interface struct { const char* ptr; error_code ec; } from_chars(const char* ptr, const char* last, T& value);
check use cases to nail down the to_chars interface

Changes

Compared to P0067R4

change output descriptions to use "in the style of printf"
LEWG: replace "bool overflow" with std::error_code
LEWG: replace "bool hex" with enum chars_format
LEWG: add signatures so all formats roundtrip if no precision is given
LWG: editorial fixes
LWG: require a valid range
LWG: fix cross-references to ISO C11
LWG: add example for parsing of hex floating-point (blue text)

Compared to P0067R3

Wording bug fix: parsing considers chars_format, too

Compared to P0067R2

Removed leftover sentence about implementation-defined NaN/inf representation (should be covered by deferring to printf).
Reflected LWG feedback from Oulu
Removed incomplete support for std::tie (i.e. tuple_size, tuple_element, get<>)

Compared to P0067R1

for errors, switched to errc constants (thanks to Zhihao Yuan)
Addressed comments from LEWG deliberations in Jacksonville, in particular:
- Specialize tuple_size and tuple_element for to_chars_result and from_chars_result
- Supply overloads for all signed and unsigned integer types, as well as char.
- Put the new functions into header <utility>.
- Support C11 printf %g for formatting; provide a bitmask type to choose formatting / parsing options for floating-point.
- Parse both uppercase and lowercase
LEWG decision: Do not support generating uppercase (no change).
LEWG decision: Target for C++17.

Compared to P0067R0

Addressed comments from LEWG deliberations in Kona
Added open issues

Existing approaches

C++ already provides at least the facilities in the following table, each with shortcomings highlighted in the second column.

facility	shortcomings
sprintf	format string, locale, buffer overrun
snprintf	format string, locale
sscanf	format string, locale
atol	locale, does not signal errors
strtol	locale, ignores whitespace and 0x prefix
strstream	locale, ignores whitespace
stringstream	locale, ignores whitespace, memory allocation
num_put / num_get facets	locale, virtual function
to_string	locale, memory allocation
stoi etc.	locale, memory allocation, ignores whitespace and 0x prefix, exception on error

As a rough performance comparison, the following simple numeric formatting task was implemented: Output the integer numbers 0 ... 1 million, separated by a single space character, into a contiguous array buffer of 10 MB. This task was executed 10 times. The execution environment was gcc 4.9 on Intel Core i5 M450.

strstream	864 ms	uses `std::strstream` with application-provided buffer
streambuf	540 ms	uses simple custom streambuf with `std::num_put<>` facet
direct	285 ms	open-coded "divide by 10" algorithm, using the interface described below
fixed-point	125 ms	fixed-point algorithm found in an older AMD optimization guide, using the interface described below

There are various approaches for even more efficient algorithms; see, for example, https://gist.github.com/anonymous/7700052 .

Interface discussion

The following discussion assumes that a common interface style should be established that covers (built-in) integer and floating-point types. The type T designates such an arithmetic type. Note that given these restrictions, output of T to a string has a small maximum length in all cases. The styles for input vs. output will differ due to the differing functionality.

The fundamental interface for a string is that it is caller-allocated, contiguous in memory, and not necessarily 0-terminated. That means, it can be represented by a range [first,last) where first and last are of type char *.

Given this framework, the following subsections discuss various specific interface styles for both output and input. In each case, the signature of an integer output or input function is shown. Criteria for comparison include impact on compiler optimizations, indication of output buffer overflow, and composability (as a measure of ease-of-use).

Output

This subsection discusses various specific interface styles for output. In each case, the signature of an integer output function is shown. There is one failure mode for output: overflow of the provided output buffer. Criteria for comparison include impact on compiler optimizations, indication of output buffer overflow, and composability (as a measure of ease-of-use). For exposition of the latter, consecutive output of two numbers is shown, without any separator.

Conceptually, an output function has four parameters and two results. The parameters are the first and last pointers of the buffer, the value, and the desired base. The results are the updated first pointer and an overflow indication.

Feature set of printf / strtol for integers

base 2...36	overload provided
uppercase for base > 10	not supported

Feature set of printf / strtod for floating-point

The following table lists the format specifiers of fprintf relevant to floating-point in C11 and the disposition in the context of the functionality proposed in this paper.

	field width	not supported
	precision (number of digits after the decimal-point)	overload provided
+	mandatory sign	not supported
space	prefix	not supported
#	mandatory decimal point	not supported
0	pad with zeroes	not supported
L	long double argument	overload provided
f	fixed-precision lowercase conversion	overload provided
F	fixed-precision uppercase conversion	not provided
e	scientific lowercase conversion	overload provided
E	scientific uppercase conversion	not provided
g	switch between f and e	overload provided
G	switch between F and E	not provided
a	hexadecimal lowercase conversion	overload provided
A	hexadecimal uppercase conversion	not provided

Iterator

  char * to_chars(char * first, char * last, T value, int base = 10);

This interface style returns the updated first pointer. That is, the resulting string is in [first, return-value) and [return-value, last) is unused space in the string. Such an interface style is used for many standard library algorithms, e.g. find [alg.find]. All parameters are passed by value which helps the optimizer. Overflow is indicated by return-value == last. The situation that the output exactly fits into the provided buffer cannot be distinguished from overflow. Two consecutive outputs can be produced trivially using:

  p = to_chars(p, last, value1);
  p = to_chars(p, last, value2);

Iterator with in-situ update

  void to_chars(char *& first, char * last, T value, int base = 10);

This interface style updates the first pointer in place. That is, the resulting string is in [old-first, first) and [first,last) is unused space in the string. Aliasing rules allow that updates to first change the data where first points. To avoid redundant updates, the implementation can copy first to a local variable. Overflow is indicated by first reaching last. The situation that the output exactly fits into the provided buffer cannot be distinguished from overflow. Two consecutive outputs can be produced trivially using:

  to_chars(p, last, value1);
  to_chars(p, last, value2);

string_view

  void to_chars(std::string_view& s, T value, int base = 10);

This interface style groups the first and last pointers into a string_view which is updated in-place. Comments on "iterator with in-situ update" apply analogously.

Iterator with in-situ update and overflow indication

Adding a boolean return value allows to indicate overflow:

  bool to_chars(char *& first, char * last, T value, int base = 10);

Comments on "iterator with in-situ update" apply analogously, except that the return value indicates whether overflow occurred.

snprintf

  int to_chars(char * first, char * last, T value, int base = 10);

This interface style always returns the number of characters required to output T, regardless of whether sufficient space was provided. That is, an overflow occurred if the return value is larger than last-first, otherwise the resulting string is in [first, first + return-value). Such an interface style is used for snprintf, except that the proposed function never 0-terminates the output. All parameters are passed by value which helps the optimizer. Overflow is indicated by a return value strictly larger than the distance between first and last. Computing the amount of overflow is helpful to allocate a larger buffer, but, in general, requires switching from the fast path, because no further characters may be stored. The elementary functions discussed in this paper all have (statically computable) limited maximum output size, so the benefit of returning the exact size is small. Two consecutive outputs require attention at the caller site to avoid buffer overflow:

  int n = 0;
  n += to_chars(first, last, value1);
  n += to_chars(first + std::min(n, last-first), last, value2);

Kona consensus

  struct to_chars_result {
    char* ptr;
    bool overflow;
    operator tuple<char *, bool>() const;
  };
  char* get<0>(const to_chars_result&);   // for tie()
  bool get<1>(const to_chars_result&);
  to_chars_result to_chars(char* first, char* last, T value, int base = 10);

This interface style returns a named pair with the updated first pointer. All parameters are passed by value which helps the optimizer. Overflow is indicated by a separate overflow indicator in the return value. Two consecutive outputs can be produced easily using:

  to_chars_result result = to_chars(p, last, value1);
  result = to_chars(result.ptr, last, value2);

Input

An input function conceptually operates in two steps: First, it consumes characters from the input string matching a pattern until the first non-matching character or the last of the string is encountered. Second, the matched characters are translated into a value of type T. There are two failure modes: no characters match, or the pattern translates to a value that is not in the range representable by T.

Conceptually, an input function has three parameters and three results. The parameters are the first and last pointers of the string and the desired base. The results are the updated first pointer, a std::error_code and the parsed value.

This subsection discusses various specific interface styles for input. Failure is indicated by std::error_code with the appropriate value. In each case, the signature of an integer input function is shown. Criteria for comparison include impact on compiler optimizations and composability (as a measure of ease-of-use). For exposition of the latter, parsing of two consecutive values is shown, without skipping of any separator.

Iterator

  const char * from_chars(const char * first, const char * last, T& value, std::error_code& ec, int base = 10);

This interface style returns the updated first pointer. That is, the returned pointer points to the first character not matching the pattern. Such an interface style is used for many standard library algorithms. Two consecutive inputs can be performed like this:

  T value1, value2;
  std::error_code ec;
  p = from_chars(p, last, value1, ec);
  if (ec)
    /* parse error */;
  p = from_chars(p, last, value2, ec);
  if (ec)
    /* parse error */;

Iterator with in-situ update

  void from_chars(const char *& first, const char * last, T& value, std::error_code& ec, int base = 10);

This interface style updates the first pointer in place. Two consecutive inputs can be performed like this:

  T value1, value2;
  std::error_code ec;
  from_chars(p, last, value1, ec);
  if (ec)
    /* parse error */;
  from_chars(p, last, value2, ec);
  if (ec)
    /* parse error */;

Iterator with in-situ update and error return

  std::error_code from_chars(const char *& first, const char * last, T& value, int base = 10);

Returning the error code allows for more compact code at the call site:

  T value1, value2;
  if (std::error_code ec = from_chars(p, last, value1))
    /* parse error */;
  if (std::error_code ec = from_chars(p, last, value2))
    /* parse error */;

Return a std::pair or std::tuple

Two of the three results of an input function could be represented by a pair. All three results could be represented by a tuple. However, experience with std::map shows that the naming of the parts (first and second) carries no semantic meaning which would help reading the resulting code. If the result value moves to the return value, its type T needs to be passed explicitly (e.g. as a template parameter). The composition example would be:

  std::pair<T, std::error_code> res;
  res = from_chars<T>(p, last);
  if (res.second)
    /* parse error */;
  T value1 = res.first;
  res = from_chars<T>(p, last);
  if (res.second)
    /* parse error */;
  T value2 = res.second;

Kona consensus

  struct from_chars_result {
    const char* ptr;
    error_code ec;
  };
  const char * get<0>(const from_chars_result&); // for tie()
  error_code get<1>(const from_chars_result&);
  from_chars_result from_chars(const char* first, const char* last, T& value, int base = 10);

This interface style returns the updated first pointer and an error code. All parameters, except for the parsed value, are passed by value, which helps the optimizer. Two consecutive inputs can be performed like this:

  T value1, value2;
  from_chars_result result = from_chars(p, last, value1);
  if (result.ec)
    /* parse error */
  result = from_chars(result.ptr, last, value2);
  if (result.ec)
    /* parse error */

Naming

The LEWG deliberations in Kona expressed the following naming preferences (sorted by number of votes).

to_text	9
to_chars	9
to_digits	7
to_characters	7
to_printable	6
to_ascii	3
to_string	3
to_output	1
[de]serialize	1
[un]marshal	1
[de]stringify	1

Given the tie in the first place and the author's personal preference for to_chars, this paper proposes to_chars for the output function and from_chars for the input (parse) function.

Cost of runtime parameter for base vs. separate overloads

The alternatives are

  to_chars_result to_chars(char* first, char* last, T value, int base = 10);

vs.

  to_chars_result to_chars(char* first, char* last, T value);
  to_chars_result to_chars(char* first, char* last, T value, int base);

The difference is almost a quality-of-implementation issue, except that the standard gives appropriate liberty only for member functions, not for non-member functions (17.6.5.5 [member.functions]). The former can be implemented like this:

  inline to_chars_result to_chars(char* first, char* last, T value, int base = 10)
  {
    if (base == 10)
      return to_chars2(first, last, value);
    else
      return to_chars2(first, last, value, base);
  }

Other than a slightly increased burden on the inlining and constant propagation capabilities of the compiler, the two signatures are thus identical in performance. I have analyzed similar cases in the past and can confirm that the inline function essentially vanishes for optimized compiles. Personally, I would prefer to give an implementation latitude to switch between the two interface styles as it sees fit, but that is a question that should be discussed in a wider context, independent of the present paper. A similar argument applies to the question of overhead for base = 16, where a very efficient implementation using SIMD vector instructions is possible.

Minor concerns

Why do we not throw an exception on parse error? Two reasons: Exceptions come with a cost (in particular, when thrown), and a parse error is not an exceptional situation.

Wording

20.2 Header <utility> synopsis [utility]

Add the following to 20.2 [utility]:

namespace std {
  struct to_chars_result {
    char* ptr;
    error_code ec;
  };

  to_chars_result to_chars(char* first, char* last, see below value, int base = 10);

  enum class chars_format {
    scientific = unspecified,
    fixed = unspecified,
    hex = unspecified,
    general = fixed | scientific
  };

  to_chars_result to_chars(char* first, char* last, float       value);
  to_chars_result to_chars(char* first, char* last, double      value);
  to_chars_result to_chars(char* first, char* last, long double value);

  to_chars_result to_chars(char* first, char* last, float       value, chars_format fmt);
  to_chars_result to_chars(char* first, char* last, double      value, chars_format fmt);
  to_chars_result to_chars(char* first, char* last, long double value, chars_format fmt);

  to_chars_result to_chars(char* first, char* last, float       value, chars_format fmt, int precision);
  to_chars_result to_chars(char* first, char* last, double      value, chars_format fmt, int precision);
  to_chars_result to_chars(char* first, char* last, long double value, chars_format fmt, int precision);


  struct from_chars_result {
    const char* ptr;
    error_code ec;
  };

  from_chars_result from_chars(const char* first, const char* last, see below& value, int base = 10);  

  from_chars_result from_chars(const char* first, const char* last, float& value, chars_format fmt = chars_format::general);  
  from_chars_result from_chars(const char* first, const char* last, double& value, chars_format fmt = chars_format::general);  
  from_chars_result from_chars(const char* first, const char* last, long double& value,  chars_format fmt = chars_format::general);  
}

The type chars_format is a bitmask type (17.5.2.1.3 [bitmask.types]) with elements scientific, fixed, and hex.

For the to_chars function taking a parameter base, the implementation shall provide overloads for all signed and unsigned integer types and char as the type of the parameter value. For the from_chars function taking a parameter base, the implementation shall provide overloads for all signed and unsigned integer types and char as the referenced type of the parameter value.

20.2.7 Output functions

All functions named to_chars convert value into a character string by successively filling the range [first, last), where [first, last) is required to be a valid range. If the member ec of the return value is such that the value, when converted to bool, is false, the conversion was successful and the member ptr is the one-past-the-end pointer of the characters written. Otherwise, the member ec has the value errc::value_too_large, the member ptr has the value last, and the contents of the range [first, last) are unspecified.

The functions that take a floating-point value but not a precision parameter ensure that the string representation consists of the smallest number of characters such that there is at least one digit before the radix point (if present) and parsing the representation using the corresponding from_chars function recovers value exactly. [ Note: This guarantee applies only if to_chars and from_chars are executed on the same implementation. ]

The functions taking a chars_format parameter determine the conversion specifier for printf as follows: The conversion specifier is f if fmt is chars_format::fixed, e if fmt is chars_format::scientific, a (without leading "0x" in the result) if fmt is chars_format::hex, and g if fmt is chars_format::general.

  to_chars_result to_chars(char* first, char* last, see above value, int base = 10);

Requires: base has a value between 2 and 36 (inclusive).

Effects: The value of value is converted to a string of digits in the given base (with no redundant leading zeroes). Digits in the range 10..35 (inclusive) are represented as lowercase characters a..z. If value is less than zero, the representation starts with a minus sign.

Throws: Nothing.

  to_chars_result to_chars(char* first, char* last, float       value);
  to_chars_result to_chars(char* first, char* last, double      value);
  to_chars_result to_chars(char* first, char* last, long double value);

Effects: value is converted to a string in the style of printf in the "C" locale (see ISO C 7.21.6.1). The conversion specifier is f or e, chosen according to the requirement for a shortest representation (see above); a tie is resolved in favor of f.

Throws: Nothing.

  to_chars_result to_chars(char* first, char* last, float       value, chars_format fmt);
  to_chars_result to_chars(char* first, char* last, double      value, chars_format fmt);
  to_chars_result to_chars(char* first, char* last, long double value, chars_format fmt);

Requires: fmt has the value of one of the enumerators of chars_format.

Effects: value is converted to a string in the style of printf in the "C" locale (see ISO C 7.21.6.1).

Throws: Nothing.

  to_chars_result to_chars(char* first, char* last, float       value, chars_format fmt, int precision);
  to_chars_result to_chars(char* first, char* last, double      value, chars_format fmt, int precision);
  to_chars_result to_chars(char* first, char* last, long double value, chars_format fmt, int precision);

Requires: fmt has the value of one of the enumerators of chars_format.

Effects: value is converted to a string in the style of printf in the "C" locale with the given precision (see ISO C 7.21.6.1).

Throws: Nothing.

20.2.8 Input functions

All functions named from_chars analyze the string [first,last) for a pattern, where [first,last) is required to be a valid range. If no characters match the pattern, value is unmodified, the member ptr of the return value is first and the member ec is equal to errc::invalid_argument. Otherwise, the characters matching the pattern are interpreted as a representation of a value of type T. The member ptr of the return value points to the first character not matching the pattern, or has the value last if all characters match. If the parsed value is not in the range representable by the type of value, value is unmodified and the member ec of the return value is equal to errc::result_out_of_range. Otherwise, value is set to the parsed value and the member ec is set such that the conversion to bool yields false.

  from_chars_result from_chars(const char* first, const char* last, see above& value, int base = 10);

Requires: base has a value between 2 and 36 (inclusive).

Effects: The pattern is the expected form of the subject sequence in the "C" locale for the given non-zero base, as described for strtol in ISO C 7.22.1.4, except that no "0x" or "0X" prefix shall appear if the value of base is 16, and except that a minus sign is the only sign that may appear, and only if value has a signed type.

Throws: Nothing.

  from_chars_result from_chars(const char* first, const char* last, float& value, chars_format fmt = chars_format::general);
  from_chars_result from_chars(const char* first, const char* last, double& value, chars_format fmt = chars_format::general);
  from_chars_result from_chars(const char* first, const char* last, long double& value, chars_format fmt = chars_format::general);

Requires: fmt has the value of one of the enumerators of chars_format.

Effects: The pattern is the expected form of the subject sequence in the "C" locale, as described for strtod in ISO C 7.22.1.3, except that

the only sign that may appear is a minus sign;
if fmt has chars_format::scientific set but not chars_format::fixed, the otherwise optional exponent part shall appear;
if fmt has chars_format::fixed set but not chars_format::scientific, the optional exponent part shall not appear; and
if fmt is chars_format::hex, the prefix "0x" or "0X" is assumed. [ Example: The string 0x123 is parsed to have the value 0 with remaining characters x123. ]

In any case, the resulting value is one of at most two floating-point values closest to the value of the string matching the pattern.

Throws: Nothing.

Open Issues

(currently none)

References

"How to print floating-point numbers accurately" by Guy L. Steele Jr. and Jon L White, Proceedings of the ACM SIGPLAN'90 Conference on Programming Language Design and Implementation; http://www.kurtstephens.com/files/p372-steele.pdf (starting at page 4 of the PDF)
"Printing Floating-Point Numbers Quickly and Accurately with Integers" by Florian Loitsch, PLDI'10 http://www.cs.tufts.edu/~nr/cs257/archive/florian-loitsch/printf.pdf
"How to Read Floating Point Numbers Accurately" by William D Clinger, University of Oregon http://www.cesura17.net/~will/professional/research/papers/howtoread.pdf