Document #: | P2728R7 |
Date: | 2024-10-06 |
Project: | Programming Language C++ |
Audience: |
SG-16 Unicode SG-9 Ranges LEWG |
Reply-to: |
Eddie Nolan <eddiejnolan@gmail.com> |
std::u32string
char
and
wchar_t
null_sentinel
and associated CPO
null_term
to_utfN_view
s views plus
utf_view
, and three
as_charN_t_view
sas_charN_t_view
is not implemented
in terms of
transform_view
utf_view
always transcodes, even in
UTF-N to UTF-N caseschar32_t
.charN_t
.utfN_view
to the types of the
from-range, instead of the types of the transcoding iterators used to
implement the view.as_utfN()
functions with the as_utfN
view
adaptors that should have been there all along.transcoding_error_handler
concept.unpack_iterator_and_sentinel
into a
CPO.null_sentinel_t
to a
non-Unicode-specific facility.utf{8,16,32}_view
with a single utf_view
.noexcept
where appropriate.code_unit
concept, and added
as_charN_t
adaptors.replacement_character
.utf_iterator
slightly.null_sentinel_t
back to
being Unicode-specific.unpacking_owning_view
with unpacking_view
, and use it to
do unpacking, rather than sometimes doing the unpacking in the
adaptor.const
and
non-const
overloads for begin
and
end
in all views.null_sentinel_t
to
std
, remove its
base
member function, and make it
useful for more than just pointers, based on SG-9 guidance.null_sentinel_t
.ranges::project_view
,
and implement charN_view
s in terms
of that.utfN_view
s to
aliases, rather than individual classes.null_sentinel_t
causing it not to satisfy
sentinel_for
by changing its operator==
to return
bool
.null_sentinel_t
where it did not support non-copyable input iterators by having
operator== take input iterators by reference.as_utfN
to
to_utfN
to emphasize that a
conversion is taking place and to contrast with the code unit views,
which remain named as_charN_t
.utf_view
into an
exposition-only
utf-view-impl
class used as
an implementation detail of separate
to_utf8_view
,
to_utf16_view
, and
to_utf32_view
classes, addressing
broken deduction guides in the previous revision.project_view
and copy
most of its implementation into separate
char8_view
,
char16_view
, and
char32_view
classes, addressing
broken deduction guides in the previous revision.utf_iterator
to an
exposition-only member class of
utf-view-impl
.begin()
and
end()
member
functions and losing the ability to implement unpacking for user-defined
UTF iterators.std::uc::format
.transcoding_error_handler
mechanism.transcoding_error
enumeration which
is returned by an
success()
member function of the transcoding view’s iterator.std::format
and
std::ostream
functionality. It doesn’t make sense for this mechanism to be the only
way we have to format/output
char8_t
; we
can revisit this functionality when we have already figured out how to
support e.g. std::u8string
.Unicode is important to many, many users in everyday software. It is not exotic or weird. Well, it’s weird, but it’s not weird to see it used. C and C++ are the only major production languages with essentially no support for Unicode.
Let’s fix.
To fix, first we start with the most basic representations of strings in Unicode: UTF. You might get a UTF string from anywhere; on Windows you often get them from the OS, in UTF-16. In web-adjacent applications, strings are most commonly in UTF-8. In ASCII-only applications, everything is in UTF-8, by its definition as a superset of ASCII.
Often, an application needs to switch between UTFs: 8 -> 16, 32 -> 16, etc. In SG-16 we’ve taken to calling such UTF-N -> UTF-M operations “transcoding”.
This paper provides interfaces to do UTF transcoding based on the ranges API.
A particular reason for urgency in adding transcoding operations to
the standard library is that the standard library has previously
contained problematic-to-broken UTF transcoding facilities in the form
of
std::codecvt
facets which are currently slated for removal without replacement as
[P2871R3] and [P2873R2] make their way through the
committee. GitHub searches show that these facilities are widely used;
the functionality contained in this paper can serve as a proper
replacement.
There are multiple encoding types defined in Unicode: UTF-8, UTF-16, and UTF-32.
A code unit is the lowest-level datum-type in your Unicode
data. Examples are a
char8_t
in
UTF-8 and a
char32_t
in
UTF-32.
A code point is a 32-bit integral value that represents a single Unicode value. Examples are U+0041 “A” “LATIN CAPITAL LETTER A” and U+0308 “¨” “COMBINING DIAERESIS”.
A code point may be consist of multiple code units. For instance, 3 UTF-8 code units in sequence may encode a particular code point.
std::u32string
::u32string hello_world =
stdu8"こんにちは世界" | std::uc::to_utf32 | std::ranges::to<std::u32string>();
Here, we sanitize potentially invalid Unicode C strings by replacing invalid code units with replacement characters according to Unicode’s recommended Substitution of Maximal Subparts:
template <typename CharT>
::basic_string<CharT> sanitize(CharT const* str) {
stdreturn std::uc::null_term(str) | std::uc::to_utf<CharT> | std::ranges::to<std::basic_string<CharT>>();
}
::optional<char32_t> last_nonascii(std::ranges::view auto str) {
stdfor (auto c : str | std::uc::to_utf32 | std::views::reverse
| std::views::filter([](char32_t c) { return c > 0x7f; })
| std::views::take(1)) {
return c;
}
return std::nullopt;
}
(This example assumes the existence of the
enum_to_string
function from [P2996R5])
template <typename FromChar, typename ToChar>
::basic_string<ToChar> transcode_or_throw(std::basic_string_view<FromChar> input) {
std::basic_string<ToChar> result;
stdauto view = input | to_utf<ToChar>;
for (auto it = view.begin(), end = view.end(); it != end; ++it) {
if (it.success()) {
.push_back(*it);
result} else {
throw std::runtime_error("error at position " +
::to_string(it.base() - input.begin()) + ": " +
std(it.success().error()));
enum_to_string}
}
return result;
}
Let’s say that we want to take code points that we got from ICU, and
transcode them to UTF-8. The problem is that ICU’s code point type is
int
. Since
int
is not a
character type, it’s not deduced by
to_utf8
to be UTF-32 data. We can
address this by using the std::uc::as_char32_t
to cast the
int
s to
char32_t
:
::vector<int> input = get_icu_code_points();
std// This is ill-formed without the as_char32_t adaptation.
auto input_utf8 =
| std::uc::as_char32_t | std::uc::to_utf8 | std::ranges::to<std::u8string>(); input
This proposal depends on the existence of [P2727R4] “std::iterator_interface”.
char
and
wchar_t
Here are some examples of the differences between having the
transcoding views accept ranges of
char
and
wchar_t
or
reject them. The to_utfN
and
as_charN
adaptors are discussed
later in this paper.
The to_utfN
adaptors produce
to_utfN_view
s, which do
transcoding.
The as_charN_t
adaptors produce
as_charN_view
s that are each very
similar to a transform_view
that
casts each element of the adapted range to a
charN_t
value. An
as_charN_view
differs from the
equivalent transform in that it may be a borrowed range.
Note the use of the shorthand
“charN_t
” below with
std::wstring
.
That’s there because whether you write
as_char16_t
or
as_char32_t
is
implementation-dependent.
Rejecting ranges of
char and
wchar_t
|
Accepting ranges of
char and
wchar_t
|
---|---|
|
|
In short, rejecting
char
and
wchar_t
forces you to write
“| as_char8_t
”
everywhere you want to use a
std::string
with the interfaces proposed in this paper.
SG-16 has previously expressed strong support for rejecting
char
and
wchar_t
, as
can be observed in the polling history section.
The feeling in SG-16 was that the
charN_t
types are designed to
represent UTF encodings, and
char
is not.
A char const *
string could be in any one of dozens (hundreds?) of encodings. The
addition of
“| as_char8_t
”
to adapt ranges of
char
is
meant to act as a lexical indicator of user intent.
The authors believe this decision is a mistake. Our argument for
accepting ranges of
char
and
wchar_t
is
as follows.
First, note that none of the
charN_t
types imposes any invariant
that a range of its contents contains valid Unicode. As a result, they
cannot enforce preconditions for APIs that require valid Unicode input
at the level of the type system.
Therefore, we claim that the main use case of the
charN_t
types in APIs is to
facilitate a coding style that allows APIs to advertise to users whether
they expect Unicode-encoded strings (whether with a wide or a narrow
contract).
For example, users of this coding style may write an API like the following:
// Expect input to be in Windows-1252
::size_t word_count(std::string_view);
std
// Expect input to be in Unicode
::size_t word_count(std::u8string_view); std
If to_utfN
rejects ranges of
char
and
wchar_t
, it
would bring this standard library API into alignment with this
style.
However, there are a number of reasons why we consider this approach undesirable for our use case.
First of all, for any large C++ API surface dealing with Unicode that
was not designed very recently, there will be APIs that expect UTF-8 in
the form of
std::string
parameters. This means that the semantic value of
char8_t
is
one-sided; in such an ecosystem, while the presence of
char8_t
certainly indicates that the API expects UTF-8, the absence of
char8_t
may
still indicate a
char
-based
API that also expects UTF-8.
Furthermore, because
char8_t
is
such a recent addition to the standard, and because it’s so poorly
supported by other standard library facilities such as <iostream>
and
std::format
,
its penetration has been extremely low; a Github Code Search showed
15.3M references to
std::string
and 6.7k references to std::u8string
.
Finally, due to the particular history of implementation choices by
compiler writers, the proportion of C++ users who have the ability to
properly benefit from the use of
char8_t
is
unfortunately smaller than intended.
For the vast majority of users of Unix-like operating systems, both
the basic literal encoding and the execution encoding are UTF-8, and so
char8_t
is
mostly redundant, since it has approximately the same meaning as
char
. This
leaves Windows developers as the remaining large pool of users who could
potentially take advantage of
char8_t
.
The issue is that Windows users are divided into two categories:
those who use MSVC’s
/utf8
compiler flag, and those who do not.
Users of
/utf8
are in
the future:
/utf8
switches the basic literal encoding and execution encoding to UTF-8.
These users have less need for
char8_t
because their
char
s are
UTF-8.
Non-users of
/utf8
are
dealing with non-Unicode basic literal and execution encodings, so
theoretically they’re the target audience for
char8_t
. But
unfortunately, without the
/utf8
flag,
MSVC breaks compliance with the standard, in that it violates the
requirement that u8""
string literals are encoded in Unicode. Attempting to create such a
string literal on MSVC without specifying
/utf8
results in Windows-1252 code units inside of
char8_t
bytes. For these users,
char8_t
is
theoretically useful but broken in practice.
Rejecting
char
and
wchar_t
for
UTF transcoding will therefore have limited benefits. On the other hand,
rejecting these types will send users over to Stack Overflow to discover
they need to copy boilerplate called | std::uc::as_char8_t
for reasons that will seem academic to most of them.
When invalid code units are encountered, the UTF transcoding views replace those code units with U+FFFD replacement characters according to the Unicode standard’s recommended “Substitution of Maximal Subparts” algorithm.
However, users of the transcoding views may want to know when invalid code units have been encountered, and to implement custom behaviors if this is the case. Simply checking whether the transcoded code points contain U+FFFD replacement characters is not sufficient because these characters are an in-band signal that can also appear in valid UTF.
What’s called for is a basis operation with which arbitrary error handling approaches may be implemented.
The UTF transcoding views in this paper provide such a basis
operation by adding an
success()
member function to the iterator of the transcoding view, which informs
users whether the current code point is a U+FFFD that was inserted in
response to an invalid code unit sequence. The
success()
member function returns a std::expected<void, std::uc::transcoding_error>
,
where std::uc::transcoding_error
is a new enum class containing enumerators for every category of
transcoding error.
Users who choose not to implement error handling will simply sanitize any invalid code unit sequences using U+FFFD replacement. Users who want to implement error handling can implement any of the following approaches, either by wrapping the iterator or by iterating with a traditional for loop:
value_type
is std::expected<charN_t, std::uc::transcoding_error>
std::expected<void, E>
?The main alternative to consider here would be to specify that
default-constructed std::uc::transcoding_error
values represent success, or add a
success
enumerator whose value is
zero. There is precedent for doing this in the standard in the error
handling approach of std::from_chars
,
which returns a std::from_chars_result
containing a
std::errc
that has an operator bool()
that returns
true
if the
std::errc
is
default-constructed.
However, that design decision was made before std::expected
was
added to the standard library in C++23. Now that we have this facility,
we should take the opportunity to use the type system to structurally
separate the error cases from the success cases, instead of lumping them
all together in the same type as in the case of
std::errc
.
iconv()
errno
error
codes:
EINVAL
(the initial subsequence
of a valid sequence was at the end of the input sequence)EILSEQ
(any other invalid
sequence in the input)u_strFromUTF8WithSub()
int32_t
(documentation recommends U+FFFD) or sets an error code using an
out-paramERROR_NO_UNICODE_TRANSLATION
decode()
Invalid sequences result in exceptions containing verbose descriptions of the offending sequence
Example:
>>> b'\x80abc'.decode("utf-8", "strict")
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
List of error messages:
unexpected end of data
invalid start byte
invalid continuation byte
code point in surrogate code point range(0xd800, 0xe000)
truncated data
code point not in range(0x110000)
illegal encoding
illegal UTF-16 surrogate
The claim that to_utfN_view
’s
success()
API is a basis operation is supported by the fact that each of the above
APIs can be implemented using it, but not vice versa. See Appendix:
Implementing Existing Practice for Error Handling for code examples
which demonstrate this.
std::uc::transcoding_error
enumeratorstruncated_utf8_sequence
0xE1 0x80
.unpaired_high_surrogate
0xD800
.unpaired_low_surrogate
0xDC00
.unexpected_utf8_continuation_byte
0x80
.overlong
0xE0 0x80
.encoded_surrogate
0xED 0xA0
,
UTF-32
0x0000D800
.out_of_range
0xF4
if it
is followed by a continuation byte greater than
0x8F
0x10FFFF
0xF4 0x90
,
UTF-32
0x110000
.invalid_utf8_leading_byte
0xC0
-0xC1
and
0xF5
-0xFF
.0xC0
.An alternative approach to minimize the number of enumerators could
merge truncated_utf8_sequence
with
unpaired_high_surrogate
and merge
unexpected_utf8_continuation_byte
with unpaired_low_surrogate
, but
based on feedback, splitting these up seems to be preferred.
The first two rows of each of the following tables are taken directly
from the “U+FFFD Substitution of Maximal Subparts” section of the
Unicode standard, and augmented to show the associated
success()
for each resulting code point.
Note that outside of the truncation case, the leading code unit is
associated with a more specific error enumerator, and then all the
continuation bytes in the invalid sequence are
unexpected_utf8_continuation_byte
.
This is aligned with my interpretation of the underlying logic of
Substitution of Maximal Subparts; also, any other approach would require
additional lookahead, which would break some of the API’s
invariants.
Iterators are constructed from more than one underlying iterator. In
order to perform iteration in many text-handling contexts, you need to
know the beginning and the end of the range you are iterating over, just
to be able to perform iteration correctly. Note that this is not a
safety issue, but a correctness one. For example, say we have a string
s
of UTF-8 code units that we would
like to iterate over to produce UTF-32 code points. If the last code
unit in s
is
0xe0
, we
should expect two more code units to follow. They are not present,
though, because
0xe0
is the
last code unit. Now consider how you would implement operator++()
for an iterator iter
that transcodes
from UTF-8 to UTF-32. If you advance far enough to get the next UTF-32
code point in each call to operator++()
,
you may run off the end of s
when
you find
0xe0
and try
to read two more code units. Note that it does not matter that
iter
probably comes from a range
with an end-iterator or sentinel as its mate; inside
iter
’s operator++()
this is no help. iter
must therefore
have the end-iterator or sentinel as a data member. The same logic
applies to the other end of the range if
iter
is bidirectional — it must also
have the iterator to the start of the underlying range as a data member.
This unfortunate reality comes up over and over in the proposed
iterators, not just the ones that are UTF transcoding iterators. This is
why iterators in this proposal (and the ones to come) usually consist of
three underlying iterators.
Because of this fact, it’s almost free to specify these iterators so
that dereferencing a past-the-end iterator, incrementing a past-the-end
iterator, and decrementing an at-the-beginning iterator are all
erroneous behavior instead of undefined behavior. The only time an
additional branch is required to ensure safety is to check for a
before-the-beginning decrement in operator--
(although actually producing diagnostics for the EB requires further
branching).
As long as a transcoding view is constructed with proper arguments, all subsequent operations on it and its iterators are memory safe.
In generic contexts, users will create
to_utfN_view
s wrapping iterators of
other to_utfN_view
s. This presents a
problem for a naive implementation because when
to_utfN_view
is wrapping a
bidirectional range, the number of iterators in each successive
to_utfN_view
wrapper increases
geometrically unless we use workarounds.
The workaround makes it so that when a
to_utfN_view
is constructed from
another to_utfN_view
’s iterators,
instead of storing those iterators in the iterators of the outer
to_utfN_view
, the outer
to_utfN_view
’s iterators have
identical contents to the inner
to_utfN_view
’s iterators, the only
difference being the output encoding. This also allows the outer
to_utfN_view
’s iterators to
reconstruct the inner to_utfN_view
iterator when its
base()
member function is invoked, without actually storing it.
This optimization is only needed when the underlying range is bidirectional (or “better”), because input ranges and forward ranges increase in size linearly rather than geometrically with each successive wrapper, due to the fact that the sentinel is not wrapped by the transcoding iterator.
Although it’s not strictly necessary, we could also apply the optimization when the underlying range is a forward range, preventing the iterator size from growing at all (as opposed to linear growth), but that isn’t done in this paper because we judge the tradeoffs as not being justified. It is not possible to apply the optimization when the underlying range is an input range, because of the fact that the underlying iterator is past-the-end of the current code point.
The diagram below represents the outcome of the following process:
char8_t
s
from 0x100
to
0x300
.to_utf16_view
with
this underlying range.begin()
iterator until the underlying pointer is at
0x150
and
reverse the view’s
end()
iterator until the underlying pointer is at
0x250
.to_utf32_view
.to_utf32_view
’s
begin()
iterator until the underlying pointer (two levels down) is at
0x175
and
similarly reverse
end()
to
0x225
.The goal is for the optimized implementation to avoid having to store all the iterators that the naive implementation does, while still outwardly appearing to the user as though its API is the same as the naive one.
The iterators of the optimized
to_utf32_view
can simulate the naive
version’s
base()
by
reconstructing a to_utf16_view
iterator containing its own first
,
curr
, and
last
iterators. However, if we added
accessors for
first()
and
last()
to
the iterator, then we wouldn’t be able to return the same results as the
naive implementation because we’ve lost information about those
iterators– so this optimization can only work properly if we leave those
out.
Unlike with other range adaptor objects,
base()
cannot have any overloads that simply return a reference to the
underlying iterator as opposed to a new copy or move-constructed
instantiation of it, because of this optimization.
Input iterators cannot benefit from this optimization because they are necessarily past-the-end of the current code point within the range being adapted, whereas other iterator types are at the beginning of the current code point.
There is an unavoidable inconsistency introduced by this optimization
that occurs when one of the iterators is in the middle of a code point
in a variable length encoding (UTF-8 or UTF-16). Consider what happens
when a user attempts to convert the UTF-32 code point U+1F574 🕴 MAN IN BUSINESS SUIT LEVITATING
to UTF-8 and then to UTF-16, but increments the iterator of the UTF-8
transcoding view by one code unit first.
In the naive implementation, the result is simply three replacement characters as the UTF-16 transcoder encounters three unexpected UTF-8 continuation bytes:
UTF-32: 0x1F574
UTF-8: 0xF0 0x9F 0x95 0xB4
^ UTF-16: 0xFFFD 0xFFFD 0xFFFD
However, in the optimized implementation, when the UTF-16 transcoding view wraps the iterator from the UTF-8 transcoding view, it looks directly at the underlying UTF-32 iterator and forgets the UTF-8 iterator’s position within the code point:
UTF-32: 0x1F574 UTF-16: 0xD83D 0xDD74
Furthermore, when you invoke
base()
on an
iterator of the UTF-16 transcoding view, it’s lost the intra-code-point
position, moving it back to the starting code unit:
Original iterator:
UTF-8: 0xF0 0x9F 0x95 0xB4
^
Result of base():
UTF-8: 0xF0 0x9F 0x95 0xB4 ^
These inconsistencies are somewhat unfortunate, but they only apply when the input to the transcoding view starts in the middle of a code point, which is definitionally invalid UTF anyway; and it does not affect the invariant that the output is always valid UTF. This is an acceptable tradeoff for avoiding quadratic growth of the iterator sizes.
There is one more quirk introduced by this optimization. For ordinary, non-special-cased iterators of transcoding views, dereferencing a past-the-end iterator, incrementing past the end, and decrementing before the beginning are all erroneous behavior. However, because of the information loss associated with this optimization, the EB detection can’t kick in until the user has exceeded the bounds of the deepest underlying range, rather than one of its intermediate layers.
For example, in the scenario in the diagram from before, the naive
implementation would detect EB when the
to_utf32_view
’s
begin()
iterator was decremented to the point where the underlying range
iterator was less than
0x150
, but
the special-cased implementation would simply continue reading through
the underlying range until
0x100
. This
is perhaps surprising, but still achieves memory safety.
It’s a useful property of this approach that the type system
remembers the correct type to use for
base()
even
in the case of transcoding views wrapping other transcoding view. To
illustrate, consider this algorithm (not proposed) as an example.
template<input_iterator I, sentinel_for<I> S, output_iterator<char8_t> O>
<I, O> transcode_to_utf32(I first, S last, O out); transcode_result
Such a transcoding algorithm is pretty similar to std::ranges::copy
,
in that you should return both the output iterator and the
final position of the input iterator
(transcode_result
is an alias for
in_out_result
). Because we can
always provide
base()
, we
have no trouble returning a
transcode_result
here in every
case:
template<input_iterator I, sentinel_for<I> S, output_iterator<char8_t> O>
<I, O> transcode_to_utf32(I first, S last, O out) {
transcode_resultauto r = ranges::subrange(first, last) | uc::as_utf32;
auto copy_result = ranges::copy(r, out);
return transcode_result<I, O>{copy_result.in.base(), copy_result.out};
}
None of the proposed interfaces is subject to change in future versions of Unicode; each relates to the guaranteed-stable subset. Just sayin’.
None of the proposed interfaces allocates or throws.
All the transcoding iterators allow you access to the underlying
iterator via .base()
,
following the convention of the iterator adaptors already in the
standard.
The transcoding views are lazy, as you’d expect. They also compose
with the standard view adaptors, so just transcoding at most 10 UTF-16
code units out of some UTF can be done with foo | std::uc::to_utf16 | std::ranges::views::take(10)
.
Error handling strategies of the user’s choosing can be implemented
by the user due to the suitable basis operation
success()
provided by the transcoding iterator. This gives control to those who
want to do something other than the default. The default, according to
Unicode, is to produce a replacement character
(0xfffd
) in
the output when broken UTF encoding is seen in the input. This is what
all these interfaces do, unless you make use of the basis operation.
The production of replacement characters as error-handling strategy is good for memory compactness and safety. It allows us to store all our text as UTF-8 (or, less compactly, as UTF-16), and then process code points as transcoding views. If an error occurs, the transcoding views will simply produce a replacement character; there is no danger of UB.
null_sentinel
and
associated CPO null_term
namespace std {
template<class I>
concept default-initializable-and-equality-comparable-iter-value =
<iter_value_t<I>> &&
default_initializable<iter_reference_t<I>, iter_value_t<I>>; // exposition only
equality_comparable_with
struct null_sentinel_t {
template<input_iterator I>
requires (not forward_iterator<I>) && default-initializable-and-equality-comparable-iter-value<I>
friend constexpr bool operator==(I const& it, null_sentinel_t) {
return *it == iter_value_t<I>{};
}
template<forward_iterator I>
requires default-initializable-and-equality-comparable-iter-value<I>
friend constexpr bool operator==(I it, null_sentinel_t) {
return *it == iter_value_t<I>{};
}
};
inline constexpr null_sentinel_t null_sentinel;
inline constexpr unspecified null_term;
}
The sentinel type matches any iterator position
it
at which
*it
is equal
to a default-constructed object of type iter_value_t<I>
.
This works for null-terminated strings, but can also serve as the
sentinel for any range terminated by a default-constructed value.
Because this type is potentially useful for lots of ranges unrelated
to Unicode or text, it is in the std
namespace, not
std::uc
.
The null_sentinel_t
’s operator==
has a separate overload for input iterators that takes the iterator by
reference instead of by value. We want to take input iterators by
reference because they are not required to be copyable. However, for
forward iterators, we want to take by value because otherwise we incur a
double indirection (e.g. int* const& it
)
that compilers may not optimize.
The name null_term
denotes a
customization point object ([customization.point.object]). Given a
subexpression E
, the expression
null_term(E)
is expression-equivalent to ranges::subrange(move(E), null_sentinel)
.
namespace std::uc {
template<class T>
constexpr bool is-empty-view = false;
template<class T>
constexpr bool is-empty-view<ranges::empty_view<T>> = true;
template<class T>
concept code-unit-to = same_as<remove_cv_t<T>, char8_t> ||
<remove_cv_t<T>, char16_t> || same_as<remove_cv_t<T>, char32_t>;
same_as
template<class T>
concept code-unit-from =
<remove_cv_t<T>, char> || same_as<remove_cv_t<T>, wchar_t> || code-unit-to<T>;
same_as
template<class T>
concept utf-range =
::input_range<T> && code-unit-from<ranges::range_value_t<T>>;
ranges
template<class I>
consteval auto bidirectional-at-most() { // exposition only
if constexpr (bidirectional_iterator<I>) {
return bidirectional_iterator_tag{};
} else if constexpr (forward_iterator<I>) {
return forward_iterator_tag{};
} else if constexpr (input_iterator<I>) {
return input_iterator_tag{};
}
}
template<class I>
using bidirectional-at-most-t = decltype(bidirectional-at-most<I>()); // exposition only
template<class I>
consteval auto iterator-to-tag() { // exposition only
if constexpr (random_access_iterator<I>) {
return random_access_iterator_tag{};
} else if constexpr (bidirectional_iterator<I>) {
return bidirectional_iterator_tag{};
} else if constexpr (forward_iterator<I>) {
return forward_iterator_tag{};
} else if constexpr (input_iterator<I>) {
return input_iterator_tag{};
}
}
template<class I>
using iterator-to-tag-t = decltype(iterator-to-tag<I>()); // exposition only
}
namespace std::uc {
enum class transcoding_error {
truncated_utf8_sequence,
unpaired_high_surrogate,
unpaired_low_surrogate,
unexpected_utf8_continuation_byte,
overlong,
encoded_surrogate,
out_of_range,
invalid_utf8_leading_byte};
template<class T>
concept to-utf-view-iterator-optimizable = unspecified // exposition only
template<code-unit-to ToType, from-utf-view V>
class to-utf-view-impl : public ranges::view_interface<to-utf-view-impl<ToType, V>> {
public:
template<bool Const>
class utf-iterator : public iterator_interface<bidirectional-at-most-t<ranges::iterator_t<V>>, ToType, ToType> {
private:
using iter = ranges::iterator_t<maybe-const<Const, V>>;
using sent = ranges::sentinel_t<maybe-const<Const, V>>;
template<code-unit-to ToType2,
>
from-utf-view V2friend class to-utf-view-impl; // exposition only
template<class I>
struct first-and-curr { // exposition only
() = default;
first-and-currconstexpr first-and-curr(I curr) : curr(move(curr)) {}
I curr;};
template<bidirectional_iterator I>
struct first-and-curr<I> { // exposition only
() = default;
first-and-currconstexpr first-and-curr(I first, I curr) : first(first), curr(curr) {}
I first;
I curr;};
using innermost-iter = unspecified; // exposition only
using from-type = decltype([] {
if constexpr (is_same_v<char, iter_value_t<innermost-iter>>) {
return char8_t{};
} else if constexpr (is_same_v<wchar_t, iter_value_t<innermost-iter>>) {
if constexpr (sizeof(wchar_t) == 2) {
return char16_t{};
} else if constexpr (sizeof(wchar_t) == 4) {
return char32_t{};
}
} else {
return iter_value_t<innermost-iter>{};
}
}()); // exposition only
using innermost-iter = unspecified; // exposition only
using innermost-sent = unspecified; // exposition only
public:
using value_type = ToType;
using reference_type = ToType&;
using difference_type = ptrdiff_t;
using iterator_concept = bidirectional-at-most-t<iter>;
constexpr utf-iterator() requires default_initializable<V> = default;
private:
constexpr utf-iterator(innermost-iter first, innermost-iter it, innermost-sent last) // exposition only
requires bidirectional_iterator<innermost-iter>
: first_and_curr_(first, it), last_(last) {
if (curr() != last_)
();
read}
constexpr utf-iterator(innermost-iter it, innermost-sent last) // exposition only
requires (!bidirectional_iterator<innermost-iter>)
: first_and_curr_(move(it)), last_(last) {
if (curr() != last_)
();
read}
public:
constexpr utf-iterator() = default;
constexpr utf-iterator(utf-iterator const&) requires copyable<innermost-iter> = default;
constexpr utf-iterator& operator=(utf-iterator const&) requires copyable<innermost-iter> = default;
constexpr utf-iterator(utf-iterator&&) = default;
constexpr utf-iterator& operator=(utf-iterator&&) = default;
constexpr iter base() const requires forward_iterator<innermost-iter>
{
if constexpr (to-utf-view-iterator-optimizable<iter>) {
if constexpr (bidirectional_iterator<innermost-iter>) {
return iter(begin(), curr(), last_);
} else {
return iter(curr(), last_);
}
} else {
return curr();
}
}
constexpr iter base() &&
requires (!forward_iterator<innermost-iter>) { return move(*this).curr(); }
constexpr expected<void, transcoding_error> success() const;
constexpr value_type operator*() const;
constexpr utf-iterator& operator++() {
if constexpr (forward_iterator<innermost-iter>) {
if (buf_index_ + 1 < buf_last_) {
++buf_index_;
} else if (buf_index_ + 1 == buf_last_) {
(curr(), to_increment_);
advance= 0;
to_increment_ if (curr() != last_) {
();
read} else {
= 0;
buf_index_ }
}
} else {
if (buf_index_ + 1 == buf_last_ && curr() != last_) {
();
read} else if (buf_index_ + 1 <= buf_last_) {
++buf_index_;
}
}
return *this;
}
constexpr auto operator++(int) {
if constexpr (is_same_v<iterator_concept, input_iterator_tag>) {
++*this;
} else {
auto retval = *this;
++*this;
return retval;
}
}
constexpr utf-iterator& operator--() requires bidirectional_iterator<innermost-iter>
{
if (!buf_index_)
();
read_reverseelse if (buf_index_)
--buf_index_;
return *this;
}
constexpr utf-iterator operator--(int) requires bidirectional_iterator<innermost-iter>
{
auto retval = *this;
--*this;
return retval;
}
friend constexpr bool operator==(utf-iterator const& lhs, utf-iterator const& rhs)
requires forward_iterator<innermost-iter> || requires (innermost-iter i) { i != i; }
{
if constexpr (forward_iterator<innermost-iter>) {
return lhs.curr() == rhs.curr() && lhs.buf_index_ == rhs.buf_index_;
} else {
if (lhs.curr() != rhs.curr())
return false;
if (lhs.buf_index_ == rhs.buf_index_ && lhs.buf_last_ == rhs.buf_last_) {
return true;
}
return lhs.buf_index_ == lhs.buf_last_ && rhs.buf_index_ == rhs.buf_last_;
}
}
friend constexpr bool operator==(utf-iterator const& lhs, innermost-sent rhs) requires copyable<innermost-iter>
{
if constexpr (forward_iterator<innermost-iter>) {
return lhs.curr() == rhs;
} else {
return lhs.curr() == rhs && lhs.buf_index_ == lhs.buf_last_;
}
}
friend constexpr bool operator==(utf-iterator const& lhs, innermost-sent rhs) requires (!copyable<innermost-iter>)
{
return lhs.curr() == rhs && lhs.buf_index_ == lhs.buf_last_;
}
constexpr innermost-iter begin() const // exposition only
requires bidirectional_iterator<innermost-iter>
{
return first_and_curr_.first;
}
constexpr innermost-sent end() const { // exposition only
return last_;
}
constexpr void read(); // exposition only
constexpr void read_reverse(); // exposition only
constexpr innermost-iter& curr() & { return first_and_curr_.curr; } // exposition only
constexpr innermost-iter const& curr() const& { return first_and_curr_.curr; } // exposition only
constexpr innermost-iter curr() && { return move(first_and_curr_.curr); } // exposition only
<value_type, 4 / sizeof(ToType)> buf_{}; // exposition only
array
<innermost-iter> first_and_curr_; // exposition only
first-and-curr
[[no_unique_address]] innermost-sent last_; // exposition only
uint8_t buf_index_ = 0; // exposition only
uint8_t buf_last_ = 0; // exposition only
uint8_t to_increment_ = 0; // exposition only
};
private:
template<bool Const>
static constexpr auto make_begin(auto first, auto last) { // exposition only
if constexpr (bidirectional_iterator<ranges::iterator_t<V>>) {
if constexpr (to-utf-view-iterator-optimizable<ranges::iterator_t<V>>) {
return utf-iterator<Const>(first.begin(), first.curr(), first.last_);
} else {
return utf-iterator<Const>(first, first, last);
}
} else {
return utf-iterator<Const>(move(first), last);
}
}
template<bool Const>
static constexpr auto make_end(auto first, auto last) { // exposition only
if constexpr (bidirectional_iterator<ranges::sentinel_t<V>>) {
if constexpr (to-utf-view-iterator-optimizable<ranges::sentinel_t<V>>) {
return utf-iterator<Const>(last.begin(), last.curr(), last.last_);
} else {
return utf-iterator<Const>(first, last, last);
}
} else {
return last;
}
}
= V(); // exposition only
V base_
public:
constexpr to-utf-view-impl() requires default_initializable<V> = default;
constexpr to-utf-view-impl(V base) : base_(move(base)) {}
constexpr V base() const& requires copy_constructible<V>
{
return base_;
}
constexpr V base() && { return move(base_); }
constexpr auto begin() requires (!copyable<ranges::iterator_t<V>>)
{
return make_begin<false>(ranges::begin(base_), ranges::end(base_));
}
constexpr auto begin() const requires copyable<ranges::iterator_t<V>>
{
return make_begin<true>(ranges::begin(base_), ranges::end(base_));
}
constexpr auto end() requires (!copyable<ranges::iterator_t<V>>)
{
return make_end<false>(ranges::begin(base_), ranges::end(base_));
}
constexpr auto end() const requires copyable<ranges::iterator_t<V>>
{
return make_end<true>(ranges::begin(base_), ranges::end(base_));
}
constexpr bool empty() const { return ranges::empty(base_); }
};
template<from-utf-view V>
class to_utf8_view {
private:
using iterator = ranges::iterator_t<to-utf-view-impl<char8_t, V>>;
using sentinel = ranges::sentinel_t<to-utf-view-impl<char8_t, V>>;
public:
constexpr to_utf8_view() requires default_initializable<V> = default;
constexpr to_utf8_view(V base) : impl_(move(base)) {}
constexpr V base() const& requires copy_constructible<V>
{
return impl_.base();
}
constexpr V base() && { return move(impl_).base(); }
constexpr auto begin() requires (!copyable<iterator>)
{
return impl_.begin();
}
constexpr auto begin() const requires copyable<iterator>
{
return impl_.begin();
}
constexpr auto end() requires (!copyable<iterator>)
{
return impl_.end();
}
constexpr auto end() const requires copyable<iterator>
{
return impl_.end();
}
constexpr bool empty() const { return impl_.empty(); }
private:
<char8_t, V> impl_;
to-utf-view-impl};
template<class R>
(R&&) -> to_utf8_view<views::all_t<R>>;
to_utf8_view
template<from-utf-view V>
class to_utf16_view {
private:
using iterator = ranges::iterator_t<to-utf-view-impl<char16_t, V>>;
using sentinel = ranges::sentinel_t<to-utf-view-impl<char16_t, V>>;
public:
constexpr to_utf16_view() requires default_initializable<V> = default;
constexpr to_utf16_view(V base) : impl_(move(base)) {}
constexpr V base() const& requires copy_constructible<V>
{
return impl_.base();
}
constexpr V base() && { return move(impl_).base(); }
constexpr auto begin() requires (!copyable<iterator>)
{
return impl_.begin();
}
constexpr auto begin() const requires copyable<iterator>
{
return impl_.begin();
}
constexpr auto end() requires (!copyable<iterator>)
{
return impl_.end();
}
constexpr auto end() const requires copyable<iterator>
{
return impl_.end();
}
constexpr bool empty() const { return impl_.empty(); }
private:
<char16_t, V> impl_;
to-utf-view-impl};
template<class R>
(R&&) -> to_utf16_view<views::all_t<R>>;
to_utf16_view
template<from-utf-view V>
class to_utf32_view {
private:
using iterator = ranges::iterator_t<to-utf-view-impl<char32_t, V>>;
using sentinel = ranges::sentinel_t<to-utf-view-impl<char32_t, V>>;
public:
constexpr to_utf32_view() requires default_initializable<V> = default;
constexpr to_utf32_view(V base) : impl_(move(base)) {}
constexpr V base() const& requires copy_constructible<V>
{
return impl_.base();
}
constexpr V base() && { return move(impl_).base(); }
constexpr auto begin() requires (!copyable<iterator>)
{
return impl_.begin();
}
constexpr auto begin() const requires copyable<iterator>
{
return impl_.begin();
}
constexpr auto end() requires (!copyable<iterator>)
{
return impl_.end();
}
constexpr auto end() const requires copyable<iterator>
{
return impl_.end();
}
constexpr bool empty() const { return impl_.empty(); }
private:
<char32_t, V> impl_;
to-utf-view-impl};
template<class R>
(R&&) -> to_utf32_view<views::all_t<R>>;
to_utf32_view
template<code-unit-to ToType>
inline constexpr unspecified to_utf;
inline constexpr unspecified to_utf8;
inline constexpr unspecified to_utf16;
inline constexpr unspecified to_utf32;
}
namespace std::ranges {
template <class ToType, class V>
inline constexpr bool enable_borrowed_range<
::uc::to-utf-view-impl<ToType, V>> = enable_borrowed_range<V>;
std
template<class V>
inline constexpr bool enable_borrowed_range<std::uc::to_utf8_view<V>> = enable_borrowed_range<V>;
template<class V>
inline constexpr bool enable_borrowed_range<std::uc::to_utf16_view<V>> = enable_borrowed_range<V>;
template<class V>
inline constexpr bool enable_borrowed_range<std::uc::to_utf32_view<V>> = enable_borrowed_range<V>;
}
The exposition-only concept
to-utf-view-iterator-optimizable
is true if its template parameter is a specialization of
utf-iterator
and it is a
std::ranges::bidirectional_iterator
.
to-utf-view-impl
is an
exposition-only class that provides implementation details common to the
three transcoding views,
to_utf8_view
,
to_utf16_view
, and
to_utf32_view
, which are themselves
described further down.
The iterator type of
to-utf-view-impl
is
utf-iterator
.
utf-iterator
is an iterator
that transcodes from UTF-N to UTF-M, where N and M are each one of 8,
16, or 32. N may equal M.
utf-iterator
uses a
mapping between character types and UTF encodings, which is that that
char
and
char8_t
correspond to UTF-8,
char16_t
corresponds to UTF-16,
char32_t
corresponds to UTF-32, and
wchar_t
corresponds to UTF-16 if its size is two or UTF-32 if its size is 4.
utf-iterator
does its
work by adapting an underlying range of code units. We use the term
“input subsequence” to refer to a potentially ill-formed code unit
subsequence which is to be transcoded into a code point
c
. Each input subsequence is decoded
from the UTF encoding corresponding to
from-type
. If the
underlying range contains ill-formed UTF, the code units are divided
into input subsequences according to Substitution of Maximal Subparts,
and each ill-formed input subsequence is transcoded into a
U+FFFD
.
c
is then encoded to
ToType
’s corresponding encoding,
into an internal code unit buffer.
utf-iterator
maintains
certain invariants; the invariants differ based on whether
utf-iterator
is an input
iterator.
For input iterators the invariant is: if *this
is at the end of the range being adapted, then
curr()
==
last_
; otherwise, the position of
curr()
is
always at the end of the input subsequence corresponding to the current
code point c
, and
buf_
contains the code units that
comprise c
, in the UTF encoding
corresponding to ToType
.
For forward and bidirectional iterators, the invariant is: if *this
is at the end of the range being adapted, then
curr()
==
last_
; otherwise, the position of
curr()
is
always at the beginning of the input subsequence corresponding to the
current code point c
within the
underlying range, and buf_
contains
the code units in ToFormat
that
comprise c
.
The exposition-only member function
read
decodes the input subsequence
starting at position
curr()
into
a code point c
, using the UTF
encoding corresponding to
from-type
, and setting
c
to U+FFFD if the input subsequence
is ill-formed. If c
is set to U+FFFD
as the result of an ill-formed input subsequence, it sets the error as
described below. It sets
to_increment_
to the number of code
units read while decoding c
; encodes
c
into
buf_
in the UTF encoding
corresponding to ToType
; sets
buf_index_
to
0
; and sets
buf_last_
to the number of code
units encoded into buf_
. If forward_iterator<I>
is true
,
curr()
is
set to the position it had before
read
was called.
The exposition-only member function
read_reverse
decodes the input
subsequence ending at position
curr()
into
a code point c
, using the UTF
encoding corresponding to
from-type
, and setting
c
to U+FFFD if the input subsequence
is ill-formed. If c
is set to U+FFFD
as the result of an ill-formed input subsequence, it sets the error as
described below. It sets
to_increment_
to the number of code
units read while decoding c
; encodes
c
into
buf_
in the UTF encoding
corresponding to ToType
; sets
buf_last_
to the number of code
units encoded into buf_
; and sets
buf_index_
to buf_last_ - 1
.
In the following paragraph,
utf-error(foo)
refers to
the result of the exposition-only function:
<void, transcoding_error> utf-error-func(transcoding_error err) {
expectedreturn unexpected{err};
}
When the utf-iterator
is
at the end of the underlying range,
success()
returns a default-constructed expected<void, transcoding_error>
.
When the utf-iterator
has a
code unit, derived from a code point
c
, which is itself derived from a
particular input subsequence (the “current input subsequence”), the
result of the
success()
method corresponds to the underlying range’s input subsequences as
follows. (All ranges of numerical values of code units below are
inclusive.)
If the encoding corresponding to
from-type
is UTF-8:
success()
returns expected<void, transcoding_error>{}
.success()
returns utf-error(transcoding_error::unexpected_utf8_continuation_byte)
.success()
returns utf-error(transcoding_error::invalid_utf8_leading_byte)
.success()
returns utf-error(transcoding_error::overlong)
.success()
returns utf-error(transcoding_error::encoded_surrogate)
.success()
returns utf-error(transcoding_error::out_of_range)
success()
returns utf-error(transcoding_error::truncated_utf8_sequence)
.If the encoding corresponding to
from-type
is UTF-16:
success()
returns expected<void, transcoding_error>{}
.success()
returns utf-error(transcoding_error::unpaired_high_surrogate)
.success()
returns utf-error(transcoding_error::unpaired_low_surrogate)
.If the encoding corresponding to
from-type
is UTF-32:
success()
returns expected<void, transcoding_error>{}
.success()
returns utf-error(transcoding_error::encoded_surrogate)
.success()
returns utf-error(transcoding_error::out_of_range)
.utf-iterator
’s
exposition-only type alias
innermost-iter
is iter::innermost-iter
if iter
is
to_utf_view_iterator_optimizable
,
or iter
otherwise. The
exposition-only type alias
innermost-sent
is sent::innermost-sent
if sent
is
to_utf_view_iterator_optimizable
,
or sent
otherwise.
If utf-iterator
is a
bidirectional_iterator
, it is
defined to be at the beginning of its underlying range if
buf_index_
is zero and curr() == begin()
.
If it is a forward_iterator
, it is
defined to be at the end of its underlying range if buf_index_ + 1 == buf_last_
and curr() == last_
.
Otherwise, it is defined to be at the end of its underlying range if
buf_index_ == buf_last_
and curr() == last_
.
If operator*
is invoked while
utf-iterator
is at the end
of its underlying range, the behavior is erroneous and the result is
unspecified. Otherwise, operator*
returns buf_[buf_index_]
.
If operator++
is invoked while
utf-iterator
is at the end
of its underlying range, the behavior is erroneous and the iterator’s
state does not change. If operator--
is invoked while
utf-iterator
is at the
beginning of its underlying range, the behavior is erroneous and the
iterator’s state does not change.
to_utf8_view
produces a UTF-8
view of the elements from a
utf-range
.
to_utf16_view
produces a UTF-16 view
of the elements from a
utf-range
.
to_utf32_view
produces a UTF-32 view
of the elements from a
utf-range
.
The names to_utf8
,
to_utf16
, and
to_utf32
denote range adaptor
objects ([range.adaptor.object]).
to_utf
denotes a range adaptor
object template. to_utf8
produces
to_utf8_view
s,
to_utf16
produces
to_utf16_view
s, and
to_utf32
produces
utf32_view
s. to_utf<ToType>
is equivalent to to_utf8
if
ToType
is
char8_t
,
to_utf16
if
ToType
is
char16_t
,
and to_utf32
if
ToType
is
char32_t
.
Let to_utfN
denote any one of
to_utf8
,
to_utf16
, and
to_utf32
, and let
V
denote the
to_utfN_view
associated with that
object. Let E
be an expression and
let T
be remove_cvref_t<decltype((E))>
.
If decltype((E))
does not model utf-range
,
to_utfN(E)
is ill-formed. The expression to_utfN(E)
is expression-equivalent to:
If T
is a specialization of
empty_view
([range.empty.view]),
then empty_view<ToType>{}
.
Otherwise, if T
is an array
type of known bound, then:
V(std::ranges::subrange(std::ranges::begin(E), --std::ranges::end(E)))
V(std::ranges::subrange(std::ranges::begin(E), std::ranges::end(E)))
Otherwise, V(std::views::all(E))
utf_view
’s implementation of the
empty()
member function is more efficient than the one provided by
view_interface
, since
view_interface
’s implementation will
construct utf_view::begin()
and utf_view::end()
and compare them, whereas we can simply use the underlying range’s
empty()
,
since a utf_view
is empty if and
only if its underlying range is empty.
namespace std::uc {
template<class I>
consteval auto iterator-to-tag() { // exposition only
if constexpr (random_access_iterator<I>) {
return random_access_iterator_tag{};
} else if constexpr (bidirectional_iterator<I>) {
return bidirectional_iterator_tag{};
} else if constexpr (forward_iterator<I>) {
return forward_iterator_tag{};
} else if constexpr (input_iterator<I>) {
return input_iterator_tag{};
}
}
template<class I>
using iterator-to-tag-t = decltype(iterator-to-tag<I>()); // exposition only
template<typename V, typename ToType>
concept convertible-to-charN-t-view = code-unit-to<ToType> && ranges::view<V> && convertible_to<ranges::range_reference_t<V>, ToType>;
template<convertible-to-charN-t-view<char8_t> V>
class as_char8_t_view : public ranges::view_interface<as_char8_t_view<V>> {
= V(); // exposition only
V base_
template<bool Const>
class iterator; // exposition only
template<bool Const>
class sentinel; // exposition only
public:
constexpr as_char8_t_view() requires default_initializable<V> = default;
constexpr as_char8_t_view(V base) : base_(move(base)) {}
constexpr V& base() & { return base_; }
constexpr const V& base() const& requires copy_constructible<V>
{
return base_;
}
constexpr V base() && { return move(base_); }
constexpr iterator<false> begin() { return iterator<false>{ranges::begin(base_)}; }
constexpr iterator<true> begin() const requires ranges::range<const V>
{
return iterator<true>{ranges::begin(base_)};
}
constexpr sentinel<false> end() { return sentinel<false>{ranges::end(base_)}; }
constexpr iterator<false> end() requires ranges::common_range<V>
{
return iterator<false>{ranges::end(base_)};
}
constexpr sentinel<true> end() const requires ranges::range<const V>
{
return sentinel<true>{ranges::end(base_)};
}
constexpr iterator<true> end() const requires ranges::common_range<const V>
{
return iterator<true>{ranges::end(base_)};
}
constexpr auto size() requires ranges::sized_range<V>
{
return ranges::size(base_);
}
constexpr auto size() const requires ranges::sized_range<const V>
{
return ranges::size(base_);
}
};
template<convertible-to-charN-t-view<char8_t> V>
template<bool Const>
class as_char8_t_view<V>::iterator
: public proxy_iterator_interface<iterator-to-tag-t<ranges::iterator_t<maybe-const<Const, V>>>, char8_t> {
public:
using reference_type = char8_t;
private:
using iterator-type = ranges::iterator_t<maybe-const<Const, V>>; // exposition only
friend access;
constexpr iterator-type& base_reference() noexcept { return it_; } // exposition only
constexpr iterator-type base_reference() const { return it_; } // exposition only
= iterator-type(); // exposition only
iterator-type it_
public:
constexpr iterator() = default;
constexpr iterator(iterator-type it) : it_(move(it)) {}
constexpr reference_type operator*() const { return *it_; }
};
template<convertible-to-charN-t-view<char8_t> V>
template<bool Const>
class as_char8_t_view<V>::sentinel {
using base = maybe-const<Const, V>; // exposition only
using sentinel-type = ranges::sentinel_t<base>; // exposition only
= sentinel-type(); // exposition only
sentinel-type end_
public:
constexpr sentinel() = default;
constexpr explicit sentinel(sentinel-type end) : end_(move(end)) {}
constexpr sentinel(sentinel<!Const> i) requires Const && convertible_to<ranges::sentinel_t<V>, ranges::sentinel_t<base>>;
constexpr sentinel-type base() const { return end_; }
template<bool OtherConst>
requires sentinel_for<sentinel-type, ranges::iterator_t<maybe-const<OtherConst, V>>>
friend constexpr bool operator==(const iterator<OtherConst>& x, const sentinel& y) {
return x.it_ == y.end_;
}
template<bool OtherConst>
requires sized_sentinel_for<sentinel-type, ranges::iterator_t<maybe-const<OtherConst, V>>>
friend constexpr ranges::range_difference_t<maybe-const<OtherConst, V>> operator-(const iterator<OtherConst>& x, const sentinel& y) {
return x.it_ - y.end_;
}
template<bool OtherConst>
requires sized_sentinel_for<sentinel-type, ranges::iterator_t<maybe-const<OtherConst, V>>>
friend constexpr ranges::range_difference_t<maybe-const<OtherConst, V>> operator-(const sentinel& y, const iterator<OtherConst>& x) {
return y.end_ - x.it_;
}
};
template<class R>
(R&&) -> as_char8_t_view<views::all_t<R>>;
as_char8_t_view
template<convertible-to-charN-t-view<char16_t> V>
class as_char16_t_view : public ranges::view_interface<as_char16_t_view<V>> {
= V(); // exposition only
V base_
template<bool Const>
class iterator; // exposition only
template<bool Const>
class sentinel; // exposition only
public:
constexpr as_char16_t_view() requires default_initializable<V> = default;
constexpr as_char16_t_view(V base) : base_(move(base)) {}
constexpr V& base() & { return base_; }
constexpr const V& base() const& requires copy_constructible<V>
{
return base_;
}
constexpr V base() && { return move(base_); }
constexpr iterator<false> begin() { return iterator<false>{ranges::begin(base_)}; }
constexpr iterator<true> begin() const requires ranges::range<const V>
{
return iterator<true>{ranges::begin(base_)};
}
constexpr sentinel<false> end() { return sentinel<false>{ranges::end(base_)}; }
constexpr iterator<false> end() requires ranges::common_range<V>
{
return iterator<false>{ranges::end(base_)};
}
constexpr sentinel<true> end() const requires ranges::range<const V>
{
return sentinel<true>{ranges::end(base_)};
}
constexpr iterator<true> end() const requires ranges::common_range<const V>
{
return iterator<true>{ranges::end(base_)};
}
constexpr auto size() requires ranges::sized_range<V>
{
return ranges::size(base_);
}
constexpr auto size() const requires ranges::sized_range<const V>
{
return ranges::size(base_);
}
};
template<convertible-to-charN-t-view<char16_t> V>
template<bool Const>
class as_char16_t_view<V>::iterator
: public proxy_iterator_interface<iterator-to-tag-t<ranges::iterator_t<maybe-const<Const, V>>>, char16_t> {
public:
using reference_type = char16_t;
private:
using iterator-type = ranges::iterator_t<maybe-const<Const, V>>; // exposition only
friend access;
constexpr iterator-type& base_reference() noexcept { return it_; } // exposition only
constexpr iterator-type base_reference() const { return it_; } // exposition only
= iterator-type(); // exposition only
iterator-type it_
public:
constexpr iterator() = default;
constexpr iterator(iterator-type it) : it_(move(it)) {}
constexpr reference_type operator*() const { return *it_; }
};
template<convertible-to-charN-t-view<char16_t> V>
template<bool Const>
class as_char16_t_view<V>::sentinel {
using base = maybe-const<Const, V>; // exposition only
using sentinel-type = ranges::sentinel_t<base>; // exposition only
= sentinel-type(); // exposition only
sentinel-type end_
public:
constexpr sentinel() = default;
constexpr explicit sentinel(sentinel-type end) : end_(move(end)) {}
constexpr sentinel(sentinel<!Const> i) requires Const && convertible_to<ranges::sentinel_t<V>, ranges::sentinel_t<base>>;
constexpr sentinel-type base() const { return end_; }
template<bool OtherConst>
requires sentinel_for<sentinel-type, ranges::iterator_t<maybe-const<OtherConst, V>>>
friend constexpr bool operator==(const iterator<OtherConst>& x, const sentinel& y) {
return x.it_ == y.end_;
}
template<bool OtherConst>
requires sized_sentinel_for<sentinel-type, ranges::iterator_t<maybe-const<OtherConst, V>>>
friend constexpr ranges::range_difference_t<maybe-const<OtherConst, V>> operator-(const iterator<OtherConst>& x, const sentinel& y) {
return x.it_ - y.end_;
}
template<bool OtherConst>
requires sized_sentinel_for<sentinel-type, ranges::iterator_t<maybe-const<OtherConst, V>>>
friend constexpr ranges::range_difference_t<maybe-const<OtherConst, V>> operator-(const sentinel& y, const iterator<OtherConst>& x) {
return y.end_ - x.it_;
}
};
template<class R>
(R&&) -> as_char16_t_view<views::all_t<R>>;
as_char16_t_view
template<convertible-to-charN-t-view<char32_t> V>
class as_char32_t_view : public ranges::view_interface<as_char32_t_view<V>> {
= V(); // exposition only
V base_
template<bool Const>
class iterator; // exposition only
template<bool Const>
class sentinel; // exposition only
public:
constexpr as_char32_t_view() requires default_initializable<V> = default;
constexpr as_char32_t_view(V base) : base_(move(base)) {}
constexpr V& base() & { return base_; }
constexpr const V& base() const& requires copy_constructible<V>
{
return base_;
}
constexpr V base() && { return move(base_); }
constexpr iterator<false> begin() { return iterator<false>{ranges::begin(base_)}; }
constexpr iterator<true> begin() const requires ranges::range<const V>
{
return iterator<true>{ranges::begin(base_)};
}
constexpr sentinel<false> end() { return sentinel<false>{ranges::end(base_)}; }
constexpr iterator<false> end() requires ranges::common_range<V>
{
return iterator<false>{ranges::end(base_)};
}
constexpr sentinel<true> end() const requires ranges::range<const V>
{
return sentinel<true>{ranges::end(base_)};
}
constexpr iterator<true> end() const requires ranges::common_range<const V>
{
return iterator<true>{ranges::end(base_)};
}
constexpr auto size() requires ranges::sized_range<V>
{
return ranges::size(base_);
}
constexpr auto size() const requires ranges::sized_range<const V>
{
return ranges::size(base_);
}
};
template<convertible-to-charN-t-view<char32_t> V>
template<bool Const>
class as_char32_t_view<V>::iterator
: public proxy_iterator_interface<iterator-to-tag-t<ranges::iterator_t<maybe-const<Const, V>>>, char32_t> {
public:
using reference_type = char32_t;
private:
using iterator-type = ranges::iterator_t<maybe-const<Const, V>>; // exposition only
friend access;
constexpr iterator-type& base_reference() noexcept { return it_; } // exposition only
constexpr iterator-type base_reference() const { return it_; } // exposition only
= iterator-type(); // exposition only
iterator-type it_
public:
constexpr iterator() = default;
constexpr iterator(iterator-type it) : it_(move(it)) {}
constexpr reference_type operator*() const { return *it_; }
};
template<convertible-to-charN-t-view<char32_t> V>
template<bool Const>
class as_char32_t_view<V>::sentinel {
using base = maybe-const<Const, V>; // exposition only
using sentinel-type = ranges::sentinel_t<base>; // exposition only
= sentinel-type(); // exposition only
sentinel-type end_
public:
constexpr sentinel() = default;
constexpr explicit sentinel(sentinel-type end) : end_(move(end)) {}
constexpr sentinel(sentinel<!Const> i) requires Const && convertible_to<ranges::sentinel_t<V>, ranges::sentinel_t<base>>;
constexpr sentinel-type base() const { return end_; }
template<bool OtherConst>
requires sentinel_for<sentinel-type, ranges::iterator_t<maybe-const<OtherConst, V>>>
friend constexpr bool operator==(const iterator<OtherConst>& x, const sentinel& y) {
return x.it_ == y.end_;
}
template<bool OtherConst>
requires sized_sentinel_for<sentinel-type, ranges::iterator_t<maybe-const<OtherConst, V>>>
friend constexpr ranges::range_difference_t<maybe-const<OtherConst, V>> operator-(const iterator<OtherConst>& x, const sentinel& y) {
return x.it_ - y.end_;
}
template<bool OtherConst>
requires sized_sentinel_for<sentinel-type, ranges::iterator_t<maybe-const<OtherConst, V>>>
friend constexpr ranges::range_difference_t<maybe-const<OtherConst, V>> operator-(const sentinel& y, const iterator<OtherConst>& x) {
return y.end_ - x.it_;
}
};
template<class R>
(R&&) -> as_char32_t_view<views::all_t<R>>;
as_char32_t_view
inline constexpr unspecified as_char8_t;
inline constexpr unspecified as_char16_t;
inline constexpr unspecified as_char32_t;
}
namespace std::ranges {
template<class V>
inline constexpr bool enable_borrowed_range<std::uc::as_char8_t_view<V>> = enable_borrowed_range<V>;
template<class V>
inline constexpr bool enable_borrowed_range<std::uc::as_char16_t_view<V>> = enable_borrowed_range<V>;
template<class V>
inline constexpr bool enable_borrowed_range<std::uc::as_char32_t_view<V>> = enable_borrowed_range<V>;
}
char8_view
produces a view of
char8_t
elements from another view.
char16_view
produces a view of
char16_t
elements from another view.
char32_view
produces a view of
char32_t
elements from another view. Let
charN_view
denote any one of the
views char8_view
,
char16_view
, and
char32_view
.
The names as_char8_t
,
as_char16_t
, and
as_char32_t
denote range adaptor
objects ([range.adaptor.object]).
as_char8_t
produces
char8_view
s,
as_char16_t
produces
char16_view
s, and
as_char32_t
produces
char32_view
s. Let
as_charN_t
denote any one of
as_char8_t
,
as_char16_t
, and
as_char32_t
, and let
V
denote the
charN_view
associated with that
object. Let E
be an expression and
let T
be remove_cvref_t<decltype((E))>
.
Let F
be the
format
enumerator associated with
as_charN_t
. If decltype((E))
does not model utf_pointer<T>
and if charN_view(E)
is ill-formed, as_charN_t(E)
is ill-formed. The expression as_charN_t(E)
is expression-equivalent to:
If T
is a specialization of
empty_view
([range.empty.view]),
then empty_view<format-to-type-t<F>>{}
.
Otherwise, if T
is an array
type of known bound, then:
V(std::ranges::subrange(std::ranges::begin(E), --std::ranges::end(E)))
V(std::ranges::subrange(std::ranges::begin(E), std::ranges::end(E)))
Otherwise, V(std::views::all(E))
.
[Example 1:
::vector<int> path_as_ints = {U'C', U':', U'\x00010000'};
std::filesystem::path path = path_as_ints | as_char32_t | std::ranges::to<std::u32string>();
stdauto const& native_path = path.native();
if (native_path != std::wstring{L'C', L':', L'\xD800', L'\xDC00'}) {
return false;
}
— end example]
to_utfN_view
s views plus
utf_view
, and three
as_charN_t_view
sThe views in
std::ranges
are constrained to accept only std::ranges::view
template parameters. However, they accept std::ranges::viewable_range
s
in practice, because they each have a deduction guide that looks like
this:
template<class R>
(R &&) -> to_utf8_view<views::all_t<R>>; to_utf8_view
It’s not possible to make this work for any view that’s a template
class that accepts a template parameter other than the underlying view,
because of the all-or-nothing nature of deduction guides. So we need
separate to_utfN_view
s and separate
as_charN_t_view
s instead of having
them simply be alias templates for a hypothetical generic to_utf_view<ToType>
or as_charN_t_view<ToType>
,
respectively.
as_charN_t_view
is not implemented
in terms of transform_view
Because transform_view
cannot be a
borrowed_range
, whereas
as_charN_t_view
can.
[P3117R0] attempted to extend
transform_view
to be conditionally
borrowed, but its authors are not pursuing it further following concerns
raised by SG9 in Tokyo 2024.
A previous revision of this paper proposed for standardization a
project_view<V, F>
view that would be like
transform_view
except that the
transformation function would be an NTTP, enabling
project_view
to be a
borrowed_range
. However, this was
removed because the NTTP template parameter prevents us from providing a
views::all_t
deduction guide as described in the previous section.
utf_view
always transcodes, even in
UTF-N to UTF-N casesYou might expect that if r
in
r | to_utfN
is already in UTF-N,
r | to_utfN
might just be r
. This is not what
the to_utfN
adaptors do, though.
The adaptors each produce a view
utfv
that stores a view of type
V
. Further, utfv.begin()
is always a specialization of
utf-iterator
. utfv.end()
is also a specialization of
utf-iterator
(if common_range<V>
),
or otherwise the sentinel value for
V
.
This gives
r | to_utfN
some nice, consistent properties. With the exception of empty_view<T>{} | to_utfN
,
the following are always true:
r | to_utfN
produces well-formed UTF. This is true even when the input was already
UTF-N. Remember, the input could have been UTF-N but had ill-formed UTF
in it.
r | to_utfN
has a consistent API. If
r | to_utfN
were sometimes r
, and since
r
may be a reference to an array,
you’d have to use std::ranges::begin(r)
and ::end(r)
all the time. However, you’d probably write r.begin()
and r.end()
,
only to one day get bitten by an array-reference
r
.
Add the feature test macro
__cpp_lib_unicode_transcoding
.
No polls were taken during this review.
No polls were taken during this review.
POLL: Move null_sentinel_t to std:: namespace
SF
|
F
|
N
|
A
|
SA
|
---|---|---|---|---|
1 | 3 | 1 | 0 | 0 |
# Of Authors: 1
Author’s Position: F
Attendance: 9 (4 abstentions)
Outcome: Consensus in favor
POLL: Remove null_sentinel_t::base member function from the proposal
SF
|
F
|
N
|
A
|
SA
|
---|---|---|---|---|
0 | 4 | 1 | 0 | 0 |
# Of Authors: 1
Author’s Position: F
Attendance: 8 (3 abstentions)
Outcome: Consensus in favor
POLL: utf_iterator should be a separate type and not nested within utf_view
SF
|
F
|
N
|
A
|
SA
|
---|---|---|---|---|
1 | 2 | 1 | 0 | 1 |
Attendance: 8 (3 abstentions)
# of Authors: 1
Author Position: F
Outcome: Weak consensus in favor
SA: Having a separate type complexifies the API
POLL: Separate std::null_sentinel_t
from P2728 into a separate paper for SG9 and LEWG; SG16 does not need to
see it again.
SF
|
F
|
N
|
A
|
SA
|
---|---|---|---|---|
1 | 1 | 4 | 2 | 1 |
Attendance: 12 (3 abstentions)
Outcome: No consensus; author’s discretion for how to continue.
POLL: SG16 would like to see a version of P2728 without eager algorithms.
SF
|
F
|
N
|
A
|
SA
|
---|---|---|---|---|
4 | 2 | 0 | 1 | 0 |
Attendance: 10 (3 abstentions)
Outcome: Consensus in favor
POLL: UTF transcoding interfaces provided by the C++ standard library should operate on charN_t types, with support for other types provided by adapters, possibly with a special case for char and wchar_t when their associated literal encodings are UTF.
SF
|
F
|
N
|
A
|
SA
|
---|---|---|---|---|
5 | 1 | 0 | 0 | 1 |
Attendance: 9 (2 abstentions)
Outcome: Strong consensus in favor
Author’s note: More commentary on this poll is provided in the
section “Discussion of whether transcoding views should accept ranges of
char
and
wchar_t
”.
But note here that the authors doubt the viability of “a special case
for char and wchar_t when their associated literal encodings are UTF”,
since making the evaluation of a concept change based on the literal
encoding seems like a flaky move; the literal encoding can change TU to
TU.
No polls were taken during this review.
POLL:
char32_t
should be used as the Unicode code point type within the C++ standard
library implementations of Unicode algorithms.
SF
|
F
|
N
|
A
|
SA
|
---|---|---|---|---|
6 | 0 | 1 | 0 | 0 |
Attendance: 9 (2 abstentions)
Outcome: Strong consensus in favor
The most recent revision of this paper has a reference implementation called UtfView available on GitHub, which is a fork of Jonathan Wakely’s implementation of P2728R6 as an implementation detail for libstdc++.
Versions of the interfaces provided by previous revisions of this paper have also been implemented, and re-implemented, several times over the last 5 years or so, as part of a proposed (but not yet accepted!) Boost library, Boost.Text. Boost.Text has hundreds of stars on GitHub.
Both libraries have comprehensive tests.
iconv
This function transcodes until it finds an invalid or truncated sequence, erroring out if so and distinguishing those two cases using errno. It uses an out-parameter to point to the beginning of the invalid sequence.
struct iconv_t {};
// For the sake of simplicity, this iconv only converts between UTF-8 and UTF-32.
size_t iconv(iconv_t cd, const char** inbuf, size_t* inbytesleft, char** outbuf,
size_t* outbytesleft) {
if (!inbuf) {
return 0;
}
if (inbuf && !*inbuf) {
return 0;
}
assert(inbytesleft);
assert(outbuf);
assert(*outbuf);
assert(outbytesleft);
auto view = std::ranges::subrange(*inbuf, *inbuf + *inbytesleft) | std::uc::to_utf32;
for (auto it = std::ranges::begin(view), end = std::ranges::end(view); it != end;) {
if (it.success()) {
if (*outbytesleft < sizeof(char32_t)) {
= E2BIG;
errno return static_cast<std::size_t>(-1);
}
char32_t c = *it;
(*outbuf)[0] = static_cast<char>((c >> 24) & 0xFF);
(*outbuf)[1] = static_cast<char>((c >> 16) & 0xFF);
(*outbuf)[2] = static_cast<char>((c >> 8) & 0xFF);
(*outbuf)[3] = static_cast<char>(c & 0xFF);
*outbuf += sizeof(char32_t);
*outbytesleft -= sizeof(char32_t);
++it;
::size_t bytes_converted = it.base() - *inbuf;
std*inbytesleft -= bytes_converted;
*inbuf = it.base();
} else {
= it.success().error();
transcoding_error e switch (e) {
case transcoding_error::truncated_utf8_sequence: {
= EINVAL;
errno } break;
case transcoding_error::unexpected_utf8_continuation_byte:
case transcoding_error::overlong:
case transcoding_error::encoded_surrogate:
case transcoding_error::out_of_range:
case transcoding_error::invalid_utf8_leading_byte: {
= EILSEQ;
errno } break;
case transcoding_error::unpaired_high_surrogate:
case transcoding_error::unpaired_low_surrogate: {
::unreachable();
std}
}
return static_cast<std::size_t>(-1);
}
}
return 0;
}
u_strFromUTF8WithSub
This function transcodes until it finds an invalid sequence and if it does, it supports either erroring out or producing a substitution character of the user’s choice. It also supports pre-flighting to determine the required output buffer size, and relying on null termination if the user doesn’t supply the size of the input buffer.
constexpr char16_t* u_strFromUTF8WithSub(
char16_t* dest, int32_t destCapacity, int32_t* pDestLength,
const char* src, int32_t srcLength, char32_t subchar,
int32_t* pNumSubstitutions, UErrorCode* pErrorCode) {
if (*pErrorCode != U_ZERO_ERROR) {
return nullptr;
}
if ((src == nullptr && srcLength != 0) || srcLength < -1 || (destCapacity < 0) ||
(dest == nullptr && destCapacity > 0) || subchar > 0x10ffff ||
(0xD800 <= subchar && subchar <= 0xDFFF)) {
*pErrorCode = U_ILLEGAL_ARGUMENT_ERROR;
return nullptr;
}
if (pNumSubstitutions != nullptr) {
*pNumSubstitutions = 0;
}
auto impl =
[&](auto view) {
auto end = std::ranges::end(view);
if (pDestLength) {
*pDestLength = 0;
for (auto it = std::ranges::begin(view); it != end; ++it) {
*pDestLength += it.success() ? 1 : (subchar > 0xFFFF ? 2 : 1);
}
}
if (destCapacity == 0) {
return dest;
}
char16_t* out_ptr = dest;
for (auto it = std::ranges::begin(view); it != end; ++it) {
auto write =
[&](char16_t c) {
*out_ptr = c;
++out_ptr;
--destCapacity;
};
if (it.success()) {
if (destCapacity == 0) {
return dest;
}
(*it);
write} else {
if (subchar == -1) {
*pErrorCode = U_INVALID_CHAR_FOUND;
return dest;
} else {
++*pNumSubstitutions;
if (subchar > 0xFFFF) {
::array<char16_t, 2> subchar_utf16{};
std::ranges::copy(std::array{subchar} | std::uc::to_utf16, subchar_utf16.data());
std(subchar_utf16[0]);
writeif (destCapacity == 0) {
return dest;
}
(subchar_utf16[1]);
write} else {
(static_cast<char16_t>(subchar));
write}
}
}
}
if (destCapacity > 0) {
*out_ptr = char16_t{};
}
return dest;
};
if (srcLength == -1) {
return impl(std::null_term(src) | std::uc::to_utf16);
} else {
return impl(std::ranges::subrange(src, src + srcLength) | std::uc::to_utf16);
}
}
MultiByteToWideChar
This function transcodes until it finds an invalid sequence. If it does, it will error out if the user provides a flag; if this flag is not provided, the behavior depends on the OS. Before Windows Vista, it simply drops the invalid sequences; afterwards, it substitutes with U+FFFD. It also supports pre-flighting to determine the required output buffer size, and relying on null termination if the user doesn’t supply the size of the input buffer.
constexpr int MultiByteToWideChar(unsigned int CodePage, unsigned long dwFlags,
const char* lpMultiByteStr, int cbMultiByte,
wchar_t* lpWideCharStr, int cchWideChar) {
(void)CodePage; // For simplicity we only implement CP_UTF8
auto impl = [&](auto view) {
auto end = std::ranges::end(view);
if (cchWideChar == 0) {
#ifdef WINDOWS_XP
int chars = 0;
for (auto it = std::ranges::begin(view); it != end; ++it) {
+= it.success() ? 1 : 0;
chars }
return chars;
#else
return static_cast<int>(std::ranges::distance(view));
#endif
} else {
wchar_t* out_ptr = lpWideCharStr;
for (auto it = std::ranges::begin(view); it != end; ++it) {
auto write =
[&](auto c) {
*out_ptr = static_cast<wchar_t>(c);
++out_ptr;
--cchWideChar;
};
if (it.success()) {
if (cchWideChar == 0) {
(ERROR_INSUFFICIENT_BUFFER);
SetLastErrorreturn 0;
}
(*it);
write} else {
if (dwFlags == MB_ERR_INVALID_CHARS) {
(ERROR_NO_UNICODE_TRANSLATION);
SetLastErrorreturn 0;
}
#ifndef WINDOWS_XP
if (cchWideChar == 0) {
(ERROR_INSUFFICIENT_BUFFER);
SetLastErrorreturn 0;
}
(*it);
write#endif
}
}
return static_cast<int>(out_ptr - lpWideCharStr);
}
};
if (cbMultiByte == -1) {
if constexpr (sizeof(wchar_t) == 2) {
return impl(std::null_term(lpMultiByteStr) | std::uc::to_utf16);
} else {
return impl(std::null_term(lpMultiByteStr) | std::uc::to_utf32);
}
} else {
if constexpr (sizeof(wchar_t) == 2) {
return impl(std::ranges::subrange(lpMultiByteStr, lpMultiByteStr + cbMultiByte) |
::uc::to_utf16);
std} else {
return impl(std::ranges::subrange(lpMultiByteStr, lpMultiByteStr + cbMultiByte) |
::uc::to_utf32);
std}
}
}
decode()
This is a C++ analog of Python’s
decode
function. It accepts a std::basic_string_view
,
transcodes it from UTF-8, returns a new transcoded std::basic_string
,
and throws an exception if it encounters invalid UTF which explains the
problem and provides the position of the offending sequence.
template <typename FromChar, typename ToChar>
::basic_string<ToChar> decode(std::basic_string_view<FromChar> input) {
std::basic_string<ToChar> result;
std.reserve(input.size()); // like what size_hint does
resultauto view = input | to_utf<ToChar>;
for (auto it = std::ranges::begin(view), end = std::ranges::end(view); it != end;
++it) {
if (it.success()) {
.push_back(*it);
result} else {
auto pos_curr = it.base() - input.begin();
auto it2 = it;
auto pos_next = (++it2).base() - input.begin();
::ostringstream ss;
std<< "can't decode ";
ss if (pos_next > pos_curr + 1) {
<< "characters";
ss } else {
<< "character 0x" << std::hex
ss << static_cast<unsigned int>(static_cast<unsigned char>(*it.base()))
<< std::dec;
}
<< " in position " << pos_curr;
ss if (pos_next > pos_curr + 1) {
<< "-" << pos_next - 1;
ss }
<< ": ";
ss << [&] {
ss switch (it.success().error()) {
case transcoding_error::truncated_utf8_sequence:
return "unexpected end of data";
case transcoding_error::unpaired_high_surrogate:
case transcoding_error::unpaired_low_surrogate:
return "illegal UTF-16 surrogate";
case transcoding_error::unexpected_utf8_continuation_byte:
case transcoding_error::invalid_utf8_leading_byte:
return "invalid start byte";
case transcoding_error::encoded_surrogate:
if constexpr (std::same_as<FromChar, char32_t>) {
return "code point in surrogate code point range(0xd800, 0xe000)";
}
case transcoding_error::overlong:
if constexpr (std::same_as<FromChar, char32_t>) {
return "code point not in range(0x110000)";
}
case transcoding_error::out_of_range:
return "invalid continuation byte";
}
::unreachable();
std}();
throw std::runtime_error(std::move(ss).str());
}
}
return result;
}
Zach Laine, for writing revisions one through six of the paper and implementing Boost.Text.
Jonathan Wakely, for implementing P2728R6, and design guidance.
Robert Leahy, for extensive design guidance including suggesting the error handling approach introduced in R7.
Gašper Ažman, for suggesting the use of std::expected<void, E>
.