<div dir="ltr"><div dir="ltr"><div>Dear Henri,<br><br></div> Apologies for taking so long to get back to you; thank you so much for the detailed feedback. I'll do my best to answer everything. Thoughts are a bit scattered, so feel free to ask if something doesn't make any sense.<br><br></div><div> Thank you for taking the time to go through everything. This has been very helpful and I have a lot of work to do!<br><br></div><div>Best Wishes,</div><div>JeanHeyd<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Aug 17, 2019 at 3:51 PM Henri Sivonen <<a href="mailto:hsivonen@hsivonen.fi" target="_blank">hsivonen@hsivonen.fi</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Why is transliterating mentioned? It seems serious scope creep<br>
compared to character encoding conversion.<br></blockquote><div><br></div><div>Sorry; I need to go through and use the proper term -- transcoding -- for what is being done here. This paper intends to only concern itself with transcoding and a tiny bit of generalized text transformation (e.g., normalization).<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
> 2.1. The Goal<br>
><br>
> int main () {<br>
> using namespace std::literals;<br>
> std::text::u8text my_text = std::text::transcode<std::text::utf8>(“안녕하세요 👋”sv);<br>
> std::cout :< my_text :< std::endl; // prints 안녕하세요 👋 to a capable console<br>
<br>
This does not look like a compelling elevator pitch, since with modern<br>
terminal emulators, merely `fwrite` to `stdout` with a u8 string<br>
literal works.<br>
<br>
Here's what I'd like to see sample code for:<br>
<br>...<br></blockquote><div><br></div><div>I certainly need a wide body of examples, but that's not going to fit in the initial proposal. At least, not in that version; the next version (which will probably be published post-Belfast) will have much more implementation and projects behind it.<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
I think it's fair to characterize the kernel32.dll conversion<br>
functions as "provably fast", but I find it weird to put iconv in that<br>
category. The most prominent implementation is the GNU one based on<br>
the glibc function of the same name, which prioritizes extensibility<br>
by shared object over performance. Based on the benchmarking that I<br>
conducted (<a href="https://hsivonen.fi/encoding_rs/#results" rel="noreferrer" target="_blank">https://hsivonen.fi/encoding_rs/#results</a>), I would not<br>
characterize iconv as "provably fast".<br></blockquote><div><br></div><div>That's fair, but it does cover a wide variety of encodings and is the backbone of many *nix programs and systems, including GCC. I should split that sentence up into "provably fast and full of features".<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> 3. Design<br>
<br>
> study of ICU’s interface<br>
<br>
Considering that Firefox was mentioned in the abstract, it would make<br>
sense to study its internal API.<br>
</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
...</blockquote><div> </div><div>Thank you for the links. I have read some of these before, but not all of them. Most of what I have read is CopperSpice's API for encoding, libogonek's documentation and source, text_view's source and examples, Boost.Text's documentation and source, my own work, and many of the proposals that have come before this. I'll make sure to give a good lookover the Firefox internals plus the rust transcoder you built.<br></div><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> Consider the usage of the low-level facilities laid out below:<br>
<br>
I think decoding to UTF-32 is not a good example if we want to promote<br>
UTF-8 as the application internal encoding. Considering that Shift_JIS<br>
tends to come up as a reason not to go UTF-8-only in various<br>
situations, I think showing conversion from Shift_JIS (preferably<br>
discovered dynamically at runtime) to UTF-8 would make more sense.<br></blockquote><div><br></div><div>Agreed, more examples are good.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> if (std::empty(result.input)) {<br>
> break;<br>
> }<br>
<br>
How does this take into account the input ending with an incomplete<br>
byte sequence?<br></blockquote><div><br></div><div>Error reporting is done by the error handler, which you commented on below. An incomplete sequence is handled by the encoding error check. The example in the paper does not include lots of handling: the default text error handler is the replacement text error handler, and it would blow up the assertion with a failure. I should include more examples of such.<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> On top of eagerly consuming free functions, there needs to be views that allow a person to walk some view of storage with a specified encoding.<br>
<br>
I doubt that this is really necessary. I think post-decode Unicode<br>
needs to be iterable by Unicode scalar value (i.e. std::u8string_view<br>
and std::u16string_view should by iterable by char32_t), but I very<br>
much doubt that it's worthwhile to provide such iteration directly<br>
over legacy encodings. Providing such iteration competes over<br>
implementor attention with SIMD-accelerated conversion of contiguous<br>
buffers, and I think it's much more important to give attention to the<br>
letter.<br></blockquote><div><br></div><div>The goal is not to provide iteration over legacy encodings. The goal is to separate the algorithm (decoding/encoding text) from the storage (std::basic_string<char8_t>, __gnu_cxx::rope<char>, boost::unencoded_rope, trial::circular_buffer, sg14::ringspan, etc.). The basic algorithm -- if it does not require more than forward iterators and friend -- should work on those class and iterators and ranges. This provides a greater flexibility of storage options for users and a robust composition story for algorithms. Having spent a small amount of time contributing to one standard library and observing the optimizations and specializations put into many of the already existing standard algorithms, I can assure you that no implementation will spend their time with only the default base encoding versions. (And if not, I have every intention on making sure at least the libraries I have the power to modify -- libstdc++ and libc++ -- are improved.)<br><br></div><div>There were also previous chances at potential optimization with things like wstring_convert, which took either pointers or just a basic_string<CharT> outright. The complaints about these functions were rife and heavy (most of it due to its dependence on std::locale and much of its virtualized interface, but many of the implementations did not implement it correctly (<a href="https://github.com/OpenMPT/openmpt/blob/master/common/mptString.cpp#L587" target="_blank">https://github.com/OpenMPT/openmpt/blob/master/common/mptString.cpp#L587</a> | <a href="https://sourceforge.net/p/mingw-w64/bugs/538/" target="_blank">https://sourceforge.net/p/mingw-w64/bugs/538/</a>), let alone with speed in mind).<br><br></div><div>Finally, I certainly agree that we want to focus on contiguous interfaces. But providing nothing at all for other storage types leaves a lot of users and use cases out in the cold and would require them to manually chunk, iterate, and poke at their code unit storage facilities. I plan to write a paper to the C Committee about providing at the very least low-level conversion utilities for mbs|w -> u8/16/32 and vice-versa. Unicode-to-Unicode will probably remain the user's responsibility, since the C libs only own the narrow and wide locale conversions with their hidden state and thus should be responsible for providing nice conversions in and out of those. wchar_t is slightly problematic for a UTF16-wchar_t (Windows, IBM) because wchar_t cannot be a multi-width encoding; it must be single-width by the standard and many of its functions bake that assumption implicitly into their out params and return types.<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> 3.2.1. Error Codes<br>
<br>
> ...<br>
<br>
Is there really a use case for distinguishing between the types of<br>
errors beyond saying that the input was malformed and perhaps<br>
providing identification of which bytes were in error? Historically,<br>
specs have been pretty bad at giving proper error definitions for<br>
character encodings. The WHATWG Encoding Standard defines...<br></blockquote><div><br></div><div>I sought to create something that would be useful, but I realize that since the error handler receives the full state of the encoder/decoder it can likely do its own callouts for specific types of errors. I can agree that overlong_sequence, etc. might be too much, but the rest of them (insufficiently sized output buffer, incomplete code unit sequence, etc.) are all necessary, so a future revision might axe the "informational" error codes but keep the necessary ones.<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> State& state;<br>
<br>
I think the paper could use more explanation of why it uses free<br>
functions with the state argument instead of encoder and decoder<br>
objects whose `this` pointer provides the state argument via the<br>
syntactic sugar associated with methods.<br></blockquote><div><br></div><div>This likely needs a lot more explanation and example. But it might change in the future as I really wrestle with dynamic encodings and non-default-constructible state types.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
> 3.2.2.2. Implementation Challenge: Ranges are not the Sum of their Parts<br>
<br>
The paper doesn't go into detail into why Ranges are needed instead of<br>
spans.<br></blockquote><div><br></div><div>This part of the paper was cataloguing an implementation issue that has since been transferred to a different paper and likely to be solved soon: <a href="https://wg21.link/p1664" target="_blank">https://wg21.link/p1664</a> | <a href="https://thephd.github.io/reconstructible-ranges" target="_blank">https://thephd.github.io/reconstructible-ranges</a></div><div> <br></div><div>Ranges are used here because working with individual iterators will have consequences for encoding and decoding iterators. libogonek explored stacking such iterators on top of iterators for decoding and normalization: the result was not very register or cache friendly due to the sheer size of the resulting iterators (256+ bytes for a range in some cases). Ranges allow us to fix this with its concept of a "sentinel".<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
> class assume_valid_handler;<br>
<br>
Is this kind of UB invitation really necessary?<br></blockquote><div><br></div><div>
I did something similar in my first private implementation and it had its use cases there as well. I've been told and shown that not re-checking invariants on things people know are clean was useful and provided meaningful performance improvements in their codebases. I think if I write more examples showing where error handlers can be used, it would show that choosing such an error handler is an incredibly conscious decision at the end of a very verbose function call or template parameter: the cognitive cost for asking for UB is extraordinarily high for when you want it (as it should be):<br><br></div><div><span style="font-family:monospace">std::u8string i_know_its_fine = std::text::transcode("abc", std::text::latin1{},
std::text::utf8{}, std::text::assume_valid_</span><span style="font-family:monospace">handler{});<br></span></div><div><br></div><div> I can imagine a world where adding "ub" to that name might make it more obvious what you're potentially
opening the door for.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
> The interface for an error handler will look as such:<br>
><br>
> namespace std { namespace text {<br>
><br>
> class an_error_handler {<br>
> template <typename Encoding, typename InputRange,<br>
> typename OutputRange, typename State><br>
> constexpr auto operator()(const Encoding& encoding,<br>
> encode_result<InputRange, OutputRange, State> result) const {<br>
> /* morph result or throw error */<br>
> return result;<br>
> }<br>
<br>
I think this part needs a lot more explanation of how the error<br>
handler is allowed to modify the ranges and what happens if the output<br>
doesn't fit.<br></blockquote><div><br></div><div>The implementation does it better, but it's ugly: <a href="https://github.com/ThePhD/phd/blob/master/include/phd/text/error_handler.hpp#L44" target="_blank">https://github.com/ThePhD/phd/blob/master/include/phd/text/error_handler.hpp#L44</a><br><br></div><div>You can roll back the range's consumption if you like, you can insert characters into the stream then return, you can change the returned error code after inserting replacement characters, etc. It's a very flexible interface and it was designed to allow for custom behaviors without loss of (much) information when a person really wanted to dig into what happened. Templates make it look verbose and ugly, but I am working on a "simpler error handler" that just takes one or two callables and does something extremely simple (like returns an optional<code_point>, or lets you throw).<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
> template <typename Encoding, typename InputRange,<br>
> typename OutputRange, typename State><br>
> constexpr auto operator()(const Encoding& encoding,<br>
> decode_result<InputRange, OutputRange, State> result) const {<br>
> /* morph result or throw error */<br>
> return result;<br>
> }<br>
<br>
Custom error handlers for decoding seem unnecessary. Are there truly<br>
use cases for behaviors other than replacing malformed sequences with<br>
the REPLACEMENT CHARACTER or stopping conversion upon discovering the<br>
first malformed sequence?<br>
<br>
> Throwing is explicitly not recommended by default by prominent vendors and<br>
> implementers (Mozilla, Apple, the Unicode Consortium, WHATWG, etc.)<br>
<br>
I don't want to advocate throwing, but I'm curious: What Mozilla and<br>
Apple advice is this referring to? Or is this referring to the Gecko<br>
and WebKit code bases prohibiting C++ exceptions in general?<br></blockquote><div><br></div><div>From looking at both private and public codebases and from speaking to implementers and developers, in SG16 telecons and otherwise. But I should probably provide more direct citation and quotes here, rather than just throwing the information out there; apologies, I'll make sure to improve that for r1 so its more properly sourced and accurate.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> For performance reasons and flexibility, the error callable must have a way<br>
> to ensure that the user and implementation can agree on whether or not we<br>
> invoke Undefined Behavior and assume that the text is valid.<br>
<br>
The ability to opt into UB seems dangerous. Are there truly compelling<br>
use cases for this?<br></blockquote><div> </div><div>I have a handful of experiences where avoiding the checks (and encoding/decoding without those checks) provided a measurable speedup. At the time, we had not optimized our functions using vectorized instructions: maybe doing so would have made such a thing moot. I'll see what the benchmarks say.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> 3.2.3. The Encoding Object<br>
<br>
> using code_point = char32_t;<br>
<br>
This looks bad. As I've opined previously<br>
(<a href="https://hsivonen.fi/non-unicode-in-cpp/" rel="noreferrer" target="_blank">https://hsivonen.fi/non-unicode-in-cpp/</a>), I think this should not be<br>
a parameter. Instead, all encodings should be considered to be<br>
conceptually decoding to or encoding from Unicode and char32_t should<br>
be the type for a Unicode scalar value.<br></blockquote><div><br></div><div>A lot of people have this comment. I am more okay with having code_point be a parameter, with the explicit acknowledgement that if someone uses not-char32_t (not a Unicode Code Point), then nothing above the encoding level in the standard will work for them (no normalization, no segmentation algorithms, etc.). I have spoken to enough people who want to provide very specific encoding stories for legacy applications where this would help. Even if the encoding facilities work for them, I am very okay with letting them know that -- if they change this fundamental tenant -- they will lock themselves out of the rest of the algorithms, the ability to transcode with much else, and basically the rest of text handling in the Standard.<br><br></div><div>They get to decide whether or not that's a worthwhile trade. A goal of working on all this is to make it so they are extremely squeamish about making that choice, but if they are informed enough and make the decision or have a special use case that they can make the trade-off knowingly. Encoding objects are incredibly low-level and once the dust settles will likely only be written by one person or one team in a given org, or in a publicly available library (e.g., a WHATWG-encoding library for C++). They should be able to make the tradeoff if they care, but the standard won't support them: it's a fairly steep punishment.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> using code_unit = char;<br>
<br>
I think it would be better to make the general facility deal with<br>
decode from bytes and encode two bytes only and then to provide<br>
conversion from wchar_t, char16_t, or char32_t to UTF-8 and from<br>
wchar_t and char32_t to UTF-16 as separate non-streaming functions.<br>
<br>
> using state = __ex_state;<br>
<br>
Does this imply that the same state type is used for encode and<br>
decode? That's odd. </blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Also, conceptually, it seems odd that the state is held in an<br>
"encoding" as opposed to "decoder" and "encoder".<br></blockquote><div><br></div><div>
<div>I do need to look into having a clear delineation, or perhaps even separating all encoding objects into encoder and decoder. I haven't had the time to justify a full split, so maybe just separating the types will be best.</div>
<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> static constexpr size_t max_code_unit_sequence = MB_LEN_MAX;<br>
<br>
Does there exist an encoding that is worthwhile to support and for<br>
which this parameter exceeds 4? Does this value need to be<br>
parameterized instead of being fixed at 4?<br></blockquote><div><br></div><div><span style="font-family:monospace">MB_LEN_MAX</span> on Windows reports 5, but that might be because it includes the null terminator, so maybe there is no implementation where it exceeds 4?<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Why aren't there methods for querying for the worst-case output size<br>
given input size and the current conversion state?<br></blockquote><div><br></div><div>This was commented on before, and I need to add it. Almost all encoding functionality today has it (usually by passing nullptr into the function).<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> static constexpr size_t max_code_point_sequence = 1;<br>
<br>
Is this relevant if decode is not supported to UTF-32 and is only<br>
supported to UTF-8 and to UTF-16? A single Big5 byte sequence can<br>
decode into two Unicode scalar values, but it happens that a single<br>
Big5 byte sequence cannot to decode into more than 4 UTF-8 code units<br>
or into more than 2 UTF-16 code units, which are the normal limits for<br>
single Unicode scalar values in these encoding forms.<br>
<br>
> // optional<br>
> using is_encoding_injective = std::false_type;<br>
<br>
Does this have a compelling use case?<br>
<br>
> // optional<br>
> using is_decoding_injective = std::true_type;<br>
<br>
Does this have a compelling use case?<br></blockquote><div><br></div><div>This is part of a system wherein users will be errored at compile-time for any lossy transcoding they do. As in the example, ASCII is perfectly fine decoding into Unicode Scalar Values. Unicode Scalar Values are NOT fine with being encoded into ASCII. Therefore, the following should loudly yell at you at compile-time, not run-time:<br><br></div><div><span style="font-family:monospace">
auto this_is_not_fine = std::text::transcode(U"☢️☢️", std::text::ascii{});</span></div><div><span style="font-family:monospace">// static assertion failed: blah blah blah</span><br><br></div><div>The escape hatch is to provide the non-default text encoding handler:<br><br><span style="font-family:monospace">
auto still_not_fine_but_whatever = std::text::transcode(U"☢️☢️", std::text::utf32{}, std::text::ascii{}, std::text::replacement_error_handler{});</span></div><div><span style="font-family:monospace">// alright, it's your funeral...</span></div><div><br></div><div>This is powered by the typedefs noted above. is_(decoding/encoding)_injective informs the implementation whether or not your implementation can perfectly encoding from the code point to code units and vice versa. If it can't, it will be sure to loudly scold you if you use a top-level API that does not explicitly pass an error handler, which is your way of saying "I know what I'm doing, shut up stdlib".<br></div><div><br>
</div><div>I have programmed this in before to an API and it was helpful to stop people from automatically converting text that was bad. See an old presentation I did on the subject when I first joined SG16 while it was still informal-text-wg: <a href="https://thephd.github.io/presentations/unicode/sg16/2018.03.07 - ThePhD - a rudimentary unicode abstraction.pdf">https://thephd.github.io/presentations/unicode/sg16/2018.03.07 - ThePhD - a rudimentary unicode abstraction.pdf</a>. It was mildly successful in smacking people's hands when they wanted to do e.g. utf8 -> latin1, and made them think twice. The feedback was generally positive. I had a different API back then, using converting constructors. I don't think anyone in the standard would be happy with converting constructors, but the same principles apply to the encode/decode/transcode functions.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
> // optional<br>
> code_point replacement_code_point = '0xFFFD';<br>
<br>
What's the use case for this? ...<br>
<br>
> // optional<br>
> code_unit replacement_code_unit = '?';<br>
<br>... That is, what's the use case for this?<br></blockquote><div><br></div><div>Not everyone uses ? or U+FFFD as their replacement, and that's pretty much the sole reason. Whether or not we want to care about those use cases is another question, and it certainly makes my life easier to toss it out the window.<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> // encodes exactly one full code unit sequence<br>
> // into one full code point sequence<br>
> template <typename In, typename Out, typename Handler><br>
> encode_result<In, Out, state> encode(<br>
> In&& in_range,<br>
> Out&& out_range,<br>
> state& current_state,<br>
> Handler&& handler<br>
> );<br>
><br>
> // decodes exactly one full code point sequence<br>
> // into one full code unit sequence<br>
> template <typename In, typename Out, typename Handler><br>
> decode_result<In, Out, state> decode(<br>
> In&& in_range,<br>
> Out&& out_range,<br>
> state& current_state,<br>
> Handler&& handler<br>
> );<br>
<br>
How do these integrate with SIMD acceleration?<br></blockquote><div> </div><div>They don't. std::text::decode/encode/transcode free functions is where specializations for fast processing are meant to kick in. These are the basic, one-by-one encodings. This is to ensure any given storage can be iterated over by a basic encoding object. A bit more information can be found about the different optimization paths in a small presentation I gave the Committee about what this paper was trying to do and a few other things: <a href="https://thephd.github.io/presentations/unicode/sg16/K%C3%B6ln/ThePhD%20-%20K%C3%B6ln%202019%20Standards%20C++%20Meeting%20-%20Catch%20Up.pdf">https://thephd.github.io/presentations/unicode/sg16/K%C3%B6ln/ThePhD%20-%20K%C3%B6ln%202019%20Standards%20C++%20Meeting%20-%20Catch%20Up.pdf</a><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> static void reset(state&);<br>
<br>
What's the use case for this as opposed to constructing a new object?<br></blockquote><div> </div><div>Not any good use case, really: I should likely change this to just let someone default-construct and copy over the state. I have to reconcile encoding objects and states which should conceivably be non-default-constructible (e.g., they hold a string or enumeration value that contains some indication of which encoding to use plus any intermediate conversion state) and this design. The minimum API of "state" needs to be more fleshed out.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> 3.2.3.1. Encodings Provided by the Standard<br>
<br>
> namespace std { namespace text {<br>
><br>
> class ascii;<br>
> class utf8;<br>
> class utf16;<br>
> class utf32;<br>
> class narrow_execution;<br>
> class wide_execution;<br>
<br>
This is rather underwhelming for an application developer wishing to<br>
consume Web content or email.<br></blockquote><div><br></div><div>I agree, but providing the entirety of what the WHATWG wants is something that should likely be provided by an external library who can keep up with changes and the cadence of changes. The standard moves slower than a ball of molasses going uphill, and people are even slower to port despite C++ gaining significant speed in recent standardization efforts and releases. (The number of people on GCC 4.x and old no-longer-LTS versions of many Linux distributions for "legacy reasons" is eye-popping and staggering.) We pick the encodings that:<br><br></div><div>1) the standard is already responsible for in its entirety (narrow/wide);<br></div><div>2) are old-as-dirt standard and will not change anytime in my lifetime (utf8, utf32, and utf16 if Aliens don't show up and overflow the allotted 21 bits with their new languages);</div><div>3) and, are old-as-dirt standard and provide reasonable speed gains if the standard can optimize for them (ascii).<br><br></div><div>Having only these encodings also means that optimizations are much more feasible for standard library developers once this paper lands, rather than implementing the full suite of WHATWG encodings.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
On the other hand, the emphasis of the design presented in this paper<br>
being compile-time specializable seems weird in connection to<br>
`narrow_execution`, whose implementation needs to be dispatched at<br>
runtime. Presenting a low-level compile-time specializable interface<br>
but then offering unnecessarily runtime-dispatched encoding through it<br>
seems like a layering violation.<br>
<br>
> If an individual knows their text is in purely ASCII ahead of time and they work in UTF8, this information can be used to bit-blast (memcpy) the data from UTF8 to ASCII.<br>
<br>
Does one need this API to live dangerously with memcpy? (As opposed to<br>
living dangerously with memcpy directly.)<br></blockquote><div><br></div><div>The idea is that the implementation can safely memcpy, because they have compile-time information that indicates they can do so. If they don't, they can't make that guarantee; I want to provide the implementation that guarantee.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> 3.2.3.2. UTF Encodings: variants?<br>
<br>
> both CESU-8 and WTF-8 are documented and used internally for legacy reasons<br>
<br>
This applies also to wide_execution, utf16, and utf32. (I wouldn't be<br>
surprised if WTF-8 surpassed UTF-32 in importance in the future.)<br>
<br>
I'm not saying that CESU-8 or WTF-8 should be included, but I think<br>
non-byte-code-unit encodings don't have good justification for being<br>
in the same interface that is used for consuming external data<br></blockquote><div><br></div><div>I will try to think of ways to separate the two APIs. I don't have many ideas for this yet.<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> More pressingly, there is a wide body of code that operates with char as the code unit for their UTF8 encodings. This is also subtly wrong, because on a handful of systems char is not unsigned, but signed.<br>
<br>
This is a weird remark. Signed char as a misfeature generally and bad<br>
for a lot of processing besides UTF-8. <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
> template <typename CharT, bool encode_null, bool encode_lone_surrogates><br>
<br>
I don't think it's all clear that it's a good idea in terms of<br>
application developer understanding of the issues to enable "UTF-8" to<br>
be customized like this is supposed to having WTF-8 as a separate<br>
thing.<br></blockquote><div><br></div><div>That's a fair assessment. Some individuals in the meeting where I did a presentation about this were very keen that they had the ability to customize how UTF8 is handled without having to rewrite the entirety of it themselves.<br><br>char is -- much to my great compiler aliasing analysis pains -- still being used. Some people have <span style="font-family:monospace">std::is_unsigned_v<char> == true</span> for their platform, and all they care about is their platform, and they write all their UTF8 code using char, and they were already lining up to throw a hissy fit for their big legacy application and interoperable codebases if we forced the encoding object to use char8_t exclusively, all the time. I think char is used far too much and should be swiftly burned out of a lot of APIs, but the sheer magnitude of current-generation code would mean not giving these individuals an escape hatch is the swiftest way to burning a compatibility bridge.</div><div><br></div><div>But maybe we should burn it...<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> This is a transformative encoding type that takes the source (network) endianness<br>
<br>
Considering that the "network byte order" is IETF speak for "big<br>
endian", I think it's confusing to refer to whatever you get from an<br>
external source in this manner.<br></blockquote><div><br></div><div>I will change the wording to just keep "source".<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> This paper disagrees with that supposition and instead goes the route of providing a wrapping encoding scheme<br>
<br>
FWIW, I believe that conversion from wchar_t, char16_t, or char32_t<br>
into UTF-8 should not be forced into the same API as conversion from<br>
external byte-oriented data sources, and I believe that it's<br>
conceptually harmful to conflate the char16_t-oriented UTF-16 with the<br>
byte-oriented UTF-16BE and UTF-16LE external encodings.<br></blockquote><div><br></div><div> <span style="font-family:monospace">encoding_scheme<utfX></span> will require the input value_type to be std::byte for decoding, and the output value_type will be std::byte for encoding. std::byte has no implicit conversions, so it's a hard error to give it anything that's not exactly an input or output range of exactly std::byte.
(For the default implementation, anyhow.<span> </span>The template allows you to change the <span style="font-family:monospace">Byte </span>type used in encoding_scheme, but at that point you've asked for the lack of strict safety and it's your problem now.)
That alleviated my and other people's safety concerns, and <span style="font-family:monospace">encoding_scheme</span> has already specifically seen implementation experience with success.
</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> 3.2.4. Stateful Objects, or Stateful Parameters?<br>
<br>
> maintains that encoding objects can be cheap to construct, copy and move;<br>
<br>
Surely objects that have methods (i.e. state is taken via `this`<br>
syntactic sugar) can be cheap to construct, copy, and move.<br>
<br>
> improves the general reusability of encoding objects by allowing state to be massaged into certain configurations by users;<br>
<br>
It seems to me that allowing application developers to "massage" the<br>
state is an anti-feature. What's the use case for this?<br>
<br>
> and, allows users to set the state in a public way without having to prescribe a specific API for all encoders to do that.<br>
<br>
Likewise, what's the use case for this?<br></blockquote><div><br></div><div> The goal here is for when locale dependency or dynamic encodings come into play. We want to keep the encoding object itself cheap to create and use, while putting any heavy lifting inside of the state object which will get passed around explicitly by reference in the low-level API. Encoding tags, incredibly expensive locale objects, and more can all be placed onto the state itself, while the encoding object serves as the cheap handle that allows working with such a state.<br><br></div><div> I would be interested in pursuing the alternate design where the encoding object just holds all the state, all the time. This means spinning up a fresh encoding object anytime state needs to be changed, but I can imagine it still amounting to the same level of work in many cases. I will be mildly concerned that doing so will front-load things like change in locale and such to filling the Encoding Object's API, or mandate certain constructor forms. This same thing happened to wstring_convert, codect_X and friends, so I am trying to do my best to avoid that pitfall. This also brings up a very concerning point: if "state" has special members that can't be easily reconstructed or even copied, how do we handle copying one encoding from one text object to another? Separating the state means it's tractable and controllable, not separating it means all of the copyability, movability, etc. becomes the encoding's concern now.<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> As a poignant example: consider the case of execution encoding character<br>
> sets today, which often defer to the current locale. Locale is inherently<br>
> expensive to construct and use: if the standard has to have an encoding<br>
> that grabs or creates a codecvt or locale member, we will immediately lose<br>
> a large portion of users over the performance drag during construction of<br>
> higher-level abstractions that rely on the encoding. It is also notable that<br>
> this is the same mistake std::wstring_convert shipped with and is one of<br>
> the largest contributing reasons to its lack of use and subsequent<br>
> deprecation (on top of its poor implementation in several libraries, from<br>
> the VC++ standard library to libc++).<br>
<br>
As noted, trying to provide a compile-time specialized API that<br>
provides access to inherently runtime-discovered encodings seems like<br>
a layering violation. Maybe the design needs to surface the<br>
dynamically dispatched nature of these encodings and to see what that<br>
leads to in terms of the API design.<br></blockquote><div> </div><div> The encoding objects API is a concept. What member types and definitions are required at compile-time are the ones relating to code units and code points, but nothing prevents the user from making code_unit = byte; and code_point = unicode_code_point; -- this is how the desired encoding_scheme<...> type will work to serialize between an Encoding Object and a byte-based representation suitable for network transmission. Nothing stops the API from being pushed to runtime by making all the functions virtual functions on the Encoding Object; in fact, that is exactly how I plan to implement an iconv-based example.</div><div><br></div><div> We'll see how it pans out. :D<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> 3.2.4.1. Self-Synchronizing State<br>
<br>
> If an encoding is self-synchronizing, then at no point is there a need to refer to an "potentially correct but need to see more" state: the input is either wholly correct, or it is not.<br>
<br>
Is this trying to say that a UTF-8 decoder wouldn't be responsible for<br>
storing the prefix of buffer-boundary-crossing byte sequences into its<br>
internal state and it would be the responsibility of the caller to<br>
piece the parts together?<br></blockquote><div><br></div><div> The purpose of this is to indicate that a state has no "holdover" between encoding calls. Whatever it encodes or decodes results in a complete sequence, and incomplete sequences are left untouched (or encodes a replacement character and gets skipped over depending on the error handler, etc. etc.). This means that function calls end up being "pure" from the point of the encoder and decoder. There were some useful bits here in detecting when state can be thrown out the window and created on the fly by the implementation, rather than needing to preserve. A micro-optimization, at best, and likely something that won't be pursued until most of the other concerns the paper is trying to tackle are polished up.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> 3.3.1. Transcoding Compatibility<br>
<br>
What are the use cases for this? I suggest to treating generalized<br>
transcoding as a YAGNI matter, and if someone really needs it, letting<br>
them pivot via UTF-8 or UTF-16.<br></blockquote><div><br></div><div> The point of this section is to allow for encodings to clue the implementation in as to whether or not just doing `memcpy` or similar is acceptable. If my understanding is correct, a lot of the GBK encodings are bitwise compatible with GB18030. It would make sense for an implementation to speedily copy this into storage rather than have to roundtrip through transcoding.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
> 3.3.2. Eager, Fast Functions with Customizability<br>
<br>
> Users should be able to write fast transcoding functions that the standard picks up for their own encoding types. From GB1032<br>
<br>
Is 1032 the intended number here?<br></blockquote><div><br></div><div> Nope; this should be GB18030. Thanks.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> WHATWG encodings<br>
<br>
If it is the responsibility of the application developer to supply an<br>
implementation of the WHATWG Encoding Standard, to my knowledge the<br>
fastest and most correct option is to use encoding_rs via C linkage.<br>
<br>
In that case, what's the value proposition for wrapping it in the API<br>
proposed by this paper as opposed to using the API from<br>
<a href="https://github.com/hsivonen/encoding_c/blob/master/include/encoding_rs_cpp.h" rel="noreferrer" target="_blank">https://github.com/hsivonen/encoding_c/blob/master/include/encoding_rs_cpp.h</a><br>
updated with C++20 types (notably std::span and char8_t) directly?<br></blockquote><div><br></div><div> The purpose of wrapping this API is to make it standard so that everyone doesn't have to keep reimplementing it. It means that everyone can write the code in one way and everyone gets the same optimizations, similarly to how I've demonstrated that by having a single bit_iterator/bit_view range abstraction, the standard library (or the user) can optimize it and everyone else can benefit: <a href="https://thephd.github.io/seize-bits-production-gsoc-2019" target="_blank">https://thephd.github.io/seize-bits-production-gsoc-2019</a><br><br></div><div> The reason we don't just want to have span<T> interfaces is because of the same flexibility iterators have bought us overtime. That doesn't mean your encoding object must be templated or deal with non-contiguous storage: I wrote a (mock) encoding object here that only works with vector<T>, basic_string<T>, and other contiguous containers by hardcoding span: <a href="http://www.open-std.org/pipermail/unicode/2019-August/000633.html">http://www.open-std.org/pipermail/unicode/2019-August/000633.html</a></div><div><br></div><div> As noted in that e-mail, hard-coding such things means you can't have deque<T> or rope<T> or gap_buffer<T> or whatever else kind of non-contiguous, but if you know your payload is always contiguous then you'll never hit an error. When you do hit a compiler error, you can make the decision about where you want to apply the flexibility. The standard library should serve everyone's needs, but there should be room -- and there will be room -- to slim things down to just what you're interested in.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Regardless of the value proposition, writing the requisite glue code<br>
could be a useful exercise to validate the API proposed in this paper:<br>
If the glue code can't be written, is there a good reason why not?<br>
Does the glue code foil the SIMD acceleration?<br>
<br>
Also, to validate the API proposed here, it would be a good exercise<br>
to encode, "with replacement" in the Encoding Standard sense, a string<br>
consisting of the three Unicode scalar values U+000F, U+2603, and<br>
U+3042 into ISO-2022-JP and to see what it takes API-wise to get the<br>
Encoding Standard-compliant result.<br></blockquote><div> </div><div> I will add an issue to attempt exactly that (after I move the implementation to a more easily-accessible standalone repository).<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> 4. Implementation<br>
<br>
> This paper’s r2 hopes to contain more benchmarks<br>
<br>
I'd be interested in seeing encoding_rs (built with SIMD enabled, both<br>
on x86_64 and aarch64) included in the benchmarks. (You can grab build<br>
code from <a href="https://github.com/hsivonen/recode_cpp/" rel="noreferrer" target="_blank">https://github.com/hsivonen/recode_cpp/</a> , but `cargo<br>
--release` needs to be replaced with `cargo --release --features<br>
simd-accel`, which requires a nightly compiler, to enabled SIMD.)<br></blockquote><div> </div><div>I will make sure that's part of the benchmarks when I move the implementation to a standalone repository.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> [WTF8]<br>
> Simon Sapin. The WTF-8 encoding. September 26th, 2019. URL: <a href="https://simonsapin.github.io/wtf-8/" rel="noreferrer" target="_blank">https://simonsapin.github.io/wtf-8/</a><br>
<br>
That date can't be right.<br></blockquote><div><br></div><div> Yep, thanks for catching that. I'll fix it in the latest draft.<br><br></div><div> Hopefully this was informative enough. I've read this over a few times, but I might have dropped a sentence or word or two here and there. I'll do my best to furnish a new paper including all of the feedback and changing the APIs were applicable.<br><br></div><div> Thank you, so much, for your time and effort in this.<br></div><br></div></div>