<html><head></head><body>I don't think this code_unit_sequence is useful. The focus on endianness is misguided, IMO.<br>
<br>
There's no reason to convert encoding schemes to encoding forms (transcoding notwithstanding). The encoding forms from the Unicode standard that we need to support are UTF-8, UTF-16, UTF-16LE, UTF-16BE, UTF-32, UTF-32LE, UTF-32BE (and possibly forms with BOMs but that needs a different design because it's effectively a stateful encoding, so let's leave it out for now).<br>
<br>
There's no byte level. The lowest level that is useful is code units. It just happens that some encodings (e.g. UTF-8, UTF-16LE) have bytes as code units. Everything is code units.<br>
<br>
This code_unit_sequence *might* be useful as an implementation detail, but not so much as a user interface. All it does is abstract away endianness when that is already abstract by the encoding schemes themselves, like UTF-16LE.<br><br><div class="gmail_quote">On June 18, 2018 8:41:00 PM GMT+02:00, Lyberta <lyberta@lyberta.net> wrote:<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<pre class="k9mail">Zach Laine:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 1ex 0.8ex; border-left: 1px solid #729fcf; padding-left: 1ex;"> This is certainly the right venue. Do you have an interface in mind?<br> Posting a synopsis could start things moving.<br> <br> Zach<br></blockquote><br>code_unit_sequence works on layer 0 - bytes - and provides iterators to<br>layer 1 - code units. The intended use case is working with UTF-16 and<br>UTF-32 where endianness of stored units is not equal to machine<br>endianness and byte-swapping everything in advance is too slow. You will<br>use template metaprogramming to support all encodings and endiannesses.<br><br>Synopsys would be something like:<br><br>template <TextEncoding TE, std::endian Endianness = std::endian::native,<br>typename Allocator = std::allocator<std::byte><br>class code_unit_sequence;<br><br>It will have the interface similar to your boost::string from Boost.Text<br>and have random access iterators that would return proxy type<br>convertible to char8_t for UTF-8, char16_t for UTF-16 and char32_t for<br>UTF-32. Maybe another template parameter for invalid code unit handling.<br><br>Also there will be a concept named CodeUnitSequence that requires the<br>similar interface to std::code_unit_sequence. I think both<br>std::vector<[w]char[8,16,32]_t> and std::basic_string should satisfy<br>that concept.<br><br>std::code_point_sequence works on layer 1 - code units - and provides<br>iterators to layer 2 - code points. It will take a type that satisfies<br>CodeUnitSequence and use it for memory management. It will provide<br>bidirectional iterators that return proxy type convertible to char32_t.<br>The iterators will be complex because a single code point can be consist<br>of different number of code units so assignment may lead to reallocation<br>of underlying buffer and invalidation of some iterators. I guess that<br>will break some std algorithms but that's the reality we will have to<br>deal with.<br><br>Synopsys would be something like:<br>template <CodeUnitSequence Container, TextEncoding ET =<br>std::default_encoding_type_t<Container>><br>class code_point_sequence;<br><br>Of course, there will be corresponding view types.<br><br>I have implemented my own version of code_point_sequence and<br>code_point_sequence_view here:<br><a href="https://gitlab.com/ftz/unicode">https://gitlab.com/ftz/unicode</a><br><br>Of course, then we have std::text that would take CodePointSequence and<br>provide grapheme cluster iterators. My free time was not enough to<br>implement grapheme cluster iteration so I'll leave it to other people.<br><br>So I see at least 5 papers:<br>* Fundamental encoding concepts, types and helpers such as TextEncoding,<br>std::utf8, std::default_encoding_type_t, etc<br>* std::code_unit_sequence<br>* std::code_unit_sequence_view<br>* std::code_point_sequence<br>* std::code_point_sequence_view<br><br>It would be fair to standardize them in this order but views may be<br>standardized before the corresponding containers but we should see<br>implementations of containers before deciding on interface of views.<br><br></pre></blockquote></div></body></html>