[SG16-Unicode] code_unit_sequence and code_point_sequence

Lyberta lyberta at lyberta.net
Tue Jun 19 17:09:00 CEST 2018


Mark Zeren:
> [mjz] This is one approach. Another is Zach's opinionated "there is only one storage container" approach.

Zach's approach is exactly what I don't want to see in the standard. His
type only supports UTF-8.

As we see with std::chrono. Encoding form should be a template
parameter. Nothing restricts us from standardizing
std::dynamic_encoding_form where code unit type is compile-time while
its meaning is determined at runtime.

I only advantage of std::basic_string over std::vector is Small Buffer
Optimization. Perhaps we can work with LEWG to standardize something
like sbo_vector. Then code_unit_sequence could just take it as template
parameter but require value_type be std::byte.

The heirarchy would then be from bottom to top:

* std::sbo_vector<std::byte>
* std::code_unit_sequence
* std::code_point_sequence
* std::text

Where each template will use the previous one in its implementation. Of
course, this is just the default hierarchy. A user can manually opt-in for:

* std::vector<char16_t> // For UTF-16 case, for example.
* std::code_point_sequence
* std::text

Or:

* std::vector<char32_t>
* std::text

Or even:

* std::vector<char32_t>
* std::code_point_sequence // Basically no-ops on this layer. This case
is typical for TMP.
* std::text

I'm baffled a bit about Zach's design. He goes 100% templates above the
code point level, there was no need to restrict his "string layer" to
UTF-8, especially since implementing code point iteration is much easier
than grapheme cluster and higher ones which he did implement.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
Url : http://www.open-std.org/pipermail/unicode/attachments/20180619/7d7da5ec/attachment.bin 


More information about the Unicode mailing list