[SG16-Unicode] code_unit_sequence
Lyberta
lyberta at lyberta.net
Thu Jul 18 01:14:00 CEST 2019
Steve Downey:
> In live code, data is dynamic, and a code_unit, particularly a utf-8
> code unit, doesn't show up in isolation, they show up in sequences, but I
> fail to see why I'd want a sequence of code_units, as I'm immediately going
> to have to interpret them into something useful.
Yes. But code unit level is still needed and using std::basic_string for
it seems like a bad idea because it contains std::char_traits, bloated
API, NUL-terminator and dumb types. All the stuff that made some sense
in the 1990s but doesn't make much sense now.
Again, there will be "scalar_value_sequence",
"grapheme_cluster_sequence" and "text" on top. code_unit_sequence is a
low level thing. But a thing we need in low level code.
> What are the operations
> on a utf8_code_unit? What interfaces does it show up in as a vocabulary
> type?
utf8_code_unit has the following member functions:
constexpr value_type value() const noexcept;
constexpr bool is_ascii() const noexcept;
constexpr bool is_leading_byte() const noexcept;
constexpr bool is_continuation_byte() const noexcept;
Those are exposed for encoding forms and people who want to learn more
about Unicode. You can read the full text of proposal here:
https://github.com/Lyberta/cpp-unicode/blob/master/Fundamental.md
> What is the overhead on it when used in bulk?
As the type is trivially copyable and relocatable, there shouldn't be
any overhead in release builds.
> Single code_unit validity isn't enough to get even well formed utf-8, so a
> significant part of error handling is still going to be present in
> processing.
Yes, but it makes conversion to scalar values much easier because that
check automatically prohibits overlong sequences.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
Url : http://www.open-std.org/pipermail/unicode/attachments/20190717/96233f4f/attachment-0001.bin
More information about the Unicode
mailing list