Document Number: | P3263R0 |
Date: | 2024/04/30 |
Reply-to: | cpp@kaotic.software |
Authors: | Tiago Freire |
Audience: | LWG, SG16 |
This paper gives a suggestion for the creation of user annotatable character types to allow for uniform support for multiple text encodings.
# | Description |
---|---|
0 | Initial draft |
1. Motivation
2. What properties are desired
3. The proposed solution
4. Suggested properties
5. What this paper doesn't solve
6. Questions for SG16
7. References
It is the status quo that there is no unique way of handling text in software, especially considering the range of possible applications a user might want to write.
We need to acknowledge that encoding is an important consideration when handling and interpreting text, and we should make it easier on the user to do it right.
Using facilities such as the type system in order to facilitate the user to identify and manage text in different encodings is a desirable trait, and indeed distinct
fundamental types such as char8_t
/char16_t
/char32_t
have been added in order to facilitate partial support for Unicode (utf8/utf16/utf32).
But this solution doesn't scale, it is not practical to just add an additional fundamental char type to the core language for every encoding in existance [1],
which would require not only petitioning the committee for their introduction, but also puts a burden on implementers to update secondary facilities associated with character
types such as std::integral, specialization of strings and string_views, etc; and this wouldn't even support custom defined encodings and variations therof.
The ideal scenario would be to allow the users to be able to specify their own text encoding support without involvement of the standard committee.
Before presenting the solution it is important to talk about what we want, the design should follow where we want to be first and foremost.
using char_iso2022_t = char8_t
are not ideal.
As the compiler would be unable to distinguish char_iso2022_t
from char8_t
, or objects like std::basic_string_view<char_iso2022_t>
from std::basic_string_view<char8_t>
.
The compiler would allow to use one in place of the other without warning, and effectively the type system would not be in use to distinguish these encodings.
var1 + var2
mean) without it participating in overload resolution.enum class
to define the character type. This would indeed provide a unique distinct type that cannot be overloaded with any other type.std::is_enum
would be true).
And enum classes don't come with desired operations by default, requiring a user to re-write multiple boiler plate operators per supported encoding, making it much less ergonomic to use and not at all user friendly.
Unless of course, we also provide a pre-processor macro to write all of the boiler plate for the user.
While we were quick to dismiss adding a syntax to the language to allow for user defined types (with complexities as to be able to not only control overloads but also optionally define what operators like + mean),
there is already a feature in the language that allows us to do just that. A "class".
A class is a unique type that is not overloadable with other types, you can define what operators mean, you can define the underlying storage is used, and you can even define annotations (in the form of static constexpr, and "using" directives) that allows for reflection,
and you can template it allowing the user to define the minimal property set that it needs and have the remainder of the code write itself.
The proposed solution is composed of 2 components and an optional concept.
//unique encoding identifier struct text_encoding_EBCDIC: public std::text_encoding //this inheritance annotates it as a type of encoding { using char_t = char8_t; //static constexpr std::string_view id{"EBCDIC"}; //optional, not required }; using char_EBCDIC_t = std::char_enc_t<text_encoding_EBCDIC>;
And it would allow the user to combine it with other facilities like:
using string_EBCDIC = std::basic_string<char_EBCDIC_t>; using string_view_EBCDIC = std::basic_string_view<char_EBCDIC_t>;
The following is an optional implementation:
struct text_encoding { text_encoding() = delete; //object cannot be instantiated text_encoding(text_encoding const&) = delete; text_encoding(text_encoding &&) = delete; }; template <typename T> concept text_encoding_c = std::is_base_of_v<text_encoding, T> //&& std::same_as<std::string_view const, decltype(T::id)> //optional && std::is_same_v<typename T::char_t, std::remove_cvref_t<typename T::char_t>> && std::unsigned_integral<typename T::char_t> && !std::is_same_v<bool, typename T::char_t> //variant 1 //&& (std::is_same_v<typename T::char_t, char8_t> || std::is_same_v<typename T::char_t, char16_t> || std::is_same_v<typename T::char_t, char32_t>) //variant 2 ; template<text_encoding_c T> class char_enc_t final { public: using encoding_t = T; using underlying_t = typename T::char_t; inline constexpr char_enc_t(char_enc_t const&) = default; inline constexpr char_enc_t() = default; explicit inline constexpr char_enc_t(underlying_t const p_other) { _val = p_other; } explicit inline operator underlying_t& () { return _val; } explicit inline constexpr operator underlying_t () const { return _val; } inline constexpr operator bool () const { return _val; } inline underlying_t& value() { return _val; } inline constexpr underlying_t value() const { return _val; } inline char_enc_t& operator = (char_enc_t const& p_other) = default; inline char_enc_t& operator &= (char_enc_t const p_other) { _val &= p_other._val; return *this; } inline char_enc_t& operator |= (char_enc_t const p_other) { _val |= p_other._val; return *this; } inline char_enc_t& operator ^= (char_enc_t const p_other) { _val ^= p_other._val; return *this; } inline char_enc_t& operator += (char_enc_t const p_other) { _val += p_other._val; return *this; } inline char_enc_t& operator -= (char_enc_t const p_other) { _val -= p_other._val; return *this; } inline char_enc_t& operator <<= (uint8_t const p_offset) { _val <<= p_offset; return *this; } inline char_enc_t& operator >>= (uint8_t const p_offset) { _val >>= p_offset; return *this; } inline constexpr bool operator == (char_enc_t const p_other) const { return _val == p_other._val; } inline constexpr bool operator != (char_enc_t const p_other) const { return _val != p_other._val; } inline constexpr bool operator < (char_enc_t const p_other) const { return _val < p_other._val; } inline constexpr bool operator > (char_enc_t const p_other) const { return _val > p_other._val; } inline constexpr bool operator <= (char_enc_t const p_other) const { return _val <= p_other._val; } inline constexpr bool operator >= (char_enc_t const p_other) const { return _val >= p_other._val; } inline constexpr auto operator <=> (char_enc_t const p_other) const { return _val <=> p_other._val; } inline constexpr char_enc_t operator & (char_enc_t const p_other) const { return _val & p_other._val; } inline constexpr char_enc_t operator | (char_enc_t const p_other) const { return _val | p_other._val; } inline constexpr char_enc_t operator ^ (char_enc_t const p_other) const { return _val ^ p_other._val; } inline constexpr char_enc_t operator + (char_enc_t const p_other) const { return _val + p_other._val; } inline constexpr char_enc_t operator - (char_enc_t const p_other) const { return _val - p_other._val; } inline constexpr char_enc_t operator << (uint8_t const p_offset) const { return _val << p_offset; } inline constexpr char_enc_t operator >> (uint8_t const p_offset) const { return _val >> p_offset; } inline constexpr bool operator ! () const { return !_val; } inline constexpr char_enc_t operator ~ () const { return ~_val; } inline constexpr char_enc_t& operator ++ () { ++_val; return *this;} inline constexpr char_enc_t& operator -- () { --_val; return *this;} inline constexpr char_enc_t operator ++ (int const) { return char_enc_t{_val++}; } inline constexpr char_enc_t operator -- (int const) { return char_enc_t{_val--}; } private: underlying_t _val; //intentionally not default initialized };
Many of the properties listed here are restrictive in nature, this is done to allow the character type to behave as much as possible like what would be expected from a character while reducing the potential amount of behavior that isn't "character like" and avoid unintended consequences. This shouldn't be read as if those properties are set in stone. They are a good first start, they seem sensible enough, if it turns out to be too restrictive and there is a good reason to change, then it is open too modification in the future by dropping constraints. Being too lenient might mean not being able to change them due to fears of ABI breaks.
This is an intentional design to allow for future alteration without the fear of ABI breaks. The class has no storage and should only be used as a compile time instrument to help identify the encoding and deduce its properties, such as the code unit width. Having a runtime object of such type should be ill-formed.
std::text_encoding
to define a text encoding.
This does two jobs.
First it signals to the compiler (by means of compile time reflection mechanism) that this type is intentionally a text encoding annotation.
This prevents other types from being accidentally used.
Secondly it enforces the rule of strict compile time type that cannot be instantiated.
The unsigned restriction is to avoid unsavory side-effects like those that currently occur with the implementations of char
that are signed, for example when testing if a code unit is bellow 0x32
it would unintentionally identify the code unit 0xA3
as it satisfies the criteria (i.e. 0xA3 < 0x32
is true) because 0xA3
is technically considered a negative number
and negative numbers are smaller than any positive number.
Code units in character encodings are mappings, they map numbers to glyphs, they are not generic arithmetic integers and shouldn't be considered as such, having a "sign" is meaningless.
However, there are two alternatives that must be decided upon:
char8_t
, char16_t
, and char32_t
char_enc_t
should be marked as final
, and it shouldn't be possible to inherit from it.
The same way you cannot inherit from char
or char8_t
, you shouldn't be able to inherit from a specialization of a char_enc_t
.
It is not clear what it would mean to have a derived character type that inherits from another base character type. They have the same encoding but are not the same character type despite the fact that they derive from the same character?
The following operators were chosen because they are typically seen in character processing:
The ability to compare characters (==, <, <=, >, >=, !=, <=>).
Being able to perform code unit offsets and basic code unit offset arithmetic (+, -, ++, --).
Bit level manipulation (such as identification by bit masking, bit type transformation, etc), (&, |, ^, ~, <<, >>)
The ability to assign a new value, or modify a value in place using the previously mention operations (=, +=, -=, &=, |=, ^=, <<=, >>=)
Test if the code unit is 0 (!, bool).
Create an object from the underlying type, and the ability to cast it back to the underlying type.
The multiplication, division, and remainder (*, /, %, *=, /=, %=), have been intentionally excluded, character encodings are mappings, their intrinsic value doesn't have much numerical meaning, it is not clear what meaning multiplying (or reciprocally dividing) a character has. These are not the type of operations that I have seen associated with characters nor would I expect to be able to do them to a character type.
This paper allows to specify a character type with an encoding annotation, which can be used with arrays, strings, and string_views, and having them function like most other characters already do, but how does one load these objects with a specific value?
While it is always possible to read and copy byte-data or prepared phrase books from an external source, in most cases developers just want the convenience of writing their strings or characters directly in the source-code using string/character literals.
The problem is string/character literals are a core language feature, only "native" and utf8/utf16/utf32 encodings are supported.
Sure, it is possible to create a constexpr conversion functions to be able to translate utf8 into the user's custom encoding, and the user can and would always be expected to write their own user-defined string literals for their encoding.
But how do you create your own permanent storage from within a constant evaluated context? It is not clear to me how a user would be able to do this effectively.
This paper leaves the problem of user-defined string/character literals in a constant evaluated context open, to be addressed elsewhere. It is a problem that exists, this paper provides no answer to it.
char8_t
, char16_t
, char32_t
, or should we allow for any unsigned integer?std::underlying_type
(currently limited to enums) be expanded to provide the underlying type of character of the code unit?