SG16: Unicode meeting summaries 2024-03-13 through 2024-05-08
Summaries of SG16 meetings are maintained at
https://github.com/sg16-unicode/sg16-meetings. This paper contains a
snapshot of select meeting summaries from that repository.
Previously published SG16 meeting summary papers:
March 13th, 2024
Draft agenda:
Attendees:
- Alisdair Meredith
- Braden Ganetsky
- Eddie Nolan
- Elias Kosunen
- Fraser Gordon
- Jens Maurer
- Mark de Wever
- Nathan Owens
- Robin Leroy
- Tom Honermann
- Victor Zverovich
Meeting summary:
- A round of introductions was held for new attendee Braden Ganetsky.
- P1729R4: Text Parsing:
- Elias explained that prior feedback has been addressed and that the
paper is expected to be ready for a forwarding poll.
- Elias reviewed the revision history and the changes requested by
SG9.
- Elias stated that support for stdin will be provided in a future
paper; similar to how std::print() was proposed after
std::format() was adopted.
- Elias proceeded to review each section of the paper.
- Eddie noted that the comment in the example in section 3.2,
"Reading multiple values at once", appears to be missing
values() following operator->.
- [ Editor's note: The comment appears to be intentional in only
referring to operator->, but incorrect in stating,
"will throw if it doesn't contain a value"; a call to
std::expected<T>::operator->() exhibits UB if
has_value() is not true. ]
- Tom asked, while looking at the example in section 3.4,
"Reading multiple values in a loop", if all result values are
definitely assigned.
- Elias explained that the scan result is returned by value and that
there is no way to provide an object that is referenced within the
result object.
- Tom asked, while looking at the example in section 3.6,
"Scanning a user-defined type", if the use of std::expected
is required or whether another std::expected-like type could
be used.
- Elias replied that a concept-like approach is used in the reference
implementation.
- Braden asked if it will be surprising to programmers that
std::scan reports errors via std::expected where as
std::format uses exceptions.
- Elias responded that a failure to parse input provided at run-time is
expected and therefore a different category of error than what is
expected when formatting.
- Victor agreed with Elias and stated that this is a reasonable
design.
- Mark asked what happens if the scan format string is not valid.
- Elias replied that the format string is constant evaluated and, if
not valid, renders the program ill-formed.
- Mark commented that both throwing an exception and returning a
std::expected value that holds an error type in response to
an invalid format string can suffice to produce a compile-time
error.
- Elias proceeded to review section 4, "Design".
- Robin requested that, in section 4.2, "Format strings", in the
discussion of whitespace, the word "currently" in
"Those code points are currently" be struck since the Unicode
stability policy ensures these won't change.
- Robin observed that the list of whitespace code points appears to be
missing some characters; U+000B LINE TABULATION for example.
- Elias responded that the ASCII line includes a range of code points
that includes that character.
- Tom suggested it would be more clear for the list to include all of
the
Pattern_White_Space
characters individually.
- Elias continued review in section 4.3.2, "Fill and align", and
explained the behavior for scanning of centered text without an
explicit width; an unambiguous width cannot be inferred based on
surrounding fill characters.
- Tom referenced the rH example that scans "*42**"
with a "{:*^}" format specification, noted that the final
* character is not scanned, and asked for confirmation that
the example won't roundtrip with what std::format() produces
with an explicit field width.
- Elias confirmed.
- Victor suggested double checking how the
Python parse project
handles that situation.
- Elias responded that he had checked at one point, but would need to
do so again.
- [ Editor's note: The "Format Specification" section in the
Python parse project description
states:
Note that the “center” alignment does not test to make sure the
value is centered - it just strips leading and trailing whitespace.
]
- Victor pondered whether it is possible to roundtrip in general without
field width information and suggested the possibility of not
supporting scanning of center aligned text without an explicit field
width.
- Elias agreed that such cases could be disallowed.
- Jens questioned whether it might be a good to scale back the options
for scanning.
- Jens noted that there are already some asymmetries and provided an
example; std::format() produces a specific whitespace
sequence while std::scan() will consume arbitrary
whitespace.
- Jens suggested that use of a regular expression to consume fill
characters might provide a more practical approach.
- Elias asked if Jens' suggestion is intended just for handling of
center alignment or for all field widths.
- Jens clarified that the goal would be for the r5 example to
have a format specifier that consumes an arbitrary number of fill
characters.
- Jens stated that perhaps the r7 example would not be covered
by this idea since it has an explicit field width.
- Jens opined that the r5 example and all those that follow it
are a little concerning; particularly with regard to centering.
- Elias responded that section 6.2, "scanf-like
[character set] matching" discusses potential future support
for matching regular expressions and discarding characters.
- Elias stated these future directions would cover Jens' suggested
approach, but acknowledged that a format specifier option would be
convenient.
- Jens stated that full regular expression support would invite
complication.
- Mark asked if dynamic field widths are supported.
- Elias replied that they are explicitly disallowed.
- Elias reported that there was a poll in LEWGI that supported
compatibility with std::format as a guiding principle.
- Elias acknowledged that formatting and scanning are different.
- Jens agreed and stated that compatibility makes sense as long as it
makes sense.
- Victor stated that symmetry with std::format() is not a goal,
but that providing a replacement for scanf() is a goal and
the motivation for many of these use cases.
- Jens replied that he is not aware of features in scanf()
that would allow for skipping over fill characters.
- Victor acknowledged the lack of such general features but that the
use cases apply when the fill character is a space character.
- Victor asked if iostreams supports skipping fill characters when
scanning.
- General uncertainty was expressed.
- Jens reported that it appears that example r5 cannot be
parsed with scanf().
- Tom stated that it sounds like there is some homework to be done.
- Jens suggested that homework be done and that review continue at a
future telecon.
- Tom agreed.
- Eddie moved on to section 4.3.3, "Sign, '#', and '0'", and stated
that ignoring '+' and '-' signs or leading '0' characters would not
be desirable by default, but could be useful in conjunction with
the sign and '0' format options.
- Elias responded that, in his experience, it is more important to have
a clean design space than it is to have compatible format strings and
that he preferred to not allow those flags in order to avoid
confusion.
- Victor agreed with Elias.
- Elias explained that there is an additional roundtrip asymmetry when
formatted text exceeds an explicit field width; scanning the text
with an explicit field width won't consume all of the formatted
text.
- Elias noted that section 4.3.5.2,
"Design discussion: Separate flag for thousands separators"
will be removed; it was unintentionally left in.
- P3154R0: Deprecating signed character types in iostreams:
- Elias introduced the paper by explaining that the
signed char and unsigned char inserters and
extractors behavior is surprising because those types are treated as
character types but are often used as the underlying types of
int8_t and uint8_t.
- Alisdair asked how std::format() handles these types.
- Elias responded that they are formatted as integer types.
- Jens suggested updating section 1, "Motivation", to add a
std::format() example for each of the std::cout
examples.
- Alisdair asked about the long term intent and whether these functions
might be defined as deleted or specified to have different behavior
after a deprecation period.
- Alisdair asserted that deprecation should be a transitional state;
features should not stay deprecated indefinitely.
- Elias expressed a preference for defining them as deleted due to
concerns about just switching to new behavior.
- Victor expressed strong support for deprecation and stated that these
functions are a common source of errors.
- Victor noted that the existing behavior will remain available but
will require an explicit cast to a char-based type.
- Jens stated that a plan to deprecate in C++26, to define these
functions as deleted for C++29, and to define them with new behavior
for C++40 or so could make sense.
- Jens expressed strong support for defining these functions as deleted
as either a final or further intermediate step.
- Jens requested gathering some implementation experience by modifying
a C++ standard library to define these functions as deleted and then
compiling some real world projects to see if any latent bugs are
discovered.
- Jens opined that deprecation is a LEWG concern and that SG16 should
offer a recommendation on use of signed char and
unsigned char as character types.
- Alisdair pondered an option to change the behavior to
implementation-defined or unspecified.
- Poll 1: Recommend reserving signed char and
unsigned char for use as integer types, not character
types.
- Attendees: 11 (1 abstention)
-
- Consensus in favor.
- A: I would like to see the results for the experiment Jens
suggested first.
- Poll 2: Forward P3154R0 with the suggested modifications to the
motivation section to LEWG for C++26.
- Attendees: 11 (3 abstentions)
-
- Consensus in favor.
- A: The direction is more a matter for LEWG.
- Those that abstained from the second poll reported being uneasy with
the poll because the proposed change to deprecate these features is
not an SG16 concern.
- Tom explained that his intention with forwarding polls is to confirm
that there are no outstanding SG16 concerns that are not either
addressed or discussed in the paper; these polls are not intended to
state a position on matters that do not fall under SG16's
purview.
- Tom reported intent to cancel the scheduled 2024-03-27 SG16 meeting since
the WG21 meeting in Tokyo will have just concluded and we'll all be busy
catching up with our regular lives.
- Jens expressed support for that cancellation.
- Tom reported that he has historically scheduled SG16 meetings for the 2nd
and 4th Wednesday of each month, but that meetings from now through
2024-10-24 were scheduled for every two weeks; whether inadvertently or
intentionally with now forgotten intent remains a mystery.
- Tom indicated an inclination to stick with that schedule for now and
requested that anyone that will encounter attendance difficulties because
of it let him know.
- Tom announced that the next meeting is scheduled for 2024-04-10 and that
there are a number of papers awaiting review.
April 10th, 2024
Draft agenda:
Attendees:
- Barry Revzin
- Corentin Jabot
- Fraser Gordon
- Jens Maurer
- Mark de Wever
- Tom Honermann
- Victor Zverovich
Meeting summary:
- Due to a scheduling conflict, Barry was delayed in joining the meeting
and review of P2758R2 was thus delayed.
The time was filled with informal chat of various items including but not
limited to:
- Progress on
P2873 (Remove Deprecated Locale Category Facets For Unicode from C++26).
- The need, or lack thereof, for u8streampos,
u16streampos, and u32streampos.
- The Unicode Text Terminal Working Group.
- U+FDFD (ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM) and other
characters with very wide display widths.
- The past Tokyo and future St. Louis meetings.
- Locales, std::format(), and char8_t support.
- The Unicode Message Formatting Working Group.
- ICU4X.
- P2758R2: Emitting messages at compile time:
- Barry provided an introduction:
- The goal is to allow programmers to produce more friendly
diagnostics.
- static_assert has limitations and clever hacks only
go so far.
- Producing errors is great, but there is value in being able to
produce informational messages and warnings that can be elevated
to errors.
- std::format() is not declared constexpr, but
probably could be.
- The proposal is minimal and intended to provide infrastructure on
which better interfaces can be built.
- Victor posited that it would be useful to have a portable way to
suppress a warning in a portable manner; a portable version of the
#pragma directives that many implementations support
today.
- Victor stated that the paper needs updates to reflect the adoption of
P2741R3 (user-generated static_assert messages).
- Mark expressed support for the paper and commented that he recently
asked Clang developers about such a feature.
- Victor noted that clang-tidy allows a comment-based annotation to
suppress diagnostics that emanate from specified source code
lines.
- Barry asked if such annotations would be expected to suppress
diagnostics that would be produced from a specific call to one of
these functions.
- Victor replied affirmatively and stated that it would be difficult
for his organization to enable these warnings otherwise without a way
to suppress false positives.
- Jens explained that clang-tidy annotations are written at the line
where the diagnostic is issued from and that the annotation Victor is
interested in would have to work differently.
- Victor agreed and stated this suppression would be more
complicated.
- Tom suggested it would probably have to be an annotation that
suppresses any indicated warnings that emanate from within the
constant evaluation of the annotated source line.
- Corentin opined that this paper doesn't need to address suppression
of a diagnostic.
- Corentin noted that display of a diagnostic is within the purview of
the implementor.
- Corentin asserted that, as long as there is a tag available, that
implementors can provide a means to suppress it.
- Tom replied that a tag is specified for
constexpr_warning_str(), but not for the other cases.
- Tom stated that, from an implementation stand point, he could see
treating errors as discretionary errors that can be demoted to
warnings.
- Barry replied that production of an error is intended to halt
constant evaluation.
- Barry said that there are use cases for both fatal and discretionary
errors, but that he doesn't really agree with motivation for the
latter.
- Victor expressed opposition to being able to demote an error to a
warning.
- Corentin observed that the wording needs to require that the message
is provided in the ordinary literal encoding.
- Corentin reported that wording examples can be found in the wording
for static_assert.
- [ Editor's note: see
[dcl.pre]p12.
]
- Jens clarified that the elements of the std::string_view
that holds the message will be considered code units of the ordinary
literal encoding.
- Barry reported having located the wording and indicated he can copy
it.
- Jens asked if constexpr_error_str() is equivalent to
static_assert(false, "message").
- Barry replied that it is very similar.
- Corentin explained that the evaluation is performed at a different
time and potentially for a different number of occurrences; a
static_assert will be evaluated once at translation or
template instantiation time where as constexpr_error_str()
may be evaluated multiple times during constant evaluation.
- Corentin asked what the expectations for a call to
constexpr_error_str() are; for example, whether a diagnostic
with different color highlighting would be produced.
- Corentin asserted that it should be possible to suppress each message
kind; they should all have a tag for this reason.
- Corentin asked if escape sequences may appear in the message
strings.
- Barry asked what static_assert does and was informed it is
implementation-defined.
- [ Editor's note: examples with hilariously predictable
implementation divergence can be seen at
https://godbolt.org/z/xasvnMPre.
]
- Victor agreed with the suggestion to add a tag to
constexpr_print_str().
- Victor asked how ill-formed tags are handled.
- Tom replied that tags should be restricted to the basic literal
character set.
- Corentin stated that implementations should escape non-printable
characters and ill-formed code unit sequences in the diagnostics
they produce.
- Tom asked for confirmation that text in the message that looks like
a universal-character-name would not be treated as such.
- Corentin confirmed.
- Jens observed that the paper proposes a library facility but that he
is uncertain that it is.
- Jens stated that
[intro.compliance.general]
would need an update.
- Jens noted that section was updated to address the requirement for
the #warning and #error directives to produce a
diagnostic message.
- Jens asked why it would be necessary to state that the program is
ill-formed rather than that the expression is not a core constant
expression.
- Jens explained that ill-formed means a diagnostic must be produced,
but an implementation can do what it wants otherwise.
- Jens asked if specifying these as ill-formed requires an
implementation to refuse to translate the program and noted that this
is currently only required for #error.
- Tom asked Barry if the intent is to match #error.
- Barry expressed uncertainty.
- Jens advised reading
[intro.compliance.general].
- Corentin stated that the characters permitted in tags needs to be
clarified; quotes, semicolon, and other characters that have special
meaning in command line shells should be prohibited.
- Tom pondered whether this should really be a core language
facility.
- Tom suggested the tag should be required to be an unevaluated string
to facilitate audits.
- Victor expressed a preference for the tag being a string literal.
- Corentin observed that requiring a string literal would require a
core language feature.
- Barry replied that he would eventually like to expose this
functionality with more std::format() like capabilities but
doing so wouldn't be possible if this is specified as a language
feature; at least not without expression aliases or some other way to
pass a tag through a library interface.
- Tom stated he would like to review the proposal in SG16 again to
review limitations on tags and wording for encoding requirements.
- Jens indicated that CWG will need to review the paper as well and
stated he has a gut feeling that there is something missing.
- Jens noted that erroneous behavior is increasing motivation for
producing something akin to diagnostics at run-time.
- Jens suggested that LEWG might not have a lot of input since the
library interface would just forward calls to a builtin function;
that builtin function will require input from core implementors.
- Tom announced that the next meeting will be on 2024-04-24 and that he
would work with authors to get papers scheduled with more advance notice
this time.
April 24th, 2024
Draft agenda:
Attendees:
- Andrei Alexandrescu
- Braden Ganetsky
- Corentin Jabot
- Dan Katz
- Daveed Vandevoorde
- Eddie Nolan
- Giuseppe D'Angelo
- Jens Maurer
- Mark de Wever
- Nathan Owens
- Steve Downey
- Tom Honermann
- Victor Zverovich
- Wyatt Childers
Meeting summary:
- P1953R0: Unicode Identifiers And Reflection:
- Corentin provided an introduction:
- This is an older paper and reflection has changed in the meantime,
but it is still relevant.
- P1949 (C++ Identifier Syntax using Unicode Standard Annex 31)
clarified the syntax for identifiers to provide better support for
non-English speakers and mathematicians.
- String literals are converted from the source file encoding to an
implementation-defined literal encoding that might not be
Unicode.
- P1854R4 (Making non-encodable string literals ill-formed)
changed string literals to be ill-formed if they specify
characters that are not representable in the associated literal
encoding.
- Characters can always be converted to Unicode encodings without
loss of data in C++.
- Reflection needs to specify the type and encoding used to reflect
an identifier.
- The only solution that works in all cases is to expose identifiers
in a UTF encoding.
- It is not possible to infer the encoding of a string just by
looking at the values of its code units.
- Daveed commented that there are some encodings that have characters
that lack representation in Unicode.
- Corentin acknowledged such limitations and explained that new
characters are regularly invented in some cultures but are not widely
used or encoded.
- Corentin noted that trademarks and various other symbols likewise are
not encoded in Unicode.
- [ Editor's note: The editor's Bluetooth stack regretably crashed
and a minute or so of Corentin's continued elaborations were not
captured. ]
- Victor stated that having reflection expose names solely in
char8_t would be user hostile since there is little support
for char8_t in the standard library.
- Victor reported that his organization bans use of u8
literals.
- Victor expressed support for the approach described in section 4.4.6,
"name_of, display_name_of,
source_location_of" that limits names to characters in the
basic character set.
- Victor asserted that reflection must provide good support for the
common case where the ordinary literal encoding is UTF-8.
- Daveed asked about implications for EBCDIC based platforms.
- Corentin replied that EBCDIC and UTF-8 encoded data can't be discerned
just by looking at the string contents, so reflecting names in UTF-8
in char-based storage would be problematic for such
platforms.
- Tom replied that there are EBCDIC code pages that are missing
representation for some characters from the basic character set but
that digraphs are available for those characters so we don't really
concern ourselves with them in practice.
- Steve corrected Tom in the chat; EBCDIC code pages provide
representation for all the characters in the basic character set, but
not all such characters are encoded with the same value.
- Jens explained that, for the purpose of this discussion, it is
important to recognize that EBCDIC and ASCII map characters of the
basic character set to different code points and are therefore not
compatible.
- P2996R2: Reflection for C++26:
- Daveed presented:
- [ Editor's note: Daveed's presentation slides are avilable
here.
]
- An overview of the proposed reflection syntax was provided.
- There are three functions that reflect the names of entities at
present, but more might be added.
- There is only one function that consumes names as strings right
now, but more might be added.
- Names provided by some reflection interfaces must be consumable
in the same form by other reflection interfaces.
- The ability to write names to std::cout is required.
- It is ok for the names to not be source-like;
std::meta::display_name_of() can use a descriptive
notation.
- The translation model is Unicode based so names can be provided in
Unicode encodings, but the standard library is missing support for
text in char8_t.
- Proposal sketch #1:
- Provide names in both char and char8_t based
storage and associated encodings.
- Require names to round-trip.
- Proposal sketch #2:
- Provide names only in char8_t based storage and
UTF-8; names naturally round-trip.
- Make std::cout work with UTF-8 text in
char8_t.
- [ Editor's note: Proposal sketch #3 in the linked slides was
added after the meeting as inspired by ensuing discussion.
]
- Jens asked if name_of() is proposed as a consteval
function.
- Daveed confirmed that it is.
- Tom asked for clarification regarding the intended use cases for
name_of(), qualified_name_of(), and
display_name_of().
- Daveed replied that name_of() is intended to return an
identifier or a canonical name such as operator X and that
qualified_name_of() and display_name_of() are
intended to return potentially localized descriptive text.
- Andrei observed that programmers might want to pass a
data_member_options_t object around and that the
optional<string_view> name member is
potentially problematic for lifetime reasons.
- Daveed acknowledged that the data member type might need to be changed
to an owning string type.
- Corentin explained that conversion from an arbitrary encoding to
Unicode might not roundtrip because characters like
Å (U+212B ANGSTROM SIGN) and
Å (U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE) are distinct in
Unicode, but might not be distinct in the ordinary literal
encoding.
- [ Editor's note: The
Å (U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE),
Å (U+212B ANGSTROM SIGN),
A (U+0041 LATIN CAPITAL LETTER A), and
̊ (U+030A COMBINING RING ABOVE)
characters are all individually permitted in Unicode identifiers.
However, since C++ identifiers are required to be in
Unicode normalization form C (NFC), only the first form (U+00C5) is
permitted in a C++ identifier.
The ordinary literal encoding is not restricted to NFC, so this
character could be converted to one of the other forms and therefore
fail to round-trip.
This could result in a requirement for implementations to perform
conversion to NFC when consuming names.
See the "Singleton Exclusions" section of
UAX #15 (Unicode Normalization Forms).
]
- Steve asked if there is a desire or requirement to be able to emit
text containing names at compile-time.
- Daveed responded negatively and stated that the std::cout
requirement is intended as a debugging aid.
- Corentin responded to Victor's earlier statements regarding lack of
support for char8_t in the standard library and asserted that
we should fix that.
- Corentin expressed support for providing names in both char
and char8_t.
- Corentin stated that reflection is an important feature and that we
shouldn't implement hacks just to workaround the missing support for
char8_t.
- Corentin insisted that improving support for char8_t is a
tractable problem and that we have some time for improvements in
C++26.
- Victor opined that reflection should not be dependent on
std::string_view.
- Victor stated that it took a long time to properly specify
std::print() and that we shouldn't implement hacks in
iostreams just to make std::cout work with char8_t
in the C++26 timeframe.
- Victor explained that the model we are moving towards is one where
the ordinary literal encoding is UTF-8.
- Victor suggested that an identifier or name type could be provided
instead of a string; this would enable writing formatters for it.
- Eddie asked Corentin if there are round-trip normalization concerns
and whether renormalization is required.
- Corentin replied negatively and stated that there are characters that
are duplicated in Unicode and do not normalize to each other.
- Eddie replied that identifiers are required to be in NFC.
- Tom stated that we are not going to be able to reach a conclusion on
round-tripping and renormalization now and that we'll need to research
and revisit.
- Tom said he is not convinced that normalization is a significant
issue.
- Daveed asked what the deadline is for new library feature proposals
for C++26.
- Jens provided a link to
P1000R5 (C++ IS schedule)
and reported that the Wrocław meeting in November is the last meeting
for core language features that require a response from LEWG and that
the Hagenberg meeting in February is the last meeting to forward
papers to CWG and LWG.
- Tom expressed support for Victor's suggestion of a distinct
formattable type for names and identifiers.
- Tom agreed with Victor regarding optimizing for the case where UTF-8
is the ordinary literal encoding, but disagreed with the suggestion
that char will ever imply UTF everywhere.
- Tom expressed a preference for exposing names in both char
and char8_t based storage.
- Daveed described limitations of constant evaluation that make use of
std::string problematic, but noted that implementations can
provide views backed by data in a string literal pool.
- Corentin noted that the encoding challenges remain the same if a
unique type is used; a solution is still needed to enable printing of
it.
- Corentin acknowledged that an opaque type might confer other
benefits.
- Jens agreed with Tom that, while we might like for UTF-8 to take over
everywhere, environments that rely on EBCDIC are likely to
remain.
- Jens asserted that we must take backward compatibility into
account.
- Jens observed that there are two levels of encoding:
- At compile-time, data might or might not be UTF-8, but the
encoding is known if a name is produced and consumed during
constant evaluation.
- At run-time, the encoding of the environment might be different
and might require transcoding or some form of escaping to not
lose data.
- Jens noted that we explicitly decided not to interfere with the
existing behavior of std::cout and introduced
std::print() as a new interface.
- Jens asked how programmers will produce new names based on reflected
ones given that std::format() is not declared
constexpr.
- Jens expressed uncertainty regarding what locale means during
constant evaluation.
- Jens suggested that returning an opaque type might be useful, but is
also not so different from returning std::string_view and
providing additional library support.
- Daveed stated that the addition of a distinct type creates some
complexity but that it could be associated with statically allocated
memory.
- Daveed noted that the creation of lots of names could produce massive
numbers of string literals if names are backed by string pools and
stated there could be an advantage to the distinct type approach.
- Eddie observed that an opaque type that converts to both
std::string_view and std::u8string_view could result
in ambiguous conversions for formatted printing.
- Dan observed that an opaque type helps to make it clear to the user
that they might want to perform some operations on it before printing
it.
- Corentin responded to Eddie's observation by stating that, as long as
the opaque type doesn't require conversion in order to be printed,
then there are no ambiguous conversion concerns.
- Corentin observed that SG16 talks about EBCDIC a lot, but noted that
Windows is not UTF-8 by default and that Shift-JIS is still the main
encoding used in Japan.
- Corentin agreed with Victor that it would be nice to have
char be synonomous with UTF-8 but stated that isn't the world
we live in.
- Corentin noted that, when writing output to a terminal, we can't
guarantee that an identifier can be accurately displayed due to
encoding limitations, encoding conversion limitations, and fonts.
- Corentin stated that std::format() and std::print()
do a much better job than iostreams and that std::print()
will print Unicode correctly on Windows; that can't be fixed for
iostreams.
- Corentin asked Daveed if non-transient memory allocation is still
being pursued.
- Daveed responded that it probably is not feasible to deliver in
C++26.
- Victor also responded to Eddie's observation by opining that he
doesn't think implicit conversions from an opaque type would be an
issue for std::format() but that he wasn't sure about
iostreams.
- Victor noted that writing char8_t to iostreams will be lossy
or produce mojibake.
- Victor stated that constexpr support for
std::format() is frequently requested and asserted that we
should prioritize that over adding new support for
char8_t.
- Victor reported that proposals for compile-time messages have
expressed interest in constexpr support for
std::format().
- Tom posted the following candidate polls in the chat:
- Candidate poll 1: P2996R2: identifier names should be made
available via char, wchar_t, char8_t,
char16_t, and char32_t consistent with
std::filesystem::path and
[fs.path.native.obs].
- Candidate poll 2: P2996R2: identifier names returned by
name_of() in char-based storage should be
encoded in the ordinary literal encoding with non-representable
characters rendering the call ill-formed.
- Candidate poll 3: P2996R2: identifier names returned by
display_name_of() in char-based storage should
be encoded in the ordinary literal encoding with non-representable
characters escaped as in
[format.string.escaped].
- Candidate poll 4: P2996R2: char-based identifier names
accepted by data_member_spec()
(via data_member_options_t) should be encoded in the
ordinary literal encoding.
- Corentin expressed concern about memory footprint if names are backed
by string literals and made available in multiple encodings.
- Daveed responded that the strings are only generated when you actually
use them; Victor's opaque type would effectively have a handle to an
internal representation backed by static storage.
- Tom pointed out that conversions from the internal representation
could then be performed at run-time.
- Victor expressed curiosity about candidate poll 1.
- Tom explained the thoughts that motivated that poll suggestion;
std::filesystem::path provides a precedent for providing
conversions to various encodings; if this poll has consensus, then
there is no need to poll support for individual encodings; if not, we
can.
- Tom posted the following alternatives to candidate poll 1 in the chat:
- Candidate poll 1.1: P2996R2: identifier names should be made
available in char-based storage.
- Candidate poll 1.2: P2996R2: identifier names should be made
available in char8_t-based storage.
- Candidate poll 1.3: P2996R2: identifier names should be made
available in char16_t-based storage.
- Candidate poll 1.4: P2996R2: identifier names should be made
available in char32_t-based storage.
- Candidate poll 1.5: P2996R2: identifier names should be made
available in wchar_t-based storage.
- Victor expressed support for candidate poll 2, noted that we didn't
discuss it yet, but likes that it enables support for all possible
identifiers in UTF-8 in char-based interfaces when the
ordinary literal encoding is UTF-8.
- Steve noted that there is the possibility of problems caused by
translation units being compiled with different ordinary literal
encodings.
- Steve suggested that it might be useful to provide a library interface
that can produce strings with UCN-like sequences substituted.
- Steve noted that use of an opaque type would enable use with any of
the range encoding libraries.
- Daveed stated that the P2996 authors would be opposed to support for
all five character types but that they are ok with support for
char and char8_t.
- Tom asked for clarification regarding opposition for support of the
other character types.
- Daveed responded that common storage can be used to back the same
representation for char and char8_t, but that isn't
the case for the other character types.
- Corentin noted that char16_t and char32_t are also
less efficient to store.
- Corentin stated that he is not opposed to an opaque type as long as it
can be printed as Unicode with good results.
- Corentin asserted that we still need to make char8_t work in
the standard library regardless.
- Corentin expressed opposition to introduction of an escape mechanism
that effectively introduces an additional encoding.
- Corentin suggested that if we want to support wchar_t,
char16_t, and char32_t, that we should provide a
translation interface rather than duplicating interfaces throughout
the standard library.
- Daveed responded with "Amen, brother!"
- Eddie stated that
P2728 (Unicode in the Library, Part 1: UTF Transcoding)
is fully constexpr and would provide support for conversion to UTF-16
in char16_t and UTF-32 in char32_t.
- Tom requested that Daveed make his presentation available for inclusion in
the meeting summary.
- Daveed immediately obliged.
- Tom announced that the next meeting will be held May 8th and that we'll
continue discussion of this paper then.
- Tom apologized to Corentin and lamented that this will once again delay
further review of
P2626 (charN_t incremental adoption: Casting pointers of UTF character types).
May 8th, 2024
Draft agenda:
Attendees:
- Braden Ganetsky
- Corentin Jabot
- Dan Katz
- Eddie Nolan
- Lauri Vasama
- Mark de Wever
- Nathan Owen
- Peter Bindels
- Robin Leroy
- Tom Honermann
- Victor Zverovich
Meeting summary:
- Robin provided a report on UTC #179:
- [ Editor's note: Minutes from the UTC #179 meeting are recorded in
L2/24-061.
]
- The alpha review period closed several weeks before the meeting and
the UTC WGs then had one week to prepare any material responses for
the meeting.
- The agenda for this meeting included reviewing the alpha feedback and
authorizing the beta release with stable specifications.
- The character repertoire is now frozen.
- Two recently added characters were removed at the request of the
Indian government; see
consensus item 179-C43.
- Significant changes were made to the line breaking algorithm, but
these changes don't affect current C++.
- Improvements were made to the handling of quotation marks in
simplified Chinese.
- Lines are no longer broken after hyphens that separate Hebrew and
non-Hebrew text.
- Recommendations from the CJK & Unihan Working Group were accepted
that will impact the wording currently present in
[format.string.std]p13
when the C++ standard is rebased on Unicode 16; the set of code points
included in bullet 13.2 will be subsumed by 13.1 due to acceptance of
L2/24-059 (Proposal to change the East_Asian_Width property of the Yijing symbols).
- Tom stated that we should create an issue to track doing that update
when we rebase on Unicode 16 or later.
- [ Editor's note: Tom created
SG16 issue 81 (Unicode 16: Updates needed for [format.string.std]p13 field widths)
to do so. ]
- Robin wondered why the code points listed in
[format.string.std]p13
bullets 13.3 (U+1f300 - U+1f5ff (Miscellaneous Symbols and Pictographs))
and 13.4 (U+1f900 - U+1f9ff (Supplemental Symbols and Pictographs)
are listed with a field width of 2; these code points aren't wide in text
presentation form, but would be in emoji presentation form.
- Robin shared a
link
listing all of the characters covered by bullet 13.3 and noted that some
of them, 🖗 for instance, are presented in a narrow form in the Windows
terminal for him.
- Corentin explained that testing revealed that these characters were
predmoninantly displayed as wide characters in existing terminals.
- Eddie reported relevant discussion having occured during the recent
meeting of the Unicode Text Terminal Working Group (TTWG); the POSIX
wcswidth() function maps a code point to a width, but does not
account for variation selectors.
- Eddie stated that there is supposed to be a default for whether text vs
emoji presentation form is used, but there is implementation
divergence.
- D3258R0: Formatting of charN_t:
- [ Editor's note: D3258R0 was the active paper under discussion at
the telecon.
The agenda and links used here reference P3258R0 since the links to
the draft paper were ephemeral.
The published document may differ from the reviewed draft revision.
]
- Corentin provided an overview of the paper:
- The motivation for the paper is to enable the ability to print
char8_t-based UTF-8 text via std::format().
- The intent is for something like
std::format("...", std::meta::name_of(^XX)) to just do
the right thing.
- The goal is for semantics to be consistent.
- The proposal includes support for formatting arguments of type
char8_t, char16_t, and char32_t for
both "" and L"" format strings.
- No support is proposed for use of u8"", u"", or
U"" literals as format strings.
- A replacement character will be substituted for ill-formed code
unit sequences.
- No error mechanism is proposed but one could be added later by
adding format specifier options.
- No support is proposed for formatting arguments of type
char with a L"" format string or for formatting
arguments of type wchar_t with a "" format
string due to potentially ambiguous encoding associations.
- Formatting of escaping characters and strings will work as
expected.
- For a non-UTF encoding, the replacement character will be
?; this matches substitutions currently observable with
the Microsoft compiler on Windows.
- No special behavior is proposed for std::print().
- A prototype implementation was completed for libc++, but libc++
only supports the ordinary literal encoding being UTF-8, so that
doesn't exercise transcoding scenarios.
- The C and C++ standards don't provide transcoding facilities other
than mbrtoc8() and such, but conversions can be done
using iconv, ICU, or other existing converters.
- Some transcoding facilities do not offer flexibility for error
handling.
- Support for formatting single code units of char8_t,
char16_t, and char32_t is proposed; this is
consistent with existing support for char and
wchar_t.
- constexpr implementations of std::format()
already have the ability to perform conversions between the set of
literal encodings.
- Victor observed that a number of the std::format() examples
in the paper are syntactically incorrect as presented; likely due to
markup issues.
- Victor explained that std::vprint_unicode() and
std::vprint_nonunicode() are not exposition only so that
programmers can provide overloads for their own types with
differentiation for UTF and non-UTF encodings.
- Victor noted that locking variations of these functions are now
specified as well.
- [ Editor's note: Locking variations were recently added via the
adoption of
P3107R5 (Permit an efficient implementation of std::print)
during the Tokyo meeting. ]
- Tom asked for an explanation of the ABI limitations on extending
format specifiers.
- Mark explained that the ABI is restricted by
std::basic_format_arg<Context>::visit();
std::basic_format_arg is effectively a discriminated union
and the number of discernible types is constrained by the type used to
identify them.
- [ Editor's note: See
Mark's follow up post to the SG16 mailing list.
]
- Victor replied that it would be possible to use the normal formatter
API instead of std::basic_format_arg.
- Victor stated that
{fmt}
already supports constexpr, but that there are no immediate
plans to propose support for std::format() as
constexpr in the standard.
- Victor suggested that the paper simply state that constexpr
support is implementable.
- Tom directed discussion to whether the proposed capabilities would
suffice to meet the minimum requirements for printing of identifiers
as desired for the reflection proposal.
- Dan opined that it does, noted that Daveed would like to have iostream
support, but commented that he doesn't feel as strongly about
that.
- Corentin said that he would like to know if anyone felt very strongly
about support for iostreams.
- Corentin stated he would rather focus on support for
std::format() and std::print(), but that he can
understand why others might want iostream support specifically.
- Corentin explained that he did not propose iostream support because he
didn't feel like he was the right person to do so.
- Victor stated that he views the proposed capabilities as a partial
solution that is not inline with the std::format() design
intent to not mix encoding concerns.
- P2996R2: Reflection for C++26:
- Victor expressed strong opposition to only exposing identifiers in
char8_t and asserted that we need to figure out the story for
support of char.
- Dan interpreted Victor's response as meaning that the proposed
facilities with only support for char8_t does not provide a
sufficient solution.
- Dan stated that the idea of using a magic proxy type seems good and
that Daveed has expressed support for it.
- Eddie reported having recently discussed the proxy type with other
attendees of C++Now and that some found it to be an overcomplicated
solution.
- Corentin responded that, if done right, programmers shouldn't be much
affected by it.
- Corentin opined that a proxy type is fine but that a solution that
uses an escape mechanism to effectively create a new encoding is
not.
- Tom pondered whether there is a need to distinguish between names and
identifiers and noted that many functions, like conversion operators
and overloaded operators, don't have associated identifiers but do
have names.
- Corentin indicated that is probably not an SG16 concern.
- Corentin suggested that reflection use cases are best addressed by
performing code injection rather than defining overloaded
operators.
- Corentin agreed that reflection might want to differentiate names and
identifiers in a similar manner to how Clang does internally.
- Dan advised caution regarding exposing a meta type for a name when
already working with a meta type.
- Victor asked if the hypothetical proxy name type might be exposition
only with conversion operators.
- Victor asked what the plan is for exploring the idea of such a
type.
- Corentin asked with a smile whether Victor was volunteering to do
so.
- Dan responded that the P2996 authors can propose a shape for the
type.
- Tom summarized his impression of consensus so far; that it seems that
there is good consensus for supporting both char and
char8_t, but since we can't overload based on return types,
that use of a distinct type is needed to avoid having to specify
distinct names; such a type enables future extension.
- Corentin asked what the motivation would be for an exposition only
type.
- Dan replied with a smile that it would avoid a bike shedding exercise
in LEWG.
- Dan stated there is a need to be able to perform string comparisons
for implementation of enum_to_string.
- Corentin acknowledged that a proxy type makes things easier by adding
a layer of indirection.
- Eddie asked for reasons not to define separate functions for
std::meta::name_of() and related functions.
- Dan replied that doing so creates a combinatorial explosion.
- Eddie asked what would happen in the case of an enumeration that has a
set of enumerators that cannot all be converted losslessly to the
ordinary literal encoding and where transliteration might produce the
same name.
- Dan suggested use of char8_t to avoid such cases.
- Eddie responded that his concern is whether such a scenario should be
possible since it makes it easy to do the wrong thing.
- Corentin noted that the compiler's internal representation is always
able to distinguish such cases.
- Corentin asked Robin if there are duplicated characters that are valid
for use in identifiers and that canonicalize to the same
representation.
- Robin replied that there are reasonable mappings to other character
sets that result in ambiguity.
- Robin provided a reference and some examples in the chat:
- Unicode 15.1, chapter 7, section 7.2, "Greek", paragraph starting with "Greek Letters as Symbols":
For compatibility purposes, a few Greek letters are separately
encoded as symbols in other character blocks. Examples include
U+00B5 µ MICRO SIGN in the Latin-1 Supplement character block and
U+2126 Ω OHM SIGN in the Letterlike Symbols character block. The ohm
sign is canonically equivalent to the capital omega, and
normalization would remove any distinction. Its use is therefore
discouraged in favor of capital omega. The same equivalence does not
exist between micro sign and mu, and use of either character as a
micro sign is common. For Greek text, only the mu should be used.
- μ, µ, 𝛍, 𝜇, 𝝁, 𝝻, and 𝞵 are all compatibility equivalent to μ and
all are valid C++ identifiers.
- Corentin asked whether the roundtrip requirement can actually be
satisfied in the presence of arbitrary encodings.
- Victor replied to the overloading concerns by mentioning that Daveed's
original suggestion was for std::meta::name_of() and friends
to be templated on a character type.
- Victor noted that roundtrip support can be facilitated with an escape
mechanism as in Daveed's preferred option.
- Tom stated that roundtrip support cannot tolerate lossy conversions
and that an attempted conversion that would be lossy must result in an
error or substitution of an escape sequence.
- Eddie expressed concern that conversion to char won't work
everywhere, but since it will work in most cases such support can lead
to broken code that isn't caught by testing.
- Eddie suggested that the function template idea seems ok if the
character template type parameter is specified with a default template
argument of char8_t.
- Dan asked what the advantage of a template parameter would be over an
opaque type.
- Eddie replied that it can't be completely hidden away; that it will
appear in error messages, on cppreference.com, etc...
- Corentin stated that programmers don't want to care about this and
that they just want the identifier to be printed; if we can make it
just work, that is a win.
- Tom announced that the next SG16 meeting will be on 2020-05-22 and that he intends to put
P2626R0 (charN_t incremental adoption: Casting pointers of UTF character types)
on the agenda, perhaps along with some recently created LWG issues.