SG16: Unicode meeting summaries 2022-06-22 through 2022-09-28
Summaries of SG16 meetings are maintained at
https://github.com/sg16-unicode/sg16-meetings. This paper contains a
snapshot of select meeting summaries from that repository.
Previously published SG16 meeting summary papers:
June 22nd, 2022
Draft agenda:
- Continue discussion of survey questions for the 2023 C++ Developer Survey.
Attendees:
- Hubert Tong
- Jens Maurer
- Peter Brett
- Steve Downey
- Tom Honermann
Meeting summary:
- Continue discussion of survey questions for the 2023 C++ Developer Survey:
- [ Editor's note: The active revision at the start of the meeting
can be viewed by selecting File | Version history |
See version history, then selecting the version named
"pre 2022-06-22 meeting", then clicking the rightward facing triangle
next to the version name to "expand detailed versions"; this latter
step is necessary to exclude detailed edits that otherwise interfere
with numbering of the questions. ]
- Tom asked attendees to nominate questions to be removed from
consideration.
- PBrett suggested removing
Q1 (What character encoding(s) do you use for source files?)
since we already have consensus for moving towards UTF-8 encoded
source files.
- PBrett asked how answers to Q1 would affect our decision making.
- Jens concurred and asked hypothetically whether responses would
entice us to, for example, add a translation phase 1 option to
support GB18030 as we are doing for UTF-8 via
P2295 (Support for UTF-8 as a portable source file encoding).
- Jens noted that implementations that support non-UTF-8 source files
will continue to support them and argued that there is nothing to be
done within the standard.
- Hubert suggested an alternative formulation that asks which scripts
programmers are using in their source files and for which they might
be using specific encodings.
- Jens noted that
P2528 (C++ Identifier Security using Unicode Standard Annex 39)
assumes that everyone is using Unicode for their source file encoding
and that encoding does not imply which scripts are being used.
- Jens stated that use of a particular encoding such as ISO8859-1 does
restrict what scripts can be used and that such information could
potentially be used in confusability analysis.
- Jens suggested the question could probe which scripts are used in
conjunction with a non-Unicode encoding.
- PBrett noted the existence of the Big-5 encoding and that it is being
phased out in favor of GB18030 and UTF-8.
- PBrett asked if we are at risk of discussing whether support for
additional encodings should be mandated.
- Hubert responded negatively and stated that the question is intended
to probe the extent to which substantial use of non-Unicode encodings
remains.
- Tom stated that it sounds like we have not identified a use case for
this question.
- Tom struck Q1 from the draft document.
- PBrett expressed uncertainty as to what
Q2 (What character encoding(s) do you use for string literals?)
is intended to ask and stated that it might be interpreted as asking
if L, u8, u, or U prefixed
literals are being used.
- Tom replied that the question is intended to ascertain what encodings
are being used for the encoding of ordinary (non-prefixed) literals
in order to learn about trends occurring in the ecosystem.
- Hubert noted that we now assume that if string literals are UTF-8,
then the locale encoding is as well.
- PBrett expressed a feeling of persistent saltiness over that
assumption.
- Jens stated that only std::format is currently pushing us
towards Unicode in this way.
- Tom stated that we seem to have no use case for this question.
- Tom struck Q2 from the draft document.
- PBrett suggested removing
Q10 (How are the project(s) that you work on organized for Unicode
normalization?)
on the basis that few programmers are aware of Unicode
normalization.
- Tom responded that the question is intended to provide input
regarding whether normalization should be reflected in the type
system.
- Steve stated that it doesn't matter for most programmers, but that
it matters immensely for a few.
- PBrett suggested it is not a good candidate question if we believe it
impacts few programmers.
- Tom struck Q10 from the draft document.
- PBrett opined that
Q13 (Do your project(s) use regular expressions for which the search
pattern is not known at compile-time?)
is important to determine if programmers create regular expressions
using user input.
- PBrett stated that it probes whether
CTRE
is a suitable replacement for std::regex.
- PBrett stated that
Q14 (Which regular expression languages do you use?)
appears to duplicate
Q12 (What libraries do you use for regular expression support?).
- Tom replied that Q14 is intended to ask which regular expression
languages are being used; for example, which of the six languages
supported by std::regex are being used.
- Hubert stated that Q12 could be useful to determine whether collation
support is useful and noted that use of POSIX languages may imply
better locale support needs.
- Jens observed that programmers might use those languages for other
reasons.
- PBrett replied that programmers tend to use whatever language the
regular expression facility they are already using supports.
- Tom struck Q14 from the draft document.
- Jens asserted that
Q15 (Do you use the signed char or unsigned char types for text
processing?)
is not interesting.
- Hubert asked if that concern is motivated by the lack of standard
library support.
- Jens replied that iostream supports signed and unsigned char
types.
- Tom stated that the question is intended to help determine whether
these types should be used exclusively as small integer types as
opposed to character types.
- Jens opined that programmers should use char,
char8_t, etc... for character types.
- PBrett noted that unsigned char is commonly used as a
character type in C.
- Tom stated that this reflects a policy issue regarding whether we
intend to extend the standard library to support use of these types
for text and stated we have no such intent.
- Jens agreed, noted that the aliasing is unfortunate, and expressed
support for not making the situation worse.
- Tom struck Q15 from the draft document.
- Jens expressed support for asking programmers how they support
internationalization and localization.
- PBrett suggested dropping
Q19 (What libraries do you use for collation?).
- Jens countered with a suggestion to merge
Q17 (What libraries or operating system features do you use for
language translation?),
Q18 (What libraries do you use for localization?),
and Q19.
- Tom agreed to do so.
- Tom pondered whether it is worth asking about prohibition of standard
library facilities.
- PBrett responded that we can infer avoidance of the standard library
when programmers state that they use, for example, ICU, but not the
standard library facilities.
- Steve stated that the explicit locale capabilities present in
std::format are representative of what programmers want.
- PBrett asked about adding a free form field for programmers to state
how they support localization.
- Tom responded that it is difficult to extract data from free form
entries.
- Steve stated that it is useful to know that no one uses, for example,
stdcoll().
- Tom asked if the "discourage or prohibit" language should be
retained.
- Jens replied negatively and stated that we want to know what they do
use.
- Hubert stated that
Q16 (Do you use the C and C++ locale features?)
is useful to know if, or to what extent, programmers depend on the C
and C++ locale for identification purposes.
- Tom agreed to simplify Q16.
- Tom pondered what we would use the responses to questions about
languages and scripts for.
- PBrett replied that Visual Studio Code has
UAX#9 HL4
features intended to help with display of bidirectional text in
source files; that information could be used for SG15 guidance.
- Jens stated that the standard allows identifiers, literals, and
comments to be written in many kinds of scripts; support for
languages such as Japanese is intentional.
- Jens added that he favors developing guidelines to encourage features
like those that Visual Studio Code offers.
- Tom noted that guidance will be forthcoming from the Unicode Source
Code Ad-Hoc Group.
- PBrett concluded that it sounds like we already know we want to
support these features; the data could help establish urgency.
- Jens agreed, but noted that implementors can decide for themselves
what is and is not urgent.
- Tom struck Q3 and Q4 from the draft document.
- Tom opined that
Q5 (Do you use characters other than the basic character set in
identifiers)
is probably irrelevant following the adoption of
P1949 (C++ Identifier Syntax using Unicode Standard Annex 31).
- Steve indicated that language specific concerns are best addressed in
a code style guide.
- Tom struck Q5 from the draft document.
- Discussion ensued regarding poll bias and privacy concerns.
- PBrett suggested we could ask which region of the world respondents
are located in.
- Jens replied that such a question might be one that the Standard C++
Foundation is interested in asking anyway; it may not need to be
included within our quota of questions.
- Hubert suggested it would be useful to emphasize culture as opposed
to geographical location.
- PBrett expressed a preference for asking which nation the respondent
is in.
- Tom suggested asking respondents what their native language is.
- Jens replied negatively; there are many languages spoken in
India.
- Tom proposed striking
Q6 (Do the projects you work on limit locale selection in deployment
environments to those that use a specific character encoding?)
on the basis that mainframes aren't going away any time soon.
- Tom struck Q6 from the draft document.
- PBrett suggested merging
Q7 (What libraries do you use for text processing?),
Q8 (How are the project(s) that you work on organized for text
processing?), and
Q9 (If your project(s) convert text to and from an internal encoding,
what encoding(s) are used for the internal encoding?)
based on an expectation that use of framework libraries like QT
sufficiently answer these questions.
- Jens noted that we already have agreement that we want utilities to
convert to/from UTF-8 and possibly UTF-16.
- Tom asked for clarification that such agreement is relative to locale
dependent encodings.
- Steve replied yes, but also to other specified encodings.
- PBrett asserted that these questions have already been probed by
JeanHeyd.
- Tom explained that
Q7 (What libraries do you use for text processing?)
is really intended to ascertain what features are supported via
non-standard libraries because the standard does not provide adequate
support for them.
- Jens suggested asking that question instead.
- Tom agreed to rephrase Q7 accordingly.
- Jens suggested asking what text processing features people most need;
whether that be transcoding, Unicode algorithms, or something
else.
- Jens noted that regular expression support could be added to that
list differentiated by compile-time vs run-time support.
- Steve asserted that a laundry list would be ok.
- Tom stated that the next meeting is scheduled for July 13th but that we
need new papers.
July 27th, 2022
Draft agenda:
Attendees:
- Eskil Steenberg
- Hubert Tong
- Jens Maurer
- Marcus Johnson
- Peter Brett
- Tom Honermann
- Victor Zverovich
Meeting summary:
- WG14 N3016: Unicode Length Modifiers v3:
- PBrett introduced the topic and invited Marcus to present his
paper.
- Marcus discussed the motivation for the paper; the desire to be able
to easily format text in a Unicode encoding.
- Tom provided a summary of the WG14 review of the paper during the
recent WG14 meeting.
- PBrett described how gettext() is used; a string in the
string literal encoding is provided and a string in the current
locale encoding is produced.
- Tom stated that there is effectively a contract that the string
produced by gettext() is encoded in the current locale
encoding.
- PBrett confirmed.
- PBrett asked how printf() would handle formatting a UTF-16
encoded argument.
- Tom replied that the existing practice for wchar_t based
arguments is to convert them to the current locale encoding.
- Tom asked if motivation exists for an alternative behavior.
- Jens asked for an example of alternative behavior.
- Tom replied that the string literal encoding could be used to guide
conversions instead of the current locale and noted that this would
match the behavior chosen for std::format() when the string
literal encoding is a Unicode encoding.
- Tom explained that such behavior would require preserving the string
literal encoding for each translation unit and then somehow passing
that information to printf().
- Jens noted that std::printf() and gettext() have
different encoding expectations; the former expects the formatting
string to be in the current locale encoding while the latter expects
something else.
- [ Editor's note: The
GNU gettext man page
states:
The msgid argument identifies the message to be translated.
By convention, it is the English version of the message, with
non-ASCII characters replaced by ASCII approximations.
]
- PBrett stated that it is rare in his experience for a string literal
to be passed as the format string to printf().
- Victor replied that in the code base he works on, approximately 50%
of printf() calls pass a string literal.
- Tom surmised that Victor's experience may reflect an assumption of
UTF-8 as both the string literal encoding and the locale
encoding.
- Victor replied that third party libraries are more likely to not
assume UTF-8.
- Jens asked if there is motivation to introduce a
u8printf().
- Tom replied that adding such an interface is an option.
- Jens expressed belief that we have consensus that the future is UTF-8
and that transcoding operations should occur at program
boundaries.
- PBrett expressed acceptance of library UB as a result of passing a
format string to printf() that is not encoded in the
expected encoding.
- Jens asked how printf() implementations recognize the '%'
character today.
- Hubert responded that printf() is required to be locale
sensitive and that the code point value of the '%' character may vary
across encodings.
- Eskil professed that implementations simply search for a code unit
that matches the ASCII encoding of '%'.
- Jens argued that is an unlikely implementation choice for an
EBCDIC-based system.
- Hubert explained that the '%' character encoding is non-varying
across EBCDIC code pages so a simple search for a code unit that
matches the EBCDIC encoding works on such systems.
- Jens surmised that, for implementations that support a locale
encoding that is unrelated to the string literal encoding, there must
exist a compile time decision regarding calls to
printf().
- Hubert responded affirmatively and stated that the printf()
family of functions have multiple entry points on z/OS.
- [ Editor's note: The z/OS C run-time library provides
EBCDIC-based implementations and ASCII-based implementations.
The latter exist to support an ASCII environment on z/OS systems.
See IBM's
Enhanced ASCII support documentation.
]
- PBrett reported having seen cases where, if printf() was not
locale sensitive, the results produced would not have matched
expectations.
- Tom agreed that we have established that the format string must match
the locale encoding.
- Eskil stated that, ideally, the string literal and locale encodings
would match.
- Hubert agreed but noted that the locale encoding is controlled by the
program user as opposed to the program author.
- Eskil observed that character conversions are not desirable in all
cases and provided production of a JPEG header as an example.
- Jens noted that there is no current proposal to implicitly convert
the printf() format string to the locale encoding.
- Eskil and others agreed that such a proposal would be
ill-advised.
- PBrett concluded that the current printf() behavior matches
the needs of the paper; it must alreadly be locale encoding aware, so
conversion between UTF encodings and the locale encoding is
reasonable.
- Hubert agreed assuming requisite functionality as proposed in
JeanHeyd's transcoding facilities.
- Hubert stated that it would be necessary to specify how transcoding
errors are handled.
- Tom expressed a belief that the C standard already specifies how such
errors are handled via delegation to functions like
wcrtomb().
- Hubert responded with a belief that the C standard requires that
well-formed multibyte strings and well-formed wide strings always be
interconvertible without loss.
- Tom expressed surprise that such a requirement exist.
- PBrett noted that the wording would need to specify whether the
precision flag applies to code units, code points, or extended
grapheme clusters (EGCs).
- PBrett stated that additional flags could select either code units,
code points, or EGCs.
- PBrett asserted that the grapheme break algorithm is not too onerous
a requirement.
- Tom asserted that the precision flag must specify code units for
consistency with other uses of precision flags and that written code
units should not split code points or EGCs.
- Hubert explained that the number of code units read from the input must
not exceed the specified precision for security reasons.
- Discussion ensued regarding the possibility of buffer overflows and
existing uses of the precision flag.
- Victor asked if the precision flag currently specifies the maximum
number of input characters when performing wide character
conversions.
- Hubert responded affirmatively but suggested verifying.
- PBrett noted that, for existing uses, code units is equivalent to
characters.
- Tom explained his understanding of the precision flag; that if the
precision is X, then up to X code units are read,
but only the complete code unit sequences are written.
- Hubert responded that, if the input string had X code
points, but the number of code units to write differs, then the same
number of characters written would not match X.
- PBrett asserted that it is common to use the precision to limit
output.
- Tom checked
https://cppreference.com
and reported that it claims that the %s specifier uses the
precision to limit the maximum number of bytes to write.
- Eskil expressed a preference towards designing for the future and
that legal output always be produced.
- Hubert checked the C standard and reported that the precision
specifies the maximum number of output code units in the target
encoding and that partial characters are not written.
- Victor summarized; the precision is the amount of output to write
and the remainder of what was read is discarded.
- PBrett asserted that programmers expect the precision to express
display width.
- Hubert responded that existing behavior hasn't matched that
expectation for as long as multibyte encodings have existed.
- Hubert pondered whether field width has a meaning in this case.
- PBrett replied that field width fills and that precision
truncates.
- PBrett asserted that what code authors really want is the ability to
specify display width.
- Tom asked if there is agreement that printf() does not
currently have the ability to specify display width.
- PBrett and Eskil responded negatively.
- Discussion ensued regarding EGCs and display width.
- Eskil expressed a preference that the C standard provide base level
functionality and that additional functionality be built as
libraries.
- Eskil asserted that there isn't always a single best solution.
- Hubert noted that, with regard to code points vs EGCs, splitting an
EGC can produce misleading output.
- PBrett noted that virtually all programs need to interact with text
in some capacity.
- Eskil stated that some capabilities are fundamental and provided the
example of formatting a number.
- Eskil stated that, with regard to string types, there are uses for a
size+pointer string type,
a size+buffer string type,
a size+capacity+buffer string type,
a string-with-allocator string type,
and more.
- Tom indicated that the next meeting is scheduled for August 10th and that
the agenda is yet to be determined.
August 24th, 2022
Draft agenda:
Attendees:
- Corentin Jabot
- Hubert Tong
- Jens Maurer
- Mark de Wever
- Peter Brett
- Steve Downey
- Tom Honermann
- Victor Zverovich
Meeting summary:
- Initial planning for Kona.
- Tom stated that there will likely be NB comments for SG16 to address
and that they are unlikely to be available in a timeframe that would
allow us to discuss them before the Kona meeting begins.
- Tom explained that, if few people will be present in Kona, that he is
inclined not to reserve a room, but rather to have both in-person and
remote attendees join a Zoom meeting for discussions.
- PBrett suggested that any such meetings should be planned for early
morning Kona time in order for remote attendees in Europe and the US
east coast to be able to attend.
- Jens explained his current plans and expectations for room setup and
audio capabilities.
- Jens cautioned that the conference wifi may not handle many in-person
attendees using Zoom at the same time.
- P2626R0: charN_t incremental adoption: Casting pointers of UTF character types:
- Corentin presented the paper.
- char8_t, char16_t, and char32_t are
useful for their encoding assurances, but lack support in the
standard library.
- Unfortunately, we can't just assume UTF-8 with char-based
types and avoid use of the UTF variants.
- Some form of interconvertibility between char,
wchar_t, and the UTF character types is needed for the
latter types to be incrementally adopted.
- Copying the content of an array of one character type to an array
of another character type just because existing code needs to
access it by the latter type is expensive.
- None of the current language facilities enable zero cost
interconvertibility.
- The proposed functions are intended to have a narrow
contract.
- The names of the functions are intended to reflect the
partitioning of character types that are always used with UTF data
and other character types.
- The functions are intended to provide interoperability in constant
expressions.
- The basic_string_view and span interfaces are
provided for convenience.
- The alias barrier based conversion operations that ICU uses are
non-conforming, probably don't work reliably, and probably can't
be made to work in the C++ core language.
- [ Editor's note: See
SG16 issue #67
for more background information regarding the ICU alias barriers.
]
- An interoperability solution is needed for the UTF character types
to be adopted in practice.·
- Victor asked how the proposed functions would work on a system where,
for example, wchar_t is not the same size as
char16_t.
- Corentin responded that the functions are constrained such that the
source and target types must have the same size and alignment; a call
is ill-formed otherwise.
- Victor requested that the paper be updated to explicitly state early
in the paper what properties of the types must match for the
operations to be well-formed.
- Hubert stated that there are memory model concerns that may make this
feature not worth pursuing; the proposed functions provide a very
sharp feature.
- Tom asked Corentin why he felt SG1 might want to review the
paper.
- Corentin responded that his understanding is that SG1 is generally
consulted regarding the C++ abstract machine, the memory model, and
concurrency concerns.
- Jens explained that the concerns the paper raises have more to do with
the object model than the memory model and that these concerns fall
more under CWG than SG1.
- Jens noted that
P2590 (Explicit lifetime management),
a paper with related concerns, was reviewed by LWG and CWG, but not
by SG1.
- Jens added that
P2590
completed work that began with
P0593 (Implicit creation of objects for low-level object manipulation)
and that paper also targeted LWG and CWG.
- Corentin asked if the paper represents a good direction.
- Hubert stated that the proposed semantics are such that, if these
functions were called to replace a subobject, that the enclosing
complete object would be destroyed.
- [ Editor's note: Hubert provided a reference to the relevant
wording in
[basic.life]p1
in a follow up
post to the SG16 mailing list.
]
- Hubert repeated his assertion that the proposed semantics have sharp
edges.
- Hubert noted that there are on-going concerns involving
start_lifetime_as() and base classes.
- Jens commented that the complete object would only be saved from
destruction if there is a provides storage relationship
([intro.object]p3)
between the subobject and the target type.
- Jens suggested that a better approach might be to add
constexpr support to start_lifetime_as_array().
- Jens added that it might be possible for
start_lifetime_as_array() to offer additional guarantees in
cases where an underlying type is shared.
- Tom stated that there is a complicated relationship between the core
language possibilities and how that impacts the library interface
possibilities.
- Tom expressed a preference for specifying an ideal library interface
that then drives the core language needs.
- Hubert expressed uncertainty with regard to how to word restrictions
around usage of an enclosing object following a change of type for a
subobject; use or destruction of the subobject via the enclosing
object would have to be avoided.
- Corentin said he would try to address that.
- Corentin stressed that, once an object's type is changed, the memory
for that object cannot be accessed as though an object of the
previous type is there.
- Hubert reiterated that a change of type for a subobject becomes very
complicated.
- Jens asked if the paper includes examples that are reflective of how
this facility would be used in something like real world code.
- Jens noted that the mailing list discussion indicated that conversion
in one direction must be followed by a conversion back.
- Corentin expressed uncertainty regarding what limitations must be
imposed and voiced an assumption that, since the character types are
trivial, there is more flexibility.
- Jens stated that the core language has moved towards objects of a
trivial type being destroyed at the same point as other types; in the
past objects of a trivial type could be accessed after their point of
destruction until their storage was destroyed.
- Jens noted that there may be wording that states that destruction of
a trivial object where an object of another type is present results
in undefined behavior and provided
[basic.life]p6
as a reference.
- Tom described his understanding of how constant evaluation works in
terms of interterpretation of an AST; constant evaluators can
currently rely on the type system; changing the type of an object
could lead to undefined behavior within the evaluator.
- Hubert agreed with Tom's description and stated that multiple
implementors should be consulted.
- Corentin suggested that such problems might be avoided via dependence
on an underlying type relationship.
- PBrett asked why the object type is so problematic and why, if a
region of memory contains bytes that represent UTF-8 encoded text, it
can't simply be accessed as an array of char8_t.
- Tom explained that constant evaluation is based on the C++ object
model and that the concept of memory regions don't apply there.
- Corentin further explained that compiler optimizers use
type based alias analysis (TBAA)
to eliminate re-reading memory and
dead stores
(writes to memory that will never be observed according to the
abstract machine) based on the type system.
- PBrett suggested that such alias restrictions could be removed.
- Hubert responded that doing so would impact performance.
- Jens noted that char8_t raised the abstraction level in C++
but not in C since char8_t is a type alias of
unsigned char there.
- PBrett stated that the issue with the object model must be solved in
order to specify a zero cost abstraction.
- Hubert explained that there is a trade off; using both
wchar_t and char16_t increases costs, but the
latter provides encoding and portability guarantees.
- PBrett opined that this suggests that use of the UTF character types
is not zero cost.
- Jens responded that C++ opted to add those types as fundamental types
in order to support overload resolution.
- Hubert explained the competing costs; restricting aliasing improves
performance at the cost of having to workaround the type system.
- Jens noted that memcpy() can be used to workaround the type
system.
- Tom noted that memcpy() can even be optimized away in some
cases.
- PBrett pondered whether the abstractions adopted for UTF character
types were the right choice and noted that a library facility could
have provided the same encoding guarantees while using char
internally.
- Tom explained that doing so wasn't an option for char8_t
since UTF-8 string literals were already part of the core
language.
- Steve explained that we use the type system to annotate how a block
of memory is used and that char8_t provided the ability to
annotate a block of memory as holding UTF-8 data.
- Steve asserted that making the UTF character types aliasing types
would impose costs like those he has seen with code that loops over
std::byte; the aliasing behavior hurts code generation.
- Steve noted that there are good libraries available that do use
char and translate between code units and code points.
- Corentin stated that the choice to make char8_t a
non-aliasing type was intentional and that any such change would
further harm adoption.
- Corentin asserted that a way to use char8_t with historic
char-based interfaces is needed or it just won't get used,
but we'll still be left with the problems that motivated its
introduction in the first place.
- Corentin opined that strong types are needed to support the
Unicode sandwich model.
- Corentin expressed a belief that this is solvable, implementable,
and therefore should be specified.
- Jens suggested that an alternative UTF-8 design could have been based
on something like std::span<char8_t> over a sequence
of unsigned char.
- Jens opined that code unit types are not particularly interesting
since an individual code unit by itself conveys little meaning.
- Jens noted that the proposed library interfaces have rough edges and
expressed skepticism regarding a need for anything UTF specific since
the underlying functionality is not encoding dependent.
- Steve agreed that the desire expressed in the paper is a special case
of the problem where we want to get objects of one type out of a
region of memory that holds objects of another type.
- Steve also agreed that the underlying storage for a text type is not
interesting; the interface provided is.
- Steve noted that none of the suggested library solutions would have
avoided the string literal concerns.
- Hubert provided a list of what he termed "a few uncomfortable facts":
- Reading object representations is allowed but the existing
wording is not satisfactory and fixing it will be hard.
- Implementations don't always follow the standard; for example,
Clang's support for placement new is non-conforming.
- Implementations sometimes implement behavior that can't be
expressed in the standard.
- Determining that wording is sufficient requires that multiple
implementations are completed based on the wording.
- Corentin, referring to earlier discussion regarding the possibility
of making start_lifetime_as_array constexpr, noted that,
since the memory location is provided by a parameter of type
void*, any original source object type information is not
present.
- Tom reported that the Unicode Source Code Ad Hoc Group suggested that
SG16 author a paper to discuss the issues that have been reported
following adoption of
P1949
for C++23 as a defect report and the migration from
immutable identifier syntax
to
default identifier syntax
in order to assist implementors with migration techniques, particularly
in light of the intent for a future Unicode standard to introduce to
default identifiers some currently excluded characters that are included
in immutable identifiers.
- Jens stated that he would like to understand more about the issues
reported and requested that it be added to the agenda for a future
meeting.
- Hubert expressed an interest in understanding more about the
discussion going on between WG21 and the Unicode Consortium.
- Steve volunteered to add writing such a paper to his todo list.
- Tom said he would file an SG16 issue to track the reported issues
and submission of a paper.
- [ Editor's note: Tom filed
SG16 issue #79.
]
- Tom stated that the next SG16 meeting is scheduled for September 14th
and will likely include further discussion of
P2626R0
and the above requests for more information about the identifier issues
and collaboration with the Unicode Consortium.
September 14th, 2022
Draft agenda:
Attendees:
- Corentin Jabot
- Hubert Tong
- Mark Davis
- Michael Kuperstein
- Peter Bindels
- Robin Leroy
- Steve Downey
- Tom Honermann
- Victor Zverovich
Meeting summary:
- A round of introductions was held in honor of new attendees.
- Report on the on-going interactions between WG21 and the Unicode
Consortium:
- Tom provided an introduction and presented prepared slides.
- [ Editor's note: Tom's slides are available at
https://github.com/sg16-unicode/sg16-meetings/blob/master/presentations/2022-09-14-WG21-UC-collab-p1949-presentation.odp.
]
- Unicode Message Format Working Group (MFWG):
- Tom presented his understanding of the group's progress as
previously relayed to him by Peter Brett as Peter was unable to
attend the meeting.
- Progress is on-going.
- A draft specification is available.
- The specification is complicated.
- The features provided subsume those currently available in
ICU.
- Implementations are available in Javascript and Rust.
- The design might not integrate well with
std::format().
- Mark elaborated on the group's work.
- A tech preview will be available in an upcoming release of
ICU; In Java first with C++ support to come later.
- The current specification (2.0) supercedes previous work.
- The design is intended to minimize dynamic processing.
- In support of higher level processes, the design enables
formatting to a data model that is then formatted to a
string.
- Formatting is sensitive to surrounding characters.
- Robin stated that, with regard to dynamic and static formatting
models, the previous 1.0 specification could be used to produce
a statically checked implementation via code generation.
- Michael noted that most formatting needs involve simple cases and
that the interfaces provided must support difficult cases without
complicating the simple cases.
- Mark replied that making simple things simple is a goal, but that
challenges naturally arise.
- Mark provided an example of such challenges; some languages have
gendered forms of sentences that should be tailored for the
user.
- Mark further emphasized the desire to cater to those cases while
maintaining simplicity.
- Tom noted an implication; that locale is insufficient by itself
for producing a message; information about the recipient is
needed.
- Mark acknowledged, but noted that gender should not be imposed;
formatting should reflect the diversity of recipients.
- Michael reflected on how these concerns are expressed in social
media.
- Mark noted the concerns apply in any case where a particular user
is the target of a message.
- Mark added that western speakers are not often aware of these
concerns.
- Unicode Source Code Ad Hoc Group (SCWG):
- Tom presented the group's progress and on-going activities.
- The group started meeting in late 2021.
- A liaison relationship between ISO SC22 and the Unicode
Consortium might be established.
- Proposed updates to
UAX #9
and
UAX #31
were accepted for Unicode 15.
- On-going work includes:
- Establishing principles for source code as text.
- Considerations for language designers.
- A new UTS.
- A new group will be formed to focus on issues of character
confusability.
- Mark commented that the updates adopted for Unicode 15 were done
to address some fairly obvious deficiencies.
- Robin categorized the updates as non-normative
clarifications.
- Steve stated that
annex E
should be updated to reflect these clarifications.
- Steve noted such an update would only modify non-normative
wording.
- Hubert cautioned that the updates must be consistent with prior
intent and noted there was a desire not to speculate on uncertain
interpretations at the time.
- Hubert stated that we tend to favor normative text when there is
a conflict with non-normative text.
- Mark noted that non-normative text may better explain the intent
of normative wording.
- Robin described in more detail some of the on-going work:
- There will be a new UTS that will be a one-stop shop for
source code.
- Much of the focus concerns display of source code in the
presence of bidirectional text or invisible characters.
- Considerations for language design.
- Considerations for language evolution; for example, migrating
a language from immutable identifiers to default
identifiers.
- Mark explained the intent to define a suite of standard profiles
that language designers can choose from in order to provide a
simple set of options that encompass complicated concerns.
- Corentin noted that most language designers are not qualified to
determine what characters should be used for what purposes and
that it is important to understand the consequences of
changes.
- Corentin expressed a desire for the Unicode Consortium to make
decisions about character use; for example, for what characters
are allowed in an identifier.
- Mark reiterated that the goal is to make choices as easy as
possible.
- Mark noted that language designers have to make choices for
backward compatibility purposes and provided the example of
maintaining use of '_' in identifiers.
- Mark explained that providing well-defined profiles allows
language designers to better understand the implications of
combining profiles.
- Mark stated that some profiles will offer the option of removing
characters that are otherwise in a default included set.
- Robin acknowledged Corentin's concern and agreed with not wanting
language designers to be burdened with having to consider
individual characters.
- Robin stated that characters in these profiles won't be added to
XID_Start and XID_Continue because those
properties are required to be universal.
- Tom noted that this work was partially motivated by the C++
migration from immutable identifiers to default identifiers and
the effort required to appreciate the consequences.
- Mark reflected on the difficulties encountered by backward
incompatible changes made for XML 2.0 relating to C1 control
characters.
- Robin offered assurances that a new UAX #31 revision will make
the consequences of such choices more clear.
- Steve noted limitations imposed by concerns we don't have control
over and provided the examples of separate compilation and
linkers; identifiers might be written in normalization form C
(NFC) but a linker might just interpret it as a sequence of
bytes.
- Mark responded that requiring NFC is a good solution for a lot of
matching cases that also arise outside of programming
languages.
- Robin lamented the problems that occur by burdening users with
NFC requirements and asserted that programmers can help.
- Steve noted that programs can validate NFC quickly.
- Mark agreed and noted that hits to the slow path during NFC
validation are infrequent.
- Tom stated that the Unicode Consortium will form a new group to
address character confusability in order to take that security
burden off the programmer.
- Mark responded that the Unicode Standard provides some data
regarding confusable characters but is limited to cases where
glyphs for a single code point might be confused with a sequence
of multiple code points; maps between code point sequences are
not currently provided.
- Mark noted that confusability is often dependent on the font
being used, that programming languages tend to use a reduced set
of characters, and that programmers tend to use fonts that avoid
some confusability issues.
- Robin explained that major changes to confusability analysis will
be handled by the new group and that smaller issues will likely
follow the existing processes.
- Michael asked if the confusability work will focus more on
usability or security.
- Mark responded that both are important and that improving one
often helps with the other.
- Corentin mentioned that visual markup for confusability can impact
usability and noted that VS Code currently highlights all
non-ASCII characters that might be confused with an ASCII
character.
- [ Editor's note: Following the meeting, Robin Leroy shared an
example of current VS Code highlighting as exhibited by Compiler
Explorer (Compiler Explorer uses VS Code as its editor).
The example code contains Russian text and many of the characters
in that text are highlighted as confusable characters despite the
surrounding context.
The highlighting creates significant distraction that makes the
text difficult to read.
See
https://gcc.godbolt.org/z/zK7GPo9hW.
]
- Mark acknowledged the concern and stated that efforts will be
focused on avoiding markup that isn't helpful.
- Robin commented that he has a note in his working draft that
states "don't do what VS Code does".
- Mark suggested a thought exercise; imagine using an editor that
highlights all Latin characters that look like characters in
other lanugages.
- Robin explained that mixed script identifier support is important
and provided HTTPЗапрос as an
example in which an identifier is composed of names that
originate from different languages.
- [ Editor's note: HTTPЗапрос can be translated as HTTPRequest.
]
- Michael expressed support for a code library that provides
confusability analysis.
- Mark replied that ICU provides confusability data but noted that
application of that data necessarily requires understanding text
structure.
- Report on the backward compatibility impact of
P1949 (C++ Identifier Syntax using Unicode Standard Annex 31):
- Tom provided an introduction.
- Robin explained that his code that was impacted is in a hobby
project.
- Robin described the survey he conducted and reported that it
identified impacted code in a number of projects.
- Robin reported that the SCWG intends to provide standard profiles
for optional inclusion of select mathematical symbols and emoji in
identifiers.
- Robin noted that the main character differences between immutable and
default identifiers is the selection of allowed mathematical symbols
and emoji characters.
- Corentin expressed concern that, if C++ were to add support for
user-defined operators as Swift did, we don't want to end up in a
situation where characters previously allowed in identifiers become
candidates for use as operators.
- Robin reiterated that there is no intent to add these characters to
XID_Start or XID_Continue; that they are only being
considered for standard profiles.
- Robin reported that the rationale for the proposed mathematical
notation standard profile for default identifiers considers existing
use in languages such as Julia and Swift that support user-defined
operators.
- Robin stated that relevant experts from other members of the Unicode
Consortium are reviewing that rationale.
- Steve expressed sympathy towards use of mathematical symbols in
Mathematica and that doing similarly in C++ means using those symbols
in identifiers since algorithms are typically implemented as
functions in C++.
- Steve stated that the subscript and superscript characters are
problematic since many fonts don't support those characters.
- Michael asked what motivates programmers to want their code to look
like mathematical equations.
- Steve responded that, in mathematics heavy fields like physics
simulation, it is desirable for the code to match equations in other
documents.
- Michael expressed uncertainty whether that is reasonable and reported
that his closest experience has involved equations in
Mathematica.
- Michael noted that typesetting languages like TeX are able to render
such characters appropriately but that he wasn't sure about common
programming language editors.
- Steve responded that such concerns may be limited if code is not
widely shared or reused.
- Steve asserted that depending on a finicky environment is
ill-advised.
- Corentin expressed a belief that language designers don't want to
make such decisions and that implementors should not offer such
extensions.
- Tom responded that different recommendations are appropriate for,
for example, general purpose languages vs domain specific ones.
- Corentin agreed.
- Steve stated that defining standard profiles helps to provide
sensible options.
- Steve suggested that profiles also provide a clearly defined feature
for which implementors can be lobbied for an extension that could
then be standardized based on adoption.
- Hubert replied that common extensions are not necessarily good
evidence of widely used or appreciated extensions.
- Steve agreed with not wanting to make decisions on individual
characters; that an appeal to authority is desired.
- Robin agreed with not placing the burden of evaluating individual
characters on language designers.
- Corentin asked about the anticipated timeline for this work.
- Robin responded that a draft is expected in November, that feedback
from the UTC will then be provided, and that the work is targeting
next September's Unicode release.
- P2626R0: charN_t incremental adoption: Casting pointers of UTF character types:
- Tom apologized for the lack of time available to continue discussion
of this paper.
- Tom stated that the next meeting will be held on September 28th and asked
for opinions regarding what to prioritize next.
- Corentin replied that continued discussion of P2626 is not a high
priority right now.
- Corentin stated that there is a need to update the standard to use
and reference the current Unicode version.
- Corentin stated that work is needed to improve estimated field
widths.
- Corentin stated that the escape string format added via
P2286 (Formatting Ranges)
needs additional work to handle combining characters in extended
grapheme clusters.
- Hubert cautioned that concern is warranted regarding debug strings
getting corrupted during copy/paste operations.
- Steve stated that Bloomberg will be filing an NB comment to update
annex E.
- Hubert stated that he will be filing an NB comment about
std::format() debug strings.
- Tom pondered the possibility of requesting that NB comment authors
send copies of relevant NB comments to us when they submit them so
that we can start work on them sooner.
- [ Editor's note: Tom reached out to Herb and he arranged for all
SGs to get early access to NB comments. ]
- Tom reported that the next meeting will focus on LWG issues and that
the following meeting will likely include a presentation from
Michael.
September 28th, 2022
Draft agenda:
Attendees:
- Hubert Tong
- Jens Maurer
- Mark de Wever
- Peter Brett
- Steve Downey
- Tom Honermann
- Victor Zverovich
Meeting summary:
- LWG #3767: codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale:
- Victor provided an introduction.
- There are four std::codecvt facets specified for
std::locale that are not intended to be locale
dependent.
- This appears to be the result of an oversight; when
char16_t and char32_t were added, new
specializations were presumably added to match the existing
char and wchar_t ones but are not actually
locale dependent.
- When char8_t was added, new specializations that
convert between char16_t/char32_t and
char8_t were added and the old specializations were
deprecated.
- The overhead of the unnecessary facets is probably minimal.
- The presence of the unnecessary facets is confusing from a
design perspective.
- The proposed resolution removes the specializations that are not
actually locale dependent from std::locale.
- The proposed resolution also makes the std::codecvt
constructors publicly accessible so that specializations can be
constructed without declaring derived classes.
- PBrett stated that the
email that announced the meeting agenda
noted that it would be helpful to understand what overhead is imposed
by these additional facets in practice and asked if it had been
measured.
- Victor replied that he had not measured and that the design
ramifications were of more concern to him.
- Victor volunteered to perform some measurements and described how
implementations manage the facets; via a dynamically allocated
array.
- Tom responded with his understanding that at least some
implementations statically allocate the facets and just register
pointers.
- Steve asked if the proposed changes would cause existing programs to
break at run-time.
- Victor replied that the presence of the facets can be queried at
run-time.
- Tom stated an expectation that, for some implementations, complete
removal of these specializations might result in link failures.
- Steve expressed appreciation for the desire to remove these facets
based on them not actually being locale dependent.
- Victor suggested that these facets could be deprecated instead.
- PBrett asked if the std::codecvt destructor should be
virtual.
- Victor expressed an expectation that a virtual destructor is
inherited from a base class.
- PBrett asserted the destructor should be declared with
override in that case.
- Hubert opined that these questions are more of a concern for LEWG and
do not fall under SG16's purview.
- Jens suggested an SG16 perspective that these facets are not locale
dependent and therefore should not vary by locale.
- Jens noted that these facets have been present for more than one
standard cycle and removal could result in silent behavior
change.
- Jens asserted that experience should be obtained regarding the
effects of removal before moving forward with a change.
- Jens noted that those removal effects are LEWG concerns.
- Victor agreed regarding SG16 scope for concerns.
- Victor volunteered to investigate what the consequences of removal
would be.
- Poll 1: SG16 agrees that the codecvt facets mentioned in
LWG3767 "codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale"
are intended to be invariant with respect to locale.
- Attendance: 7
-
- Consensus: unanimously in favor.
- LWG #3412: §[format.string.std] references to "Unicode encoding" unclear:
- Hubert explained that the term "Unicode encoding" is used in several
places in the standard, but with no formal definition.
- Tom provided two perspectives:
- "Unicode encoding" refers to only those encodings specified by
the Unicode standard and ISO/IEC 10646; UTF-8, UTF-16, and
UTF-32.
- "Unicode encoding" refers to any encoding that maps the entirety
of the Unicode code space and therefore includes, for example,
UTF-7 and UTF-EBCDIC in addition to UTF-8, UTF-16, and
UTF-32.
- PBrett asked if there is an industry term that describes the latter
perspective.
- Hubert replied that he is not aware of one.
- Tom replied that he had briefly looked for one in the Unicode
standard when drafting the agenda email but did not find one.
- Hubert stated that, for the debug formatting output introduced by
P2286 (Formatting Ranges),
that a stateless encoding was assumed.
- Tom expressed support for restricting "Unicode encoding" to just
those encodings that are defined in the Unicode Standard.
- Tom noted that, if motivation arises to support additional encodings
as Unicode encodings, that a paper can argue for relaxing the
restrictions.
- Poll 2: SG16 recommends that
LWG3412 "§[format.string.std] references to 'Unicode encoding' unclear"
should be resolved by replacing references to "Unicode encoding"
with "UCS encoding scheme".
- Attendance: 7
-
- Consensus: unanimously in favor.
- Tom asked Hubert if he would be willing to research other uses of
"Unicode encoding" to see if they should be similarly changed.
- Hubert agreed to do so and to open new LWG issues as appropriate.
- Jens suggested that a proposed resolution can address all such
issues.
- PBrett raised concern about use of GB18030 with
std::print().
- Hubert noted that we don't currently use the "Unicode encoding"
terminology in conjunction with std::print().
- [ Editor's note: Overloads of std::print() for
wchar_t and other character types are not currently provided;
the wording in
[print.fun]p2
currently restrits the enhanced Unicode behavior to UTF-8.
]
- Hubert suggested we proceed with the pragmatic solution for now.
- Tom noted that, for GB18030, the latest version no longer requires
use of the Unicode Private Use Area (PUA), and is therefore more
likely to be considered acceptable as a "Unicode encoding" in the
colloquial sense.
- Tom stated that the issues are likely sufficiently complicated though
that inclusion via a new paper is justified.
- Handling ill-formed Unicode in the library:
- Mark summarized the two issues raised during prior
mailing list discussion:
- One of the examples in
[format.string.escaped]p3
is incorrect; s5 should have a result value of
["\x{c3}("], not ["\x{c3}\x{28}"].
- It is not specified how ill-formed code unit sequences should be
handled for purposes of width estimation and formatting of debug
output.
- Victor responded that, for debug format output, the goal is to avoid
loss of information but that concern doesn't apply to width
estimation.
- Tom stated that the issue with the example is editorial since
examples are non-normative.
- PBrett suggested that the width estimation issue can be addressed via
an NB comment or an LWG issue.
- Tom opined that specifying the behavior for invalid code unit
sequences is reasonable.
- Victor agreed and noted that this is actually a C++20 issue.
- PBrett noted that performance overhead may be potential motivation
for not specifying the behavior of ill-formed input.
- Victor responded that this concern only applies to width estimation;
optimizations can still be employed.
- Jens stated that, for formatting of debug output, it is clear that
the intent is not to lose information.
- Tom agreed that the intent in that case is clear and well-specified;
the remaining issue is width estimation for ill-formed code unit
sequences.
- Jens asked what should be displayed for such ill-formed code unit
sequences.
- Tom replied that such questions depend on replacement character
policy.
- Jens asserted that the width estimate should be derived from the
characters that will actually be displayed.
- Victor suggested that research is needed to determine what happens
in practice.
- Tom noted that the input string has to be processed to calculate the
estimated width, so what terminals and such do with ill-formed code
unit sequences doesn't necessarily matter.
- Victor agreed and asked if the standard specifies a replacement
character.
- Tom responded that he did not think it does.
- Tom suggested that the desired resolution is probably to apply
PR-121
policy 2 with the Unicode replacement character substituted for the
ill-formed sequence.
- Victor replied that substituting a replacement character might not
be easy and might impose overhead.
- Jens suggested that the best answer might be that the estimated
width is unspecified.
- Mark volunteered to file an LWG issue for further follow up.
- Tom stated that the next meeting is scheduled for October 12th and that
the agenda is expected to include a presentation by Michael Kuperstein
unless preempted by a need to start addressing NB comments.