SG16: Unicode meeting summaries 2019-10-09 through 2019-12-11
Summaries of SG16 meetings are maintained at
https://github.com/sg16-unicode/sg16-meetings. This paper contains a
snapshot of select meeting summaries from that repository.
October 9th, 2019
Draft agenda:
- P1880R0 - u8string, u16string, and u32string Don't Guarantee UTF Endcoding
- P1879R0 - The u8 string literal prefix does not do what you think it does
- P1844R0: Enhancement of regex
Attendees:
- Corentin Jabot
- David Wendt
- Henri Sivonen
- JeanHeyd Meneide
- Peter Bindels
- Peter Brett
- Tom Honermann
- Zach Laine
Meeting summary:
- P1880R0 - u8string, u16string, and u32string Don't Guarantee UTF Endcoding
-
https://github.com/tzlaine/small_wg1_papers/blob/master/P1880_uNstring_shall_be_utf_n_encoded.md
- Zach introduced:
- The idea is that interfaces taking these string types expect that
contents of these strings are well-formed UTF-8, UTF-16, UTF-32
respectively; this requirement needs to be reflected in the
standard.
- We should state a blanket requirement for these expectations.
- The paper proposes a 4th bullet to
[res.on.arguments].
- PeterBr asked if the requirement should be for well-formed data.
- Zach replied that it should be. LWG should confirm that.
- Henri asked what happens if an ill-formed code unit sequence is
passed. Is it undefined behavior or as-if the Unicode replacement
character was present?
- Zach replied that the current wording makes it undefined
behavior.
- PeterBr provided an example of why the behavior is undefined.
Consider a string that ends with an incomplete code unit sequence; the
implementation could run off the end of the buffer.
- Zach responded that, for std::basic_string types, the buffer
overrun can be avoided, but in that case, the interface specification
should state that behavior. The proposed blanket wording is for the
weakest interface requirements and can be strengthened by individual
interfaces.
- Henri asked if that is useful as it seems like undefined behavior is
a huge foot cannon; replacement character semantics would provide a
safer interface.
- Zach responded that, if this is a foot gun, then so is
std::vector operator[]. You must meet preconditions.
Implementations can always constrain their handling if they want.
The intent here is to enable the fast path.
- PeterBr added that it would add complexity to implement replacement
character behavior; interfaces would not be able to use SIMD
instructions if ill-formed strings must be handled.
- Zach repeated that the proposal just specifies the default behavior
unless otherwise specified for an interface.
- Corentin opined that this seems almost editorial.
- Henri stated that, for char8_t, there are values that are
never valid in well-formed UTF-8 text and asked what an individual
char8_t means; it must be restricted to ASCII.
- Tom noted that matches UTF-8 character literals; they can only
specify ASCII values.
- Zach read the existing content in
[res.on.arguments]
in order to demonstrate similarity in existing requirements.
- Henri asked if this represents a requirement that is more difficult
to satisfy than the existing requirements. For example, in UTF-16,
almost all code bases will allow unpaired surrogates. Does this
requirement make the standard library useless for their code
bases?
- Zach stated that interfaces can specify their handling of unpaired
surrogates.
- Henri asked again if this is a practical requirement.
- Tom responded that this is needed for our mantra of not leaving
performance on the floor; we can't both check for ill-formed text and
maximize performance.
- Zach added that ICU already does this for performance. Within
Boost.Text, Zach added interfaces for both unchecked and checked
text.
- PeterBr opined that this paper is great and sorely needed.
- Tom and Corentin agreed.
- Henri asked Zach to expound on his statement that ICU already
exhibits undefined behavior.
- Zach responded that, in ICU normalization code, assumptions are made
when decoding UTF-8. For example, unsafe unpacking of UTF-8 is
performed.
- Henri asked if ICU does likewise for UTF-16 for unpaired
surrogates.
- Zach responded that he thought so, but is not completely sure.
- Corentin expressed support for an NB comment to include this in
C++20.
- Tom opined that it doesn't much matter if this makes C++20 as
implementors will already do the right thing.
- Henri asked if this might introduce a backward compatibility issue in
C++23 if added after C++20.
- Tom responded that the undefined behavior is effectively already
there; this is fixing an underspecification.
- Henri stated it would be a huge task to scrub existing code bases to
avoid this undefined behavior.
- Zach predicted that we'll end up with separate interfaces for
assuming an encoding vs checking the encoding. This isn't hurting
anybody, it is just enabling fast path implementations.
- Henri expressed concern about digging deeper into making default
interfaces unsafe; like std::optional::operator* is. He
would prefer unsafe interfaces be clearly marked as unsafe. This
undefined behavior has the potential to introduce security
issues.
- Zach responded that most standard interfaces are unsafe in some way,
for example every function that accepts arguments of pointer
type.
- Henri countered that the undefined behavior can be avoided in this
case; just like we could for std::optional::operator*.
- Zach suggested that C++ is often used for its performance advantages;
we want the default to be fast. But this proposal isn't really about
that; it is about documenting our default behavior.
- PeterBr stated that std::u8string is
std::basic_string with char8_t.
std::basic_string provides many interfaces that allow
mutating the string in a way that would break otherwise well-formed
UTF-8. Rust doesn't do that. We could specify a UTF-8 string type
that maintains invariants, but it wouldn't be a
std::basic_string any more. Thus, it is up to the
programmer to not violate UTF-8 requirements.
- Corentin agreed that we don't want to change std::u8string;
it is just a container of code units. String mutation should be
managed via some overlying type like std::text. This paper
just reflects existing behavior.
- Henri asked if we really want to enable so much performance that we
risk our users. In Firefox, lots of string checking is done to avoid
security issues even though ill-formed UTF-8 is very rare. The
performance isn't bad.
- PeterBr responded that an implementation can choose to define its
behavior.
- Henri countered that, if it isn't required everywhere, then it can't
be relied on.
- Corentin suggested that, if you want safety, then
std::basic_string is not the type you're looking for.
We're going to need other types on top and, eventually, we'll have
more trusted types.
- Zach added that no interfaces are being specified in this paper, so
there are no ergonomic concerns. Again, this is just proposing
blanket wording that can be strengthened in individal interfaces.
- Tom initiated a discussion about polling during telecons.
- Tom introduced:
- He prefers to avoid polling during telecons in favor of polling
during face to face meetings. This is due to 1) larger numbers
of attendees at face to face meetings, 2) more opportunity for
input from those that do not regularly attend telecons, and 3)
more opportunity for background thinking after a discussion
before having to respond to a poll.
- He also sees the telecons as useful for priming discussion and
identifying non-obvious concerns.
- Tom asked if anyone wanted to argue for a change in practice.
- The group expressed general agreement to continue doing what we've
been doing.
- P1879R0 - The u8 string literal prefix does not do what you think it does
-
https://github.com/tzlaine/small_wg1_papers/blob/master/P1879_please_dont_rewrite_my_string_literals.md
- Zach introduced:
- This started from an experience from a while back that we have
previously discussed.
- Tests involving UTF-8 formatted source files failed when compiled
with the Microsoft compiler, but not with other compilers.
- The source files did not have a UTF-8 BOM and Microsoft's
/source-charset:utf-8 option wasn't being used, so the
source files were decoded as Windows-1252.
- String literals therefore did not contain what was expected
because code units were not interpreted as expected.
- The paper proposes prohibiting use of u8, u,
and U literals unless the source file encoding is a
Unicode encoding.
- Corentin suggested relaxing the prohibition to allow use of these
literals so long as the source contents of the literal only use
characters from the basic source character set.
[ Editor's note: presumably this would still allow characters
outside the basic source character set if specified with
universal-character-name escape sequences. ]
- Corentin also stated that the current behavior makes sense according
to the standard, but most programmers aren't aware of source file
encoding vs execution encoding concerns.
- Henri stated that the behavior makes sense if you think of C++ source
code as text rather than bytes and agreed that this isn't what
programmers expect.
- PeterBr expressed support for the paper because it ensures you get
the same abstract characters written in the source file and added
that it would be nice if this paper used the same terminology as
propsoed in Steve's recent terminology paper
(P1859R0).
[ Editor's note: this paper will be in the Belfast pre-meeting
mailing. ]
- Zach agreed regarding use of terminology.
- Tom expressed concerns regarding breaking backward compatibility,
particularly for z/OS where source files are EBCDIC and u8
literals are used to produce ASCII strings.
- Zach asked if it would help to only allow characters from ASCII.
- PeterBr stated that, if the compiler is not explicitly told what the
source encoding is, you are in trouble since the compiler can't
always detect an encoding expectation mismatch.
- Henri noted that the translation model matches what is done on the
web where HTML source is transcoded to some internal (Unicode)
encoding. A compiler could preserve meta data about the encoding a
literal came from and, if the transcoded code point is above 0x80,
issue a diagnostic.
- Zach asked for more information regarding concerns for z/OS and
EBCDIC.
- Tom explained the source translation model according to
translation phase 1.
Source files are first transcoded from an implementation defined
encoding to an implementation defined internal encoding. The internal
encoding has to be effectively Unicode (or isomorphic to it) due to
possible use of universal-character-name sequences in the
source code. The internal encoding is then transcoded to the various
execution encodings where needed.
- Tom went on to explain that there are multiple EBCDIC code pages and
that many of the characters available in them are not defined in
ASCII. Restricting UTF literals to just ASCII would prevent use of
those characters.
- Tom restated PeterBr's point from earlier. This problem is always due
to mojibake; the source file being encoded in something other than
what the compiler expects.
- PeterBr agreed that the root cause is the encoding mismatch and opined
that this is a problem worth solving. The question is how best to
solve it. The first place to look is at the translation from source
encoding to internal encoding.
- Henri expressed belief that it makes sense to address the problem
where Zach suggests.
- Zach stated that the right place to detect this is during parsing;
when parsing a UTF literal, it is critical to know what the source
encoding is.
- Tom countered that it is necessary to know the encoding as soon as you
hit a code unit that doesn't represent a member of the basic source character set.
- Henri stated that diagnosing any such code unit is a harder sell than
just diagnosing one in a UTF literal.
- Tom agreed.
- PeterBr noted that it is implementation defined how (or if) characters
outside the basic source character set are represented. The goal of
the paper is effectively to tighten that up. That means
implementations can have extensions to relax diagnostics.
- Henri responded that such arguments apply to any change to the
standard.
- Zach agreed, but noted this is restricted to source files that have
UTF literals with transcoded code points outside of ASCII.
- Henri stated that there is more potential for failures for some
character sets than others. For example, some character sets don't
roundtrip through Unicode. This failure mode already exists, but
there is little value in trying to diagnose this outside of UTF
literals.
- PeterBr stated that a source file with code units representing
characters outside of the basic source character set is ill-formed
subject to implementation defined behavior. When a programmer writes
a UTF literal, that is a request for a specific encoding, but it is
perfectly valid for the source file to be written in Shift-JIS.
- Henri acknowledged that perspective as logically valid, but doesn't
address the problems caused by the Microsoft compiler's default
behavior not matching user expectations. Programmers are using UTF-8
editors these days.
- PeterBr asserted that is a quality of implementation concern and not
an issue with the standard.
- Tom agreed.
- Zach stated that the proposed restrictions can be worked around by
using universal-character-name escapes and stated a
preference for implementing a solution that results in a diagnosis
for the problem he encountered, but that this isn't a critical
issue.
- Corentin brought up static reflection and that, at some point,
reflection will require defining or reflecting the source file
encoding.
- Tom stated that dovetails nicely with Steve's P1859R0 draft that
provides a callable for conversion of string literal encoding.
- Corentin noted that Vcpkg compiles all of its packages with the
Microsoft compiler's /utf-8 option and that Microsoft may
be open to defaulting source encoding to UTF-8 when compiling as
C++20.
- Zach added that the Visual Studio editor, by default, adds a UTF-8
BOM to new source files it creates, though it doesn't implicitly add
a UTF-8 BOM when existing files are added to a project.
- Corentin observed that, because source encoding is not portable,
most programmers just don't use characters outside of ASCII except
in comments; which is why such characters are ignored.
- PeterBr suggested that an evening session in Belfast to discuss this
or other ideas might be an option and that it would be good to talk
directly with implementors.
- Tom confirmed that the next meeting will be on October 23rd and will be
the last meeting before Belfast.
October 23rd, 2019
Draft agenda:
- P1844R0: Enhancement of regex
- P1892R0 - Extended locale-specific presentation specifiers for std::format
- P1859R0 - Standard terminology for execution character set encodings
Attendees:
- David Wendt
- Mark Zeren
- Peter Brett
- Steve Downey
- Tom Honermann
- Yehezkel Bernat
- Zach Laine
Meeting summary:
- Tom initiated a round of introductions for new attendees.
- P1844R0: Enhancement of regex
- https://wg21.link/P1844R0
- Tom introduced the paper on behalf of the author:
- The proposal is an expansion of std::basic_regex
specializations.
- We've discussed issues with std::basic_regex before.
The author has put significant effort into this proposal. It
includes wording. We owe it to the author to set aside any biases
and consider the benefits of this paper.
- An implementation is available though it only implements the
proposed char8_t, char16_t, and
char32_t specializations, not the existing char
or wchar_t specializations.
- The paper does not propose an alternative to
std::basic_regex, but rather attempts to address
shortcomings of it for UTF encodings via specializations.
[ Editor's note: this implies that the proposal doesn't
address issues with support of UTF encodings with the
char and wchar_t specializations. ]
- The paper proposes a new regex syntax option,
ECMAScript2019, to be used to select a regular expression
engine that implements the ECMAScript 2019 specification. This
option would be available for use with all
std::basic_regex specializations.
- The paper proposes a new dotall syntax option that allows
the . character to match any Unicode code point,
including new line characters, when using the
ECMAScript2019 option.
- The new ECMAScript2019 syntax option would be the only
syntax option supported for the char8_t, char16_t, and char32_t specializations.
- The ECMAScript2019 regular expression engine would
NOT exactly match the ECMAScript 2019 specification:
- The \xHH expression is redefined to match code points
rather than code units. However,
- The author would be fine with removing support for the
\xHH expression since support for code points is
provided by the \uHHHH and \u{H...}
expressions.
- The proposal removes locale dependency for the char8_t,
char16_t, and char32_t specializations and
therefore does not propose any new specializations of
std::regex_traits.
- The paper proposes new overloads of std::regex_match and
std::regex_search to allow specifying look behind limits
on ranges.
- The proposed changes to std::regex_iterator are ABI
breaking.
- PeterBr observed that the proposal doesn't deal with language specific
aspects like case folding.
- PeterBr stated he liked the motivation for this paper and the notion
that std::regex can be made to work.
- Zach asked about support for collation and whether anyone was familiar
with the existing collate syntax option.
- PeterBr responded that the paper states that the collate
option is ignored for these specializations.
- Zach stated that the default collation is not useful and that
tailoring is required.
- Tom summarized, so the paper needs to address collation.
- Zach refuted that need since it could profoundly impact
performance.
- PeterBr suggested that, perhaps, regex for Unicode should operate on
std::text.
- Tom expanded that suggestion to any sequence of code points and
observed that the proposal kind of does that already via the changes
to regex_iterator.
- Zach agreed it would be useful to use as an adapter for code
points.
- Tom asked if a new regex feature for non-compile-time regex support
would be preferred over specializing std::basic_regex as
proposed.
- Zach responded that he doesn't think std::regex is DOA, but
if we're going to support Unicode regex with dynamic patterns, then,
we should pursue some of the design of CTRE.
- Zach added that solving the problem is important and that he wants to
see Unicode regex support but would prefer to take a wait-and-see
approach on this paper while watching how CTRE and
std::format evolve.
- PeterBr acknowledged the benefits of CTRE, but stated that we do need
a solution for dynamic regex.
- Zach reported that be believes that Hana is planning to make CTRE
capable of supporting dynamic pattern strings and If that were to
happen, then we wouldn't need std::regex any longer.
- Mark lamented the lack of a proposal like this one when C++11 was
being designed since the approach looks good relative to other papers
from the past.
- Mark added that it is an embarassment that we don't have a solution
for this today, but that he feels kind of neutral on it as well due
to concerns about allocating time for this relative to other things
we could do.
- Mark asked what implementors would think and if they get requests for
Unicode std::regex support.
- Mark asserted that the implicit use of the ECMAScript2019
engine when a different syntax option is specified has to be
changed.
- Zach reiterated that this proposal is definitely an ABI break, that
an ABI break is a serious problem, and that the need for such a break
suggests we need a different family of types.
- Mark added that the paper should make it clear that it does break ABI,
not that it might.
- Tom asked if this proposal solves the std::basic_regex
issues with support for variable length encodings.
- Zach responded that std::regex doesn't handle incomplete or
ill-formed code unit sequences and suggested that perhaps those should
match against \uFFFD.
- Zach reported that std::regex can also match code unit ranges
that stride code unit sequences since std::regex effectively
matches bytes.
- Tom asked what guidance we should offer to LEWG.
- Zach suggested:
- We should solve this problem.
- This approach is premature given other things in flight now, but
if this had been proposed three years ago he might have felt
differently about it.
- PeterBr suggested it should be prioritized behind CTRE.
- Tom asked whether support for tailoring is important.
- Zach suggested placing tailoring at the lowest priority and mentioned
that he doesn't think ICU supports it as people don't often want to
do collation aware searching.
- Tom reiterated that we should offer guidance that it be ill-formed to
specify a syntax option other than ECMAScript2019 for the
proposed specializations.
- P1892R0 - Extended locale-specific presentation specifiers for std::format
- PeterBr introduced the paper:
- Looking through the std::format specification he found
that there are useful floating point formats that can not be
produced in locale specific formats.
- Locale specific formats are important in scientific fields.
- The 'n' specifier has a different meaning for integers
than it does for floating point.
- An NB comment was filed to make the 'n' specifier
indicate a locale specific format rather than a type
modifier.
- The proposed change should not affect existing well-formed
std::format calls except for bool which would
now be formatted as locale variants of "true" or "false" instead
of 1 or 0.
- This would make std::format unambiguously the best choice
for localized formatting since locales can be easily specified and
std::format already solves short falls of iostreams and
printf such as ordering.
- Without this change, there is still a need to use printf
for locale sensitive formatting.
- Mark noted that this change will break existing users of
{fmt}.
- PeterBr responded that it will for existing uses of bool but
that he isn't concerned about existing users of
{fmt}.
- Tom observed that use of 'l' as the specifier as suggested in
the paper avoids the break and aligns with Victor's
P1868R0 paper to enable locale
specific handling of character encodings.
- Mark stated that the core issue is that there remains some uses of
printf that can't be directly replicated with
std::format and asked how a programmer would print, for
example, the locale specific decimal character but without the locale
specific thousands separator.
- PeterBr responded that the programmer can create a custom locale.
- Zach stated that we can't defer this until C++23 because changing the
meaning of 'n' would break compatibility and asked why we
can't just introduce an 'l' specifier in C++23?
- PeterBr responded that doing so makes things more complicated and
asked whether we would deprecate 'n' if 'l' were to
be adopted. We can postpone addressing this, but we get a cleaner
solution in the long term by addressing it now.
- Zach agreed with the motivation being to avoid a wart that we'll need
to teach but that some opposition will be raised due to perceived risk
at this late stage.
- Zach stated that he likes the change, but that it needs good
motivation.
- PeterBr suggested that 'n' could be removed now and then
restored with desired changes in C++23.
- Zach suggested that if Victor supports the paper, it will probably
pass, but if he disagrees with it, then it is probably DOA.
- Mark stated that the choices need to be clearly presented for
LEWG.
- Zach observed that there are a few options and suggested presenting a
cost/benefit of each so that LEWG is given clear choices.
- Mark suggested socializing the issue on the LEWG mailing list now to
flush out any objections.
- PeterBr stated that any help improving the paper would be
appreciated.
- Mark suggested presenting either slides or a different paper that
presents the options and analysis.
- PeterBr stated he would create a doc that could be collaboratively
edited.
- P1859R0 - Standard terminology for execution character set encodings
- Steve introduced the paper:
- The goal is to not affect implementations, but rather to fix
wording so that we can use modern terminology and understand
each other better.
- We often use terms like "execution encoding" that are not defined
in the standard and are opportunities for confusion.
- We need to admit that wchar_t is not, in practice, able
to hold all code points of the wide execution character set.
- Zach asked what "literal encoding" is for.
- Steve responded that it reflects the encoding for non-UTF
literals.
- Zach asked what difference is intended by "character set" and
"character repertoire".
- Steve responded that the goal is to tighten up the meanings of
existing terminology so as to avoid massive changes to the
standard.
- Mark observed that there seems to be a missing word in the
definition of "Basic execution character set"; that there seems to
be a missing "that".
- PeterBr stated that this should be high priority in C++23 so we can
get everyone on board with terminology.
- Steve agreed and asserted we'll need to socialize these new
terms.
- Tom asked if there are any terms being dropped; it looks like the
paper adds "literal encoding" and "dynamic encoding".
- Steve responded that none are dropped and stated there will be an
additional associated encoding added for character types as well.
- Mark noticed that the paper discusses literal_encoding and
wide_literal_encoding but doesn't define a term for "Wide
literal encoding".
- Tom asked if "source encoding" should be added.
- Tom asked if we should add a statement that the dynamic encoding must
be able to represent all of the characters of the execution character
set.
- Steve responded that we could add that.
- PeterBr observed a potential problem with doing so on Windows where
the dynamic encoding might be UCS-2, but the execution character set
is UTF-16.
- Tom suggested refining the requirement such that characters used in
literals must have a representation in the dynamic encoding.
- Mark suggested it would be helpful to have a cheat sheet with
mathematical notation of which terms denote a subset of other
terms.
- Steve agreed.
- Tom suggested that we also need "wide dynamic encoding".
- Zach asked about the difference between the "encoding" and "character
set" terms.
- Steve responded that the former states how characters are represented
while the latter states what characters must be representeable.
- Zach stated it would be useful to have text explaining the
difference.
- Tom asked how ODR violations would be avoided for
literal_encoding since literal encoding can vary by TU.
- Steve responded that the same technique used for
std::source_location can be used; a value is provided.
- Tom confirmed that the next meeting will be November 20th.
November 20th, 2019
Draft agenda:
- Belfast follow up and review.
- Volunteers to draft a library design guidelines paper.
Attendees:
- JeanHeyd Meneide
- Mark Zeren
- Steve Downey
- Tom Honermann
- Yehezkel Bernat
- Zach Laine
Meeting summary:
- P1868 - 🦄 width: clarifying units of width and precision in std::format:
- Tom introduced the topic:
- Concerns were raised in Belfast with regard to the stability of
the proposed code point ranges to be used for display width
estimation. The currently proposed ranges map all extended
grapheme clusters (EGCs) to a display width of one or two despite
there being a number of known cases of EGCs that consume no
display width (e.g., U+200B {ZERO WIDTH SPACE}) or more
than two display width units (e.g., U+FDFD {ARABIC LIGATURE
BISMALLAH AR-RAHMAN AR-RAHEEM}).
- Additionally, the EGC breaking algorithm is dependent on Unicode
version and the proposed wording does not specify which version
of Unicode to implement. Concerns were raised regarding having a
floating reference to the Unicode standard and the potential for
differences in behavior across implementations if the Unicode
version is implementation defined and subject to change across
compiler versions.
- How should we address these concerns?
- Zach commented that the wording review went through LWG ok and that
he had posted a message to the LWG mailting list responding to one
concern that was raised.
- Zach reported that Jonathan Wakely stated that floating references
to other standards are not permitted but that implementors can, as
QoI, offer support for other versions.
- Tom expressed surprise regarding that restriction given that we have
a floating reference to ISO 10646 in the working paper today.
- Zach responded that LWG stated a requirement for a normative reference
and is therefore planning to add a normative reference to Unicode 12
with the intent that we update the normative reference with each
standard release.
- Tom asked that, if we reference a particular version, can
implementations use a later version and remain conforming.
- Zach responded that doing so seems to be acceptable to
implementors.
- Steve remarked that CWG expressed a preference for a floating
reference.
- JeanHeyd confirmed and added that is how the working paper ended up
with the floating reference to ISO 10646.
- Zach said he will follow up about this discrepancy.
- Mark asked if we have a preference for floating vs fixed.
- Zach responded that implementations will do what they need to do for
their users.
- Tom turned the discussion back to concerns raised by Billy regarding
changes to the width estimate algorithm being a breaking change; e.g.,
changing the width estimate for a given EGC. This is a related but
distinct concern from the EGC algorithm changing due to a change in
Unicode version.
- Zach stated that U+FDFD is an example of something we need
to fix that can also be a breaking change.
- Steve repeated that the concern is basically any change in behavior
potentially resulting in a surprising or undesirable change.
- Mark asserted that we're going to continue having difficulties with
dependencies on Unicode data and that the situation is analagous with
respect to the timezone database. Implementors can enable stable
behavior by allowing choice of Unicode version.
- Steve noted that the rate of change of the Unicode standard has skewed
towards stability.
- Mark opined that we should not solve this problem in the
standard.
- Tom agreed and added that we can specify a minimum version, but leave
the atual version implementation defined.
- Mark asked which version of the Unicode standard the proposed code
point ranges were pulled from.
- Tom responded that the Unicode standard doesn't contain character
display width data and that these were extracted from an
implementation of wcswidth().
- Steve stated that he maintained a list of double wide characters for
years and that it was not a significant burden.
- Tom stated that his desire for a floating reference to the Unicode
standard with an implementation defined choice of version is intended
to allow implementors to keep up with new Unicode versions. Unicode
releases happen every year while C++ standards are only released
every three years. Implementors probably can't lag Unicode by three
years.
- Zach acknowledged the goal and stated that will result in some
implementation divergence as some implementors will keep up and some
won't, but that the differences are likely to be minor.
- Tom asked if ISO 10646 annex U constitutes a reference to
UAX#31.
- Steve suggested this is probably a beuracratic issue and added that
having a normative reference is helpful.
- Zach responded that it could be harmful if we get cconflicting
floating and non-floating references for ISO 10646 vs Unicode, but
this should fall to LWG and CWG to decide.
- Tom asked how we should go about fixing the currently proposed width
estimates since the proposed ranges are clearly missing support for
cases of zero width or width greater than two.
- Zach opined that he wasn't sure there is a problem to be fixed since
what is specified matches existing practice.
- Tom asked if we know where this implementation of wcswidth()
came from and how widely deployed it is.
- Zach suggested asking Victor.
- [ Editor's note: According to
P1868R0, the implementation
of wcswidth() is the one at
https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c.
]
- Tom asked for opinions regarding writing a short paper that explains
the Unicode stability guarantees and argues for floating references
and implementations.
- Zach suggested waiting for a more motivating reason to do so.
- P1949 - C++ Identifier Syntax using Unicode Standard Annex 31:
- Tom introduced the topic:
- EWG rejected the SG16 guidance offered in response to NB comment
NL029
to deprecate identifiers that do not conform to
UAX#31 with
noted exceptions for the _ character.
- A suggestion was made that a CWG issue be filed to consider the
lack of updates to the allowed identifiers since C++11 as a
defect.
- Tom agreed to file a core issue and started to do some
research.
- According to N3146, the
original identifier allowances appear to have been aggregated
from various sources including
UAX#31 and
XML 2008,
and following guidance in annex A of a draft of
ISO/IEC TR 10176:2003.
- Thank you to Corentin for quickly providing a way to query the
code point ranges that have the XID_Start or
XID_Continue property set.
https://godbolt.org/z/h7ThEh.
These ranges differ substantially from what is in the current
standard.
- What should the proposed resolution for the core issue be?
- Steve stated that
UAX#31 permits
extensions, and what was adopted for C++11 effectively whitelisted
a large set of code points.
- Zach asked what EWG's concern was.
- Steve replied that they were nervous about such a late change and
want more time to think it through.
- Zach opined that this seems like something better addressed in
C++23.
- Steve noted that what is done can be back ported to prior standards
though, that Clang and gcc support Unicode encoded source code
[ Editor's note: so does MSVC ], and that the longer we wait
to address this, the more code we potentially break.
- Tom stated that, from the DR perspective, we could either figure out
what we want for C++23 and recommend that as the proposed resolution,
or we can do a more targetted fix for C++20 for specific problematic
cases knowing that we'll likely do differently for C++23.
- Steve stated that the only difference C++ needs from
UAX#31 is support
for _, and such an extension is conforming. It would also
be ok to restrict identifiers to a common script to avoid homoglyph
attacks.
- Steve added that there is also the issue of normalization forms and
that gcc will currently warn if identifiers are not in NFC form.
- Mark asked if we should make it ill-formed for identifiers to not be
in NFC form.
- Steve responded that doing so could break existing code.
- Tom suggested normalizing when comparing identifiers is another
approach.
- Steve noted that doing so requires the Unicode normalization
algorithms.
- JeanHeyd mentioned that we'll also have the problem of reflecting
identifiers in the future and that normalization will be relevant
there. Corentin brought this up in SG7. Requiring NFC would be
helpful there.
- Mark expressed support for the idea of requiring NFC.
- Steve suggested that there is always the
universal-character-name escape hatch.
- Mark opined that EWG probably won't like requiring conversion to NFC
in name lookup.
- Tom responded that gcc is at least detecting non-normalized
identifiers today, that doing so must require some level of Unicode
database support, and that performance costs are presumably
reasonable.
- Steve stated that gcc looks for some range of combining code points
and may not be 100% accurate.
- Mark asked if non-NFC normalization can be detected without having
to fully normalize?
- Zach responded that he didn't think so.
- Mark asked if normalization was brought up in EWG.
- Steve responded that it wasn't, that we didn't get that far in the
discussion.
- Tom suggested that we have a good amount to think about here and that
he is looking forward to the next revision of Steve's paper.
- Steve took the bait and agreed that the paper will have to provide
good arguments for why this is important.
- Zach suggested that this should be easy for implementors if they
don't have to deal with normalization and that we should just
require NFC for performance reasons.
- Mark asked if we could make use of non-NFC ill-formed NDR so that
implementations are not required to diagnose violations.
- P1097 - Named character escapes:
- Tom introduced the topic:
- EWG narrowly rejected the paper, but expressed good support for
the direction.
- Most concerns had to do with implementation impact and, in
particular, the potential increase in compiler binaries. Some
distributed build systems distribute compilers as part of the
build process and the additional latency imposed by incresing
the size of compiler binaries adds cost. Numbers haven't been
obtained, but guesses were around 2MB, but could probably be
reduced to under 600K.
- One prominent EWG member was strongly opposed to the design
because he would prefer a solution that avoids baking Unicode
into the core language. Something like a string interpolation
solution that could call out to constexpr library
functions to do character name lookup.
- Martinho was working on an implementation in Clang at Kona, but
Tom doesn't know the state of it or where to find it. Tom
reached out to Martinho via email, but didn't hear back.
- Anyone have time and interest to experiment and produce some
estimates to address the implementation impact concerns?
- Steve stated that he could probably do some work on it and that the
name DB should compress really well with use of a trie.
- JeanHeyd suggested that the
UAX44-LM2
compression scheme could help to reduce size.
- Tom expressed uncertaintly that it would help much over a trie, but
we could experiment and put the results in a paper.
- Zach suggested splitting names that contain "with" in them since the
suffixes that tend to follow "with" are highly repeated.
- Tom noted that the algorithmically generated names could be specially
handled as well.
- Steve added that a tokenization approach could help too.
- Tom asked if anyone might know of a link to Martinho's
implementation.
- Zach replied that a link was provided at some point, possibly in
Slack.
- [ Editor's note: Tom searched Slack, but failed to find a
reference. ]
- P1880 - uNstring Arguments Shall Be UTF-N Encoded:
- Tom introduced the topic:
- LEWG rejected the SG16 guidance offered in response to NB comment
FR164
to adopt P1880 for C++20.
- What should we do next?
- Zach expressed frustration that he was available when the NB comment
and paper were discussed in LEWG, but that no one notified him that
the discussion was happening.
- Zach stated that, after the SG16 meeting, he went through all
references to std::basic_string and added missing references
to PMR strings and std::basic_string_view. This research
also identified a number of references that are deserving of more
scrutiny.
- Zach opined that this isn't very important for C++20 and that he will
work on a revision for C++23, though not for the Prague meeting.
- Zach stated he was surprised at how many references to these types he
found in function templates.
- Tom asked for volunteers to draft a library design guidelines paper.
- Tom introduced the topic:
- During the
SG16 meeting on July 31st,
we discussed guidelines for when to add function overloads for
each of char, wchar_t, char8_t,
char16_t, and char32_t and he would like to have
a library guideline paper that records our guidance.
- Would anyone be interested and willing to work on this?
- Zach expressed interest in doing so.
- Mark brought up a wording update email Zach sent to LWG with regard to
P1868:
- Mark noted that the wording introduces a new term of art: "estimated
display width units".
- Zach responded that the new term was intentional; we're leaving the
width estimation effectively unspecified for non-Unicode encodings.
Implementors expressed a preference for not having to document their
choices and we didn't want to force embedded compilers to have to be
Unicode aware. So, we needed a non-Unicode term.
- Tom noted that the wording appears to require embedded compilers to
use the proposed Unicode algorithm if their execution character set
is Unicode.
- Zach acknowledged that would be the case.
- Mark siggested that is probably what we want if they are actually
doing Unicode.
- Tom agreed and suggested such implementors could otherwise state that
their execution character set is ASCII.
- Tom communicated that the next meeting will be on December 11th.
December 11th, 2019
Draft agenda:
- Vocabulary type(s) for extended grapheme clusters?
- Per Michael McLaughlin's questions posted to the (old) mailing list
on 11/01.
- P1097: Named character escapes
- Review research on minimizing the name lookup DB and code size.
Attendees:
- Corentin Jabot
- David Wendt
- Peter Bindels
- Peter Brett
- Steve Downey
- Tom Honermann
Meeting summary:
- P1097: Named character escapes:
- Tom introduced the topic:
- Since our last meeting, Corentin did some outstanding
investigative and evaluation work and blogged about his results:
- Corentin's implementation of his size reduction techniques is
available at:
- The goal for today is to review his results and determine next
steps.
- Corentin opined that the data is still kind of large at approximately
260K.
- Zach noted that Corentin did a good job of estimating a theoretical
lower bound for reducing the data at around 180K, so achieving a
result of 260K is great.
- Steve commented that the code shows the challenges C++ has with
variable length data. The natural representation would use variants,
but that can't be represented as well.
- Corentin agreed noting that good performance demands working at the
byte level.
- Zach expressed a similar experience working on
Boost.text; flat arrays
of bytes had to be used to achieve scaling goals.
- Tom stated that we need to draft a revision of this paper and that he
is happy to do so, but would welcome any other volunteers.
- Corentin asked if we know how to get in touch with Martinho.
- Tom responded that he tried, but did not get a response.
- Tom noted that, if we can't get in touch with Martinho, then we'll
need to submit a new paper rather than a new revision.
- Corentin asked if a new paper was really necessary.
- Steve responded that, as a matter of procedure, we need a new paper to
get it on the schedule.
- PeterBi added that we need a place to record the new information.
- Tom stated he would attempt to contact Martinho again.
- [ Editor's note: Tom did reach out again via email, but again did
not get a response. ]
- Tom asked Corentin if he wanted to take this and run with it given the
considerable investment he has already made.
- Corentin responded that he is unfortunately time constrained.
- Corentin mentioned that the new paper should state the need for
matching name aliases and case insensitivity.
- Tom agreed and noted that we have polls on those topics from
presentation to EWGI in San Diego that record a trail of intent for
those cases.
- Zach asked Corentin if dashes are handled properly in his
experiment.
- Corentin replied affirmatively that spaces, dashes, and underscores
can be omitted or swapped as recommended by Unicode in
UAX44.
- Corentin added that the current 260K size includes support for name
aliases.
- Steve observed motivation for allowing spaces, dashes, and underscores
to be swappable; that behavior falls out of a good implementation.
- Corentin stated that, should a desire arise to be able to map code
points to names, then a different implementation would provide a more
optimized data set that handles mapping both directions.
- Tom asked Corentin for an estimated size for a perfect hash
approach.
- Corentin responded with 300K to 400K.
- Corentin pointed out a potential challenge; that it may be desirable
to support code point to name mapping in the standard library, but
probably not in the compiler. This implies a potential need for the
Unicode character name data to be available to both.
- Steve stated that it seems unfortunate to not expose the compiler data
to the library.
- Corentin suggested the data would probably need to be present in both
the compiler and the library.
- Tom provided a possible way to avoid that; by making it available in
the library, but accessible from the core language. At least one EWG
member strongly advocated for such an approach; a string interpolation
like facility.
- Vocabulary types for extended grapheme clusters:
- Tom introduced the topic:
- Michael McLaughlin had posted some questions to the (old) mailing
list on 2019-11-01:
- These questions are related to representation of extended grapheme
clusters (EGCs), specifically, how a collection or sequence of
them might be stored.
- Should the standard library provide vocabulary types for EGCs?
- Zach explained the choices he made for
Boost.text. There are
two vocabulary types;
grapheme
provides value semantics and stores a small vector optimized sequence
of code units with a maximum size limited according to the
Unicode stream-safe text format described in UAX #15,
and grapheme_ref
provides read-only reference/view semantics over a code point range
denoted by an iterator pair.
- Zach added that he is unsure if anyone is using the value type.
- Corentin acknowledged the uncertainty regarding use cases for a value
type.
- Corentin asked why the reference/view version is not an alias of a
span.
- Zach responded that he wanted to support subranges and non-contiguous
storage. The implementation uses the view_interface CRTP
base from C++20 ranges.
- Steve asked who the anticipated consumers are for use of EGCs.
- PeterBr expressed similar curiosity and provided some background
experience; he previously worked on a product that was text based and
everything was done on graphemes. Support was available for
individual grapheme replacement, but a value type was never needed
because reference/view semantics were always desired. All text
processing was performed in terms of ranges of graphemes.
- Zach offered a couple of examples. Text rendering depends on
knowledge of EGC boundaries. Additionally, an EGC reference is the
value type of an (EGC-based) iterator on a text range.
- Zach observed that breaking algorithms don't always break on EGC
boundaries, though split EGCs still remain EGCs on either side of the
boundary.
- Steve stated that having a named type is very useful. An EGC view is
essentially a subrange, but naming it is useful.
- PeterBr clarified that an EGC is effectively a range of code
points.
- Tom asked if there is a good distinction between an EGC type that
represents a range of code units or code points that constitute
exactly one grapheme vs a type that represents a range of EGCs in
terms of a range of code units or code points.
- Zach replied yes,
Boost.text has a type
that represents the latter case as well;
grapheme_view
is a view that provides an EGC iterator. So, yes, there are three
potentially useful types: an owning EGC, a reference EGC, and an EGC
view.
- Steve asked how breaking algorithms that split EGCs interact with
these types.
- Zach replied that all Unicode algorithms are specified in terms of
code points, not EGCs. So, a split EGC just becomes two EGCs. The
sentence breaking algorithm may cause this to happen.
- Tom recalled prior conversations where we discovered that the EGC sum
of the parts of a text may be greater than the EGC sum of the whole
text.
- Steve asked for confirmation that you can still view the split code
point ranges as EGCs.
- Zach confirmed, yes.
- Corentin asked if all of these types aren't effectively
subranges.
- Steve replied yes, but different types is useful to avoid subranges
of subranges.
- Corentin countered that, if you have a text_view and you
split it, you get a text_view.
- Zach stated that the idea that the Unicode algorithms produce
sequences of code points but programmers want EGCs is a key idea.
- PeterBr observed that rendering text requires more than just
EGCs.
- Steve returned converation to the motivation for EGC types and
mentioned the DB field example; there is a known limit of how many
bytes can be stored, EGCs indicate where text should be truncated
to.
- Tom asked if there is a need to distinguish between an EGC view and a
subrange of EGC view other than an EGC reference; as Corentin
mentioned, a subrange of a text_view is a text_view,
so is a subrange of an EGC view an EGC view?
- Zach stated he didn't see a need for such a distinction. Most
interfaces should operate on EGC views, but for Unicode algorithms,
it is necessary to drop down a level to a code point view.
- Steve summarized; an EGC reference is a view over code points with a
contract that its range represents exactly one EGC.
- PeterBr imagined a scenario in which a range of code points is sliced
to produce multiple EGCs, but when recombined with additional text,
might yield different EGCs.
- [ Editor's note: Some discussion was missed here. ]
- Tom stated a need for consistent terminology. Tom originally proposed
text_view as a sequence of code points, but we now think it
should be EGC based.
- PeterBr expressed concern; most people think they want code points.
LEWG might object to an EGC based design.
- Zach stated that a concern we have is that we're the Unicode experts
and everyone with strong opinions is pretty much on this call; we
need to be aware of echo chamber issues.
- Tom added that echo chamber issues are the thing that keeps me up at
night; how do we ensure we deliver what is truly useful?
- Steve added that he frequently is asked why some simple thing isn't
implemented. The answer is, because it isn't actually simple.
- Corentin stated that he gets quite concerned whenever we discuss going
in a direction that doesn't align with Unicode recomendations; the UTC
(Unicode Technical Committeee) doesn't get things wrong very
often.
- Steve noted that, fortunately, we're kind of late to the game, we can
learn from the experience of other languages, and we don't have to
discover all the problems ourselves.
- Tom returned discussion to the subrange of subrange concern; there may
be a need to put subranges back together.
- Corentin replied that there is an ongoing effort to support that, but
it is complicated. JeanHeyd is working on
P1664 and it should be discussed
more in Prague.
- Steve described one of the challenges; for efficiency, when we have an
EGC view and want to get down to the code unit range for efficient IO,
reassembly can get difficult.
- Zach replied that, if you have an EGC view over a code point view over
a sequence of code units, that is easy.
- Tom countered that doing so requires that you know that the underlying
storage is contiguous if you want to operate on it at the code unit
level.
- Steve added that there can't be a missing range in the middle.
- Corentin expressed a belief that this will be solved; maybe not for
C++20, but for C++23.
- Tom stated that our normal meeting cadence would have us meeting again on
December 25th 🎅, but expected meeting that day would be unpopular,
so we'll plan to meet next on January 8th.