SG16: Unicode meeting summaries 2018/05/16 - 2018/06/20
Summaries of SG16 meetings are maintained at
https://github.com/sg16-unicode/sg16-meetings. This paper contains a
snapshot of select meeting summaries from that repository.
May 16th, 2018
Draft agenda:
- Review and discuss papers in the Rapperswil pre-meeting mailing.
- Discuss plans and goals for those attending Rapperswil.
Attendees:
- Bob Steagal
- Corentin Jabot
- Dalton Woodward
- Florin Trofin
- JeanHeyd Meneide
- Mark Zeren
- Martinho Fernandes
- Steve Downey
- Tom Honermann
- Zach Laine
Meeting summary:
- It was reported on Slack that Martinho's properly formatted UTF-8
P1041R0 paper was served up by
open-std.org either without a CharSet header or with a Latin1
setting. Tom contacted Hal and Keld. Further discussion yielded a plan
to update
SD-7 to require UTF-8 for .md files and to configure the
open-std.org web server to serve them with a CharSet=UTF-8 header.
- Zach, Bob, and JeanHeyd shared some of their experience at C++Now.
- We then went on to review papers from the pre-Rapperswil mailing.
- P1041R0 - Make char16_t/char32_t string literals be UTF-16/32
- Tom noted a typo in the proposed wording changes for
lex.ccon/4; a use of UTF-8 where UTF-16 was
intended.
- Given the encoding issues and lack of Markdown rendering support built
into browsers, it was suggested that future papers, at least for now,
be submitted in a pre-rendered format.
- Martinho asked about getting the paper scheduled for discussion in
Rapperswil. Tom said he would forward SG16 polls on papers we
discussed to WG chairs to communicate our position and request time
in Rapperswil. Tom will copy paper authors and expected presenters on
this communication.
- It was asked if there was any library impact. Martinho responded no.
Tom noted having previously audited occurrences of char16_t,
char32_t, UTF-16, and UTF-32 and could not
think of a case.
- Zach suggested that, when presenting, it be emphasized that no
implementation will need to make changes; that this is just standarizing
existing practice. Emphasize that there are no known implementations
where the encoding used is not already UTF-16/UTF-32 and that a member
of the C committee was consulted.
- Poll: Those in favor of P1041R0?
- P1072R0 - Default Initialization for basic_string
- Mark noted that P1072 is dependent
on P1010 which is dependent on
Richard Smith's P0593. This
raised the question of prioritization and a request for SG16 to request
that P0593 and
P1010 get time in Rapperswil so
that progress can be made on P1072
in San Diego. Tom agreed to make such a request; specifically to
request that EWG entertain P0593
and that LEWG look at P1010 (and
P1072 time permitting).
- Mark went on to discuss applicability of
P1072 to SG16. Of particular
concern are the issues caused by requiring null termination. This is
not a problem for vector, and hence not a concern expressed in
P1010.
- Mark pointed out that the design is used in real world code today.
- Zach asked why reserve() doesn't suffice. Mark explained the
examples in the paper; that we currently either have to repeatedly
update the size of the container with each addition, or eagerly resize
the container and pay for an unused initialization. The goal of the
paper is to avoid both costs by enabling writing to excess capacity
independently of updates to the container size.
- Tom asked if option A is viable. The concern is that const member
functions must be thread safe. A call to resize_uninitialized()
makes uninitialized data available to const member functions. Further,
there is no event to indicate when the uninitialized data has been
read and therefore no memory barier to function as a synchronization
point.
- Mark acknowledged that a two-phase commit approach is necessary to
avoid UB.
- Martinho observed that two-phase commit is not sufficient by itself
because basic_string uses excess capacity to store a null
terminator for the string; this is what allows the data() and
c_str() member functions to be const qualified.
Overwriting the null terminator will cause UB for concurrently
executing threads.
- Mark advised SG16 to consider the consequences of providing implicit
null-termination for string-like containers in the future. An
alternative approach would use string builders that append a
null-terminator when they are collapsed.
- Mark noted that the two-phase commit approach does at least allow the
implementation to re-establish invariants (such as ensuring a
null-terminator is present at the start of excess capacity following
insert_from_capacity().
- Tom suggested an emplace-like solution might be preferred to enable
preserving invariants.
- Mark acknowledged a call-back/functor based solution would work (though
it still doesn't address the over-written null-terminator issue).
- Dalton asked whether making vector/string node-based containers such
that data could be written to a new buffer and then swapped in. This
has the disadvantage of requiring that the current buffer be copied
prior to performing the append.
- Tom asked if any performance numbers were available. What is the
expected gain?
- Mark responded that numbers are not available, but that Google has
measured and claims the improvements make this feature worthwhile.
Estimate is a few percent improvement.
- Corentin asked why not to use vector instead of string.
- Mark responded that string is a vocabulary type.
- Poll: Do we agree P1072R0 addresses a problem worth solving?
- Poll: Do we prefer option A, option B, or some option C?
- Mark clarified that option C, as discussed today, would be one of:
- An emplace-like call with a call-back/functor.
- A node-based swap.
- Discussion moved into allocator interaction with node types.
- Zach stated that swap is broken for PMR allocators.
- Steve agreed and provided an elaboration; that the swap of the
allocators doesn't swap the actual buffers.
- Mark noted that moving a buffer between vector and
string encounters complexities due to null-termination
requirements.
- Martinho asked how a small buffer optimized string is moved into a
node type.
- Dalton responded that you allocate.
- Tom added, or the node type implements the SBO itself.
- Mark expressed concern that an emplace-like call-back/functor approach
may not work for the network use case of wanting to read data off the
network directly into the buffer.
- Zach suggested that, in a string builder approach, vector is
the string builder.
- Corentin expressed a preference for a specific string builder type
rather than vector. Essentially a vocabulary type suited to
the purpose.
- P1025R0 - Update The Reference To The Unicode Standard
- Steve briefly introduced the change as similar to what had been
proposed, but not completed, for C++17.
- Tom asked, why update the normative reference to specify each of
Unicode 10, Unicode without a version indicator, and ISO 10646?
- Steve answered, we need ISO 10646 for existing references; for
example, the __STDC_ISO_10646__ macro. We want to reference
the Unicode standard (in addition to ISO 10646) for stability
guarantees and additional features. We want to reference Unicode 10
to establish a minimum requirement, and the unversioned Unicode
standard to enable implementors to adopt a newer version.
- Tom suggested adding a non-normative note that implementors are
allowed to use Unicode 10 or newer; though they must use a
corresponding version of ISO 10646.
- Martinho stated that we need to make it clear that implementors must
choose a specific Unicode release.
- Tom asked if we should require a predefined macro that indicates the
Unicode version.
- Steve and Martinho both answered, maybe, but not yet as we don't
actually depend on anything Unicode version dependent yet.
- Poll: Those in favor of P1025R0:
- Our next meeting will be May 30th; the week before Rapperswil.
- There is a WG21 administrative teleconference May 25th.
- Tom will dial-in to give an update on SG16. Martinho and JeanHeyd
are encouraged to attend as well since they have papers to present.
- Those planning to attend Rapperswil: Martinho, Corentin, Peter, JeanHeyd.
- Following the meeting, Martinho volunteered to present
P1025R0 at Rapperswil since Steve
will not be present. Steve agreed.
May 30th, 2018
Draft agenda:
- Discuss plans and goals for those attending Rapperswil.
- Review and discuss the following papers from the Rapperswil pre-meeting mailing:
- P1030R0: std::filesystem::path_view
- P0540R1: A Proposal to Add split/join of string/string_view to the Standard Library
- P0645R2: Text Formatting
Attendees:
- JeanHeyd Meneide
- Mark Zeren
- Martinho Fernandes
- Peter Bindels
- Sergey Zubkov
- Steve Downey
- Tom Honermann
- Zach Laine
Meeting summary:
- Administrative updates:
- Tom reported that WG chairs were contacted regarding SG16 requests
for paper reviews in Rapperswil. WG chairs are predictably swamped
and prioritizing as best they know how, but we may not get to present
any of our papers.
- Zach observed that Titus is concerned about the amount of time that
LEWG will need for ranges, but that LWG should be more concerned.
- Tom relayed that JF Bastien volunteered to arrange introductions with
Swift and WebKit developers working on Unicode. Tom reached out to
arrange meetings, but hasn't heard back. Apple developers are busy
preparing for WWDC; Tom will reach out again soon.
- Tom brought up the recent news that Microsoft has added beta support
for UTF-8 as a system code page as of the Windows 10 April update.
Tom made some new contacts within Microsoft, but has not yet gotten
any further information about Microsoft's goals or plans with this
change.
- Rapperswil planning:
- Tom asked for volunteers to standup for SG16 at the Saturday plenary
in Rapperswil and give a brief update. Martinho and JeanHeyd agreed
to do so.
- Tom asked for those who have attended meetings before to offer any
advice they have for first time attendees.
- Zach recommended spending some time in each of the WGs. Each WG has
its own personality.
- It was noted that hanging around in WGs where one has a short paper
in the queue creates opportunities to present earlier than the paper
might otherwise be scheduled. The
P1025 (normative Unicode reference)
and P1041 (char16_t/char32_t are
UTF-16/UTF-32) papers are good candidates.
- Zach also mentioned not to be afraid to ask questions and to try to
read papers ahead of time.
- Tom noted that anyone present in the room is allowed to vote in straw
polls, but that polls in plenary are generally restricted to ISO
members. It was noted that Herb will make it clear when ISO membership
is required to vote.
- P1030R0 - std::filesystem::path_view
- Martinho liked it, especially section 4.1 (Assume UTF-8 for char based
interfaces).
- Tom liked it with the exception of section 4.1.
- Tom expressed a belief that the discussion in section 4.1 of how
existing char based interfaces on Windows handle conversion to
wchar_t for invocation of native filesystem interfaces is
incorrect. Tom's understanding is that char based strings are transcoded
to wchar_t strings using the system code page.
- Zach asked what is meant by ANSI encoding.
- Tom explained that Microsoft has long referred to char based encodings
collectively as ANSI encodings despite these encodings not reflecting
an ANSI standard.
- [Editor's note: Microsoft's glossary of terms on MSDN describes
the origin of the ANSI reference here. It comes from a draft ANSI
specification that was eventually standardized as the ISO-8859 family
of encodings. See the definition of "ANSI" at
https://msdn.microsoft.com/en-us/goglobal/bb964658.aspx#a.
Microsoft now officially refers to these encodings as "Windows code
pages".]
- Zach initiated a discussion on compile-time vs run-time encodings.
Section 4.1 describes a scenario in which file paths are pasted into
source code as string literals, but the existing interpretation of
such strings, when used as paths at run-time, depends on run-time
locale settings.
- Peter mentioned that the Microsoft compiler now supports a
/utf-8 option that purports to define the source and execution
character encodings. However, that option really only affects how
literals from the source code are translated to the execution character
encoding (UTF-8 at compile time, but never UTF-8 at run-time (at least,
not until the newly introduced beta support in Windows 10 that requires
the user to opt in)).
- Tom stated that we can't fix the compile-time vs run-time aspects of
the execution character encoding.
- Martinho countered that char8_t offers a solution for this -
we know the compile-time and run-time encoding of char8_t
characters and strings.
- Tom suggested a response to the author: maintain consistency with
existing code; char means "ANSI" encoding. Use
char8_t for UTF-8 (follow the changes to path
proposed in the char8_t proposal.
- Tom, Zach, and JeanHeyd all noted the presence of #ifdefs
surrounding the wchar_t based interfaces in the proposed
design. We don't use #ifdef as specification for implementation
defined features.
- JeanHeyd noted that that path_view should not fight with the
platform; don't propagate implementation defined behavior through
interfaces to the programmer.
- Martinho observed that there is no rationale for providing
wchar_t based interfaces only for Windows; they are perfectly
applicable to other platforms as well.
- Zach stated that path_view should work the same as
path; just as string_view does for string.
path_view should support the same set of constructors that
path has and they should behave the same. If there is a need
for new constructors, they should be added to both path and
path_view.
- Zack noted that path_view should be explicitly constructible
from path, not the other way around. [Editor's note: as
currently specified, path_view is constructible from
path, though the constructor isn't explicit. Note that
string_view's corresponding constructor is also not
explicit.]
- Further discussion regarding memory allocation and the behavior of the
proposed c_str class ensued. [Editor's note: few details
of this discussion were recorded. From what I recall, consensus was
that the memory allocation behavior should be implementation
defined.]
- JeanHeyd asked how we should communicate our feedback to the author.
- Zach replied with a preference for a direct person-to-person
response.
- JeanHeyd volunteered to deliver feedback.
- Poll: Use execution character encoding for char interfaces,
char8_t for UTF-8?
- P0882R0 - User-defined Literals for std::filesystem::path
- Tom stated that SG16 concerns are limited to encoding issues; LEWG
should address any other concerns; e.g., naming.
- Peter noted that the paper punts on UTF-8 support pending a solution
from the comittee for differentiating ordinary and UTF-8 string
literals. Fortunately, we have a solution for that in the works!
- It was asked why the UDLs are not constexpr; the answer is
because they produce path objects and the path
constructor allocates.
- Mark asked if the UDLs should produce path_view objects ala
P1030 above and was rewarded
with a round of yeses.
- Peter observed that the UDL names are very generic (ha ha) and that
the literal namespace proposed for them differs unnecessarily from
existing precedent (e.g., std::filesystem::literals vs
std::literals::filesystem. [Editor's note: This design
also results in the UDL declarations being visible following
using namespace std::filesystem; this may be
intentional.]
- Poll: Contingent upon adoption of `char8_t`, add `char8_t` based overloads?
- P0540R1 - A Proposal to Add split/join of string/string_view to the Standard Library
- P0645R2 - Text Formatting
- Zach requested char8_t overloads. [Editor's note:
Peter has been planning to work on adding char16_t and
char32_t support. There is an existing issue tracking
support for char16_t at
https://github.com/fmtlib/fmt/issues/698. That issue notes
that support for std::numpunct<char16_t> is
missing; that would presumably be an issue for char8_t
support as well.]
- Zach observed that formatting only works for trivial encodings
in which one code unit equals one code point; otherwise, field
alignments won't match up in displayed text.
- Martinho responded that, if a font is missing a glyph for a
combining character, then the combining character will likely be
displayed as a separate glyph. Text layout is required to display
aligned text (e.g., depends on console, curses, etc...).
- Tom asked how such display concerns can be addressed; format
is not a text display tool.
- Zach asked how field size is specified. Code units? Code points?
"Characters"?
- Peter provided a link to an existing github issue concerning field
size and UTF-8:
https://github.com/fmtlib/fmt/issues/628.
- Tom noted that we were out of time; we'll continue discussion next
time and will invite Victor to join us.
- Tom stated our next meeting will be scheduled for three weeks from now
on June 20th. The extra week is to give everyone a break following
Rapperswil.
June 20th, 2018
Draft agenda:
- Rapperswil recap. Progress!
- Continue review of P0645R2 (Text Formatting), hopefully with Victor
present if he can attend.
- Review the draft D1097R0 proposal:
- https://github.com/rmartinho/sg16/blob/master/papers/d1097r0.md
- Discuss what we want to learn from the Swift and WebKit developers.
Attendees:
- Corentin Jabot
- JeanHeyd Meneide
- Keld Simonsen
- Mark Zeren
- Martinho Fernandes
- Peter Bindels
- Steve Downey
- Tom Honermann
- Victor Zverovich
- Zach Laine
Meeting summary:
- First order of business was to ensure that papers requiring updates
following the Rapperswil meeting are submitted in time for the
post-Rapperswil mailing. Tom confirmed that P0482R4 had been submitted
and correspondence with Hal confirmed that P1025R1 (adopted at Rapperswil)
will be included in the mailing. Though not discussed in Rapperswil,
Martinho plans to submit a revision of P1041 for the mailing.
- P0645R2 - Text Formatting
- Victor started us off with a brief introduction of recent changes and
review in Rapperswil.
- Victor reported having read the summary of our previous meeting and
discussion of P0645.
- Discussion resumed regarding what field widths mean for multibyte
encodings and combining characters.
- Victor asked if basing field widths on grapheme clusters would be
appropriate.
- Zach provided an example of family emojis. Consider 4 person code
points separated by zero width joiners. Each person code point
combined with a ZWJ is a distinct grapheme cluster, but a single
glyph may be used to display all four clusters. So, grapheme clusters
are not the right abstraction for field width.
- Tom claimed that format should be used to format code
units.
- Peter suggested assuming one column per code point.
- Keld asked about other libraries; are there any that use abstractions
above code points for field formatting?
- Tom stated that the competition is printf and iostreams.
- Keld asked what ICU does.
- Zach responded that he wasn't sure, but that Python uses code points
for field formatting.
- Discussion then moved on to other topics briefly.
- Zach expressed enthusiasm for format_to_n.
- Tom asked if mixed character encodings are supported. For example:
-
format("{}", u"text"); // execution character encoding for format string with UTF-16 argument.
- Victor stated that mixed encodings are not supported and result in
compilation failure.
- Zach observed that, if char8_t overloads were added, that,
internally, format must consume code points.
- Tom responded that this is true for any multibyte encoding, and
therefore true in general for the execution and wide character
encodings.
- Victor agreed, but noted that operations other than fill and field
formatting could be optimized to avoid looking at code points.
- Peter asked if any multibyte encodings allow a NUL byte in trailing
code unit sequences. No such encodings were named.
- Peter observed that, if an encoding library is used, format
can always just read code points.
- Zach offered to provide Victor code using code point iterators from
Boost.Text that could be used to prototype code point based
approaches.
- Discussion briefly turned to portability of wchar_t and
Keld's work to increase the number of C interfaces that do not rely
on global program state; e.g., locale data. Keld wants to improve
support for working with multiple encodings in a single process.
- Tom noted that such improvements are useful for our ideas around use
of compile-time known internal encodings with transcoding to run-time
determined encodings at program borders.
- Tom asked how format handles signed and unsigned char; are
they treated as integral/arithmetic or character types?
- Victor replied that he didn't recall and would have to check.
- Keld asked about reentrancy.
- Victor responded that the only global state references are for locale
data.
- Keld recommended allowing strings to be tagged with encoding data.
- Tom tried to bring discussion back to fill operations and field widths;
are we agreed on use of code points for field fill/alignment?
- Martinho asked how a code point approach works when writing to a fixed
width buffer (of code units).
- Victor mentioned that format_to_n takes a code unit count
constraint.
- Peter observed that a code unit count constraint can result in
truncated code unit sequences.
- Victor suggested that format_to could produce code points
instead.
- Steve asked how to avoid writing broken code; code points produced are
likely going to be written to a code unit buffer anyway.
- Keld stated that programmers like to write both code unit and code point
code; perhaps both should be supported.
- Martinho claimed that truncated code unit sequences are probably not a
large concern; buffers are generally larger than required anyway.
- Discussion again drifted towards encodings that are known at
compile-time vs run-time.
- Keld asked what types are generally used for double byte character
sets; Japanese, Chinese, ...
- Martinho responded that those tend to be variable length encodings
that switch between single byte and multibyte.
- Tom agreed and mentioned ISO-2022 and escape sequences.
- Discussion drifted back to code units vs code points.
- Zach suggested that programmers will expect the output encoding to
match the format string, but that code points are more consistent and
natural. If the n in format_to_n means something
different than for field widths, that will be a problem.
- Victor agreed that programmers will expect to be filling a code unit
based buffer.
- Tom observed that more discussion would be useful, but that we need
to move on.
- Zach recommended trying to support both code unit and code point
based approaches and observe feedback and usage.
- D1097R0 - Named character escapes
- Martinho started by requesting feeback on:
- name matching (currently more limited than described by
UAX44-LM2)
- lack of support for named character sequences.
- Tom recommended adding a small section that summarizes what is
actually proposed. At present, the paper presents a number of options,
but one must read the proposed wording to determine which options are
actually proposed.
- Tom expressed a preference for following the UAX44-LM2 rule
for name matching.
- Martinho responded with a dislike for the
U+1180 HANGUL JUNGSEONG O-E exception and noted that none of
the other languages he surveyed use UAX44-LM2 for matching.
- Keld noted existing APIs that allow specifying precision for matching.
- Martinho clarified that general collation APIs don't apply here
(because of the U+1180 HANGUL JUNGSEONG O-E exception).
- Tom asked if we should propose this for C and everyone responded yes.
- Tom mentioned the paper should address the potential for code breakage.
"\N" has a meaning now (it means "N").
- Tom asked if it is permissible to construct these escapes using macro
concatentation.
- Tom observed that '_' seemed to be missing in the definition of
c-char.
- Martinho stated that is intentional; '_' would be needed for
UAX44-LM2 matching, but that actual character names never use
'_'.
- Zach suggested adding a Tony Table to compare use of \U and
\N{} escapes.
- Tom suggested clarifying that \N{} escapes would not be
permitted in identifiers.
- Tom asked about interaction with raw string literals;
r-char-sequence doesn't seem to include
universal-character-name.
- Martinho responded that universal-character-name escapes are
not recognized in raw string literals; following existing precedent.
- Rapperswil recap:
- Tom asked if Rapperswil attendees were able to connect with authors
of previously discussed papers in order to deliver our feedback.
- JeanHeyd reported that connections did not happen. However:
- P1030 was not discussed in Rapperswil.
- P0882 was discussed in LEWG but not well received. No need for
follow up.
- P0540 was discussed but LEWG feedback matched ours. No need to
follow up.
- We ran out of time to discuss what we want to learn from the Swift and
WebKit developers.
- Tom asked about renaming the SG16 mailing list from unicode to
sg16-unicode. Both Tom and Martinho had been annoyed by the
similarity to the unicode.org mailing list by the same name. No
objections were raised; Tom will follow up with Keld.
- Tom noted that our next regularly scheduled meeting would fall on July
4th, a US holiday. The next meeting will be scheduled for July 11th.