SG16: Unicode meeting summaries 2022-01-12 through 2022-06-08
Summaries of SG16 meetings are maintained at
https://github.com/sg16-unicode/sg16-meetings. This paper contains a
snapshot of select meeting summaries from that repository.
-
January 12th, 2022
-
January 26th, 2022
-
February 9th, 2022
-
February 23rd, 2022
-
March 9th, 2022
-
April 13th, 2022
-
April 27th, 2022
-
May 11th, 2022
-
May 25th, 2022
-
June 8th, 2022
Previously published SG16 meeting summary papers:
January 12th, 2022
Draft agenda:
Attendees:
- Hubert Tong
- Jens Maurer
- Peter Brett
- Steve Downey
- Tom Honermann
- Victor Zverovich
Meeting summary:
- [ Editor's note: We did not have quorum due to low attendance;
as a result, no polls were taken. ]
- [ Editor's note: Changes made in a draft of
P1885R9
were material to review of the papers on the agenda, so we first
reviewed it. ]
- P1885R9: Naming Text Encodings to Demystify Them
- Jens stated that there is a procedural question related to the
changes made in this revision; the new draft changes the design
after electronic polling.
- Tom replied that the changes were at least partially inspired by
comments received on the proposal by electronic polling participants
and were intended to increase consensus.
- PBrett noted that the polled revision did not have the wording to
mandate CHAR_BIT == 8 due to a procedural error and asked
whether that warranted follow up with LEWG.
- PBrett asked what it means in practice for the wording to mandate
CHAR_BIT == 8.
- Jens replied that it requires a call to a function specified with
that wording to be ill-formed if the requirement is violated.
- Jens explained that implementors can conform with this requirement
by defining such functions as deleted, implementing them as a
function template with an appropriate static_assert, or
similar.
- PBrett asked for more details regarding implementation as a function
template and whether the functions could simply not be declared
at all.
- Tom replied that a function template is ok because taking the address
of a standard library function is not allowed; the standard just
requires a call expression to be well-formed.
- Hubert explained that the name must be declared due to interaction
with name lookup.
- Jens stated an implementation preference to define the function as
deleted.
- PBrett: noticed that the revision history simply stated
"Wording fixes" with no details.
- Tom volunteered to review the R8 and R9 revisions and to summarize
the differences in the meeting summary.
- [ Editor's note: Tom did so; the wording differences identified include:
- An additional using enum id declaration was added to
the definition of text_encoding.
- The wide_literal(), wide_environment(), and
wide_environment_is() declarations were removed and
wording that referred to them adjusted accordingly.
- "character encoding" was changed to "character encoding scheme"
in the paragraph following the note regarding how the names of
text_encoding::id enumerators were derived from the
IANA registry.
- A paragraph describing invariants maintained by
text_encoding was added.
- Wording that introduced and used
SUBSTITUTE_UTF_ENCODING() for mapping the bigendian and
littleendian UTF-16 and UTF-32 schemes to their respective
encoding forms was removed; related wording was adjusted
accordingly.
- Wording to mandate CHAR_BIT == 8 was added
throughout.
]
- P2491R0: Text encodings follow-up
- Jens summarized the paper.
- Jens stated that the changes in P1885R9 improve consistency with the
IANA intent and P1885 usage.
- Jens added that there is opportunity for specification improvements,
but that they can be addressed during LWG review.
- Tom noted that the wording plan described in the paper indicated
intent to remove the requirement that CHAR_BIT == 8.
- Jens confirmed, but expressed a lack of concern; that restriction can
be removed if motivation to do so arises.
- PBrett expressed a preference for explicit specification for how IANA
octets are mapped to C++ code units.
- Jens agreed, stated that can be done during LWG review, and expressed
a preference for a normative sentence that maps an IANA octet to a
char code unit.
- Jens expressed a belief that relying on IANA for anything other than
octets would be a mistake.
- PBrett asked if Jens had an alternative to suggest.
- Jens replied that he did not, but that such concerns usually arose
during discussion of UTF-16 and UTF-32.
- Hubert stated that mapping to IANA in those cases is feasible, but
noted that there are multiple possible mappings that may be platform
dependent; particularly when the size of char is not 8
bits.
- PBrett agreed, but noted that a particular mapping could be required
for a conforming implementation.
- Tom added that a non-conforming implementation would map to "other"
in that case.
- P2498R0: Forward compatibility of text_encoding with additional encoding registries
- PBrett introduced the proposal:
- Character encoding repositories other than IANA exist.
- The mapping to IANA should be explicit such that a mapping to
another registry could be gracefully introduced in the future if
motivation arises.
- PBrett summarized the proposed changes:
- Rename "id" to "iana_id"
- Rename "mib" to "iana_mib"
- Add recommended practice to avoid implementations creating an
over dependence on IANA.
- PBrett asked for opinions on the proposed renames.
- Victor asked if other viable candidate registries exist in practice
and stated that, if not, the proposed renames seem premature.
- PBrett replied that there are NB concerns about IANA being an
unregulated, unaccountable, and unreliable organization.
- PBrett added that examples of other registries can be found in the
WhatWG Encoding Standard,
IBM's Character Data Representation Architecture (CDRA),
and ISO/IEC 2022:1994.
- PBrett noted that the paper does not propose the addition of another
registry; just the possibility to add more in the future in a
seamless manner.
- Jens suggested adding examples of other registries to the paper.
- PBrett responded with concern that doing so might create disruption
for the advancement of P1885 and result in considerable time spent
debating whether the merits of each such repository warrant their
being mentioned.
- Steve asked for confirmation that, assuming an additional registry,
that a single text encoding ID is still needed.
- PBrett responded that the IANA ID is an enum class and that, in
principle, multiple such classes are possible.
- Steve stated that renaming mib() to iana_mib()
results in the feature no longer being generic.
- Jens agreed.
- PBrett responded that code written for P1885 today that consults
mib() is necessarily concerned with IANA specifically.
- Steve asked what function a generic library should call then.
- Jens replied that, if the IANA ID is desired, then call
mib().
- Tom noted that, since multiple encodings may map to IANA's "other",
reliance on mib() to uniquely identify an encoding is not
possible.
- Hubert opined that the proposed renames are fine, but that
extension to other registries might require different
terminology.
- Jens offered examples such as "unique ID" and "UUID".
- Jens opined that it doesn't hurt to add "iana" to make the
terminology association explicit.
- Hubert agreed.
- Hubert stated that, for wide encodings, there are some registries
that are somewhat suitable; in CDRA, wide encodings aren't explicitly
represented as they constitute a composition of a character set and
an encoding.
- PBrett acknowledged and noted that the proposal is intended to clear
design space for extension for such cases.
- Tom provided some arguments in favor of support for multiple
registries:
- As previously noted, the IANA specification is goverened by an
organization that some have concerns about.
- The IANA registry is poor from a quality of specification
perspective.
- The IANA registry is missing entries that are found in other
registries.
- The same name is sometimes used by different registries to refer
to differenc encodings.
- Other registries are arguably more suitable for some environments;
e.g., CDRA for IBM environments.
- Tom suggested that the proposal replace the P1885 proposed exposition
only data members with post conditions on the default constructor to
require iana_mib() and name() to return
id::other and an empty string respectively; this is to avoid
encouraging implementations to simply store an IANA ID.
- PBrett noted that much of the current wording is in terms of the
mib_ exposition only data member.
- Tom replied that those can be changed to mib().
- Jens pointed out that the following text from the proposed wording
creates the impression that the specified feature provides a remote
API interface.
- "[text.encoding] describes an interface for accessing the IANA
Character Sets registry".
- Jens stated that text_encoding currently containes a list of
all encodings in the IANA registry and that this proposal makes the
text_encoding class more of a first class entity for which
mapping to other registries is ancillary functionality.
- Jens opined that this direction suggests the need to create our own
encoding registry since the functionality effectively defines
one.
- Jens stated that such a change constitutes a change in direction that
is more significant than the proposed renames suggest.
- PBrett acknowledged that the proposal makes the design more abstract
in a manner similar to how Unicode specifies abstract characters.
- PBrett added that he could imagine a C++ appendix that lists the
encodings, but that would then necessitate defining them.
- Jens stated that a registration service could be established, but did
not advise doing so.
- Jens observed that discussion has not yet reached the bottom of how
encodings would be mapped between encodings registered with different
registries.
- Jens suggested the possibility of defining an
iana_text_encoding class and later adding an
iso_text_encoding class or similar for other registries if
warranted.
- Hubert observed that the P1885 design contains the same mapping
problem since it presents a single domain, but doesn't adhere solely
to that domain.
- Hubert suggested the possibility of a text_encoding class
template parameterized by a registry identifier.
- Tom noted that, if one were to map the encodings present in the
ICU converter explorer,
then parameterization by registry is necessary due to conflicting
use of the same name for different encodings.
- [ Editor's note: For example, "korean" maps to "windows-949-2000"
via the "Windows" provider, but to "ibm-1363_P11B-1998" via the "IANA"
provider. ]
- PBrett reiterated the intent of this proposal; to remove complete
dependency on IANA.
- Tom stated that the proposed change is consistent with the P1885
direction given that comparison is dependent on name when an encoding
maps to IANA's "other".
- Tom opined that there is no need to specify a mapping between
repositories in the standard; the mapping can be
implementation-defined.
- Jens agreed that leaving the mapping implementation-defined is
possible but felt an obligation to specify the mapping.
- Hubert noted that, for implementors, the concern is what is in the
interest of their users.
- PBrett expressed a belief that creation of a character encoding
registry service would be outside the scope of WG21 work, but that he
would be willing to assist with such an effort outside of WG21.
- Jens agreed and noted that such a service would essentially be
attending to graves.
- Hubert suggested that text_encoding may be more appropriate
as a concept.
- Tom stated that distinct classes or class template specializations
for each registry would create friction at interface boundaries.
- Hubert responded that a type erased interface could still be
specified.
- PBrett expressed being open to other suggestions.
- Jens suggested renaming the current text_encoding class to
iana_text_encoding and, if motivation arises for another
registry, then a new class can be added.
- General discussion ensued regarding the ramification of distinct
classes.
- Tom pondered what name would be returned by name() if
multiple registries are supported.
- PBrett responded that his proposal intended to avoid that
question.
- Tom stated that the next telecon will be held on 2022-01-26 and that the
agenda will include:
January 26th, 2022
Draft agenda:
Attendees:
- Charlie Barto
- Hubert Tong
- Jens Maurer
- Mark de Wever
- Peter Brett
- Steve Downey
- Tom Honermann
- Victor Zverovich
- Zach Laine
Meeting summary:
- Tom informed the group of tentative plans for a SSRG and SG16 joint
telecon to discuss the security aspects of
P2528R0 (C++ Identifier Security using Unicode Standard Annex 39)
and asked for feedback or concerns about such a meeting.
- P2286R6: Formatting Ranges:
- Tom informed the group that Barry was unable to be in attendance but
that we are ok to discuss the wording and gather feedback for
him.
- Victor explained that he had assisted with the wording related to
escape handling, but that Barry authored the rest.
- Jens stated that it is generally not advised to discuss a paper
without the author present.
- Tom acknowledged the guidance and reported that he had communicated
with Barry and that Barry was content with Victor being present to
discuss any issues that are raised.
- Jens asked for confirmation that LEWG has already approved the
design.
- Victor responded that the paper is present in the currently active
electronic polling cycle.
- Victor shared the paper and reviewed the revision history.
- Victor began reviewing the wording.
- PBrett asked if the formattable concept has semantic
constraints that cannot be expressed in the concept definition.
- Victor replied that it does not.
- Jens noted that, in section 5.1 of the paper, [format.formattable]
paragraph 2 states the semantic requirements.
- Zach asked if the concept has been implemented as specified.
- Victor expressed uncertainty regarding the concept definition, but
assured the group that the rest of the design has been
implemented.
- Mark asked, with regard to section 5.2, why '?' appears as a
type.
- Victor explained a requirement for mutual exclusivity.
- PBrett asked if there is intent to standardize the use of '?' to
enable a debug representation generally; e.g., for non-standard
types.
- Victor replied that doing so is outside the scope of the standard,
but expressed support for that direction.
- PBrett noted that the paper introduces a set_debug_format()
member function and asked if it would be desirable to add generic
support for invoking it.
- Jens replied that the parser for the formatted type would presumably
have to be responsible for recognizing the '?' character and invoking
the member function.
- Jens noted that calls to set_debug_format() must activate
the debug format regardless of whether a '?' is present in the format
string.
- Jens suggested that adding generic facilities to support the '?'
specifier would be ok, but probably best addressed by a different
paper.
- Hubert noted that, in the proposed wording for
[format.string.escaped], subparagraph 2.4.1, that the "Other" ("C")
value of General_Category is an abbreviation for a set of
categories and requested that the wording specify the individual
categories.
- [ Editor's note: The values for General_Category are
specified in
table 12 of section 5.7.1 of Unicode 14 UAX#44.
]
- [ Editor's note: The wording also refers to the
General_Category value of "Separator" ("Z") that is likewise
an abbreviation for a set of categories and should presumably be
expanded as well. ]
- Hubert noted that format stability cannot be guaranteed and that
output will change when newer Unicode standards are adopted.
- Tom asked for confirmation that stability will only be lacking for
currently unassigned characters.
- Steve replied that this should be further researched and noted that
the "Unassigned" ("Cn") property is not stable.
- Hubert noted that the wording does not address non-Unicode text and
asked if isprint() and iswprint() should be used to
identify non-printable characters in those cases.
- Jens replied that doing so warrants further discussion.
- Hubert pondered whether it would be preferable to map non-Unicode
characters to Unicode and then proceed with using the Unicode
character properties.
- Tom noted that, for implementations that support user defined
encodings, the implementation may not know how to map to Unicode.
- Hubert noted that such user defined definitions must define
categories to be used for isprint() and other character
classification functions, but acknowledged that a mapping to Unicode
may not be present.
- PBrett stated that [format.string.escaped] paragraph 2 does not state
how to determine if the string to be escaped is in a Unicode
encoding.
- PBrett noted that he believes such wording to be present in the
wording related to field width and suggested it may be desirable to
generalize that and move it to a central location.
- Steve asked if the determination might be locale dependent.
- PBrett noted that previous guidance was that std::format()
may be used for binary data and that any associated encoding is
therefore tenuous.
- Jens suggested that, in those cases, a programmer might be advised
not to use the '?' formatting type.
- Tom suggested it may be reasonable to, again, use the literal
encoding as a proxy for the potentially locale dependent
encoding.
- Victor agreed.
- Hubert stated that, for the non-Unicode case, determining
printability would require either locale dependence or preservation
of the compile-time literal encoding.
- Charlie noted that, historically, the latter would have been
difficult and that, for Microsoft, the active code page (ACP) was
used in the past.
- Hubert noted that, in a cross-compilation scenario, it is possible
that the literal encoding is not a defined encoding for the target
environment.
- Tom suggested that, at some point, we will need to poll whether the
non-Unicode behavior should be locale dependent or not.
- Hubert expressed acceptance of non-locale dependent behavior for the
non-Unicode case so long as there is no requirement to match the
behavior for the Unicode case with regard to emitting hex vs UCN
notation.
- Hubert noted that, if locale dependence is avoided, it will be
necessary to assume an encoding for characters that are not
consistently encoded for all locales; like '\' in EBCDIC
environments.
- Hubert added that doing so might be ok if the choice is determined
by the literal encoding.
- PBrett suggested it may be useful to support opt-in to locale
dependence via the 'L' modifier.
- Victor stated that the 'L' modifier could be reserved for now.
- Mark noted that the 'L' modifier is currently supported for character
type.
- Jens pointed out that the changes to [format.formatter.spec]
paragraph 2 appear to indicate that a declaration of the
set_debug_format() member function in any one of the
specializations will affect all of them.
- Victor agreed that this wording requires more work.
- Mark asked when a formatter object for which
set_debug_format() was called is reset or what happens when
the '?' specifier is not applicable to the type.
- Victor replied that it should be an error to specify mutually
exclusive options or that the last option overrides prior ones but
that further consideration is required.
- Jens stated that the order of the interior bullets in
[format.string.escaped] subparagraph 2.2 need to be revised to
address two issues:
- The algorithm must process the contents of string S in
order.
- S consists of a sequence of code units, not UCS scalar
values.
- Jens suggested trying to factor out the code unit and UCS scalar
value cases to avoid the exceptions.
- Charlie stated that the handling of invalid code unit sequences needs
to be specified since recovery may not always be possible; failure to
recover could result in output containing a long sequence of values
in hex notation.
- Jens acknowledged the scenario and agreed that the specification must
be made clear about that.
- Jens asked what "universal character name escape sequence" is intended
to mean in [format.string.escaped] subparagraph 2.4.2 and noted that a
definition does not exist for "universal character name" though one
does exist for universal-character-name.
- Jens noted that this wording probably should not refer to
universal-character-name since it describes a grammar
nonterminal and suggested replacing
"its universal character name escape sequence" with
"a sequence of scalar values".
- PBrett agreed.
- Steve pointed out that similar uses of grammar nonterminals appear in
the wording.
- Jens agreed and suggested that the various "escape sequence" uses
should be replaced with "a sequence of scalar values in the form
...".
- Jens pointed out a grammar issue in the first line of
[format.string.escaped] paragraph 3;
"... is equivalent to the escaped string representation a string
of C ...".
- Hubert requested that a note be added to [format.string.escaped]
paragraph 4 to indicate that behavior is not locale dependent but
that the encoding may be informed by the literal encoding.
- Jens reported that the expected output noted in the comment for
example s3 in [format.string.escaped] paragraph 4 is
incorrect; there should be two ranges and therefore two sets of
brackets.
- Tom observed that the examples that follow [format.string.escaped]
paragraph 4 depict the expected Unicode behavior, but appear to be
part of paragraph 4 which is specific to non-Unicode behavior.
- Victor agreed there is a presentation issue that needs to be
addressed there.
- Jens advised moving paragraphs 2 through 6 of [format.range] after
the range_formatter class definition.
- Victor noted that "exposition only" should appear in italics.
- Mark asked why, in the last row of the table associated with
[format.range] paragraph 6, "?s" is required as opposed to just
"?".
- PBrett stated that requiring "?s" is inconsistent with the string
case where "?" can be specified by itself.
- Victor explained that "s" indicates that a range is intended to be
formatted as a string; "?s" is therefore needed to format the range
as a string in debug mode.
- PBrett pointed out the "this is equivalent to invoking
set_brackets({}, {})" note in [format.range] paragraph 5 and asked
if a programmer is permitted to make such a call.
- Victor replied that the wording is intended for implementors.
- PBrett asked if, when implementing a range formatter, whether it is
required to implement member functions like
set_brackets().
- Victor replied that LEWG had discussed this and that the member
functions are intended to help avoid ABI issues.
- Mark asked if set_separator() and set_brackets()
should be restricted to characters and how characters outside the
basic character set need to be handled in various scenarios such as
applicability to estimated field width determination.
- Victor replied that the functions accept strings as input.
- Tom reviewed the list of previously discussed design points that were
noted in the
email announcing the agenda for the telecon.
- Tom noted the need for wording to address recovery from encoding
errors.
- Charlie stated that recovery should follow the processes described in
the
WHATWG Encoding Standard.
- [ Editor's note: See
section 4.1, "Encoders and decoders".
]
- Tom noted the need to ensure that std::filesystem::path is
rejected as a formattable range.
- Victor replied that it is rejected by the constraints on the
formatter partial specialization specified in the
[format.syn] updates; those constraints reject recursive ranges.
- Victor suggested it might be helpful to add a note that
std::filesystem::path is rejected there.
- Tom announced that the next telecon will be February 9th.
February 9th, 2022
Draft agenda:
Attendees:
- Charlie Barto
- Hubert Tong
- JeanHeyd Meneide
- Jens Maurer
- Peter Brett
- Steve Downey
- Tom Honermann
- Victor Zverovich
Meeting summary:
- P2498R1: Forward compatibility of text_encoding with additional encoding registries
- PBrett presented:
- The R0 revision was reviewed by SG16 a few weeks ago.
- The R1 revision rebases the wording on the latest P1885
revision.
- The wording was reworked to decouple IANA IDs from the
exposition only data members.
- Victor noticed an unnecessary trailing semicolon following the
closing brace of the std namespace declaration in the
proposed updates to [text.encoding].
- Hubert noted that Corentin's recently added response to P2498R1 in
P1885R10
noted unnecessary use of an enum for the proposed internal details of
the text_encoding class and asked if it was necessary for it
to be an enum.
- PBrett responded that it is exposition only.
- Jens pointed out an existing using enum id declaration in
the text_encoding synopsis that should presumably have been
changed to using enum iana_id.
- Jens noted similar renaming updates needed in the postcondition for
the text_encoding(iana_id) constructor where comparisons
against id::unknown and id::other are currently
present.
- Jens observed that the text_encoding(iana_id) constructor
does not state that its argument is stored.
- PBrett explained that such storage is implied by the postcondition
requirements on the result of calls to iana_mib().
- Jens suggested that there should be some wording that relates the
exposition only id type to iana_id.
- PBrett agreed that could be better specified.
- PBrett indicated that some positive guidance is needed before
spending further effort on this proposal.
- Tom suggested polling support for the concerns the paper purports to
address.
- Discussion regarding polls ensued.
- Poll 1A: It should be more explicit in the identifiers used by the
text_encoding interface that the facility is tied to the
IANA character sets database.
- Attendance: 8
-
- Weak consensus in favor.
- Poll 1B: The text_encoding class design should be modified
to facilitate potential future association with additional encoding
registries without retaining a bias towards the IANA registry.
- Attendance: 8 (one abstention)
-
- No consensus.
- SA: I think IANA is a reasonable default and others can be
added; we shouldn't slow down progress.
- Poll 1C: The `text_encoding` class should be renamed to
iana_text_encoding.
- Attendance: 8
-
- No consensus.
- Poll 1D: Address feedback on wording in P2498R1
"Forward compatibility of text_encoding with additional encoding
registries", and forward the paper as revised to LEWG as a bug
fix for P1885 "Naming text encodings to demystify them" with a
recommended ship vehicle of C++23.
- Attendance: 8 (one abstention)
-
- Weak consensus in favor.
- Jens requested that the changes regarding exposition only data
members be removed if the intent is just to rename some
identifiers as opposed to changing the original design intent.
- D2513R1: char8_t Compatibility and Portability Fixes
- Tom announced that the next telecon will be February 23rd.
February 23rd, 2022
Draft agenda:
- Discuss objectives and priorities for C++26
Attendees:
- Charlie Barto
- Hubert Tong
- Inbal Levi
- Jens Maurer
- Peter Brett
- Steve Downey
- Tom Honermann
- Victor Zverovich
Meeting summary:
- Tom reminded the group about the upcoming SSRG meeting on Monday,
February 28th, and encouraged the contribution of suggested improvements
for
P2528R0 (C++ Identifier Security using Unicode Standard Annex 39)
to the author.
- Discuss objectives and priorities for C++26:
- Tom asked the group for suggestions on what we should focus on for
C++26.
- Tom suggested alias barriers as a remaining core language feature
improvement.
- [ Editor's note: See
SG16 issue #67.
]
- PBrett suggested improvements for access to command line options and
environment variables.
- [ Editor's note: See
SG16 issue #66.
]
- Jens noted the existing practice for the _wmain() entry
point in Microsoft Visual C++.
- PBrett noted the envp parameter that POSIX adds to
main().
- Tom suggested std::rope for chained text segments.
- Steve suggested transcoding and lamented JeanHeyd's absence.
- PBrett suggested an owning string type that eschews null
termination.
- Charles expressed concern that a string type that doesn't ensure
null termination may not improve security due to programmers
attempting to pass such data to functions that require a null
terminated string anyway.
- PBrett suggested that implicit conversions could be prohibited.
- Steve noted the usability challenges such prohibition creates.
- Victor stated that programmers will misuse such types even if
implicit conversions are not supported; they will assume that, for
example, a data() member function will provide a null
terminated string.
- PBrett replied that such concerns apply to the existing
std::string_view type as well.
- PBrett asserted there is room for a std::string_view like
type that owns the string data.
- Charles noted that having to pass a string held in such a type to a
function that requires null termination would require having to copy
the string contents just to add a null terminator and lamented the
cost of such copies.
- Jens asked PBrett what other features he might like to see in a new
string type.
- PBrett responded that he would mostly like to not have to write yet
more string types that avoid null termination concerns.
- Jens suggested that a string type that helps to avoid allocation and
deallocation costs could be beneficial.
- Jens stated he would prefer not to spend time in SG16 on general
data structures.
- Jens asserted that the elephant in the room is ICU and the
functionality it provides and expressed a desire for a roadmap
towards providing similar functionality.
- Steve suggested that PBrett publish a paper extolling the benefits
of a string type without null termination.
- PBrett responded that benefits appear when working with databases
and graphics.
- Tom stated that, if a new string type were to materialize, he would
like to know for sure what encoding it is intended to be used
with.
- PBrett offered two categories of such encoding assurances:
1) a type intended for text with an assumed encoding, and
2) a type that enforces well-formed encoding.
- Steve offered a third category:
3) a type that maintains invariants like normalization form.
- Steve reported having to investigate problems due to assumption of
UTF-8 leading to errors when data was passed to Python programs
that enforced well-formedness.
- Steve stated that Bloomberg is strongly in favor of a type that
maintains such invariants.
- Charles expressed desire for a way to pass a filename on the command
line that is preserved.
- Charles explained that this currently isn't possible on Windows due
to transcoding that occurs when preparing a command line to pass to
main().
- Tom asked if it is possible to enter such names on the command line
from Windows shells in the first place, but acknowledged it can be
done from CreateProcess().
- Charles replied that tab completion and similar shell features can
produce such names.
- Tom asked to confirm that this should be considered a language issue
as opposed to a shell issue.
- Charles explained that other languages provide such mechanisms and
noted that Rust has an OsString type that stores WTF-8 on
Windows.
- PBrett added that Rust is a good example of a language that provides
strings that don't guarantee null termination.
- Tom returned to discussion about features to focus on and suggested
Unicode algorithms.
- Hubert asked if ICU provides message formatting features.
- Tom confirmed that it does.
- PBrett noted the work that the Message Format Working Group (MFWG)
is doing.
- [ Editor's note: The MFWG tracks its work
here.
]
- Tom suggested improved support for regular expressions as another
feature to work on.
- Charles replied that such work depends on whether the committee is
willing to standardize a std::regex2 or similar.
- Charles stated that a new regex design would presumably face the
same challenges that std::regex did and may therefore
produce a similarly poor result.
- Charles noted that regular expression support is a feature for which
interoperability across libraries is not often needed.
- PBrett asserted that a standard regex facility is beneficial because
general programmers aren't particularly good at writing their
own.
- Charles suggested that a class intended to be used as a base class
for string buffer management could be beneficial for building
parsers; particularly a type that provides interfaces to peek ahead
and push back.
- PBrett suggested a std::lexy with reference to
Jonathan Müller's work.
- [ Editor's note: Jonathan's Lexy work is published
here. ]
- Tom asked if the work done on std::format provides a
precedent for how to design a std::regex replacement that
would be more isolated from ABI concerns.
- Charles responded that std::format is fairly exposed to ABI
concerns, but acknowledged the possibility of designing a more
isolated type.
- Victor stated that Hana Dusíková's CTRE provides a good example of
how to write parsers.
- [ Editor's note: Hana's CTRE work is published
here.
]
- PBrett asked about ideas for how to be more explicit about
associating an encoding with existing text; e.g., how to have text
in UTF-8 and pass it to something that requires UTF-16.
- Steve asked if anyone recalled a proposal that would allow
specifically binding to a string literal.
- Tom stated that he did not but noted that a user-defined literal
(UDL) can provide similar functionality.
- [ Editor's note: Tom was thinking of something like this:
#include <cstddef>
class string_literal {
private:
const char *p;
string_literal(const char *p) : p(p) {}
public:
const char* data() const { return p; }
friend string_literal operator ""sl(const char *p, std::size_t);
};
string_literal operator ""sl(const char *p, std::size_t) {
return string_literal(p);
}
void f(const char*);
void g(string_literal sl) {
f(sl.data());
}
void h() {
g("hello"sl);
}
]
- Steve stated that a facility that allows associating a dynamic
encoding would be useful for scenarios in which the text and its
associated encoding are not known at compile time.
- Victor noted that encodings are often properties of interfaces and
suggested that an attribute based approach to specifying encoding
expecations could be helpful; this would allow specifying
expectations that are dependent on locale or Windows Active Code
Page (ACP).
- Charles expressed support for that approach.
- PBrett asked what the advantage of an attribute would be.
- Victor replied that it could be added to existing functions without
ABI impact.
- Charles observed that, if the #embed is adopted, then text
that is not in the literal encoding might be imported.
- Hubert responded that the #embed author has been clear that
the proposal is intended to provide binary resources.
- Tom changed topics by asking for suggestions regarding how to solicit
new SG16 attendees and reported that he and Peter had previously
discussed inviting other WG21 members to share what the organizations
they work with do in practice.
- PBrett responded that backward compatibility and existing practice
are motivation for not providing new facilities.
- PBrett suggested that, if someone were to come forward with a
proposal to adopt a suite of string types similar to those provided
by Rust, that we might tell them no thank you.
- PBrett stated that we should, of course, discuss any proposals that
come before us.
- Tom replied that seems to argue for the status quo plus what people
actually do.
- Jens expressed belief that there is good opportunity to make progress
with small improvements and provided a UTF-8 decoding iterator as an
example.
- Tom replied that an existing proposal covers such iterators.
- [ Editor's note: See
P0244 (Text_view: A C++ concepts and range based character encoding and code point enumeration library).
]
- Charles responded that he has written a grapheme iterator.
- Charles noted that such iterators are always going to be slower than
bulk conversion routines.
- PBrett suggested adding locale-independent replacements for character
classification functions like isdigit(); e.g.,
is_ascii_digit().
- Charles expressed support for such locale invariant functions and
noted via chat that glib provides such a variant named
g_ascii_isdigit().
- Tom noted that there is a distinction; locale-independent variants
would operate using an implementation-defined encoding where as ASCII
variants would operate using ASCII even in an EBCDIC environment.
- Hubert asked if anyone would object to having both variants; e.g., a
locale-independent variant that always operates using the "C" locale
and an ASCII variant.
- No objections were raised.
- PBrett stated that the highest priority item should be transcoding
facilities.
- Jens asked if he meant something like iconv().
- PBrett responded that he meant something more like JeanHeyd Meneide's
ztd::text.
- [ Editor's note: JeanHeyd's ztd::text work is published
here.
]
- Charles stated that, with regard to providing an ICU interface in
the standard, there are performance implications; support for generic
iterators would warrant availability of definitions for inlining
purposes.
- PBrett asked if anyone could comment regarding the utility of
std::message.
- Steve responded that the main problem with message catalogs is that
they don't tend to work well with real languages.
- PBrett reported having good experience with GNU gettext, but that it
required discipline to be used successfully.
- PBrett stated that the other elephant in the room is locale, but
noted that std::format() provides a foundation we can build
on.
- Tom suggested the possibility of enabling thread-specific locale
facilities that avoid reliance on a global locale.
- PBrett suggested a review of all locale dependent interfaces followed
by a proposal that adds variants that accept a locale as an
argument.
- Jens replied that C already does that and that all C++ interfaces use
std::locale.
- Hubert stated that POSIX provides a thread aware locale interface
that enables changing locale for a specific thread as well as
interfaces that accept a locale argument.
- PBrett provided an example; a std::isalpha() overload that
accepts a std::locale argument.
- Hubert responded that POSIX specifies a isalpha_l()
interface that accepts a locale_t argument.
- Tom asked if std::locale is fundamentally problematic or
whether the problems we face with it are isolated to the facets.
- PBrett responded that it has fundamental limitations due to reliance
on inheritance.
- Victor stated that the facet's assumption that characters fit in a
code unit is the most significant problem.
- Victor added that std::locale uses reference counting and
therefore has a performance profile similar to a shared pointer.
- Hubert stated that std::locale strongly ties C++
implementations to the suite of C locales since the interface depends
on the names of C locales; it is therefore questionable how
implementors might provide more generic locale support.
- Steve stated that WG14 recently adopted a proposal that adds
'@', '$', and '`' to the basic source character set.
- Steve added that he intends to bring a matching paper to WG21 in the
near future.
- [ Editor's note: The adopted WG14 proposal is
N2701.
]
- Tom announced that the next meeting will be on March 9th and, unless a
new paper materializes, will likely focus on diving deeper into one of
the topics discussed today.
March 9th, 2022
Draft agenda:
- ICU features to consider for C++26
Attendees:
- Charlie Barto
- Hubert Tong
- Jens Maurer
- Peter Brett
- Steve Downey
- Tom Honermann
Meeting summary:
- Tom introduced the topic:
- Tom was inspired by Jens' assertion during the previous telecon that
"the elephant in the room is ICU and the functionality it
provides".
- Based on that, Tom created a
Google Doc
containing a list of ICU feature categories annotated with Tom's
opinions on whether such features are worthy of consideration for
standardization in C++26 should suitable proposals materialize.
- Tom would like to issue a Call for Papers covering the features we
agree are worth consideration.
- Tom began reviewing the feature categories in the order they appear in
the document.
- The categories enumerated below are those for which there were material
comments.
- "Unicode character properties":
- Tom pondered the benefits of exposing the set of raw UCD properties
as opposed to just those that are needed for other features.
- Charles responded that there are representation choices to be made
based on usage.
- Steve noted that support for segmentation requires many of the UCD
properties.
- Steve stated that it would be useful to expose all of the properties
to enable experimentation.
- Steve noted that, if there are design problems with how the data is
exposed, that would be useful information.
- PBrett observed that the UCD properties are needed to implement
proper internationalization and localization support.
- PBrett pondered whether exposing all UCD properties might create
portability issues.
- Tom asked if it might be reasonable to only expose properties with
stability guarantees.
- Steve replied that doing so probably would not be viable; some
properties required for segmentation are not stable and there is a
need to be able to fix things.
- Tom noted the difference between data being stable as compared to
property shapes being stable.
- Steve acknowledged and noted that property shapes haven't changed
in a while.
- Steve added that the ICU maintainers would presumably push back hard
if Unicode made significant changes to the shape of existing
properties.
- "Sets of Unicode Code Points and Strings":
- Tom commented that the items in this category appear to be simple
utilities.
- PBrett opined that may be a good reason to provide them.
- Tom observed that they appear to be used to provide support for
character classes.
- "Locales":
- PBrett stated that one of the challenges with locale identifiers is
that choice of locale is not fixed at compile-time.
- PBrett added that locale IDs are subject to geopolitical changes.
- Steve reported that there has been much work on client-side
localization through
ECMA 402,
and
ICU4X.
- Steve expressed caution over trying to standardize something like
ECMA 401 and ICU4X given that they are recent developments.
- Steve noted that client side support uses browser based
facilities.
- PBrett wondered if C++ code compiled to WASM should transparently
proxy to the local environment.
- Tom asked whether new locale support could be built on
std::locale, perhaps via new facets.
- Hubert objected to such support built on new std::locale
facets due to std::locale being dependent on C locale
names.
- PBrett replied that it would be useful to be able to get an ISO
locale name from a standard locale.
- Hubert noted that encoding information would be lost in that
case.
- PBrett agreed and asserted that encoding ideally wouldn't be a
locale concern anyway.
- Hubert replied that a particular script could be implied by
std::locale.
- Steve observed that we seem to be in agreement that we don't know
how to do this today.
- Tom stated that the problem is that proper case mapping, case
folding, and collation can't be provided without proper locale
support.
- PBrett agreed and asserted the need for a plan to improve locale
support.
- "Resource Bundles":
- Tom suggested that this feature could be built on top of features
like #embed and support for dynamic libraries; perhaps a
downstream feature then.
- Charles noted that Microsoft style resource management can be
provided on other platforms using linker scripts.
- PBrett opined that resources are a requirement for a good locale
system.
- "Calendars and Time Zones" and "Date and Time Formatting":
- Tom stated that base facilities exist that could be expanded, but
that there are locale dependenices.
- PBrett acknowledged concerns raised by Corentin on the mailing list
that noted that the existing facilities are not at parity with the
ICU features.
- "Message Formatting":
- Tom stated that this is awaiting further developments from the
Message Format Working Group (MFWG).
- PBrett commented that, what programmers want and what they need are
not always the same thing.
- Charles expressed a desire for experience outside the standard before
pursuing support in the standard; even more so for this feature than
for other ones.
- Charles added that, ideally, a production quality application would
be built on top of such a facility; one that has users using it from
multiple locales.
- PBrett opined that std::format() may not provide a great
base to build this on.
- Charles replied that it is probably usable, but not sufficient as
is.
- Tom stated there would presumably be a need to pass a message ID in
place of a format string.
- Charles reported that could be done using
std::vformat().
- "Text Transformation":
- Tom suggested that, if there is a specific transliteration that we
have motivation to provide, then we can do so; range adapters
provide general support for this otherwise.
- Steve noted that JeanHeyd's
ztd.text
library can do some of this.
- Hubert asked if this feature category is referring to technical
transliteration or linguistic transliteration.
- Tom replied that he did not know.
- "Bidirectional Algorithm":
- Tom asked for confirmation that this does not have locale
dependencies and is effectively about state management for layout
purposes.
- Steve replied that the algorithm is complicated as it tries to
support all possible scenarios.
- "Collation":
- Jens described a design possibility involving locale independent
base functionality that operates on a set of locale dependent
collation rules.
- Jens stated that the rules engine would have to be sufficient to
perform all tasks required by all locales.
- Jens noted that such a facility could be useful as a building block
for locale support and as a standalone facility.
- "String Searching":
- Tom noted the locale dependence.
- PBrett noted collation dependence.
- Jens stated that precise requirements are needed and that basic
element searching is present.
- Hubert suggested this might be provided by a regular expression
facility.
- "Text Boundary Analysis":
- Steve stated there are natural range interfaces for this and that the
interface is probably clear.
- Hubert asked when a C++ programmer would use this.
- Steve replied that he performs such analysis for reflowing text in
outgoing email.
- PBrett added that such support was an integral part of a DSP language
he worked on.
- Steve stated that word wrapping comes up in many scenarios.
- "Regular Expressions":
- Tom suggested we consider this category if a paper arrives.
- Tom noted that a proposed design would be locale dependent and would
presumably have to operate on various string types, potentially
including segmented text data structures.
- PBrett pondered the possibility of a run-time only library with a
narrow interface that could later be overloaded for new
extensions.
- Steve stated that this is not an area of his experties but that he
would be willing to help author a paper if a good design were to
be proposed.
- Steve noted that there are many SG16 and LEWG related issues to
consider.
- "StringPrep":
- Hubert stated that this feature is related to string normalization
for interchange purposes such as key lookup.
- Jens noted that the
stringprep RFC
specifies exclusion of certain characters and expressed a desire to
better understand what this facility is used for.
- Tom updated the Google Doc to change the consideration column value
from "N/A" to "No", to add a reference to RFC 3454, and to note that
this is a string normalization feature.
- "IDNA":
- Tom expressed considerable ignorance of this topic.
- [ Editor's note: "IDNA" stands for
"Internationalizing Domain Names for Applications" and is covered by
UTS #46. ]
- PBrett suggested such support might be needed for standard
networking.
- "Universal Time Scale":
- Tom suggested this is not an SG16 concern and could be tackled
directly by LEWG if motivated.
- Tom updated the Google Doc to change the consideration column value
from "N/A" to "No; LEWG".
- "Paragraph Layout / Complex Text Layout":
- Review of the ICU playout.h header revealed that this is an
experimental facility.
- Tom updated the Google Doc to change the consideration column value
from "Not yet" to "No" and added a note about the facility being
experimental.
- "ICU I/O":
- Tom stated this is not an SG16 concern.
- Tom stated that the next telecon is scheduled for 3/23; the agenda is
TBD.
April 13th, 2022
Draft agenda:
Attendees:
- Charles Barto
- Hubert Tong
- Jens Maurer
- Mark de Wever
- Peter Bindels
- Peter Brett
- Steve Downey
- Tom Honermann
- Victor Zverovich
Meeting summary:
- P2558R0: Add @, $, and ` to the basic character set
- PBrett asked whether this proposal is an SG16 concern.
- Tom replied that we are the encoding experts and best equipped within
the committee to evaluate whether these changes will have consequences
for source character sets used in practice; our affirmation of the
proposal will hopefully ease progress through other groups.
- Charlie asked for confirmation that SG22 will review the paper as
well.
- Steve replied affirmatively and stated WG14 has already adopted the
proposal for C.
- Someone (apologies, the editor neglected to record who) noted the
existence of
P2342: For a Few Punctuators More
and that it argues that these characters could be used as new
operators.
- Tom responded that P2342 is an example of a paper that is not an SG16
concern; once the characters are available in the source character
set, how they are used is an EWG concern.
- Hubert observed that this introduces a requirement that these
characters be encoded as a single code unit.
- Jens acknowledged and noted that this is required by
[lex.charset]p6
for all characters in the basic literal character set.
- Jens requested that the paper prose discuss this requirement.
- Steve agreed to add such prose.
- Jens stated that he does not know if this requirement would be
problematic for any existing character sets.
- Steve replied that there are infrequently used EBCDIC code pages
that lack them.
- Jens asked if those code pages also lack other characters.
- Steve responded that they probably do.
- Charlie asked if those problematic EBCDIC code pages have unused
code points that could be used for these characters.
- Tom suggested that digraphs could be introduced to support those
code pages if needed.
- Jens replied that digraphs can't be used inside character or string
literals.
- Steve suggested this concern can be addressed if these characters
start getting used within the standard.
- Jens reported that the addition of these characters to the basic
character set makes them ineligible to be specified via a
universal-character-name (UCN) outside of a character or
string literal due to
[lex.charset]p3.
- PBrett suggested that restriction could be lifted.
- Charlie stated that existing uses of '$' are probably limited to
identifiers and symbol renaming via attributes, pragma directives,
or asm labels.
- Hubert reported they may appear in preprocessor stringization.
- PBrett reported they may appear in header names as well.
- Jens noticed that use of a UCN to #include a source file
named with one of these characters would become ill-formed.
- PBrett reiterated support for lifting restrictions on use of UCNs
to name characters from the basic character set with the rationale
that we shouldn't ban things that are only bad ideas.
- Charlie stated that, if such incompatibilities were to cause
problems in practice, then lifting that restriction would be the
obvious solution.
- Steve suggested such concerns can be addressed after the paper is
approved; we know programmers want to use these characters.
- Tom asked Steve if he knows whether the UCN concern was discussed
in WG14.
- Steve replied that it was not.
- Hubert asked when UCNs are translated in identifiers and other
preprocessor tokens.
- Jens replied, during translation phase 3 except in quoted contexts;
so translation occurs as a preprocessing-token is
formed.
- PBrett observed that the new UCN restrictions would be limited to
h-char and q-char sequences since these characters
aren't otherwise usable outside quoted contexts.
- Jens replied that they may also appear in a
preprocessing-token used in stringization.
- Hubert stated that a UCN specifying one of these characters would
become ill-formed during stringization.
- Hubert added that, likewise, use of a UCN to specify '$' in an
identfier would become ill-formed.
- Tom asked whether that matters since use of '$' in an identifier is
already an extension.
- Jens noted that these characters currently match the
"each non-whitespace character that cannot be one of the above"
case of
preprocessing-token.
- Jens requested that this discussion be included in the paper.
- PBrett asked whether h-char and q-char sequences
remain a backward compatibility concern.
- Jens replied that they do, but that UCNs in such sequneces already
have implementation-defined behavior; what UCNs mean in
h-char and q-char sequences isn't really
defined.
- [ Editor's note: Per
[lex.phases]p1.3,
UCNs are not recognized and replaced in h-char-sequence and
q-char-sequence sequences and, per
[lex.header]p1,
these sequences are mapped in an implementation-defined manner.
]
- Tom suggested that it might be worth updating the paper to describe
how existing implementations behave with respect to UCNs in
preprocessor token concatenation, stringization and #include
scenarios.
- Jens asked if there are other ways in which these characters might
plausibly be used today.
- Tom replied that Objective-C uses '@'.
- Jens mentioned the possibility of concerns being raised regarding
the imposition of single code unit encoding for these characters on
the execution character set.
- Tom asked Hubert if he was aware of any EBCDIC related concerns.
- Hubert replied that his colleagues in WG14 did not express such
concerns and that he is confident that they would have if they had
any.
- Hubert noted that locales that don't support these characters would
no longer be strictly conforming but could still be supported as
extensions.
- Tom summarized the discussion; there are requests to Steve to update
the prose for several of the items discussed, but that there do not
appear to be any objections to the paper direction.
- D2572R0: std::format() fill character allowances
- Tom provided an overview of the topic and prior discussions.
- Tom asked for additional categories of characters that should be
represented in the introductory table.
- Tom stated that Zero-Width Joiner (ZWJ) and
Zero-Width Non-Joiner (ZWNJ) cases were not added because he thought
they were not interesting.
- Tom noted that lone surrogates should not be possible.
- PBrett suggested the bidirectional override characters.
- [ Editor's note: the bidirectional override characters are
U+2066 through U+202E. ]
- Tom agreed to add right-to-left and left-to-right examples.
- Tom stated that extending fill character support to arbitrary
extended grapheme clusters (EGCs) in the future would presumably
impose an ABI break.
- Mark agreed that it would; at least one implementation only stores
a single code unit as the current wording appears to specify.
- Charlie agreed and noted that another implementation stores the fill
character as a code unit sequence with a maximum length of 4 code
units for UTF-8.
- Tom noted that, for the "Estimated display width restrictions"
section, there is no good or right solution.
- Tom asked if characters with an estimated width other than one were
to be banned, how that would be accomplished without imposing
undesirable overhead.
- PBindels noted that the ZWSP case is interesting; if it were given
an estimated-width of 0, then an infinite amount of padding would be
required.
- Charlie stated that the estimated width is not intended to be
accurate; it is best effort.
- Charlie stated that, in the "Existing practice" section,
std::format_error is now thrown for the cases that
previously produced the stack overflow for MSVC.
- Charlie noted that the diagnostic is somewhat disappointing though
since an invalid type is reported because the intended fill
character does not match the fill-and-align grammar.
- [ Editor's note: The diagnostic produced by gcc with {fmt} is
likewise disappointing. ]
- PBrett suggested that a table that demonstrates the desired output
inline with the proposal would be useful.
- Victor stated that the paper direction makes sense and is consistent
with previous guidance.
- Victor pondered the consequences of diagnosing fill characters with
an estimated width other than one; on one hand, it would be nice,
but on the other hand, it adds overhead and potentially restricts
valid use cases.
- PBrett suggested that fill characters with an estimated width other
than one could be conditionally supported.
- PBrett stated that his preferred approach would be to diagnose at
run-time (e.g., throw an exception) when alignment requirements
could not be achieved due to the estimated width of the fill
character.
- PBindels reviewed the "Proposal" section and stated:
- The first point to restrict to a single UCS scalar value seems
sensible.
- The third and fourth points appear to be consistent with what
implementations currently do.
- The second point is the questionable one.
- PBrett suggested that the number of fill characters inserted could
be unspecified when the fill character has an estimated width other
than one.
- PBindels replied that he considered that as well, but since the
actual width is dependent on font selection, a consistent result may
not be achieved anyway.
- Charlie reminded the group that the estimated widths are not
currently specified by any standard.
- Charlie observed that diagnosing fill characters with an estimated
width other than one will have the potential effect of rendering
existing code invalid if estimated widths are changed in the
future.
- PBrett suggested another possibility that, if the fill character has
an estimated width of two, then perform the alignment as if the fill
character and all characters in the format argument have an estimated
width of one; this would enable idiographic characters to be
formatted properly.
- Charlie responded that, if such a feature is desirable, then it would
be preferable to add a flag to opt-in to it rather than inferring it
based on the chosen fill character.
- Tom agreed and noted that idiographic characters are just one special
case.
- Charlie stated that, for table style alignment, it is likely
preferrable to emit a field separator character explicitly in the
format string rather than to rely on the fill and align
capabilities.
- PBindels asserted that there is value to having well-defined portable
behavior across implementations, so unspecified and
implementation-defined behaviors should be avoided where
possible.
- Hubert asked if Victor has good examples of this facility being used
with format arguments that have characters with estimated widths
other than one regardless of fill character.
- Victor replied affirmatively, but stated he did not recall specific
examples; such cases involved terminal output.
- Tom reported vague recollections of such examples being present in
Victor's original papers.
- PBrett stated that support for certain languages remains a concern
for him and reported seeing Japanese characters reliably displayed in
terminals in tabular formats.
- Tom asked if he knew how the programmers were facilitating such
output.
- PBrett replied that he did not.
- PBindels asked if it would be feasible to research what features
would be useful for such languages.
- PBrett replied that it isn't feasible for him to do so due to the
number of possible solutions; he would like to allow implementations
to be creative and see what the results bring.
- PBindels again emphasized a desire for portable behavior.
- Charlie agreed and stated that a creative solution in one
implementation might produce an error in another.
- Tom asked Peter Brett if it would suffice for an implementation to
provide a flag to opt-in to an alternate behavior.
- PBrett lamented the absence of attendees that are experts in
non-Latin based languages.
- Hubert questioned the frequency with which a programmer would want
to use a fill character with an estimated width other than one;
the programmer would presumably need to be able to specify a
fill-remainder character or otherwise provide their own padding.
- Victor agreed with the goal of specifying consistent behavior and
suggested standardizing the demonstrated existing behavior in which
a fill character is assumed to have an estimated width of one.
- Charlie noted that a programmer can implement their own fancy
formatter that produces a result that is then embedded using
standard formatters.
- Tom stated that it sounds like we have a few possible extension
mechanisms that can be used for experimentation, work arounds, or
future standard behavior.
- Tom reported that he is leaning towards Victor's suggestion of
specifying the currently demonstrated implementation experience.
- PBindels indicated he would be content with any of the demonstrated
outputs so long as behavior is consistent across implementations.
- PBrett expressed concern about specifying particular behavior in the
absence of more diverse expertise in SG16.
- PBindels stated that he would reach out within his organization to
try to find people with more diverse experience that would be
interested in attending.
- Tom reported that the next meeting will be on 2022-04-27.
April 27th, 2022
Draft agenda:
Attendees:
- Hubert Tong
- Jens Maurer
- Peter Brett
- Steve Downey
- Tom Honermann
- Victor Zverovich
- Zach Laine
Meeting summary:
- P2286R7: Formatting Ranges
- [ Editor's note: D2286R7 was the active paper under discussion
at the telecon. The agenda and links used here reference P2286R7
since the links to the draft paper were not shared publicly. The
published document may differ from the reviewed draft revision.
]
- PBrett provided an introduction.
- Victor explained that LWG reviewed the wording and conditionally
approved the paper subject to SG16 review.
- PBrett recalled concerns raised in the past regarding the use of
Unicode properties.
- Victor replied that implementors were present during the LWG review
and did not express any objections or concerns.
- Victor noted that Hubert provided some wording tweaks during the
LWG review.
- PBrett presented wording updates Tom
proposed on the SG16 mailing list.
- PBrett expressed a preference for Tom's wording relative to the
current wording.
- Victor stated that he is ok with Tom's wording so long as it is
equivalent to the current wording in the paper for the Unicode
case.
- Victor stated that he would prefer not to defer to other format
facilities for the description of how hexadecimal values are
formatted.
- Hubert expressed a desire to document spacing and non-printable
characters by their encoded values as opposed to their character
names or glyphs.
- Hubert explained that doing so would free the library from having
to be aware of the literal encoding selected at compile-time.
- Tom acknowledged the concern; the set of spacing and non-printable
characters, or their encoded values may differ for one literal
encoding vs another.
- Hubert noted that the concern applies to both EBCDIC and Windows
code pages.
- Hubert stated that the proposed wording could produce strange
results in cases where unnecessary shift states are present.
- PBrett observed that the current wording does not state which
encoding is used in cases where printable characters overlap with
control characters in a related encoding, but the proposed wording
does.
- Jens noticed that the proposed wording states that the literal
encoding is used to construct E, but not to interpret
S.
- Tom acknowledged the omission and stated that it needs to be
corrected.
- PBrett returned to his example where a locale encoding overlays
graphical characters over control characters and noted that the
overlayed characters would be interpreted in the literal
encoding.
- Hubert reported that his implementations are not affected by
overlay concerns and that locale support can be added later if
motivated.
- Zach asked if the method for determining whether S is in a
Unicode encoding matches the method specified for
std::format() in C++20.
- Tom replied that he didn't recall how it was specified.
- [ Editor's note: It does not appear to be specified. The relevant
wording simply states "For a string in a Unicode encoding, ...". See
[format.string.std]p11,
[format.string.std]p12,
and
[format.string.std]p14.
Improvements appear to be warranted. ]
- Jens stated that, with the exception of the use of CE for
string S, this is well-specified so long as it matches the
desired behavior.
- Tom noted that std::format() is intentionally locale
independent.
- Hubert reported that his implementations will likely assume a certain
literal encoding rather than storing the literal encoding actually
used at compile-time; that encoding is likely to be EBCDIC 1047 or,
for ASCII contexts, ISO-8859-1.
- Hubert expressed his understanding of the design intent to be that an
escaped sequence can be interpreted to reproduce the original byte
sequence.
- Hubert suggested that it may be worth adding a note to that
effect.
- Tom acknowledged the intent and noted that his wording fails to
reflect that intent for stateful encodings since state transitions in
S should be reflected as escape sequences rather than
interpreted when constructing E.
- PBrett asked for other concerns.
- Tom noted that there is the issue of handling boundaries of ill-formed
code unit sequences and asked if anyone wanted to argue for addressing
that now.
- PBrett expressed a preference not to address it now.
- Hubert suggested it could be unspecified or
implementation-defined.
- Tom replied that it is more-or-less implied at present.
- PBrett agreed.
- Tom summarized the discussion; we agree that we want revised wording
for this case but that we don't quite have what we want yet.
- Tom said he will inform LWG that we'll continue iterating on the
wording with the intent to have something approved by our next
meeting in two weeks.
- Hubert agreed with the summary.
- Victor gave a thumbs up.
- P2558R1: Add @, $, and ` to the basic character set
- [ Editor's note: D2558R1 was the active paper under discussion at
the telecon. The agenda and links used here reference P2558R1 since
the links to the draft paper were ephemeral. The published document
may differ from the reviewed draft revision. ]
- Steve reported that the prose was updated to record the results of
prior discussions in order to better explain the intent; the wording
has not been changed.
- Steve presented the paper and noted that section 3 is new.
- Tom suggested adding a comment in
section 3.1 (Universal Character Name)
to indicate the character corresponding to \u0060.
- PBrett reported a typo in
section 3.4, "sting literal" should be "string literal".
- PBrett noted that, with respect to existing use of these characters,
they are usually used for convenience where another mechanism could
be used.
- Steve agreed and noted that such use generally occurs in contexts
that require some kind of magic and where they can generally be
escaped in some way.
- Tom stated that, with regard to raw string literals, a reason to
exclude the new characters in the delimiter portion is because these
characters might acquire meaning in the future that could become
problematic.
- Tom expressed a preference to exclude ` for now so that we
can preserve it for use as a new type of string literal.
- [ Editor's note: Such exclusion is unnecessary; the raw string
literal delimiter pattern is bounded by a double quote and a
parenthesis. Allowing use of ` in between those poses no
ambiguity for a hypothetical new string literal delimited by
`. ]
- Tom suggested the paper explicitly note that the proposed changes
enable these characters to portably be used in character literals
by virtue of being encoded as a single code unit.
- Steve agreed to update the paper.
- PBrett reported another typo in
section 3.3, "invarient" should be "invariant".
- Jens expressed a continuing interest in the paper showing examples
of behavioral changes.
- Hubert noted that such examples should be added to annex C.
- Steve reported two known compatibility issues:
- Use of a UCN to name one of these characters in
stringification.
- Use of a UCN to name one of these characters as an argument to
a function-like macro that does not use the corresponding
parameter.
- Steve stated that SG22 may want to review these updates.
- Steve suggested it may suffice to forward an updated paper via an
SG16 mailing list review.
- Tom agreed.
- Tom stated that the next meeting will be May 11th and will hopefully
include review of an updated revision of
D2572R0 (std::format() fill character allowances).
May 11th, 2022
Draft agenda:
Attendees:
- Charles Barto
- Hubert Tong
- Jens Maurer
- Mark de Wever
- Peter Brett
- Steve Downey
- Tom Honermann
- Victor Zverovich
Meeting summary:
- P2286R8: Formatting Ranges
- [ Editor's note: D2286R8 was the active paper under discussion at
the telecon. The agenda and links used here reference P2286R8 since
the links to the draft paper were ephemeral. The published document
may differ from the reviewed draft revision. ]
- Victor summarized recent wording changes worked out on the SG16
mailing list.
- Victor asked if "code point" should be preferred over "character" in
the proposed wording for [format.string.escaped]p2.
- Tom replied that he is unaware of any normative use of "code point"
in the standard today.
- Victor responded that it is used in the wording for format field
width estimation.
- Hubert stated that the usage there is in a Unicode specific context
and that "character" is probably most appropriate here.
- Hubert pointed out that, in [format.string.escaped]p2.3, it is odd
that c is defined as a character, but then compared with
UCS scalar values.
- Jens agreed and proposed substituting "character" for
"UCS scalar value" in paragraph 2.3.1 and in the header of the
associated table.
- Jens suggested doing likewise in paragraph 2.3.3.
- Hubert argued that a change is not needed in 2.3.3 due to the use of
"corresponds".
- Tom noted the use in that paragraph is also a Unicode specific
context.
- Tom asked Charles if he continues to have concerns regarding the lack
of specification for determining the boundaries of ill-formed code
unit sequences.
- Charles replied that he does and that he would like to see it
addressed via a reference to the
WHATWG Encoding Standard.
- Tom responded with uncertainty whether such a normative reference is
possible given the lack of versioning around that standard.
- Charles suggested that the method specified in the WHATWG standard
could be replicated in the C++ standard; we want the "maximal subpart"
behavior described by policy option 2 in
Unicode PR-121.
- Hubert asked if that policy is defined for all UTF encodings.
- Charles replied that it is.
- PBrett asked what the motivation is for rigorously specifying how the
boundaries of ill-formed code unit sequences are determined.
- Charles replied that the goal is to ensure consistent output, but then
noticed that, in this case, the method used does not appear to be
observable since each code unit of the sequence is written to the
output anyway.
- Tom agreed that it should not matter for self-synchronizing
encodings.
- Charles noted that this will be the first instance of Unicode UCD
properties being normatively required by the C++ standard.
- Charles suggested that, if we're ok with such normative use, we could
revisit the wording for estimated format field widths to make the
uses there normative as well.
- [ Editor's note:
[format.string.std]p11
specifies normative encouragement of behavior that depends on UCD
properties in order to identify extended grapheme cluster boundaries.
]
- [ Editor's note: This would not be the first normative use of the
UCD properties;
[lex.name]p1
requires the XID_Start and XID_Continue properties
to determine identifier boundaries and validity. ]
- Jens requested that such changes not be handled via this paper.
- Steve asked how much data is required for the new uses of the
General_Category and Grapheme_Extend
properties.
- Victor replied that the necessary data fits in ~1K.
- Charles agreed and shared a
link to code
used to implement a grapheme break algorithm that uses the
Grapheme_Cluster_Break and Extended_Pictographic
properties.
- Charles noted that some creative packing is necessary to get the
size that small.
- Tom expressed surprise that Grapheme_Extend is small.
- Victor replied that it is composed of a number of compressable
ranges.
- [ Editor's note: The set of all characters that satisfy the
Grapheme_Extend=yes property can be viewed
here;
that set comprises 2090 code points in Unicode 14. ]
- Poll: Forward D2286R8 to LWG with 2.3.1 and associated table
revised to substitute "character" for "UCS scalar value" as
discussed for inclusion in C++23
- Attendance: 8
- No objection to unanimous consent.
- P2558R1: Add @, $, and ` to the basic character set
- [ Editor's note: D2558R1 was the active paper under discussion at
the telecon. The agenda and links used here reference P2558R1 since
the links to the draft paper were ephemeral. The published document
may differ from the reviewed draft revision. ]
- Steve introduced the changes made since the last review; just the
addition of annex C wording.
- Poll: Forward D2558R1 to EWG for inclusion in C++26
- Attendance: 8 (1 abstention)
-
- Consensus in favor.
- Steve stated that he would follow up with SG22 with regard to issues
found that were not discussed in WG14.
- [ Editor's note: Steve did so via a
post to the C liaison list.
]
- Tom stated that the next meeting is scheduled for May 25th.
May 25th, 2022
Draft agenda:
Attendees:
- Charles Barto
- Hubert Tong
- Jens Maurer
- Mark de Wever
- Robin Leroy
- Steve Downey
- Tom Honermann
Meeting summary:
- In honor of a new attendee, a round of introductions was conducted.
- D2572R0: std::format() fill character allowances
- Tom presented the paper.
- Robin pointed out a spelling error; "IDIOGRAPHIC" -> "IDEOGRAPHIC"
(two occurrences).
- Charlie explained that the ABI mitigation technique discussed in the
"Future considerations and ABI" section relies on persistence of at
least the fill character portion of the format string but such
persistence is not otherwise currently required because the format
string is evaluated at compile-time.
- Charlie stated he could imagine ways of accomplishing the goal
though.
- Tom asked for confirmation that the Microsoft implementation is
already shipping and locked into its current ABI.
- Charlie confirmed that is the case.
- Tom stated that he would add a note to the ABI section stating that
some implementations are already locked in to their current
behavior.
- Steve commented that there are escape hatches and that other
extension means are possible should the need arise.
- Charlie explained why implementing ABI resiliency would likely impose
dynamic memory management costs including possible lifetime
management challenges.
- Jens reported finding it a bit concerning that the estimated width of
a character would be honored in some cases but not in others, but
recognized the trade offs involved.
- Jens stated that boilerplate wording is needed within the format
section in order for the proposed use of "U+007B LEFT CURLY BRACKET"
and "U+007D RIGHT CURLY BRACKET" to be applicable to the literal
encoding.
- Tom stated that the proposed wording changes to table 64 need work;
in "if that value is negative", it is not clear whether "value"
refers to "n" or to
"the width of the formatting argument".
- Jens requested that "estimated" be inserted before "width" in
"the width of the formatting argument".
- Hubert stated that "formatting argument" doesn't sound like the right
term in this context; it should probably be "formatted argument".
- Tom reported that this term was used for consistency with wording
elsewhere but that he would review and try to improve.
- Jens requested that the note following table 64 be modified to
replace "ignored" with "assumed to be 1".
- Tom agreed.
- L2/22-072R: Proposal for amendments to UAX#9 and UAX#31
- Tom provided a brief introduction.
- Hubert asked if it is necessary to address the UAX31-R3 conformance
concern for C++23.
- Tom replied that he did not believe so since the annex is
non-normative.
- [ Editor's note: Zoom crashed for Tom and it took him several
minutes to get reconnected.
Jens assured him that the time missed primarily concerned the
flogging of a dead horse. ]
- Jens asked what version of Unicode is expected to receive the
proposed amendments.
- Robin replied that Unicode 15 is expected to have these updates and
that more significant normative changes are anticipated for
Unicode 16.
- Robin stated that Unicode 15 is expected to be released in
September.
- Jens observed that September would be just in time for adoption in
C++23.
- Steve suggested that the annex could be updated to claim
non-conformance with UAX31-R3.
- Hubert agreed and noted that we may not want to change our dated
UAX31 reference at that late point in the C++23 release cycle.
- Jens proposed that we proceed with an NB comment on the C++23
committee draft to request upgrading the bibliography reference for
UAX31 to Unicode 15.
- Jens explained that, since the new Unicode release won't be available
before then, it won't be possible to act on a core issue and an NB
comment would end up being required anyway.
- Robin directed discussion to allowances for
U+200E LEFT-TO-RIGHT MARK (LRM) and U+200F RIGHT-TO-LEFT MARK (RLM)
to be used in combination with other whitespace.
- Robin stated that example wording can be found in the
Ada 2012 reference manual;
section 2.2 paragraph 7 1/3, "Lexical Elements, Separators, and Delimiters"
states:
One or more other_format characters are allowed anywhere that a
separator is; any such characters have no effect on the meaning of
an Ada program.
- Jens noted that this would be a new kind of whitespace for C++ since
sequences of these marks by themselves would not constitute
whitespace.
- Jens expressed curiosity regarding "implicit directional marks" as
discussed in L2/22-072R.
- Robin replied that "implicit directional marks" is discussed in
UAX #9 section 2.6, "Implicit Directional Marks".
- Robin explained that, per
UAX #9 section 6.5, "Conversion to Plain Text",
such marks may be implicitly inserted during conversion to plain text
for text subject to protocol
UAX9-HL4
and that Unicode 15 will recommend that protocol for source code
text.
- Hubert asked if it would make sense to prohibit sequences consisting
of more than one of these marks.
- Robin replied that he knew of no motivation for doing so; that the
presence of multiple marks should not pose any negative
consequences.
- Jens expressed opposition to these marks constituting whitespace
separation in isolation.
- Tom agreed.
- Jens noted that specification of LRM and RLM in whitespace will have
to target C++26 as C++23 is now closed to new language changes.
- Jens suggested that such a change could be adopted as a DR against
C++23 to encourage recognition of these marks as a conforming
extension in prior language modes.
- Jens stated that a paper will be needed and that it should await the
availability of Unicode 15.
- Tom stated he would file a SG16 github issue to track the request for
such a paper.
- [ Editor's note: Tom filed
SG16 issue 74: Extend whitespace to include NEL, LS, PS, LRM, RLM, and maybe ALM.
]
- Tom noted that this isn't a particularly urgent issue to address.
- Jens countered that it would be helpful to prevent obfuscated display
of source code and that the desire to avoid such confusion has gained
prominence in recent times.
- Jens directed discussion towards future conformance with UAX31-R3 and
that there are questions about Pattern_White_Space that need
to be answered.
- Robin asked which Pattern_White_Space characters are not
considered whitespace in C++.
- Hubert listed them; they are all the ones outside the ASCII subset.
- U+0085 NEXT LINE
- U+200E LEFT-TO-RIGHT MARK
- U+200F RIGHT-TO-LEFT MARK
- U+2028 LINE SEPARATOR
- U+2029 PARAGRAPH SEPARATOR
- Hubert noted that the above characters can only appear in comments and
character or string literals currently.
- Jens asked if it is the intent of UAX31-R3 to require that all of the
characters in Pattern_White_Space be supported as
whitespace.
- Robin replied that allowing a subset would render the requirement
vacuous.
- Jens suggested the possibility of updating UAX31-R3 to specify a
minimal subset.
- Robin responded that a sub-requirement like UAX31-R3a could be
introduced; such sub-requirements can be found elsewhere in
UAX #31.
- Steve noted that the current normative text of UAX31-R3 allows
deviation by specifying a profile.
- Hubert asked what motivation exists for not accepting the other
whitespace characters.
- Steve noted existing practice and suggested this be addressed in the
future paper.
- Jens directed discussion to conformance with the
Pattern_Syntax requirement of UAX31-R3.
- Robin expressed a belief that C++ conforms to that.
- Tom expressed curiosity with regard to the presence of . in
Pattern_Syntax and its use within floating point
literals.
- [ Editor's note: Tom's concern stems from the following note in
the description of UAX31-R3. Is the use of . in
floating point literals considered syntax or part of a literal?
Note: When meeting this requirement, all characters except
those that have the Pattern_White_Space or Pattern_Syntax properties
are available for use as identifiers or literals.
]
- Hubert stated that this kind of confusion is why he is hesitant to
declare conformance to UAX31-R3 prior to improved wording that will
hopefully appear in Unicode 16.
- Jens summarized the three tasks identified so far:
- For C++23, file an NB comment after the July plenary to update
[uaxid.pattern]
in annex E to state that conformance with UAX31-R3 is not claimed.
At the same time, update the UAX references in the bibliography
to refer to Unicode 15.
- For C++26, author a paper to add LRM, RLM, and other
Pattern_White_Space characters to the set of whitespace
characters.
If support for U+061C ARABIC LETTER MARK is also desired, that
will require a profile to conform with UAX31-R3.
- For C++26, update
[uaxid.pattern]
in annex E to claim conformance with UAX31-R3.
At the same time, update the UAX references in the bibliography
to refer to Unicode 16 (or later).
- Robin noted that Unicode 15 is planned for release on September 13th
per
https://www.unicode.org/versions/beta-15.0.0.html.
- Tom recalled Hubert mentioning on the mailing list that
U+000D CARRIAGE RETURN (CR) can now be added to the basic character
set.
- Hubert acknowledged and opined that we can do so as part of
P2348: Whitespaces Wording Revamp.
- Tom stated that CR presumably should have already been present
because of the existence of the \r escape sequence.
- Jens explained that \r creates a requirement for literal
encodings but not for the basic character set nor an allowance for
its use in whitespace.
- Tom noted that a paper will be needed that targets SG15 and discusses
the concerns and options available to implementations with regard to
UAX9-HL4 and presentation of source code that contains right-to-left
characters.
- [ Editor's note: Tom filed
SG16 issue 75: SG15 proposal for implementations that present source code to conform with UAX9-HL4
to track producing such a paper. ]
- Tom stated that the next meeting will be in two weeks, on 2022-06-08.
June 8th, 2022
Draft agenda:
Attendees:
- Charlie Barto
- Hubert Tong
- Inbal Levi
- Jens Maurer
- Mark de Wever
- Peter Brett
- Steve Downey
- Tom Honermann
- Victor Zverovich
Meeting summary:
- D2572R0: std::format() fill character allowances
- PBrett lamented the lack of a published revision with a "P"
designation and change history that reflects the evolution of the
design and wording.
- Tom stated, in response to preferences expressed by Victor on the
mailing list, that he will add a drafting note regarding the change
of "specifier" to "option" in the description of the align
grammar production so that LWG will be sure to review.
- Jens requested that the drafting note regarding note renumbering be
removed since the LaTeX machinery will handle that automatically.
- Tom stated that he would add a note following the format examples
that mentions that the clown face emoji has an estimated length of
two.
- PBrett suggested adding an example with a formatting argument that
contains a character with an estimated width other than one;
essentially an example that swaps the fill character and formatting
argument in example s7.
- Jens requested that note X be modified to drop "estimated" and to
replace "width of the fill character" with "width of any
fill character".
- Jens expressed distaste for the existing wording in
[format.string.std]p11
that states "estimated width of ... UCS scalar values";
UCS scalar values are not characters.
- Hubert noted a missing "the" in the same paragraph;
"the sum of the estimated widths".
- Poll 1: Revise D2572R0 "std::format() fill character allowances"
as discussed, and forward the paper so revised to LEWG as the
recommended resolution of LWG3576 and LWG3639.
- Attendance: 8 (2 abstention)
-
- Consensus in favor.
- Discuss survey questions to suggest for the 2023 C++ Developer Survey
- PBrett proposed separating the survey questions into two categories:
- The form of the source code; how the source code is written.
- The facilities used to perform text processing.
- PBrett suggested that questions about tools might comprise an
additional category.
- Inbal suggested asking for input on what topics most urgently require
solutions in the standard.
- Tom replied that a free form question could be used for that and that
the current developer survey presents such free form responses as a
word cloud.
- PBrett asserted that the questions should address the topics we most
want to learn about.
- PBrett suggested asking what facilities programmers are using in
place of standard facilities like std::regex and
std::locale that are known to have significant design
issues.
- Hubert noted that the standard notion of locale encompasses both
interface and locale identification; a programmer may use
std::locale or the C locale facilities for locale
identification, but then use alternate facilities for locale dependent
behavior.
- PBrett pondered whether the facilities programmers actively use to
support internationalization and localization is one of the topics we
are most ignorant of.
- Hubert responded that there is speculation that programmers avoid the
standard facilities but that we don't have data to confirm that.
- Steve stated that Bloomberg actively avoids the standard locale
related facilities.
- Victor suggested it might be helpful to ask if programmers are
intentionally using the standard locale facilities and noted that many
do so inadvertently.
- Victor noted that the questions need to consider C and C++ locale
facilities as distinct facilities.
- PBrett suggested structuring the questions as:
- "Do you provide internationalization support", and
- "If so, how do you provide internationalization support".
- Victor stated that those questions should include multiple selection
responses that include C, C++, POSIX, etc...
- PBrett suggested adding ICU and other popular packages.
- PBrett asked for additional topics that we might be particularly
ignorant about.
- Steve suggested asking if programmers are still having to work with
multiple character encodings within their main application and, if so:
- whether they transcode at application boundaries and work
exclusively with Unicode internally, or
- whether they work directly with data in whatever character
encoding it is provided in.
- PBrett suggested it would be useful to know how often programmers
use regular expressions where the pattern is not known until
run-time.
- Victor pondered whether Hana coerced Peter to ask that question.
- PBrett responded that she did not, but that the question does concern
whether and to what extent CTRE is a suitable substitute for
std::regex.
- Steve agreed that it would be useful to understand what the
requirements for a replacement are.
- Charlie stated that a replacement would need to be more ABI resilient
but that there is no need to pass compiled regular expression objects
across module boundaries.
- Steve suggested asking which regular expression languages programmers
are using.
- PBrett responded that question requires an "I don't know" response
option.
- Tom asked if it would be helpful to know how many programmers are
using TCHAR in Windows environments.
- Charlie reported suspicion that many programmers are still using
that.
- Tom wondered what we might do with such data if we had it.
- PBrett suggested it would be interesting to know what libraries
programmers are using for Unicode algorithm support and string
classes.
- Tom asserted that that question would need to be multiple choice with
an "other" option.
- Tom recalled one of the questions Peter had suggested on the mailing
list, "what language(s) do you use in identifiers and comments?", and
asked whether the question was intended to probe the languages used or
whether characters outside the basic character set are being
used.
- PBrett replied that the interest is in the language; the goal is to
find out which scripts are being used.
- Tom recalled one of the questions Zach suggested,
"How often do you use multiple Unicode normalization forms in the
same program?" and commented that this is similar to the encoding
question; whether programmers normalize at program boundaries or not
normalize at all.
- Steve stated this is an important consideration for deciding if
normalization belongs in the type system.
- Tom mentioned that Zach also posed questions about collation
support.
- Steve replied that collation needs depend on context; data in a
database is likely to be ordered in a locale independent manner but
may need to be reordered for presentation in a user's locale.
- Inbal asked if serialization is within the purview of SG16.
- Tom replied that it could be for producing and consuming text
formats.
- Inbal stated that, given a list of keywords, it would be possible to
scrape stackoverflow.com for related questions.
- Tom wondered about asking if programmers place locale constraints on
their users; for example, whether they require use of UTF-8.
- Hubert responded that such a question won't garner a 100% yes answer,
so may not be so helpful.
- Tom replied that it might be useful to help guide where we invest
effort; if lots of products support non-UTF-8 environments, then we
know to focus more broadly.
- Hubert agreed with that perspective.
- PBrett stated that a transcoding facility remains high on our
priority list and asked if the question about internal encodings and
translation at program boundaries was intended to probe actual
need.
- Tom responded that he had not thought about that relationship
concretely.
- Steve suggested that question is more related to how many programmers
need to operate directly on text in multiple encodings and if
facilities are needed to do so.
- Hubert stated that gathering interest in a rope class would be
useful.
- Tom asked if that would be for the case of stringing together blocks
of text that are differently encoded.
- Hubert responded affirmatively.
- PBrett noted such a type is also useful in cases where buffers are in
different places.
- Tom pondered what we would do with data regarding which character
types are in use; for example, whether we would choose not to focus
on char32_t as anything other than a code point type.
- Steve suggested asking if programmers use signed char and
unsigned char for text as opposed to for use as small
integers like int8_t and uint8_t or for other forms
of bit manipulation.
- Hubert responded that unsigned char is likely used for
UTF-8.
- Victor responded that unsigned char is often used for
uint8_t and that this causes formatting confusion.
- Inbal asked if C and C++ compatibility is important.
- Tom replied that it is helpful to decide if we need low level C
utilities that are exposed via C++ wrappers.
- Tom added that JeanHeyd has been pursuing the approach of getting low
level facilities for transcoding through WG14 before continuing work
on transcoding support in WG21.
- PBrett raised the question of which kind of C string interfaces are
important; null terminated vs Pascal strings.
- PBrett mentioned WG14 Annex K and noted that Microsoft can't ship it
due to conflicts with their historical secure function
implementation.
- Hubert returned to the topic of localization and stated it would be
useful to know if programmers customize locale formatting or just
rely on default formatting.
- Tom stated that he will draft a Google doc with an initial set of
questions based on this discussion that we can all comment on and
contribute to.
- Further discussion ensued regarding stability vs the need to evolve
and fix defects.
- Tom stated that the next meeting will be on 2022-06-22 and that, if we
make good progress refining and suggesting survey questions in the
interim, then we'll probably continue this discussion then.