SG16: Unicode meeting summaries 2021-04-14 through 2021-05-26
Summaries of SG16 meetings are maintained at
https://github.com/sg16-unicode/sg16-meetings. This paper contains a
snapshot of select meeting summaries from that repository.
Previously published SG16 meeting summary papers:
April 14th, 2021
Draft agenda:
Attendees:
- Corentin Jabot
- Hubert Tong
- JeanHeyd Meneide
- Jens Maurer
- Mark Zeren
- Peter Bindels
- Peter Brett
- Steve Downey
- Tom Honermann
- Zach Laine
Meeting summary:
- PBrett introduced the agenda.
- P2295R2: Correct UTF-8 handling during phase 1 of translation
- Corentin introduced:
- This is a proposal to require that UTF-8 be one of the set of
otherwise implementation-defined source file encodings.
- With regard to ill-formed code unit sequences, there is no such
thing; the source code is either valid UTF-8 or it is not
UTF-8.
- Gcc does not validate its presumed UTF-8 input.
- With regard to BOMs, the proposal does not impose any
requirements other than that a BOM present in a UTF-8 source
file be ignored for the purposes of lexing.
- An implementation may use the presence or non-presence of a BOM
as part of its source file encoding determination.
- The proposed wording will require updates for changes that will
presumably be adopted from Jens'
P2314: Character sets and encodings.
- This proposal follows Beman Dawes' earlier proposal,
N3463: Portable Program Source Files.
- At present, the C++ standard has no requirement for a portable
source file.
- Tom stated that gcc will perform UTF-8 validation if both
-finput-charset=utf-8 and -fexec-charset=utf-8
are specified.
- [ Editor's note: Tom was wrong (and since Tom is also the editor,
he can be blunt like that); gcc only validates UTF-8 for string
literals, and then only if -fexec-charset=<encoding>
is specified. ]
- Jens noted a capitalization issue in the wording; the sentence
following the added note in [lex.phases]p1 has a capitalized "The"
following a ";".
- Jens asked why the note added to [lex.phases]p1 is just a note; the
preceding prose provides a definition, but does not impose any
requirements.
- PBrett responded that, if an invalid sequence is present, then there
is no sequence of Unicode scalar values.
- PBrett asked if moving the note after the following sentence would
resolve the concern.
- Jens replied that it would not; that would define a UTF-8 source file
and state that a well-formed UTF-8 source file must be accepted, but
would impose no requirements on an ill-formed UTF-8 source file.
- PBrett acknowledged that further wording work is needed.
- Jens observed, and noted that the paper discusses, that
implementations can accept source files that approximate UTF-8.
- Hubert noted that a normative statement is needed to state that it is
implementation-defined how a requirement for UTF-8 source files is
specified.
- PBindels suggested placing a requirement for well-formed input with
the character set definitions.
- Jens indicated no objection to clarification, but that he would like
to see the ISO 10646 definition of "well-formed".
- Steve observed that the note is stating that invalid UTF-8 sequences
cannot happen in a well-formed UTF-8 source file.
- Jens responded that there is a normative difference between
something that cannot happen and something that is ill-formed; the
latter requires a diagnostic.
- Hubert asserted that the wording needs to establish intent; a
sequence of bytes may happen to be well-formed UTF-8, but the
wording needs to ensure that the bytes were intended to be
interpreted as UTF-8.
- PBindels summarized; we need to state there is an
implementation-defined way to specify that a source file is to be
interpreted as UTF-8.
- Jens agreed.
- JenaHeyd agreed from chat, "Yes, Hubert's definition is correct. You
have to make it so the implementation has a way to mark/identify a
source file as UTF-8, and then you can impose these requirements."
- Corentin stated the intent; that the compiler determine the source
encoding in an implementation-defined way, but that a source file
that does not decode successfully is diagnosed as ill-formed.
- Tom suggested specifying that the file must decode successfully as
opposed to being well-formed.
- PBrett stated that a branch is needed in translation phase 1 to
distinguish the cases where the source file is encoded as UTF-8 vs
some other encoding.
- Zach suggested that a definition for a UTF-8 source file is
unnecessary.
- PBindels expressed concern that there may be a conflict between use
of a BOM and a truly portable source file.
- PBrett responded that the goal is that, if a source file is UTF-8
encoded, that there is a way to direct an implementation to process
it as such.
- Jens acknowledged and added that an implementation could require use
of a command line option to opt-in to UTF-8 encoded source files;
that implies that the source file is not automatically portable,
but is the best we can do.
- Tom agreed and stated that the only way we could do better is to
require a BOM everywhere and nobody wants that.
- Zach noted that the only statement made regarding a BOM is that it
can be ignored; presumably after encoding determination is complete
so that the BOM doesn't interfere with translation phase 2.
- Hubert noted that, once the encoding is determined to be UTF-8, a
BOM is portably ignored.
- PBrett encouraged assumption of non-hostile implementations; no
implementation is going to require a BOM in order for a UTF-8
encoded source file to be processed as such.
- Several relevant comments were made from chat:
- Steve: "We want portable source code. If anyone requires a BOM,
then portable source code needs one."
- JeanHeyd: "If you put in a BOM and use -fexec-charset=SHIFT-JIS,
the implementation can ignore the BOM and still read everything
as SHIFT-JIS."
- Hubert: "If you did that, the BOM is not a BOM..."
- Jens suggested that the wording needs to establish when encoding
determination happens; that should be the first step of translation
phase 1.
- Jens added that the wording should be consistent with regard to
encoding vs encoding form vs encoding scheme.
- Tom stated that, for UTF-8, encoding form vs encoding scheme doesn't
matter, but that encoding scheme should be used if the intent is for
the wording to be compatible with UTF-16 or UTF-32.
- Hubert asserted that, since the context is byte oriented files,
encoding scheme should be used.
- Jens reiterated the necessary wording updates; the encoding scheme to
use must first be established, then the source file can be validated
and diagnostics issued if it fails to conform to the encoding
scheme.
- Jens added that the wording needs to prevent the current
implementation-defined mapping to the internal encoding from being
applied to UTF-8 source files.
- PBindels asked if the added sentence in translation phase 2 regarding
the "first codepoint" applies to each source file or just to the
primary source file.
- Tom and Corentin replied that translation phases 1 through 3 are
performed separately for each source file.
- Hubert suggested that translation phase 2 should discard a lead
U+FEFF character regardless of the source file encoding.
- Jens noted that the added translation phase 2 sentence doesn't make
sense without the wording changes proposed in
P2314: Character sets and encodings
due to character translation to universal-character-name in
translation phase 1.
- Tom noted that the wording changes in P2314 allow distinguishing a
source file with a BOM and a source file that starts with a
\uFEFF universal-character-name.
- Jens clarified that, after P2314, a universal-character-name
isn't translated to a UCS scalar value until translation phase 3.
- Hubert stated that it is a design question whether we want to treat a
leading \uFEFF universal-character-name as a BOM.
- PBrett asked PBindels if he is satisfied with the BOM design
following prior discussion.
- PBindels responded that he is, so long as we don't intentionally or
unintentionally create the situation where UTF-8 source files end up
requiring a BOM in practice.
- PBrett asked if we should add normative encouragement not to require
a BOM.
- Hubert noted that, as wording updates are done, care must be taken to
ensure we don't lose the wording that requires an implementation to
accept a UTF-8 encoded source file whether it does, or does not,
contain a BOM.
- Tom asked about handling of differently encoded source files.
- JeanHeyd replied in chat, "I think it's better to leave Encoding
Identication to Tom's Paper on the subject."
- Tom replied in chat, "Assuming I actually deliver on that
threat..."
- Hubert responded that the implementation must provide some means for
standard headers (as opposed to header files), to remain usable when
the implementation is running in UTF-8 mode.
- Steve added in chat, "Which might be 7 bit ascii for those headers.
Which is largely the case today."
- We wish to require implementations to support UTF-8 source files.
- Attendance: 10
- No objections to unanimous consent.
- We wish to require implementations to be capable of accepting UTF-8
source files whether or not they begin with a U+FEFF byte order mark.
- Attendance: 10
- No objections to unanimous consent.
- Hubert reported that Clang allows non-UTF-8 encoded header names in
#include directives in otherwise UTF-8 encoded source
files.
- Steve stated that, since file names are not required to be
representable in UTF-8, requiring strictly well-formed UTF-8 could
have unanticipated consequences.
- JeanHeyd asked in chat, "Does `\xFF` work in header-names as an
escape?"
- Corentin replied in chat, "unspecified".
- Corentin explained his intent in requiring diagnosis of ill-formed
UTF-8 input.
- PBindels asked why it is useful to allow invalid UTF-8 in
comments.
- Corentin replied that Clang source code has comments explaining why
invalid UTF-8 in comments is explicitly allowed and provided a link
to the source code.
- PBrett shared cases of copyright symbols appearing in otherwise ASCII
files.
- Tom noted that non-ASCII characters tend to appear in author, product,
and company names in comments.
- Hubert stated that source files that iconv will reject are
undesirable.
- We wish to require implementations to have a mode in which they diagnose ill-formed UTF-8 source files (regardless of whether the ill-formedness is located in comments, header names or string literals).
- Consensus is strongly in favor.
- SF: As it stands right now, people are already basically rolling
the dice with their source files. This is strictly an improvement
over the status quo, because now there is, at least, one entirely
portable way to write source code.
- Corentin asked about necessary wording to support both source files
and non-files.
- Hubert responded that (standard library) headers are not source
files; source files are those things that are included by
#include directives that do not name standard headers.
- PBrett asked if the wording should be modified do discuss "input"
as opposed to "files".
- Hubert responded that such a change is not necessary.
- Corentin pledged to bring back a revised paper.
- Tom stated the next telecon will be April 28th.
April 28th, 2021
Draft agenda:
Attendees:
- Charlie Barto
- Corentin Jabot
- Hubert Tong
- Jens Maurer
- Mark Zeren
- Peter Bindels
- Peter Brett
- Steve Downey
- Tom Honermann
- Victor Zverovich
- Zach Laine
Meeting summary:
- Charlie Barto was welcomed with a round of introductions.
- PBrett introduced the agenda.
- LWG3547: Time formatters should not be locale sensitive by default
- PBrett presented:
- Peter's presentation slides are available
here.
- As currently specified, whether a format specifier is locale
dependent is not obvious.
- Floating point values are locale independent by default, but
chrono values are not.
- There is no systematic way to format locale-independent and
locale-dependent chrono values.
- Victor expressed a preference for chrono values being locale
independent by default.
- Victor explained that the current specification derived from existing
specifiers used elsewhere.
- Victor noted that, in some cases, specifiers are not available for
locale independent formatting.
- Victor reported success with a prototype implementation of the
proposed resolution that performs locale independent formatting of
chrono values unless a L specifier is present.
- Charlie stated that changes to the format specifier syntax may have
more implementation impact than just requiring changes to the
implementation behavior.
- [ Editor's note: Discussion regarding the amount of time
available to make changes before implementations of
std::format() are shipped to users ensued. That discussion
is not recorded as it involved discussion of internal company time
lines that have not yet been stated in public. ]
- PBrett noted that there are two related issues:
- 1: The format specification syntax.
- 2: The behavior of the format specifiers.
- PBrett explained that the proposed resolution addresses both concerns
by making the format syntax consistent in requiring a L
specifier to opt-in to locale dependent behavior.
- Charlie noted that std::format() does not currently perform
any transcoding operations today; not for format arguments, and not
for text provided by a locale that uses a different character encoding
than the literal encoding.
- Charlie added that std::format() does need to be encoding
aware for the purposes of field width estimation.
- Corentin stated that the intent of the proposed resolution is to
ensure that std::format() use consistent syntax to opt-in to
locale dependent formatting and encouraged trying to address at least
this concern.
- Corentin added that LWG might agree on a resolution in a short time
frame, but that there will not be a plenary poll until June.
- PBrett stated that the resolution may be considered evolutionary.
- Victor agreed and noded that the L specifier could be added
for a future standard.
- Victor asserted that we do need to decide what the default behavior
is now.
- Victor added that we could consider transcoding locale provided text
and potentially detecting mojibake if it would be produced.
- Victor noted that the format string is always a literal.
- [ Editor's note: In C++20, the format string may not be a literal,
but
P2216,
if adopted, will require a literal or other compile-time evaluated
expression. ]
- Zach asked for clarification regarding what is meant by
"default behavior" and noted that the %Ou specifier is
locale dependent, but that %u is not.
- Victor responded that there are cases like %T that do not
have locale independent forms.
- [ Editor's note: %T is locale dependent because the
decimal point character potentially used for sub-second precision is
provided by the locale. ]
- Hubert stated that these concerns will be difficult to resolve
quickly, are clearly evolutionary, and may require balloting.
- Hubert added that there may also be issues with requiring the locale
independent behavior to use English translations.
- Tom noted that the basic source character set already has a bias in
English.
- Hubert responded that this goes further; we may potentially have to
specify behavior in terms of asctime().
- Charlie commented that the text provided by the locale facet is
currently produced by the operating system; changing that behavior
may not be problematic.
- Charlie added that adding new format specifiers will result in
incompatibilities if code that uses those specifiers is run with an
older library implementation that doesn't support them.
- Charlie noted that, if support for compile-time format string
checking is adopted via
P2216,
then the format string will become part of the function template
specialization; this may help to avoid library compatibility
issues.
- Charlie stated that there are multiple sources of locale information
and that formatting of the chrono types is goverend by the Windows
region settings.
- Charlie noted that changes to the Windows region settings require a
reboot.
- Tom asked for confirmation that calls to std::setlocale()
don't affect how chrono values are formatted.
- Charlie confirmed that is correct.
- PBrett asked if std::format() behavior is affected by
changes to the global locale via std::locale::global().
- Charlie responded that the global locale does affect the behavior of
format specifiers that include the L specifier.
- Charlie clarified that the global locale will not affect parsing of
the format string itself.
- Corentin requested review of the proposed resolution.
- Hubert noted that the wording requires that the "C" locale be used
for field formats that do not include the L specifier
regardless of whether a std::locale argument is passed.
- Hubert noted that under the C++20 wording, implementations trying to
accomodate this tentative future direction may be more able to ignore
the global locale than an explicit locale argument. So, a change
that maintains respecting the locale parameter is more compatible
with C++20.
- Tom responded that doing so would not be consistent with the other
standard format specifiers.
- Victor agreed and added that he would be strongly opposed to implicit
use of a std::locale parameter.
- Jens stated that a migration path to better behavior needs to be
estalished and noted that the current situation is an interesting
mess.
- Jens suggested investigating how to increase consistency with the
existing locale dependent format specifiers; e.g., for decimal
point and digit group separator characters.
- Jens added that there may be cases where it would be useful to be
able to specify use of the "C" locale even when a locale is provided
as an argument.
- Jens observed that use of the "C" locale for the chrono %p
specifier would be consistent with use of the "C" locale for floating
point values.
- Jens noted that the example in the proposed resolution does not match
the proposed grammar; the L specifier should precede the
chrono-specs specifier, not follow it.
- Jens stated that adding support for the L specifier is
backward compatible from a standard evolution perspective.
- Tom stated that a change to use the "C" locale in place of the global
locale or a locale passed as an argument can be done as a non-abi
breaking change.
- Charlie agreed, but noted that some implementation tricks may be
required to avoid potential conflicts with older libraries.
- Zach stated that mixing different library versions is non-conforming
anyway.
- Corentin stated that the "C" locale is used as a proxy for the
absence of a locale and suggested that a constexpr locale might be
desired in the future.
- Corentin asked Charlie if formatters can be modified without
breaking ABI.
- Charlie replied that they are templates, so modifications can result
in ODR violations. Charled added that inline namespaces can be
helpful in some cases.
- PBrett asked for confirmation that use of a L specifier
where one is not expected will result in a format exception being
thrown.
- Victor confirmed that is the case.
- PBrett asked if the L specifier could be reserved now such
that a format exception will be thrown if used, and then different
behavior specified later.
- Charlie responded that changing behavior to not throw in cases where
an exception was previously thrown is fine so long as mixed library
version problems are avoided.
- Victor expressed agreement with Jens' prior comments.
- Victor stated that behavior must remain consistent between
std::format() overloads that do and do not accept
std::locale arguments; the presence of the
std::locale argument must not, by itself, affect
behavior.
- PBrett suggested that a paper that explores the alternatives may be
required.
- Corentin asserted that it must be possible to evolve the
std::format format string so as to add new behaviors.
- Corentin expressed distaste for the idea of a "no locale" specifier;
that approach would still result in inconsistencies with number
formatting.
- Charlie agreed.
- Jens conceded that challenging standardization work will be required
if behavior changes from C++20 to C++23.
- Jens asserted that the right to add format specifiers when a new
standard is issued must be reserved, even if doing so causes
implementation challenges.
- Poll 1: LWG3547 raises a valid design defect in [time.format] in C++20.
- Attendance: 11
-
- Consensus: Strong consensus that this issue represents a
design defect.
- Hubert noted that, with regard to issues of consistency, the proposed
resolution is a departure from existing interfaces such as
strftime().
- Poll 2: The proposed LWG3547 resolution as written should be applied to C++23.
- Attendance: 11
-
- No consensus.
- SA: Mitigation of behavior changes sensitive to string literal
contents is very difficult and there are options available to
deal with this problem in an additive way; this direction
represents an unnecessary backward compatibility break.
- Mark stated that the proposed resolution would have been great 18
months ago.
- PBrett responded that we need to recognize when we make mistakes
and own correcting them.
- Corentin lamented the current state being another case of a bad
default.
- Tom suggested that the current behavior can be presented as
intentional with the goal to maintain consistency with existing
interfaces; new format specifiers can then be added in C++23.
- PBrett suggested that an SG16 issue be filed and a volunteer found
to work on it.
- Victor responded that the behavior isn't sufficiently broken to
make him want to spend time on it.
- [ Editor's note: Despite that lack of desire, Victor and
Corentin quickly authored an initial draft paper that will become
P2372R0
once published. ]
- PBrett volunteered to work on a paper.
- Tom and PBrett thanked Charlie for joining the telecon and encouraged
him to continue attending.
- Tom stated that Victor had expressed interest in working on a potential
std::locale replacement and asked if there were other
volunteers interested in such work.
- Tom stated that the next SG16 telecon will be held May 12th.
May 12th, 2021
Draft agenda:
Attendees:
- Charlie Barto
- Hubert Tong
- Jens Maurer
- Mark Zeren
- Peter Brett
- Steve Downey
- Tom Honermann
- Victor Zverovich
- Zach Laine
Meeting summary:
- P2295R3: Support for UTF-8 as a portable source file encoding
- No discussion as the author was not present.
- P2372R1: Fixing locale handling in chrono formatters
- [ Editor's note: D2372R1 was the active paper under discussion at
the telecon. That paper was later published as P2372R1 without
further modification. The agenda and links used here reference
P2372R1 since the links to the draft paper were ephemeral.
]
- PBrett introduced the topic:
- LEWG reached consensus for the direction proposed by
P2372R0
at its
2021-05-03 telecon
with additional refinement to preserve locale dependent
formatting for iostreams.
- Since SG16 polls conduced at its
2021-04-28 telecon
did not agree with this direction, LEWG requested that SG16
review and conform or rebut the LEWG consensus.
- Victor presented slides lightly updated from his prior LEWG
presentation.
- Victor's presentation slides are available
here.
- Poll 1: Forward D2372R1 to LEWG for inclusion in C++23 and with
the intent that it be applied retroactively to C++20.
- Attendance: 8
-
- Consensus: Strong consensus in favor.
- [Editor's note: D2372R1 contains the LEWG requested update to
preserve locale dependent formatting for ostreams. ]
- [Editor's note: The chair's perception is that SG16's change in
consensus is attributable to two factors:
- New information that arrived after the initial poll.
- SG16's original poll targeted C++23 while LEWG's poll targets
C++23 and C++20 as a DR; some concerns had been expressed
regarding backward compatibility and migration.
]
- P2093R6: Formatted output
- Victor presented:
- std::print() integrates std::format() with
I/O.
- R6 addresses recent LEWG feedback:
- The proposed std::print() header was changed from
<io> to <print>.
- Additional rationale and clarifications were added regarding:
- Substitution of replacement characters.
- The choice to base behavior on the compile-time literal
encoding.
- ANSI escape sequences do not constitute a native device
API.
- Existing practice in Rust.
- PBrett asked how substitutions would be performed for different kinds
of ill-formed scenarios.
- Zach stated that the Unicode standard documents recommended practice
for substitution of replacement characters.
- [ Editor's note:
Unicode 13
discusses substitution of replacement characters in section
"U+FFFD Substitution of Maximal Subparts" of
chapter 3.9, "Unicode Encoding Forms" and in
chapter 5.22, "U+FFFD Substitution in Conversion". ]
- Zach expressed a preference for implementations to be consistent in
how replacement characters are substituted.
- Hubert stated that an example should be added to the paper.
- Hubert expressed a preference for vprint_unicode() to
substitute replacement characters even when the output device is not
Unicode.
- Victor asked if that could be done as implementation-defined
behavior.
- Hubert responded, no; the goal is for the substitution behavior to be
determinstic for vprint_unicode() regardless of the output
device.
- Victor replied that he would prefer that behavior to be optional.
- Hubert replied that he would like to ensure that ill-formed inputs are
not presented with no indication that something went wrong.
- PBrett stated that, when writing to a Unicode device, a
U+FFFD replacement character should be substituted and the
device should then handle it as its designers intended.
- Victor agreed with the substitution rationale for the device case
since transcoding may be necessary, but disagreed for files due to a
desire to avoid the validation overhead.
- Hubert expressed a preference for the behavior of
vprint_unicode() to be consistent across files and
devices.
- PBrett suggested that what Hubert desires is some kind of noisy
failure, like a trap.
- Hubert agreed and restated the goal as some kind of signal that
encoding issues were encountered.
- Steve stated that C++ programs do not typically interact directly
with a device and that it is difficult to diagnose problems where the
data can't be inspected en route.
- PBrett asked if Steve had a suggestion.
- Steve responded with a preference for a programatic error handling
facility.
- Zach stated that, in the case where UTF-8 source is copied to a UTF-8
sink, introduction of replacement characters could be surprising, but
when transcoding is required, e.g., when the sink is UTF-16, then
replacement characters are expected.
- Zach suggested decomposing the problem; validate and handle errors
first, then convert.
- Charlie explained that, on Windows, the only ways to write Unicode to
the console are to change the console encoding and write using the
ANSI APIs, or to convert to UTF-16 and write using the wide APIs.
- Charlie noted that, since the console encoding is a global property
of the process, changing it within std::print() would require
synchronization.
- Zach suggested that it is reasonable to get mojibake in the ANSI case
if the console encoding hasn't been correctly set.
- Hubert responded that the global console encoding condition seems to
be particular to Windows and worth addressing.
- Charlie pondered the ramifications of writing to a stream opened in
text mode.
- Victor reiterated his stance on not wanting to pay validation costs
except in cases where transcoding is necessitated.
- Poll 2: When <print> facilities must transcode formatting
results for display on a device and, during that process,
invalidly-encoded text is encountered, std::print() should
replace the erroneously-encoded code units with
U+FFFD REPLACEMENT CHARACTER.
- Attendance: 9
-
- Consensus is in favor.
- A: Not convinced that silently substituting replacement
characters is always the right policy; an exception could be
appropriate. There are parallels with integer overflow.
- A: Testing is difficult if substitution is device
sensitive.
- Charlie expressed support for a direction that would allow explicitly
inhibiting use of the native device API but noted that, on Windows,
that would mean the console encoding would have to be correctly set
and the application would have to take care of buffering
concerns.
- Poll 3: When <print> facilities need not transcode their
formatting results for display on a device and invalidly-encoded text
is encountered, std::print() should nevertheless replace the
erroneously-encoded code units with U+FFFD REPLACEMENT CHARACTER.
- Attendance: 9
-
- N: Undecided due to uncertainty; more consideration is
needed.
- A: Would prefer a UB approach that would enable sanitizers to
diagnose these cases and remain conforming.
- SA: There is lack of implementation experience for this
direction, it imposes overhead, and there are terminals that
accept bytes.
- SA: A wide contract with validation does not make sense for
high-performance I/O.
- PBrett stated that there appear to be different audiences for
std::print() and these audiences have different ideas of
what is "obviously" correct:
- For some, std::print() is a simple tool that enables a
better Hello World.
- For others, it is a high-performance I/O facility.
- For yet others, it is a way to format bytes.
- Tom suggested that an error handling facility might move us
towards more consensus.
- PBrett noted that something like JeanHeyd's transcoding facilities
could provide that.
- Charlie agreed that integration of a familiar transcoding facility
could work.
- Tom stated that the next telecon will be May 26th and that the agenda
will again include
P2295R3
and
P2093R6.
May 26th, 2021
Draft agenda:
Attendees:
- Corentin Jabot
- Hubert Tong
- Jens Maurer
- Mark Zeren
- Peter Brett
- Steve Downey
- Tom Honermann
- Victor Zverovich
- Zach Laine
Meeting summary:
- P2295R4: Support for UTF-8 as a portable source file encoding
- [ Editor's note: D2295R4 was the active paper under discussion at
the telecon. The agenda and links used here reference
P2295R4 since the links to the draft paper were ephemeral. The
published document may differ from the reviewed draft revision.
]
- PBrett provided an introduction.
- Corentin presented and described the changes from R3 to the
draft R4.
- PBrett observed that the wording updates removed the prior
definition for a UTF-8 file and added a new definition for
a UTF-8 source file.
- Tom recalled prior discussion that suggested there was no need to
provide such a definition at all.
- Jens confirmed and explained that the prior suggestion was to
instead specify translation phase 1 in terms of a sequnce of
characters instead.
- Jens noted that there will be merge conflicts with
P2314.
- Corentin asked if the merge conflicts can be dealt with after CWG
reviews P2314.
- Jens confirmed that they can be.
- PBrett asked if progress can be made before P2314 is adopted into
the working paper.
- Jens confirmed that progress can be made.
- PBrett asked Jens if he would like to see additional wording changes
reviewed in SG16.
- Jens replied that he would and noted that he had not received a
response to all of the suggestions previously provided in his message
to the mailing list available at
https://lists.isocpp.org/sg16/2021/04/2353.php.
- Jens observed that the proposed wording results in existing wording
no longer applying to all source files. For example, "Any source
file character not in the basic source character set is replaced by
the universal-character-name that designates that character"
now appears in a paragraph that doesn't apply to UTF-8 source
files.
- Corentin responded that this paper doesn't make sense without the
changes from P2314.
- Tom asked if the wording could be rebased on P2314 with a noted
dependency on P2314.
- Jens replied that it could be.
- Hubert noted that the definition of a UTF-8 source file is problematic
since the definition could apply to a file that just so happens to
decode as UTF-8, but is not intended as a UTF-8 file.
- PBrett responded that the following sentence specifies that encoding
determination is implementation-defined.
- Hubert acknowledged and suggested it might be helpful to reorder the
sentences.
- Hubert added that wording is still required to reflect intent that a
file be interpreted as UTF-8.
- PBrett agreed by way of an example; an implementation invoked without
such intent may analyze a file, determine that it does not decode
successfully as UTF-8, and then interpret it as, for example,
Windows-1252, and do so without issuing a diagnostic.
- Jens observed that the wording states that, "An implementation shall
support UTF-8 source files", but there is no wording to require
diagnosis of ill-formed UTF-8 source files.
- Corentin responded that there is no such thing as an invalid UTF-8
file; either a file is valid UTF-8 or it is not UTF-8.
- Mark responded that there is a desire to have implementations produce
a diagnostic if source files that are purported to be encoded as
UTF-8 are not, in fact, valid UTF-8.
- PBrett stated that there are three distinct requirements:
- A requirement to support UTF-8 encoded source files.
- A requirement for means to inform the implementation that all
source files are to be assumed to be UTF-8 encoded.
- A requirement that the implementation diagnose files that were
assumed to be UTF-8 encoded but that contain (some) non-UTF-8
content.
- Hubert offered some suggested wording in chat:
- "An implementation shall provide for processing physical source
files as having a UTF-8 encoding scheme without restriction,
other than resource limits ([implimits]), upon the content of
the physical source file."
- Jens pasted previously suggested wording from the mailing list in
chat:
- "The encoding scheme of a physical source file is determined in
an implementation-defined manner. An implementation shall
support (possibly among others) the UTF-8 encoding scheme."
- "If the encoding scheme of a physical source file is determined
to be UTF-8, the physical source file shall consist of a
well-formed sequence of UTF-8 code units as specified by ISO/IEC
10646."
- Hubert expressed support for that wording but thought some additional
updates would still be required to ensure diagnostics.
- Corentin disagreed with removal of wording that requires that the
scalar value of source file characters be preserved.
- Jens responded that the scalar value preservation wording isn't
required because the mapping to the translation character set already
preserves characters.
- Steve noted the existence of wording that uses the phrase "known to
the implementation" and asked if that could be used to specify how
source file encoding is determined.
- Tom suggested that implementation-defined is preferred since that
reflects a documentation requirement.
- Hubert added that the "known to the implementation" wording is not
intended to reflect that implementations can be wrong.
- PBrett observed that Jens and Hubert would presumably like to see
updated wording.
- Hubert expressed a belief that the required wording has been
identified and that he is onboard with the goal of preserving scalar
value sequences from UTF-8 source files.
- Corentin responded that he will bring back a revised paper with the
suggested wording.
- Steve informed the group that the EWG chair is considering dedicating
a telecon to SG16 papers in the next month or so.
- P2093R6: Formatted output
- PBrett reported a previous conversation with Victor in which Victor
expressed that he felt he has the guidance he needs regarding
handling of substitution characters and locale.
- Victor presented slides:
- The next question to be answered is whether it is ok to base
behavior on the literal encoding.
- Use of the literal encoding avoids race conditions with locale
settings.
- Discussion ensued regarding current dependencies on the choice of
literal encoding and it was observed that, though the wording
provided by
P1868
to specify estimated format field widths is not based on the literal
encoding, at least one implementation is planning to only use the
specified estimated widths when the literal encoding is UTF-8.
- Hubert observed that field width estimation can apply to content
from other than string literals.
- PBrett provided an example; when gettext() is used, a
literal is used for the message catalog lookup, but the result is
not a string literal.
- Hubert acknowledged the provided rationale, but noted that it does
not address concerns raised and that he has seen many cases where
use of locales works fine on UNIX systems.
- Hubert added that this has the potential to bite existing users since
code may appear to work correctly until it suddenly doesn't.
- Victor replied that his goal is to make UTF-8 cases work as expected
and that he is willing to accept some surprises in other
scenarios.
- Victor stressed that the intention is that, on UNIX systems, bytes
are simply passed through.
- Tom directed discussion towards the example code from the
telecon announcement.
- Victor stated that he will request a LWG issue or author a paper to
address handling of locale provided text.
- [ Editor's note: Victor requested an LWG issue that is now
tracked as
LWG issue 3565. ]
- Corentin stated that he is content with undefined behavior for cases
where UTF-8 input is expected, but the input is not actually
UTF-8 encoded.
- Hubert responded that the format locale situation is rather urgent
for EBCDIC environments.
- PBrett stated that he is ok with the proposal because it won't break
anything worse than it already is.
- Tom stated that the next telecon will be held on June 9th.