SG16: Unicode meeting summaries 2023-05-24 through 2023-09-27
Summaries of SG16 meetings are maintained at
https://github.com/sg16-unicode/sg16-meetings. This paper contains a
snapshot of select meeting summaries from that repository.
Previously published SG16 meeting summary papers:
May 24th, 2023
Draft agenda:
Attendees:
- Alisdair Meredith
- Charlie Barto
- Corentin Jabot
- Eddie Nolan
- Fraser Gordon
- Giuseppe D'Angelo
- Jens Maurer
- Mark de Wever
- Mark Zeren
- Peter Bindels
- Peter Brett
- Robin Leroy
- Tom Honermann
- Victor Zverovich
- Zach Laine
Meeting summary:
- P2779R0: Make basic_string_view’s range construction conditionally explicit:
- Giuseppe presented an overview of the paper including relevant
history:
- P1989R2 (Range constructor for std::string_view 2: Constrain Harder)
added an implicit std::string_view constructor that
enables implicit conversion from any type that satisfies a set of
constraints, one of which includes having a member type alias
named traits_type that matches the
std::string_view member of the same name.
- P2499R0 (string_view range constructor should be explicit)
changed the new constructor to be declared explicit due
to concerns involving ranges that do or do not contain an
embedded null character; this broke the ability for string types
to implicitly convert to std::string_view.
- LWG 3857
removed the constraint requiring a matching traits_type
member type alias based on the rationale that such a safety
precaution is no longer necessary since conversions are now
explicit.
- The proposed paper seeks to conditionally restore implicit
conversions for string-like types without requiring modifications
to those types to add conversion operators.
- Two options are proposed:
- Option 1 adds an opt-in trait and makes the constructor
conditionally explicit based on the presence of a matching
member traits_type type alias.
- Option 2 makes the constructor conditionally explicit based on
the presence of a matching member traits_type type
alias without requiring an opt-in trait.
- Qt has provided a QStringView class with an
implicit constructor that accepts a range
that has worked well in practice for a decade.
- PBrett asked what the essential nature of a string-like type is.
- Giuseppe responded that it is a contiguous sequence of characters
and associated character classification traits.
- PBrett argued for substitution of "code units" for "characters".
- Zach noted that the traits_type name might be used by types
that are not string-like types, stated that he does not typically add
a traits_type to his own string-like types, and asked what is
commonly done in practice.
- Giuseppe responded that the paper lists the results of a survey of
various projects for occurrences of the traits_type name and
found that it is strongly correlated with string-like types but that
there are string-like types that don't have such a member.
- Giuseppe acknowledged that the traits_type name is quite
generic.
- Victor expressed opposition to option 2 since it relies on what he
considers to be a legacy feature and that traits_type is, in
practice, always std::char_traits.
- Victor asserted that implicit conversions and implicit interoperation
with the standard library are not desired for Folly's
fbstring.
- Victor stated that he is ok-ish with option 1.
- Tom asked Victor to further explain his concerns and the damage he
fears the implicit conversions would cause.
- Victor replied that use of fbstring is no longer encouraged
and the proposed change would facilitate continued usage.
- Victor noted that the proposed changes could also impact overload
resolution in generic code and potentially introduce overload
resolution failures due to ambiguity.
- Corentin lamented the ability for programmers to specialize
std::char_traits for their own user-defined types and stated
he plans to propose deprecating or removing that allowance.
- Corentin explained that the interface that std::char_traits
provides is not a good match for how text processing works in
practice.
- Corentin asserted that increased use of std::char_traits
should be discouraged.
- Corentin opined that option 1 is fine but that option 2 is
problematic in the long run.
- Giuseppe acknowledged Corentin's position.
- Corentin clarified that programmers should not be encouraged to use
a different type than std::char_traits but rather that they
should be encouraged not to use a char-traits-like type at all.
- Tom summarized his understanding of the concerns; the proposed change
could encourage programmers to add a traits_type member type
alias of std::char_traits to classes that otherwise wouldn't
define the type alias solely to enable implicit conversions to
std::string_view.
- Zach argued for not enabling such implicit conversions at all on the
basis that std::string_view is intended to be implicitly
convertible from other standard library types and that explicit
conversions are appropriate elsewhere.
- Alisdair opined that the right approach would be for types to opt
themselves in to an implicit conversion.
- Alisdair asserted that std::char_traits is not legacy and
that it cannot be removed without significant ABI impact.
- Alisdair stated that the matching traits_type constraint is
a good heuristic and that the opt-in trait in option 1 is so specific
that he would have a hard time supporting it.
- Jens noted that the proposed wording for option 1 requires both the
opt-in string-like-type trait and the matching traits_type
constraint to enable implicit conversions.
- Jens expressed a preference for an option that proposed only the
string-like-type trait.
- Jens stated that the wording needs to be rebased on the current
working paper since the struck wording has already been removed.
- Jens suggested is_string_view_like might not be the best
choice of name for the opt-in trait and suggested enable_view
as an example name for similar opt-in traits.
- Giuseppe acknowledged the suggestion and stated that the name can be
changed.
- Jens noted that it doesn't matter how string-view-like the source type
is as long as it provides contiguous storage and opts itself in.
- Jens agreed with not wanting to encourage the addition of an otherwise
unused traits_type member.
- Jens observed that is_string_view_like is false by
default.
- Jens suggested that, if it is desirable to provide a safety check on a
matching traits_type member, that the
is_string_view_like trait can support a mechanism to enable
that.
- Jens expressed a preference for postponing a poll to forward the paper
until it has been rebased on the current working paper.
- Various poll options were discussed but it was decided that polling be
postponed pending an updated paper revision with wording rebased on
the current working paper and an additional option to enable implicit
conversions based solely on the opt-in trait.
- P2863R0: Review Annex D for C++26:
- Alisdair introduced this and the following papers.
- Tom explained his understanding of the ramifications for removal of
standard library features; that an implementor may choose not to
provide the removed features or may choose to provide them since the
removed names are reserved as "zombie" names.
- Alisdair acknowledged the intent, but noted that the standard
currently lacks wording to support zombification of explicit template
specializations.
- Alisdair explained that there are four deprecated subclauses that are
relevant to SG16;
D.26 ([depr.locale.stdcvt]),
D.27 ([depr.conversions]),
D.28 ([depr.locale.category]),
and
D.29 ([depr.fs.path.factory]).
- PBindels stated that
D.15 ([depr.str.strstreams])
and
D.25 ([depr.string.capacity])
have to do with text facilities but that he reviewed them and
concluded that the functionality is not strongly relevant for
SG16.
- Alisdair stated that, for std::filesystem::u8path, per
LWG 3840,
there have been recent comments that removal would be
problematic.
- Tom stated that the LWG issue was recently discussed in LEWG but that
the LWG issue does not appear to have been updated to reflect that
discussion.
- [ Editor's note: LEWG discussed the LWG issue during its
2023-01-10 telecon.
]
- Alisdair stated that deprecated features should either be undeprecated
or removed and noted that this feature has been deprecated since
C++20.
- Jens expressed concern regarding Billy O'Neal's comment in the LWG
issue that deprecation of u8path was one of the reasons that
vcpkg discontinued use of std::filesystem.
- Jens stated that SG16 should offer an opinion.
- Corentin replied that there was a poll in LEWG in January and that
there was no consensus to undeprecate u8path.
- Corentin stated that a mechanism to access a sequence of char
that holds UTF-8 code units as-if it were a sequence of
char8_t is a feature that we should have; we're missing a way
to pass such a sequence to the std::filesystem::path()
constructor such that it is interpreted as UTF-8.
- Tom noted that Corentin has a paper on that topic.
- [ Editor's note: See
P2626 (charN_t incremental adoption: Casting pointers of UTF character types).
]
- Alisdair noted that, if removed, u8path would be added to the
list of zombie names, so implementors that wish to continue providing
it may do so.
- PBindels opined that u8path provides a solution to work
around legacy issues but that Corentin's P2626 provides a proper
solution.
- PBindels suggested that we should neither undeprecate nor remove
u8path until a proper solution is in place.
- Alisdair stated that he can update the paper to reflect that guidance
and to note further action as dependent on P2626.
- Charlie agreed with not removing u8path without a proper
alternative.
- Charlie noted that, if u8path is zombified, that implementors
can continue to provide it, but that portability is lost.
- Charlie stated that he didn't see a reason to remove u8path;
that it isn't harmful.
- Alisdair acknowledged that a migration path is needed.
- Tom explained that the original motivation for deprecation was to
dissuade continuing to provide standard library functions that require
UTF-8 data in char-based storage.
- Tom noted that u8path and the deprecated
std::codecvt facets were the only standard library features
that did so.
- P2871R0: Remove Deprecated Unicode Conversion Facets From C++26:
- Alisdair presented the paper:
- These facets were deprecated because they did not provide error
handling capabilities and could not reasonably be extended.
- There are some implementations that do not issue deprecation
warnings.
- Corentin noted the work in progress and general plan to provide
replacements for C++26 and suggested waiting to remove them pending
that work.
- Jens agreed and stated that removal without replacements is
ill-advised unless these are actively causing harm.
- Tom noted that conversions are possible through the mbrtoc*
and c*rtomb family of functions though those have their own
issues.
- Victor stated that the codecvt facets are so challenging to
use that not having a replacement isn't really a problem.
- Alisdair noted that implementors can continue to provide them thanks
to zombification.
- Alisdair reported that, per the paper, LEWG and SG16 previously
recommended removal during the C++23 cycle, but that action wasn't
completed.
- Alisdair reminded the group that codecvt_utf and
codecvt_utf1 convert to and from UCS-2 or UTF-32 depending
on the size of the first template parameter.
- PBrett asked for any objections to removal.
- No objections were reported.
- Alisdair stated he will take that feedback back to LEWG.
- P2873R0: Remove Deprecated Locale Category Facets For Unicode from C++26:
- Tom explained that these facets were deprecated because they convert
to and from UTF-8 in char-based storage rather than between
the multibyte encoding like the non-deprecated facets do.
- Tom reported that char8_t-based replacements were added as
replacements, but those were a mistake because they won't be used by
char-based streams anyway.
- [ Editor's note:
LWG 3767
tracks deprecating the char8_t-based facets. ]
- PBrett asked for any objections to removal.
- No objections were reported.
- Corentin spoke in favor of removal.
- P2872R0: Remove wstring_convert From C++26:
- Giuseppe asked if the paper includes removal of
std::wbuffer_convert.
- Alisdair confirmed that it does.
- Alisdair explained that these were deprecated because the example for
std::wstring_convert used another deprecated feature,
std::codecvt_utf8 and, due to other underspecification
concerns, noone was motivated to fix them.
- Alisdair asked if SG16 is the right group to address this.
- PBrett responded affirmatively and stated that SG16 is the group that
misunderstands wchar_t the least.
- Alisdair noticed some issues with the paper and concluded that updates
are required before the paper is ready for any action to be taken on
it.
- Tom stated that the next meeting is tentatively scheduled for 2023-06-07
and will likely continue review of
P2779 (Make basic_string_view’s range construction conditionally explicit)
and
P2872 (Remove wstring_convert From C++26)
if updated revisions are available followed by an initial review of
P2845 (Formatting of std::filesystem::path).
- Zach reported that he expects to have a new revision of
P2728 (Unicode in the Library, Part 1: UTF Transcoding)
available soon after the Varna meeting.
June 7th, 2023
Draft agenda:
Attendees:
- Alisdair Meredith
- Charlie Barto
- Corentin Jabot
- Fraser Gordon
- Giuseppe D'Angelo
- Jens Maurer
- Mark de Wever
- Mark Zeren
- Peter Brett
- Tom Honermann
- Victor Zverovich
- Zach Laine
Meeting summary:
- P2779R1: Make basic_string_view’s range construction conditionally explicit.
- [ Editor's note: D2779R1 was the active paper under discussion at
the telecon.
The agenda and links used here reference P2749R1 since the links to
the draft paper were ephemeral.
The published document may differ from the reviewed draft revision.
]
- Giuseppe summarized the paper and changes since the last revision:
- The paper endeavors to identify a compromise position for the
issues that have resulted in multiple changes to how the
std::basic_string_view range constructor is
specified.
- Option 2 from the previous revision is still present though there
was not much support for this option in the last discussion.
- Option 1 follows existing precedent for type traits that enable
some functionality; this option has been divided into two
sub-options.
- Option 1-A provides a type trait that enables conversion without
regard to the traits_type member.
- Option 1-B provides the type trait from option 1-A as well as an
additional type trait that can be used to enable conversion that
is sensitive to the traits_type member.
- Tom asked if the intent is for the trait to be used only for
conversion to std::string_view or for conversion to any
string_view-like type.
- Giuseppe responded that it is intended to be used for conversion to
any string_view-like type.
- Jens suggested in chat: "You can also define
enable_string_view_conversion in a way so that the user specialization
can compare char_traits, if so desired (or not)."
- Jens' suggestion received several positive responses.
- Alisdair, following up on Jens' suggestion in chat, asked if the
traits in option 1-B could be merged.
- Giuseppe confirmed that they could be.
- Alisdair indicated that would be his preference.
- Alisdair stated that the conversion could be enabled based on a class
member similar to how transparent key comparison for associative
containers is enabled via the is_transparent member of the
compare class.
- Giuseppe acknowledged that approach would work as well.
- Tom noted that approach would require modifying the class.
- Alisdair responded that the trait could still be specialized but could
be defaulted based on the presence of a member.
- Jens stated that the most convenient option would be to define a
conversion operator with the trait available as a fallback.
- Jens expressed a preference for a single trait with template
parameters such that a specialization can be written to explicitly
match traits_type or std::char_traits as
desired.
- Jens noted that enable_string_view_conversion_with_traits
still requires comparison with std::char_traits or a
traits_type member.
- Jens suggested that third party string_view-like classes can provide
their own trait to enable implicit conversions.
- Giuseppe responded that the goal is to enable interconvertibility
between different string types.
- Giuseppe noted that the proposal doesn't require comparisons with
specific type or member names.
- Zach stated that he doesn't find the problem that the paper intends
to address compelling and noted that std::string_view is
available as a vocabulary type.
- Zach noted that working around the lack of an implicit conversion
just requires slightly more code; explicit construction of a
std::string_view object.
- Victor requested that the two traits in option 1-B be merged.
- Victor agreed with Alisdair's suggestion to default the trait to
enable based on the presence of a class member.
- Victor asserted that only the author of a class should opt a class
into the proposed behavior; not users of the class.
- Victor repeated his opposition to enabling implicit third party
interoperation.
- Corentin stated that most of the proposed behavior should be being
discussed in LEWG rather than in SG16 and that SG16 just needs to
provide a recommendation whether use of std::char_traits
is a good heuristic.
- PBrett responded that there is an SG16 question concerning which
types are sufficiently text-like.
- PBrett asked for poll suggestions.
- Tom noted that discussion revealed other options that should be
explored.
- Tom suggested polling the desire to enable interconvertibility
across any/all string-like types in the ecosystem.
- Poll wordsmithing ensued.
- Poll 1.1: Any opt-in to implicit range construction of
std::string_view should be explicit on a per-type basis.
- Attendees: 12 (1 abstention)
-
- Strong consensus.
- A: If types have character traits, we should be making use of
them to determine compatibility.
- Jens responded to the against rationale by stating that use of
character traits is not excluded; per-type enablement could be
conditional on matching traits.
- Poll 1.2: The standard library should provide a general-purpose
facility for enablement of implicit interconvertibility between
string and string_view-like types (including UDTs).
- Attendance: 12 (2 abstentions)
-
- No consensus.
- Poll 1.3: A solution to the problem stated in P2779 needs to be
included in the C++ standard library.
- Attendance: 12 (1 abstention)
-
- No consensus.
- Tom stated that he will record the poll results in the paper tracker
and that it will be up to the LEWG chair to decide what to do
next.
- PBrett suggested that more examples of how this proposal could
alleviate programming challenges
- might help to increase motivation.
- Tom agreed and noted that the large proportion of N votes presumably
reflects insufficient motivation.
- P2872R1: Remove wstring_convert From C++26.
- [ Editor's note: D2872R1 was the active paper under discussion at
the telecon.
The agenda and links used here reference P2872R1 since the links to
the draft paper were ephemeral.
The published document may differ from the reviewed draft revision.
]
- Alisdair stated that, If feedback is light, that he will incorporate
it and publish the paper as P2872R1; otherwise, he will publish
P2872R1 as-is and incorporate the feedback in a newer revision.
- Alisdair explained that wbuffer_convert and
wstring_convert have been deprecated for three standard
releases now.
- Alisdair noted that removal permits implementors to continue to
provide the functionality thanks to the additions to zombie
names.
- Alisdair indicated that wording updates might be needed, but that LWG
will handle that.
- Alisdair explained that the deprecation was motivated by
underspecification and dependence on other deprecated features like
std::codecvt_utf8.
- Alisdair reported that there are currently four related open LWG
issues and that reviving the feature would require more.
- Corentin stated that, without std::codecvt_utf8, the
standard no longer provides features needed to use these types.
- Alisdair agreed and explained that programmers would have to provide
their own std::codecvt facet.
- Corentin acknowledged the requirement, but observed that programmers
could more easily just implement the needed conversion.
- Victor opined that these types provide little value since they are
just light wrappers anyway.
- Victor reported that a search of the projects he works on found a few
uses, but that those uses should be replaced anyway.
- PBrett asked if anyone had an objection to removing these
features.
- No objections were raised.
- MarkZ reported that a Github search identified few uses.
- P2845R0: Formatting of std::filesystem::path.
- Victor introduced the paper:
- P1636 (Formatters for library types)
previously proposed formatting for std::filesystem::path
but was specified to use the native() member function
which might require transcoding and had no provisions for
handling of non-printable characters.
- This paper proposes a formatter that performs proper transcoding
and substitutes escape sequences for non-printable characters and
ill-formed code units.
- Victor noticed a missing doublequote character in the first source
code example in section 2, "Problems".
- Victor reported that some minor issues have been fixed in a draft R1
revision.
- Corentin asked if backslash path delimiters on Windows would be
formatted with escape sequences.
- Victor confirmed that they would be, that such substitution might be
surprising, but is consistent with std::quoted().
- Victor noted that an additional format specifier could be provided
to choose an alternate behavior.
- Corentin asked about use of the debug specifier, "{:?}".
- Victor replied that the escaped format is proposed as the default
behavior.
- Charlie asserted that some lattitude is needed to choose an alternate
escape character since backslash in paths has an important meaning
on Windows.
- Charlie noted that an alternate escape character could be surprising
and would create an inconsistency across platforms.
- PBrett asked about adding a specifier that enables specifying a
different escape character.
- Victor responded that such a specifier would be cumbersome and that
there are other options such as performing a transformation.
- Victor stated that there are use cases for both an escaped and a
non-escaped variant.
- Tom presented a few use cases including formatting for generic text,
byte preserved for filesystem access, punycode for URLs, and quoted
for shell scripts.
- Tom suggested that most transformations should be done outside of
formatting.
- Corentin stated that the default behavior should just escape
ill-formed code units and that the debug format specifier could be
used to escape problematic characters.
- Victor replied that quoting is useful but not always needed.
- Tom suggested that a specifier could be added to opt in to
quoting.
- PBrett expressed two high level use cases:
- The need to format the path precisely such that it can be used
to open a file.
- The need to format the path for textual display in a format
friendly to humans.
- PBrett opined that the paper does not clearly define the problem it
intends to solve.
- PBrett noted that, in
GLib,
functions are provided to request a file name suitable for display
as valid UTF-8 or as a byte array.
- Victor replied that the goal of the paper is to address the issues
discovered from prior review of
P1636 (Formatters for library types).
- Victor stated that additional use cases can be addressed as
needed.
- Zach reported that Python provides the functionality this paper is
proposing and noted that its formatters will double Windows path
separators.
- Zach stated that Python allows printing unformatted paths by treating
paths as a string and that C++ can do so as well.
- Zach agreed that some kind of escaping and quoting is needed.
- [ Editor's note: Corentin later
posted a message to the SG16 mailing list
that demonstrates Python's behavior with a
Compiler Explorer link.
]
- Jens asserted that, due to various quirks with
std::filesystem::path, that the paper should cover the
motivation and design space and not solely focus on addressing the
issues found from review of P1636.
- Jens stated that the paper should discuss, for example, the
implication of using backslashes in the syntax of character escapes
in formatted paths.
- PBrett agreed.
- PBrett noted that we were out of time and that additional review will
be needed to discuss encoding issues.
Tom stated that the next meeting is scheduled for 2023-06-28, that there
are several LWG issues awaiting review, and that Zach is working on a
revision of
P2728 (Unicode in the Library, Part 1: UTF Transcoding).
[ Editor's note: The following meeting was canceled due to summer
vacations. ]
Zach stated an expectation to have a new revision available in the next
two weeks.
July 12th, 2023
Draft agenda:
Attendees:
- Charlie Barto
- Fraser Gordon
- Hubert Tong
- Jens Maurer
- Mark de Wever
- Nathan Owen
- Niall Douglas
- Peter Brett
- Robin Leroy
- Tom Honermann
- Victor Zverovich
- Zach Laine
Meeting summary:
- P1030R5: std::filesystem::path_view:
- Niall stated that, during LEWG discussion in Varna, LEWG approved
removal of std::locale function overloads that were added
for compatibility with std::filesystem::path.
- Niall noted that, for each overload set that has an overload with a
std::locale parameter, there is an overload that does
not.
- PBrett asked for an explanation of the concerns with the overloads
that work with std::locale.
- Niall responded that locale support generally delegates conversion to
the OS where they are handled efficiently, but conversions performed
via std::locale impose considerable performance overhead;
possibly including multiple conversions on some platforms.
- [ Editor's note: conversions controlled by std::locale
require use of the std::codecvt facet which, per
[fs.path.construct]p6,
may require multiple conversions. ]
- Niall stated that a replacement for std::locale would be
welcome.
- PBrett opined that, in his experience, treating paths as having an
encoding leads to sadness.
- PBrett stated that a lossy conversion to a definitive encoding can
be used to display paths.
- Niall noted that the proposed path_view supports a raw byte
encoding and provides rendering operations.
- PBrett asked if the facility provides features to produce a path
suitable for display purposes.
- Niall replied that such formatting falls more in the domain of
P2845 (Formatting of std::filesystem::path)
and that he has been in discussion with Victor.
- PBrett asked if there is a plan to provide a formatter for
path_view.
- Niall suggested that such a formatter behave the same as for
std::filesystem::path.
- Victor summarized observations made during the LEWG discussion:
- std::locale was present in constexpr overloads;
that issue is easily solved by removing the constexpr
specifier from those declarations.
- the std::locale parameter is only present to support
encoding conversions, but those conversions are better handled by
an interface designed for such conversions.
- Victor noted that std::codecvt is not an efficient method
for transcoding.
- Victor opined that the overloads with a std::locale
parameter are not known to be needed and can be added back later,
perhaps in a more restrictive form, if desired.
- Niall asked Victor if he is suggesting that the existing
std::filesystem::path overloads with a std::locale
parameter should be deprecated.
- Victor replied that he would be happy to write such a paper at some
future point.
- Tom asked why there is a compare() overload with a
std::locale parameter.
- Niall responded that comparisons are shallow by default and
compare() is provided to allow for more comprehensive
equivalence comparisons.
- Niall explained that the std::locale parameter is used to
convert each path to a common form that is then compared.
- PBrett expressed an assumption that the std::locale
parameter would be used for collation purposes using the
std::collate facet.
- Hubert asked why collation would be relevant for equality.
- PBrett asked if, given a set of path_view objects, whether
the compare() operation could be used to order them.
- Zach responded that such collation might be better performed using
features outside of the std::filesystem library.
- Jens stated that the wording in the paper is suggestive that only
the encoding is intended to be consumed from the locale object.
- Jens observed that removal of the std::locale parameter
results in a loss of transcoding facilities, but since what was
provided was so thin, it isn't much of a loss.
- Victor stated that the equivalent facility in path_view of
the std::locale based std::filesystem::path
construction is the locale dependent render() member
function.
- Niall explained that the reference implementation of the locale
dependent render() member uses the std::locale
object to convert a path to UTF-8 and then compares it.
- Tom expressed confusion, stated that std::locale doesn't
support conversion to UTF-8, and then realized the reference
implementation is probably using the char8_t codecvt
facets that don't actually convert between the locale encoding.
- Niall responded that he is not aware of anyone that uses
std::locale with the filesystem.
- Victor pondered interaction with std::format and
std::print and whether it would make sense for
path_view to also rely on the literal encoding to detect
UTF-8 encoding; that would enable construction with
char-based data to be saved as char8_t.
- Tom expressed some reservations; programmers might compile with a
/utf-8 or equivalent option, but file names produced or
provided at run-time might be differently encoded.
- Hubert expressed concerns regarding implementation experience
obtained so far regarding preservation of the literal encoding for
use by the standard library.
- Poll 1: Modify P1030R6 "std::filesystem::path_view" to restore
function overloads with locale parameters.
- Attendees: 12 (4 abstentions)
-
- Consensus against.
- P2845R0: Formatting of std::filesystem::path:
- Tom apologized for his delinquency in producing a meeting summary for
the previous discussion on this paper that took place at the prior
SG16 meeting.
- Victor summarized his understanding of the direction from the prior
meeting; to explore more options for quoting and escaping.
- PBrett explained a desired ability to obtain a close approximation of
a path validly encoded for display purposes and stated that the paper
does not currently provide sufficient detail.
- Victor asked for confirmation that Peter wants the path formatted
without any transformation, no loss of information, no quoting, and
perhaps just escaping for invalid code unit sequences.
- PBrett explained that he wants three version:
- one that provides the raw bytes; path_view provides that,
but std::filesystem::path does not.
- one that understands encoding and provides the path unmodified
with the exception of substitution characters for invalid code
unit sequences.
- one with quotes and escape sequences for problematic
characters.
- Niall stated that, for both std::filesystem::path and
path_view, it is possible to obtain the path as a string or
to visit the components with a lambda.
- Jens asked for confirmation that std::format includes a
debug specifier that enables a string to be printed with escape
sequences for problematic characters.
- Victor confirmed that is the case and stated that it could be used
for paths such that the default formatting provides the second option
PBrett listed.
- Jens asked what the output would be for the Belarusian example in the
paper for arbitrary code pages used in practice.
- Victor replied that, in either case, the same substitutions would be
performed.
- Jens expressed approval and noted that behavior would be consistent
with choices previously made.
- Mark observed that the options discussed so far, with an exception
for the debug specifier, would retain newline characters.
- PBrett acknowledged the behavior and noted that additional
translations can be applied on the formatted result as needed;
e.g., to substitute a space for the newline character.
- Niall expressed frustration regarding rendering paths in quotes since
quote characters are also valid path characters.
- Tom acknowledged feeling similary frustrated by that.
- PBrett stated that quotes would only be present when the debug
specifier is used.
- Niall pondered whether an additional format specifier to format the
path with escape sequences but without quotes is warranted.
- Tom responded that additional such options could be recognized by the
formatter specialization.
- Zach asked how control characters like RTL isolates should be handled;
whether they should be ignored when formatting for display but
preserved by the debug format.
- PBrett replied that he doesn't have experience with those in path
names but that he would expect them to be handled as a custom
translation.
- Zach suggested such characters should probably be passed through when
formatting for display.
- PBrett asked if the paper should be updated to address the
path_view proposal.
- Victor replied that path_view should be handled separately
since there are additional complications for the byte case.
- Tom stated that the consensus direction seems pretty clear for a
paper revision.
- LWG 3944: Formatters converting sequences of char to sequences of wchar_t:
- Mark summarized the issue:
- In C++20, it was an intentional design decision to not support
formatting of char-based string arguments when
formatting for wchar_t.
- In C++23, such formatting was inadvertently added via support for
range formatting since a range might have a char element
type.
- PBrett asked Mark what his preferred resolution is.
- Mark replied with a preference to preserve formatting of individual
characters of type char in general but to disable formatting
of ranges with a char element type.
- Mark noted that such range formatting probably wouldn't produce the
intended result when the characters are, for example, individual
UTF-8 code units.
- PBrett expressed skepticism that the reported formatting was
intentional.
- Tom asked why a different conclusion is reached for formatting of an
individual character vs an individual character in a range.
- Hubert replied that a range of individual code units is more
string-like.
- Niall stated that, in principle, the range could be iterated to
decode characters.
- PBrett agreed but noted that doing so would require encoding
information.
- Niall acknowledged the requirement and noted it could be inferred for
the charN_t types, but not for char.
- Tom expressed a belief that support for the charN_t types is
disabled.
- Victor confirmed that is the case.
- Hubert indicated that such conversions could be enabled, but that
necessary facilities are not currently available at run-time;
something like ICU or iconv would be needed.
- PBrett suggested that an escape translation could be produced.
- Hubert replied that stateful encodings would require representing
state.
- Tom asked what the downside is of disabling support for ranges that
have a mismatched character type as the element type.
- PBrett replied that, ideally, it should be possible to format
everything.
- Victor agreed with PBrett and stated that formatters for string-like
types that have a mismatched character element type could be disabled
and that a specifier to format a range as a string could be
provided.
- Hubert expressed support for a protocol to opt-in to support of
string-like types.
- Zach asked if std::vector would be considered a string-like
type.
- Zach expressed support for disabling formatting of ranges with a
mismatched character element type.
- Victor observed that disabling formatters for mismatched
std::string and std::string_view would suffice to
automatically disable types that derive from them.
- Victor expressed support for distinguishing between string-like and
non-string-like types.
- Mark noted that support can always be added later for a disabled
formatter and that disabling these formatters would be an improvement
over the status quo.
- PBrett agreed and asked Mark if he is willing to author a proposed
resolution.
- Mark agreed to do so.
- [ Editor's note: Mark offered a proposed resolution that is now
reflected in the LWG issue. ]
- Tom announced that the next meeting will be 2023-07-26 and that the
agenda will cover allowances for $ in identifiers, encoding for
the proposed std::contracts::contract_violation::comment() member
function, and continued review of of Zach's UTF transcoding paper if a
new revision becomes available.
July 26th, 2023
Draft agenda:
Attendees:
- Corentin Jabot
- Eddie Nolan
- Hubert Tong
- Jens Maurer
- Joshua Berne
- Mark de Wever
- Peter Brett
- Steve Downey
- Tom Honermann
- Victor Zverovich
- Ville Voutilainen
- Zach Laine
Meeting summary:
- WG14 N3145: $ in Identifiers v2:
- Hubert introduced the topic.
- C23 explicitly blessed $ as an allowed character in
identifiers as an implementation-defined extension.
- C has traditionally allowed this extension and support for it is
widely implemented.
- P2342 (For a Few Punctuators More)
contains additional analysis.
- Up to and including C++20, this has been a conforming extension
in C++ since $ in an identifier would be ill-formed.
- In C++20, $ is a UCN and combines with adjacent
identifier characters to produce an ill-formed identifier.
- In C++23, $ is no longer a UCN and adjacency with
identifier characters now yields two pp-tokens, the second of
which renders the program ill-formed.
- In C++26, $ is a member of the basic character set,
adjacency with identifier characters continues to yield two
pp-tokens, but the $ token may be discarded such that
it is never processed during translation phase 7.
- PBrett asked for clarification of what constitutes a conforming
extension.
- Corentin observed that this extension requires the production of a
single pp-token when $ is adjacent to an identifier
character.
- Corentin stated that sanctioning this allowance in the standard
would restrict evolution of the language since it would prevent
use of $ as an operator.
- Steve noted that the status quo is that all implementations allow
$ in identifiers by default, $ is widely used in
identifiers, and $ appears in mangled names.
- Steve stated that compilers are free to issue a diagnostic and
produce a working executable for source code that is ill-formed
according to the standard.
- Hubert replied that the concern is with preprocessing; if $
is not explicitly allowed in an identifier by the preprocessor,
then it is handled as a separate token and the difference is
observable.
- Hubert stated that issuing a diagnostic only during translation
phase 7 would be difficult.
- Hubert asserted that wording changes are in order to continue to
permit existing practice with $ in identifiers.
- Hubert acknowledged concerns regarding how to word an allowance so
that new uses of $ are not restricted.
- Hubert noted that new uses are only problematic if they are not
surrounded by whitespace.
- Jens suggested the possibility of reverting the adoption of
P258R2 (Add @, $, and ` to the basic character set)
for C++26.
- PBrett expressed opposition to doing so since that would contradict
the direction established in WG14 and codified in C23.
- PBrett stated that this discussion is a good start regarding how to
move forward.
- Jens opined that the WG14 rationale is not motivating and that he is
therefore not motivated to follow the same direction in C++.
- Tom noted that there are backward compatibility concerns for some
platforms due to use of $ in identifiers in system
headers.
- Corentin stated that the WG14 direction was to explicitly state that
it is implementation-defined whether $ is allowed in an
identifier.
- Poll 1: Whether DOLLAR SIGN is accepted as an identifier start
and/or identifier continuation character should be explictly
implementation-defined.
- Attendees: 12 (4 abstentions)
-
- No consensus.
- SA: I don't think an identifier should be
implementation-defined.
- PBrett stated that the next step would be a proposal to EWG
acknowledging the guidance here.
- Tom asked for opinions regarding the default modes of current
compilers being non-conforming.
- Zach replied that all implementations offer an option to disable the
extension.
- PBrett stated that every implementation is non-conforming in their
default modes in practice.
- Corentin asserted that implementations should issue warnings for use
of the extension.
- P2811R7: Contract-Violation Handlers:
- Joshua introduced the topic:
- SG21 is working on a specification for a contract violation
handler.
- The proposed comment() member function of
std::contracts::contract_violation is intended to return
a string containing the source code of the violated contract
predicate.
- The proposed encoding for the returned string is the ordinary
literal encoding.
- Tom expressed support for use of the ordinary literal encoding.
- Tom asked if anything should be specified regarding handling of
characters that are not encodeable in the ordinary literal
encoding.
- Corentin agreed with use of the ordinary literal encoding on the
basis that the text will be used at run-time.
- Steve asked for confirmation that the feature effectively converts a
source code snippet to text.
- Joshua confirmed.
- Steve suggested that a hand wavy approach similar to that taken for
static_assert is likely necessary except that the string has
to survive until run-time and we lack a mechanism to communicate the
encoding.
- Steve stated that the compiler should perform a best effort rendering
in the target encoding with the understanding that, for example, an
identifier might not be representable in Latin1.
- Jens observed that is a different operation than stringizing.
- Steve agreed.
- Corentin asked what the anticipated use cases are for the
comment() function.
- Joshua replied that the primary use case is for logging; other use
cases might involve using the result as a key for a map.
- Joshua asserted that it is not intended to provide source code that a
programmer might expect to parse.
- Joshua stated that the output is only intended to be sufficient for a
human to be able to correlate it with the original source code.
- Zach ruminated on the interaction of source encoding and literal
encoding and how preprocessor stringifying works.
- Jens noted that the assert macro is similarly expected to
embed source code in the output it produces.
- Jens stated that the wording for assert does not capture the
fact that producing the output involves multiple transcoding
steps.
- [ Editor's note: the transcoding steps are the conversion from the
encoding of the input file
([lex.phases]p1)
to the translation character set
([lex.charset]p1)
then to the ordinary literal encoding
([lex.charset]p8)
and then finally, if necessary, to the implementation-defined encoding
used to write text to the standard error stream
([cassert.syn]
via reference to the C standard). ]
- Jens observed that, for comment(), there is a possibility to
differentiate these steps; the compiler performs the conversion to the
ordinary literal encoding and the violation handler can then perform
additional transcoding as necessary.
- Jens asserted that these are not novel problems.
- Jens observed that non-encodeable characters in string literals are
ill-formed and that a preprocessor stringize operation that produces
such a string would likewise be ill-formed.
- Jens posited doing similarly for contracts.
- Corentin stated that doing so makes sense and then described some
additional encoding options:
- UTF-8 in char8_t, though that doesn't improve
usability.
- implementation-defined.
- ordinary literal encoding with an escaping mechanism for
non-encodeable characters.
- Corentin suggested it is likely best to just let implementors do what
they think is best.
- PBrett stated that SG21 had strong consensus for the text returned by
comment() being implementation-defined.
- PBrett noted that, since it is implementation-defined, there is no
need to specify whether the content includes macro expanded text.
- PBrett asserted that it is essential that the encoding be specified
and expressed support for the current paper direction.
- PBrett agreed that UTF-8 in char8_t is an option, but that
the standard provides few facilities to consume it.
- Hubert noted that, since C does not prohibit non-encodeable characters
in string literals, the stringize operation suffices for
assert in C.
- Steve stated that it would be very suprising if a char-based
string with an encoding other than the ordinary literal encoding was
returned; a char8_t-based string should be used if a UTF-8
encoded string is always returned.
- Poll 2: The value of std::contract_violation::comment should be a
null-terminated multi-byte string (NTMBS) in the string literal
encoding.
- Attendees: 12 (1 abstention)
-
- Unanimous consensus.
- LWG 3944: Formatters converting sequences of char to sequences of wchar_t:
- PBrett explained that the goal of discussing this issue is to
determine if we agree with the proposed resolution.
- Victor expressed support for it and stated that it is consistent
with previous discussions.
- Victor noted a minor markup issue in the proposed wording; the extent
of the struck text should include the trailing >
character.
- Poll 3: Recommend the proposed resolution to LWG3944 "Formatters
converting sequences of char to sequences of wchar_t" to LWG, after
fixing the typo.
- Attendees: 12
- No objection to unanimous consent.
- Mark asked what the next step is for this issue.
- Tom advised sending the proposed resolution to the LWG chair and
stated that he would work with the LWG chair to get a github issue
filed to record the SG16 poll.
- Tom stated that the next meeting is scheduled for 2023-08-09.
- Zach indicated that he could have a revision of
P2728 (Unicode in the Library, Part 1: UTF Transcoding)
available by then.
- Victor reported that he has a a new revision of
P2845: Formatting of std::filesystem::path
available.
August 23rd, 2023
Draft agenda:
Attendees:
- Fraser Gordon
- Hubert Tong
- Mark de Wever
- Peter Brett
- Robin Leroy
- Tom Honermann
- Victor Zverovich
- Zach Laine
Meeting summary:
- P2909R0: Dude, where’s my char?:
- Much appreciation was expressed for the clever paper title.
- [ Editor's note: in later revisions, the R0 title was demoted
to a sub-title and a new title introduced; "Fix formatting of code
units as integers". ]
- Victor introduced the paper:
- When std::format() was introduced, non-portable behavior
due to the implementation-defined signedness of char was
not intended.
- It is possible that some users expect the signedness to be
reflected in the output, but most users that are formatting
character types as integers are intending to expose bit
patterns.
- This is technically a breaking change.
- This is more LEWG territory, but since it is text related, it
seemed prudent to collect input from SG16.
- PBrett requested that section 2, "Proposal", be expanded to illustrate
the before/after effects for each of the type options.
- Victor agreed to do so.
- Victor explained that the change increases compatibility with
std::printf() for the impacted type options other than "d";
the "%d" std::printf() conversion specifier always treats its
argument as a signed type, but the proposed change for the "d" type
option will always treat char as an unsigned type regardless
of whether it is signed.
- Zach expressed appreciation for symmetry and that the change improves
support for portable roundtripping behavior.
- Mark acknowledged that the change is a breaking change and asked if
the intent is to handle this as a DR.
- Victor replied that LEWG will decide that and that he would recommend
handling this as a DR.
- Mark observed the lack of a feature test macro.
- Victor stated that he could add one.
- Hubert requested that a more descriptive title be used for the
paper.
- Hubert noted that it is implementation-defined whether
wchar_t is a signed type as well.
- Victor replied that it would be reasonable to treat all charT
types as being unsigned.
- PBrett requested that the paper be updated to explicitly mention
wchar_t as well.
- Hubert expressed some concerns over the proposed change; char
and wchar_t do have a signedness and it isn't good for
programmers to ignore that.
- Victor replied that, for wchar_t at least, the concern is not
as strong since programmers don't tend to use wchar_t as an
integer type as is done with char.
- Hubert suggested it might make sense for the "d" type option to
maintain signedness.
- Victor stated a preference for the signedness handling being
consistent across the type options.
- Tom noted that int8_t could be implemented in terms of
char.
- Hubert noted that most of the changes increase consistency with
std::printf() and stated the improved consistency should be
extended to all of the integer types.
- PBrett reminded the group that char is a distinct type from
signed char and unsigned char.
- Zach asserted that it is surprising to get a negative value for a
char type and stated that negative char values are
a wart in the language.
- Hubert noted that
[basic.fundamental]p11
specifies that char is an integer type.
- PBrett asked if an LWG issue should be raised regarding whether
int8_t can use char as its designated type.
- Fraser responded that cv-qualified types are also integer types and
might therefore possibly be used as the designated type unless the
int8_t wording excludes them.
- Hubert noted that cv-qualified types being integer types was a recent
CWG change.
- PBrett reported that
[cstdint.syn]
specifies that int8_t must designate a signed integer type
and that
[basic.fundamental]p1
doesn't include char in its definition of
signed integer types.
- PBrett stated that we will file a LWG issue to clarify this.
- Tom asked for confirmation of the behavior for integer types other than
char when used with the "o", "x", and "X" type options.
- Victor replied that negative values may be produced.
- Hubert stated that includes wchar_t when it is a signed
type.
- Tom noted that is consistent with the status quo wording.
- Hubert noted that the wording is applicable to charT, but not
to mixed character types.
- Poll 1: Modify P2909R0 "Dude, where's my char‽" to maintain
semi-consistency with printf such that the 'b', 'B', 'o', 'x', and 'X'
conversions convert all integer types as unsigned.
- Attendees: 8 (1 abstention)
-
- No consensus.
- SA: I'm not opposed to that direction in principle, but it is
a deeper change and needs more research.
- A: I'm concerned about the lack of implementation
experience.
- Poll 2: Modify P2909R0 "Dude, where's my char‽" to remove the
change to handling of the 'd' specifier.
- Attendees: 8 (1 abstention)
-
- No consensus.
- SA: That would add a corner case to a corner case; this is
more LEWG territory and will get discussed there.
- Poll 3: Forward P2909R0 "Dude, where's my char‽", amended with a
descriptive title, an expanded before/after table, and fixed CharT
wording, to LEWG with the recommendation to adopt it as a Defect
Report.
- Attendees: 8 (1 abstention)
-
- Weak consensus.
- Tom asked if there are any concerns beyond the std::printf()
inconsistencies that would motivate the N and A voters towards
F/SF.
- No other concerns were raised.
- Hubert expressed unhappiness with the "d" type option direction since
it won't provide help to those debugging issues related to
char being a signed type.
P2728R6: Unicode in the Library, Part 1: UTF Transcoding:
- Zach introduced the changes made in recent revisions:
- The type unpacking mechanism was reworked.
- The null_sentinel_t type was moved to the std
namespace.
- A std::ranges::project_view was introduced bsaed on SG9
(Ranges) feedback though this view is likely to be replaced in a
future revision with a conditionally borrowed
transform_view.
- The utfN_views are now just aliases of a
utf_view class template specialization.
- PBrett asked if anyone has new SG16 concerns inspired by the changes
since R3.
- [ Editor's note: SG16's last review of this paper was
P2728R3
during the
2023-05-10 SG16 meeting.
]
- No new concerns were raised.
- Tom asked for specific ideas on how to improve presentation in the
motivation section of the paper to address any lingering concerns from
reviews of previous revisions.
- PBrett stated that the paper has improved significantly from previous
revisions
- PBrett volunteered to meet with Zach offline to more thoroughtly
review that section.
- Fraser asked whether support for the approximately_sized_range
concept proposed by
P2846 (size_hint: Eagerly reserving memory for not-quite-sized lazy ranges)
has been considered.
- Zach replied that he has to some extent and noted that there are range
limits that could be imposed and that might work with that
feature.
- Fraser asked if the proposal could be retrofitted to support that
feature as it progresses through the committee.
- Zach replied affirmatively and explained that the size_hint()
member could be conditionally enabled when size information is
available.
- Hubert requested clarification regarding the request for improvements
to the motivation section.
- Tom explained that he had received input from multiple people that
they felt the motivation section was lacking.
- PBrett explained that one of the perceived issues was the lack of
rationale for the design decisions made and an analysis of
alternatives considered; for example, during previous SG16
discussions, vague comments were sometimes made regarding the design
being motivated by performance concerns, but the performance goals
and concerns are not reflected in the paper.
- PBrett repeated his earlier claim that recent revisions and the
refined scope have improved the situation.
- Hubert stated that it sounds like the motivation question might be
resolved then.
- Hubert suggested that a scope section could be added.
- PBrett reported that, in a recent UK body discussion concerning the
failure for some papers to attain consensus, observations were made
that lack of a common understanding of the problem to be solved
likely contributed to the failure.
- PBrett opined that discussion of earlier revisions of the paper
exhibited some confusion regarding which problems this paper is
intended to address.
- Zach stated that he has been working on prototypes that lead to this
paper for about seven years now and that some of the design
motivation is influenced by things he learned along the way, but that
would require some reflection to recall.
- Zach suggested that discussion move towards error handling as
discussion of that topic was requested in the meeting agenda.
- [ Editor's note: Zach was referring to requests made on the SG16
mailing list. See
https://lists.isocpp.org/sg16/2023/08/3930.php.
]
- Robin added some background for the linked
PR-121 (Recommended Practice for Replacement Characters)
policies. That policy paper was used to inform the recommendation made
by the UTC during the
UTC 116 / L2 213 Joint Meeting held in Redmond, WA from August 11-15, 2008
in which a
consensus to prefer policy 2
was established.
- Zach reported having been unaware of PR-121 and that his design
decisions were guided by what appears in the Unicode Standard.
- Zach summarized the error handling options described by the Unicode
Standard as:
- terminate
- report an error
- substitute a replacement character.
- [ Editor's note: the Unicode 15 chapters that discuss handling of
ill-formed code unit sequences are:
- 3.9, Unicode Encoding Forms, U+FFFD Substitution of Maximal
Subparts.
- 5.22, U+FFFD Substitution in Conversion.
]
- Zach stated that an option to just drop ill-formed code unit sequences
seems misguided.
- Robin agreed and stated that doing so can lead to security
issues.
- Zach stated that there are other options to identify encoding errors
and that he does not want this feature to be made complicated.
- PBrett asserted a need for a feature to just validate that a given
string can be successfully decoded.
- Zach responded that such a feature was in a previous revision of the
paper, but that it was removed as part of reducing scope.
- PBrett stated that he actually wants that feature more than he wants
transcoding support so that input could be proactively rejected.
- Tom expressed sympathy for Zach's perspective but stated a preference
towards not providing an error handler at all over providing one that
is unable to handle arbitrary complexity.
- Zach replied that he really only cared to support terminate, throw,
and substitute as recommended by the Unicode Standard.
- Tom described the error handling approach that JeanHeyd developed for
his work on
P1629 (Standard Text Encoding);
it allows for the current iterator to be moved to an error handler
that manipulates it as necessary and then moves it back; this provides
the error handler full autonomy.
- Zach replied that such an approach doesn't work for a transcoding
iterator since exactly one output code unit must be produced; or would
otherwise require a buffer to be persisted and referenced for later
outputs.
- Tom expressed gratitude for that response and reported that he had not
considered the limitations of lazily transcoding within iterator
operations.
- PBrett provided a brief introduction to the
ztd.text
error handlers.
- [ Editor's note: see the error handlers in the header files
included by
https://github.com/soasis/text/blob/main/include/ztd/text/error_handler.hpp.
]
- Zach noted that each iterator dereference has to produce the next code
unit value and that makes it expensive to support anything other than
substitution of a single code point.
- PBrett asked if more design space options are opened by considering
views rather than iterators.
- Zach replied that the iterators are stateful in either case.
- Zach stated that he would be ok with dropping the error handler in
favor of only doing substitution and noted that the error handler can
only be specified when the iterators are used directly; the views
don't support providing an error handler.
- Tom asked Hubert if his previously expressed interest in exposing the
type unpacking behavior has been satisfied.
- Hubert did not recall his previous interest.
- Tom explained his recollection; that Hubert wanted to be able to take
advantage of the unpacking behavior when writing adapters to be used
in range pipelines.
- Zach stated that the concepts in the paper might need to be refined a
bit but that he has a test that does that.
- Tom requested that an example be added to the paper.
- Hubert suggested that the motivation section be updated to explain
that functionality as well.
Tom reported that the next meeting will be 2023-09-13 and that likely
agenda items include continued review of P2728R6 and initial review of
P1729R2 (Text Parsing).
September 13th, 2023
Draft agenda:
Attendees:
- Corentin Jabot
- Eddie Nolan
- Fraser Gordon
- Hubert Tong
- Jens Maurer
- Nathan Owen
- Robin Leroy
- Steve Downey
- Tom Honermann
- Victor Zverovich
- Zach Laine
Meeting summary:
- P2845R3: Formatting of std::filesystem::path:
- [ Editor's note: D2845R3 was the active paper under discussion at the telecon.
The agenda and links used here reference P2845R3 since the links to the draft paper were ephemeral.
The published document may differ from the reviewed draft revision.
]
- Tom noted that SG16 previously reviewed P2845R0 and that the current
revision addresses prior review feedback.
- Victor provided an introduction:
- The recent revisions correct some minor mistakes.
- The proposed default format now produces a non-quoted non-escaped
representation.
- If the format specifier includes the ? option, then a
quoted escaped representation is produced.
- The
{fmt} library
had previously implemented the behavior proposed in P2845R0 but
was recently changed to implement the behavior introduced in
P2845R2; this was a breaking change that impacted a few
users.
- Eddie pointed out an incorrect word choice in section 6, Proposal;
"loose" is used where "lose" is intended.
- Victor stated that the output shown for the first lone surrogate
example in section 6, Proposal, might be incorrect and that he needs
to check if a \x escape should be produced instead of the
\u escape currently presented; the intent is for the
behavior to match what is specified in
[format.string.escaped].
- Poll 1: Forward P2845R2, Formatting of std::filesystem::path, to LEWG with a recommended target of C++26.
- Attendees: 9 (1 abstention)
-
- Strong consensus.
- P2728R6: Unicode in the Library, Part 1: UTF Transcoding:
- Tom stated that the next meeting is scheduled for 2023-09-27 and that the
anticipated agenda will include an initial review of
P1729R2 (Text Parsing).
- Tom asked for opinions regarding what else should be discussed before
we're ready to poll forwarding this paper.
- Corentin replied that we could work on improving the presentation in the
paper for LEWG.
- Corentin noted that LEWG is likely to return the paper to SG16 for any
Unicode questions not answered in the paper.
- Hubert suggested that LEWG review the design before substantial effort is
expended on formal wording.
September 27th, 2023
Draft agenda:
Attendees:
- Eddie Nolan
- Elias Kosunen
- Fraser Gordon
- Nathan Owen
- Peter Brett
- Robin Leroy
- Tom Honermann
- Victor Zverovich
Meeting summary:
- Tom announced that Steve Downey has agreed to take on a SG16 co-chair
role and will likely participate in a SG16 meeting chair rotation.
- PBrett stated that he might have less time available for SG16 meetings
in the near future due to commute changes.
- Tom reported that an in-person meeting for SG16 is not planned for
Kona.
- PBrett noted that SG21 (contracts) is likely to reduce both his and
Tom's availability during the Kona meeting.
- Fraser asked if SG16 is impacted by the recent decision to disallow ISO
and INCITS from hosting joint meetings.
- PBrett responded that it is not.
- P1729R2: Text Parsing:
- Elias offered an introduction:
- SG16 reviewed P1729R0 during the in-person meeting in
Cologne.
- P1729R1 was reviewed by LEWG-I soon after, but activity then
stalled until recently.
- There have been a lot of changes.
- std::scan is the parsing analog to
std::format.
- Elias proceded with presenting the paper.
- PBrett pointed out an error in the comments in the example code in
section 3.1, "Basic example";
result.begin() should be result->begin().
- Elias reported that that error has already been corrected in an R3
draft.
- PBrett asked whether begin() reflects the start of the
parsed range or the start of the remaining text.
- Elias replied that it reflects the unparsed remainder.
- Fraser asked why the text to be parsed is passed before the format
string.
- Elias replied that the order is consistent with scanf().
- Tom observed that the dereference of result in the example
code in section 3.5, "Alternative error handling", is unconditional
and would presumably result in some kind of bad behavior if the scan
was not successful.
- Elias confirmed that would result in undefined behavior, noted that
an exception would be thrown if result.value() was used
instead, and stated that the null-coalesing example avoids the bad
behavior.
- Elias stated that the example should probably be updated.
- Elias presented 3.6, "Scanning an user-defined type" and stated that,
with regard to encoding, his implementation assumes UTF-8, but that
doing so is probably not acceptable for the standard.
- Victor replied that most of the matching against the format string is
encoding agnostic and just needs to match code units.
- Victor noted that encoding issues are relevant for cases that involve
locale.
- Victor asserted that the same encoding approach should be used as for
std::format; use the ordinary literal encoding.
- PBrett reported that other encodings, such as EUC-KR, are still widely
used in some regions.
- Elias presented 6.2, "scanf-like [character set]
matching", a future extension to support matching a range of
characters and asked about the potential to unintentionally close off
the possibility of future support for such extensions if they are not
provided up front.
- Tom replied that he didn't have any such concerns.
- Tom stated that this is regex-like behavior that could be layered on
in a different manner.
- PBrett expressed a desire for such a feature in order to ease
migration from scanf().
- PBrett noted that matching a range could be locale dependent.
- Victor asserted that feature additions should be motivated by usage
and noted that std::format() does not provide replacements
for all features in printf().
- Eddie asked if a range of characters might be sensitive to encoding;
for example, the meaning of the range [A-Z] could potentially
be different for EBCDIC vs ASCII.
- Tom suggested revisiting such concerns with locale in mind at a later
time.
- Tom noted that character set ranges are locale sensitive in utilities
like grep.
- Tom directed discussion back to section 4.2, "Format strings", and
stated that the use of std::isspace() to scan whitespace is
problematic because it is only able to recognize whitespace characters
that are encoded as a single code unit.
- Tom asked Victor to confirm that a definition of whitespace characters
was not required for std::format().
- Victor confirmed.
- Tom suggested that, if the associated literal encoding is a standard
unicode encoding form, then the set of whitespace characters should be
defined to match one of the Unicode whitespace definitions and that
the set is otherwise implementation-defined.
- Elias noted that Unicode specifies a lot of whitespace characters and
wondered how surprising recognizing them might be.
- Robin stated that Unicode provides multiple whitespace character
definitions for use in various contexts via the White_Space
and Pattern_White_Space character properties.
- Robin noted that Pattern_White_Space has the advantage of
being immutable and that it includes some invisible characters that
should be ignored.
- Robin explained that Pattern_White_Space is intended to be
used for the specification of programming languages and that Unicode
offers a recommendation in
UAX #31 of Unicode 15.1.
- [ Editor's note: See
section 4.1, "Whitespace"
and, in particular,
conformance requirement UAX31-R3a.
]
- Tom provided a Unicode utilities link that lists all the characters
with Pattern_White_Space=Yes.
- PBrett lamented the "null" names for the C0 and C1 control
characters.
- Robin filed an issue to correct the display names.
- [ Editor's note: Following the meeting, Robin shared additional
details regarding the categorization of whitespace in the Unicode
Standard. Even this is an incomplete list.
]
- Tom asked for a clarification in the example in section 4.3.2,
"Fill and align"; if the format string for rK was changed
to "**42*" would that result in an error like in the
rI example?
- Elias replied affirmatively.
- Tom suggested it might be worth adding an additional example to make
that explicit.
- Elias asked what the behavior should be for input that is invalidly
encoded and stated that he had planned to propose substitution with
a replacement character, but that approach is problematic for output
types with reference semantics like std::string_view.
- Tom asked facetiously why the input wasn't sanitized before being
passed to std::scan.
- Tom noted that this is a basic error handling question.
- Victor stated that substituting a replacement character doesn't work
well in general since it can't be matched by most of the type
specific scanners.
- Victor suggested treating invalidly encoded input as an error such
that an unsuccessful scan result is returned.
- PBrett noted that ill-formed sequences could just be passed through
when scanning for string input.
- Tom asked how the scanner would know when to stop scanning.
- PBrett replied that the replacement character could be handled as a
character that doesn't match any other character.
- Tom quipped, so it's a NaN.
- PBrett agreed.
- Robin suggested treating such substituted characters as the unknown
character, U+FFFF, with regard to Unicode properties and shared a
link that lists those properties.
- Elias expressed concern about the overhead imposed by always
validating the encoding of the input, but noted that passing
ill-formed sequences through means that errors are sometimes caught
and sometimes not.
- Eddie expressed a desire for input to be sanitized to avoid having to
worry about the consequences otherwise.
- PBrett observed that, historically, we have required input to be
sanitized and noted that validating the input is helpful to avoid
undefined behavior.
- PBrett summarized the options available; make the behavior
well-defined, make it undefined, or, perhaps, categorize it as
erroneous though we don't yet know how to apply that concept to the
standard library.
- [ Editor's note: "Erroneous behavior" is a concept recently
discussed within WG21 in the context of
P2795 (Erroneous behaviour for uninitialized reads).
]
- Elias noted that scanners written for user-defined types are unlikely
to perform encoding validation, but that we could encourage authors to
delegate scanning back to std::scan as demonstrated in
section 3.6, "Scanning an user-defined type".
- Robin reported that Unicode does have a conformance requirement that
ill-formed input not be treated as a character and directed the group
to
conformance clause C.10 in chapter 3 of Unicode 15.
When a process interprets a code unit sequence which purports to be
in a Unicode character encoding form, it shall treat ill-formed code
unit sequences as an error condition and shall not interpret such
sequences as characters.
- PBrett interpreted the clause as motivation for erroneous
behavior.
- Robin asked whether erroneous behavior is similar to Ada's concept of
bounded errors and provided a link to
section 1.1.5, "Classification of Errors" in the Ada 2022 standard.
- [ Editor's note: "erroneous behavior" as recently used in WG21
does appear to correlate well with Ada's "bounded errors". Note that
Ada's "erroneous execution" corresponds to the C and C++ notion of
"undefined behavior". ]
- Tom provided a brief overview of the recent history of erroneous
behavior and its proposed use for reads of uninitialized
variables.
- [ Editor's note: Robin shared a link to
section 13.9.1, "Data Validity", paragraph 9 in the Ada 2022 standard
and its discussion of handling of objects with invalid
representations; such cases might arise due to lack of
initialization. ]
- PBrett asked if there is a way to scan a single code point.
- Elias replied that there is in his reference implementation but that
it is provided via a distinct scanner specialization for a
code_point type.
- Elias suggested that it might be desirable to support transcoding of
charN_t-based types in the future.
- PBrett noted that scanning of charN_t-based types does not
involve ambiguous encoding.
- Victor expressed uncertainty regarding how to handle such transcoding
when the corresponding literal encoding isn't a standard Unicode
encoding form.
- Elias stated that a programmer can handle such concerns on their own,
but only for a user-defined type since they would not be permitted
to specialize std::scanner for the charN_t
types.
- Elias explained that locale support is opt-in the same as it is for
std::format and that the classic locale is used by
default.
- Tom pondered whether it would be desirable to recognize input using
both the specified locale and the classic locale.
- PBrett expressed strong opposition.
- Tom realized use of multiple locales doesn't work at all because
recognition of characters used for, e.g., thousands separators and
decimal points would be ambiguous.
- PBrett requested the addition of an example to the paper to
demonstrate explicit use of a locale.
- PBrett asserted that it is a programmer requirement to ensure that
input is correctly encoded for the specified locale.
- Elias directed attention to section 4.3.5, "Localized (L)", and asked
whether a misplaced grouping separator should result in an error.
- Elias indicated that the proposed behavior is consistent with
iostreams.
- Victor cautioned against being innovative and opined that existing
practice should be followed.
- Victor noted that implementation of a relaxed scanner should not be
difficult when needed.
- Tom noted the discussion of alternative options for interpretation of
field width units in section 4.3.4, "Width and precision" and asked
for motivating reasons to consider options that differ from
std::format.
- Elias replied that the proposed behavior differs from
std::scanf().
- Tom asked whether the message strings described in section 4.6,
"Error handling" require translation or localization.
- Elias replied that they are intended for use with
std::exception and therefore target programmers rather than
end users.
- Tom stated that the next meeting is scheduled for October 11th and that
an agenda is still to be determined.