<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Aug 2, 2018, at 10:26 PM, Tom Honermann <<a href="mailto:tom@honermann.net" class="">tom@honermann.net</a>> wrote:</div><br class="Apple-interchange-newline"><div class="">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" class="">
<div text="#000000" bgcolor="#FFFFFF" class="">
<div class="moz-cite-prefix">Thank you Michael and Dave! I
appreciate the time and detail. All of your answers look to
confirm our expectations, so I interpret this as a good sign we're
thinking about the right things.<br class="">
<br class="">
I added a few inline comments/clarifications below.<br class="">
<br class="">
We had tentatively planned to meet Wednesday of next week, but it
turns out that two of our core SG16 members are going to be on
vacation so, at a minimum, I'd like to postpone. I'm also feeling
pretty content with the responses that we got from you and I think
it would suffice for us to just follow up with any remaining
thoughts via email. While I'd love for any of you to attend one
(or more) of our meetings (any time), I want to be sensitive to
productive use of your time. So, how about we play it by ear for
now?<br class=""></div></div></div></blockquote><div><br class=""></div>Works for me</div><div><br class=""><blockquote type="cite" class=""><div class=""><div text="#000000" bgcolor="#FFFFFF" class=""><div class="moz-cite-prefix">
<br class="">
On 08/02/2018 05:18 PM, Dave Abrahams wrote:<br class="">
</div>
<blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" class="">
<br class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Aug 1, 2018, at 12:04 PM, Michael Ilseman
<<a href="mailto:milseman@apple.com" class="" moz-do-not-send="true">milseman@apple.com</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8" class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;" class="">
<div class="">Hello, I am the current maintainer of
Swift’s String, and can speak to my thoughts on the
status quo and future directions. Dave, who is on this
thread, is much more familiar with the history behind
this and can likely provide deeper insight into the
reasoning.</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
Michael has done very well here; I only have a few things to
add.</div>
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class="" color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;">
<div class="">On Jul 23, 2018, at 7:39 PM, Tom
Honermann <<a href="mailto:tom@honermann.net" class="" moz-do-not-send="true">tom@honermann.net</a>>
wrote:<br class="">
<font class="" color="#00c8fa"><br class="">
</font>SG16 is seeking input from Swift and
WebKit representatives to help inform our work
towards enhancing support for Unicode in the
C++ standard. In particular, we recognize the
significant amount of effort that went into
the design of the Swift String type and would
like to better understand the motivations that
contributed to its current design and any
pressures that might encourage further
evolution or refinement; especially for any
concerns that would be deemed significant
enough to warrant backward incompatible
changes.<br class="">
Though most of these questions specifically
mention Swift, that is an artifact of our
being more familiar with Swift than the
internal workings of WebKit. Many of these
questions would be applicable to any string
type designed to support Unicode. We are
therefore also interested in hearing about the
string types used by WebKit, the motivations
that guided their design, and the trade offs
that have been made. Of particular interest
would be the results of design decisions that
are contrast with the design of Swift's String
type.<br class="">
Thank you in advance for any time and
expertise you are willing and able to share
with us.<br class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">The Swift string manifesto is
about 1 1/2 years old. What have you
learned since writing it? What would
you change? What have you changed?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br class="">
</font>We haven’t really diverged from that
manifesto. Some things are still in progress, minor
details were tweaked, but the core arguments are
still relevant.</div>
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="" style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class=""><br class="">
Swift strings are extended grapheme
cluster (EGC) based. What have been the
best and worst consequences of this
choice?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br class="">
</font>I’ll use “grapheme” casually to mean EGC.
Swift’s Character type represents a grapheme
cluster, Unicode.Scalar represents a Unicode scalar
value (non-surrogate code point).<br class="">
<font class="" color="#8886ff"><br class="">
</font>Cocoa APIs are UTF-16 code unit oriented, and
thus there’s always caution (via documentation)
about making sure such indices align to grapheme
boundaries. This is a frequent source of bugs,
especially as part of internationalization. By
making Swift strings be grapheme-based by default,
developers first reach for the correct APIs.<br class="">
<font class="" color="#8886ff"><br class="">
</font>Another good consequence is that people
picking up Swift and playing with string, e.g. in a
repl or Playground, see Swift’s notion of characters
align with what is displayed. This includes complex
multi-component emoji such as family emoji
(👨👨👧👧), which is a single Character composed
of 7 Unicode.Scalars.<br class="">
<font class="" color="#8886ff"><br class="">
</font>This does have downsides. What is and is not
a grapheme cluster changes with each version of
Unicode, and thus grapheme breaking is inherently a
run-time concern and can’t be checked at compile
time. Another is that while code units can be
random-access, graphemes cannot, which is confusing
to developers used to UTF-16 code unit access mostly
working (until their users use non-BMP scalars or
emoji that is). </div>
</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
<div class="">I'd say the biggest downside is that there are users who
simply refuse to accept what we consider to be the fundamental
non-random-access character of any efficient string
representation. They are upset that they can't index a string
directly with an integer, and can't be talked out of it. I
still think we made the right decision in this regard; you'd
have the same problem if your strings were
unicode-scalar-based.</div>
</div>
</blockquote>
<br class="">
Are there common scenarios where programmers tend to be frustrated
by lack of random access? Perhaps most often when they are working
with inputs known to be ASCII only? </div></div></blockquote><div><br class=""></div>Those people can just use the UTF-16 or UTF-8 views and be done.</div><div><br class=""><blockquote type="cite" class=""><div class=""><div text="#000000" bgcolor="#FFFFFF" class="">Or is this mostly an education
issue and these programmers are having a difficult time accepting
that they've spent most of their career thus far writing bugs? :)<br class=""></div></div></blockquote><div><br class=""></div>IMO it's a combination of the latter and the fact that we don't yet have good APIs for the higher-level operations they really mean when they want to write code that involves (usually constant) integer indices, which is usually pattern matching/parsing code.</div><div><br class=""><blockquote type="cite" class=""><div class=""><div text="#000000" bgcolor="#FFFFFF" class="">
<br class="">
<blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;" class="">
<div class="">
<div class="">
<div class="">Furthermore, few existing specifications
are phrased in terms grapheme-clusters, so something
like a validator wouldn’t want to run on
grapheme-segmented text, but a lower abstraction
level.<br class="">
<font class="" color="#8886ff"><br class="">
</font>Also, graphemes can be funky. A string
containing only, U+0301 (COMBINING ACUTE ACCENT) has
one grapheme, but modifies the prior grapheme upon
concatenation. Such degenerate graphemes violate
algebraic reasoning in these corner cases. </div>
</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
<div class="">We are not aware of generic algorithms that rely on
concatenation of collections conserving element counts, so we
decided to simply document this quirk rather than saying that
string is-not-a collection.</div>
</div>
</blockquote>
<br class="">
SG16 has previously discussed cases like this and I'm happy to hear
you haven't had to do anything special for it. This is a good
example of why we asked about inappropriate use of the String count
property: programmers assuming s1.count + s2.count ==
s1.append(s2).count.<br class="">
<br class="">
<blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;" class="">
<div class="">
<div class="">
<div class="">Unicode defines properties and most
operations on scalars or code points, and very
little on top of graphemes.<br class="">
<font class="" color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">When porting code unit or code
point based code to Swift strings (e.g.,
when rewriting Objective-C code, or
rewriting Swift code to use String
instead of NSString), has profiling
revealed performance regressions due to
the switch to EGC based processing? If
so, what action was taken to correct it?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br class="">
</font>We have many fast-paths in grapheme-breaking
to identify common situations surrounding
single-scalar graphemes. If a developer wants to
work with Unicode at a lower level, String provides
a UTF8View, a UTF16View, and a UnicodeScalarView.
Those views lazily transcode/decode upon access.<br class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br class="">
Cool, it sounds like the answer to any such regressions was 1)
optimization in terms of fast-paths, and 2) fall back to code
unit/point processing otherwise.<br class="">
<br class="">
<blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class="" color="#8886ff"><br class="">
</font>There are also performance concerns and
annoyances when working with ICU, but this is an
implementation detail. If you’re interested in using
ICU, we can discuss further what has worked best for
us.<br class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
I think you're interested in (at least optionally) using ICU
unless you have evidence of major investment in another
open-source implementation of Unicode algorithms and tables.
Otherwise, C++ implementors could not afford to develop
standard libraries.</div>
</blockquote>
<br class="">
Yes, definitely. For the foreseeable future, I think we need to
ensure that any interfaces we propose can be reasonably implemented
using ICU. However, Zach Laine has made impressive progress
implementing many of the Unicode algorithms without use of ICU in
his proposed Boost.Text library. See
<a class="moz-txt-link-freetext" href="https://github.com/tzlaine/text">https://github.com/tzlaine/text</a> and
<a class="moz-txt-link-freetext" href="https://tzlaine.github.io/text/doc/html/index.html">https://tzlaine.github.io/text/doc/html/index.html</a>.<br class=""></div></div></blockquote><div><br class=""></div>W00t! Go Zach!<br class=""><blockquote type="cite" class=""><div class=""><div text="#000000" bgcolor="#FFFFFF" class="">
<blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class="" color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class=""><br class="">
Swift strings do not enforce storage in
any particular Unicode normalization
form. Was consideration given to
forcing storage in a particular form
such as FCC or NFC?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br class="">
</font>Swift strings now sort with NFC (currently
UTF-16 code unit order, but likely changed to
Unicode scalar value order). We didn’t find FCC
significantly more compelling in practice. Since NFC
is far more frequent in the wild (why waste space if
you don’t have to), strings are likely to already be
in NFC. We have fast-paths to detect on-the-fly
normal sections of strings (e.g. all ASCII, all <
U+0300, NFC_QC=yes, etc.). We lazily normalize
portions of string during comparison when needed.<br class="">
<font class="" color="#8886ff"><br class="">
</font>As far as enforcing on creation, no. We do
want to add an option to perform a linear scan to
set a performance flag, perhaps at creation, so that
comparison can take the memcmp-like fast-path.<br class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br class="">
Ok, my take away from this is that fast-pathing has been sufficient
for lazy normalization (when needed) to not be (much of) a
performance concern. At least, not enough to want to take the
normalization cost on every string construction up front.<br class="">
<br class="">
<blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class="" color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">Swift strings support
comparison via normalization. Has use
of canonical string equality been a
performance issue? Or been a source of
surprise to programmers?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br class="">
</font>This was a big performance issue on Linux,
where we used to do UCA+DUCET based comparisons. We
switch to lexicographical order of NFC-normalized
UTF-16 code units (future: scalar values), and saw a
very significant speed up there. The remaining
performance work revolves around checking and
tracking whether a string is known to already be in
a normal form, so we can just memcmp.<br class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br class="">
This is very helpful, thank you. We've suspected that full
collation (with or without tailoring) would be too expensive for use
as a default comparison operator, so it is good to hear that
confirmed.<br class=""></div></div></blockquote><div><br class=""></div>More importantly, such collation is not actually useful without a locale. Strings being used for machine processing don't need to be ordered according to "human rules" and once human rules do come into play you want to account for language/region. We think it <i class="">is</i> important that the machine doesn't distinguish between the different ways of writing "é", if nothing else to prevent invisible distinctions in literals in source code, which is why we normalize.</div><div><br class=""><blockquote type="cite" class=""><div class=""><div text="#000000" bgcolor="#FFFFFF" class="">
<br class="">
I'm curious why this was a larger performance issue for Linux than
for (presumably) macOS and/or iOS.<br class="">
<br class="">
<blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class="" color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">Swift strings are not locale
sensitive. Was any consideration given
to creation of a distinct locale
sensitive string type?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br class="">
</font>This is still up for debate and hasn’t been
settled yet, but we think it makes a lot of sense.
If an array of strings is sorted, we certainly don’t
want a locale-change to violate programmer
invariants. A distinct type from string could avoid
a lot of common errors here, including forgetting to
localize before presenting to a user as part of a
UI.<br class="">
<font class="" color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">Swift strings provide a count
property as required to satisfy the
Collection protocol. How often do
programmers use count (the number of
EGCs in the string) inappropriately?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br class="">
</font>I’m not sure what would constitute
inappropriate usage here. We do not currently
provide access to the underlying stored code units,
though this is a frequent request and we likely will
in the future. I haven’t seen anyone baking in the
assumption that count is the same for String and
across all of Strings’s views (UTF-8, UTF-16,
Unicode scalars).<br class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
</div>
<div class="">One thing to consider is that as long as String is not
random-access, count will be a worst-case O(N) operation. An
inappropriate usage might involve computing the length once per
loop iteration.</div>
</blockquote>
<br class="">
In addition to the above and prior mention of algebraic concerns,
other potential abuses we had in mind were using it to determine
field widths for display or code unit/point based storage.<br class="">
<br class="">
C++ container requirements specify that .size() be O(1). For us to
meet container requirements would require computing and caching the
count during construction and mutation operations. </div></div></blockquote><div><br class=""></div>You could also just not supply .size(). I don't know if .size() is required by container these days, but unless things have changed since I was watching (and I'm sure they have) the container concepts were not actually useful for generic programming.</div><div><br class=""><blockquote type="cite" class=""><div class=""><div text="#000000" bgcolor="#FFFFFF" class="">We could
potentially get by just meeting range requirements though.<br class="">
<br class="">
<blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;" class="">
<div class="">
<div class="">
<div class="">I mentioned degenerate graphemes
breaking algebraic properties of the Collection
protocol, but this hasn’t been a huge issue in
practice so far.<br class="">
<font class="" color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class=""><br class="">
Swift strings support several memory
unsafe initializers and methods. How
frequently are these used incorrectly?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br class="">
</font>Many of these initializers come from NSString
originally, and developers migrating correct code to
Swift maintain that correctness. Rust has a similar
situation, though they do validation at
creation-time and from_utf8_unchecked() voids
memory-safety if the contents are invalid.<br class="">
<font class="" color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">The Swift manifesto discussed
three approaches to handling substrings
and Swift 4 changed from "same type,
shared storage" to "different type,
shared storage". Any regrets?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br class="">
</font>Having two types can be a bit of a pain, but
we still think it was the right thing to do. This is
consistent with Swift treating slices as a distinct
type from the base collection.<br class="">
<font class="" color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class=""><br class="">
How often do you find programmers doing
work at the EGC level that would be
better performed at the code unit or
code point level?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br class="">
</font>Often, if a developer has strict
requirements, they know what they’re doing enough to
operate at one of those lower levels.<br class="">
<font class="" color="#8886ff"><br class="">
</font>Not being able to random-access graphemes in
a string is a common source of frustration and
confusion amongst new users.<br class="">
<font class="" color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">Likewise, how often do you find
programmers working with unicodeScalars,
utf8, or utf16 views to do work better
performed at the EGC level? For what
reasons does this occur? Perhaps to
work around differences in EGC
boundaries across Unicode versions or
the underlying version of ICU in use?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br class="">
</font>This was very prevalent in Swift’s early
days. String wasn’t a collection of graphemes by
default prior to Swift 4,</div>
</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
Well, it was. And then in Swift 2 or 3 it wasn't, due to the
algebraic reasoning issue. Now it is again.</div>
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;" class="">
<div class="">
<div class="">
<div class=""> so without guidance many developers
wrote code against the unicode scalars view. We also
didn’t have any fast-paths for common-case
situations back then, which further encouraged them
to use one of the other views.<br class="">
<font class="" color="#8886ff"><br class="">
</font>This is still done sometimes for
performance-sensitive usage, or someone wanting to
handle Unicode themselves. However, as mentioned
previously, we don’t (yet) provide direct access to
the actual storage.<br class="">
<font class="" color="#8886ff"><br class="">
</font>We haven’t seen much desire for reconciling
behavior across Unicode versions. This may be due to
Swift being primarily an applications level
programming language for devices which only have one
version of Unicode that’s relevant (the current
one).<br class="">
<font class="" color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">Has consideration been given to
exposing Unicode character database
properties? CharacterSet exposes some of
these properties, but have more been
requested?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br class="">
</font>Yes, this was recently added to the
language: <a href="https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md" class="" moz-do-not-send="true">https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md</a>.
We surface much of the UCD via ICU.<br class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br class="">
Ah, nice. All kinds of fun to be had with that :)<br class="">
<br class="">
<blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class="" color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">How firmly is the Swift string
implementation tied to ICU? If the C++
standard library were to add suitable
Unicode support, what would motivate
reimplementing Swift strings on top of
it?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
Swift’s tie to ICU is less firm than it used to be.
We use ICU for the following:<br class="">
<font class="" color="#8886ff"><br class="">
</font>1. Grapheme breaking<br class="">
2. Normalization<br class="">
3. Accessing UCD properties<br class="">
4. Case conversion<br class="">
<font class="" color="#8886ff"><br class="">
</font>Each of these are not too tightly entwined
with string; they’re cordoned-off as a couple of
shims called on fallback slow-paths.<br class="">
<font class="" color="#8886ff"><br class="">
</font>If the C++ standard library provided these
operations, sufficiently up-to-date with Unicode
version and comparable or better to ICU in
performance, we would be willing to switch. A big
pain in interacting with ICU is their limited
support for UTF-8. Some users who would like to use
a “lighter-weight” Swift and are unhappy at having
to link against ICU, as it’s fairly large, and it
can complicate security audits.<br class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br class="">
Got it. Increasing the size of the C++ standard library is a
definite concern for us as well. We imagine some C++ users would be
similarly unhappy if their standard library suddenly required
linking against ICU.<br class="">
<br class="">
<blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class="" color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">Do Swift programmers tend to
prefer string interpolation or string
formatting functions?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
Users tend to prefer string interpolation. However,
Swift currently does not have much in the way of
formatting control in interpolations, and this is
something we’re currently working on.<br class="">
<font class="" color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">What enhancements would you
most like to see in C++ to improve
Unicode support?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
Swift’s string is perhaps geared as a higher-level
construct than what you may want for C++, and Swift
has Cocoa-interoperability concerns where everything
is UTF-16. Rust might provide a closer model to what
you’re looking for:<br class="">
</div>
</div>
<div class=""><br class="">
</div>
<div class="">
<ul class="MailOutline">
<li class="">Strings are a sequence of (valid) UTF-8
code units</li>
<ul class="">
<li class="">Validation is done on creation</li>
<li class="">Invalid contents (e.g. Windows file
paths) can be handled via something like WTF-8,
which is not intended for interchange</li>
</ul>
</ul>
</div>
<div class="">
<ul class="MailOutline">
<li class="">String provides bidirectional iterators
for:</li>
<ul class="">
<li class="">Transcoded and/or normalized code
units</li>
<li class="">Unicode scalar values (their
“character” type)</li>
<li class="">Grapheme clusters</li>
</ul>
</ul>
</div>
</div>
</div>
</div>
</blockquote>
<br class="">
</div>
<div class="">Michael, I think you're not answering the question asked.
They are asking what Swift would want from C++, e.g., to allow
us to decouple from ICU. Wouldn't we like to be able to do
that?</div>
</blockquote>
<br class="">
This question was intended to ask you, as expert C++ programmers
independently from Swift, what additions to C++ you think would be
most helpful to improve our (very lacking) Unicode support. So,
Michael's response is on point (thank you; we'll take a closer look
at Rust), as are any comments regarding what would benefit Swift
specifically. Michael's earlier comments regarding what Swift
currently uses ICU for are suggestive of what Swift might want from
C++. But I imagine the form in which those features are provided
would matter greatly; devils and details.<br class=""></div></div></blockquote><div><br class=""></div>OK, sorry for the misunderstanding!</div><div><br class=""><blockquote type="cite" class=""><div class=""><div text="#000000" bgcolor="#FFFFFF" class="">
<br class="">
Tom.<br class="">
<br class="">
<blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
<div class=""><br class="">
</div>
<div class="">-Dave</div>
<div class=""><br class="">
</div>
<br class="">
</blockquote><p class=""><br class="">
</p>
</div>
</div></blockquote></div><br class=""></body></html>