<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">On 08/03/2018 02:00 PM, Dave Abrahams
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:2D0C499E-0196-415D-AB68-D48578D53057@apple.com">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<br class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Aug 2, 2018, at 10:26 PM, Tom Honermann <<a
href="mailto:tom@honermann.net" class=""
moz-do-not-send="true">tom@honermann.net</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8" class="">
<div text="#000000" bgcolor="#FFFFFF" class="">
<div class="moz-cite-prefix">Thank you Michael and Dave!
I appreciate the time and detail. All of your answers
look to confirm our expectations, so I interpret this as
a good sign we're thinking about the right things.<br
class="">
<br class="">
I added a few inline comments/clarifications below.<br
class="">
<br class="">
We had tentatively planned to meet Wednesday of next
week, but it turns out that two of our core SG16 members
are going to be on vacation so, at a minimum, I'd like
to postpone. I'm also feeling pretty content with the
responses that we got from you and I think it would
suffice for us to just follow up with any remaining
thoughts via email. While I'd love for any of you to
attend one (or more) of our meetings (any time), I want
to be sensitive to productive use of your time. So, how
about we play it by ear for now?<br class="">
</div>
</div>
</div>
</blockquote>
<div><br class="">
</div>
Works for me</div>
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">
<div class="moz-cite-prefix"> <br class="">
On 08/02/2018 05:18 PM, Dave Abrahams wrote:<br class="">
</div>
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8" class="">
<br class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Aug 1, 2018, at 12:04 PM, Michael
Ilseman <<a href="mailto:milseman@apple.com"
class="" moz-do-not-send="true">milseman@apple.com</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<meta http-equiv="Content-Type"
content="text/html; charset=utf-8" class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">Hello, I am the current maintainer
of Swift’s String, and can speak to my
thoughts on the status quo and future
directions. Dave, who is on this thread, is
much more familiar with the history behind
this and can likely provide deeper insight
into the reasoning.</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
Michael has done very well here; I only have a few
things to add.</div>
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;">
<div class="">On Jul 23, 2018, at 7:39
PM, Tom Honermann <<a
href="mailto:tom@honermann.net"
class="" moz-do-not-send="true">tom@honermann.net</a>>
wrote:<br class="">
<font class="" color="#00c8fa"><br
class="">
</font>SG16 is seeking input from
Swift and WebKit representatives to
help inform our work towards
enhancing support for Unicode in the
C++ standard. In particular, we
recognize the significant amount of
effort that went into the design of
the Swift String type and would like
to better understand the motivations
that contributed to its current
design and any pressures that might
encourage further evolution or
refinement; especially for any
concerns that would be deemed
significant enough to warrant
backward incompatible changes.<br
class="">
Though most of these questions
specifically mention Swift, that is
an artifact of our being more
familiar with Swift than the
internal workings of WebKit. Many
of these questions would be
applicable to any string type
designed to support Unicode. We are
therefore also interested in hearing
about the string types used by
WebKit, the motivations that guided
their design, and the trade offs
that have been made. Of particular
interest would be the results of
design decisions that are contrast
with the design of Swift's String
type.<br class="">
Thank you in advance for any time
and expertise you are willing and
able to share with us.<br class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">The
Swift string manifesto is
about 1 1/2 years old. What
have you learned since writing
it? What would you change?
What have you changed?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>We haven’t really diverged from
that manifesto. Some things are still in
progress, minor details were tweaked, but
the core arguments are still relevant.</div>
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class=""><br
class="">
Swift strings are extended
grapheme cluster (EGC) based.
What have been the best and
worst consequences of this
choice?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>I’ll use “grapheme” casually to
mean EGC. Swift’s Character type
represents a grapheme cluster,
Unicode.Scalar represents a Unicode scalar
value (non-surrogate code point).<br
class="">
<font class="" color="#8886ff"><br
class="">
</font>Cocoa APIs are UTF-16 code unit
oriented, and thus there’s always caution
(via documentation) about making sure such
indices align to grapheme boundaries. This
is a frequent source of bugs, especially
as part of internationalization. By making
Swift strings be grapheme-based by
default, developers first reach for the
correct APIs.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>Another good consequence is that
people picking up Swift and playing with
string, e.g. in a repl or Playground, see
Swift’s notion of characters align with
what is displayed. This includes complex
multi-component emoji such as family emoji
(👨👨👧👧), which is a single Character
composed of 7 Unicode.Scalars.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>This does have downsides. What is
and is not a grapheme cluster changes with
each version of Unicode, and thus grapheme
breaking is inherently a run-time concern
and can’t be checked at compile time.
Another is that while code units can be
random-access, graphemes cannot, which is
confusing to developers used to UTF-16
code unit access mostly working (until
their users use non-BMP scalars or emoji
that is). </div>
</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
<div class="">I'd say the biggest downside is that
there are users who simply refuse to accept what we
consider to be the fundamental non-random-access
character of any efficient string representation.
They are upset that they can't index a string
directly with an integer, and can't be talked out of
it. I still think we made the right decision in
this regard; you'd have the same problem if your
strings were unicode-scalar-based.</div>
</div>
</blockquote>
<br class="">
Are there common scenarios where programmers tend to be
frustrated by lack of random access? Perhaps most often
when they are working with inputs known to be ASCII only?
</div>
</div>
</blockquote>
<div><br class="">
</div>
Those people can just use the UTF-16 or UTF-8 views and be done.</div>
</blockquote>
<br>
I think I may have misunderstood Michael's initial response. The
concern is less about (O(1)) random access and more about the
ability to index with an integer rather than having to use
String.Index. Though, that is the case for String.UTF8View and
String.UTF16View as well, isn't it?<br>
<br>
<blockquote type="cite"
cite="mid:2D0C499E-0196-415D-AB68-D48578D53057@apple.com">
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">Or is this
mostly an education issue and these programmers are having
a difficult time accepting that they've spent most of
their career thus far writing bugs? :)<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
IMO it's a combination of the latter and the fact that we don't
yet have good APIs for the higher-level operations they really
mean when they want to write code that involves (usually
constant) integer indices, which is usually pattern
matching/parsing code.</div>
</blockquote>
<br>
Ok, that makes sense and I think aligns with my new understanding
above.<br>
<br>
<blockquote type="cite"
cite="mid:2D0C499E-0196-415D-AB68-D48578D53057@apple.com">
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class=""> <br
class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class="">Furthermore, few existing
specifications are phrased in terms
grapheme-clusters, so something like a
validator wouldn’t want to run on
grapheme-segmented text, but a lower
abstraction level.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>Also, graphemes can be funky. A
string containing only, U+0301 (COMBINING
ACUTE ACCENT) has one grapheme, but
modifies the prior grapheme upon
concatenation. Such degenerate graphemes
violate algebraic reasoning in these
corner cases. </div>
</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
<div class="">We are not aware of generic algorithms
that rely on concatenation of collections conserving
element counts, so we decided to simply document
this quirk rather than saying that string is-not-a
collection.</div>
</div>
</blockquote>
<br class="">
SG16 has previously discussed cases like this and I'm
happy to hear you haven't had to do anything special for
it. This is a good example of why we asked about
inappropriate use of the String count property:
programmers assuming s1.count + s2.count ==
s1.append(s2).count.<br class="">
<br class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class="">Unicode defines properties and
most operations on scalars or code points,
and very little on top of graphemes.<br
class="">
<font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">When
porting code unit or code
point based code to Swift
strings (e.g., when rewriting
Objective-C code, or rewriting
Swift code to use String
instead of NSString), has
profiling revealed performance
regressions due to the switch
to EGC based processing? If
so, what action was taken to
correct it?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>We have many fast-paths in
grapheme-breaking to identify common
situations surrounding single-scalar
graphemes. If a developer wants to work
with Unicode at a lower level, String
provides a UTF8View, a UTF16View, and a
UnicodeScalarView. Those views lazily
transcode/decode upon access.<br class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br class="">
Cool, it sounds like the answer to any such regressions
was 1) optimization in terms of fast-paths, and 2) fall
back to code unit/point processing otherwise.<br class="">
<br class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class="" color="#8886ff"><br
class="">
</font>There are also performance concerns
and annoyances when working with ICU, but
this is an implementation detail. If
you’re interested in using ICU, we can
discuss further what has worked best for
us.<br class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
I think you're interested in (at least optionally)
using ICU unless you have evidence of major investment
in another open-source implementation of Unicode
algorithms and tables. Otherwise, C++ implementors
could not afford to develop standard libraries.</div>
</blockquote>
<br class="">
Yes, definitely. For the foreseeable future, I think we
need to ensure that any interfaces we propose can be
reasonably implemented using ICU. However, Zach Laine has
made impressive progress implementing many of the Unicode
algorithms without use of ICU in his proposed Boost.Text
library. See <a class="moz-txt-link-freetext"
href="https://github.com/tzlaine/text"
moz-do-not-send="true">https://github.com/tzlaine/text</a>
and <a class="moz-txt-link-freetext"
href="https://tzlaine.github.io/text/doc/html/index.html"
moz-do-not-send="true">https://tzlaine.github.io/text/doc/html/index.html</a>.<br
class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
W00t! Go Zach!<br class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class=""><br
class="">
Swift strings do not enforce
storage in any particular
Unicode normalization form.
Was consideration given to
forcing storage in a
particular form such as FCC or
NFC?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>Swift strings now sort with NFC
(currently UTF-16 code unit order, but
likely changed to Unicode scalar value
order). We didn’t find FCC significantly
more compelling in practice. Since NFC is
far more frequent in the wild (why waste
space if you don’t have to), strings are
likely to already be in NFC. We have
fast-paths to detect on-the-fly normal
sections of strings (e.g. all ASCII, all
< U+0300, NFC_QC=yes, etc.). We lazily
normalize portions of string during
comparison when needed.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>As far as enforcing on creation,
no. We do want to add an option to perform
a linear scan to set a performance flag,
perhaps at creation, so that comparison
can take the memcmp-like fast-path.<br
class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br class="">
Ok, my take away from this is that fast-pathing has been
sufficient for lazy normalization (when needed) to not be
(much of) a performance concern. At least, not enough to
want to take the normalization cost on every string
construction up front.<br class="">
<br class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">Swift
strings support comparison via
normalization. Has use of
canonical string equality been
a performance issue? Or been
a source of surprise to
programmers?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>This was a big performance issue on
Linux, where we used to do UCA+DUCET based
comparisons. We switch to lexicographical
order of NFC-normalized UTF-16 code units
(future: scalar values), and saw a very
significant speed up there. The remaining
performance work revolves around checking
and tracking whether a string is known to
already be in a normal form, so we can
just memcmp.<br class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br class="">
This is very helpful, thank you. We've suspected that
full collation (with or without tailoring) would be too
expensive for use as a default comparison operator, so it
is good to hear that confirmed.<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
More importantly, such collation is not actually useful without
a locale. Strings being used for machine processing don't need
to be ordered according to "human rules" and once human rules do
come into play you want to account for language/region. We
think it <i class="">is</i> important that the machine doesn't
distinguish between the different ways of writing "é", if
nothing else to prevent invisible distinctions in literals in
source code, which is why we normalize.</div>
</blockquote>
<br>
That makes perfect sense.<br>
<br>
<blockquote type="cite"
cite="mid:2D0C499E-0196-415D-AB68-D48578D53057@apple.com">
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class=""> <br
class="">
I'm curious why this was a larger performance issue for
Linux than for (presumably) macOS and/or iOS.<br class="">
<br class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">Swift
strings are not locale
sensitive. Was any
consideration given to
creation of a distinct locale
sensitive string type?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>This is still up for debate and
hasn’t been settled yet, but we think it
makes a lot of sense. If an array of
strings is sorted, we certainly don’t want
a locale-change to violate programmer
invariants. A distinct type from string
could avoid a lot of common errors here,
including forgetting to localize before
presenting to a user as part of a UI.<br
class="">
<font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">Swift
strings provide a count
property as required to
satisfy the Collection
protocol. How often do
programmers use count (the
number of EGCs in the string)
inappropriately?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>I’m not sure what would constitute
inappropriate usage here. We do not
currently provide access to the underlying
stored code units, though this is a
frequent request and we likely will in the
future. I haven’t seen anyone baking in
the assumption that count is the same for
String and across all of Strings’s views
(UTF-8, UTF-16, Unicode scalars).<br
class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
</div>
<div class="">One thing to consider is that as long as
String is not random-access, count will be a
worst-case O(N) operation. An inappropriate usage
might involve computing the length once per loop
iteration.</div>
</blockquote>
<br class="">
In addition to the above and prior mention of algebraic
concerns, other potential abuses we had in mind were using
it to determine field widths for display or code
unit/point based storage.<br class="">
<br class="">
C++ container requirements specify that .size() be O(1).
For us to meet container requirements would require
computing and caching the count during construction and
mutation operations. </div>
</div>
</blockquote>
<div><br class="">
</div>
You could also just not supply .size(). I don't know if .size()
is required by container these days, but unless things have
changed since I was watching (and I'm sure they have) the
container concepts were not actually useful for generic
programming.</div>
</blockquote>
<br>
.size() is required for containers, but is not required for ranges.
The ranges TS provides concepts for both sized and non-sized ranges.<br>
<br>
<blockquote type="cite"
cite="mid:2D0C499E-0196-415D-AB68-D48578D53057@apple.com">
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">We could
potentially get by just meeting range requirements though.<br
class="">
<br class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class="">I mentioned degenerate
graphemes breaking algebraic properties of
the Collection protocol, but this hasn’t
been a huge issue in practice so far.<br
class="">
<font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class=""><br
class="">
Swift strings support several
memory unsafe initializers and
methods. How frequently are
these used incorrectly?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>Many of these initializers come
from NSString originally, and developers
migrating correct code to Swift maintain
that correctness. Rust has a similar
situation, though they do validation at
creation-time and from_utf8_unchecked()
voids memory-safety if the contents are
invalid.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">The
Swift manifesto discussed
three approaches to handling
substrings and Swift 4 changed
from "same type, shared
storage" to "different type,
shared storage". Any regrets?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>Having two types can be a bit of a
pain, but we still think it was the right
thing to do. This is consistent with Swift
treating slices as a distinct type from
the base collection.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class=""><br
class="">
How often do you find
programmers doing work at the
EGC level that would be better
performed at the code unit or
code point level?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>Often, if a developer has strict
requirements, they know what they’re doing
enough to operate at one of those lower
levels.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>Not being able to random-access
graphemes in a string is a common source
of frustration and confusion amongst new
users.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">Likewise,
how often do you find
programmers working with
unicodeScalars, utf8, or utf16
views to do work better
performed at the EGC level?
For what reasons does this
occur? Perhaps to work around
differences in EGC boundaries
across Unicode versions or the
underlying version of ICU in
use?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>This was very prevalent in Swift’s
early days. String wasn’t a collection of
graphemes by default prior to Swift 4,</div>
</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
Well, it was. And then in Swift 2 or 3 it wasn't, due
to the algebraic reasoning issue. Now it is again.</div>
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class=""> so without guidance many
developers wrote code against the unicode
scalars view. We also didn’t have any
fast-paths for common-case situations back
then, which further encouraged them to use
one of the other views.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>This is still done sometimes for
performance-sensitive usage, or someone
wanting to handle Unicode themselves.
However, as mentioned previously, we don’t
(yet) provide direct access to the actual
storage.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>We haven’t seen much desire for
reconciling behavior across Unicode
versions. This may be due to Swift being
primarily an applications level
programming language for devices which
only have one version of Unicode that’s
relevant (the current one).<br class="">
<font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">Has
consideration been given to
exposing Unicode character
database properties?
CharacterSet exposes some of
these properties, but have
more been requested?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>Yes, this was recently added to the
language: <a
href="https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md"
class="" moz-do-not-send="true">https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md</a>.
We surface much of the UCD via ICU.<br
class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br class="">
Ah, nice. All kinds of fun to be had with that :)<br
class="">
<br class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">How
firmly is the Swift string
implementation tied to ICU?
If the C++ standard library
were to add suitable Unicode
support, what would motivate
reimplementing Swift strings
on top of it?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
Swift’s tie to ICU is less firm than it
used to be. We use ICU for the following:<br
class="">
<font class="" color="#8886ff"><br
class="">
</font>1. Grapheme breaking<br class="">
2. Normalization<br class="">
3. Accessing UCD properties<br class="">
4. Case conversion<br class="">
<font class="" color="#8886ff"><br
class="">
</font>Each of these are not too tightly
entwined with string; they’re cordoned-off
as a couple of shims called on fallback
slow-paths.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>If the C++ standard library
provided these operations, sufficiently
up-to-date with Unicode version and
comparable or better to ICU in
performance, we would be willing to
switch. A big pain in interacting with ICU
is their limited support for UTF-8. Some
users who would like to use a
“lighter-weight” Swift and are unhappy at
having to link against ICU, as it’s fairly
large, and it can complicate security
audits.<br class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br class="">
Got it. Increasing the size of the C++ standard library
is a definite concern for us as well. We imagine some C++
users would be similarly unhappy if their standard library
suddenly required linking against ICU.<br class="">
<br class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">Do
Swift programmers tend to
prefer string interpolation or
string formatting functions?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
Users tend to prefer string interpolation.
However, Swift currently does not have
much in the way of formatting control in
interpolations, and this is something
we’re currently working on.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode: space;
line-break: after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">What
enhancements would you most
like to see in C++ to improve
Unicode support?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
Swift’s string is perhaps geared as a
higher-level construct than what you may
want for C++, and Swift has
Cocoa-interoperability concerns where
everything is UTF-16. Rust might provide a
closer model to what you’re looking for:<br
class="">
</div>
</div>
<div class=""><br class="">
</div>
<div class="">
<ul class="MailOutline">
<li class="">Strings are a sequence of
(valid) UTF-8 code units</li>
<ul class="">
<li class="">Validation is done on
creation</li>
<li class="">Invalid contents (e.g.
Windows file paths) can be handled via
something like WTF-8, which is not
intended for interchange</li>
</ul>
</ul>
</div>
<div class="">
<ul class="MailOutline">
<li class="">String provides bidirectional
iterators for:</li>
<ul class="">
<li class="">Transcoded and/or
normalized code units</li>
<li class="">Unicode scalar values
(their “character” type)</li>
<li class="">Grapheme clusters</li>
</ul>
</ul>
</div>
</div>
</div>
</div>
</blockquote>
<br class="">
</div>
<div class="">Michael, I think you're not answering the
question asked. They are asking what Swift would want
from C++, e.g., to allow us to decouple from ICU.
Wouldn't we like to be able to do that?</div>
</blockquote>
<br class="">
This question was intended to ask you, as expert C++
programmers independently from Swift, what additions to
C++ you think would be most helpful to improve our (very
lacking) Unicode support. So, Michael's response is on
point (thank you; we'll take a closer look at Rust), as
are any comments regarding what would benefit Swift
specifically. Michael's earlier comments regarding what
Swift currently uses ICU for are suggestive of what Swift
might want from C++. But I imagine the form in which
those features are provided would matter greatly; devils
and details.<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
OK, sorry for the misunderstanding!</div>
</blockquote>
<br>
Not a misunderstanding, the question was just (intentionally, but
clearly overly) vague :)<br>
<br>
Tom.<br>
<br>
<blockquote type="cite"
cite="mid:2D0C499E-0196-415D-AB68-D48578D53057@apple.com">
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class=""> <br
class="">
Tom.<br class="">
<br class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class=""><br class="">
</div>
<div class="">-Dave</div>
<div class=""><br class="">
</div>
<br class="">
</blockquote>
<p class=""><br class="">
</p>
</div>
</div>
</blockquote>
</div>
<br class="">
</blockquote>
<p><br>
</p>
</body>
</html>