<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">On 08/03/2018 12:53 PM, Michael Ilseman
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:DF57361A-F68C-44B0-87E9-FDA5F7D0484E@apple.com">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:
space; line-break: after-white-space;" class=""><br class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Aug 2, 2018, at 10:26 PM, Tom Honermann
<<a href="mailto:tom@honermann.net" class=""
moz-do-not-send="true">tom@honermann.net</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8" class="">
<div text="#000000" bgcolor="#FFFFFF" class="">
<div class="moz-cite-prefix">Thank you Michael and
Dave! I appreciate the time and detail. All of your
answers look to confirm our expectations, so I
interpret this as a good sign we're thinking about the
right things.<br class="">
<br class="">
I added a few inline comments/clarifications below.<br
class="">
<br class="">
We had tentatively planned to meet Wednesday of next
week, but it turns out that two of our core SG16
members are going to be on vacation so, at a minimum,
I'd like to postpone. I'm also feeling pretty content
with the responses that we got from you and I think it
would suffice for us to just follow up with any
remaining thoughts via email. While I'd love for any
of you to attend one (or more) of our meetings (any
time), I want to be sensitive to productive use of
your time. So, how about we play it by ear for now?<br
class="">
<br class="">
</div>
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>I’d be happy to meet up sometime. JF mentioned an
in-person meeting sometime this fall. Feel free to grab me
whenever you think I can add value.</div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">
<div class="moz-cite-prefix"> On 08/02/2018 05:18 PM,
Dave Abrahams wrote:<br class="">
</div>
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8" class="">
<br class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Aug 1, 2018, at 12:04 PM, Michael
Ilseman <<a href="mailto:milseman@apple.com"
class="" moz-do-not-send="true">milseman@apple.com</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<meta http-equiv="Content-Type"
content="text/html; charset=utf-8" class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">Hello, I am the current
maintainer of Swift’s String, and can speak
to my thoughts on the status quo and future
directions. Dave, who is on this thread, is
much more familiar with the history behind
this and can likely provide deeper insight
into the reasoning.</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
Michael has done very well here; I only have a few
things to add.</div>
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class=""
color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode:
space; line-break:
after-white-space;">
<div class="">On Jul 23, 2018, at
7:39 PM, Tom Honermann <<a
href="mailto:tom@honermann.net"
class="" moz-do-not-send="true">tom@honermann.net</a>>
wrote:<br class="">
<font class="" color="#00c8fa"><br
class="">
</font>SG16 is seeking input from
Swift and WebKit representatives
to help inform our work towards
enhancing support for Unicode in
the C++ standard. In particular,
we recognize the significant
amount of effort that went into
the design of the Swift String
type and would like to better
understand the motivations that
contributed to its current design
and any pressures that might
encourage further evolution or
refinement; especially for any
concerns that would be deemed
significant enough to warrant
backward incompatible changes.<br
class="">
Though most of these questions
specifically mention Swift, that
is an artifact of our being more
familiar with Swift than the
internal workings of WebKit. Many
of these questions would be
applicable to any string type
designed to support Unicode. We
are therefore also interested in
hearing about the string types
used by WebKit, the motivations
that guided their design, and the
trade offs that have been made.
Of particular interest would be
the results of design decisions
that are contrast with the design
of Swift's String type.<br
class="">
Thank you in advance for any time
and expertise you are willing and
able to share with us.<br class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">The
Swift string manifesto is
about 1 1/2 years old. What
have you learned since
writing it? What would you
change? What have you
changed?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>We haven’t really diverged from
that manifesto. Some things are still in
progress, minor details were tweaked,
but the core arguments are still
relevant.</div>
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode:
space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class=""><br
class="">
Swift strings are extended
grapheme cluster (EGC)
based. What have been the
best and worst consequences
of this choice?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>I’ll use “grapheme” casually to
mean EGC. Swift’s Character type
represents a grapheme cluster,
Unicode.Scalar represents a Unicode
scalar value (non-surrogate code point).<br
class="">
<font class="" color="#8886ff"><br
class="">
</font>Cocoa APIs are UTF-16 code unit
oriented, and thus there’s always
caution (via documentation) about making
sure such indices align to grapheme
boundaries. This is a frequent source of
bugs, especially as part of
internationalization. By making Swift
strings be grapheme-based by default,
developers first reach for the correct
APIs.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>Another good consequence is that
people picking up Swift and playing with
string, e.g. in a repl or Playground,
see Swift’s notion of characters align
with what is displayed. This includes
complex multi-component emoji such as
family emoji (👨👨👧👧), which is a
single Character composed of 7
Unicode.Scalars.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>This does have downsides. What is
and is not a grapheme cluster changes
with each version of Unicode, and thus
grapheme breaking is inherently a
run-time concern and can’t be checked at
compile time. Another is that while code
units can be random-access, graphemes
cannot, which is confusing to developers
used to UTF-16 code unit access mostly
working (until their users use non-BMP
scalars or emoji that is). </div>
</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
<div class="">I'd say the biggest downside is that
there are users who simply refuse to accept what
we consider to be the fundamental
non-random-access character of any efficient
string representation. They are upset that they
can't index a string directly with an integer, and
can't be talked out of it. I still think we made
the right decision in this regard; you'd have the
same problem if your strings were
unicode-scalar-based.</div>
</div>
</blockquote>
<br class="">
Are there common scenarios where programmers tend to be
frustrated by lack of random access? Perhaps most often
when they are working with inputs known to be ASCII
only? Or is this mostly an education issue and these
programmers are having a difficult time accepting that
they've spent most of their career thus far writing
bugs? :)<br class="">
<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>A lot of it is shaped by expectations coming from other
languages, whose programming models do not prioritize
operating on Unicode scalar values, let alone grapheme
clusters. Objective-C’s default interface with Strings is
random-access to UTF-16 code units, which “works” right up
until you encounter an emoji or other scalar not on the BMP.
It also “works” for graphemes right up until you encounter
emoji or a language you didn’t test or a non-NFC-normalized
contents in a language you did test.</div>
<div><br class="">
</div>
<div>This gets compounded by the prevalence of strings in
teaching, interviews, programming puzzles, etc., where a
string is treated like an array with a more visual
representation.</div>
<div><br class="">
</div>
<div>Also note that even for fully ASCII strings we cannot
provide random access to grapheme clusters, as “\r\n” is a
single grapheme cluster. For pretty much every
Unicode-correct operation we provide fast-paths for, there’s
nasty corner cases that complicates the model.</div>
</div>
</div>
</blockquote>
<br>
Thanks, I had not considered the "\r\n" case. Alas, there are no
easy cases.<br>
<br>
<blockquote type="cite"
cite="mid:DF57361A-F68C-44B0-87E9-FDA5F7D0484E@apple.com">
<div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:
space; line-break: after-white-space;" class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class="">Furthermore, few existing
specifications are phrased in terms
grapheme-clusters, so something like a
validator wouldn’t want to run on
grapheme-segmented text, but a lower
abstraction level.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>Also, graphemes can be funky. A
string containing only, U+0301
(COMBINING ACUTE ACCENT) has one
grapheme, but modifies the prior
grapheme upon concatenation. Such
degenerate graphemes violate algebraic
reasoning in these corner cases. </div>
</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
<div class="">We are not aware of generic algorithms
that rely on concatenation of collections
conserving element counts, so we decided to simply
document this quirk rather than saying that string
is-not-a collection.</div>
</div>
</blockquote>
<br class="">
SG16 has previously discussed cases like this and I'm
happy to hear you haven't had to do anything special for
it. This is a good example of why we asked about
inappropriate use of the String count property:
programmers assuming s1.count + s2.count ==
s1.append(s2).count.<br class="">
<br class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class="">Unicode defines properties
and most operations on scalars or code
points, and very little on top of
graphemes.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode:
space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">When
porting code unit or code
point based code to Swift
strings (e.g., when
rewriting Objective-C code,
or rewriting Swift code to
use String instead of
NSString), has profiling
revealed performance
regressions due to the
switch to EGC based
processing? If so, what
action was taken to correct
it?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>We have many fast-paths in
grapheme-breaking to identify common
situations surrounding single-scalar
graphemes. If a developer wants to work
with Unicode at a lower level, String
provides a UTF8View, a UTF16View, and a
UnicodeScalarView. Those views lazily
transcode/decode upon access.<br
class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br class="">
Cool, it sounds like the answer to any such regressions
was 1) optimization in terms of fast-paths, and 2) fall
back to code unit/point processing otherwise.<br
class="">
<br class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class=""
color="#8886ff"><br class="">
</font>There are also performance
concerns and annoyances when working
with ICU, but this is an implementation
detail. If you’re interested in using
ICU, we can discuss further what has
worked best for us.<br class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
I think you're interested in (at least optionally)
using ICU unless you have evidence of major
investment in another open-source implementation of
Unicode algorithms and tables. Otherwise, C++
implementors could not afford to develop standard
libraries.</div>
</blockquote>
<br class="">
Yes, definitely. For the foreseeable future, I think we
need to ensure that any interfaces we propose can be
reasonably implemented using ICU. However, Zach Laine
has made impressive progress implementing many of the
Unicode algorithms without use of ICU in his proposed
Boost.Text library. See <a
class="moz-txt-link-freetext"
href="https://github.com/tzlaine/text"
moz-do-not-send="true">https://github.com/tzlaine/text</a>
and <a class="moz-txt-link-freetext"
href="https://tzlaine.github.io/text/doc/html/index.html"
moz-do-not-send="true">https://tzlaine.github.io/text/doc/html/index.html</a>.<br
class="">
<br class="">
</div>
</div>
</blockquote>
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class=""
color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode:
space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class=""><br
class="">
Swift strings do not enforce
storage in any particular
Unicode normalization form.
Was consideration given to
forcing storage in a
particular form such as FCC
or NFC?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>Swift strings now sort with NFC
(currently UTF-16 code unit order, but
likely changed to Unicode scalar value
order). We didn’t find FCC significantly
more compelling in practice. Since NFC
is far more frequent in the wild (why
waste space if you don’t have to),
strings are likely to already be in NFC.
We have fast-paths to detect on-the-fly
normal sections of strings (e.g. all
ASCII, all < U+0300, NFC_QC=yes,
etc.). We lazily normalize portions of
string during comparison when needed.<br
class="">
<font class="" color="#8886ff"><br
class="">
</font>As far as enforcing on creation,
no. We do want to add an option to
perform a linear scan to set a
performance flag, perhaps at creation,
so that comparison can take the
memcmp-like fast-path.<br class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br class="">
Ok, my take away from this is that fast-pathing has been
sufficient for lazy normalization (when needed) to not
be (much of) a performance concern. At least, not
enough to want to take the normalization cost on every
string construction up front.<br class="">
<br class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class=""
color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode:
space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">Swift
strings support comparison
via normalization. Has use
of canonical string equality
been a performance issue?
Or been a source of surprise
to programmers?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>This was a big performance issue
on Linux, where we used to do UCA+DUCET
based comparisons. We switch to
lexicographical order of NFC-normalized
UTF-16 code units (future: scalar
values), and saw a very significant
speed up there. The remaining
performance work revolves around
checking and tracking whether a string
is known to already be in a normal form,
so we can just memcmp.<br class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br class="">
This is very helpful, thank you. We've suspected that
full collation (with or without tailoring) would be too
expensive for use as a default comparison operator, so
it is good to hear that confirmed.<br class="">
<br class="">
I'm curious why this was a larger performance issue for
Linux than for (presumably) macOS and/or iOS.<br
class="">
<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>There were two main factors. The first is that on Darwin
platforms, CFString had an implementation that we used
instead of UCA+DUCET which was faster. The second is that
Darwin platforms are typically up-to-date and have very
recent versions of ICU. On Linux, we still support Ubuntu
LTS 14.04 which has a version of ICU which predates Swift
and didn’t have any fast-paths for ASCII or mostly-ASCII
text.</div>
<div><br class="">
</div>
<div>Switching to our own implementation based on NFC gave us
many X improvement over CFString, which in turn was many X
faster than UCA+DUCET (especially on older versions of ICU).</div>
</div>
</div>
</blockquote>
<br>
Thanks. My take away is that implementation quality matters; those
fast paths are important.<br>
<br>
<blockquote type="cite"
cite="mid:DF57361A-F68C-44B0-87E9-FDA5F7D0484E@apple.com">
<div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:
space; line-break: after-white-space;" class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class=""
color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode:
space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">Swift
strings are not locale
sensitive. Was any
consideration given to
creation of a distinct
locale sensitive string
type?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>This is still up for debate and
hasn’t been settled yet, but we think it
makes a lot of sense. If an array of
strings is sorted, we certainly don’t
want a locale-change to violate
programmer invariants. A distinct type
from string could avoid a lot of common
errors here, including forgetting to
localize before presenting to a user as
part of a UI.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode:
space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">Swift
strings provide a count
property as required to
satisfy the Collection
protocol. How often do
programmers use count (the
number of EGCs in the
string) inappropriately?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>I’m not sure what would
constitute inappropriate usage here. We
do not currently provide access to the
underlying stored code units, though
this is a frequent request and we likely
will in the future. I haven’t seen
anyone baking in the assumption that
count is the same for String and across
all of Strings’s views (UTF-8, UTF-16,
Unicode scalars).<br class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
</div>
<div class="">One thing to consider is that as long as
String is not random-access, count will be a
worst-case O(N) operation. An inappropriate usage
might involve computing the length once per loop
iteration.</div>
</blockquote>
<br class="">
In addition to the above and prior mention of algebraic
concerns, other potential abuses we had in mind were
using it to determine field widths for display or code
unit/point based storage.<br class="">
<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>Display width is a whole other concern accounting for
rendering environment, font, etc. I don’t have expertise
here.</div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class=""> C++
container requirements specify that .size() be O(1).
For us to meet container requirements would require
computing and caching the count during construction and
mutation operations. We could potentially get by just
meeting range requirements though.<br class="">
<br class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class="">I mentioned degenerate
graphemes breaking algebraic properties
of the Collection protocol, but this
hasn’t been a huge issue in practice so
far.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode:
space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class=""><br
class="">
Swift strings support
several memory unsafe
initializers and methods.
How frequently are these
used incorrectly?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>Many of these initializers come
from NSString originally, and developers
migrating correct code to Swift maintain
that correctness. Rust has a similar
situation, though they do validation at
creation-time and from_utf8_unchecked()
voids memory-safety if the contents are
invalid.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode:
space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">The
Swift manifesto discussed
three approaches to handling
substrings and Swift 4
changed from "same type,
shared storage" to
"different type, shared
storage". Any regrets?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>Having two types can be a bit of
a pain, but we still think it was the
right thing to do. This is consistent
with Swift treating slices as a distinct
type from the base collection.<br
class="">
<font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode:
space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class=""><br
class="">
How often do you find
programmers doing work at
the EGC level that would be
better performed at the code
unit or code point level?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>Often, if a developer has strict
requirements, they know what they’re
doing enough to operate at one of those
lower levels.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>Not being able to random-access
graphemes in a string is a common source
of frustration and confusion amongst new
users.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode:
space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">Likewise,
how often do you find
programmers working with
unicodeScalars, utf8, or
utf16 views to do work
better performed at the EGC
level? For what reasons
does this occur? Perhaps to
work around differences in
EGC boundaries across
Unicode versions or the
underlying version of ICU in
use?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>This was very prevalent in
Swift’s early days. String wasn’t a
collection of graphemes by default prior
to Swift 4,</div>
</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
Well, it was. And then in Swift 2 or 3 it wasn't,
due to the algebraic reasoning issue. Now it is
again.</div>
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class=""> so without guidance many
developers wrote code against the
unicode scalars view. We also didn’t
have any fast-paths for common-case
situations back then, which further
encouraged them to use one of the other
views.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>This is still done sometimes for
performance-sensitive usage, or someone
wanting to handle Unicode themselves.
However, as mentioned previously, we
don’t (yet) provide direct access to the
actual storage.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>We haven’t seen much desire for
reconciling behavior across Unicode
versions. This may be due to Swift being
primarily an applications level
programming language for devices which
only have one version of Unicode that’s
relevant (the current one).<br class="">
<font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode:
space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">Has
consideration been given to
exposing Unicode character
database properties?
CharacterSet exposes some of
these properties, but have
more been requested?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<font class="" color="#8886ff"><br
class="">
</font>Yes, this was recently added to
the language: <a
href="https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md"
class="" moz-do-not-send="true">https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md</a>.
We surface much of the UCD via ICU.<br
class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br class="">
Ah, nice. All kinds of fun to be had with that :)<br
class="">
<br class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class=""
color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode:
space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">How
firmly is the Swift string
implementation tied to ICU?
If the C++ standard library
were to add suitable Unicode
support, what would motivate
reimplementing Swift strings
on top of it?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
Swift’s tie to ICU is less firm than it
used to be. We use ICU for the
following:<br class="">
<font class="" color="#8886ff"><br
class="">
</font>1. Grapheme breaking<br class="">
2. Normalization<br class="">
3. Accessing UCD properties<br class="">
4. Case conversion<br class="">
<font class="" color="#8886ff"><br
class="">
</font>Each of these are not too tightly
entwined with string; they’re
cordoned-off as a couple of shims called
on fallback slow-paths.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>If the C++ standard library
provided these operations, sufficiently
up-to-date with Unicode version and
comparable or better to ICU in
performance, we would be willing to
switch. A big pain in interacting with
ICU is their limited support for UTF-8.
Some users who would like to use a
“lighter-weight” Swift and are unhappy
at having to link against ICU, as it’s
fairly large, and it can complicate
security audits.<br class="">
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br class="">
Got it. Increasing the size of the C++ standard library
is a definite concern for us as well. We imagine some
C++ users would be similarly unhappy if their standard
library suddenly required linking against ICU.<br
class="">
<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>If you go the route of implementing Unicode operations
without ICU, would it be possible to separately link against
Unicode support without also pulling in all of libc++? If
your implementation is lighter-weight, yet current, it would
be very appealing for Swift to consider switching over.</div>
</div>
</div>
</blockquote>
<br>
It would be up to the implementation to determine how it is
packaged, but I suspect there will be sufficient motivation for
separating out the heavier parts. Whether those heavier parts could
then be used separately from the rest of the library I can't say. I
think this is something for us to keep in mind as a design point
though.<br>
<br>
Tom.<br>
<br>
<blockquote type="cite"
cite="mid:DF57361A-F68C-44B0-87E9-FDA5F7D0484E@apple.com">
<div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:
space; line-break: after-white-space;" class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000" bgcolor="#FFFFFF" class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word;
-webkit-nbsp-mode: space; line-break:
after-white-space;" class="">
<div class="">
<div class="">
<div class=""><font class=""
color="#8886ff"><br class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode:
space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">Do
Swift programmers tend to
prefer string interpolation
or string formatting
functions?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
Users tend to prefer string
interpolation. However, Swift currently
does not have much in the way of
formatting control in interpolations,
and this is something we’re currently
working on.<br class="">
<font class="" color="#8886ff"><br
class="">
</font>
<blockquote type="cite" class="">
<div class="" style="word-wrap:
break-word; -webkit-nbsp-mode:
space; line-break:
after-white-space;">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div text="#000000"
bgcolor="#FFFFFF" class="">What
enhancements would you most
like to see in C++ to
improve Unicode support?</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
Swift’s string is perhaps geared as a
higher-level construct than what you may
want for C++, and Swift has
Cocoa-interoperability concerns where
everything is UTF-16. Rust might provide
a closer model to what you’re looking
for:<br class="">
</div>
</div>
<div class=""><br class="">
</div>
<div class="">
<ul class="MailOutline">
<li class="">Strings are a sequence of
(valid) UTF-8 code units</li>
<ul class="">
<li class="">Validation is done on
creation</li>
<li class="">Invalid contents (e.g.
Windows file paths) can be handled
via something like WTF-8, which is
not intended for interchange</li>
</ul>
</ul>
</div>
<div class="">
<ul class="MailOutline">
<li class="">String provides
bidirectional iterators for:</li>
<ul class="">
<li class="">Transcoded and/or
normalized code units</li>
<li class="">Unicode scalar values
(their “character” type)</li>
<li class="">Grapheme clusters</li>
</ul>
</ul>
</div>
</div>
</div>
</div>
</blockquote>
<br class="">
</div>
<div class="">Michael, I think you're not answering
the question asked. They are asking what Swift
would want from C++, e.g., to allow us to decouple
from ICU. Wouldn't we like to be able to do that?</div>
</blockquote>
<br class="">
This question was intended to ask you, as expert C++
programmers independently from Swift, what additions to
C++ you think would be most helpful to improve our (very
lacking) Unicode support. So, Michael's response is on
point (thank you; we'll take a closer look at Rust), as
are any comments regarding what would benefit Swift
specifically. Michael's earlier comments regarding what
Swift currently uses ICU for are suggestive of what
Swift might want from C++. But I imagine the form in
which those features are provided would matter greatly;
devils and details.<br class="">
<br class="">
Tom.<br class="">
<br class="">
<blockquote type="cite"
cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
class="">
<div class=""><br class="">
</div>
<div class="">-Dave</div>
<div class=""><br class="">
</div>
<br class="">
</blockquote>
<p class=""><br class="">
</p>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</blockquote>
<p><br>
</p>
</body>
</html>