<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Aug 2, 2018, at 10:26 PM, Tom Honermann &lt;<a href="mailto:tom@honermann.net" class="">tom@honermann.net</a>&gt; wrote:</div><br class="Apple-interchange-newline"><div class="">
  
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" class="">
  
  <div text="#000000" bgcolor="#FFFFFF" class="">
    <div class="moz-cite-prefix">Thank you Michael and Dave!&nbsp; I
      appreciate the time and detail.&nbsp; All of your answers look to
      confirm our expectations, so I interpret this as a good sign we're
      thinking about the right things.<br class="">
      <br class="">
      I added a few inline comments/clarifications below.<br class="">
      <br class="">
      We had tentatively planned to meet Wednesday of next week, but it
      turns out that two of our core SG16 members are going to be on
      vacation so, at a minimum, I'd like to postpone.&nbsp; I'm also feeling
      pretty content with the responses that we got from you and I think
      it would suffice for us to just follow up with any remaining
      thoughts via email.&nbsp; While I'd love for any of you to attend one
      (or more) of our meetings (any time), I want to be sensitive to
      productive use of your time.&nbsp; So, how about we play it by ear for
      now?<br class="">
      <br class=""></div></div></div></blockquote><div><br class=""></div><div>I’d be happy to meet up sometime. JF mentioned an in-person meeting sometime this fall. Feel free to grab me whenever you think I can add value.</div><br class=""><blockquote type="cite" class=""><div class=""><div text="#000000" bgcolor="#FFFFFF" class=""><div class="moz-cite-prefix">
      On 08/02/2018 05:18 PM, Dave Abrahams wrote:<br class="">
    </div>
    <blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8" class="">
      <br class="">
      <div class=""><br class="">
        <blockquote type="cite" class="">
          <div class="">On Aug 1, 2018, at 12:04 PM, Michael Ilseman
            &lt;<a href="mailto:milseman@apple.com" class="" moz-do-not-send="true">milseman@apple.com</a>&gt; wrote:</div>
          <br class="Apple-interchange-newline">
          <div class="">
            <meta http-equiv="Content-Type" content="text/html;
              charset=utf-8" class="">
            <div style="word-wrap: break-word; -webkit-nbsp-mode: space;
              line-break: after-white-space;" class="">
              <div class="">Hello, I am the current maintainer of
                Swift’s String, and can speak to my thoughts on the
                status quo and future directions. Dave, who is on this
                thread, is much more familiar with the history behind
                this and can likely provide deeper insight into the
                reasoning.</div>
            </div>
          </div>
        </blockquote>
        <div class=""><br class="">
        </div>
        Michael has done very well here; I only have a few things to
        add.</div>
      <div class=""><br class="">
        <blockquote type="cite" class="">
          <div class="">
            <div style="word-wrap: break-word; -webkit-nbsp-mode: space;
              line-break: after-white-space;" class="">
              <div class="">
                <div class="">
                  <div class=""><font class="" color="#8886ff"><br class="">
                    </font>
                    <blockquote type="cite" class="">
                      <div class="" style="word-wrap: break-word;
                        -webkit-nbsp-mode: space; line-break:
                        after-white-space;">
                        <div class="">On Jul 23, 2018, at 7:39 PM, Tom
                          Honermann &lt;<a href="mailto:tom@honermann.net" class="" moz-do-not-send="true">tom@honermann.net</a>&gt;
                          wrote:<br class="">
                          <font class="" color="#00c8fa"><br class="">
                          </font>SG16 is seeking input from Swift and
                          WebKit representatives to help inform our work
                          towards enhancing support for Unicode in the
                          C++ standard.&nbsp; In particular, we recognize the
                          significant amount of effort that went into
                          the design of the Swift String type and would
                          like to better understand the motivations that
                          contributed to its current design and any
                          pressures that might encourage further
                          evolution or refinement; especially for any
                          concerns that would be deemed significant
                          enough to warrant backward incompatible
                          changes.<br class="">
                          Though most of these questions specifically
                          mention Swift, that is an artifact of our
                          being more familiar with Swift than the
                          internal workings of WebKit.&nbsp; Many of these
                          questions would be applicable to any string
                          type designed to support Unicode.&nbsp; We are
                          therefore also interested in hearing about the
                          string types used by WebKit, the motivations
                          that guided their design, and the trade offs
                          that have been made.&nbsp; Of particular interest
                          would be the results of design decisions that
                          are contrast with the design of Swift's String
                          type.<br class="">
                          Thank you in advance for any time and
                          expertise you are willing and able to share
                          with us.<br class="">
                          <blockquote type="cite" class="">
                            <div class="">
                              <div text="#000000" bgcolor="#FFFFFF" class="">The Swift string manifesto is
                                about 1 1/2 years old. What have you
                                learned since writing it?&nbsp; What would
                                you change?&nbsp; What have you changed?</div>
                            </div>
                          </blockquote>
                        </div>
                      </div>
                    </blockquote>
                    <font class="" color="#8886ff"><br class="">
                    </font>We haven’t really diverged from that
                    manifesto. Some things are still in progress, minor
                    details were tweaked, but the core arguments are
                    still relevant.</div>
                  <div class=""><br class="">
                    <blockquote type="cite" class="">
                      <div class="" style="word-wrap: break-word;
                        -webkit-nbsp-mode: space; line-break:
                        after-white-space;">
                        <div class="">
                          <blockquote type="cite" class="">
                            <div class="">
                              <div text="#000000" bgcolor="#FFFFFF" class=""><br class="">
                                Swift strings are extended grapheme
                                cluster (EGC) based.&nbsp; What have been the
                                best and worst consequences of this
                                choice?</div>
                            </div>
                          </blockquote>
                        </div>
                      </div>
                    </blockquote>
                    <font class="" color="#8886ff"><br class="">
                    </font>I’ll use “grapheme” casually to mean EGC.
                    Swift’s Character type represents a grapheme
                    cluster, Unicode.Scalar represents a Unicode scalar
                    value (non-surrogate code point).<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>Cocoa APIs are UTF-16 code unit oriented, and
                    thus there’s always caution (via documentation)
                    about making sure such indices align to grapheme
                    boundaries. This is a frequent source of bugs,
                    especially as part of internationalization. By
                    making Swift strings be grapheme-based by default,
                    developers first reach for the correct APIs.<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>Another good consequence is that people
                    picking up Swift and playing with string, e.g. in a
                    repl or Playground, see Swift’s notion of characters
                    align with what is displayed. This includes complex
                    multi-component emoji such as family emoji
                    (👨‍👨‍👧‍👧), which is a single Character composed
                    of 7 Unicode.Scalars.<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>This does have downsides. What is and is not
                    a grapheme cluster changes with each version of
                    Unicode, and thus grapheme breaking is inherently a
                    run-time concern and can’t be checked at compile
                    time. Another is that while code units can be
                    random-access, graphemes cannot, which is confusing
                    to developers used to UTF-16 code unit access mostly
                    working (until their users use non-BMP scalars or
                    emoji that is). </div>
                </div>
              </div>
            </div>
          </div>
        </blockquote>
        <div class=""><br class="">
        </div>
        <div class="">I'd say the biggest downside is that there are users who
          simply refuse to accept what we consider to be the fundamental
          non-random-access character of any efficient string
          representation. &nbsp;They are upset that they can't index a string
          directly with an integer, and can't be talked out of it. &nbsp;I
          still think we made the right decision in this regard; you'd
          have the same problem if your strings were
          unicode-scalar-based.</div>
      </div>
    </blockquote>
    <br class="">
    Are there common scenarios where programmers tend to be frustrated
    by lack of random access?&nbsp; Perhaps most often when they are working
    with inputs known to be ASCII only?&nbsp; Or is this mostly an education
    issue and these programmers are having a difficult time accepting
    that they've spent most of their career thus far writing bugs? :)<br class="">
    <br class=""></div></div></blockquote><div><br class=""></div><div>A lot of it is shaped by expectations coming from other languages, whose programming models do not prioritize operating on Unicode scalar values, let alone grapheme clusters. Objective-C’s default interface with Strings is random-access to UTF-16 code units, which “works” right up until you encounter an emoji or other scalar not on the BMP. It also “works” for graphemes right up until you encounter emoji or a language you didn’t test or a non-NFC-normalized contents in a language you did test.</div><div><br class=""></div><div>This gets compounded by the prevalence of strings in teaching, interviews, programming puzzles, etc., where a string is treated like an array with a more visual representation.</div><div><br class=""></div><div>Also note that even for fully ASCII strings we cannot provide random access to grapheme clusters, as “\r\n” is a single grapheme cluster. For pretty much every Unicode-correct operation we provide fast-paths for, there’s nasty corner cases that complicates the model.</div><br class=""><blockquote type="cite" class=""><div class=""><div text="#000000" bgcolor="#FFFFFF" class="">
    <blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
      <div class=""><br class="">
        <blockquote type="cite" class="">
          <div class="">
            <div style="word-wrap: break-word; -webkit-nbsp-mode: space;
              line-break: after-white-space;" class="">
              <div class="">
                <div class="">
                  <div class="">Furthermore, few existing specifications
                    are phrased in terms grapheme-clusters, so something
                    like a validator wouldn’t want to run on
                    grapheme-segmented text, but a lower abstraction
                    level.<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>Also, graphemes can be funky. A string
                    containing only, U+0301 (COMBINING ACUTE ACCENT) has
                    one grapheme, but modifies the prior grapheme upon
                    concatenation. Such degenerate graphemes violate
                    algebraic reasoning in these corner cases. </div>
                </div>
              </div>
            </div>
          </div>
        </blockquote>
        <div class=""><br class="">
        </div>
        <div class="">We are not aware of generic algorithms that rely on
          concatenation of collections conserving element counts, so we
          decided to simply document this quirk rather than saying that
          string is-not-a collection.</div>
      </div>
    </blockquote>
    <br class="">
    SG16 has previously discussed cases like this and I'm happy to hear
    you haven't had to do anything special for it.&nbsp; This is a good
    example of why we asked about inappropriate use of the String count
    property: programmers assuming s1.count + s2.count ==
    s1.append(s2).count.<br class="">
    <br class="">
    <blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
      <div class=""><br class="">
        <blockquote type="cite" class="">
          <div class="">
            <div style="word-wrap: break-word; -webkit-nbsp-mode: space;
              line-break: after-white-space;" class="">
              <div class="">
                <div class="">
                  <div class="">Unicode defines properties and most
                    operations on scalars or code points, and very
                    little on top of graphemes.<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>
                    <blockquote type="cite" class="">
                      <div class="" style="word-wrap: break-word;
                        -webkit-nbsp-mode: space; line-break:
                        after-white-space;">
                        <div class="">
                          <blockquote type="cite" class="">
                            <div class="">
                              <div text="#000000" bgcolor="#FFFFFF" class="">When porting code unit or code
                                point based code to Swift strings (e.g.,
                                when rewriting Objective-C code, or
                                rewriting Swift code to use String
                                instead of NSString), has profiling
                                revealed performance regressions due to
                                the switch to EGC based processing?&nbsp; If
                                so, what action was taken to correct it?</div>
                            </div>
                          </blockquote>
                        </div>
                      </div>
                    </blockquote>
                    <font class="" color="#8886ff"><br class="">
                    </font>We have many fast-paths in grapheme-breaking
                    to identify common situations surrounding
                    single-scalar graphemes. If a developer wants to
                    work with Unicode at a lower level, String provides
                    a UTF8View, a UTF16View, and a UnicodeScalarView.
                    Those views lazily transcode/decode upon access.<br class="">
                  </div>
                </div>
              </div>
            </div>
          </div>
        </blockquote>
      </div>
    </blockquote>
    <br class="">
    Cool, it sounds like the answer to any such regressions was 1)
    optimization in terms of fast-paths, and 2) fall back to code
    unit/point processing otherwise.<br class="">
    <br class="">
    <blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
      <div class="">
        <blockquote type="cite" class="">
          <div class="">
            <div style="word-wrap: break-word; -webkit-nbsp-mode: space;
              line-break: after-white-space;" class="">
              <div class="">
                <div class="">
                  <div class=""><font class="" color="#8886ff"><br class="">
                    </font>There are also performance concerns and
                    annoyances when working with ICU, but this is an
                    implementation detail. If you’re interested in using
                    ICU, we can discuss further what has worked best for
                    us.<br class="">
                  </div>
                </div>
              </div>
            </div>
          </div>
        </blockquote>
        <div class=""><br class="">
        </div>
        I think you're interested in (at least optionally) using ICU
        unless you have evidence of major investment in another
        open-source implementation of Unicode algorithms and tables.
        &nbsp;Otherwise, C++ implementors could not afford to develop
        standard libraries.</div>
    </blockquote>
    <br class="">
    Yes, definitely.&nbsp; For the foreseeable future, I think we need to
    ensure that any interfaces we propose can be reasonably implemented
    using ICU.&nbsp; However, Zach Laine has made impressive progress
    implementing many of the Unicode algorithms without use of ICU in
    his proposed Boost.Text library.&nbsp; See
    <a class="moz-txt-link-freetext" href="https://github.com/tzlaine/text">https://github.com/tzlaine/text</a> and
    <a class="moz-txt-link-freetext" href="https://tzlaine.github.io/text/doc/html/index.html">https://tzlaine.github.io/text/doc/html/index.html</a>.<br class="">
    <br class=""></div></div></blockquote><blockquote type="cite" class=""><div class=""><div text="#000000" bgcolor="#FFFFFF" class=""><blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class=""><div class=""><br class="">
        <blockquote type="cite" class="">
          <div class="">
            <div style="word-wrap: break-word; -webkit-nbsp-mode: space;
              line-break: after-white-space;" class="">
              <div class="">
                <div class="">
                  <div class=""><font class="" color="#8886ff"><br class="">
                    </font>
                    <blockquote type="cite" class="">
                      <div class="" style="word-wrap: break-word;
                        -webkit-nbsp-mode: space; line-break:
                        after-white-space;">
                        <div class="">
                          <blockquote type="cite" class="">
                            <div class="">
                              <div text="#000000" bgcolor="#FFFFFF" class=""><br class="">
                                Swift strings do not enforce storage in
                                any particular Unicode normalization
                                form.&nbsp; Was consideration given to
                                forcing storage in a particular form
                                such as FCC or NFC?</div>
                            </div>
                          </blockquote>
                        </div>
                      </div>
                    </blockquote>
                    <font class="" color="#8886ff"><br class="">
                    </font>Swift strings now sort with NFC (currently
                    UTF-16 code unit order, but likely changed to
                    Unicode scalar value order). We didn’t find FCC
                    significantly more compelling in practice. Since NFC
                    is far more frequent in the wild (why waste space if
                    you don’t have to), strings are likely to already be
                    in NFC. We have fast-paths to detect on-the-fly
                    normal sections of strings (e.g. all ASCII, all &lt;
                    U+0300, NFC_QC=yes, etc.). We lazily normalize
                    portions of string during comparison when needed.<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>As far as enforcing on creation, no. We do
                    want to add an option to perform a linear scan to
                    set a performance flag, perhaps at creation, so that
                    comparison can take the memcmp-like fast-path.<br class="">
                  </div>
                </div>
              </div>
            </div>
          </div>
        </blockquote>
      </div>
    </blockquote>
    <br class="">
    Ok, my take away from this is that fast-pathing has been sufficient
    for lazy normalization (when needed) to not be (much of) a
    performance concern.&nbsp; At least, not enough to want to take the
    normalization cost on every string construction up front.<br class="">
    <br class="">
    <blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
      <div class="">
        <blockquote type="cite" class="">
          <div class="">
            <div style="word-wrap: break-word; -webkit-nbsp-mode: space;
              line-break: after-white-space;" class="">
              <div class="">
                <div class="">
                  <div class=""><font class="" color="#8886ff"><br class="">
                    </font>
                    <blockquote type="cite" class="">
                      <div class="" style="word-wrap: break-word;
                        -webkit-nbsp-mode: space; line-break:
                        after-white-space;">
                        <div class="">
                          <blockquote type="cite" class="">
                            <div class="">
                              <div text="#000000" bgcolor="#FFFFFF" class="">Swift strings support
                                comparison via normalization.&nbsp; Has use
                                of canonical string equality been a
                                performance issue?&nbsp; Or been a source of
                                surprise to programmers?</div>
                            </div>
                          </blockquote>
                        </div>
                      </div>
                    </blockquote>
                    <font class="" color="#8886ff"><br class="">
                    </font>This was a big performance issue on Linux,
                    where we used to do UCA+DUCET based comparisons. We
                    switch to lexicographical order of NFC-normalized
                    UTF-16 code units (future: scalar values), and saw a
                    very significant speed up there. The remaining
                    performance work revolves around checking and
                    tracking whether a string is known to already be in
                    a normal form, so we can just memcmp.<br class="">
                  </div>
                </div>
              </div>
            </div>
          </div>
        </blockquote>
      </div>
    </blockquote>
    <br class="">
    This is very helpful, thank you.&nbsp; We've suspected that full
    collation (with or without tailoring) would be too expensive for use
    as a default comparison operator, so it is good to hear that
    confirmed.<br class="">
    <br class="">
    I'm curious why this was a larger performance issue for Linux than
    for (presumably) macOS and/or iOS.<br class="">
    <br class=""></div></div></blockquote><div><br class=""></div><div>There were two main factors. The first is that on Darwin platforms, CFString had an implementation that we used instead of UCA+DUCET which was faster. The second is that Darwin platforms are typically up-to-date and have very recent versions of ICU. On Linux, we still support Ubuntu LTS 14.04 which has a version of ICU which predates Swift and didn’t have any fast-paths for ASCII or mostly-ASCII text.</div><div><br class=""></div><div>Switching to our own implementation based on NFC gave us many X improvement over CFString, which in turn was many X faster than UCA+DUCET (especially on older versions of ICU).</div><br class=""><blockquote type="cite" class=""><div class=""><div text="#000000" bgcolor="#FFFFFF" class="">
    <blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
      <div class="">
        <blockquote type="cite" class="">
          <div class="">
            <div style="word-wrap: break-word; -webkit-nbsp-mode: space;
              line-break: after-white-space;" class="">
              <div class="">
                <div class="">
                  <div class=""><font class="" color="#8886ff"><br class="">
                    </font>
                    <blockquote type="cite" class="">
                      <div class="" style="word-wrap: break-word;
                        -webkit-nbsp-mode: space; line-break:
                        after-white-space;">
                        <div class="">
                          <blockquote type="cite" class="">
                            <div class="">
                              <div text="#000000" bgcolor="#FFFFFF" class="">Swift strings are not locale
                                sensitive.&nbsp; Was any consideration given
                                to creation of a distinct locale
                                sensitive string type?</div>
                            </div>
                          </blockquote>
                        </div>
                      </div>
                    </blockquote>
                    <font class="" color="#8886ff"><br class="">
                    </font>This is still up for debate and hasn’t been
                    settled yet, but we think it makes a lot of sense.
                    If an array of strings is sorted, we certainly don’t
                    want a locale-change to violate programmer
                    invariants. A distinct type from string could avoid
                    a lot of common errors here, including forgetting to
                    localize before presenting to a user as part of a
                    UI.<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>
                    <blockquote type="cite" class="">
                      <div class="" style="word-wrap: break-word;
                        -webkit-nbsp-mode: space; line-break:
                        after-white-space;">
                        <div class="">
                          <blockquote type="cite" class="">
                            <div class="">
                              <div text="#000000" bgcolor="#FFFFFF" class="">Swift strings provide a count
                                property as required to satisfy the
                                Collection protocol.&nbsp; How often do
                                programmers use count (the number of
                                EGCs in the string) inappropriately?</div>
                            </div>
                          </blockquote>
                        </div>
                      </div>
                    </blockquote>
                    <font class="" color="#8886ff"><br class="">
                    </font>I’m not sure what would constitute
                    inappropriate usage here. We do not currently
                    provide access to the underlying stored code units,
                    though this is a frequent request and we likely will
                    in the future. I haven’t seen anyone baking in the
                    assumption that count is the same for String and
                    across all of Strings’s views (UTF-8, UTF-16,
                    Unicode scalars).<br class="">
                  </div>
                </div>
              </div>
            </div>
          </div>
        </blockquote>
        <div class=""><br class="">
        </div>
      </div>
      <div class="">One thing to consider is that as long as String is not
        random-access, count will be a worst-case O(N) operation. &nbsp;An
        inappropriate usage might involve computing the length once per
        loop iteration.</div>
    </blockquote>
    <br class="">
    In addition to the above and prior mention of algebraic concerns,
    other potential abuses we had in mind were using it to determine
    field widths for display or code unit/point based storage.<br class="">
    <br class=""></div></div></blockquote><div><br class=""></div><div>Display width is a whole other concern accounting for rendering environment, font, etc. I don’t have expertise here.</div><br class=""><blockquote type="cite" class=""><div class=""><div text="#000000" bgcolor="#FFFFFF" class="">
    C++ container requirements specify that .size() be O(1).&nbsp; For us to
    meet container requirements would require computing and caching the
    count during construction and mutation operations.&nbsp; We could
    potentially get by just meeting range requirements though.<br class="">
    <br class="">
    <blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
      <div class=""><br class="">
        <blockquote type="cite" class="">
          <div class="">
            <div style="word-wrap: break-word; -webkit-nbsp-mode: space;
              line-break: after-white-space;" class="">
              <div class="">
                <div class="">
                  <div class="">I mentioned degenerate graphemes
                    breaking algebraic properties of the Collection
                    protocol, but this hasn’t been a huge issue in
                    practice so far.<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>
                    <blockquote type="cite" class="">
                      <div class="" style="word-wrap: break-word;
                        -webkit-nbsp-mode: space; line-break:
                        after-white-space;">
                        <div class="">
                          <blockquote type="cite" class="">
                            <div class="">
                              <div text="#000000" bgcolor="#FFFFFF" class=""><br class="">
                                Swift strings support several memory
                                unsafe initializers and methods.&nbsp; How
                                frequently are these used incorrectly?</div>
                            </div>
                          </blockquote>
                        </div>
                      </div>
                    </blockquote>
                    <font class="" color="#8886ff"><br class="">
                    </font>Many of these initializers come from NSString
                    originally, and developers migrating correct code to
                    Swift maintain that correctness. Rust has a similar
                    situation, though they do validation at
                    creation-time and from_utf8_unchecked() voids
                    memory-safety if the contents are invalid.<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>
                    <blockquote type="cite" class="">
                      <div class="" style="word-wrap: break-word;
                        -webkit-nbsp-mode: space; line-break:
                        after-white-space;">
                        <div class="">
                          <blockquote type="cite" class="">
                            <div class="">
                              <div text="#000000" bgcolor="#FFFFFF" class="">The Swift manifesto discussed
                                three approaches to handling substrings
                                and Swift 4 changed from "same type,
                                shared storage" to "different type,
                                shared storage".&nbsp; Any regrets?</div>
                            </div>
                          </blockquote>
                        </div>
                      </div>
                    </blockquote>
                    <font class="" color="#8886ff"><br class="">
                    </font>Having two types can be a bit of a pain, but
                    we still think it was the right thing to do. This is
                    consistent with Swift treating slices as a distinct
                    type from the base collection.<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>
                    <blockquote type="cite" class="">
                      <div class="" style="word-wrap: break-word;
                        -webkit-nbsp-mode: space; line-break:
                        after-white-space;">
                        <div class="">
                          <blockquote type="cite" class="">
                            <div class="">
                              <div text="#000000" bgcolor="#FFFFFF" class=""><br class="">
                                How often do you find programmers doing
                                work at the EGC level that would be
                                better performed at the code unit or
                                code point level?</div>
                            </div>
                          </blockquote>
                        </div>
                      </div>
                    </blockquote>
                    <font class="" color="#8886ff"><br class="">
                    </font>Often, if a developer has strict
                    requirements, they know what they’re doing enough to
                    operate at one of those lower levels.<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>Not being able to random-access graphemes in
                    a string is a common source of frustration and
                    confusion amongst new users.<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>
                    <blockquote type="cite" class="">
                      <div class="" style="word-wrap: break-word;
                        -webkit-nbsp-mode: space; line-break:
                        after-white-space;">
                        <div class="">
                          <blockquote type="cite" class="">
                            <div class="">
                              <div text="#000000" bgcolor="#FFFFFF" class="">Likewise, how often do you find
                                programmers working with unicodeScalars,
                                utf8, or utf16 views to do work better
                                performed at the EGC level?&nbsp; For what
                                reasons does this occur?&nbsp; Perhaps to
                                work around differences in EGC
                                boundaries across Unicode versions or
                                the underlying version of ICU in use?</div>
                            </div>
                          </blockquote>
                        </div>
                      </div>
                    </blockquote>
                    <font class="" color="#8886ff"><br class="">
                    </font>This was very prevalent in Swift’s early
                    days. String wasn’t a collection of graphemes by
                    default prior to Swift 4,</div>
                </div>
              </div>
            </div>
          </div>
        </blockquote>
        <div class=""><br class="">
        </div>
        Well, it was. &nbsp;And then in Swift 2 or 3 it wasn't, due to the
        algebraic reasoning issue. &nbsp;Now it is again.</div>
      <div class=""><br class="">
        <blockquote type="cite" class="">
          <div class="">
            <div style="word-wrap: break-word; -webkit-nbsp-mode: space;
              line-break: after-white-space;" class="">
              <div class="">
                <div class="">
                  <div class=""> so without guidance many developers
                    wrote code against the unicode scalars view. We also
                    didn’t have any fast-paths for common-case
                    situations back then, which further encouraged them
                    to use one of the other views.<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>This is still done sometimes for
                    performance-sensitive usage, or someone wanting to
                    handle Unicode themselves. However, as mentioned
                    previously, we don’t (yet) provide direct access to
                    the actual storage.<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>We haven’t seen much desire for reconciling
                    behavior across Unicode versions. This may be due to
                    Swift being primarily an applications level
                    programming language for devices which only have one
                    version of Unicode that’s relevant (the current
                    one).<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>
                    <blockquote type="cite" class="">
                      <div class="" style="word-wrap: break-word;
                        -webkit-nbsp-mode: space; line-break:
                        after-white-space;">
                        <div class="">
                          <blockquote type="cite" class="">
                            <div class="">
                              <div text="#000000" bgcolor="#FFFFFF" class="">Has consideration been given to
                                exposing Unicode character database
                                properties? CharacterSet exposes some of
                                these properties, but have more been
                                requested?</div>
                            </div>
                          </blockquote>
                        </div>
                      </div>
                    </blockquote>
                    <font class="" color="#8886ff"><br class="">
                    </font>Yes, this was recently added to the
                    language:&nbsp;<a href="https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md" class="" moz-do-not-send="true">https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md</a>.
                    We surface much of the UCD via ICU.<br class="">
                  </div>
                </div>
              </div>
            </div>
          </div>
        </blockquote>
      </div>
    </blockquote>
    <br class="">
    Ah, nice.&nbsp; All kinds of fun to be had with that :)<br class="">
    <br class="">
    <blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
      <div class="">
        <blockquote type="cite" class="">
          <div class="">
            <div style="word-wrap: break-word; -webkit-nbsp-mode: space;
              line-break: after-white-space;" class="">
              <div class="">
                <div class="">
                  <div class=""><font class="" color="#8886ff"><br class="">
                    </font>
                    <blockquote type="cite" class="">
                      <div class="" style="word-wrap: break-word;
                        -webkit-nbsp-mode: space; line-break:
                        after-white-space;">
                        <div class="">
                          <blockquote type="cite" class="">
                            <div class="">
                              <div text="#000000" bgcolor="#FFFFFF" class="">How firmly is the Swift string
                                implementation tied to ICU?&nbsp; If the C++
                                standard library were to add suitable
                                Unicode support, what would motivate
                                reimplementing Swift strings on top of
                                it?</div>
                            </div>
                          </blockquote>
                        </div>
                      </div>
                    </blockquote>
                    <div class=""><br class="">
                    </div>
                    Swift’s tie to ICU is less firm than it used to be.
                    We use ICU for the following:<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>1. Grapheme breaking<br class="">
                    2. Normalization<br class="">
                    3. Accessing UCD properties<br class="">
                    4. Case conversion<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>Each of these are not too tightly entwined
                    with string; they’re cordoned-off as a couple of
                    shims called on fallback slow-paths.<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>If the C++ standard library provided these
                    operations, sufficiently up-to-date with Unicode
                    version and comparable or better to ICU in
                    performance, we would be willing to switch. A big
                    pain in interacting with ICU is their limited
                    support for UTF-8. Some users who would like to use
                    a “lighter-weight” Swift and are unhappy at having
                    to link against ICU, as it’s fairly large, and it
                    can complicate security audits.<br class="">
                  </div>
                </div>
              </div>
            </div>
          </div>
        </blockquote>
      </div>
    </blockquote>
    <br class="">
    Got it.&nbsp; Increasing the size of the C++ standard library is a
    definite concern for us as well.&nbsp; We imagine some C++ users would be
    similarly unhappy if their standard library suddenly required
    linking against ICU.<br class="">
    <br class=""></div></div></blockquote><div><br class=""></div><div>If you go the route of implementing Unicode operations without ICU, would it be possible to separately link against Unicode support without also pulling in all of libc++? If your implementation is lighter-weight, yet current, it would be very appealing for Swift to consider switching over.</div><br class=""><blockquote type="cite" class=""><div class=""><div text="#000000" bgcolor="#FFFFFF" class="">
    <blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
      <div class="">
        <blockquote type="cite" class="">
          <div class="">
            <div style="word-wrap: break-word; -webkit-nbsp-mode: space;
              line-break: after-white-space;" class="">
              <div class="">
                <div class="">
                  <div class=""><font class="" color="#8886ff"><br class="">
                    </font>
                    <blockquote type="cite" class="">
                      <div class="" style="word-wrap: break-word;
                        -webkit-nbsp-mode: space; line-break:
                        after-white-space;">
                        <div class="">
                          <blockquote type="cite" class="">
                            <div class="">
                              <div text="#000000" bgcolor="#FFFFFF" class="">Do Swift programmers tend to
                                prefer string interpolation or string
                                formatting functions?</div>
                            </div>
                          </blockquote>
                        </div>
                      </div>
                    </blockquote>
                    <div class=""><br class="">
                    </div>
                    Users tend to prefer string interpolation. However,
                    Swift currently does not have much in the way of
                    formatting control in interpolations, and this is
                    something we’re currently working on.<br class="">
                    <font class="" color="#8886ff"><br class="">
                    </font>
                    <blockquote type="cite" class="">
                      <div class="" style="word-wrap: break-word;
                        -webkit-nbsp-mode: space; line-break:
                        after-white-space;">
                        <div class="">
                          <blockquote type="cite" class="">
                            <div class="">
                              <div text="#000000" bgcolor="#FFFFFF" class="">What enhancements would you
                                most like to see in C++ to improve
                                Unicode support?</div>
                            </div>
                          </blockquote>
                        </div>
                      </div>
                    </blockquote>
                    <div class=""><br class="">
                    </div>
                    Swift’s string is perhaps geared as a higher-level
                    construct than what you may want for C++, and Swift
                    has Cocoa-interoperability concerns where everything
                    is UTF-16. Rust might provide a closer model to what
                    you’re looking for:<br class="">
                  </div>
                </div>
                <div class=""><br class="">
                </div>
                <div class="">
                  <ul class="MailOutline">
                    <li class="">Strings are a sequence of (valid) UTF-8
                      code units</li>
                    <ul class="">
                      <li class="">Validation is done on creation</li>
                      <li class="">Invalid contents (e.g. Windows file
                        paths) can be handled via something like WTF-8,
                        which is not intended for interchange</li>
                    </ul>
                  </ul>
                </div>
                <div class="">
                  <ul class="MailOutline">
                    <li class="">String provides bidirectional iterators
                      for:</li>
                    <ul class="">
                      <li class="">Transcoded and/or normalized code
                        units</li>
                      <li class="">Unicode scalar values (their
                        “character” type)</li>
                      <li class="">Grapheme clusters</li>
                    </ul>
                  </ul>
                </div>
              </div>
            </div>
          </div>
        </blockquote>
        <br class="">
      </div>
      <div class="">Michael, I think you're not answering the question asked.
        &nbsp;They are asking what Swift would want from C++, e.g., to allow
        us to decouple from ICU. &nbsp;Wouldn't we like to be able to do
        that?</div>
    </blockquote>
    <br class="">
    This question was intended to ask you, as expert C++ programmers
    independently from Swift, what additions to C++ you think would be
    most helpful to improve our (very lacking) Unicode support.&nbsp; So,
    Michael's response is on point (thank you; we'll take a closer look
    at Rust), as are any comments regarding what would benefit Swift
    specifically.&nbsp; Michael's earlier comments regarding what Swift
    currently uses ICU for are suggestive of what Swift might want from
    C++.&nbsp; But I imagine the form in which those features are provided
    would matter greatly; devils and details.<br class="">
    <br class="">
    Tom.<br class="">
    <br class="">
    <blockquote type="cite" cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com" class="">
      <div class=""><br class="">
      </div>
      <div class="">-Dave</div>
      <div class=""><br class="">
      </div>
      <br class="">
    </blockquote><p class=""><br class="">
    </p>
  </div>

</div></blockquote></div><br class=""></div></body></html>