<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">On 08/03/2018 12:53 PM, Michael Ilseman
      wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:DF57361A-F68C-44B0-87E9-FDA5F7D0484E@apple.com">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:
        space; line-break: after-white-space;" class=""><br class="">
        <div><br class="">
          <blockquote type="cite" class="">
            <div class="">On Aug 2, 2018, at 10:26 PM, Tom Honermann
              &lt;<a href="mailto:tom@honermann.net" class=""
                moz-do-not-send="true">tom@honermann.net</a>&gt; wrote:</div>
            <br class="Apple-interchange-newline">
            <div class="">
              <meta http-equiv="Content-Type" content="text/html;
                charset=utf-8" class="">
              <div text="#000000" bgcolor="#FFFFFF" class="">
                <div class="moz-cite-prefix">Thank you Michael and
                  Dave!  I appreciate the time and detail.  All of your
                  answers look to confirm our expectations, so I
                  interpret this as a good sign we're thinking about the
                  right things.<br class="">
                  <br class="">
                  I added a few inline comments/clarifications below.<br
                    class="">
                  <br class="">
                  We had tentatively planned to meet Wednesday of next
                  week, but it turns out that two of our core SG16
                  members are going to be on vacation so, at a minimum,
                  I'd like to postpone.  I'm also feeling pretty content
                  with the responses that we got from you and I think it
                  would suffice for us to just follow up with any
                  remaining thoughts via email.  While I'd love for any
                  of you to attend one (or more) of our meetings (any
                  time), I want to be sensitive to productive use of
                  your time.  So, how about we play it by ear for now?<br
                    class="">
                  <br class="">
                </div>
              </div>
            </div>
          </blockquote>
          <div><br class="">
          </div>
          <div>I’d be happy to meet up sometime. JF mentioned an
            in-person meeting sometime this fall. Feel free to grab me
            whenever you think I can add value.</div>
          <br class="">
          <blockquote type="cite" class="">
            <div class="">
              <div text="#000000" bgcolor="#FFFFFF" class="">
                <div class="moz-cite-prefix"> On 08/02/2018 05:18 PM,
                  Dave Abrahams wrote:<br class="">
                </div>
                <blockquote type="cite"
                  cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
                  class="">
                  <meta http-equiv="Content-Type" content="text/html;
                    charset=utf-8" class="">
                  <br class="">
                  <div class=""><br class="">
                    <blockquote type="cite" class="">
                      <div class="">On Aug 1, 2018, at 12:04 PM, Michael
                        Ilseman &lt;<a href="mailto:milseman@apple.com"
                          class="" moz-do-not-send="true">milseman@apple.com</a>&gt;
                        wrote:</div>
                      <br class="Apple-interchange-newline">
                      <div class="">
                        <meta http-equiv="Content-Type"
                          content="text/html; charset=utf-8" class="">
                        <div style="word-wrap: break-word;
                          -webkit-nbsp-mode: space; line-break:
                          after-white-space;" class="">
                          <div class="">Hello, I am the current
                            maintainer of Swift’s String, and can speak
                            to my thoughts on the status quo and future
                            directions. Dave, who is on this thread, is
                            much more familiar with the history behind
                            this and can likely provide deeper insight
                            into the reasoning.</div>
                        </div>
                      </div>
                    </blockquote>
                    <div class=""><br class="">
                    </div>
                    Michael has done very well here; I only have a few
                    things to add.</div>
                  <div class=""><br class="">
                    <blockquote type="cite" class="">
                      <div class="">
                        <div style="word-wrap: break-word;
                          -webkit-nbsp-mode: space; line-break:
                          after-white-space;" class="">
                          <div class="">
                            <div class="">
                              <div class=""><font class=""
                                  color="#8886ff"><br class="">
                                </font>
                                <blockquote type="cite" class="">
                                  <div class="" style="word-wrap:
                                    break-word; -webkit-nbsp-mode:
                                    space; line-break:
                                    after-white-space;">
                                    <div class="">On Jul 23, 2018, at
                                      7:39 PM, Tom Honermann &lt;<a
                                        href="mailto:tom@honermann.net"
                                        class="" moz-do-not-send="true">tom@honermann.net</a>&gt;
                                      wrote:<br class="">
                                      <font class="" color="#00c8fa"><br
                                          class="">
                                      </font>SG16 is seeking input from
                                      Swift and WebKit representatives
                                      to help inform our work towards
                                      enhancing support for Unicode in
                                      the C++ standard.  In particular,
                                      we recognize the significant
                                      amount of effort that went into
                                      the design of the Swift String
                                      type and would like to better
                                      understand the motivations that
                                      contributed to its current design
                                      and any pressures that might
                                      encourage further evolution or
                                      refinement; especially for any
                                      concerns that would be deemed
                                      significant enough to warrant
                                      backward incompatible changes.<br
                                        class="">
                                      Though most of these questions
                                      specifically mention Swift, that
                                      is an artifact of our being more
                                      familiar with Swift than the
                                      internal workings of WebKit.  Many
                                      of these questions would be
                                      applicable to any string type
                                      designed to support Unicode.  We
                                      are therefore also interested in
                                      hearing about the string types
                                      used by WebKit, the motivations
                                      that guided their design, and the
                                      trade offs that have been made. 
                                      Of particular interest would be
                                      the results of design decisions
                                      that are contrast with the design
                                      of Swift's String type.<br
                                        class="">
                                      Thank you in advance for any time
                                      and expertise you are willing and
                                      able to share with us.<br class="">
                                      <blockquote type="cite" class="">
                                        <div class="">
                                          <div text="#000000"
                                            bgcolor="#FFFFFF" class="">The
                                            Swift string manifesto is
                                            about 1 1/2 years old. What
                                            have you learned since
                                            writing it?  What would you
                                            change?  What have you
                                            changed?</div>
                                        </div>
                                      </blockquote>
                                    </div>
                                  </div>
                                </blockquote>
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>We haven’t really diverged from
                                that manifesto. Some things are still in
                                progress, minor details were tweaked,
                                but the core arguments are still
                                relevant.</div>
                              <div class=""><br class="">
                                <blockquote type="cite" class="">
                                  <div class="" style="word-wrap:
                                    break-word; -webkit-nbsp-mode:
                                    space; line-break:
                                    after-white-space;">
                                    <div class="">
                                      <blockquote type="cite" class="">
                                        <div class="">
                                          <div text="#000000"
                                            bgcolor="#FFFFFF" class=""><br
                                              class="">
                                            Swift strings are extended
                                            grapheme cluster (EGC)
                                            based.  What have been the
                                            best and worst consequences
                                            of this choice?</div>
                                        </div>
                                      </blockquote>
                                    </div>
                                  </div>
                                </blockquote>
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>I’ll use “grapheme” casually to
                                mean EGC. Swift’s Character type
                                represents a grapheme cluster,
                                Unicode.Scalar represents a Unicode
                                scalar value (non-surrogate code point).<br
                                  class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>Cocoa APIs are UTF-16 code unit
                                oriented, and thus there’s always
                                caution (via documentation) about making
                                sure such indices align to grapheme
                                boundaries. This is a frequent source of
                                bugs, especially as part of
                                internationalization. By making Swift
                                strings be grapheme-based by default,
                                developers first reach for the correct
                                APIs.<br class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>Another good consequence is that
                                people picking up Swift and playing with
                                string, e.g. in a repl or Playground,
                                see Swift’s notion of characters align
                                with what is displayed. This includes
                                complex multi-component emoji such as
                                family emoji (👨‍👨‍👧‍👧), which is a
                                single Character composed of 7
                                Unicode.Scalars.<br class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>This does have downsides. What is
                                and is not a grapheme cluster changes
                                with each version of Unicode, and thus
                                grapheme breaking is inherently a
                                run-time concern and can’t be checked at
                                compile time. Another is that while code
                                units can be random-access, graphemes
                                cannot, which is confusing to developers
                                used to UTF-16 code unit access mostly
                                working (until their users use non-BMP
                                scalars or emoji that is). </div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                    <div class=""><br class="">
                    </div>
                    <div class="">I'd say the biggest downside is that
                      there are users who simply refuse to accept what
                      we consider to be the fundamental
                      non-random-access character of any efficient
                      string representation.  They are upset that they
                      can't index a string directly with an integer, and
                      can't be talked out of it.  I still think we made
                      the right decision in this regard; you'd have the
                      same problem if your strings were
                      unicode-scalar-based.</div>
                  </div>
                </blockquote>
                <br class="">
                Are there common scenarios where programmers tend to be
                frustrated by lack of random access?  Perhaps most often
                when they are working with inputs known to be ASCII
                only?  Or is this mostly an education issue and these
                programmers are having a difficult time accepting that
                they've spent most of their career thus far writing
                bugs? :)<br class="">
                <br class="">
              </div>
            </div>
          </blockquote>
          <div><br class="">
          </div>
          <div>A lot of it is shaped by expectations coming from other
            languages, whose programming models do not prioritize
            operating on Unicode scalar values, let alone grapheme
            clusters. Objective-C’s default interface with Strings is
            random-access to UTF-16 code units, which “works” right up
            until you encounter an emoji or other scalar not on the BMP.
            It also “works” for graphemes right up until you encounter
            emoji or a language you didn’t test or a non-NFC-normalized
            contents in a language you did test.</div>
          <div><br class="">
          </div>
          <div>This gets compounded by the prevalence of strings in
            teaching, interviews, programming puzzles, etc., where a
            string is treated like an array with a more visual
            representation.</div>
          <div><br class="">
          </div>
          <div>Also note that even for fully ASCII strings we cannot
            provide random access to grapheme clusters, as “\r\n” is a
            single grapheme cluster. For pretty much every
            Unicode-correct operation we provide fast-paths for, there’s
            nasty corner cases that complicates the model.</div>
        </div>
      </div>
    </blockquote>
    <br>
    Thanks, I had not considered the "\r\n" case.  Alas, there are no
    easy cases.<br>
    <br>
    <blockquote type="cite"
      cite="mid:DF57361A-F68C-44B0-87E9-FDA5F7D0484E@apple.com">
      <div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:
        space; line-break: after-white-space;" class="">
        <div><br class="">
          <blockquote type="cite" class="">
            <div class="">
              <div text="#000000" bgcolor="#FFFFFF" class="">
                <blockquote type="cite"
                  cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
                  class="">
                  <div class=""><br class="">
                    <blockquote type="cite" class="">
                      <div class="">
                        <div style="word-wrap: break-word;
                          -webkit-nbsp-mode: space; line-break:
                          after-white-space;" class="">
                          <div class="">
                            <div class="">
                              <div class="">Furthermore, few existing
                                specifications are phrased in terms
                                grapheme-clusters, so something like a
                                validator wouldn’t want to run on
                                grapheme-segmented text, but a lower
                                abstraction level.<br class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>Also, graphemes can be funky. A
                                string containing only, U+0301
                                (COMBINING ACUTE ACCENT) has one
                                grapheme, but modifies the prior
                                grapheme upon concatenation. Such
                                degenerate graphemes violate algebraic
                                reasoning in these corner cases. </div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                    <div class=""><br class="">
                    </div>
                    <div class="">We are not aware of generic algorithms
                      that rely on concatenation of collections
                      conserving element counts, so we decided to simply
                      document this quirk rather than saying that string
                      is-not-a collection.</div>
                  </div>
                </blockquote>
                <br class="">
                SG16 has previously discussed cases like this and I'm
                happy to hear you haven't had to do anything special for
                it.  This is a good example of why we asked about
                inappropriate use of the String count property:
                programmers assuming s1.count + s2.count ==
                s1.append(s2).count.<br class="">
                <br class="">
                <blockquote type="cite"
                  cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
                  class="">
                  <div class=""><br class="">
                    <blockquote type="cite" class="">
                      <div class="">
                        <div style="word-wrap: break-word;
                          -webkit-nbsp-mode: space; line-break:
                          after-white-space;" class="">
                          <div class="">
                            <div class="">
                              <div class="">Unicode defines properties
                                and most operations on scalars or code
                                points, and very little on top of
                                graphemes.<br class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>
                                <blockquote type="cite" class="">
                                  <div class="" style="word-wrap:
                                    break-word; -webkit-nbsp-mode:
                                    space; line-break:
                                    after-white-space;">
                                    <div class="">
                                      <blockquote type="cite" class="">
                                        <div class="">
                                          <div text="#000000"
                                            bgcolor="#FFFFFF" class="">When
                                            porting code unit or code
                                            point based code to Swift
                                            strings (e.g., when
                                            rewriting Objective-C code,
                                            or rewriting Swift code to
                                            use String instead of
                                            NSString), has profiling
                                            revealed performance
                                            regressions due to the
                                            switch to EGC based
                                            processing?  If so, what
                                            action was taken to correct
                                            it?</div>
                                        </div>
                                      </blockquote>
                                    </div>
                                  </div>
                                </blockquote>
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>We have many fast-paths in
                                grapheme-breaking to identify common
                                situations surrounding single-scalar
                                graphemes. If a developer wants to work
                                with Unicode at a lower level, String
                                provides a UTF8View, a UTF16View, and a
                                UnicodeScalarView. Those views lazily
                                transcode/decode upon access.<br
                                  class="">
                              </div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                  </div>
                </blockquote>
                <br class="">
                Cool, it sounds like the answer to any such regressions
                was 1) optimization in terms of fast-paths, and 2) fall
                back to code unit/point processing otherwise.<br
                  class="">
                <br class="">
                <blockquote type="cite"
                  cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
                  class="">
                  <div class="">
                    <blockquote type="cite" class="">
                      <div class="">
                        <div style="word-wrap: break-word;
                          -webkit-nbsp-mode: space; line-break:
                          after-white-space;" class="">
                          <div class="">
                            <div class="">
                              <div class=""><font class=""
                                  color="#8886ff"><br class="">
                                </font>There are also performance
                                concerns and annoyances when working
                                with ICU, but this is an implementation
                                detail. If you’re interested in using
                                ICU, we can discuss further what has
                                worked best for us.<br class="">
                              </div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                    <div class=""><br class="">
                    </div>
                    I think you're interested in (at least optionally)
                    using ICU unless you have evidence of major
                    investment in another open-source implementation of
                    Unicode algorithms and tables.  Otherwise, C++
                    implementors could not afford to develop standard
                    libraries.</div>
                </blockquote>
                <br class="">
                Yes, definitely.  For the foreseeable future, I think we
                need to ensure that any interfaces we propose can be
                reasonably implemented using ICU.  However, Zach Laine
                has made impressive progress implementing many of the
                Unicode algorithms without use of ICU in his proposed
                Boost.Text library.  See <a
                  class="moz-txt-link-freetext"
                  href="https://github.com/tzlaine/text"
                  moz-do-not-send="true">https://github.com/tzlaine/text</a>
                and <a class="moz-txt-link-freetext"
                  href="https://tzlaine.github.io/text/doc/html/index.html"
                  moz-do-not-send="true">https://tzlaine.github.io/text/doc/html/index.html</a>.<br
                  class="">
                <br class="">
              </div>
            </div>
          </blockquote>
          <blockquote type="cite" class="">
            <div class="">
              <div text="#000000" bgcolor="#FFFFFF" class="">
                <blockquote type="cite"
                  cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
                  class="">
                  <div class=""><br class="">
                    <blockquote type="cite" class="">
                      <div class="">
                        <div style="word-wrap: break-word;
                          -webkit-nbsp-mode: space; line-break:
                          after-white-space;" class="">
                          <div class="">
                            <div class="">
                              <div class=""><font class=""
                                  color="#8886ff"><br class="">
                                </font>
                                <blockquote type="cite" class="">
                                  <div class="" style="word-wrap:
                                    break-word; -webkit-nbsp-mode:
                                    space; line-break:
                                    after-white-space;">
                                    <div class="">
                                      <blockquote type="cite" class="">
                                        <div class="">
                                          <div text="#000000"
                                            bgcolor="#FFFFFF" class=""><br
                                              class="">
                                            Swift strings do not enforce
                                            storage in any particular
                                            Unicode normalization form. 
                                            Was consideration given to
                                            forcing storage in a
                                            particular form such as FCC
                                            or NFC?</div>
                                        </div>
                                      </blockquote>
                                    </div>
                                  </div>
                                </blockquote>
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>Swift strings now sort with NFC
                                (currently UTF-16 code unit order, but
                                likely changed to Unicode scalar value
                                order). We didn’t find FCC significantly
                                more compelling in practice. Since NFC
                                is far more frequent in the wild (why
                                waste space if you don’t have to),
                                strings are likely to already be in NFC.
                                We have fast-paths to detect on-the-fly
                                normal sections of strings (e.g. all
                                ASCII, all &lt; U+0300, NFC_QC=yes,
                                etc.). We lazily normalize portions of
                                string during comparison when needed.<br
                                  class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>As far as enforcing on creation,
                                no. We do want to add an option to
                                perform a linear scan to set a
                                performance flag, perhaps at creation,
                                so that comparison can take the
                                memcmp-like fast-path.<br class="">
                              </div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                  </div>
                </blockquote>
                <br class="">
                Ok, my take away from this is that fast-pathing has been
                sufficient for lazy normalization (when needed) to not
                be (much of) a performance concern.  At least, not
                enough to want to take the normalization cost on every
                string construction up front.<br class="">
                <br class="">
                <blockquote type="cite"
                  cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
                  class="">
                  <div class="">
                    <blockquote type="cite" class="">
                      <div class="">
                        <div style="word-wrap: break-word;
                          -webkit-nbsp-mode: space; line-break:
                          after-white-space;" class="">
                          <div class="">
                            <div class="">
                              <div class=""><font class=""
                                  color="#8886ff"><br class="">
                                </font>
                                <blockquote type="cite" class="">
                                  <div class="" style="word-wrap:
                                    break-word; -webkit-nbsp-mode:
                                    space; line-break:
                                    after-white-space;">
                                    <div class="">
                                      <blockquote type="cite" class="">
                                        <div class="">
                                          <div text="#000000"
                                            bgcolor="#FFFFFF" class="">Swift
                                            strings support comparison
                                            via normalization.  Has use
                                            of canonical string equality
                                            been a performance issue? 
                                            Or been a source of surprise
                                            to programmers?</div>
                                        </div>
                                      </blockquote>
                                    </div>
                                  </div>
                                </blockquote>
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>This was a big performance issue
                                on Linux, where we used to do UCA+DUCET
                                based comparisons. We switch to
                                lexicographical order of NFC-normalized
                                UTF-16 code units (future: scalar
                                values), and saw a very significant
                                speed up there. The remaining
                                performance work revolves around
                                checking and tracking whether a string
                                is known to already be in a normal form,
                                so we can just memcmp.<br class="">
                              </div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                  </div>
                </blockquote>
                <br class="">
                This is very helpful, thank you.  We've suspected that
                full collation (with or without tailoring) would be too
                expensive for use as a default comparison operator, so
                it is good to hear that confirmed.<br class="">
                <br class="">
                I'm curious why this was a larger performance issue for
                Linux than for (presumably) macOS and/or iOS.<br
                  class="">
                <br class="">
              </div>
            </div>
          </blockquote>
          <div><br class="">
          </div>
          <div>There were two main factors. The first is that on Darwin
            platforms, CFString had an implementation that we used
            instead of UCA+DUCET which was faster. The second is that
            Darwin platforms are typically up-to-date and have very
            recent versions of ICU. On Linux, we still support Ubuntu
            LTS 14.04 which has a version of ICU which predates Swift
            and didn’t have any fast-paths for ASCII or mostly-ASCII
            text.</div>
          <div><br class="">
          </div>
          <div>Switching to our own implementation based on NFC gave us
            many X improvement over CFString, which in turn was many X
            faster than UCA+DUCET (especially on older versions of ICU).</div>
        </div>
      </div>
    </blockquote>
    <br>
    Thanks.  My take away is that implementation quality matters; those
    fast paths are important.<br>
    <br>
    <blockquote type="cite"
      cite="mid:DF57361A-F68C-44B0-87E9-FDA5F7D0484E@apple.com">
      <div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:
        space; line-break: after-white-space;" class="">
        <div><br class="">
          <blockquote type="cite" class="">
            <div class="">
              <div text="#000000" bgcolor="#FFFFFF" class="">
                <blockquote type="cite"
                  cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
                  class="">
                  <div class="">
                    <blockquote type="cite" class="">
                      <div class="">
                        <div style="word-wrap: break-word;
                          -webkit-nbsp-mode: space; line-break:
                          after-white-space;" class="">
                          <div class="">
                            <div class="">
                              <div class=""><font class=""
                                  color="#8886ff"><br class="">
                                </font>
                                <blockquote type="cite" class="">
                                  <div class="" style="word-wrap:
                                    break-word; -webkit-nbsp-mode:
                                    space; line-break:
                                    after-white-space;">
                                    <div class="">
                                      <blockquote type="cite" class="">
                                        <div class="">
                                          <div text="#000000"
                                            bgcolor="#FFFFFF" class="">Swift
                                            strings are not locale
                                            sensitive.  Was any
                                            consideration given to
                                            creation of a distinct
                                            locale sensitive string
                                            type?</div>
                                        </div>
                                      </blockquote>
                                    </div>
                                  </div>
                                </blockquote>
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>This is still up for debate and
                                hasn’t been settled yet, but we think it
                                makes a lot of sense. If an array of
                                strings is sorted, we certainly don’t
                                want a locale-change to violate
                                programmer invariants. A distinct type
                                from string could avoid a lot of common
                                errors here, including forgetting to
                                localize before presenting to a user as
                                part of a UI.<br class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>
                                <blockquote type="cite" class="">
                                  <div class="" style="word-wrap:
                                    break-word; -webkit-nbsp-mode:
                                    space; line-break:
                                    after-white-space;">
                                    <div class="">
                                      <blockquote type="cite" class="">
                                        <div class="">
                                          <div text="#000000"
                                            bgcolor="#FFFFFF" class="">Swift
                                            strings provide a count
                                            property as required to
                                            satisfy the Collection
                                            protocol.  How often do
                                            programmers use count (the
                                            number of EGCs in the
                                            string) inappropriately?</div>
                                        </div>
                                      </blockquote>
                                    </div>
                                  </div>
                                </blockquote>
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>I’m not sure what would
                                constitute inappropriate usage here. We
                                do not currently provide access to the
                                underlying stored code units, though
                                this is a frequent request and we likely
                                will in the future. I haven’t seen
                                anyone baking in the assumption that
                                count is the same for String and across
                                all of Strings’s views (UTF-8, UTF-16,
                                Unicode scalars).<br class="">
                              </div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                    <div class=""><br class="">
                    </div>
                  </div>
                  <div class="">One thing to consider is that as long as
                    String is not random-access, count will be a
                    worst-case O(N) operation.  An inappropriate usage
                    might involve computing the length once per loop
                    iteration.</div>
                </blockquote>
                <br class="">
                In addition to the above and prior mention of algebraic
                concerns, other potential abuses we had in mind were
                using it to determine field widths for display or code
                unit/point based storage.<br class="">
                <br class="">
              </div>
            </div>
          </blockquote>
          <div><br class="">
          </div>
          <div>Display width is a whole other concern accounting for
            rendering environment, font, etc. I don’t have expertise
            here.</div>
          <br class="">
          <blockquote type="cite" class="">
            <div class="">
              <div text="#000000" bgcolor="#FFFFFF" class=""> C++
                container requirements specify that .size() be O(1). 
                For us to meet container requirements would require
                computing and caching the count during construction and
                mutation operations.  We could potentially get by just
                meeting range requirements though.<br class="">
                <br class="">
                <blockquote type="cite"
                  cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
                  class="">
                  <div class=""><br class="">
                    <blockquote type="cite" class="">
                      <div class="">
                        <div style="word-wrap: break-word;
                          -webkit-nbsp-mode: space; line-break:
                          after-white-space;" class="">
                          <div class="">
                            <div class="">
                              <div class="">I mentioned degenerate
                                graphemes breaking algebraic properties
                                of the Collection protocol, but this
                                hasn’t been a huge issue in practice so
                                far.<br class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>
                                <blockquote type="cite" class="">
                                  <div class="" style="word-wrap:
                                    break-word; -webkit-nbsp-mode:
                                    space; line-break:
                                    after-white-space;">
                                    <div class="">
                                      <blockquote type="cite" class="">
                                        <div class="">
                                          <div text="#000000"
                                            bgcolor="#FFFFFF" class=""><br
                                              class="">
                                            Swift strings support
                                            several memory unsafe
                                            initializers and methods. 
                                            How frequently are these
                                            used incorrectly?</div>
                                        </div>
                                      </blockquote>
                                    </div>
                                  </div>
                                </blockquote>
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>Many of these initializers come
                                from NSString originally, and developers
                                migrating correct code to Swift maintain
                                that correctness. Rust has a similar
                                situation, though they do validation at
                                creation-time and from_utf8_unchecked()
                                voids memory-safety if the contents are
                                invalid.<br class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>
                                <blockquote type="cite" class="">
                                  <div class="" style="word-wrap:
                                    break-word; -webkit-nbsp-mode:
                                    space; line-break:
                                    after-white-space;">
                                    <div class="">
                                      <blockquote type="cite" class="">
                                        <div class="">
                                          <div text="#000000"
                                            bgcolor="#FFFFFF" class="">The
                                            Swift manifesto discussed
                                            three approaches to handling
                                            substrings and Swift 4
                                            changed from "same type,
                                            shared storage" to
                                            "different type, shared
                                            storage".  Any regrets?</div>
                                        </div>
                                      </blockquote>
                                    </div>
                                  </div>
                                </blockquote>
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>Having two types can be a bit of
                                a pain, but we still think it was the
                                right thing to do. This is consistent
                                with Swift treating slices as a distinct
                                type from the base collection.<br
                                  class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>
                                <blockquote type="cite" class="">
                                  <div class="" style="word-wrap:
                                    break-word; -webkit-nbsp-mode:
                                    space; line-break:
                                    after-white-space;">
                                    <div class="">
                                      <blockquote type="cite" class="">
                                        <div class="">
                                          <div text="#000000"
                                            bgcolor="#FFFFFF" class=""><br
                                              class="">
                                            How often do you find
                                            programmers doing work at
                                            the EGC level that would be
                                            better performed at the code
                                            unit or code point level?</div>
                                        </div>
                                      </blockquote>
                                    </div>
                                  </div>
                                </blockquote>
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>Often, if a developer has strict
                                requirements, they know what they’re
                                doing enough to operate at one of those
                                lower levels.<br class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>Not being able to random-access
                                graphemes in a string is a common source
                                of frustration and confusion amongst new
                                users.<br class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>
                                <blockquote type="cite" class="">
                                  <div class="" style="word-wrap:
                                    break-word; -webkit-nbsp-mode:
                                    space; line-break:
                                    after-white-space;">
                                    <div class="">
                                      <blockquote type="cite" class="">
                                        <div class="">
                                          <div text="#000000"
                                            bgcolor="#FFFFFF" class="">Likewise,
                                            how often do you find
                                            programmers working with
                                            unicodeScalars, utf8, or
                                            utf16 views to do work
                                            better performed at the EGC
                                            level?  For what reasons
                                            does this occur?  Perhaps to
                                            work around differences in
                                            EGC boundaries across
                                            Unicode versions or the
                                            underlying version of ICU in
                                            use?</div>
                                        </div>
                                      </blockquote>
                                    </div>
                                  </div>
                                </blockquote>
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>This was very prevalent in
                                Swift’s early days. String wasn’t a
                                collection of graphemes by default prior
                                to Swift 4,</div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                    <div class=""><br class="">
                    </div>
                    Well, it was.  And then in Swift 2 or 3 it wasn't,
                    due to the algebraic reasoning issue.  Now it is
                    again.</div>
                  <div class=""><br class="">
                    <blockquote type="cite" class="">
                      <div class="">
                        <div style="word-wrap: break-word;
                          -webkit-nbsp-mode: space; line-break:
                          after-white-space;" class="">
                          <div class="">
                            <div class="">
                              <div class=""> so without guidance many
                                developers wrote code against the
                                unicode scalars view. We also didn’t
                                have any fast-paths for common-case
                                situations back then, which further
                                encouraged them to use one of the other
                                views.<br class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>This is still done sometimes for
                                performance-sensitive usage, or someone
                                wanting to handle Unicode themselves.
                                However, as mentioned previously, we
                                don’t (yet) provide direct access to the
                                actual storage.<br class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>We haven’t seen much desire for
                                reconciling behavior across Unicode
                                versions. This may be due to Swift being
                                primarily an applications level
                                programming language for devices which
                                only have one version of Unicode that’s
                                relevant (the current one).<br class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>
                                <blockquote type="cite" class="">
                                  <div class="" style="word-wrap:
                                    break-word; -webkit-nbsp-mode:
                                    space; line-break:
                                    after-white-space;">
                                    <div class="">
                                      <blockquote type="cite" class="">
                                        <div class="">
                                          <div text="#000000"
                                            bgcolor="#FFFFFF" class="">Has
                                            consideration been given to
                                            exposing Unicode character
                                            database properties?
                                            CharacterSet exposes some of
                                            these properties, but have
                                            more been requested?</div>
                                        </div>
                                      </blockquote>
                                    </div>
                                  </div>
                                </blockquote>
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>Yes, this was recently added to
                                the language: <a
href="https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md"
                                  class="" moz-do-not-send="true">https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md</a>.
                                We surface much of the UCD via ICU.<br
                                  class="">
                              </div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                  </div>
                </blockquote>
                <br class="">
                Ah, nice.  All kinds of fun to be had with that :)<br
                  class="">
                <br class="">
                <blockquote type="cite"
                  cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
                  class="">
                  <div class="">
                    <blockquote type="cite" class="">
                      <div class="">
                        <div style="word-wrap: break-word;
                          -webkit-nbsp-mode: space; line-break:
                          after-white-space;" class="">
                          <div class="">
                            <div class="">
                              <div class=""><font class=""
                                  color="#8886ff"><br class="">
                                </font>
                                <blockquote type="cite" class="">
                                  <div class="" style="word-wrap:
                                    break-word; -webkit-nbsp-mode:
                                    space; line-break:
                                    after-white-space;">
                                    <div class="">
                                      <blockquote type="cite" class="">
                                        <div class="">
                                          <div text="#000000"
                                            bgcolor="#FFFFFF" class="">How
                                            firmly is the Swift string
                                            implementation tied to ICU? 
                                            If the C++ standard library
                                            were to add suitable Unicode
                                            support, what would motivate
                                            reimplementing Swift strings
                                            on top of it?</div>
                                        </div>
                                      </blockquote>
                                    </div>
                                  </div>
                                </blockquote>
                                <div class=""><br class="">
                                </div>
                                Swift’s tie to ICU is less firm than it
                                used to be. We use ICU for the
                                following:<br class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>1. Grapheme breaking<br class="">
                                2. Normalization<br class="">
                                3. Accessing UCD properties<br class="">
                                4. Case conversion<br class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>Each of these are not too tightly
                                entwined with string; they’re
                                cordoned-off as a couple of shims called
                                on fallback slow-paths.<br class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>If the C++ standard library
                                provided these operations, sufficiently
                                up-to-date with Unicode version and
                                comparable or better to ICU in
                                performance, we would be willing to
                                switch. A big pain in interacting with
                                ICU is their limited support for UTF-8.
                                Some users who would like to use a
                                “lighter-weight” Swift and are unhappy
                                at having to link against ICU, as it’s
                                fairly large, and it can complicate
                                security audits.<br class="">
                              </div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                  </div>
                </blockquote>
                <br class="">
                Got it.  Increasing the size of the C++ standard library
                is a definite concern for us as well.  We imagine some
                C++ users would be similarly unhappy if their standard
                library suddenly required linking against ICU.<br
                  class="">
                <br class="">
              </div>
            </div>
          </blockquote>
          <div><br class="">
          </div>
          <div>If you go the route of implementing Unicode operations
            without ICU, would it be possible to separately link against
            Unicode support without also pulling in all of libc++? If
            your implementation is lighter-weight, yet current, it would
            be very appealing for Swift to consider switching over.</div>
        </div>
      </div>
    </blockquote>
    <br>
    It would be up to the implementation to determine how it is
    packaged, but I suspect there will be sufficient motivation for
    separating out the heavier parts.  Whether those heavier parts could
    then be used separately from the rest of the library I can't say.  I
    think this is something for us to keep in mind as a design point
    though.<br>
    <br>
    Tom.<br>
    <br>
    <blockquote type="cite"
      cite="mid:DF57361A-F68C-44B0-87E9-FDA5F7D0484E@apple.com">
      <div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:
        space; line-break: after-white-space;" class="">
        <div><br class="">
          <blockquote type="cite" class="">
            <div class="">
              <div text="#000000" bgcolor="#FFFFFF" class="">
                <blockquote type="cite"
                  cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
                  class="">
                  <div class="">
                    <blockquote type="cite" class="">
                      <div class="">
                        <div style="word-wrap: break-word;
                          -webkit-nbsp-mode: space; line-break:
                          after-white-space;" class="">
                          <div class="">
                            <div class="">
                              <div class=""><font class=""
                                  color="#8886ff"><br class="">
                                </font>
                                <blockquote type="cite" class="">
                                  <div class="" style="word-wrap:
                                    break-word; -webkit-nbsp-mode:
                                    space; line-break:
                                    after-white-space;">
                                    <div class="">
                                      <blockquote type="cite" class="">
                                        <div class="">
                                          <div text="#000000"
                                            bgcolor="#FFFFFF" class="">Do
                                            Swift programmers tend to
                                            prefer string interpolation
                                            or string formatting
                                            functions?</div>
                                        </div>
                                      </blockquote>
                                    </div>
                                  </div>
                                </blockquote>
                                <div class=""><br class="">
                                </div>
                                Users tend to prefer string
                                interpolation. However, Swift currently
                                does not have much in the way of
                                formatting control in interpolations,
                                and this is something we’re currently
                                working on.<br class="">
                                <font class="" color="#8886ff"><br
                                    class="">
                                </font>
                                <blockquote type="cite" class="">
                                  <div class="" style="word-wrap:
                                    break-word; -webkit-nbsp-mode:
                                    space; line-break:
                                    after-white-space;">
                                    <div class="">
                                      <blockquote type="cite" class="">
                                        <div class="">
                                          <div text="#000000"
                                            bgcolor="#FFFFFF" class="">What
                                            enhancements would you most
                                            like to see in C++ to
                                            improve Unicode support?</div>
                                        </div>
                                      </blockquote>
                                    </div>
                                  </div>
                                </blockquote>
                                <div class=""><br class="">
                                </div>
                                Swift’s string is perhaps geared as a
                                higher-level construct than what you may
                                want for C++, and Swift has
                                Cocoa-interoperability concerns where
                                everything is UTF-16. Rust might provide
                                a closer model to what you’re looking
                                for:<br class="">
                              </div>
                            </div>
                            <div class=""><br class="">
                            </div>
                            <div class="">
                              <ul class="MailOutline">
                                <li class="">Strings are a sequence of
                                  (valid) UTF-8 code units</li>
                                <ul class="">
                                  <li class="">Validation is done on
                                    creation</li>
                                  <li class="">Invalid contents (e.g.
                                    Windows file paths) can be handled
                                    via something like WTF-8, which is
                                    not intended for interchange</li>
                                </ul>
                              </ul>
                            </div>
                            <div class="">
                              <ul class="MailOutline">
                                <li class="">String provides
                                  bidirectional iterators for:</li>
                                <ul class="">
                                  <li class="">Transcoded and/or
                                    normalized code units</li>
                                  <li class="">Unicode scalar values
                                    (their “character” type)</li>
                                  <li class="">Grapheme clusters</li>
                                </ul>
                              </ul>
                            </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                    <br class="">
                  </div>
                  <div class="">Michael, I think you're not answering
                    the question asked.  They are asking what Swift
                    would want from C++, e.g., to allow us to decouple
                    from ICU.  Wouldn't we like to be able to do that?</div>
                </blockquote>
                <br class="">
                This question was intended to ask you, as expert C++
                programmers independently from Swift, what additions to
                C++ you think would be most helpful to improve our (very
                lacking) Unicode support.  So, Michael's response is on
                point (thank you; we'll take a closer look at Rust), as
                are any comments regarding what would benefit Swift
                specifically.  Michael's earlier comments regarding what
                Swift currently uses ICU for are suggestive of what
                Swift might want from C++.  But I imagine the form in
                which those features are provided would matter greatly;
                devils and details.<br class="">
                <br class="">
                Tom.<br class="">
                <br class="">
                <blockquote type="cite"
                  cite="mid:A9CC2CEA-2102-4473-93A3-455C4AF66365@apple.com"
                  class="">
                  <div class=""><br class="">
                  </div>
                  <div class="">-Dave</div>
                  <div class=""><br class="">
                  </div>
                  <br class="">
                </blockquote>
                <p class=""><br class="">
                </p>
              </div>
            </div>
          </blockquote>
        </div>
        <br class="">
      </div>
    </blockquote>
    <p><br>
    </p>
  </body>
</html>