[SG16-Unicode] SG16 Unicode related questions for Swift and WebKit representatives

Tom Honermann tom at honermann.net
Tue Aug 7 04:14:29 CEST 2018


On 08/03/2018 12:53 PM, Michael Ilseman wrote:
>
>
>> On Aug 2, 2018, at 10:26 PM, Tom Honermann <tom at honermann.net 
>> <mailto:tom at honermann.net>> wrote:
>>
>> Thank you Michael and Dave!  I appreciate the time and detail.  All 
>> of your answers look to confirm our expectations, so I interpret this 
>> as a good sign we're thinking about the right things.
>>
>> I added a few inline comments/clarifications below.
>>
>> We had tentatively planned to meet Wednesday of next week, but it 
>> turns out that two of our core SG16 members are going to be on 
>> vacation so, at a minimum, I'd like to postpone.  I'm also feeling 
>> pretty content with the responses that we got from you and I think it 
>> would suffice for us to just follow up with any remaining thoughts 
>> via email.  While I'd love for any of you to attend one (or more) of 
>> our meetings (any time), I want to be sensitive to productive use of 
>> your time.  So, how about we play it by ear for now?
>>
>
> I’d be happy to meet up sometime. JF mentioned an in-person meeting 
> sometime this fall. Feel free to grab me whenever you think I can add 
> value.
>
>> On 08/02/2018 05:18 PM, Dave Abrahams wrote:
>>>
>>>
>>>> On Aug 1, 2018, at 12:04 PM, Michael Ilseman <milseman at apple.com 
>>>> <mailto:milseman at apple.com>> wrote:
>>>>
>>>> Hello, I am the current maintainer of Swift’s String, and can speak 
>>>> to my thoughts on the status quo and future directions. Dave, who 
>>>> is on this thread, is much more familiar with the history behind 
>>>> this and can likely provide deeper insight into the reasoning.
>>>
>>> Michael has done very well here; I only have a few things to add.
>>>
>>>>
>>>>> On Jul 23, 2018, at 7:39 PM, Tom Honermann <tom at honermann.net 
>>>>> <mailto:tom at honermann.net>> wrote:
>>>>>
>>>>> SG16 is seeking input from Swift and WebKit representatives to 
>>>>> help inform our work towards enhancing support for Unicode in the 
>>>>> C++ standard.  In particular, we recognize the significant amount 
>>>>> of effort that went into the design of the Swift String type and 
>>>>> would like to better understand the motivations that contributed 
>>>>> to its current design and any pressures that might encourage 
>>>>> further evolution or refinement; especially for any concerns that 
>>>>> would be deemed significant enough to warrant backward 
>>>>> incompatible changes.
>>>>> Though most of these questions specifically mention Swift, that is 
>>>>> an artifact of our being more familiar with Swift than the 
>>>>> internal workings of WebKit.  Many of these questions would be 
>>>>> applicable to any string type designed to support Unicode.  We are 
>>>>> therefore also interested in hearing about the string types used 
>>>>> by WebKit, the motivations that guided their design, and the trade 
>>>>> offs that have been made. Of particular interest would be the 
>>>>> results of design decisions that are contrast with the design of 
>>>>> Swift's String type.
>>>>> Thank you in advance for any time and expertise you are willing 
>>>>> and able to share with us.
>>>>>> The Swift string manifesto is about 1 1/2 years old. What have 
>>>>>> you learned since writing it?  What would you change?  What have 
>>>>>> you changed?
>>>>
>>>> We haven’t really diverged from that manifesto. Some things are 
>>>> still in progress, minor details were tweaked, but the core 
>>>> arguments are still relevant.
>>>>
>>>>>>
>>>>>> Swift strings are extended grapheme cluster (EGC) based.  What 
>>>>>> have been the best and worst consequences of this choice?
>>>>
>>>> I’ll use “grapheme” casually to mean EGC. Swift’s Character type 
>>>> represents a grapheme cluster, Unicode.Scalar represents a Unicode 
>>>> scalar value (non-surrogate code point).
>>>>
>>>> Cocoa APIs are UTF-16 code unit oriented, and thus there’s always 
>>>> caution (via documentation) about making sure such indices align to 
>>>> grapheme boundaries. This is a frequent source of bugs, especially 
>>>> as part of internationalization. By making Swift strings be 
>>>> grapheme-based by default, developers first reach for the correct APIs.
>>>>
>>>> Another good consequence is that people picking up Swift and 
>>>> playing with string, e.g. in a repl or Playground, see Swift’s 
>>>> notion of characters align with what is displayed. This includes 
>>>> complex multi-component emoji such as family emoji (👨‍👨‍👧‍👧), 
>>>> which is a single Character composed of 7 Unicode.Scalars.
>>>>
>>>> This does have downsides. What is and is not a grapheme cluster 
>>>> changes with each version of Unicode, and thus grapheme breaking is 
>>>> inherently a run-time concern and can’t be checked at compile time. 
>>>> Another is that while code units can be random-access, graphemes 
>>>> cannot, which is confusing to developers used to UTF-16 code unit 
>>>> access mostly working (until their users use non-BMP scalars or 
>>>> emoji that is).
>>>
>>> I'd say the biggest downside is that there are users who simply 
>>> refuse to accept what we consider to be the fundamental 
>>> non-random-access character of any efficient string representation. 
>>>  They are upset that they can't index a string directly with an 
>>> integer, and can't be talked out of it.  I still think we made the 
>>> right decision in this regard; you'd have the same problem if your 
>>> strings were unicode-scalar-based.
>>
>> Are there common scenarios where programmers tend to be frustrated by 
>> lack of random access?  Perhaps most often when they are working with 
>> inputs known to be ASCII only?  Or is this mostly an education issue 
>> and these programmers are having a difficult time accepting that 
>> they've spent most of their career thus far writing bugs? :)
>>
>
> A lot of it is shaped by expectations coming from other languages, 
> whose programming models do not prioritize operating on Unicode scalar 
> values, let alone grapheme clusters. Objective-C’s default interface 
> with Strings is random-access to UTF-16 code units, which “works” 
> right up until you encounter an emoji or other scalar not on the BMP. 
> It also “works” for graphemes right up until you encounter emoji or a 
> language you didn’t test or a non-NFC-normalized contents in a 
> language you did test.
>
> This gets compounded by the prevalence of strings in teaching, 
> interviews, programming puzzles, etc., where a string is treated like 
> an array with a more visual representation.
>
> Also note that even for fully ASCII strings we cannot provide random 
> access to grapheme clusters, as “\r\n” is a single grapheme cluster. 
> For pretty much every Unicode-correct operation we provide fast-paths 
> for, there’s nasty corner cases that complicates the model.

Thanks, I had not considered the "\r\n" case.  Alas, there are no easy 
cases.

>
>>>
>>>> Furthermore, few existing specifications are phrased in terms 
>>>> grapheme-clusters, so something like a validator wouldn’t want to 
>>>> run on grapheme-segmented text, but a lower abstraction level.
>>>>
>>>> Also, graphemes can be funky. A string containing only, U+0301 
>>>> (COMBINING ACUTE ACCENT) has one grapheme, but modifies the prior 
>>>> grapheme upon concatenation. Such degenerate graphemes violate 
>>>> algebraic reasoning in these corner cases.
>>>
>>> We are not aware of generic algorithms that rely on concatenation of 
>>> collections conserving element counts, so we decided to simply 
>>> document this quirk rather than saying that string is-not-a collection.
>>
>> SG16 has previously discussed cases like this and I'm happy to hear 
>> you haven't had to do anything special for it.  This is a good 
>> example of why we asked about inappropriate use of the String count 
>> property: programmers assuming s1.count + s2.count == 
>> s1.append(s2).count.
>>
>>>
>>>> Unicode defines properties and most operations on scalars or code 
>>>> points, and very little on top of graphemes.
>>>>
>>>>>> When porting code unit or code point based code to Swift strings 
>>>>>> (e.g., when rewriting Objective-C code, or rewriting Swift code 
>>>>>> to use String instead of NSString), has profiling revealed 
>>>>>> performance regressions due to the switch to EGC based 
>>>>>> processing?  If so, what action was taken to correct it?
>>>>
>>>> We have many fast-paths in grapheme-breaking to identify common 
>>>> situations surrounding single-scalar graphemes. If a developer 
>>>> wants to work with Unicode at a lower level, String provides a 
>>>> UTF8View, a UTF16View, and a UnicodeScalarView. Those views lazily 
>>>> transcode/decode upon access.
>>
>> Cool, it sounds like the answer to any such regressions was 1) 
>> optimization in terms of fast-paths, and 2) fall back to code 
>> unit/point processing otherwise.
>>
>>>>
>>>> There are also performance concerns and annoyances when working 
>>>> with ICU, but this is an implementation detail. If you’re 
>>>> interested in using ICU, we can discuss further what has worked 
>>>> best for us.
>>>
>>> I think you're interested in (at least optionally) using ICU unless 
>>> you have evidence of major investment in another open-source 
>>> implementation of Unicode algorithms and tables.  Otherwise, C++ 
>>> implementors could not afford to develop standard libraries.
>>
>> Yes, definitely.  For the foreseeable future, I think we need to 
>> ensure that any interfaces we propose can be reasonably implemented 
>> using ICU.  However, Zach Laine has made impressive progress 
>> implementing many of the Unicode algorithms without use of ICU in his 
>> proposed Boost.Text library.  See https://github.com/tzlaine/text and 
>> https://tzlaine.github.io/text/doc/html/index.html.
>>
>>>
>>>>
>>>>>>
>>>>>> Swift strings do not enforce storage in any particular Unicode 
>>>>>> normalization form. Was consideration given to forcing storage in 
>>>>>> a particular form such as FCC or NFC?
>>>>
>>>> Swift strings now sort with NFC (currently UTF-16 code unit order, 
>>>> but likely changed to Unicode scalar value order). We didn’t find 
>>>> FCC significantly more compelling in practice. Since NFC is far 
>>>> more frequent in the wild (why waste space if you don’t have to), 
>>>> strings are likely to already be in NFC. We have fast-paths to 
>>>> detect on-the-fly normal sections of strings (e.g. all ASCII, all < 
>>>> U+0300, NFC_QC=yes, etc.). We lazily normalize portions of string 
>>>> during comparison when needed.
>>>>
>>>> As far as enforcing on creation, no. We do want to add an option to 
>>>> perform a linear scan to set a performance flag, perhaps at 
>>>> creation, so that comparison can take the memcmp-like fast-path.
>>
>> Ok, my take away from this is that fast-pathing has been sufficient 
>> for lazy normalization (when needed) to not be (much of) a 
>> performance concern.  At least, not enough to want to take the 
>> normalization cost on every string construction up front.
>>
>>>>
>>>>>> Swift strings support comparison via normalization.  Has use of 
>>>>>> canonical string equality been a performance issue? Or been a 
>>>>>> source of surprise to programmers?
>>>>
>>>> This was a big performance issue on Linux, where we used to do 
>>>> UCA+DUCET based comparisons. We switch to lexicographical order of 
>>>> NFC-normalized UTF-16 code units (future: scalar values), and saw a 
>>>> very significant speed up there. The remaining performance work 
>>>> revolves around checking and tracking whether a string is known to 
>>>> already be in a normal form, so we can just memcmp.
>>
>> This is very helpful, thank you.  We've suspected that full collation 
>> (with or without tailoring) would be too expensive for use as a 
>> default comparison operator, so it is good to hear that confirmed.
>>
>> I'm curious why this was a larger performance issue for Linux than 
>> for (presumably) macOS and/or iOS.
>>
>
> There were two main factors. The first is that on Darwin platforms, 
> CFString had an implementation that we used instead of UCA+DUCET which 
> was faster. The second is that Darwin platforms are typically 
> up-to-date and have very recent versions of ICU. On Linux, we still 
> support Ubuntu LTS 14.04 which has a version of ICU which predates 
> Swift and didn’t have any fast-paths for ASCII or mostly-ASCII text.
>
> Switching to our own implementation based on NFC gave us many X 
> improvement over CFString, which in turn was many X faster than 
> UCA+DUCET (especially on older versions of ICU).

Thanks.  My take away is that implementation quality matters; those fast 
paths are important.

>
>>>>
>>>>>> Swift strings are not locale sensitive.  Was any consideration 
>>>>>> given to creation of a distinct locale sensitive string type?
>>>>
>>>> This is still up for debate and hasn’t been settled yet, but we 
>>>> think it makes a lot of sense. If an array of strings is sorted, we 
>>>> certainly don’t want a locale-change to violate programmer 
>>>> invariants. A distinct type from string could avoid a lot of common 
>>>> errors here, including forgetting to localize before presenting to 
>>>> a user as part of a UI.
>>>>
>>>>>> Swift strings provide a count property as required to satisfy the 
>>>>>> Collection protocol.  How often do programmers use count (the 
>>>>>> number of EGCs in the string) inappropriately?
>>>>
>>>> I’m not sure what would constitute inappropriate usage here. We do 
>>>> not currently provide access to the underlying stored code units, 
>>>> though this is a frequent request and we likely will in the future. 
>>>> I haven’t seen anyone baking in the assumption that count is the 
>>>> same for String and across all of Strings’s views (UTF-8, UTF-16, 
>>>> Unicode scalars).
>>>
>>> One thing to consider is that as long as String is not 
>>> random-access, count will be a worst-case O(N) operation.  An 
>>> inappropriate usage might involve computing the length once per loop 
>>> iteration.
>>
>> In addition to the above and prior mention of algebraic concerns, 
>> other potential abuses we had in mind were using it to determine 
>> field widths for display or code unit/point based storage.
>>
>
> Display width is a whole other concern accounting for rendering 
> environment, font, etc. I don’t have expertise here.
>
>> C++ container requirements specify that .size() be O(1). For us to 
>> meet container requirements would require computing and caching the 
>> count during construction and mutation operations.  We could 
>> potentially get by just meeting range requirements though.
>>
>>>
>>>> I mentioned degenerate graphemes breaking algebraic properties of 
>>>> the Collection protocol, but this hasn’t been a huge issue in 
>>>> practice so far.
>>>>
>>>>>>
>>>>>> Swift strings support several memory unsafe initializers and 
>>>>>> methods. How frequently are these used incorrectly?
>>>>
>>>> Many of these initializers come from NSString originally, and 
>>>> developers migrating correct code to Swift maintain that 
>>>> correctness. Rust has a similar situation, though they do 
>>>> validation at creation-time and from_utf8_unchecked() voids 
>>>> memory-safety if the contents are invalid.
>>>>
>>>>>> The Swift manifesto discussed three approaches to handling 
>>>>>> substrings and Swift 4 changed from "same type, shared storage" 
>>>>>> to "different type, shared storage".  Any regrets?
>>>>
>>>> Having two types can be a bit of a pain, but we still think it was 
>>>> the right thing to do. This is consistent with Swift treating 
>>>> slices as a distinct type from the base collection.
>>>>
>>>>>>
>>>>>> How often do you find programmers doing work at the EGC level 
>>>>>> that would be better performed at the code unit or code point level?
>>>>
>>>> Often, if a developer has strict requirements, they know what 
>>>> they’re doing enough to operate at one of those lower levels.
>>>>
>>>> Not being able to random-access graphemes in a string is a common 
>>>> source of frustration and confusion amongst new users.
>>>>
>>>>>> Likewise, how often do you find programmers working with 
>>>>>> unicodeScalars, utf8, or utf16 views to do work better performed 
>>>>>> at the EGC level?  For what reasons does this occur?  Perhaps to 
>>>>>> work around differences in EGC boundaries across Unicode versions 
>>>>>> or the underlying version of ICU in use?
>>>>
>>>> This was very prevalent in Swift’s early days. String wasn’t a 
>>>> collection of graphemes by default prior to Swift 4,
>>>
>>> Well, it was.  And then in Swift 2 or 3 it wasn't, due to the 
>>> algebraic reasoning issue.  Now it is again.
>>>
>>>> so without guidance many developers wrote code against the unicode 
>>>> scalars view. We also didn’t have any fast-paths for common-case 
>>>> situations back then, which further encouraged them to use one of 
>>>> the other views.
>>>>
>>>> This is still done sometimes for performance-sensitive usage, or 
>>>> someone wanting to handle Unicode themselves. However, as mentioned 
>>>> previously, we don’t (yet) provide direct access to the actual storage.
>>>>
>>>> We haven’t seen much desire for reconciling behavior across Unicode 
>>>> versions. This may be due to Swift being primarily an applications 
>>>> level programming language for devices which only have one version 
>>>> of Unicode that’s relevant (the current one).
>>>>
>>>>>> Has consideration been given to exposing Unicode character 
>>>>>> database properties? CharacterSet exposes some of these 
>>>>>> properties, but have more been requested?
>>>>
>>>> Yes, this was recently added to the language: 
>>>> https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md. 
>>>> We surface much of the UCD via ICU.
>>
>> Ah, nice.  All kinds of fun to be had with that :)
>>
>>>>
>>>>>> How firmly is the Swift string implementation tied to ICU? If the 
>>>>>> C++ standard library were to add suitable Unicode support, what 
>>>>>> would motivate reimplementing Swift strings on top of it?
>>>>
>>>> Swift’s tie to ICU is less firm than it used to be. We use ICU for 
>>>> the following:
>>>>
>>>> 1. Grapheme breaking
>>>> 2. Normalization
>>>> 3. Accessing UCD properties
>>>> 4. Case conversion
>>>>
>>>> Each of these are not too tightly entwined with string; they’re 
>>>> cordoned-off as a couple of shims called on fallback slow-paths.
>>>>
>>>> If the C++ standard library provided these operations, sufficiently 
>>>> up-to-date with Unicode version and comparable or better to ICU in 
>>>> performance, we would be willing to switch. A big pain in 
>>>> interacting with ICU is their limited support for UTF-8. Some users 
>>>> who would like to use a “lighter-weight” Swift and are unhappy at 
>>>> having to link against ICU, as it’s fairly large, and it can 
>>>> complicate security audits.
>>
>> Got it.  Increasing the size of the C++ standard library is a 
>> definite concern for us as well.  We imagine some C++ users would be 
>> similarly unhappy if their standard library suddenly required linking 
>> against ICU.
>>
>
> If you go the route of implementing Unicode operations without ICU, 
> would it be possible to separately link against Unicode support 
> without also pulling in all of libc++? If your implementation is 
> lighter-weight, yet current, it would be very appealing for Swift to 
> consider switching over.

It would be up to the implementation to determine how it is packaged, 
but I suspect there will be sufficient motivation for separating out the 
heavier parts.  Whether those heavier parts could then be used 
separately from the rest of the library I can't say.  I think this is 
something for us to keep in mind as a design point though.

Tom.

>
>>>>
>>>>>> Do Swift programmers tend to prefer string interpolation or 
>>>>>> string formatting functions?
>>>>
>>>> Users tend to prefer string interpolation. However, Swift currently 
>>>> does not have much in the way of formatting control in 
>>>> interpolations, and this is something we’re currently working on.
>>>>
>>>>>> What enhancements would you most like to see in C++ to improve 
>>>>>> Unicode support?
>>>>
>>>> Swift’s string is perhaps geared as a higher-level construct than 
>>>> what you may want for C++, and Swift has Cocoa-interoperability 
>>>> concerns where everything is UTF-16. Rust might provide a closer 
>>>> model to what you’re looking for:
>>>>
>>>>   * Strings are a sequence of (valid) UTF-8 code units
>>>>       o Validation is done on creation
>>>>       o Invalid contents (e.g. Windows file paths) can be handled
>>>>         via something like WTF-8, which is not intended for interchange
>>>>
>>>>   * String provides bidirectional iterators for:
>>>>       o Transcoded and/or normalized code units
>>>>       o Unicode scalar values (their “character” type)
>>>>       o Grapheme clusters
>>>>
>>>
>>> Michael, I think you're not answering the question asked.  They are 
>>> asking what Swift would want from C++, e.g., to allow us to decouple 
>>> from ICU.  Wouldn't we like to be able to do that?
>>
>> This question was intended to ask you, as expert C++ programmers 
>> independently from Swift, what additions to C++ you think would be 
>> most helpful to improve our (very lacking) Unicode support.  So, 
>> Michael's response is on point (thank you; we'll take a closer look 
>> at Rust), as are any comments regarding what would benefit Swift 
>> specifically.  Michael's earlier comments regarding what Swift 
>> currently uses ICU for are suggestive of what Swift might want from 
>> C++.  But I imagine the form in which those features are provided 
>> would matter greatly; devils and details.
>>
>> Tom.
>>
>>>
>>> -Dave
>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20180806/71d53562/attachment-0001.html 


More information about the Unicode mailing list