[SG16-Unicode] [isocpp-lib] [isocpp-lib-ext] The "Let's Stop Ascribing Meaning to Code Points" blog post
Billy O'Neal (VC LIBS)
bion at microsoft.com
Wed Nov 13 20:28:26 CET 2019
>Will you be hesitant to update the reference to the grapheme breaking algorithm if it changes in future Unicode standards as well?
Yes. There's a reason why, for example, Java doesn't follow Unicode's rules in its regex implementation, because it would be a breaking change to do that.
>It is important to remember that width estimation is orthogonal to memory safety; format_to_n() is there to give you the memory safety part, and that will never be impacted by the width estimation piece.
I agree, but the same is true of sprintf vs. snprintf.
Billy3
________________________________
From: Zach Laine <whatwasthataddress at gmail.com>
Sent: Wednesday, November 13, 2019 10:41 AM
To: Library Working Group <lib at lists.isocpp.org>
Cc: Kirk Shoop <kirkshoop at fb.com>; lib-ext at lists.isocpp.org <lib-ext at lists.isocpp.org>; Titus Winters <titus at google.com>; Billy O'Neal (VC LIBS) <bion at microsoft.com>; Victor Zverovich <victor.zverovich at gmail.com>; Corentin <corentin.jabot at gmail.com>; Tom Honermann <tom at honermann.net>; SG16 <unicode at open-std.org>
Subject: Re: [isocpp-lib] [isocpp-lib-ext] The "Let's Stop Ascribing Meaning to Code Points" blog post
Will you be hesitant to update the reference to the grapheme breaking algorithm if it changes in future Unicode standards as well? I ask this because it seems like the same thing to me.
I think users would be better served by having a consistent result across all implementations, for their particular version of C++ (all implementations of 20, or 23, etc.), rather than stable for all time. It is inherent to trying to accommodate all possible natural languages (aka Unicode) to need to vary algorithms across releases. Since we picked a Unicode approach to width estimation, that kind of slight release-to-release variation just comes with the territory.
It is important to remember that width estimation is orthogonal to memory safety; format_to_n() is there to give you the memory safety part, and that will never be impacted by the width estimation piece.
Zach
On Wed, Nov 13, 2019 at 12:29 PM Billy O'Neal (VC LIBS) via Lib <lib at lists.isocpp.org<mailto:lib at lists.isocpp.org>> wrote:
> How is managing this database different than the timze-zone database?
1. Changes to the time zone database (if I understand correctly) don't change buffer management behavior.
2. The time zone database is an explicit opt-in to query an external database in source code, with text that describes it as reading a database from disk. That's different input to a program producing different output, not identical input to a program producing different output.
Maybe we need a term of art for databases like this to put into SD8?
________________________________
From: Kirk Shoop <kirkshoop at fb.com<mailto:kirkshoop at fb.com>>
Sent: Tuesday, November 12, 2019 11:05 PM
To: lib-ext at lists.isocpp.org<mailto:lib-ext at lists.isocpp.org> <lib-ext at lists.isocpp.org<mailto:lib-ext at lists.isocpp.org>>; Titus Winters <titus at google.com<mailto:titus at google.com>>
Cc: Corentin <corentin.jabot at gmail.com<mailto:corentin.jabot at gmail.com>>; Billy O'Neal (VC LIBS) <bion at microsoft.com<mailto:bion at microsoft.com>>; Victor Zverovich <victor.zverovich at gmail.com<mailto:victor.zverovich at gmail.com>>; Tom Honermann <tom at honermann.net<mailto:tom at honermann.net>>; lib at lists.isocpp.org<mailto:lib at lists.isocpp.org> <lib at lists.isocpp.org<mailto:lib at lists.isocpp.org>>; SG16 <unicode at open-std.org<mailto:unicode at open-std.org>>
Subject: Re: [isocpp-lib-ext] The "Let's Stop Ascribing Meaning to Code Points" blog post
How is managing this database different than the timze-zone database?
Why specify values when you can specify functions that query the database?
Why not specify that the database is updatable within a particular standard release and thus its results are not fixed across time?
Kirk
From: Lib-Ext <lib-ext-bounces at lists.isocpp.org<mailto:lib-ext-bounces at lists.isocpp.org>> on behalf of Corentin via Lib-Ext <lib-ext at lists.isocpp.org<mailto:lib-ext at lists.isocpp.org>>
Reply-To: "lib-ext at lists.isocpp.org<mailto:lib-ext at lists.isocpp.org>" <lib-ext at lists.isocpp.org<mailto:lib-ext at lists.isocpp.org>>
Date: Wednesday, November 13, 2019 at 6:53 AM
To: Titus Winters <titus at google.com<mailto:titus at google.com>>
Cc: Corentin <corentin.jabot at gmail.com<mailto:corentin.jabot at gmail.com>>, "Billy O'Neal (VC LIBS)" <bion at microsoft.com<mailto:bion at microsoft.com>>, "lib-ext at lists.isocpp.org<mailto:lib-ext at lists.isocpp.org>" <lib-ext at lists.isocpp.org<mailto:lib-ext at lists.isocpp.org>>, Victor Zverovich <victor.zverovich at gmail.com<mailto:victor.zverovich at gmail.com>>, Tom Honermann <tom at honermann.net<mailto:tom at honermann.net>>, "lib at lists.isocpp.org<mailto:lib at lists.isocpp.org>" <lib at lists.isocpp.org<mailto:lib at lists.isocpp.org>>, SG16 <unicode at open-std.org<mailto:unicode at open-std.org>>
Subject: Re: [isocpp-lib-ext] The "Let's Stop Ascribing Meaning to Code Points" blog post
We should say _something_ somewhere.
In many areas Unicode is purposefully not making any commitment to stability (it turns out that organizing the world cultures is hard), and that particular proposal is harder still.
Notably the width of an emoji sequence depends on vendors and Unicode version - some clusterization depends on locale - although by default format should not do tailoring.
Anyway promising anything but a best effort (with the expectation that both the standard and implementation will improve/evolve), backs us in a corner that i don't think anyone in SG-16 wants to be in.
This issue will arise for many Unicode/locales related proposals
On Wed, 13 Nov 2019 at 06:56, Titus Winters <titus at google.com<mailto:titus at google.com>> wrote:
SD-8 is *appropriate* if we want to tell the public "The committee probably won't consider anything like X a breaking change, if your code gets in the way of that you may have a difficult time upgrading."
It's never *necessary*, nor does it *limit* us - we might still decide to do things that are outside of that scope. It's just trying to set general expectations.
(This doesn't sound like a case that falls into that category.)
On Tue, Nov 12, 2019 at 10:15 PM Billy O'Neal (VC LIBS) <bion at microsoft.com<mailto:bion at microsoft.com>> wrote:
Sorry, I added Titus to ask if we need to talk about this in SD-8 somehow.
Billy3
From: Billy O'Neal (VC LIBS) via Lib-Ext<mailto:lib-ext at lists.isocpp.org>
Sent: Tuesday, November 12, 2019 1:14 PM
To: Tom Honermann<mailto:tom at honermann.net>; lib-ext at lists.isocpp.org<mailto:lib-ext at lists.isocpp.org>; Corentin<mailto:corentin.jabot at gmail.com>; Titus Winters<mailto:titus at google.com>
Cc: Billy O'Neal (VC LIBS)<mailto:bion at microsoft.com>; Victor Zverovich<mailto:victor.zverovich at gmail.com>; lib at lists.isocpp.org<mailto:lib at lists.isocpp.org>; SG16<mailto:unicode at open-std.org>
Subject: Re: [isocpp-lib-ext] The "Let's Stop Ascribing Meaning to Code Points" blog post
I haven’t seen how customers will use this API enough to go so far as make the statement “implementers aren’t going to be willing to change […]” at this time. It is certainly a possibility. Changes to that table are breaking changes. Whether we’re going to be willing to make such changes is a value judgement on potential breaks vs. such benefit that might be attained from those breaks.
> I take it your concern is regarding code that calls std::format_to with an assumption that the provided output buffer is large enough?
More or less, yes. Certainly we see people do that with sprintf today.
Billy3
From: Tom Honermann<mailto:tom at honermann.net>
Sent: Tuesday, November 12, 2019 1:09 PM
To: Billy O'Neal (VC LIBS)<mailto:bion at microsoft.com>; lib-ext at lists.isocpp.org<mailto:lib-ext at lists.isocpp.org>; Corentin<mailto:corentin.jabot at gmail.com>
Cc: lib at lists.isocpp.org<mailto:lib at lists.isocpp.org>; SG16<mailto:unicode at open-std.org>; Victor Zverovich<mailto:victor.zverovich at gmail.com>
Subject: Re: [isocpp-lib-ext] The "Let's Stop Ascribing Meaning to Code Points" blog post
If implementors aren't going to be willing to change these tables once we ship, then I think we have a fairly serious issue.
Some have adamantly stated that these widths are estimates only and should not be counted on to remain stable. Code that is sensitive to the formatted size of the output should be calling std::formatted_size and allocating appropriately. I take it your concern is regarding code that calls std::format_to with an assumption that the provided output buffer is large enough? (or, code that calls std::format and assumes the size of the resulting std::string).
Tom.
On 11/12/19 8:58 PM, Billy O'Neal (VC LIBS) wrote:
My only point was that the specified behavior gives grapheme clusters a width of 1 or 2, but there exist characters like U+FDFD that are wider than 2. (And many that have a width of 0) I would be very nervous about changing the constants used after std::format ships because that could introduce unexpected buffer overruns or underruns in user programs. This is the kind of thing that becomes contractual very quickly (which is one of the reasons I was weakly against trying to open this can of worms).
Billy3
From: Tom Honermann<mailto:tom at honermann.net>
Sent: Tuesday, November 12, 2019 12:53 PM
To: lib-ext at lists.isocpp.org<mailto:lib-ext at lists.isocpp.org>; Corentin<mailto:corentin.jabot at gmail.com>
Cc: Billy O'Neal (VC LIBS)<mailto:bion at microsoft.com>; lib at lists.isocpp.org<mailto:lib at lists.isocpp.org>; SG16<mailto:unicode at open-std.org>; Victor Zverovich<mailto:victor.zverovich at gmail.com>
Subject: Re: [isocpp-lib-ext] The "Let's Stop Ascribing Meaning to Code Points" blog post
On 11/12/19 6:11 PM, Billy O'Neal (VC LIBS) via Lib-Ext wrote:
It came up in the context of that width thing in format and I was asking if I had permission to make wider-than-2 characters format properly, and the forwarded text doesn’t seem to allow that (which is OK, I just wanted to understand at the time); I was thinking of U+FDFD (﷽).
Can you elaborate? My understanding of the forwarded wording is that the assumed encoding for the input text is implementation defined (though not locale sensitive) and that implementors are encouraged to use the Unicode code point ranges indicated in the wording, but are not required to (that is my interpretation of the use of the word "should" in the proposed wording).
It does look like the provided code point ranges don't handle U+FDFD correctly.
I don't know how much confidence should be placed on the listed code point ranges. But I think it is important that we consider them amenable to change. I suspect that U+FDFD is not the last code point we'll find that is not correctly handled.
Tom.
Billy3
From: Corentin<mailto:corentin.jabot at gmail.com>
Sent: Tuesday, November 12, 2019 8:42 AM
To: C++ Library Evolution Working Group<mailto:lib-ext at lists.isocpp.org>
Cc: lib at lists.isocpp.org<mailto:lib at lists.isocpp.org>; Billy O'Neal (VC LIBS)<mailto:bion at microsoft.com>; SG16<mailto:unicode at open-std.org>
Subject: Re: [isocpp-lib-ext] The "Let's Stop Ascribing Meaning to Code Points" blog post
On Tue, 12 Nov 2019 at 16:58, Billy O'Neal (VC LIBS) via Lib-Ext <lib-ext at lists.isocpp.org<mailto:lib-ext at lists.isocpp.org>> wrote:
During review of some Unicode stuff in LWG we had a mini discussion for some folks about grapheme clusters and I mentioned everyone who touches this stuff might understand the complexities better if they read this:
https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__nam06.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fmanishearth.github.io-252Fblog-252F2017-252F01-252F14-252Fstop-2Dascribing-2Dmeaning-2Dto-2Dunicode-2Dcode-2Dpoints-252F-26data-3D02-257C01-257Cbion-2540microsoft.com-257C325ed688adf24821865508d767b55bf1-257C72f988bf86f141af91ab2d7cd011db47-257C1-257C0-257C637091900938888858-26sdata-3Dn6PWmt9higWO-252BDgRCopDQLf8huNNtXtLaPEOSnX4Lds-253D-26reserved-3D0%26d%3DDwMFaQ%26c%3D5VD0RTtNlTh3ycd41b3MUw%26r%3DabsRl_gwAeoq_5SHbj9kew%26m%3DzfNb9EcQLZ2P7qFcex2DuIqNYlajjEMYpH_mY9pRiYU%26s%3DCN9gVPvzVmz7D5HIeMtzwhQJQBmF2IhwxyABW33kPFQ%26e%3D&data=02%7C01%7Cbion%40microsoft.com%7C14efd966d4844df1f5d508d76868a2f3%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637092670937252828&sdata=1Fcpw4WWId3%2F3Pwsoo5NT78zM5W7RL1rYO18j6A1fgY%3D&reserved=0>
+1
FYI SG-16 is aware of that blog post and i think there is a pretty strong agreement with it.
Codepoints have some use (notably the Unicode Character Database is really the Unicode Codepoint Database, and most Unicode algorithms works on codepoints), but any kind of user facing UX should deal with EGCS.
It is not always what applications choose to do for a variety of reasons. Notably Twitter character counts deals in codepoints, web browsers search function use codepoints as to ignore diacritics, and comparisons can be done on (normalized) codepoint sequences.
There is also not always a 1-1 mapping between what people understand as "character", grapheme clusters and glyphes.
Billy3
_______________________________________________
Lib-Ext mailing list
Lib-Ext at lists.isocpp.org<mailto:Lib-Ext at lists.isocpp.org>
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__nam06.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Flists.isocpp.org-252Fmailman-252Flistinfo.cgi-252Flib-2Dext-26data-3D02-257C01-257Cbion-2540microsoft.com-257C325ed688adf24821865508d767b55bf1-257C72f988bf86f141af91ab2d7cd011db47-257C1-257C0-257C637091900938898848-26sdata-3DInj6zKImFUHAzMuOG9XGDnFNaV0sk4oqowibQ0AIF4E-253D-26reserved-3D0%26d%3DDwMFaQ%26c%3D5VD0RTtNlTh3ycd41b3MUw%26r%3DabsRl_gwAeoq_5SHbj9kew%26m%3DzfNb9EcQLZ2P7qFcex2DuIqNYlajjEMYpH_mY9pRiYU%26s%3DH3kLq2_SQcNoyTQu5LCCISpj57ZbTuXcK8BeGl7Gcps%26e%3D&data=02%7C01%7Cbion%40microsoft.com%7C14efd966d4844df1f5d508d76868a2f3%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637092670937262825&sdata=yWVfrLAR117FmqIoZzh8X8FwRK16rJvO1AlHwbPHXN8%3D&reserved=0>
Link to this post: http://lists.isocpp.org/lib-ext/2019/11/13606.php<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__nam06.safelinks.protection.outlook.com_-3Furl-3Dhttp-253A-252F-252Flists.isocpp.org-252Flib-2Dext-252F2019-252F11-252F13606.php-26data-3D02-257C01-257Cbion-2540microsoft.com-257C325ed688adf24821865508d767b55bf1-257C72f988bf86f141af91ab2d7cd011db47-257C1-257C0-257C637091900938898848-26sdata-3D65O8kixjxGs7UKCX8-252Fb1yHuVj41a3hr0VcSHiTsTdpw-253D-26reserved-3D0%26d%3DDwMFaQ%26c%3D5VD0RTtNlTh3ycd41b3MUw%26r%3DabsRl_gwAeoq_5SHbj9kew%26m%3DzfNb9EcQLZ2P7qFcex2DuIqNYlajjEMYpH_mY9pRiYU%26s%3DC_XHBFOfN-m_1rlJPTepfphqmKZYokCMwJiS-7lS2qw%26e%3D&data=02%7C01%7Cbion%40microsoft.com%7C14efd966d4844df1f5d508d76868a2f3%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637092670937272815&sdata=4ShTq4QrlcVTlZoGU68eYuw6BPMvUbaW7jyVxgTExxU%3D&reserved=0>
_______________________________________________
Lib-Ext mailing list
Lib-Ext at lists.isocpp.org<mailto:Lib-Ext at lists.isocpp.org>
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__nam06.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Flists.isocpp.org-252Fmailman-252Flistinfo.cgi-252Flib-2Dext-26data-3D02-257C01-257Cbion-2540microsoft.com-257C325ed688adf24821865508d767b55bf1-257C72f988bf86f141af91ab2d7cd011db47-257C1-257C0-257C637091900938908847-26sdata-3DQbrmymcetx9msnXGCnfQGmT39hiiscI2Sjha97S80c8-253D-26reserved-3D0%26d%3DDwMFaQ%26c%3D5VD0RTtNlTh3ycd41b3MUw%26r%3DabsRl_gwAeoq_5SHbj9kew%26m%3DzfNb9EcQLZ2P7qFcex2DuIqNYlajjEMYpH_mY9pRiYU%26s%3DnXZ37w7raK0M51YEirBOyu0kRH3JoZBY8mekP3IZqUI%26e%3D&data=02%7C01%7Cbion%40microsoft.com%7C14efd966d4844df1f5d508d76868a2f3%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637092670937272815&sdata=BILwXMcsA37jNS%2B7F3nxMCLpar1UJ75lAyDXUdlQxc0%3D&reserved=0>
Link to this post: http://lists.isocpp.org/lib-ext/2019/11/13609.php<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__nam06.safelinks.protection.outlook.com_-3Furl-3Dhttp-253A-252F-252Flists.isocpp.org-252Flib-2Dext-252F2019-252F11-252F13609.php-26data-3D02-257C01-257Cbion-2540microsoft.com-257C325ed688adf24821865508d767b55bf1-257C72f988bf86f141af91ab2d7cd011db47-257C1-257C0-257C637091900938908847-26sdata-3Dbfw5Bj-252Fa5Fy5DFjo-252BAwWX4mNJRl0-252B8GWdDL5r0HwKm0-253D-26reserved-3D0%26d%3DDwMFaQ%26c%3D5VD0RTtNlTh3ycd41b3MUw%26r%3DabsRl_gwAeoq_5SHbj9kew%26m%3DzfNb9EcQLZ2P7qFcex2DuIqNYlajjEMYpH_mY9pRiYU%26s%3Dno00C1VIhngN-PgZ5Za3pSyq1GTgBv7LJen3CozsG7M%26e%3D&data=02%7C01%7Cbion%40microsoft.com%7C14efd966d4844df1f5d508d76868a2f3%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637092670937282808&sdata=V5v%2FOryMawvpdTxEbv64dFHv311whDPrhT8OpXvKliY%3D&reserved=0>
_______________________________________________
Lib mailing list
Lib at lists.isocpp.org<mailto:Lib at lists.isocpp.org>
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C14efd966d4844df1f5d508d76868a2f3%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637092670937292809&sdata=9mkklV61CXhf2WMey5MKkR6Tjged%2BkIidZZZxq%2B9fvE%3D&reserved=0>
Link to this post: http://lists.isocpp.org/lib/2019/11/14227.php<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F11%2F14227.php&data=02%7C01%7Cbion%40microsoft.com%7C14efd966d4844df1f5d508d76868a2f3%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637092670937292809&sdata=LOBvxzX0dlvN7x%2FnrTFAFsTt2F0jheTQv%2F08T9Ly5kQ%3D&reserved=0>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20191113/11c83d20/attachment-0001.html
More information about the Unicode
mailing list