[SG16-Unicode] Draft string literal issues

Tom Honermann tom at honermann.net
Sun Mar 17 05:42:11 CET 2019


On 3/16/19 1:46 PM, Steve Downey wrote:
> Study Group 16 has recently noticed two issues with string literals in 
> the current WP.
>
> The first is in lex.phases#1.5 ( http://eel.is/c++draft/lex.phases#1.5 
> )  where characters in all string literals are converted into the 
> execution character set, which should be true only for un-prefixed 
> literals. U, u, and u8 string literals should be converted to UTF-32, 
> -16, and -8 each respectively, and wide literals into the wide encoding.
Sounds good.  I'm struggling with the standard talking only about 
character sets here and not encodings, but that is a different 
pre-existing problem.
>
> The second however, follows from the first, where string literals are 
> concatenated after being translated.  lex.string#12 ( 
> http://eel.is/c++draft/lex.string#12.note-1 ) teaches that "If one 
> string-literal has no encoding-prefix, it is treated as a 
> string-literal of the same encoding-prefix as the other operand. " 
> However, since this happens _after_ encoding, there is no sensible way 
> to achieve this. The execution encoding will not, in general, be valid 
> Unicode encoding, and if it happens to be, it will not encode the same 
> source characters. The conversion from universal-character-name to 
> execution encoding will also, in general, be lossy leading to 
> replacement characters, like '?', in the strings.
I think it would be helpful to explicitly mention translation phases 5 
and 6 here.
>
> SG16 has not reached consensus on how the issue should be resolved, 
> except that creating mojibake as happens in practice now is undesireable.
I don't think it is necessary to mention SG16 when reporting the issue 
since we don't (yet) have a proposed resolution to offer.
>
> MSVC 19.16 for example when processing `char16_t text1[] = u"" 
> "\u0102";` with the utf-8 option encodes the string literal as {0xC4 
> 0x82}, then treats that pair of bytes as Windows 1252, the normal 
> execution encoding, before reencoding as UTF-16, {0x00C4 0x201A}, 
> where the first character is U+00C4 LATIN CAPITAL LETTER A WITH 
> DIAERESIS, and the second is U+201A, SINGLE LOW-9 QUOTATION MARK, 
> which is equivalent to 0x82 in 1252.

I think it would be useful to present the current implementation 
divergence between gcc/clang and MSVC (without use of the /utf-8 option 
since that mode is clearly buggy).  This would demonstrate that 
gcc/clang doesn't differ behavior for `u"" "x"` vs `u"" u"x"` where as 
MSVC does.

Richard offered a potential resolution on the std-discussion list, it 
may be worth submitting his suggestion (with appropriate attribution of 
course) with the issue.

https://groups.google.com/a/isocpp.org/d/msg/std-discussion/qYf6treuLmY/dljWwyawCwAJ

Tom.



More information about the Unicode mailing list