[Tooling] [isocpp-modules] Dependency information for module-aware build tools

Tom Honermann tom at honermann.net
Thu Mar 7 16:59:20 CET 2019


On 3/7/19 10:30 AM, Ben Boeckel wrote:
> On Thu, Mar 07, 2019 at 00:15:34 -0500, Tom Honermann wrote:
>> I don't know of any that use 32-bit code units for file names.
>>
>> I find myself thinking (as I so often do these days much to the surprise
>> of my past self), how does EBCDIC and z/OS fit in here? If we stick to
>> JSON and require the dependency file to be UTF-8 encoded, would all file
>> names in these files be raw8 encoded and effectively unreadable (by
>> humans) on z/OS?  Perhaps we could allow more flexibility, but doing so
>> necessarily invites locales into the discussion (for those that are
>> unaware, EBCDIC has code pages too).  For example, we could require that
>> the selected locale match between the producers and consumers of the
>> file (UB if they don't) and permit use of the string representation by
>> transcoding from the locale interpreted physical file name to UTF-8, but
>> only if reverse-transcoding produces the same physical file name,
>> otherwise the appropriate raw format must be used.
> I first tried saying "treat these strings as if they were byte arrays"
> with allowances for escaping `"` and `\`, but there was pushback on the
> previous thread about it. This basically makes a new dialect of JSON
> which is (usually) an error in existing implementations. It would mean
> that tools are implementing their own JSON parsers (or even writers)…

This isn't what I was suggesting.  Rather, I was suggesting that 
standard UTF-8 encoded JSON be used, but that, for platforms where the 
interpretation of the filename may differ based on locale settings, 
that, if the file name can be losslessly round-tripped to UTF-8 and 
back, that the UTF-8 encoding of it (transcoded from the locale) be used 
in the JSON file as a (well-formed) UTF-8/JSON string even though that 
name wouldn't reflect the exact code units of the file name.

For example, consider a file name consisting of the octets { 0x86, 0x89, 
0x93, 0x85, 0x59 }.  In EBCDIC code page 37, this denotes a file name 
"fileß", but in EBCDIC code page 273 denotes "file~".  The suggestion 
then is, when generating the JSON file, if the current locale setting is 
CP37, to use the UTF-8 encoded name "fileß" as a normal JSON string.  
Tools consuming the file would then have to transcode the UTF-8 provided 
name back to the original locale to open the file.

Previously, I had suggested that the locales must match for the producer 
and consumer and that it be UB otherwise (effectively leading to file 
not found errors).  However, I think it would be better to store the 
encoding used to interpret the file name at generation time (if it isn't 
UTF-8) in the file to allow tools to accurately reverse the UTF-8 
encoding.  The supported encodings and the spelling of their names 
would, of course, be implementation/platform defined.

>
> Note that if you'd like to have a readable filename, adding it as a
> `_readable` key with a human-readable utf-8 transcoding to the filename
> would be supported (see my message with the JSON schema bits from
> yesterday).

That seems reasonable to me for file names that really can't be 
represented as UTF-8, but seems like noise otherwise.  In other words, I 
think we should try to minimize use of raw8, raw16, etc... where possible.

Tom.

>
> --Ben




More information about the Tooling mailing list