[Tooling] [isocpp-modules] Dependency information for module-aware build tools

Tom Honermann tom at honermann.net
Fri Mar 8 06:00:56 CET 2019


On 3/7/19 11:13 AM, Ben Boeckel via Modules wrote:
> On Thu, Mar 07, 2019 at 10:59:20 -0500, Tom Honermann wrote:
>> For example, consider a file name consisting of the octets { 0x86, 0x89,
>> 0x93, 0x85, 0x59 }.  In EBCDIC code page 37, this denotes a file name
>> "fileß", but in EBCDIC code page 273 denotes "file~".  The suggestion
>> then is, when generating the JSON file, if the current locale setting is
>> CP37, to use the UTF-8 encoded name "fileß" as a normal JSON string.
>> Tools consuming the file would then have to transcode the UTF-8 provided
>> name back to the original locale to open the file.
> This would require build tools to do more than "just" sling strings
> around. iconv is not a light dependency… It also means that compilers
> can't do the (trivial) `is_valid_utf8` check and instead have to also do
> transcoding. And know the name of the encoding used. On Linux, you don't
> have that information at all. For example, my locale is all
> `en_US.UTF-8`, but nothing stops me from having a Shift-JIS filename
> anywhere (and I do have a few in archives of mid-2000-era software). How
> is a compiler supposed to know what the encoding of `readdir->d_name` is
> here?

A tool can't know the encoding of `readdir->d_name`.  This problem 
occurs with any tool that intends to display a file name, even tools 
like 'ls'.  For example, on Linux, in a directory with a file name "fileß":

# With default locale settings (UTF-8):
$ ls -1
fileß

# With "C" locale:
$ LANG=C ls -1
'file'$'\303\237'

# With Czech locale:
$ LANG=cs_CZ.iso88592 ls -1
'file�'$'\237'

Essentially, interpretation of a file name is always subject to locale 
settings.

>
>> Previously, I had suggested that the locales must match for the producer
>> and consumer and that it be UB otherwise (effectively leading to file
>> not found errors).  However, I think it would be better to store the
>> encoding used to interpret the file name at generation time (if it isn't
>> UTF-8) in the file to allow tools to accurately reverse the UTF-8
>> encoding.  The supported encodings and the spelling of their names
>> would, of course, be implementation/platform defined.
> Build tools already have enough things to worry about. Transcoding and
> code pages is not something I want a /dependency file format/ to require
> them to handle.
I can appreciate not wanting additional requirements :)
>
>> On 3/7/19 10:30 AM, Ben Boeckel wrote:
>>> Note that if you'd like to have a readable filename, adding it as a
>>> `_readable` key with a human-readable utf-8 transcoding to the filename
>>> would be supported (see my message with the JSON schema bits from
>>> yesterday).
>> That seems reasonable to me for file names that really can't be
>> represented as UTF-8, but seems like noise otherwise.  In other words, I
>> think we should try to minimize use of raw8, raw16, etc... where possible.
> Then we should probably look for a core format that doesn't require
> UTF-8 and intead supports byte arrays natively (effectively making it a
> binary format as far as text editors are concerned).

That was my initial inclination before the JSON approach was suggested.  
I think either approach is workable.

Tom.




More information about the Tooling mailing list