[SG16-Unicode] P1689: Encoding of filenames for interchange

Tom Honermann tom at honermann.net
Thu Sep 5 17:20:58 CEST 2019


Thank you for writing this up, Thiago!

On 9/5/19 12:12 AM, Thiago Macieira wrote:
> == Transport ==
> P1689 suggests using JSON. I'm comparing that in the context of the three
> options with a binary format (CBOR).
>
> One thing SG16 is completely in agreement of is that if you go with JSON, you
> must obey RFC 8259: there must not be a BOM and the file must be encoded in
> UTF-8.

We haven't polled anything, so saying we're all in agreement is 
premature.  Additionally, we discussed this further in the SG16 meeting 
yesterday and I think we determined that a BOM *may* be present.

RFC 8259 section 8.1 states: (emphasis mine)

    JSON text exchanged between systems *that are not part of a closed
    ecosystem* MUST be encoded using UTF-8 [RFC3629].

    Previous specifications of JSON have not required the use of UTF-8
    when transmitting JSON text.  However, the vast majority of
    JSON-based software implementations have chosen to use the UTF-8
    encoding, to the extent that it is the only encoding that achieves
    interoperability.

    Implementations MUST NOT add a byte order mark (U+FEFF) to the
    beginning of a *networked-transmitted JSON text*.  In the interests
    of interoperability, implementations that parse JSON texts *MAY
    ignore the presence of a byte order mark* rather than treating it as
    an error.

My reading of this is that RFC 8259 permits use of non-UTF-8 encodings 
in some situations.  Whether the situation that P1689 is defined for 
qualifies is something that could be debated.  If we consider the build 
system and compiler invocations to form a closed system, then the 
dependency file could be, for example, EBCDIC encoded JSON and still 
conform to RFC 8259.  I'm not arguing for or against such a position at 
this time; but rather noting that, if SG15 requires UTF-8 encoded JSON, 
that requirement is arguably more restrictive than what RFC 8259 requires.

My reading of the BOM requirements is that they only apply to UTF-8 data 
sent over the network and that use of a BOM in file contents is permitted.

ECMA 404 does not specify any requirements on encoding of the JSON 
content, nor the presence or absence of a BOM.

My conclusions are, if we choose to adopt either RFC 8259 or ECMA 404 as 
the JSON specification deferred to, and if we don't add additional 
restrictions, that:

 1. Implementations could choose whatever encoding they like for the
    JSON file.
 2. Implementations could choose whether to produce and consume a BOM.

Tom.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20190905/c40ca380/attachment-0001.html 


More information about the Unicode mailing list