Submission Date: 10 Dec 92
Submittor: WG14
Source: X3J11/90-066 (Yasushi Nakahara)
Question 1
It is unclear how the fscanf function shall behave when
executing directives that include ``ordinary multibyte characters,'' especially
in the case of shift-encoded ordinary multibyte characters.
The following statements are described in subclause 7.9.6.2 The fscanf
function of the current standard:
A directive that is an ordinary multibyte character is executed by reading
the next characters of the stream. If one of the characters differs from
one comprising the directive, the directive fails, and the differing and
subsequent characters remain unread.
Assume a typical shift-encoded directive: A\* in 7-bit
representation. And consider two different encoding systems, Latin Alphabet
No.1 - 8859/1 and German Standard DIN 66 003. The codes are, for example,
A; in 8859/1: SO 4/4 SI
A; in DIN 66 003: ESC 2/8 4/11 5/11 ESC 2/8 4/2
where SO is a Shift-Out code (0/15 = 0x0F) and SI corresponds
to a Shift-In code (0/14). ``ESC 2/8 4/11'' is an escape
sequence for the German Standard DIN 66 003, and ``ESC 2/8 4/2''
is for ISO 646 USA Version (ASCII).
Assuming that a subject sequence includes A;,
O;, and U; with the following
7-bit representations,
in 8859/1: SO 4/4 5/6 5/12 SI
in DIN 66 003: ESC 2/8 4/11 5/11 5/12 5/13 ESC 2/8 4/2
does the ``A;'' directive in the fscanf
format string match the beginning part of the``A;O;U;''
sequence?
At what position of the target sequence shall the ``A;''
directive fail?
One interpretation of this is that because the current standard defined
the behavior of the directive in the fscanf format based
on the word ``character'' (byte), not using the term ``multibyte character,''
the comparison shall be done on a byte-by-byte basis. One may conclude
that the ``A;'' directive never matches the
``;'';O;U
sequence in this case.
Another interpretation may lead to an opposite conclusion, saying that
the current standard's statements quoted above do not necessarily mean
that such comparison shall be done on a byte-by-byte basis. Instead, it
is read that the matching shall be done on a ``multibyte character by multibyte
character basis'' or rather ``wide character by wide character basis.''
Especially, a ``ghost'' sequence like ``ESC ...'' and SI/SO
characters should not be regarded as independent ordinary multibyte characters
in this case.
Which is a correct interpretation of the current standard?
These different interpretations are caused by the ambiguity of the descriptions
in the current standard. Also, it should be pointed out that the major
problem here is usage of the word ``character.'' The generic word ``character''
and the specific word ``character(=byte)'' should be properly discriminated
in the standard.
Response
Subclause 7.9.6.2 says, ``A directive that is an ordinary multibyte character
is executed by reading the next characters ...'' [emphasis added].
Consistently throughout the standard, plain ``characters'' refers to one-byte
characters. (See subclause 3.5 for the definition of ``character.'')