[SG16-Unicode] Fwd: Shift-JIS NEC/IBM discussion

Tom Honermann tom at honermann.net
Sun Apr 15 16:08:15 CEST 2018


Preserving from the old std-text-wg at googlegroups.com mailing list.

-------- Forwarded Message --------
Subject: 	Shift-JIS NEC/IBM discussion
Date: 	Mon, 26 Feb 2018 18:18:25 +0000
From: 	Mark Zeren <mzeren at vmware.com>
To: 	std-text-wg <std-text-wg at googlegroups.com>



copied from Slack for safe keeping:

sdowney [1 hour ago]

Shift-JIS has a few hundred distinct character pairs that were unified 
into the same unicode codepoints?

rmf [24 minutes ago]

That's only half correct. There are several problematic characters in 
the Japanese encoding standards, but this isn't an issue with Han 
unification.

rmf [22 minutes ago]

Those pairs are pairs *of the same character*, which happens to exist 
*twice* in common Shift-JIS codepages, like Microsoft's cp932.

rmf [20 minutes ago]

The reason those code pages encode the same character twice is because 
of the way Shift-JIS extensions occurred. Almost all of the problematic 
characters were added by NEC and by IBM at separate Shift-JIS code 
points (edited)

rmf [19 minutes ago]

Because these pairs don't overlap, Microsoft's code page doubles as an 
IBM-compatible Shift-JIS and as a NEC-compatible Shift-JIS by mapping both.

rmf [11 minutes ago]

So yes, some Shift-JIS codepages, like cp932, don't roundtrip with naive 
processes, but:

rmf [10 minutes ago]

1. the problem is specific to the code pages and unrelated to Han 
unification

rmf [10 minutes ago]

2. the problem is actually irrelevant unless you're interacting with 
e.g. NEC-only or IBM-only systems

rmf [10 minutes ago]

And 3. Unicode has mechanisms to actually roundtrip this properly if you 
need it.

rmf [7 minutes ago]

(If you want an example: NEC encoded 纊at Shift-JIS position 0xED40; IBM 
encoded it at Shift-JIS position 0xFA5C; Unicode has it at U+7E8A)

-- 
You received this message because you are subscribed to the Google 
Groups "std-text-wg" group.
To unsubscribe from this group and stop receiving emails from it, send 
an email to std-text-wg+unsubscribe at googlegroups.com 
<mailto:std-text-wg+unsubscribe at googlegroups.com>.
To post to this group, send email to std-text-wg at googlegroups.com 
<mailto:std-text-wg at googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/std-text-wg/75742882-938E-45C6-86F1-F541723431E8%40vmware.com 
<https://groups.google.com/d/msgid/std-text-wg/75742882-938E-45C6-86F1-F541723431E8%40vmware.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.open-std.org/pipermail/unicode/attachments/20180415/a29c427e/attachment.html 


More information about the Unicode mailing list