<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Preserving from the old <a class="moz-txt-link-abbreviated" href="mailto:std-text-wg@googlegroups.com">std-text-wg@googlegroups.com</a> mailing
list.<br>
</p>
<div class="moz-forward-container">-------- Forwarded Message
--------
<table class="moz-email-headers-table" border="0" cellspacing="0"
cellpadding="0">
<tbody>
<tr>
<th nowrap="nowrap" valign="BASELINE" align="RIGHT">Subject:
</th>
<td>Shift-JIS NEC/IBM discussion</td>
</tr>
<tr>
<th nowrap="nowrap" valign="BASELINE" align="RIGHT">Date: </th>
<td>Mon, 26 Feb 2018 18:18:25 +0000</td>
</tr>
<tr>
<th nowrap="nowrap" valign="BASELINE" align="RIGHT">From: </th>
<td>Mark Zeren <a class="moz-txt-link-rfc2396E" href="mailto:mzeren@vmware.com"><mzeren@vmware.com></a></td>
</tr>
<tr>
<th nowrap="nowrap" valign="BASELINE" align="RIGHT">To: </th>
<td>std-text-wg <a class="moz-txt-link-rfc2396E" href="mailto:std-text-wg@googlegroups.com"><std-text-wg@googlegroups.com></a></td>
</tr>
</tbody>
</table>
<br>
<br>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:"MS Gothic";
        panose-1:2 11 6 9 7 2 5 8 2 4;}
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:"\@MS Gothic";
        panose-1:2 11 6 9 7 2 5 8 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:#0563C1;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:#954F72;
        text-decoration:underline;}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri",sans-serif;
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri",sans-serif;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style>
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt">copied from
Slack for safe keeping:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">sdowney [1
hour ago]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Shift-JIS
has a few hundred distinct character pairs that were unified
into the same unicode codepoints?<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">rmf [24
minutes ago]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">That's only
half correct. There are several problematic characters in
the Japanese encoding standards, but this isn't an issue
with Han unification.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">rmf [22
minutes ago]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Those pairs
are pairs *of the same character*, which happens to exist
*twice* in common Shift-JIS codepages, like Microsoft's
cp932.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">rmf [20
minutes ago]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">The reason
those code pages encode the same character twice is because
of the way Shift-JIS extensions occurred. Almost all of the
problematic characters were added by NEC and by IBM at
separate Shift-JIS code points (edited)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">rmf [19
minutes ago]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Because
these pairs don't overlap, Microsoft's code page doubles as
an IBM-compatible Shift-JIS and as a NEC-compatible
Shift-JIS by mapping both.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">rmf [11
minutes ago]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">So yes, some
Shift-JIS codepages, like cp932, don't roundtrip with naive
processes, but:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">rmf [10
minutes ago]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">1. the
problem is specific to the code pages and unrelated to Han
unification<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">rmf [10
minutes ago]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">2. the
problem is actually irrelevant unless you're interacting
with e.g. NEC-only or IBM-only systems<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">rmf [10
minutes ago]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">And 3.
Unicode has mechanisms to actually roundtrip this properly
if you need it.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">rmf [7
minutes ago]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">(If you want
an example: NEC encoded
</span><span style="font-size:11.0pt;font-family:"MS
Gothic"">纊</span><span style="font-size:11.0pt"> at
Shift-JIS position 0xED40; IBM encoded it at Shift-JIS
position 0xFA5C; Unicode has it at U+7E8A)<o:p></o:p></span></p>
</div>
-- <br>
You received this message because you are subscribed to the Google
Groups "std-text-wg" group.<br>
To unsubscribe from this group and stop receiving emails from it,
send an email to <a
href="mailto:std-text-wg+unsubscribe@googlegroups.com"
moz-do-not-send="true">std-text-wg+unsubscribe@googlegroups.com</a>.<br>
To post to this group, send email to <a
href="mailto:std-text-wg@googlegroups.com"
moz-do-not-send="true">std-text-wg@googlegroups.com</a>.<br>
To view this discussion on the web visit <a
href="https://groups.google.com/d/msgid/std-text-wg/75742882-938E-45C6-86F1-F541723431E8%40vmware.com?utm_medium=email&utm_source=footer"
moz-do-not-send="true">https://groups.google.com/d/msgid/std-text-wg/75742882-938E-45C6-86F1-F541723431E8%40vmware.com</a>.<br>
For more options, visit <a
href="https://groups.google.com/d/optout" moz-do-not-send="true">https://groups.google.com/d/optout</a>.<br>
</div>
</body>
</html>