Why iscp932
needed?
In MySQL, thesjis
character set corresponds to theShift_JIS
character set defined by IANA, which supports JIS X0201 and JIS X0208 characters. (Seehttp://www.iana.org/assignments/character-sets.)
However, the meaning of“SHIFT JIS”as a descriptive term has become very vague and it often includes the extensions toShift_JIS
that are defined by various vendors.
For example,“SHIFT JIS”used in Japanese Windows environments is a Microsoft extension ofShift_JIS
and its exact name isMicrosoft Windows Codepage : 932
orcp932
. In addition to the characters supported byShift_JIS
,cp932
supports extension characters such as NEC special characters, NEC selected—IBM extended characters, and IBM selected characters.
Many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:
MySQL automatically converts character sets.
Character sets are converted using Unicode (
ucs2
).The
sjis
character set does not support the conversion of these extension characters.There are several conversion rules from so-called“SHIFT JIS”to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).
The MySQLcp932
character set is designed to solve these problems.
Because MySQL supports character set conversion, it is important to separate IANAShift_JIS
andcp932
into two different character sets because they provide different conversion rules.
How doescp932
differ fromsjis
?
Thecp932
character set differs fromsjis
in the following ways:
cp932
supports NEC special characters, NEC selected—IBM extended characters, and IBM selected characters.Some
cp932
characters have two different code points, both of which convert to the same Unicode code point. When converting from Unicode back tocp932
, one of the code points must be selected. For this“round trip conversion,”the rule recommended by Microsoft is used. (Seehttp://support.microsoft.com/kb/170559/EN-US/.)The conversion rule works like this:
If the character is in both JIS X 0208 and NEC special characters, use the code point of JIS X 0208.
If the character is in both NEC special characters and IBM selected characters, use the code point of NEC special characters.
If the character is in both IBM selected characters and NEC selected—IBM extended characters, use the code point of IBM extended characters.
The table shown athttps://msdn.microsoft.com/en-us/goglobal/cc305152.aspxprovides information about the Unicode values of
cp932
characters. Forcp932
table entries with characters under which a four-digit number appears, the number represents the corresponding Unicode (ucs2
) encoding. For table entries with an underlined two-digit value appears, there is a range ofcp932
character values that begin with those two digits. Clicking such a table entry takes you to a page that displays the Unicode value for each of thecp932
字符begin with those digits.The following links are of special interest. They correspond to the encodings for the following sets of characters:
NEC special characters (lead byte
0x87
):https://msdn.microsoft.com/en-us/goglobal/gg674964
NEC selected—IBM extended characters (lead byte
0xED
and0xEE
):https://msdn.microsoft.com/en-us/goglobal/gg671837 https://msdn.microsoft.com/en-us/goglobal/gg671838
IBM selected characters (lead byte
0 xfa
,0xFB
,0xFC
):https://msdn.microsoft.com/en-us/goglobal/gg671839 https://msdn.microsoft.com/en-us/goglobal/gg671840 https://msdn.microsoft.com/en-us/goglobal/gg671841
cp932
supports conversion of user-defined characters in combination witheucjpms
, and solves the problems withsjis
/ujis
conversion. For details, please refer tohttp://www.sljfaq.org/afaq/encodings.html.
For some characters, conversion to and fromucs2
is different forsjis
andcp932
. The following tables illustrate these differences.
Conversion toucs2
:
sjis /cp932 Value |
sjis ->ucs2 Conversion |
cp932 ->ucs2 Conversion |
---|---|---|
5C | 005C | 005C |
7E | 007E | 007E |
815C | 2015 | 2015 |
815F | 005C | FF3C |
8160 | 301C | FF5E |
8161 | 2016 | 2225 |
817C | 2212 | FF0D |
8191 | 00A2 | FFE0 |
8192 | 00 a3 | FFE1 |
81CA | 00AC | FFE2 |
Conversion fromucs2
:
ucs2 value |
ucs2 ->sjis Conversion |
ucs2 ->cp932 Conversion |
---|---|---|
005C | 815F | 5C |
007E | 7E | 7E |
00A2 | 8191 | 3F |
00 a3 | 8192 | 3F |
00AC | 81CA | 3F |
2015 | 815C | 815C |
2016 | 8161 | 3F |
2212 | 817C | 3F |
2225 | 3F | 8161 |
301C | 8160 | 3F |
FF0D | 3F | 817C |
FF3C | 3F | 815F |
FF5E | 3F | 8160 |
FFE0 | 3F | 8191 |
FFE1 | 3F | 8192 |
FFE2 | 3F | 81CA |
Users of any Japanese character sets should be aware that using--character-set-client-handshake
(or--skip-character-set-client-handshake
) has an important effect. SeeSection 5.1.7, “Server Command Options”.