8.5 Unicode Support
There are two new character sets for storing Unicode data:
ucs2
(the
UCS-2 Unicode character set) and utf8
(the
UTF-8 encoding of the Unicode character set).
- In UCS-2 (binary Unicode representation) every character is
represented by a two-byte Unicode code with the most significant
byte first. For example: "LATIN CAPITAL LETTER A" has the code
0x0041 and it's stored as a two byte sequence: 0x00 0x41. "CYRILLIC
SMALL LETTER YERU" (Unicode 0x044B) is stored as a two byte
sequence: 0x04 0x4B. For Unicode characters and their codes please
refer to the Unicode Home Page.
Temporary restriction: UCS-2 can't (yet) be used as a client
character set. That means that
SET NAMES ucs2
will not work.
- The UTF8 character set (transform Unicode representation) is an
alternative way to store Unicode data. It is implemented according
to RFC2279. The idea of the UTF8 character set is that various
Unicode characters fit into byte sequences of different
lengths.
- Basic Latin letters, digits, and punctuation signs use one
byte.
- Most European and Middle East script letters fit into a two-byte
sequence: extended Latin letters (with tilde, macron, acute,
grave and other accents), Cyrillic, Greek, Armenian, Hebrew,
Arabic, Syriac, and others.
- Korean, Chinese and Japanese ideographs use three-byte
sequences.
- Currently, MySQL UTF8 support does not include four-byte sequences.
Tip: To save space with UTF8, use VARCHAR
instead of CHAR
.
Otherwise, MySQL has to reserve 30 bytes for a CHAR(10) CHARACTER
SET utf8
column, because that's the maximum possible length.
Add your own comment.