Character conversion
A string is a sequence of bytes that may represent
characters. Within a string, all the characters are represented by a common
coding representation. In some cases, it might be necessary to convert these
characters to a different coding representation. The process of conversion
is known as character conversion.6
Character conversion can occur when an SQL statement is executed remotely.
Consider, for example, these two cases:
- The values of variables sent from the application requester to the current
server.
- The values of result columns sent from the current server to the application
requester.
In either case, the string could have a different representation at
the sending and receiving systems. Conversion can also occur during string
operations on the same system.
Note that SQL statements are strings and are therefore subject to character
conversion.
The following list defines some of the terms used when discussing character
conversion.
- character set
- A defined set of characters. For example, the following character set
appears in several code pages:
- 26 non-accented letters A through Z
- 26 non-accented letters a through z
- digits 0 through 9
- . , : ; ? ( ) ' " / - _ & + % * = < >
- code page
- A set of assignments of characters to code points. In EBCDIC, for example, "A" is assigned code point X'C1' and "B" is assigned
code point X'C2'. Within a code page, each code point has only one
specific meaning.
- code point
- A unique bit pattern that represents a character within a code page.
- coded character set
- A set of unambiguous rules that establish a character set and the one-to-one
relationships between the characters of the set and their coded representations.
- encoding scheme
- A set of rules used to represent character data. For example:
- Single-byte EBCDIC
- Single-byte ASCII
- Double-byte EBCDIC
- Mixed single- and double-byte ASCII7
- Unicode (UTF-8, UCS-2, and UTF-16 universal coded character sets).
- substitution character
- A unique character that is substituted during character conversion for
any characters in the source coding representation that do not have a match
in the target coding representation.
- Unicode
- A universal encoding scheme for written characters and text that enables
the exchange of data internationally. It provides a character set standard
that can be used all over the world. It uses a 16-bit encoding form that provides
code points for more than 65,000 characters and an extension called UTF-16
that allows for encoding as many as a million more characters. It provides
the ability to encode all characters used for the written languages of the
world and treats alphabetic characters, ideographic characters, and symbols
equivalently because it specifies a numeric value and a name for each of its
characters. It includes punctuation marks, mathematical symbols, technical
symbols, geometric shapes, and dingbats. Three encoding forms are supported:
- UTF-8: Unicode Transformation Format, a 8-bit encoding form designed for
ease of use with existing ASCII-based systems. UTF-8 data is stored in character
data types. The CCSID value for data in UTF-8 format is 1208.
A UTF-8 character
can be 1,2,3 or 4 bytes in length. A UTF-8 data string can contain any combination
of SBCS and DBCS data, including surrogates and combining characters.
- UCS-2: Universal Character Set coded in 2 octets, which means that characters
are represented in 16-bits per character. UCS-2 data is stored in graphic
data types. The CCSID value for data in UCS-2 format is 13488.
UCS-2 is
a subset of UTF-16. UCS-2 is identical to UTF-16 except that UTF-16 also supports
combining characters and surrogates. Since UCS-2 is a simpler form of UTF-16,
UCS-2 data will typically perform better than UTF-16.8
- UTF-16: Unicode Transformation Format, a 16-bit encoding form designed
to provide code values for over a million characters. UTF-16 data is stored
in graphic data types. The CCSID value for data in UTF-16 format is 1200.
Both UTF-8 and UTF-16 data can contain combining characters. Combining character support allows a resulting
character to be comprised of more than one character. After the first character,
hundreds of different non-spacing accent characters (umlauts, accents, etc.)
can follow in the data string. The resulting character may already be defined
in the character set. In this case, there are multiple representations for
the same character. For example, in UTF-16, an é can
be represented either by X'00E9' (the normalized representation) or X'00650301'
(the non-normalized combining character representation).
Since multiple
representations of the same character will not compare equal, it is usually
not a good idea to store both forms of the characters in the database. Normalization is a process that replaces the string of combining characters
with equivalent characters that do not include combining characters. After
normalization has occurred, only one representation of any specific character
will exist in the data. For example, in UTF-16, any instances of X'00650301'
(the non-normalized combining character representation of é ) will be converted to X'00E9' (the normalized representation of é ).9
Both UTF-8 and UTF-16 can contain 4 byte characters called surrogates. Surrogates are 4 byte sequences that can address one million more characters
than would be available in a 2 byte character set.
Character conversion,
when required, is automatic and is transparent to the application when it
is successful. A knowledge of conversion is, therefore, unnecessary when all
the strings involved in a statement's execution are represented in the
same way. Thus, for many readers, character conversion may be irrelevant.
UTF-8 unicode data is also mixed
data. In this book, however, mixed data refer to mixed single- and double-byte
data.
UCS-2 can contain
surrogates and combining characters, however, they are not recognized as such.
Each 16-bits is considered to be a character.
Since normalization can significantly affect
performance (from 2.5 to 25 percent extra CPU), the default in column definitions
is NOT NORMALIZED.
(C) Copyright IBM Corporation 1992, 2006. All Rights Reserved.