Administration Guide

Unicode/UCS-2 and UTF-8 Support in DB2 UDB

These two standards are documented here.

Introduction

The Unicode character encoding standard is a fixed-length, character encoding scheme that includes characters from almost all of the living languages of the world. Unicode characters are usually shown as "U+xxxx", where xxxx is the hexadecimal code of the character.

Each character is 16 bits (2 bytes) wide, regardless of the language. While the resulting 65000 code elements are sufficient for encoding most of the characters of the major languages of the world, the Unicode standard also provides an extension mechanism that allows the encoding of as many as one million more characters. This extension reserves a range of code values (U+D800 to U+D8FF, known as "surrogates") for encoding some 32-bit characters as two successive code elements.

The International Standards Organization (ISO) and the International Electrotechnical Commission (IEC) standard 10646 (ISO/IEC 10646) specifies the Universal Multiple-Octet Coded Character Set (UCS) that has a 2-byte version (UCS-2) and a 4-byte version (UCS-4). The 2-byte version of this ISO standard is identical to Unicode without surrogates. ISO 10646 also defines an extension technique for encoding some UCS-4 codes in a UCS-2 encoded string. This extension, called UTF-16, is identical to Unicode with surrogates.

DB2 UDB supports UCS-2; that is, Unicode without surrogates.

Connection of a UTF-8 (code page 1208) client to a non-Unicode database is not supported.

UTF-8

With UCS-2 or Unicode encoding, ASCII and control characters are also two bytes long, and the lead byte is zero. For example, NULL is U+0000, and the uppercase "A" is represented by U+0041. This could be a major problem for ASCII-based applications and ASCII file systems, because in a UCS-2 string, extraneous NULLs can appear anywhere in the string. A transformation algorithm, known as UTF-8, can be used to circumvent this problem for programs that rely on ASCII code being invariant.

UTF-8 (UCS Transformation Format 8) is an algorithmic transformation that transforms fixed-length UCS-4 characters into variable-length byte strings. In UTF-8, ASCII characters are represented by their usual single-byte codes, but non-ASCII characters in UCS-2 become two or three bytes long. In other words, UTF-8 transforms UCS-2 characters into a multi-byte code set, for which ASCII is invariant. The number of bytes for each UCS-2 character in UTF-8 format can be determined from the following table:

    UCS-2 (hex)     UTF-8 (binary)                 Description
    ------------    --------------------------     ----------------
    0000 to 007F    0xxxxxxx                       ASCII
    0080 to 07FF    110xxxxx 10xxxxxx              up to U+07FF
    0800 to FFFF    1110xxxx 10xxxxxx 10xxxxxx     other UCS-2
 
    NOTE: The range D800 to DFFF is to be excluded from treatment
          by the third row of this table which governs the UCS-4 range
          0000 0800 to 0000 FFFF.

In each of the above, a series of x's is the UCS bit representation of the character. For example, U0080 transforms into 11000010 10000000.

UCS-2/UTF-8 Implementation in DB2 UDB

Code Page/CCSID Numbers

Within IBM, the UCS-2 code page has been registered as code page 1200. All code pages are defined with growing character sets; that is, when new characters are added to a code page, the code page number does not change. Code page 1200 always refers to the current version of Unicode/UCS-2, and has been used for UCS-2 support in DB2 UDB.

A specific version of the UCS standard, as defined by Unicode 2.0 and ISO/IEC 10646-1, has also been registered within IBM as CCSID 13488. This CCSID has been used internally by DB2 UDB for storing graphic string data in euc-Japan and euc-Taiwan databases. CCSID 13488 and code page 1200 both refer to UCS-2, and are handled the same way, except for the value of their "double-byte" (DBCS) space:

     CP/CCSID        Single-byte (SBCS) space      Double-byte (DBCS) space
    ---------        ------------------------      ------------------------
      1200                   N/A                           U+0020
      13488                  N/A                           U+3000
 
    NOTE: In a UCS-2 database, U+3000 has no special meaning.

Regarding the conversion tables, since code page 1200 is a superset of CCSID 13488, the same (superset) tables are used for both.

Within IBM, UTF-8 has been registered as CCSID 1208 with growing character set (sometimes also referred to as code page 1208). As new characters are added to the standard, this number (1208) will not change. The number 1208 is used as the multi-byte code page number for DB2's UCS-2/UTF-8 support.

DB2 UDB supports UCS-2 as a new multi-byte code page. The MBCS code page number is 1208, which is the database code page number, and the code page of character string data within the database. The double-byte code page number for UCS-2 is 1200, which is the code page of graphic string data within the database. When a database is created in UCS-2/UTF-8, CHAR, VARCHAR, LONG VARCHAR, and CLOB data are stored in UTF-8, and GRAPHIC, VARGRAPHIC, LONG VARGRAPHIC, and DBCLOB data are stored in UCS-2. We will simply refer to this as a UCS-2 database.

Creating a UCS-2 Database

By default, databases are created in the code page of the application creating them. Therefore, if you create your database from a UTF-8 client (for example, the UNIVERSAL locale of AIX), or if the DB2CODEPAGE registry variable on the client is set to 1208, your database will be created as a UCS-2 database. Alternatively, you can explicitly specify "UTF-8" as the CODESET name, and use any valid two letter TERRITORY code supported by DB2 UDB.

For example, to create a UCS-2 database with the territory code for the United States, issue:

   DB2 CREATE DATABASE dbname USING CODESET UTF-8 TERRITORY US

To create a UCS-2 database using the sqlecrea API, you should set the values in sqledbcountryinfo accordingly. For example, set SQLDBCODESET to UTF-8, and SQLDBLOCALE to any valid territory code (for example, US).

The default collating sequence for a UCS-2 database is IDENTITY, which provides UCS-2 code point order. Therefore, by default, all UCS-2/UTF-8 characters are ordered and compared according to their UCS-2 code point sequence.

All culturally-sensitive parameters, such as date or time format, decimal separator, and others, are based on the current territory of the client.

A UCS-2 database allows connection from every single-byte and multi-byte code page supported by DB2 UDB. Code page character conversions between the client's code page and UTF-8 are automatically performed by the database manager. Data in graphic string types is always in UCS-2, and does not go through code page conversions. The command line processor (CLP) environment is an exception. If you select graphic string (UCS-2) data from the CLP, the returned graphic string data is converted (by the CLP) from UCS-2 to the code page of your client environment.

Every client is limited by the character repertoire, the input method, and the fonts supported by its environment, but the UCS-2 database itself accepts and stores all UCS-2 characters. Therefore, every client usually works with a subset of UCS-2 characters, but the database manager allows the entire repertoire of UCS-2 characters.

When characters are converted from a local code page to UTF-8, there may be expansion in the number of bytes. There is no expansion for ASCII characters, but other UCS-2 characters expand by a factor of two or three. The number of bytes of each UCS-2 character in UTF-8 format can be determined from the table in UTF-8.

Data Types

All data types supported by DB2 UDB are also supported in a UCS-2 database. In particular, graphic string data is supported for a UCS-2 database, and is stored in UCS-2/Unicode. Every client, including SBCS clients, can work with graphic string data types in UCS-2/Unicode when connected to a UCS-2 database.

A UCS-2 database is like any MBCS database where character string data is measured in number of bytes. When working with character string data in UTF-8, one should not assume that each character is one byte. In multi-byte UTF-8 encoding, each ASCII character is one byte, but non-ASCII characters take two or three bytes each. This should be taken into account when defining CHAR fields. Depending on the ratio of ASCII to non-ASCII characters, a CHAR field of size n bytes can contain anywhere from n/3 to n characters.

Using character string UTF-8 encoding versus the graphic string UCS-2 data type also has an impact on the total storage requirements. In a situation where the majority of characters are ASCII, with some non-ASCII characters in between, storing UTF-8 data may be a better alternative, because the storage requirements are closer to one byte per character. On the other hand, in situations where the majority of characters are non-ASCII characters that expand to three-byte UTF-8 sequences (for example ideographic characters), the UCS-2 graphic-string format may be a better alternative, because every UCS-2 character requires exactly two bytes, rather than three bytes, for each corresponding character in the UTF-8 format.

In MBCS environments, SQL scalar functions that operate on character strings, such as LENGTH, SUBSTR, POSSTR, MAX, MIN, and the like, operate on the number of "bytes" rather than number of "characters". The behavior is the same in a UCS-2 database, but you should take extra care when specifying offsets and lengths for a USC-2 database, because these values are always defined in the context of the database code page. That is, in the case of a UCS-2 database, these offsets should be defined in UTF-8. Since some single-byte characters require more than one byte in UTF-8, SUBSTR indexes that are valid for a single-byte database may not be valid for a UCS-2 database. If you specify incorrect indexes, SQLCODE -191 (SQLSTATE 22504) is returned. For a description of the behavior of these functions, refer to the SQL Reference.

SQL CHAR data types are supported (in the C language) by the char data type in user programs. SQL GRAPHIC data types are supported by sqldbchar in user programs. Note that, for a UCS-2 database, sqldbchar data is always in big-endian (high byte first) format. When an application program is connected to a UCS-2 database, character string data is converted between the application code page and UTF-8 by DB2 UDB, but graphic string data is always in UCS-2.

Identifiers

In a UCS-2 database, all identifiers are in multi-byte UTF-8. Therefore, it is possible to use any UCS-2 character in identifiers where the use of a character in the extended character set (for example, an accented character, or a multi-byte character) is allowed by DB2 UDB. For details about which identifiers allow the use of extended characters, see Appendix A, Naming Rules.

Clients can enter any character that is supported by their SBCS or MBCS environment, and all the characters in the identifiers will be converted to UTF-8 by the database manager. Two points must be taken into account when specifying national language characters in identifiers for a UCS-2 database:

Each non-ASCII character requires two or three bytes. Therefore, an n-byte identifier can only hold somewhere between n/3 and n characters, depending on the ratio of ASCII to non-ASCII characters. If you have only one or two non-ASCII (for example, accented) characters, the limit is closer to n characters, while for an identifier that is completely non-ASCII (for example, in Japanese), only n/3 characters can be used.
If identifiers are to be entered from different client environments, they should be defined using the common subset of characters available to those clients. For example, if a UCS-2 database is to be accessed from Latin-1, Arabic, and Japanese environments, all identifiers should realistically be limited to ASCII.

UCS-2 Literals

UCS-2 literals can be specified in two ways:

As a graphic string constant, using the G'...' or N'....' format described in the "Graphic String Constants" section of the "Language Elements" chapter in the SQL Reference. Any literal specified in this way will be converted by the database manager from the application code page to UCS-2.
As a UCS-2 hexadecimal string, using the UX'....' or GX'....' format. The constant specified between the quotation marks after UX or GX must be a multiple of 4 hexadecimal digits. Each four-digit group represents one UCS-2 code point.

When using the command line processor (CLP), the first method is easier if the UCS-2 character exists in the local application code page (for example, when entering any code page 850 character from a terminal that is using code page 850). The second method should be used for characters that are outside of the application code page repertoire (for example, when specifying Japanese characters from a terminal that is using code page 850).

Pattern Matching in a UCS-2 Database

Pattern matching is one area where the behavior of existing MBCS databases is slightly different from the behavior of a UCS-2 database.

For MBCS databases in DB2 UDB, the current behavior is as follows: If the match-expression contains MBCS data, the pattern can include both SBCS and MBCS characters. The special characters in the pattern are interpreted as follows:

An SBCS underscore refers to one SBCS character.
A DBCS underscore refers to one MBCS character.
A percent (either SBCS or DBCS) refers to a string of zero or more SBCS or MBCS characters.

If the match-expression contains graphic string DBCS data, the expressions contain only DBCS characters. The special characters in the pattern are interpreted as follows:

A DBCS underscore refers to one DBCS character.
A DBCS percent sign refers to a string of zero or more DBCS characters.

In a UCS-2 database, there is really no distinction between "single-byte" and "double-byte" characters; every UCS-2 character occupies two bytes. Although the UTF-8 format is a "mixed-byte" encoding of UCS-2 characters, there is no real distinction between SBCS and MBCS characters in UTF-8. Every character is a UCS-2 character, regardless of the number of its bytes that are in UTF-8 format. When specifying a character string, or a graphic string expression, an underscore refers to one UCS-2 character, and a percent sign refers to a string of zero or more UCS-2 characters.

On the client side, the character string expressions are in the code page of the client, and will be converted to UTF-8 by the database manager. SBCS client code pages do not have a DBCS percent sign or a DBCS underscore, but every supported code page contains a single-byte percent sign (corresponding to U+0025) and a single-byte underscore (corresponding to U+005F). The interpretation of special characters for a UCS-2 database is as follows:

An SBCS underscore (corresponding to U+0025) refers to one UCS-2 character in a graphic string expression, or to one UTF-8 character in a character string expression.
An SBCS percent sign (corresponding to U+005F) refers to a string of zero or more UCS-2 characters in a graphic string expression, or to a string of zero or more UTF-8 characters in a character string expression.

DBCS code pages also support a DBCS percent sign (corresponding to U+FF05) and a DBCS underscore (corresponding to U+FF3F). These characters have no special meaning for a UCS-2 database.

For the optional "escape expression", which specifies a character to be used to modify the special meaning of the underscore and percent sign characters, only ASCII characters, or characters that expand into a two-byte UTF-8 sequence, are supported. If you specify an escape character that expands to a three-byte UTF-8 value, an error message (error SQL0130N, SQLSTATE 22019) is returned.

Import/Export/Load Considerations

The DEL, ASC, and PC/IXF file formats are supported for a UCS-2 database, as described in this section. The WSF format is not supported.

When exporting from a UCS-2 database to an ASCII delimited (DEL) file, all character data is converted to the application code page. Both character string and graphic string data are converted to the same SBCS or MBCS code page of the client. This is expected behavior for the export of any database, and cannot be changed, because the entire delimited ASCII file can have only one code page. Therefore, if you export to a delimited ASCII file, only those UCS-2 characters that exist in your application code page will be saved. Other characters are replaced with the default substitution character for the application code page. For UTF-8 clients (code page 1208), there is no data loss, because all UCS-2 characters are supported by UTF-8 clients.

When importing from an ASCII file (DEL or ASC) to a UCS-2 database, character string data is converted from the application code page to UTF-8, and graphic string data is converted from the application code page to UCS-2. There is no data loss. If you want to import ASCII data that has been saved under a different code page, you should change the data file code page before issuing the IMPORT command. One way to accomplish this is to set DB2CODEPAGE to the code page of the ASCII data file.

The range of valid ASCII delimiters for SBCS and MBCS clients is identical to what is currently supported by DB2 UDB for those clients. The range of valid delimiters for UTF-8 clients is 0x01 to 0x7F, with the usual restrictions. For a complete list of these restrictions, refer to the "Export/Import/Load Utility File Formats" appendix in the Data Movement Utilities Guide and Reference.

When exporting from a UCS-2 database to a PC/IXF file, character string data is converted to the SBCS/MBCS code page of the client. graphic string data is not converted, and is stored in UCS-2 (code page 1200). There is no data loss.

When importing from a PC/IXF file to a UCS-2 database, character string data is assumed to be in the SBCS/MBCS code page stored in the PC/IXF header, and graphic string data is assumed to be in the DBCS code page stored in the PC/IXF header. Character string data is converted by the import utility from the code page specified in the PC/IXF header to the code page of the client, and then from the client code page to UTF-8 (by the INSERT statement). graphic string data is converted by the import utility from the DBCS code page specified in the PC/IXF header directly to UCS-2 (code page 1200).

The load utility places the data directly into the database and, by default, assumes data in ASC or DEL files to be in the code page of the database. Therefore, by default, no code page conversion takes place for ASCII files. When the code page for the data file has been explicitly specified (using the codepage modifier), the load utility uses this information to convert from the specified code page to the database code page before loading the data. For PC/IXF files, the load utility always converts from the code pages specified in the IXF header to the database code page (1208 for CHAR, and 1200 for GRAPHIC).

The code page for DBCLOB files is always 1200 for UCS-2. The code page for CLOB files is the same as the code page for the data files being imported, loaded or exported. For example, when loading or importing data using the PC/IXF format, the CLOB file is assumed to be in the code page specified by the PC/IXF header. If the DBCLOB file is in ASC or DEL format, the load utility assumes that CLOB data is in the code page of the database (unless explicitly specified otherwise using the codepage modifier), while the import utility assumes it to be in the code page of the client application.

The nochecklengths modifier is always specified for a UCS-2 database, because:

Any SBCS can be connected to a database for which there is no DBCS code page
Character strings in UTF-8 format usually have different lengths than those in client code pages.

For more information about the load, import, and export utilities, refer to the Data Movement Utilities Guide and Reference.

Incompatibilities

For applications connected to a UCS-2 database, graphic string data is always in UCS-2 (code page 1200). For applications connected to non-UCS-2 databases, the graphic string data is in the DBCS code page of the application, or not allowed if the application code page is SBCS. For example, when a 932 client is connected to a Japanese non-UCS-2 database, the graphic string data is in code page 301. For the 932 client applications connected to a UCS-2 database, the graphic string data is in UCS-2.

[ Top of Page | Previous Page | Next Page ]