IBM Books

Administration Guide


Unicode/UCS-2 and UTF-8 Support in DB2 UDB

These two standards are documented here.

Introduction

The Unicode character encoding standard is a fixed-length, character encoding scheme that includes characters from almost all the living languages of the world. Unicode characters are usually shown as "U+xxxx" where xxxx is the hexadecimal code of the character.

Each character is 16 bits (2 bytes) wide, regardless of the language. While the resulting 65 000 code elements are sufficient for encoding most of the characters of the major languages of the world, the Unicode standard also provides an extension mechanism, that allows for encoding as many as a million more characters. This extension reserves a range of code values (U+D800 to U+D8FF, known as "surrogates") for encoding some 32-bit characters as two successive code elements.

The International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC) 10646 standard (ISO/IEC 10646) specifies the Universal Multiple-Octet Coded Character Set (UCS) that has a 2-byte version (UCS-2) and a 4-byte version (UCS-4). The 2-byte version of this ISO standard UCS-2 is identical to Unicode without surrogates. ISO 10646 also defines an extension technique, for encoding some UCS-4 codes in a UCS-2 encoded string. This extension called UTF-16, is identical to Unicode with surrogates.

DB2 UDB supports UCS-2, that is, Unicode without surrogates.

UTF-8

With UCS-2 or Unicode encoding, ASCII and control characters are also two bytes long, and the lead byte is zero. For example, NULL is U+0000 and CAPITAL LETTER A is represented by U+0041. This could be a major problem for ASCII-based applications and ASCII file systems, because in a UCS-2 strings, extraneous NULLs may appear anywhere in the string. A transformation algorithm, known as UTF-8, can be used to circumvent this problem for programs that rely on ASCII code being invariant.

UTF-8 (UCS Transformation Format 8), is an algorithmic transformation which transforms fixed-length UCS-4 characters into variable-length byte strings. In UTF-8, ASCII characters are represented by their usual single-byte codes, but non-ASCII characters in UCS-2 become two or three bytes long. In other words, UTF-8 transforms UCS-2 characters to a multi-byte codeset, for which ASCII is invariant. The number of bytes for each UCS-2 character in UTF-8 format can be determined from the following table:

    UCS-2 (hex)     UTF-8 (binary)                 Description
    ------------    --------------------------     ----------------
    0000 to 007F    0xxxxxxx                       ASCII
    0080 to 07FF    110xxxxx 10xxxxxx              up to U+07FF
    0800 to FFFF    1110xxxx 10xxxxxx 10xxxxxx     other UCS-2
 
    NOTE: The range D800 to DFFF is to be excluded from treatment
          by the third row of this table which governs the UCS-4 range
          0000 0800 to 0000 FFFF.

In all the above, a series of x's indicate the UCS bit representation of the character. For example, U0080 transforms into 11000010 10000000.

UCS-2/UTF-8 Implementation in DB2 UDB

Code Page/CCSID Numbers

Within IBM, the UCS-2 code page has been registered as code page 1200. All code pages are defined with growing character sets, that is, when new characters are added to a code page, the code page number does not change. Code page 1200 always refers to the current version of Unicode/UCS-2, and has been used for UCS-2 support in DB2 UDB.

A specific repertoire of the UCS standard, as defined by Unicode 2.0 and ISO/IEC 10646-1, has also been registered within IBM as CCSID 13488. This CCSID (13488) has been used internally by DB2 UDB for storing graphic string data in euc-Japan and euc-Taiwan databases. CCSID 13488 and code page 1200 both refer to UCS-2, and are handled the same way except for the value of their "double-byte" (DBCS) space:

     CP/CCSID        Single Byte (SBCS) space      Double Byte (DBCS) space
    ---------        ------------------------      ------------------------
      1200                   N/A                           U+0020
      13488                  N/A                           U+3000
 
    NOTE: In a UCS-2 database, U+3000 has no special meaning.

Regarding the conversion tables, since code page 1200 is a superset of CCSID 13488, the exact same (superset) tables are used for both.

In IBM, UTF-8 has been registered as CCSID 1208 with growing character set (sometimes also referred to as code page 1208). As new characters are added to the standard, this number (1208) will not change either. 1208 is used as the multi-byte code page number for DB2's UCS-2/UTF-8 support.

DB2 UDB supports UCS-2 as a new multi-byte code page. The MBCS code page number is 1208, which is the database code page number, and the code page of character string data within the database. The double-byte code page number (for UCS-2) is 1200 which is the code page of graphic string data within the database. When a database is created in UCS-2/UTF-8, CHAR, VARCHAR, LONG VARCHAR, and CLOB data, are stored in UTF-8, and GRAPHIC, VARGRAPHIC, LONG VARGRAPHIC, and DBCLOB data, are stored in UCS-2. We will simply refer to this as a UCS-2 database.

Creating a UCS-2 Database

By default, databases are created in the code page of the application creating them. Therefore, if you create your database from a UTF-8 client (for example, the UNIVERSAL locale of AIX), or if DB2CODEPAGE environment variable on the client is set to 1208, your database will be created as a UCS-2 database. Alternatively, you can explicitly specify "UTF-8" as the CODESET name, and use any valid two letter TERRITORY code supported by DB2 UDB.

For example, to create a UCS-2 database from the CLP, with the territory code for United States, issue:

  DB2 CREATE DATABASE dbname USING CODESET UTF-8 TERRITORY US

To create a UCS-2 database using the sqlecrea API, you should set the values in sqledbcountryinfo accordingly. For example, set sqledbcountryinfo to "UTF-8", and sqldblocale to any valid territory code (for example, "US").

The default collation sequence for a UCS-2 database is IDENTITY, which provides UCS-2 code point order. Therefore, by default, all UCS-2/UTF-8 characters are ordered and compared according to their UCS-2 code point sequence.

All cultural-sensitive parameters such as date/time format, decimal separator, and others, are based on the current territory of the client.

A UCS-2 database allows connection from every single-byte and multi-byte code page supported by DB2 UDB. Code page character conversions between client's code page and UTF-8 are automatically performed by the database manager. Data in graphic string types, is always in UCS-2 and does not go through code page conversions. The Command Line Processor (CLP) environment is an exception. If you SELECT graphic string (UCS-2) data from the CLP, the returned graphic string data is converted (by the CLP) from UCS-2 to the code page of your client environment.

Every client is limited by the character repertoire, the input methods, and the fonts supported by its environment, but the UCS-2 database itself accepts and stores all UCS-2 characters. Therefore, every client usually works with a subset of UCS-2 characters, but the database manager allows the entire repertoire of UCS-2 characters.

When characters are converted from a local code page to UTF-8, there is a possibility of expansion in the number of bytes. There is no expansion for ASCII characters, but other UCS-2 characters expand by a factor of two or three. The number of bytes of each UCS-2 character in UTF-8 format can be determined from the above table (section about UTF-8).

Data Types

All data types supported by DB2 UDB, are also supported in a UCS-2 database. In particular, graphic string data, is supported for UCS-2 database and is stored in UCS-2/Unicode. Every client, including SBCS clients, can work with graphic string data types in UCS-2/Unicode when connected to a UCS-2 database.

A UCS-2 database is like any MBCS database where character string data is measured in number of bytes. When working with character string data in UTF-8, one should not assume that each character is one byte. In multi-byte UTF-8 encoding, each ASCII character is one byte, but non-ASCII characters take two or three bytes each. This should be taken into account when defining CHAR fields. Depending on the ratio of ASCII to non-ASCII characters, a CHAR field of size n bytes, can contain anywhere from n/3 to n characters.

Using character string UTF-8 encoding versus graphic string UCS-2 data type also has an impact on the total storage requirements. For a situation where the majority of characters are ASCII, with some non-ASCII characters in between, storing UTF-8 data may be a better alternative because the storage requirements are closer to one byte per character. On the other hand, for situations where the majority of characters are non-ASCII characters that expand to three-byte UTF-8 sequences (for example ideographic characters), the UCS-2 graphic-string format may be a better alternative because every UCS-2 character requires exactly two bytes, rather than three bytes for each corresponding character in UTF-8 format.

SQL scalar functions that operate on character strings, such as LENGTH, SUBSTR, POSSTR, MAX, MIN, and the like, in MBCS environments operate on number of "bytes" rather than number of "characters". The behavior is the same in a UCS-2 database but you should take extra care when specifying offsets and lengths for a USC-2 database because these values are always defined in the context of the database code page. That is, in the case of UCS-2 database should be defined in UTF-8. Since some single-byte characters require more than one byte in UTF-8, SUBSTR indexes that are valid for a single-byte database may not be valid for a UCS-2 database. If you specify incorrect indexes, you will get SQLCODE -191, SQLSTATE 22504. Refer to the SQL Reference for a description of the behavior of these functions.

SQL CHAR data types are supported by C language's char data type in user programs. SQL GRAPHIC data types are supported by sqldbchar in user C programs. Note that, for a UCS-2 database, sqldbchar data is always in big-endian (high byte first) format. When an application program is connected to a UCS-2 database, character string data is converted between the application code page and UTF-8 by DB2 UDB, but graphic string data is always in UCS-2.

Identifiers

In a UCS-2 database, all identifiers are in multi-byte UTF-8. Therefore, it is possible to use any UCS-2 character in identifiers where the use of a character in the extended character set (for example, an accented character, or a multi-byte character) is allowed by DB2 UDB. Please refer to the appendix "Naming Rules" in the Administration Guide for details of which identifiers allow use of extended characters.

Clients can enter any character which is supported by their SBCS/MBCS environment, and all the characters in the identifiers will be converted to UTF-8 by the database manager. Two points need to be taken into account when specifying National Language characters in identifiers in a UCS-2 database:

  1. Each non-ASCII character takes two or three bytes. Therefore, an n-byte identifier, can only hold somewhere between n/3 and n characters, depending the ratio of ASCII to non-ASCII characters. If you have only one or two non-ASCII (for example, accented) characters, the limit is closer to n characters, while for an identifier which is completely non-ASCII (for example, in Japanese), only n/3 characters can be used.

  2. If identifiers are to be entered from different client environments, they should be defined using the common subset of characters available to those clients. For example, if a UCS-2 database is to be accessed from Latin-1, Arabic, and Japanese environments, all identifiers should realistically be limited to ASCII.

UCS-2 Literals

UCS-2 literals can be specified in two ways:

  1. As a GRAPHIC string constant using the G'...', or N'....' format as described in the SQL Reference, Chapter 3 "Language Elements", the section "Constants", the subsection "Graphic String Constants." Any literal specified in this way will be converted by the database manager from the application code page to UCS-2.

  2. As a UCS-2 hexadecimal string, using the UX'....' or GX'....' format. The constant specified between quotes after UX or GX must be a multiple of 4 hexadecimal digits. Each four digits represent one UCS-2 code point.

When using the Command Line Processor (CLP), the first method is easier if the UCS-2 character exists in the local application code page (for example, for entering any code page 850 character from a terminal that is using code page 850). The second method should be used for characters which are outside the application code page repertoire (for example, for specifying Japanese characters from a terminal that is using code page 850).

Pattern Matching in a UCS-2 Database

Pattern matching is one area where the behavior of existing MBCS databases is slightly different from the behavior of a UCS-2 database.

For MBCS databases in DB2 UDB, the current behavior is as follows: If the match-expression contains MBCS data, the pattern can include both SBCS and MBCS characters. The special characters in the pattern are interpreted as follows:

If the match-expression contains graphic string DBCS data, the expressions contain only DBCS characters. The special characters in the pattern are interpreted as follows:

In a UCS-2 database, there is really no distinction between "single-byte" and "double-byte" characters; every UCS-2 character occupies two bytes. Although the UTF-8 format is a "mixed-byte" encoding of UCS-2 characters, there is no real distinction between SBCS and MBCS characters in UTF-8. Every character is a UCS-2 character, irrespective of its number of bytes in UTF-8 format. When specifying a character string, or a graphic string expression, an underscore refers to one UCS-2 characters and a percent refers to a string of zero or more UCS-2 characters.

On the client side, the character string expressions are in the code page of the client, and will be converted to UTF-8 by the database manager. SBCS client code pages, do not have any DBCS percent or DBCS underscore, but every supported code page contains a single-byte percent (corresponding to U+0025) and a single-byte underscore (corresponding to U+005F). The interpretation of special characters for a UCS-2 database is as follows:

DBCS code pages, additionally support a DBCS percent sign (corresponding to U+FF05) and a DBCS underscore (corresponding to U+FF3F). These characters have no special meaning for a UCS-2 database.

For the optional "escape-expression" which specifies a character to be used to modify the special meaning of the underscore and percent characters, only ASCII characters, or characters that expand into a two-byte UTF-8 sequence, are supported. If you specify an ESCAPE character which expands to a three-byte UTF-8 value, you will get an error message (SQL0130N error, SQLSTATE 22019).

IMPORT/EXPORT/LOAD Considerations

The DEL, ASC, and PC/IXF file formats are supported for a UCS-2 database as described in this section. The WSF format is not supported.

When exporting from a UCS-2 database to an ASCII delimited (DEL) file, all character data is converted to the application code page. Both character string and GRAPHIC string data are converted to the same SBCS or MBCS code page of the client. This is the existing behavior for the export of any database, and cannot be changed because the entire ASCII delimited file can have only one code page. Therefore, if you export to an ASCII delimited file, only those UCS-2 characters that exist in your application code page would be saved. Other characters are replaced with the default substitution character for the application code page. For UTF-8 clients (code page 1208) there is no data loss because all UCS-2 characters are supported by UTF-8 clients.

When importing from an ASCII file (DEL or ASC) to a UCS-2 database, character string data is converted from the application code page to UTF-8, and GRAPHIC string data is converted from the application code page to UCS-2. There is no data loss. If you want to import ASCII data that has been saved under a different code page, you should switch the data file code page before issuing the import command. One way to accomplish this, is to set DB2CODEPAGE to the code page of the ASCII data file.

The range of valid ASCII delimiters, for SBCS and MBCS clients is identical to what is currently supported by DB2 UDB for these clients. The range of valid delimiters for UTF-8 clients is 0x01 to 0x7F, with the usual restrictions. Refer to the "IMPORT/EXPORT/LOAD Utility File Formats" appendix in the Command Reference for a complete list of these restrictions.

When exporting from a UCS-2 database to a PC/IXF file, character string data is converted to the SBCS/MBCS code page of the client. GRAPHIC string data is not converted and is stored in UCS-2 (code page 1200). There is no data loss.

When importing from an PC/IXF file to a UCS-2 database, character string data is assumed to be in the SBCS/MBCS code page stored in the PC/IXF header and GRAPHIC string data is assumed to be in the DBCS code page stored in the PC/IXF header. Character string data is converted by the IMPORT utility from the code page specified in the PC/IXF header to the code page of the client, and then from the client code page to UTF-8 (by the INSERT statement). GRAPHIC string data is converted by the IMPORT utility from the DBCS code page specified in the PC/IXF header directly to UCS-2 (code page 1200).

LOAD directly places the data into the database and by default, assumes data in ASC or DEL files is in the code page of the database. Therefore, by default no code page conversion takes place for ASCII files. When the code page of the data file has been explicitly specified (using the MODIFIED BY codepage=x command parameter), LOAD uses this information to convert from the specified code page x into the database code page before loading the data. For PC/IXF files, LOAD always converts from the code pages specified in the IXF header to the database code page (1208 for CHAR, 1200 for GRAPHIC).

The code page of DBCLOB files (as specified using the MODIFIED BY lobsinfile command parameter) is always 1200 for UCS-2. The code page of the CLOB files is the same as the code page of the data files being imported, loaded or exported. For example, for load or import using PC/IXF format, the CLOB file is assumed to be in the code page specified by the PC/IXF header. If the DBCLOB file is in ASC or DEL format, for LOAD the CLOB data is assumed to be in the code page of the database (unless explicitly specified otherwise using the MODIFIED BY codepage=x command parameter), and for IMPORT it is assumed to be in the client application code page.

The NOCHECKLENGTHS option is always set to TRUE for a UCS-2 database because in a UCS-2 database, any SBCS can be connected to the database for which there is no DBCS code pages; and also because character strings in UTF-8 format usually have different length than corresponding length in client code pages.

Refer to Data Movement Utilities Guide and Reference for more information on the LOAD, EXPORT, and IMPORT utilities.

Incompatibilities

An application connected to a UCS-2 database, the graphic string data is always in UCS-2 (code page 1200). For applications connected to non UCS-2 databases, the graphic string data is in the applications DBCS code page; or, not allowed if the application code page is SBCS. For example, when a 932 client is connected to a Japanese non UCS-2 database, then the graphic string data is in code page 301. But for the 932 client applications connected to a UCS-2 database, the graphic string data is in UCS-2.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]

[ DB2 List of Books | Search the DB2 Books ]