Application Development Guide

National Language Support Considerations

This section describes National Language Support (NLS) support issues that you must consider for your applications. The major topics discussed are:

"Collating Sequences"
"Deriving Code Page Values"
"Deriving Locales in Application Programs"
"Programming Considerations"
"Conversion Between Different Code Pages"

Collating Sequences

The database manager compares character data using a collating sequence. This is an ordering for a set of characters that determines whether a particular character sorts higher, lower, or the same as another.
Note: Character string data defined with the FOR BIT DATA attribute, or BLOB data, is sorted using the binary sort sequence.

For example, a collating sequence can be used to indicate that lowercase and uppercase versions of a particular character are to be sorted equally.

The database manager allows databases to be created with custom collating sequences. The following sections help you determine and implement a particular collating sequence for a database.

Overview

In a database, each single-byte character is represented internally as a unique number between 0 and 255, (in hexadecimal notation, between X'00' and X'FF'). This number is referred to as the code point of the character; the assignments of numbers to characters in a set are collectively called a code page. A collating sequence is a mapping between the code point and the desired position of each character in a sorted sequence. The numeric value of the position is called the weight of the character in the collating sequence. The simplest collating sequence is one where the weights are identical to the code points. This is called the identity sequence.

For example, consider the characters B (X'42'), and b (X'62'). If, according to the collating sequence table, they both have a sort weight of X'42' (B), then they collate the same. If the sort weight for B is X'9E' and the sort weight for b is X'9D', then b will be sorted before B. Actual weights depend on the collating sequence table used which depends on the code set and locale. Note that a collating sequence table is not the same as a code page table, which defines code points.

Consider the following example. In ASCII, the characters A through Z are represented by X'41' through X'5A'. To describe a collating sequence where these are sorted in order, and consecutively (no intervening characters), you can write X'41', X'42', ...X'59', X'5A'.

For multi-byte characters, the hexadecimal value of the multi-byte character is also used as the weight. For example, X'8260', X'8261' are the code points for double byte character A and B. In this case, the collation weights for X'82', X'60', and X'61' are used to sort these two characters according to their code points.

The values of the weights in a collating sequence need not be unique. For example, you could give uppercase letters and their lowercase equivalents the same weight.

Specifying the collating sequence can be simplified if a collating sequence provides weights for all 256 code points. The weight of each character can be determined using the code point of the character. This is the method used to specify a collating sequence for the database manager: a string of 256 bytes, where the nth byte (starting with 0) contains the weight of code point n.

In all cases, DB2 uses the collation table which was specified at database creation time. If you require the multi-byte characters to sort the way they appear in their code point table, you must specify IDENTITY as your collation sequence when you create the database.
Note: For DBCS characters in GRAPHIC fields, the sort sequence is always IDENTITY without regard to the collation sequence specified at database creation time.

Character Comparisons:

Once a collating sequence is established, character comparison is performed by comparing the weights of two characters, instead of directly comparing their code point values.

If weights that are not unique are used, characters that are not identical may compare equally. Because of this, string comparison must be a two-phase process:

Compare the characters of each string based on their weights.
If step 1 yielded equality, compare the characters of each string based on their code point values.

If the collating sequence contains 256 unique weights, only the first step is performed. If the collating sequence is the identity sequence only the second step is performed. In either case, there is a performance benefit.

For more information on character comparisons, see the SQL Reference.

Case Independent Comparisons:

To perform character comparisons that are independent of whether they are upper or lower case, you can use the TRANSLATE function to select and compare mixed case column data by translating it to upper case, but only for the purposes of comparison. Consider the following data:

  Abel
  abels
  ABEL
  abel
  ab
  Ab

For the following select statement:

SELECT c1 FROM T1 WHERE TRANSLATE(c1) LIKE 'AB%'

you would receive the following results:

  ab
  Ab
  abel
  Abel
  ABEL
  abels

Note:

You could also set the select as in the following view v1, and then make all your comparisons against the view (in upper case) and your inserts into the table in mixed case:

  CREATE VIEW v1 AS SELECT TRANSLATE(c1) FROM t1

At the database level, you can set the collating sequence as part of the CREATE DATABASE API. This allows you to decide if 'a' is processed before 'A', or if 'A' is processed after 'a', or if they are processed with equal weighting. This will make them equal when collating or sorting using the ORDER BY clause. If you have two values of 'a' and 'A', 'A' will always come before 'a', because in all senses they are equal, so the only difference upon which to sort is the hexadecimal value.

Thus if you issue SELECT c1 FROM t1 WHERE c1 LIKE 'ab%', you receive the following output:

  ab
  abel
  abels

If you issue SELECT c1 FROM t1 WHERE c1 LIKE 'A%', you receive the following output:

  Abel
  Ab
  ABEL

If you issue SELECT c1 FROM t1 ORDER BY c1, you receive the following:

  ab
  Ab
  abel
  Abel
  ABEL
  abels

Thus, you may want to consider using the scalar function TRANSLATE(), as well as the CREATE DATABASE API. Note that you can only specify a collating sequence using the CREATE DATABASE API. You cannot specify a collating sequence from the Command Line Processor. For information on the TRANSLATE() function, see the SQL Reference. For information on the CREATE DATABASE API see the Administrative API Reference.

You can also use the UCASE function as follows, but note that DB2 performs a table scan instead of using an index for the select:

    SELECT * FROM EMP WHERE UCASE(JOB) = 'NURSE'

Collating Sequence Sort Order: EBCDIC and ASCII Example

The order in which data in a database is sorted depends on the collating sequence defined for the database. For example, suppose that database A uses the EBCDIC code page's default collating sequence and that database B uses the ASCII code page's default collating sequence. Sort orders at these two databases would differ, as shown in Figure 7.

Figure 7. Example of How a Sort Order in an EBCDIC-Based Sequence Differs from a Sort Order in an ASCII-Based Sequence

SELECT..... ORDER BY COL2 EBCDIC-Based Sort ASCII-Based Sort COL2 COL2 ---- ---- V1G 7AB Y2W V1G 7AB Y2W

Similarly, character comparisons in a database depend on the collating sequence defined for that database. So if database A uses the EBCDIC code page's default collating sequence and database B uses the ASCII code page's default collating sequence, the results of character comparisons at the two databases would differ. Figure 8 illustrates the difference.

Figure 8. Example of How a Comparison of Characters in an EBCDIC-Based Sequence Differs from a Comparison of Characters in an ASCII-Based Sequence

SELECT..... WHERE COL2 > 'TT3' EBCDIC-Based Results ASCII-Based Results COL2 COL2 ---- ---- TW4 TW4 X72 X72 39G

If you are creating a federated database, consider specifying that your collating sequence matches the collating sequence at a data source. This approach will maximize "push-down" opportunities and possibly increase query performance. See the Administration Guide, Performance for more information on the relationship between pushdown analysis, collating sequences, and query performance.

Specifying a Collating Sequence

The collating sequence for a database is specified at database creation time. Once the database has been created, the collating sequence cannot be changed.

The CREATE DATABASE API accepts a data structure called the Database Descriptor Block (SQLEDBDESC). You can define your own collating sequence within this structure.

To specify a collating sequence for a database:

Pass the desired SQLEDBDESC structure, or
Pass a NULL pointer. The collating sequence of the operating system (based on current country code and code page) is used. This is the same as specifying SQLDBCSS equal to SQL_CS_SYSTEM (0).

The SQLEDBDESC structure contains:

SQLDBCSS

A 4-byte integer indicating the source of the database collating sequence. Valid values are:

SQL_CS_SYSTEM: The collating sequence of the operating system (based on current country code and code page) is used.
SQL_CS_USER: The collating sequence is specified by the value in the SQLDBUDC field.
SQL_CS_NONE: The collating sequence is the identity sequence. Strings are compared byte for byte, starting with the first byte, using a simple code point comparison.

Note:

These constants are defined in the SQLENV include file.

SQLDBUDC

A 256-byte field. The nth byte contains the sort weight of the nth character in the code page of the database. If SQLDBCSS is not equal to SQL_CS_USER, this field is ignored.

Sample Collating Sequences:

Several sample collating sequences are provided (as include files) to facilitate database creation using the EBCDIC collating sequences instead of the default workstation collating sequence.

The collating sequences in these include files can be specified in the SQLDBUDC field of the SQLEDBDESC structure. They can also be used as models for the construction of other collating sequences.

For information on the include files that contain collating sequences, see the following sections:

For C/C++, Include Files
For COBOL, Include Files
For FORTRAN, Include Files.

General Concerns:

Once a collating sequence is defined, all future character comparisons for that database will be performed with that collating sequence. Except for character data defined as FOR BIT DATA or BLOB data, the collating sequence will be used for all SQL comparisons and ORDER BY clauses, and also in setting up indexes and statistics. For more information on how the database collating sequence is used, see the section on String Comparisons in the SQL Reference, N/A.

Potential problems may occur in the following cases:

An application merges sorted data from a database with application data that was sorted using a different collating sequence.
An application merges sorted data from one database with sorted data from another, but the databases have different collating sequences.
An application makes assumptions about sorted data that are not true for the relevant collating sequence. For example, numbers collating lower than alphabetics might or might not be true for a particular collating sequence.

A final point to remember is that the results of any sort based on a direct comparison of code points of characters will only match the results of a query ordered using an identity collating sequence.

Federated Database Concerns:

Your choice of database collating sequence can affect federated system performance. If a data source uses the same collating sequence as the DB2 federated database, DB2 can push-down order-dependent processing involving character data to the data source. If a data source collating sequence does not match DB2's, data is retrieved and all order-dependent processing on character data is done locally (which can slow performance).

To determine if a data source and DB2 have the same collating sequence, consider the following factors:

National language support
The collating sequence is related to the language supported on a server. Compare DB2 NLS information to data source NLS information.
Data source characteristics
Some data sources are created using case-insensitive collating sequences, which can yield different results from DB2 in order-dependent operations.
Customization
Some data sources provide multiple options for collating sequences or allow the collating sequence to be customized.

Choose the collating sequence for a DB2 federated database based on the mix of data sources that will be accessed from that database. For example:

If a DB2 database will access mostly Oracle databases with the same code page (NLS) as DB2, specify SQL_CS_NONE for the SQLDBCSS structure (Oracle databases use an equivalent collating sequence).
If a DB2 database will access only DB2 UDB databases, ensure that you match SQLDBCSS values.

For information on how to set up a MVS collating sequence, you should refer to theAdministrative API Reference for samples under the "sqlecrea" topic. These samples contain collation tables for EBCIDIC 500, 37, and 5026/5035 code pages.

After you set the collating sequence for the DB2 database, ensure that you set the collating_sequence server option for each data source server. The collating_sequence option indicates if the collating sequence of a given data source server matches the collating sequence of the DB2 database.

Set the collating_sequence option to "Y" if the collating sequences match. This setting allows the DB2 optimizer to consider order-dependent processing at a data source, which can improve performance. However, if the data source collating sequence is not the same as the DB2 database collating sequence, you can receive incorrect results. For example, if your plan uses merge joins, the DB2 optimizer will push-down ordering operations to the data sources as much as possible. If the data source collating sequence is not the same, the join results may not have a correct result set.

Set the collating_sequence option to "N" if the collating sequences do not match. Use this value when data source collating sequences differ from DB2 or when the data source collating operations might be case insensitive. For example, in a case insensitive data source with an English code page, TOLLESON, ToLLeSoN, and tolleson would all be considered equal. Set the collating_sequence option to "N" if you are not sure that the collating sequence at the data source is identical to the DB2 collating sequence.

Deriving Code Page Values

The application code page is derived from the active environment when the database connection is made. If the DB2CODEPAGE registry variable is set, its value is taken as the application code page. However, it is not necessary to set the DB2CODEPAGE registry variable because DB2 will determine the appropriate code page value from the operating system. Setting the DB2CODEPAGE registry variable to incorrect values may cause unpredictable results.

The database code page is derived from the value specified (explicitly or by default) at the time the database is created. The following defines how the active environment is determined in different operating environments, for example:

UNIX: In UNIX-based environments, the active environment is determined from the locale setting, which includes information about language, territory and code set.
OS/2: In OS/2, primary and secondary code pages are specified in the CONFIG.SYS file. You can use the chcp command to display and dynamically change code pages within a given session.
Windows 32-bit operating systems: For all Windows 32-bit operating systems, if the DB2CODEPAGE environment variable is not set, the code page is derived from the ANSI code page setting in the Registry.

For a complete list of environment mappings for code page values, see Table 44.

Deriving Locales in Application Programs

Locales are implemented one way in Windows and another in UNIX-based systems. In UNIX-based systems there are two locales:

The environment locale allows you to specify the language, currency symbol, and so on, that you want to use.
The program locale contains the current language, currency symbol, and so on, of a program that is executing.

In Windows, the cultural preferences can be set through Regional Settings of the Control Panel. However, there is no environmental locale like the one on UNIX systems.

When your program is started, it gets a default C locale. It does not get a copy of the environment locale. If you set the program locale to any locale other than "C", DB2 Universal Database uses your current program locale to determine the code page and territory settings for your application environment. Otherwise, these values are obtained from the operating system environment. You should note that setlocale() is not thread-safe, and if you issue setlocale() from within your application, the new locale is set for the entire process.

How DB2 Derives Locales

With UNIX, the active locale used by DB2 is determined from the LC_CTYPE portion of the locale. For details, see the NLS documentation for your operating system.

If LC_CTYPE of the program locale has a value other than that of 'C', DB2 will use this value to determine the application code page by mapping it to its corresponding code page.
If LC_CTYPE has the value of 'C' (the 'C' locale), DB2 will set the program locale according to the environment locale using the setlocale() function.
If LC_CTYPE still has a value of 'C', DB2 will assume the default of the US English environment, and code page 819 (ISO 8859-1).
If LC_CTYPE's value is no longer 'C', its new value will be used to map to a corresponding code page. For information on the default locale for a particular platform, see Table 44. Additional information may be found in the Application Building Guide.

Programming Considerations

Constant character strings in static SQL statements are converted at bind time, from the application code page to the database code page, and will be used at execution time in this database code page representation. To avoid such conversions if they are not desired, you can use host variables in place of string constants.

If your program contains constant character strings, it is strongly recommended that application be precompiled, bound, compiled, and executed using the same code page. For a Unicode database, you should use host variables instead of using string constants. This is because data conversions by the server can occur in both the bind and the execution phases. This could be a concern if constant character strings are used within the program. These embedded strings are converted at bind time based on the code page which is in effect during the bind phase. Seven-bit ASCII characters are common to all the code pages supported by DB2 Universal Database and will not cause a problem. For non-ASCII characters, users should ensure that the same conversion tables are used by binding and executing with the same active code page. For a discussion of how applications determine the active code page, see Deriving Code Page Values.

Any external data obtained by the application will be assumed to be in the application code page. This includes data obtained from a file or from user input. Make sure that data from sources outside the application uses the same code page as the application.

If you use host variables that use graphic data in your C or C++ applications, there are special precompiler, application performance, and application design issues you need to consider. For a detailed discussion of these considerations, see Handling Graphic Host Variables. If you deal with EUC code sets in your applications, refer to Japanese and Traditional Chinese EUC and UCS-2 Code Set Considerations for guidelines that you should consider.

Coding SQL Statements

The coding of SQL statements is not language dependent. The SQL keywords must be typed as shown in this book, although they may be typed in uppercase, lowercase, or mixed case. The names of database objects, host variables and program labels that occur in an SQL statement cannot contain characters outside the extended character set supported by your code page. See the SQL Reference for more information about extended character sets.

The server does not convert file names. To code a file name, either use the ASCII invariant set, or provide the path in the hexadecimal values that are physically stored in the file system.

In a multi-byte environment, there are four characters which are considered special that do not belong to the invariant character set. These characters are:

The double-byte percentage and double-byte underscore characters used in LIKE processing. For further details concerning LIKE, refer to the SQL Reference.
The double-byte space character, used for, among other things, blank padding in graphic strings.
The double-byte substitution character, used as a replacement during character conversion when no mapping exists between a source code page and a target code page.

The code points for each of these characters, by code page is as follows:

Table 13. Code Points for Special Double-byte Characters

Code Page Double-Byte Percentage Double-Byte Underscore Double-byte Space Double-Byte Substitution Character
932 X'8193' X'8151' X'8140' X'FCFC'
938 X'8193' X'8151' X'8140' X'FCFC'
942 X'8193' X'8151' X'8140' X'FCFC'
943 X'8193' X'8151' X'8140' X'FCFC'
948 X'8193' X'8151' X'8140' X'FCFC'
949 X'A3A5' X'A3DF' X'A1A1' X'AFFE'
950 X'A248' X'A1C4' X'A140' X'C8FE'
954 X'A1F3' X'A1B2' X'A1A1' X'F4FE'
964 X'A2E8' X'A2A5' X'A1A1' X'FDFE'
970 X'A3A5' X'A3DF' X'A1A1' X'AFFE'
1381 X'A3A5' X'A3DF' X'A1A1' X'FEFE'
1383 X'A3A5' X'A3DF' X'A1A1' X'A1A1'
13488 X'FF05' X'FF3F' X'3000' X'FFFD'

Code Page	Double-Byte Percentage	Double-Byte Underscore	Double-byte Space	Double-Byte Substitution Character
932	X'8193'	X'8151'	X'8140'	X'FCFC'
938	X'8193'	X'8151'	X'8140'	X'FCFC'
942	X'8193'	X'8151'	X'8140'	X'FCFC'
943	X'8193'	X'8151'	X'8140'	X'FCFC'
948	X'8193'	X'8151'	X'8140'	X'FCFC'
949	X'A3A5'	X'A3DF'	X'A1A1'	X'AFFE'
950	X'A248'	X'A1C4'	X'A140'	X'C8FE'
954	X'A1F3'	X'A1B2'	X'A1A1'	X'F4FE'
964	X'A2E8'	X'A2A5'	X'A1A1'	X'FDFE'
970	X'A3A5'	X'A3DF'	X'A1A1'	X'AFFE'
1381	X'A3A5'	X'A3DF'	X'A1A1'	X'FEFE'
1383	X'A3A5'	X'A3DF'	X'A1A1'	X'A1A1'
13488	X'FF05'	X'FF3F'	X'3000'	X'FFFD'

UCS-2 Considerations:

For a UCS-2 database, the GRAPHIC space is X'0020' which is different from the "Double-byte Space" of X'3000' used for CCSID 13488. This difference should be taken into consideration when comparing data from a EUC database to data from a UCS-2 database. Note that in a UCS-2 database, Unicode representations of ASCII percent and an ASCII underscore are used for pattern matching. DBCS percent and DBCS underscore have no special meaning for a UCS-2 database. The DBCS substitution character is used to replace any EUC non-SBCS character as required. There is no concept of a three or four byte substitution character.

Coding Remote Stored Procedures and UDFs

When coding stored procedures that will be running remotely, the following considerations apply:

Data in a stored procedure must be in the database code page.
Data passed to or from a stored procedure using an SQLDA with a character data type must really contain character data. Numeric data and data structures must never be passed with a character type if the client application code page is different from the database code page. This is because the server will convert all character data in an SQLDA. To avoid character conversion, you can pass data by defining it in binary string format by using a data type of BLOB or by defining the character data as FOR BIT DATA.

By default, when you invoke DB2 DARI stored procedures and UDFs, they run under a default national language environment which may not match the database's national language environment. Consequently, using country or code page specific operations, such as the C wchar_t graphic host variables and functions, may not work as you expect. You need to ensure that, if applicable, the correct environment is initialized when you invoke the stored procedure or UDF.

Package Name Considerations in Mixed Code Page Environments

Package names are determined when you invoke the PRECOMPILE PROGRAM command or API. By default, they are generated based on the first 8-bytes of the application program source file (without the file extension) and are folded to upper case. Optionally, a name can be explicitly defined. Regardless of the origin of a package name, if you are running in an unequal code page environment, the characters for your package names should be in the invariant character set. Otherwise you may experience problems related to the modification of your package name. The database manager will not be able to find the package for the application or a client-side tool will not display the right name for your package.

A package name modification due to character conversion will occur if any of the characters in the package name, are not directly mapped to a valid character in the database code page. In such cases, a substitution character replaces the character that is not converted. After such a modification, the package name, when converted back to the application code page, may not match the original package name. An example of a case where this behavior is undesirable is when you use the DB2 Database Director to list and work with packages. Package names displayed may not match the expected names.

To avoid conversion problems with package names, ensure that only characters are used which are valid under both the application and database code pages.

Precompiling and Binding

At precompile/bind time, the precompiler is the executing application. The active code page when the database connection was made prior to the precompile request is used for precompiled statements, and any character data returned in the SQLCA.

Executing an Application

At execution time, the active code page of the user application when a database connection is made is in effect for the duration of the connection. All data is interpreted based on this code page; this includes dynamic SQL statements, user input data, user output data, and character fields in the SQLCA.

A Note of Caution

Failure to follow these guidelines may produce unpredictable results. These conditions cannot be detected by the database manager, so no error or warning message will result. For example, a C application contains the following SQL statements operating against a table T1 with one column defined as C1 CHAR(20):

     (0)  EXEC SQL CONNECT TO GLOBALDB;
     (1)  EXEC SQL INSERT INTO T1 VALUES ('a-constant');
          strcpy(sqlstmt, "SELECT C1 FROM T1 WHERE C1='a-constant');
     (2)  EXEC SQL PREPARE S1 FROM :sqlstmt;
Where:
    application code page at bind time = x
    application code page at execution time = y
    database code page = z

At bind time, 'a-constant' in statement (1) is converted from code page x to code page z. This conversion can be noted as (x>z).

At execution time, 'a-constant' (x>z) is inserted into the table when statement (1) is executed. However, the WHERE clause of statement (2) will be executed with 'a-constant' (y>z). If the code points in the constant are such that the two conversions (x>z and y>z) yield different results, the SELECT in statement (2) will fail to retrieve the data inserted by statement (1).

Conversion Between Different Code Pages

Ideally, for optimal performance, your applications should always use the same code page as your database. However, this is not always practical or possible. The DB2 products provide support for character conversion that allows your application and database to use different code pages. Characters from one code page must be mapped to the other code page in order to maintain meaning of the data.

When Does Character Conversion Occur?:

Character conversion can occur in the following situations:

When a client or application accessing a database is running in a code page that is different from the code page of the database.

This database conversion will occur on the database server machine for both conversions from the application code page to the database code page and from the database code page to the application code page.

You can minimize or eliminate client/server character conversion in some situations. For example, you could:

Create a Windows NT database using code page 850 to match an OS/2 and Windows client application environment that predominately uses code page 850,
If a Windows ODBC application is used with the IBM DB2 ODBC driver in Windows database client, this problem may be alleviated by the use of the TRANSLATEDLL and TRANSLATEOPTION keywords in the odbc.ini or db2cli.ini file.
Create a DB2 for AIX database using code page 850 to match an OS/2 and DOS client application environment that predominately uses code page 850.

Note:

The DB2 for OS/2 Version 1.0 or Version 1.2 database server does not support character conversion between different code pages. Ensure that the code pages on server and client are compatible. See Table 44 for the code page conversions that are supported.

When a client or application importing a PC/IXF file runs in a code page that is different from the file being imported.
This data conversion will occur on the database client machine before the client accesses the database server. Additional data conversion may take place if the application is running in a code page that is different from the code page of the database (as stated in the previous point).
Data conversion, if any, also depends on how the import utility was called. See the Administration Guide for more information.
When DB2 Connect is used to access data on a host or AS/400 server. In this case the data receiver converts the character data. For example, data that is sent to DB2 for MVS/ESA is converted to the appropriate MVS coded character set identifier (CCSID) by DB2 for MVS/ESA. The data sent back to the DB2 Connect machine from DB2 for MVS/ESA is converted by DB2 Connect. For more information, see the DB2 Connect User's Guide.

Character conversion will not occur for:

File names. You should either use the ASCII invariant set for file names or provide the file name in the hexadecimal values that are physically stored in the file system. Note that if you include a file name as part of a SQL statement, it gets converted as part of the statement conversion.
Data that is targeted for or comes from a column assigned the FOR BIT DATA attribute, or data used in an SQL operation whose result is FOR BIT or BLOB data. In these cases, the data is treated as a byte stream and no conversion occurs. ¹ See the SQL Reference for unequal code page rules for assigning, comparing, and combining strings.
A DB2 product or platform that does not support, or that does not have support installed, for the desired combination of code pages. In this case, an SQLCODE -332 (SQLSTATE 57017) is returned when you try to run your application.

Character Substitutions During Conversions: When your application converts from one code page to another, it is possible that one or more characters are not represented in the target code page. If this occurs, DB2 inserts a substitution character into the target string in place of the character that has no representation. The replacement character is then considered a valid part of the string. In situations where a substitution occurs, the SQLWARN10 indicator in the SQLCA is set to 'W'.
Note: Any character conversions resulting from using the WCHARTYPE CONVERT precompiler option will not flag a warning if any substitutions take place.

Supported Character Conversions:

When data conversion occurs, conversion will take place from a source code page to a target code page.

The source code page is determined from the source of the data; data from the application has a source code page equal to the application code page, and data from the database has a source code page equal to the database code page.

The determination of target code page is more involved; where the data is to be placed, including rules for intermediate operations, is considered:

If the data is moved directly from an application into a database, with no intervening operations, the target code page is the database code page.
If the data is being imported into a database from a PC/IXF file, there are two character conversion steps:
1. From the PC/IXF file code page (source code page) to the application code page (target code page)
2. From the application code page (source code page) to the database code page (target code page).
Exercise caution in situations where two conversion steps might occur. Make sure you follow the supported character conversions listed in Table 44 to avoid a possible loss of character data. Additionally, within each group, only characters which exist in both the source and target code page have meaningful conversions. Other characters are used as "substitutions" and are only useful for converting from the target code page back to the source code page (and may not necessarily provide meaningless conversions in the two-step conversion process mentioned above). Such problems are avoided if the application code page is the same as the database code page.
If the data is derived from operations performed on character data, where the source may be any of the application code page, the database code page, FOR BIT DATA, or for BLOB data, data conversion is based on a set of rules. Some or all of the data items may have to be converted to an intermediate result, before the final target code page can be determined. See the SQL Reference for a summary of these rules and for specific application with individual operators and predicates.

Table 44 shows the code pages that are supported by DB2 Universal Database. The values under the heading "Group" can be used to determine where conversions are supported. Any code page can be converted to any other code page that is listed in the same IBM-defined language group. For example, code page 437 can be converted to 37, 819, 850, 1051, 1252, or 1275.
Note: Character string conversions between multi-byte code pages, for example DBCS and EUC, may result in either an increase or a decrease in the length of the string.

Character Conversion Expansion Factor:

When your application successfully completes an attempt to connect to a DB2 database server, you should consider the following fields in the returned SQLCA:

The second token in the SQLERRMC field (tokens are separated by X'FF') indicates the code page of the database. The ninth token in the SQLERRMC field indicates the code page of the application. Querying the application's code page and comparing it to the database's code page informs the application whether it has established a connection which will undergo character conversions.
The first and second entries in the SQLERRD array. SQLERRD(1) contains an integer value equal to the maximum expected expansion or contraction factor for the length of mixed character data (CHAR data types) when converted to the database code page from the application code page. SQLERRD(2) contains an integer value equal to the maximum expected expansion or contraction factor for the length of mixed character data (CHAR data types) when converted to the application code page from the database code page. A value of 0 or 1 indicates no expansion; a value greater than 1 indicates a possible expansion in length; a negative value indicates a possible contraction. Refer to the SQL Reference for details on using the CONNECT statement.

The considerations for graphic string data should not be a factor in unequal code page situations. Each string always has the same number of characters, regardless of whether the data is in the application or the database code page.

See Unequal Code Page Situations for information on dealing with unequal code page situations.

Unicode/UCS-2 and UTF-8 Support in DB2 UDB

These two standards are documented here.

Introduction

The Unicode character encoding standard is a fixed-length, character encoding scheme that includes characters from almost all the living languages of the world. Unicode characters are usually shown as "U+xxxx" where xxxx is the hexadecimal code of the character.

Each character is 16 bits (2 bytes) wide, regardless of the language. While the resulting 65 000 code elements are sufficient for encoding most of the characters of the major languages of the world, the Unicode standard also provides an extension mechanism, that allows for encoding as many as a million more characters. This extension reserves a range of code values (U+D800 to U+D8FF, known as "surrogates") for encoding some 32-bit characters as two successive code elements.

The International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC) 10646 standard (ISO/IEC 10646) specifies the Universal Multiple-Octet Coded Character Set (UCS) that has a 2-byte version (UCS-2) and a 4-byte version (UCS-4). The 2-byte version of this ISO standard UCS-2 is identical to Unicode without surrogates. ISO 10646 also defines an extension technique, for encoding some UCS-4 codes in a UCS-2 encoded string. This extension called UTF-16, is identical to Unicode with surrogates.

DB2 UDB supports UCS-2, that is, Unicode without surrogates.

UTF-8: With UCS-2 or Unicode encoding, ASCII and control characters are also two bytes long, and the lead byte is zero. For example, NULL is U+0000 and CAPITAL LETTER A is represented by U+0041. This could be a major problem for ASCII-based applications and ASCII file systems, because in a UCS-2 strings, extraneous NULLs may appear anywhere in the string. A transformation algorithm, known as UTF-8, can be used to circumvent this problem for programs that rely on ASCII code being invariant.

UTF-8 (UCS Transformation Format 8), is an algorithmic transformation which transforms fixed-length UCS-4 characters into variable-length byte strings. In UTF-8, ASCII characters are represented by their usual single-byte codes, but non-ASCII characters in UCS-2 become two or three bytes long. In other words, UTF-8 transforms UCS-2 characters to a multi-byte codeset, for which ASCII is invariant. The number of bytes for each UCS-2 character in UTF-8 format can be determined from the following table:

    UCS-2 (hex)     UTF-8 (binary)                 Description
    ------------    --------------------------     ----------------
    0000 to 007F    0xxxxxxx                       ASCII
    0080 to 07FF    110xxxxx 10xxxxxx              up to U+07FF
    0800 to FFFF    1110xxxx 10xxxxxx 10xxxxxx     other UCS-2
 
    NOTE: The range D800 to DFFF is to be excluded from treatment
          by the third row of this table which governs the UCS-4 range
          0000 0800 to 0000 FFFF.

In all the above, a series of x's indicate the UCS bit representation of the character. For example, U0080 transforms into 11000010 10000000.

UCS-2/UTF-8 Implementation in DB2 UDB

Code Page/CCSID Numbers: Within IBM, the UCS-2 code page has been registered as code page 1200. All code pages are defined with growing character sets, that is, when new characters are added to a code page, the code page number does not change. Code page 1200 always refers to the current version of Unicode/UCS-2, and has been used for UCS-2 support in DB2 UDB.

A specific repertoire of the UCS standard, as defined by Unicode 2.0 and ISO/IEC 10646-1, has also been registered within IBM as CCSID 13488. This CCSID (13488) has been used internally by DB2 UDB for storing graphic string data in euc-Japan and euc-Taiwan databases. CCSID 13488 and code page 1200 both refer to UCS-2, and are handled the same way except for the value of their "double-byte" (DBCS) space:

     CP/CCSID        Single Byte (SBCS) space      Double Byte (DBCS) space
    ---------        ------------------------      ------------------------
      1200                   N/A                           U+0020
      13488                  N/A                           U+3000
 
    NOTE: In a UCS-2 database, U+3000 has no special meaning.

Regarding the conversion tables, since code page 1200 is a superset of CCSID 13488, the exact same (superset) tables are used for both.

In IBM, UTF-8 has been registered as CCSID 1208 with growing character set (sometimes also referred to as code page 1208). As new characters are added to the standard, this number (1208) will not change either. 1208 is used as the multi-byte code page number for DB2's UCS-2/UTF-8 support.

DB2 UDB supports UCS-2 as a new multi-byte code page. The MBCS code page number is 1208, which is the database code page number, and the code page of character string data within the database. The double-byte code page number (for UCS-2) is 1200 which is the code page of graphic string data within the database. When a database is created in UCS-2/UTF-8, CHAR, VARCHAR, LONG VARCHAR, and CLOB data, are stored in UTF-8, and GRAPHIC, VARGRAPHIC, LONG VARGRAPHIC, and DBCLOB data, are stored in UCS-2. We will simply refer to this as a UCS-2 database.

Creating a UCS-2 Database: By default, databases are created in the code page of the application creating them. Therefore, if you create your database from a UTF-8 client (for example, the UNIVERSAL locale of AIX), or if DB2CODEPAGE environment variable on the client is set to 1208, your database will be created as a UCS-2 database. Alternatively, you can explicitly specify "UTF-8" as the CODESET name, and use any valid two letter TERRITORY code supported by DB2 UDB. Refer to the Administration Guide, Design and Implementation, and the appendix on "NLS Support" for a complete list of territory codes supported by DB2 UDB.

For example, to create a UCS-2 database from the CLP, with the territory code for United States, issue:

  DB2 CREATE DATABASE dbname USING CODESET UTF-8 TERRITORY US

To create a UCS-2 database using the sqlecrea API, you should set the values in sqledbcountryinfo accordingly. For example, set sqledbcountryinfo to "UTF-8", and sqldblocale to any valid territory code (for example, "US").

The default collation sequence for a UCS-2 database is IDENTITY, which provides UCS-2 code point order. Therefore, by default, all UCS-2/UTF-8 characters are ordered and compared according to their UCS-2 code point sequence.

All cultural-sensitive parameters such as date/time format, decimal separator, and others, are based on the current territory of the client.

A UCS-2 database allows connection from every single-byte and multi-byte code page supported by DB2 UDB. Code page character conversions between client's code page and UTF-8 are automatically performed by the database manager. Data in graphic string types, is always in UCS-2 and does not go through code page conversions. The Command Line Processor (CLP) environment is an exception. If you SELECT graphic string (UCS-2) data from the CLP, the returned graphic string data is converted (by the CLP) from UCS-2 to the code page of your client environment.

Every client is limited by the character repertoire, the input methods, and the fonts supported by its environment, but the UCS-2 database itself accepts and stores all UCS-2 characters. Therefore, every client usually works with a subset of UCS-2 characters, but the database manager allows the entire repertoire of UCS-2 characters.

When characters are converted from a local code page to UTF-8, there is a possibility of expansion in the number of bytes. There is no expansion for ASCII characters, but other UCS-2 characters expand by a factor of two or three. The number of bytes of each UCS-2 character in UTF-8 format can be determined from the above table (section about UTF-8).

Data Types: All data types supported by DB2 UDB, are also supported in a UCS-2 database. In particular, graphic string data, is supported for UCS-2 database and is stored in UCS-2/Unicode. Every client, including SBCS clients, can work with graphic string data types in UCS-2/Unicode when connected to a UCS-2 database.

A UCS-2 database is like any MBCS database where character string data is measured in number of bytes. When working with character string data in UTF-8, one should not assume that each character is one byte. In multi-byte UTF-8 encoding, each ASCII character is one byte, but non-ASCII characters take two or three bytes each. This should be taken into account when defining CHAR fields. Depending on the ratio of ASCII to non-ASCII characters, a CHAR field of size n bytes, can contain anywhere from n/3 to n characters.

Using character string UTF-8 encoding versus graphic string UCS-2 data type also has an impact on the total storage requirements. For a situation where the majority of characters are ASCII, with some non-ASCII characters in between, storing UTF-8 data may be a better alternative because the storage requirements are closer to one byte per character. On the other hand, for situations where the majority of characters are non-ASCII characters that expand to three-byte UTF-8 sequences (for example ideographic characters), the UCS-2 graphic-string format may be a better alternative because every UCS-2 character requires exactly two bytes, rather than three bytes for each corresponding character in UTF-8 format.

SQL scalar functions that operate on character strings, such as LENGTH, SUBSTR, POSSTR, MAX, MIN, and the like, in MBCS environments operate on number of "bytes" rather than number of "characters". The behavior is the same in a UCS-2 database but you should take extra care when specifying offsets and lengths for a USC-2 database because these values are always defined in the context of the database code page. That is, in the case of UCS-2 database should be defined in UTF-8. Since some single-byte characters require more than one byte in UTF-8, SUBSTR indexes that are valid for a single-byte database may not be valid for a UCS-2 database. If you specify incorrect indexes, you will get SQLCODE -191, SQLSTATE 22504. Refer to the SQL Reference for a description of the behavior of these functions.

SQL CHAR data types are supported by C language's char data type in user programs. SQL GRAPHIC data types are supported by sqldbchar in user C programs. Note that, for a UCS-2 database, sqldbchar data is always in big-endian (high byte first) format. When an application program is connected to a UCS-2 database, character string data is converted between the application code page and UTF-8 by DB2 UDB, but graphic string data is always in UCS-2. However, you can set the WCHARTYPE pre-compiler option to CONVERT, if you want to convert UCS-2 data to you program's wchar_t. See Handling Graphic Host Variables for more information.

Identifiers: In a UCS-2 database, all identifiers are in multi-byte UTF-8. Therefore, it is possible to use any UCS-2 character in identifiers where the use of a character in the extended character set (for example, an accented character, or a multi-byte character) is allowed by DB2 UDB. Please refer to the appendix "Naming Rules" in the Administration Guide for details of which identifiers allow use of extended characters.

Clients can enter any character which is supported by their SBCS/MBCS environment, and all the characters in the identifiers will be converted to UTF-8 by the database manager. Two points need to be taken into account when specifying National Language characters in identifiers in a UCS-2 database:

Each non-ASCII character takes two or three bytes. Therefore, an n-byte identifier, can only hold somewhere between n/3 and n characters, depending the ratio of ASCII to non-ASCII characters. If you have only one or two non-ASCII (for example, accented) characters, the limit is closer to n characters, while for an identifier which is completely non-ASCII (for example, in Japanese), only n/3 characters can be used.
If identifiers are to be entered from different client environments, they should be defined using the common subset of characters available to those clients. For example, if a UCS-2 database is to be accessed from Latin-1, Arabic, and Japanese environments, all identifiers should realistically be limited to ASCII.

UCS-2 Literals: UCS-2 literals can be specified in two ways:

As a GRAPHIC string constant using the G'...', or N'....' format as described in the SQL Reference, Chapter 3 "Language Elements", the section "Constants", the subsection "Graphic String Constants." Any literal specified in this way will be converted by the database manager from the application code page to UCS-2.
As a UCS-2 hexadecimal string, using the UX'....' or GX'....' format. The constant specified between quotes after UX or GX must be a multiple of 4 hexadecimal digits. Each four digits represent one UCS-2 code point.

When using the Command Line Processor (CLP), the first method is easier if the UCS-2 character exists in the local application code page (for example, for entering any code page 850 character from a terminal that is using code page 850). The second method should be used for characters which are outside the application code page repertoire (for example, for specifying Japanese characters from a terminal that is using code page 850).

Pattern Matching in a UCS-2 Database: Pattern matching is one area where the behavior of existing MBCS databases is slightly different from the behavior of a UCS-2 database.

For MBCS databases in DB2 UDB, the current behavior is as follows: If the match-expression contains MBCS data, the pattern can include both SBCS and MBCS characters. The special characters in the pattern are interpreted as follows:

An SBCS underscore refers to one SBCS character.
A DBCS underscore refers to one MBCS character.
A percent (either SBCS or DBCS) refers to a string of zero or more SBCS or MBCS characters.

If the match-expression contains graphic string DBCS data, the expressions contain only DBCS characters. The special characters in the pattern are interpreted as follows:

A DBCS underscore refers to one DBCS character.
A DBCS percent refers to a string of zero or more DBCS characters.

In a UCS-2 database, there is really no distinction between "single-byte" and "double-byte" characters; every UCS-2 character occupies two bytes. Although the UTF-8 format is a "mixed-byte" encoding of UCS-2 characters, there is no real distinction between SBCS and MBCS characters in UTF-8. Every character is a UCS-2 character, irrespective of its number of bytes in UTF-8 format. When specifying a character string, or a graphic string expression, an underscore refers to one UCS-2 characters and a percent refers to a string of zero or more UCS-2 characters.

On the client side, the character string expressions are in the code page of the client, and will be converted to UTF-8 by the database manager. SBCS client code pages, do not have any DBCS percent or DBCS underscore, but every supported code page contains a single-byte percent (corresponding to U+0025) and a single-byte underscore (corresponding to U+005F). The interpretation of special characters for a UCS-2 database is as follows:

An SBCS underscore (corresponding to U+0025) refers to one UCS-2 character in a graphic string expression, or to one UTF-8 character in a character string expression.
An SBCS percent (corresponding to U+005F) refers to a string of zero or more UCS-2 characters in a graphic string expression, or to a string of zero or more UTF-8 characters in a character string expression.

DBCS code pages, additionally support a DBCS percent sign (corresponding to U+FF05) and a DBCS underscore (corresponding to U+FF3F). These characters have no special meaning for a UCS-2 database.

For the optional "escape-expression" which specifies a character to be used to modify the special meaning of the underscore and percent characters, only ASCII characters, or characters that expand into a two-byte UTF-8 sequence, are supported. If you specify an ESCAPE character which expands to a three-byte UTF-8 value, you will get an error message (SQL0130N error, SQLSTATE 22019).

IMPORT/EXPORT/LOAD Considerations: The DEL, ASC, and PC/IXF file formats are supported for a UCS-2 database as described in this section. The WSF format is not supported.

When exporting from a UCS-2 database to an ASCII delimited (DEL) file, all character data is converted to the application code page. Both character string and GRAPHIC string data are converted to the same SBCS or MBCS code page of the client. This is the existing behavior for the export of any database, and cannot be changed because the entire ASCII delimited file can have only one code page. Therefore, if you export to an ASCII delimited file, only those UCS-2 characters that exist in your application code page would be saved. Other characters are replaced with the default substitution character for the application code page. For UTF-8 clients (code page 1208) there is no data loss because all UCS-2 characters are supported by UTF-8 clients.

When importing from an ASCII file (DEL or ASC) to a UCS-2 database, character string data is converted from the application code page to UTF-8, and GRAPHIC string data is converted from the application code page to UCS-2. There is no data loss. If you want to import ASCII data that has been saved under a different code page, you should switch the data file code page before issuing the import command. One way to accomplish this, is to set DB2CODEPAGE to the code page of the ASCII data file.

The range of valid ASCII delimiters, for SBCS and MBCS clients is identical to what is currently supported by DB2 UDB for these clients. The range of valid delimiters for UTF-8 clients is 0x01 to 0x7F, with the usual restrictions. Refer to the "IMPORT/EXPORT/LOAD Utility File Formats" appendix in the Command Reference for a complete list of these restrictions.

When exporting from a UCS-2 database to a PC/IXF file, character string data is converted to the SBCS/MBCS code page of the client. GRAPHIC string data is not converted and is stored in UCS-2 (code page 1200). There is no data loss.

When importing from an PC/IXF file to a UCS-2 database, character string data is assumed to be in the SBCS/MBCS code page stored in the PC/IXF header and GRAPHIC string data is assumed to be in the DBCS code page stored in the PC/IXF header. Character string data is converted by the IMPORT utility from the code page specified in the PC/IXF header to the code page of the client, and then from the client code page to UTF-8 (by the INSERT statement). GRAPHIC string data is converted by the IMPORT utility from the DBCS code page specified in the PC/IXF header directly to UCS-2 (code page 1200).

LOAD directly places the data into the database and by default, assumes data in ASC or DEL files is in the code page of the database. Therefore, by default no code page conversion takes place for ASCII files. When the code page of the data file has been explicitly specified (using the MODIFIED BY codepage=x command parameter), LOAD uses this information to convert from the specified code page x into the database code page before loading the data. For PC/IXF files, LOAD always converts from the code pages specified in the IXF header to the database code page (1208 for CHAR, 1200 for GRAPHIC).

The code page of DBCLOB files (as specified using the MODIFIED BY lobsinfile command parameter) is always 1200 for UCS-2. The code page of the CLOB files is the same as the code page of the data files being imported, loaded or exported. For example, for load or import using PC/IXF format, the CLOB file is assumed to be in the code page specified by the PC/IXF header. If the DBCLOB file is in ASC or DEL format, for LOAD the CLOB data is assumed to be in the code page of the database (unless explicitly specified otherwise using the MODIFIED BY codepage=x command parameter), and for IMPORT it is assumed to be in the client application code page.

The NOCHECKLENGTHS option is always set to TRUE for a UCS-2 database because in a UCS-2 database, any SBCS can be connected to the database for which there is no DBCS code pages; and also because character strings in UTF-8 format usually have different length than corresponding length in client code pages.

Refer to Data Movement Utilities Guide and Reference for more information on the LOAD, EXPORT, and IMPORT utilities.

Incompatibilities:

An application connected to a UCS-2 database, the graphic string data is always in UCS-2 (code page 1200). For applications connected to non UCS-2 databases, the graphic string data is in the applications DBCS code page; or, not allowed if the application code page is SBCS. For example, when a 932 client is connected to a Japanese non UCS-2 database, then the graphic string data is in code page 301. But for the 932 client applications connected to a UCS-2 database, the graphic string data is in UCS-2. However, you can set the WCHARTYPE pre-compiler option to convert graphic string data to wchar_t of your applications in both cases. See Handling Graphic Host Variables for more information.

Running CLI/ODBC/JDBC/SQLJ Programs in a DBCS Environment

For details on running Java programs that access DB2 Universal Database in a double-byte character set (DBCS) environment, see DB2 Java - DBCS Support

. This web page currently contains the following information:

JDBC and SQLJ programs access DB2 using the DB2 CLI/ODBC driver and therefore use the same configuration file (db2cli.ini). The following entries must be added to this configuration file if you run Java programs that access DB2 Universal Database in a DBCS environment:

PATCH1 = 65536: This forces the driver to manually insert a "G" in front of character literals which are in fact graphic literals. This PATCH1 value should always be set when working in a double byte environment.
PATCH1 = 64: This forces the driver to NULL terminate graphic output strings. This is needed by Microsoft Access in a double byte environment. If you need to use this PATCH1 value as well then you would add the two values together (64+65536 = 65600) and set PATCH1=65600. See Note #2 below for more information about specifying multiple PATCH1 values.
PATCH2 = 7: This forces the driver to map all graphic column data types to char column data type. This is needed in a double byte environment.
PATCH2 = 10: This setting should only be used in an EUC (Extended Unix Code) environment. It ensures that the CLI driver provides data for character variables (CHAR, VARCHAR, etc...) in the proper format for the JDBC driver. The data in these character types will not be usable in JDBC without this setting.

Note:

Each of these keywords is set in each database specific stanza of the db2cli.ini file. If you want to set them for multiple databases then you need to repeat them for each database stanza in db2cli.ini.
To set multiple PATCH1 values you add the individual values and use the sum. To set PATCH1 to both 64 and 65536 you would set PATCH1=65600 (64+65536). If you already have other PATCH1 values set then replace the existing number with the sum of the existing number and the new PATCH1 values you want to add.
To set multiple PATCH2 values you specify them in a comma delimited string (unlike the PATCH1 option). To set PATCH2 values 1 and 7 you would set PATCH2="1,7"

For more information about setting these keywords refer to the Installation and Configuration Supplement.

Footnotes:

¹: However, a literal inserted into a column defined as FOR BIT DATA could be converted if that literal was part of an SQL statement which was converted.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]

[ DB2 List of Books | Search the DB2 Books ]