Character encoding units
A character encoding unit (or encoding unit) is the unit of data that COBOL treats as a single character at run time. In this information, the terms character and character position refer to a single encoding unit.
The size of an encoding unit for data items and literals depends on the USAGE clause of the data item or the category of the literal as follows:
- For data items described with USAGE DISPLAY and for alphanumeric literals, an encoding unit is 1 byte, regardless of the code page used and regardless of the number of bytes used to represent a given graphic character.
- For data items described with USAGE DISPLAY-1 (DBCS data items) and for DBCS literals, an encoding unit is 2 bytes.
- For data items described with USAGE NATIONAL and for national literals, an encoding unit is 2 bytes.
The relationship between a graphic character and an encoding unit depends on the type of code page used for the data item or literal. See the following types of runtime code pages:
- Single-byte EBCDIC
- EBCDIC DBCS
- Unicode UTF-16
See the following sections for the details of each type of code page.
Also see the section Specifying the encoding in the Enterprise COBOL Programming Guide.
Single-byte code pages
You can use a single-byte EBCDIC code page in data items described with USAGE DISPLAY and in literals of category alphanumeric. An encoding unit is 1 byte and each graphic character is represented in 1 byte. For these data items and literals, you need not be concerned with encoding units.
EBCDIC DBCS code pages
USAGE DISPLAY
You can use a mixture of single-byte and double-byte EBCDIC characters in data items described with USAGE DISPLAY and in literals of category alphanumeric. Double-byte characters must be delimited by shift-out and shift-in characters. An encoding unit is 1 byte and the size of a graphic character is 1 byte or 2 bytes.
When alphanumeric data items or literals contain DBCS data, programmers are responsible for ensuring that operations do not unintentionally separate the multiple encoding units that form a graphic character. Care should be taken with reference modification, and truncation during moves should be avoided. The COBOL runtime system does not check for a split between the encoding units that form a graphic character or for the loss of shift-out or shift-in codes.
To avoid problems, you can convert alphanumeric literals and data items described with usage DISPLAY to national data (UTF-16) by moving the data items or literals to data items described with usage NATIONAL or by using the NATIONAL-OF intrinsic function. You can then perform operations on the national data with less concern for splitting graphic characters. You can convert the data back to USAGE DISPLAY by using the DISPLAY-OF intrinsic function.
USAGE DISPLAY-1
You can use double-byte characters of an EBCDIC DBCScode page in data items described with USAGE DISPLAY-1 and in literals of category DBCS. An encoding unit is 2 bytes and each graphic character is represented in a single 2-byte encoding unit. For these data items and literals, you need not be concerned with encoding units.
Unicode UTF-16
You can use UTF-16 in data items described with USAGE NATIONAL. National literals are stored as UTF-16 characters regardless of the code page used for the source program. An encoding unit for data items of usage NATIONAL and national literals is 2 bytes.
For most of the characters in UTF-16, a graphic character is one encoding unit. Characters converted to UTF-16 from an EBCDIC, ASCII, or EUC code page are represented in one UTF-16 encoding unit. Some of the other graphic characters in UTF-16 are represented by a surrogate pair or a combining character sequence. A surrogate pair consists of two encoding units (4 bytes). A combining character sequence consists of a base character and one or more combining marks or a sequence of one or more combining marks (4 bytes or more, in 2-byte increments). In data items of usage NATIONAL, each 2-byte encoding unit is treated as a character.
When national data contains surrogate pairs or combining character sequences, programmers are responsible for ensuring that operations on national characters do not unintentionally separate the multiple encoding units that form a graphic character. Care should be taken with reference modification, and truncation during moves should be avoided. The COBOL runtime system does not check for a split between the encoding units that form a graphic character.