Languages Around The World

C/POSIX Migration

Migration from Standard C and POSIX APIs

The ISO C and POSIX standards define a number of APIs for string handling and internationalization in C. They do not support Unicode well because they were initially designed before Unicode/ISO 10646 were developed, and the POSIX APIs are also problematic for other internationalization aspects.

This chapter discusses C/POSIX APIs with their problems, and shows which ICU APIs to use instead.

NoteWe use the term "POSIX" to mean the POSIX.1 standard (IEEE Std 1003.1) which defines system interfaces and headers with relevance for string handling and internationalization. The XPG3, XPG4, Single Unix Specification (SUS) and other standards include POSIX.1 as a subset, adding other specifications that are irrelevant for this topic.

This chapter is not complete yet – more POSIX APIs are expected to be discussed in the future.

Strings and Characters

Character Sets and Encodings

ISO C

The ISO C standard provides two basic character types (char and wchar_t) and defines strings as arrays of units of these types. The standard allows nearly arbitrary character and string character sets and encodings, which was necessary when there was no single character set that worked everywhere.

For portable C programs, characters and strings are opaque, i.e., a program cannot assume that any particular character is represented by any particular code or sequence of codes. Programs use standard library functions to handle characters and strings. Only a small set of characters — usually the set of graphic characters available in US-ASCII — can be reliably accessed via character and string literals.

Problems

ICU

ICU always processes Unicode text. Unicode covers all languages and allows safe hardcoding of character codes, in addition to providing many standard or recommended algorithms and a lot of useful character property data. See the chapters about Unicode Basics and Strings and others.

ICU uses the 16-bit encoding form of Unicode (UTF-16) for processing, making it fully interoperable with most Unicode-aware software. (See UTF-16 for Processing .) In the case of ICU4J, this is naturally the case because the Java language and the JDK use UTF-16.

ICU uses and/or provides direct access to all of the Unicode properties which provide a much finer-grained classification of characters than C/POSIX character classes .

In C/C++ source code character and string literals, ICU uses only "invariant" characters. They are the subset of graphic ASCII characters that are almost always encoded with the same byte values on all systems. (One set of byte values for ASCII-based systems, and another such set of byte values for EBCDIC systems.) See utypes.h for the set of "invariant" characters.

With the use of Unicode, the implementation of many of the Unicode standard algorithms, and its cross-platform availability, ICU provides for consistent, portable, and reliable text processing.

Case Mappings

ISO C

The standard C functions tolower(), towupper(), etc. take and return one character code each.

Problems

ICU

Case mappings are operations taking and returning strings, to support length changes and context dependencies. Unicode provides algorithms and data for proper case mappings, and ICU provides APIs for them. (See the API references for various string functions and for Transforms/Transliteration.)

Character Classes

ISO C

The standard C functions isalpha(), isdigit(), etc. take a character code each and return boolean values for whether the character belongs to the current locale's respective character class.

Problems

ICU

ICU provides locale-independent access to all Unicode properties (except Unihan.txt properties) via functions defined in uchar.h, and in ICU4J's UCharacter class (see API references). The Unicode Character Database defines more than 70 character properties, their values are designed for the large character set as well as for real text processing, and they are updated with each version of Unicode. The UCD is available online, facilitating industry-wide consistency in the implementation of Unicode properties.

Formatting and Parsing

Currency Formatting

POSIX

The strfmon() function is used to format monetary values. The default format and the currency display symbol or display name are selected by the LC_MONETARY locale ID. The number formatting can also be controlled with a formatting string resembling what printf() uses.

Problems

ICU

ICU number formatting APIs have separate, orthogonal settings for the number format, which can be selected with a locale ID, and the currency, which is specified with an ISO code. See the Formatting Numbers chapter for details.



Copyright (c) 2000 - 2005 IBM and Others - PDF Version - Feedback: http://icu.sourceforge.net/contacts.html

User Guide for ICU v3.4 Generated 2005-07-27.