
Collation Introduction
Overview
Traditionally, information is displayed in sorted order to enable users to easily find the items they are looking for. However, users of different languages might have very different expectations of what a "sorted" list should look like. Not only does the alphabetical order vary from one language to another, but it also can vary from document to document within the same language. For example, phonebook ordering might be different than dictionary ordering. String comparison is one of the basic functions most applications require, and yet implementations often do not match local conventions. The ICU Collation Service provides string comparison capability with support for appropriate sort orderings for each of the locales you need. In the event that you have a very unusual requirement, you are also provided the facilities to customize orderings.
Starting in release 1.8, the ICU Collation Service is updated to be fully compliant to the Unicode Collation Algorithm (UCA) (http://www.unicode.org/unicode/reports/tr10/ ) and conforms to ISO 14651. There are several benefits to using the collation algorithms defined in these standards. Some of the more significant benefits include:
Unicode contains a large set of characters. This can make it difficult for collation to be a fast operation or require collation to use significant memory or disk resources. The ICU collation implementation is designed to be fast, have a small memory footprint and be highly customizable.
The algorithms have been designed and reviewed by experts in multilingual collation, and therefore are robust and comprehensive.
Applications that share sorted data but do not agree on how the data should be ordered fail to perform correctly. By conforming to the UCA/14651 standard for collation, independently developed applications, such as those used for e-business, sort data identically and perform properly.
The ICU Collation Service also contains several enhancements that are not available in UCA. For example:Additional case handling: ICU allows case differences to be ignored or flipped. Uppercase letters can be sorted before lowercase letters, or vice-versa.
Easy customization: Services can be easily tailored to address a wide range of collation requirements.
Flexibility: ICU offers both sort key generation and fast incremental string comparison. It also provides low-level access to collation data through the collation element iterator
There are many challenges when accommodating the world's languages and writing systems and the different orderings that are used. However, the ICU Collation Service provides an excellent means for comparing strings in a locale-sensitive fashion.
For example, here are some of the ways languages vary in ordering strings:
The letters A-Z can be sorted in a different order than in English. For example, in Lithuanian, "y" is sorted between "i" and "k".
Combinations of letters can be treated as if they were one letter. For example, in traditional Spanish "ch" is treated as a single letter, and sorted between "c" and "d".
Accented letters can be treated as minor variants of the unaccented letter. For example, "é" can be treated equivalent to "e".
Accented letters can be treated as distinct letters. For example, "Å" in Danish is treated as a separate letter that sorts just after "Z".
Unaccented letters that are considered distinct in one language can be indistinct in another. For example, the letters "v" and "w" are two different letters according to English. However, "v" and "w" are considered variant forms of the same letter in Swedish.
A letter can be treated as if it were two letters. For example, in traditional German "ä" is compared as if it were "ae".
Thai requires that the order of certain letters be reversed.
French requires that letters sorted with accents at the end of the string be sorted ahead of accents in the beginning of the string. For example, the word "côte" sorts before "coté" because the acute accent on the final "e" is more significant than the circumflex on the "o".
Sometimes lowercase letters sort before uppercase letters. The reverse is required in other situations. For example, lowercase letters are usually sorted before uppercase letters in English. Latvian letters are the exact opposite.
Even in the same language, different applications might require different sorting orders. For example, in German dictionaries, "öf" would come before "of". In phone books the situation is the exact opposite.
Sorting orders can change over time due to government regulations or new characters/scripts in Unicode.
To accommodate the many languages and differing requirements, ICU collation supports customizing sort orderings - also known as tailoring. More details regarding tailoring are discussed in a later chapter.
The basic ICU Collation Service is provided by two main categories of APIs:
String comparison - used when two strings are to be compared once: APIs return result of comparison (greater than, equal or less than). An example usage of this function is a string search.
Sort key generation - used when a set of strings are compared repeatedly: APIs return a zero-terminated array of bytes per string known as a sort key. The keys can be compared directly using strcmp or memcmp standard library functions, saving repeated computation of each string's relative weights. Typically, database applications use sort keys to index strings that are compared multiple times.
Programming Examples
Here are some API usage conventions for the ICU Collation Service APIs.
Copyright (c) 2000 - 2005 IBM and Others - PDF Version - Feedback: http://icu.sourceforge.net/contacts.html
User Guide for ICU v3.4 Generated 2005-07-27.