Encoding

All XML data are represented in plain-text. A small number of characters have a particular meaning to XML ("<", ">", "'", "", "&") and if these occur in your data they are automatically converted to their corresponding XML character entities to avoid problems. However, if you use characters outside the normal US-ASCII range (characters 0-127), even plain-text becomes ambiguous. For example, in Western Europe, you might typically store your data using the ISO-8859-1 character set also known as "Latin 1". In this character set, the character "ë" (e-umlaut) is character number 235. However if you sent this XML data to a person in Greece who would typically use the ISO-8859-7 (Greek) character set, the same character 235 would appear as the lower-case Greek letter lambda.

To avoid this problem, XML allows the character encoding used for a document to be stated in the XML processing instruction found at the top of all XML documents. Now, when you create your document you can explicitly state that you want to use ISO-8859-1 for your data because that is the form in which it is stored in your database. When you send the file to Greece, the person there knows not to use the ISO-8859-7 character set to interpret the data but ISO-8859-1 instead. In general, this will be handled by their XML parsing software which will read the encoding information from the document.

By default, XML uses an encoding scheme known as UTF-8. This modified Unicode scheme creates a document that uses two bytes to represent characters greater than 127. However, you will need to set the encoding explicitly if the data stored in your database uses a different encoding scheme.

IBM Cúram Social Program Management XML provides a range of constants for the common encoding schemes. The available schemes are shown in Encoding below.

Table 1. XML Character Encoding Constants

Constant

Alternative Constant

Encoding Scheme

kEncodeUTF8

 

UTF-8

kEncodeISO10646UCS2

 

ISO-10646-UCS-2

kEncodeISO10646UCS4

 

ISO-10646-UCS-4

kEncodeISO88591

kEncodeISOLATIN1

ISO-8859-1

kEncodeISO88592

kEncodeISOLATIN2

ISO-8859-2

kEncodeISO88593

kEncodeISOLATIN3

ISO-8859-3

kEncodeISO88594

kEncodeISOLATIN4

ISO-8859-4

kEncodeISO88595

kEncodeISOCYRILLIC

ISO-8859-5

kEncodeISO88596

kEncodeISOARABIC

ISO-8859-6

kEncodeISO88597

kEncodeISOGREEK

ISO-8859-7

kEncodeISO88598

kEncodeISOHEBREW

ISO-8859-8

kEncodeISO88599

kEncodeISOLATIN5

ISO-8859-9

kEncodeISO885910

kEncodeISOLATIN6

ISO-8859-10

kEncodeISO885913

kEncodeISOLATIN7

ISO-8859-13

kEncodeISO885914

kEncodeISOLATIN8

ISO-8859-14

kEncodeISO885915

kEncodeISOLATIN9

ISO-8859-15

kEncodeISO2022JP

 

ISO-2022-JP

kEncodeSHIFTJIS

 

Shift_JIS

kEncodeEUCJP

 

EUC-JP

The relevant constant should be specified when constructing a new XMLDocument in order to set the encoding scheme as appropriate for the XML document. This encoding will be used for the XML document declaration as well as for the XML document itself. If loading an XML document from the database, the encoding of that document should match the encoding used to construct the XMLDocument class. If you supply no value, no encoding scheme will be specified in the XML and XML parsers will thus assume UTF-8 according to the XML standard. If the encoding scheme you wish to use is not among those listed, you may supply a string containing the encoding value you wish to use.

All of the encoding constants are within the XMLEncodingConstants interface. To use, for example, the Latin 1 set, you would use XMLEncodingConstants. kEncodeISOLATIN1 or XMLEncodingConstants. kEncodeISO88591.