Documentation
(C) IBM Corp. 1996, 1999

Text Extender: Administration and Programming


Information about text documents

Each text document that you intend to search has three characteristics that are significant to Text Extender:

Format

Language

Coded Character Set Identifier (CCSID).

Formats

Text Extender needs to know the format (or type) of text documents, such as WordPerfect or ASCII, that you intend to search. This information is needed when indexing text documents.

The text document types supported are:

HTML
Hypertext Markup Language

XML
Extended Markup Language

ASCII_SECTIONS
Structured ASCII containing sections

TDS
Flat ASCII

AMI
AmiPro Architecture Version 4

FFT
IBM Final Form Text: Document Content Architecture

MSWORD
Microsoft Word, Versions 5.0 and 5.5

RFT
IBM Revisable Form Text: Document Content Architecture

RTF
Microsoft Rich Text Format (RTF), Version 1

WP5
WordPerfect (OS/2 and Windows), Versions 5.0, 5.1, and 5.2

For nonsupported document types, specify a numeric ID. Valid values are 1 to 100. This value is passed as the source format to the user exit that converts the original format to TDS.

If, during indexing, there is a document that is not one of the supported types, Text Extender provides an exit that writes the document to a disk and calls a program that you provide to extract the text into one of the supported formats.

To enable the user exit, edit the following ASCII files:

Windows NT:
%DMBMMPATH%\instance\%DB2INSTANCE%\db2tx\descl.ini
%DMBMMPATH%\instance\%DB2INSTANCE%\db2tx\txinsnnn\dessrv.ini
 
UNIX:
$DB2TX_INSTOWNERHOMEDIR/db2tx/descl.ini
$DB2TX_INSTOWNERHOMEDIR/db2tx/txinsnnn/dessrv.ini

by adding the following statements:

[DOCUMENTFORMAT]
USEREXIT=name_of_executable

where <name_of_executable> is the name of the user exit. You can specify a fully qualified file name, or, if the user exit is stored in a directory that is in the PATH statement, you can specify only the file name.

The parameters of the user exit must be as follows:

<name_of_user_exit>  -sourcefile   <sourcefilename>
                     -targetfile   <targetfilename>
                     -sourceccsid  <sourceccsid>
                     -targetccsid  <targetccsid>
                     -sourceformat <sourceformat>
                     -targetformat <targetformat>

The user exit must read the document from the <sourcefilename> and write the converted document to the <targetfilename>. The file names must be fully qualified. The target file must match the <targetccsid> and <targetformat>. The target format must be TDS. The target CCSID must be 850.

During enabling, a format other than TDS (flat ASCII) must be specified as format to force the user exit to be called.

Languages

Text Extender also needs to know in which language a document is written so that the correct dictionary can be used for the linguistic processing that occurs. Here is a list of the language parameters that you can specify when you enable a text column or external documents:

Brazilian Portuguese
BRAZILIAN

Canadian French
CAN_FRENCH

Catalan
CATALAN

Chinese, simplified
S_CHINESE

Chinese, traditional
T_CHINESE

Danish
DANISH

Dutch
DUTCH

Finnish
FINNISH

French
FRENCH

German
GERMAN

Icelandic
ICELANDIC

Italian
ITALIAN

Japanese
JAPANESE

Korean
KOREAN

Norwegian, Bokmal
BM_NORWEGIAN

Norwegian, Nynorsk
NN_NORWEGIAN

Norwegian, Bokmal and Nynorsk
BMNN_NORWEGIAN

Portuguese
PORTUGUESE

Spanish
SPANISH

Swedish
SWEDISH

Swiss German
SWISS_GERMAN

UK English
UK_ENGLISH

US English
US_ENGLISH

CCSIDs

Each DB2 database uses a particular code page for storing character data. Text Extender, as an application working with DB2, runs using the same code page as the database.

Documents can be indexed if they are in one of the following CCSIDs. During search the CCSID of the database is used to interpret the CCSID of the search string.

Data stored in DB2 UDB character datatypes, such as VARCHAR or CLOB, are converted by DB2 UDB into the CCSID of the database. So, when enabling a text column for search, use the CCSID of the database as the CCSID parameter. When you enable a text column for search, you can avoid data conversion by DB2 by using a BLOB or binary datatype, and using the actual CCSID of the documents.
Note:

CCSIDs 861, 865, and 4946 are not supported by DB2 UDB . To index documents having these CCSIDs, store the documents in a column with a binary data type (BLOB or FOR BIT DATA).

EBCDIC

37
US, Canadian English

273
Austrian, German

277
Danish, Norwegian

278
Finnish, Swedish

280
Italian

284
Spanish, Latin American

285
UK English

297
French

420
Arabic

424
Hebrew

437
US English

500
International Latin-1

871
Icelandic

1025
Russian

ASCII

819 AIX, HP, SUN
Latin-1

850 AIX, OS/2
Latin-1

860 OS/2
Portuguese

861 See note
Icelandic

862 OS/2
Hebrew

864 OS/2
Arabic

863 OS/2
Canadian

865 See note
Danish, Norwegian

866 OS/2
Russian

915 AIX, OS/2, HP
Russian

916 AIX
Hebrew

1064 AIX
Arabic

1089 AIX, HP
Arabic

1250 WIN
Croatian

1251 WIN
Russian

1252 WIN
Latin-1

1255 WIN
Hebrew

1256 WIN
Arabic

DBCS

932 AIX, OS/2
Japanese, combined SBCS/DBCS

942 OS/2
Japanese, combined SBCS/DBCS

943 OS/2, WIN
Japanese, combined SBCS/DBCS

5039 HP
Japanese, combined SBCS/DBCS

954 AIX, HP, SUN
Japanese

949 OS/2
Korean

970 AIX, HP, SUN
Korean

1363 WIN
Korean

948 OS/2
Chinese (traditional), combined SBCS/DBCS

950 AIX, HP, OS/2, SUN, WIN
Chinese (traditional), combined SBCS/DBCS

964 AIX, HP, SUN
Chinese (traditional), combined SBCS/DBCS

1381 OS/2, WIN
Chinese (simplified), combined SBCS/DBCS

1383 AIX, HP, SUN
Chinese (simplified), combined SBCS/DBCS

1386 AIX, OS/2, WIN
Chinese (simplified), combined SBCS/DBCS

4946 See note
Latin-1 (CP850)

5039 HP
Japanese

UNICODE

UTF8

UCS2


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]