Documentation
(C) IBM Corp. 1996, 1999

Text Extender: Administration and Programming


Types of index

You can assign one of these index types to a column containing text to be searched: linguistic, precise, dual, and Ngram. You must decide which index type to create before you prepare any such columns for use by Text Extender. For a more detailed description of how each type of index affects linguistic processing, read Chapter 3, Linguistic processing.
Tip

Text Extender offers a wide variety of search options, though not all are available for all index types. See Table 5 and Table 6 before making your decision about which index type to use.

Linguistic index

For a linguistic index, linguistic processing is applied while analyzing each document's text for indexing. This means that words are reduced to their base form before being stored in an index; the term "mice", for example, is stored in the index as mouse.

For a query against a linguistic index, the same linguistic processing is applied to the search terms before searching in the text index. So, if you search for "mice", it is reduced to its base form mouse before the search begins. Table 1 summarizes how terms are extracted for indexing when you use a linguistic index.

The advantage of this type of index is that any variation of a search term matches any other variation occurring in one of the indexed text documents. The search term mouse matches the document terms "mouse", "mice", "MICE" (capital letters), and so on. Similarly, the search term Mice matches the same document terms.

This index type requires the least amount of disk space. However, indexing and searching can take longer than for a precise index.

The types of linguistic processing available depend on the document's language. See The supported languages for details. Here is a list of the types:

Precise index

In a precise index, the terms in the text documents are indexed exactly as they occur in the document. For example, the search term mouse can find "mouse" but not "mice" and not "Mouse"; the search in a precise index is case-sensitive.

In a query, the same processing is applied to the query terms, which are then compared with the terms found in the index. This means that the terms found are exactly the same as the search term. You can use masking characters to broaden the search; for example, the search term experiment* can find "experimental", "experimented", and so on.

Table 2 gives some examples of how terms are extracted from document text for indexing when you use a precise index.

The advantage of this type of index is that the search is more precise, and indexing and retrieval is faster. Because each different form and spelling of every term is indexed, more disk space is needed than for a linguistic index.

The linguistic processes used to index text documents for a precise index are:

Word and sentence separation

Stop-word filtering.

Dual index

A dual index is a combination of precise and linguistic indexes. It contains the normalized form (standard form in all lower-case letters, and without accents), the base form, such as the infinitives of verbs, and the precise form of each term.

This index type allows the user to decide for each search term whether to search linguistically or precisely.

In a query, you can choose the processing that is applied to the query terms:

This index type needs the most disk space. Indexing and searching are slower than for a linguistic index. It is not recommended for a large number of text documents.
Tip

In a dual index, word fragments are always looked for as in a precise index; the result is that matching for word fragments is case sensitive.

Ngram index

An Ngram index analyzes text by parsing sets of characters. This analysis is not based on a dictionary.

If your text contains DBCS characters, you must use an Ngram index. No other index type supports DBCS characters.

This index type supports "fuzzy" search, meaning that you can find character strings that are similar to the specified search term. For example, a search for Extender finds the mistyped word Extendrrs. You can also specify a required degree of similarity.
Note:Even if you use fuzzy search, the first three characters must match.

To make a case-sensitive search in an Ngram index, it is not enough to specify the PRECISE FORM OF keyword in the query. This is because an Ngram index normally does not distinguish between the case of the characters indexed. You can make an Ngram index case-sensitive, however, by specifying the CASE_ENABLED option when the index is created. Then, in your query, specify the PRECISE FORM OF keyword.

When the CASE_ENABLED option is used, the index needs more space, and searches can take longer.

The SBCS CCSIDs supported by Ngram indexes are 819, 850, and 1252. The DBCS CCSID's supported by Ngram indexes are: 932, 942, 943, 948, 949, 950, 954, 964, 970, 1363, 1381, 1383, 1386, 4946, and 5039.

Although the Ngram index type was designed to be used for indexing DBCS documents, it can also be used for SBCS documents. However, it supports only TDS documents.

Note also that not all of the search syntax options are supported. See the summary of rules and restrictions in Chapter 10, Syntax of search arguments.

Changing the text index type

If you decide that the index type you are using is not suitable, you can change it. When you do so, however, the existing index is deleted, an empty index is created, and entries for all the text documents in the column are added to the log table for reindexing. The command that lets you do this is CHANGE INDEX SETTINGS.

You can choose to have the documents reindexed immediately or the next time periodical indexing occurs.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]