Documentation
(C) IBM Corp. 1996, 1999

Text Extender: Administration and Programming


Linguistic processing when indexing

When Text Extender indexes and retrieves documents, it makes a linguistic analysis of the text. As you can see from the following table, the amount of linguistic processing depends on the index type. For Ngram indexes, no linguistic processing is applied.

The linguistic processing used for indexing documents consists of:

Table 1 shows a summary of how terms are indexed when the index type is linguistic and no additional index properties have been requested.

Table 1. Term extraction for a linguistic index
Document text Term in index Linguistic processing

Mouse
Käfer


mouse
kaefer


Basic text analysis
(normalization)


mice
swum


mouse
swim

Reduction to base form

system-based
 
 
Wetterbericht


system-based,
system
base
wetterbericht,
wetter
bericht


Decomposition

a report on animals
report
animal

Stop-word filtering. Stop words are: a, on

By comparison, Table 2 shows a summary of how terms are indexed when the index type is precise.

Table 2. Term extraction for a precise index
Document text Term in index Linguistic processing

Mouse
Käfer


Mouse
Käfer


No normalization


mice
swum


mice
swum

No reduction to base form

a report on animals


report
animals


Stop-word filtering.
Stop words are: a, on


system-based
Wetterbericht


system-based
Wetterbericht


No decomposition

Basic text analysis

Text Extender processes basic text analysis without using an electronic dictionary.

Recognizing terms that contain nonalphanumeric characters

When documents are indexed, terms are recognized even when they contain nonalphanumeric characters, for example: "$14,225.23", "mother-in-law", and "10/22/90".

The following are regarded as part of a term:

Accents

Currency signs

Number separator characters (like "/" or ".")

The "@" character in e-mail addresses (English only)

The "+" sign.

Language-specific rules are also used to recognize terms containing:

Normalizing terms to a standard form

Normalizing reduces mixed-case terms, and terms containing accented or special characters, to a standard form. This is done by default when the index type is linguistic, or when a dual index is used with the search parameter STEMMED FORM OF. (In a precise index the case of letters is left unchanged--searches are case-sensitive.)

For example, the term Computer is indexed as computer, the uppercase letter is changed to lowercase. A search for the term computer finds occurrences not only of computer, but also of Computer. The effect of normalization during indexing is that terms are indexed in the same way, regardless of how they are capitalized in the document.

Normalization is applied not only during indexing, but also during retrieval. Uppercase characters in a search term are changed to lowercase before the search is made. When your search term is, for example, Computer, the term used in the search is computer.

Accented and special characters are normalized in a similar way. Any variation of école, such as École, finds école, Ecole, and so on. Bürger finds buerger, Maße finds masse.

If the search term includes masking (wildcard) characters, normalization is done before the masking characters are processed. Example: Bür_er becomes buer_er.

Recognizing sentences

You can search for terms that occur in the same sentence. To make this possible, each document is analyzed during indexing to find out where each sentence begins and ends. The end of a sentence is indicated by a period, exclamation mark, or a question mark, followed by a blank character. Many abbreviations ending in a period are ignored.

Reducing terms to their base form (lemmatization)

In a linguistic or a dual index, you can search for mouse, for example, and find mice. Terms are reduced to their base form for indexing; the term mice is indexed as mouse. Later, when you use the search term mouse, the document is found. The document is found also if you search for mice.

The effect is that you find documents containing information about mice, regardless of which variation of the term mouse occurs in the document, or is used as a search term.

In the same way, conjugated verbs are reduced to their infinitive; bought, for example, becomes buy.

Stop-word filtering

Stop words are words such as prepositions and pronouns that occur very frequently in documents, and are therefore not suitable as search terms. Such words are in a stop-word list associated with each dictionary, and are excluded from the indexing process.

Stop word processing is case-insensitive. So a stop word about also excludes the first word in a sentence About. This is normally not true for an Ngram index which is case insensitive unless it is created using an option making it case sensitive. The stop word lists supplied in various languages can be modified.

Decomposition (splitting compound terms)

Germanic languages, such as German or Dutch, are rich in compound terms, like Versandetiketten, which means mail (Versand) labels (Etiketten). Such compound terms can be split into their components.

For a precise index, compound terms are indexed unchanged as one word. For a linguistic or dual index, compound terms are split during indexing. When you search, compound terms are split if you have a linguistic index, or if you use the STEMMED FORM OF option with a dual index.

The components are found if they occur in any sequence in a document as long as they are contained within one sentence. For example, when searching for the German word Wetterbericht (weather report), a document containing the phrase Bericht über das Wetter (report about the weather) would also be found.

An attempt is made to split a term if:

If a split is found to be possible, the term's component parts are then reduced to their base form. Here are some examples from Danish, German, and Dutch:
Compound term Component parts

børsmæglerselskab


børsmæglerselskab
børs
mægler selskab


Kindersprachen


kindersprache
kind
sprache


probleemkinderen


probleemkinderen
probleemkind
kind
probleem


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]