When Text Extender indexes and retrieves documents, it makes a linguistic analysis of the text. As you can see from the following table, the amount of linguistic processing depends on the index type. For Ngram indexes, no linguistic processing is applied.
The linguistic processing used for indexing documents consists of:
Recognizing terms (tokenization)
Normalizing terms to a standard form
Recognizing sentences
Table 1 shows a summary of how terms are indexed when the index type
is linguistic and no additional index properties have been
requested.
Table 1. Term extraction for a linguistic index
Document text | Term in index | Linguistic processing |
---|---|---|
Mouse Käfer |
mouse kaefer |
Basic text analysis (normalization) |
mice swum |
mouse swim | Reduction to base form |
system-based Wetterbericht |
system-based, system base wetterbericht, wetter bericht |
Decomposition |
a report on animals |
report animal | Stop-word filtering. Stop words are: a, on |
By comparison, Table 2 shows a summary of how terms are indexed when the index type
is precise.
Table 2. Term extraction for a precise index
Document text | Term in index | Linguistic processing |
---|---|---|
Mouse Käfer |
Mouse Käfer |
No normalization |
mice swum |
mice swum | No reduction to base form |
a report on animals |
report animals |
Stop-word filtering. Stop words are: a, on |
system-based Wetterbericht |
system-based Wetterbericht |
No decomposition |
Text Extender processes basic text analysis without using an electronic dictionary.
When documents are indexed, terms are recognized even when they contain nonalphanumeric characters, for example: "$14,225.23", "mother-in-law", and "10/22/90".
The following are regarded as part of a term:
Accents
Currency signs
Number separator characters (like "/" or ".")
The "@" character in e-mail addresses (English only)
The "+" sign.
Language-specific rules are also used to recognize terms containing:
Normalizing reduces mixed-case terms, and terms containing accented or special characters, to a standard form. This is done by default when the index type is linguistic, or when a dual index is used with the search parameter STEMMED FORM OF. (In a precise index the case of letters is left unchanged--searches are case-sensitive.)
For example, the term Computer is indexed as computer, the uppercase letter is changed to lowercase. A search for the term computer finds occurrences not only of computer, but also of Computer. The effect of normalization during indexing is that terms are indexed in the same way, regardless of how they are capitalized in the document.
Normalization is applied not only during indexing, but also during retrieval. Uppercase characters in a search term are changed to lowercase before the search is made. When your search term is, for example, Computer, the term used in the search is computer.
Accented and special characters are normalized in a similar way. Any variation of école, such as École, finds école, Ecole, and so on. Bürger finds buerger, Maße finds masse.
If the search term includes masking (wildcard) characters, normalization is done before the masking characters are processed. Example: Bür_er becomes buer_er.
You can search for terms that occur in the same sentence. To make this possible, each document is analyzed during indexing to find out where each sentence begins and ends. The end of a sentence is indicated by a period, exclamation mark, or a question mark, followed by a blank character. Many abbreviations ending in a period are ignored.
In a linguistic or a dual index, you can search for mouse, for example, and find mice. Terms are reduced to their base form for indexing; the term mice is indexed as mouse. Later, when you use the search term mouse, the document is found. The document is found also if you search for mice.
The effect is that you find documents containing information about mice, regardless of which variation of the term mouse occurs in the document, or is used as a search term.
In the same way, conjugated verbs are reduced to their infinitive; bought, for example, becomes buy.
Stop words are words such as prepositions and pronouns that occur very frequently in documents, and are therefore not suitable as search terms. Such words are in a stop-word list associated with each dictionary, and are excluded from the indexing process.
Stop word processing is case-insensitive. So a stop word about also excludes the first word in a sentence About. This is normally not true for an Ngram index which is case insensitive unless it is created using an option making it case sensitive. The stop word lists supplied in various languages can be modified.
Germanic languages, such as German or Dutch, are rich in compound terms, like Versandetiketten, which means mail (Versand) labels (Etiketten). Such compound terms can be split into their components.
For a precise index, compound terms are indexed unchanged as one word. For a linguistic or dual index, compound terms are split during indexing. When you search, compound terms are split if you have a linguistic index, or if you use the STEMMED FORM OF option with a dual index.
The components are found if they occur in any sequence in a document as long as they are contained within one sentence. For example, when searching for the German word Wetterbericht (weather report), a document containing the phrase Bericht über das Wetter (report about the weather) would also be found.
An attempt is made to split a term if:
If a split is found to be possible, the term's component parts are then
reduced to their base form. Here are some examples from Danish, German,
and Dutch:
Compound term | Component parts |
---|---|
børsmæglerselskab |
børsmæglerselskab børs mægler selskab |
Kindersprachen |
kindersprache kind sprache |
probleemkinderen |
probleemkinderen probleemkind kind probleem |