Documentation
(C) IBM Corp. 1996, 1999

Text Extender: Administration and Programming


Specifying search arguments

Search arguments are used in CONTAINS, NO_OF_MATCHES, RANK, and HANDLE_LIST. This section uses the CONTAINS function to show different examples of search arguments in UDFs.

Searching for several terms

You can have more than one term in a search argument. One way to combine several search terms is to connect them together using commas, like this:

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE,
         '("compress", "compiler", "pack", "zip", "compact")') = 1

This form of search argument finds text that contains any of the search terms. In logical terms, the search terms are connected by an OR operator.

Searching with the Boolean operators AND and OR

(See also Searching with the Boolean operator NOT.)

Search terms can be combined with other search terms using the Boolean operators "&" (AND) and "|" (OR). For example:

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE,
                             '"compress" | "compiler"') = 1

You can combine several terms using Boolean operators:

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE,
                             '"compress" | "compiler" & "DB2"') = 1

If you use more than one Boolean operator, Text Extender evaluates them from left to right, but the logical AND operator (&) binds stronger than the logical OR operator (|). For example, if you do not include parentheses,

     "DB2" & "compiler" | "support" & "compress"

is evaluated as:

     ("DB2" & "compiler") | ("support" & "compress")

So in the following example you must include the parentheses:

     "DB2" & ("compiler" | "support") & "compress"

If you combine Boolean operators with search terms chained together using the comma separator, like this:

     ("compress", "compiler") & "DB2"

the comma is interpreted as a Boolean OR operator, like this:

     ("compress" | "compiler") & "DB2"

Searching for variations of a term

If you are using a precise index, Text Extender searches for the terms exactly as you type them. For example, the term media finds only text that contains "media". Text that contains the singular "medium" is not found.

If you are using a linguistic index, Text Extender searches also for variations of the terms, such as the plural of a noun, or a different tense of a verb.

For example, the term drive finds text that contains "drive", "drives", "driving", "drove", and "driven.".

If you are using a dual index, you can choose to search for word variations or not. For example, the following query finds only occurrences of "utility":

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE, 'PRECISE FORM OF "utility"') = 1

By contrast, this example finds occurrences of "utility" and "utilities":

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE, 'STEMMED FORM OF "utility"') = 1

Searching for parts of a term (character masking)

Masking characters, otherwise known as "wildcard" characters, offer a way to make a search more flexible. They represent optional characters at the front, middle, or end of a search term. They increase the number of text documents found by a search.
Tip

If you use masking characters, you cannot use the SYNONYM FORM OF keyword. If you use a dual index type, the masked search is case-sensitive.

Masking characters are particularly useful for finding variations of terms if you have a precise index. If you have a linguistic index, many of the variations found by using masking characters would be found anyway.

Note that word fragments (words masked by wildcard characters) cannot be reduced to a base form. So, if you search for passe%, you will not find the words "passes" or "passed", because they are reduced to their base form "pass" in the index. To find them, you must search for pass%.

Text Extender uses two masking characters: underscore (_) and percent (%):

Searching for terms that already contain a masking character

If you want to search for a term that contains the "%" character or the "_" character, you must precede the character by a so-called escape character, and then identify the escape character using the ESCAPE keyword.

For example, to search for "10% interest":

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE,
                             '"10!% interest" ESCAPE "!"') = 1

The escape character in this example is "!".

Searching for terms in any sequence

If you search for "hard disk" as shown in the following example, you find the two terms only if they are adjacent and occur in the sequence shown, regardless of the index type you are using.

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE, '"hard disk"') = 1

To search for terms in any sequence, as in "data disks and hard drives", for example, use a comma to separate the terms:

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE, '("hard", "disk")') = 1

Searching for terms in the same sentence or paragraph

Here is an example of a search argument that finds text documents in which the search terms occur in the same sentence:

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE,
                       '"compress" IN SAME SENTENCE AS "decompress"') = 1

You can also search for more than two words occurring together. In the next example, a search is made for several words occurring in the same paragraph:

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE,
                             '"compress" IN SAME PARAGRAPH AS "decompress"
                                                     AND "encryption"') = 1

Searching for terms in sections of structured documents

Here is an example of a search argument that finds text documents in which the search term Williams occurs in the subsection author in section play of structured documents. The document structure is specified by model play which is described in a document model file. See Working with structured documents for more information.

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE,
                       'MODEL play SECTIONS (play/author) "williams"') = 1

Searching for synonyms of terms

For a linguistic or a dual index, you can make your searches more flexible by looking not only for the search terms you specify, but also for words having a similar meaning. For example, when you search for the word "book", it can be useful to search also for its synonyms. To do this, specify:

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE, 'SYNONYM FORM OF "book"') = 1

When you use SYNONYM FORM OF, it is assumed that the synonyms of the term are connected by a logical OR operator, that is, the search argument is interpreted as:

"book" | "article" | "volume" | "manual"

The synonyms are in a dictionary that is provided with Text Extender. The default dictionary used for synonyms is always US_ENGLISH, not the language specified in the text configuration settings.

You can change the dictionary for a particular query by specifying a different language. Here is an example:

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE,
                             'SYNONYM FORM OF UK_ENGLISH "programme"') = 1
Tip

You cannot use the SYNONYM keyword if there are masking characters in a search term, or if NOT is used with the search argument.

Making a linguistic search

Text Extender offers powerful linguistic processing for making a search based on the search terms that you provide. The linguistic functions are applied when the index is linguistic or when a dual index is used with STEMMED FORM OF parameter. The linguistic functions are described in Chapter 3, Linguistic processing.

An example of this is searching for a plural form, such as "utilities", and finding "utility". The plural is reduced to its base form utility, using an English dictionary, before the search begins.

The English dictionary, however, does not have the information for reducing variations of terms in other languages to their base form. To search for the plural of a term in a different language you must use the dictionary for that language.

If you specify GERMAN, for example, you can search for "geflogen" (flown) and find all variations of its base form "fliegen" (fly)--not only "geflogen", but also "fliege", "fliegt", and so on.

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE,
                             'STEMMED FORM OF GERMAN "geflogen"') = 1
Tip

When searching in documents that are not in U.S. English, specify the language in the search argument regardless of the default language.

If you always specify the base form of a search term, rather than a variation of it, you do not need to specify a language.

To understand why, consider what happens when the text in your database is indexed. If you are using a linguistic or a dual index, all variations of a term are reduced to their base form before the terms are stored in the index. This means that, in the DB2TX.SAMPLE table, although the term "decompress" occurs in the first entry in the COMMENT column, "decompression" occurs in the second entry, the index contains only the base form "decompress" and identifies this term (or its variations) as being in both entries.

Later, if you search for the base form "decompress", you find all the variations. If, however, you search for a variation like "decompression", you cannot find it directly. You must specify an appropriate dictionary for the search, so that the variation can first be converted to its base form.

Searching with the Boolean operator NOT

You can use the Boolean operator NOT to exclude particular text documents from the search. For example:

("compress", "compiler") & NOT "DB2"

Any text documents containing the term "DB2" are excluded from the search for "compress" or "compiler".

You cannot use the NOT operator in combination with IN SAME SENTENCE AS or IN SAME PARAGRAPH AS described in Searching for terms in the same sentence or paragraph, neither can you use it with SYNONYM FORM OF described in Searching for synonyms of terms.

You can use the NOT operator only with a search-primary, that is, you cannot freely combine the &, |, and NOT operators (see Search argument syntax).

Example of the use of NOT that is not allowed:

     NOT("compress" & "compiler")

Allowed is:

     NOT("compress" , "compiler")

Fuzzy search

"Fuzzy" search searches for words that are spelled in a similar way to the search term. It is available for Ngram indexes.

For example:

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE,
                             'FUZZY FORM OF 2 "compress"') = 1

This search could find an occurrence of the misspelled word conpress.

The match level, in the example "2", specifies the degree of accuracy. Five levels are supported, where level 1 gives the loosest matching of about 20 percent, and level 5 gives the tightest matching of about 90 percent. Use a fuzzy search when the misspellings are possible in the document, as is often the case when the document was created using an Optical Character Recognition device, or phonetic input.

Respecting word-phrase boundaries

"Bound" search has been developed for the Korean language. It ensures that Text Extender respects word boundaries during the search. For example:

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE,
                             'BOUND "korean-expression"') = 1

Searching for similar-sounding words

"Sound" search finds words that sound like the search argument. This is useful when documents can contain words that sound alike, but are spelled differently. The German name that is pronounced my-er, for example, has several spellings.

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE,
                             'SOUNDS LIKE "Meyer"') = 1

This search could find occurrences of "Meyer", "Mayer", and "Maier".

Thesaurus search

Thesaurus search is another of Text Extender's powerful search-term expansion functions. The additional terms searched for are taken from a thesaurus that you build yourself, so you have direct control over them. You search for "database", for example, and could find terms like "repository" and "DB2".

This type of search is intended for specific areas of interest in which you make frequent searches; an area in which it is worth the investment in time to build a thesaurus in order to produce significantly more effective search results.

See Thesaurus concepts for more information and a description of how to build a thesaurus. The example in Figure 5 is a small extract from a thesaurus on the subject of databases. It is used in the following examples that demonstrate the syntax for using thesaurus expansion.

This example takes the term "object relational database management system" and expands it, adding all instances of this term found in the thesaurus "myterms". Here, "DB2" is added to the search.

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE,
                   'THESAURUS "myterms"
                   EXPAND "INST"
                   TERM OF "object relational database management system"
                   ') = 1

The next example takes the term "document management system" and expands it, adding all its synonyms. There is one synonym - "library".

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE,
                   'THESAURUS "myterms"
                   EXPAND "SYN"
                   TERM OF "document management system"
                   ') = 1

Free-text and hybrid search

"Free-text search" is a search in which the search term is expressed as free-form text. A phrase or a sentence describes in natural language the subject to be searched for. The sequence of words in a free-text query are not relevant. Furthermore, so-called lexical affinities are supported. In retrieval, these are certain pairs of words occurring in a free-text query term, and occurring in the document collection, with a certain minimal frequency and a certain minimal distance. The distance for English documents is five words.

Note that the masking of characters or words is not supported for search strings in a free-text argument.

For example:

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE,
             'IS ABOUT "everything related to AIX installation"') = 1

Hybrid search is a combination of Boolean search and free-text search. For example:

     SELECT DATE, SUBJECT
       FROM DB2TX.SAMPLE
       WHERE DB2TX.CONTAINS (COMMENTHANDLE,
             '"DB2" & IS ABOUT "everything related to AIX installation"') = 1


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]