Operators for Searching Full Text


This section describes operators used for performing full text searches. The following three tables summarize the three "families" of text search operators. The operators and examples of their use are listed in alphabetical order after the tables.

Evidence Operators

Operator
Modifiers
Automatically Relevance-ranked
SOUNDEX
MANY, NOT
No
STEM
MANY, NOT
No
THESAURUS
MANY, NOT
No
TYPO/N
NOT

WILDCARD
CASE, MANY, NOT
No
WORD
CASE, MANY, NOT
No

Proximity Operators

Operator
Modifiers
Automatically Relevance-ranked
IN
NOT, WHEN

NEAR
NOT
Yes
NEAR/N
NOT, ORDER
Yes
PARAGRAPH
MANY, NOT, ORDER
Yes
PHRASE
MANY, NOT
Yes
SENTENCE
MANY, NOT, ORDER
Yes

Concept Operators

Operator
Modifiers
Automatically Relevance-ranked
ACCRUE
NOT
Yes
ALL
NOT, ORDER
No
AND
NOT
Yes
ANY
NOT
No
OR
NOT
Yes

ACCRUE

Selects documents that include at least one of the search elements you specify. Valid search elements are two or more words or phrases. Retrieved documents are relevance-ranked.

The ACCRUE operator scores retrieved documents according to the presence of each search element in the document using "the more, the better" approach; the more search elements found in the document, the better the document's score. The following examples illustrate the search syntax.

For example, to select documents containing stemmed variations of the words "computers" and "laptops," enter any of the following:

computers <ACCRUE> laptops

computers, laptops

<ACCRUE> (computers, laptops)

ALL

Selects documents that contain all of your search elements. Retrieved documents are not relevance-ranked. Scores cannot be assigned to this operator.

For example, to select documents which contain stemmed variations of the phrase "pharmaceutical companies" and stemmed variations of the word "stock," enter the following:

pharmaceutical companies ALL stock

Only those documents that contain both search elements, or stemmed variations of them (for example, "pharmaceutical company," "stocks," etc.), are retrieved. Each retrieved document is assigned a score of 1.00.

AND

Selects documents that contain all of your search elements. Documents retrieved using the AND operator are relevance-ranked.

For example, to select documents which contain stemmed variations of the phrase "pharmaceutical companies" and stemmed variations of the word "stock," enter the following:

pharmaceutical companies AND stock

Only those documents that contain both search elements, or stemmed variations of them (for example, "pharmaceutical company," "stocks," etc.), are retrieved. A calculated score is assigned to each retrieved document.

ANY

Selects documents that contain at least one of your search elements. Retrieved documents are not relevance-ranked. Scores cannot be assigned to this operator.

For example, to select documents that contain stemmed variations of the word "election" or the phrases "national elections" or "senatorial race", enter the following:

election ANY national elections ANY senatorial race

Only those documents that contain at least one of the search elements, or a stemmed variation of at least one of them, are retrieved. Each retrieved document is assigned a score of 1.00.

IN

Selects documents that contain specified values in one or more document zones. A document zone represents a region of a document, such as the document's summary, date, or body text. The IN operator works only if document zones have been defined in your collections. If you use the IN operator to search collections without defined zones, no documents will be selected. Also, the zone name you specify must match the zone names defined in your collections. Consult your collection administrator to determine which zones have been defined for specific collections.

The IN operator can be qualified with the WHEN operator to search for a term only within the one or more zones upon which certain conditions have been placed. Use of the WHEN operator is described below.

The following query expression searches document zones named "summary" for the word "safety."

"safety" <IN> summary

To search with multiple words, phrases, or topics, enclose them in parentheses. The following query expression searches document zones named "summary" for the word "safety" and stemmed variations of the word "warning."

("safety", warning) <IN> summary

To search multiple zones, separate them with commas and enclose them in parentheses. The following query expression searches both the "summary" zone and the "title" zone for the word "safety" and stemmed variations of the word "warning."

("safety", warning) <IN> (summary, title)

You must enclose query expressions containing commas in parentheses. The following example searches the "summary" zone for the word "safety" and stemmed variations of the phrase "environmental regulation."

("safety", environmental regulation) <IN> summary

The following query expression searches both the "summary" zone and the "title" zone for the word "safety" and stemmed variations of the phrase "environmental regulation."

("safety", environmental regulation) <IN> (summary, title)

WHEN

Selects documents that contain specified values in one or more document zones upon which certain conditions have been placed. The following examples illustrate searching for terms within a zone upon which certain conditions have been placed.

Say you want to search for the word "here" in a zone named "A," whose HREF attribute contains the string "verity," and the text looks like this:

Our site is <A HREF = "www.verity.com">here</A>.

To search for the word "here" in the zone "A" when the HREF contains the string "verity," you can write this query:

"here" <IN> A <WHEN> (HREF <CONTAINS> "verity")

A query condition for the WHEN operator must be enclosed in parentheses, as shown above. A query condition can include one or more Verity operators; it takes the form:

"atribute_name" <attribute_test_operator> "test_value"

where attribute_test_operator is one of the following operators: <STARTS>, <ENDS>, <CONTAINS>, <=>, or <MATCHES>. Except for =, all operators must be surrounded by angle brackets.

Attribute test operators can be combined with the combination operators <AND> or <OR>. For example, you can search for the string "IBM" in a zone named "Company," when the attribute named "reference" is either equal to "major" or "significant" by using the following query:

"IBM" <IN> "Company" <WHEN> ("reference" = "major" <OR>
"reference" = "significant")

NEAR

Selects documents containing specified search terms within close proximity to each other. Document scores are calculated based on the relative number of words between search terms.

For example, if the search expression includes two words, and those words occur next to each other in a document (so that the region size is two words long), then the score assigned to that document is 1.0. Thus, the document with the smallest possible region containing all search terms always receives the highest score. As search terms appear further apart, the score drops toward zero. A document receives a zero score only if it does not contain all search terms.

The NEAR operator is similar to the other proximity operators in the sense that the search words you enter must be found within close proximity of one another. However, unlike other proximity operators, the NEAR operator calculates relative proximity and assigns scores based on its calculations.

To retrieve relevance-ranked documents that contain stemmed variations of the words "war" and "peace" within close proximity to each other, enter the following:

war <NEAR> peace

NEAR/N

Selects documents containing two or more words within N number of words of each other, where N is an integer. Document scores are calculated based on the relative distance of the specified words when they are separated by N words or less.

For example, if the search expression NEAR/5 is used to find two words within five words of each other, a document that has the specified words within three words of each other is scored higher than a document that has the specified words within five words of each other.

The N variable can be an integer between 1 and 1,024, where NEAR/1 searches for two words that are next to each other. If N is 1,000 or above, you must specify its value without commas, as in NEAR/1000. You can specify multiple search terms using multiple instances of NEAR/N, as long as the value of N is the same.

For example, to retrieve relevance-ranked documents that contain stemmed variations of the words "commute," "bicycle," "train," and "bus" within 10 words of each other, enter the following:

commute <NEAR/10> bicycle <NEAR/10> train <NEAR/10> bus

You can use the NEAR/N operator with the ORDER modifier to perform ordered proximity searches. For more information about the ORDER modifier, see "ORDER" in Chapter 4.

OR

Selects documents that show evidence of at least one of your search elements. Documents selected using the OR operator are relevance-ranked.

For example, to select documents that contain stemmed variations of the word "election" or the phrases "national elections" or "senatorial race", enter the following:

election OR national elections OR senatorial race

Only those documents that contain at least one of the search elements, or a stemmed variation of at least one of them, are retrieved. A calculated score is assigned to each retrieved document.

PARAGRAPH

Selects documents that include all of the search elements you specify within a paragraph. Valid search elements are two or more words or phrases. You can specify search elements in a sequential or a random order. Documents are retrieved as long as search elements appear in the same paragraph.

To retrieve relevance-ranked documents that contain stemmed variations of the word "drug" and the phrase "cancer treating" in the same paragraph, enter the following:

drug <PARAGRAPH> cancer treating

To search for three or more words or phrases, you must use the PARAGRAPH operator between each word or phrase.

You can use the PARAGRAPH operator with the ORDER modifier to perform ordered proximity searches. For more information about the ORDER modifier, see "ORDER" in Chapter 4.

PHRASE

Selects documents that include a phrase you specify. A phrase is a grouping of two or more words that occur next to each other in a specific order.

By default, two or more words separated by a space are considered to be a phrase in simple syntax. Two or more words enclosed in double quotes are also considered to be a phrase. To retrieve relevance-ranked documents that contain the phrase "mission oak," enter any of the following:

mission oak

"mission oak"

mission <PHRASE> oak

<PHRASE> (mission, oak)

SENTENCE

Selects documents that include all of the words you specify within a sentence. You can specify search elements in a sequential or a random order. Documents are retrieved as long as search elements appear in the same sentence.

To retrieve relevance -ranked documents that contain stemmed variations of the words "American," and "innovation" within the same sentence, enter the following:

american <SENTENCE> innovation

<SENTENCE> (american, innovation)

You can use the SENTENCE operator with the ORDER modifier to perform ordered proximity searches. For more information about the ORDER modifier, see "ORDER" in Chapter 4.

SOUNDEX

Selects documents that include one or more words that "sound like," or whose letter pattern is similar to, the word specified. Words must start with the same letter as the word you specify to be selected.

Your collection administrator must have configured collections to support the SOUNDEX operator. See your collection administrator for information.

For example, to retrieve documents containing a word that is close in structure to the word "sale," enter the following:

<SOUNDEX> sale

The documents retrieved will include words such as "sale," "sell," "seal," "shell," "soul," and "scale." Documents are not relevance-ranked unless the MANY modifier is used, as in:

<MANY><SOUNDEX> sale

STEM

Selects documents that include one or more variations of the search word you specify.

For example, to retrieve documents containing a variation of the word "film," enter the following:

<STEM> film

The documents retrieved will include words such as "films," "filmed," and "filming." Documents are not relevance-ranked unless the MANY modifier is used, as in:

<MANY><STEM> film

THESAURUS

Selects documents that contain one or more synonyms of the word you specify.

For example, to retrieve documents containing synonyms of the word "altitude" enter the following:

<THESAURUS> altitude

The documents retrieved will include words such as "height" or "elevation." Documents are not relevance-ranked unless the MANY modifier is used, as in:

<MANY><THESAURUS> altitude

TYPO/N

Selects documents that contain the word you specify plus words that are similar to the query term. The TYPO/N operator performs "approximate pattern matching" to identify similar words. This makes it ideal for use in an environment where documents have been scanned using an Optical Character Reader (OCR).

The optional N variable in the operator name expresses the maximum number of errors between the query term and a matched term, a value called the error distance. If N is not specified, an error distance of 2 is used.

The error distance between two words is based on the calculation of errors, where an error is defined to be a character insertion, deletion, or transposition. For example, for these sets of words, the second word matches the first within an error distance of 1:


mouse, house (m\xde h)
agreed, greed (a is deleted)
cat, coat (o is inserted)
For the query below, documents with the words "sweeping" and "swimming" will match, since there are 3 transpositions in the word (e\xde i, e\xde m, p\xde m).

<TYPO/3> sweeping

Both of the queries below return the same results. Documents containing the words "swept" and "kept" match, since the "kept" word contains 1 transposition, 1 deletion.

<TYPO/2> swept

<TYPO> swept

The TYPO/N operator must scan the collection's word list to find candidate matching words. This makes it impractical for use in large collections (greater than 100,000 documents unless a current spanning word list is available) or in performance-sensitive environments. Performance can be improved by generating a spanning word list for the collections to be used.

NOTE: Please note these limitations. A query term specified with TYPO/N can have a maximum length of 32 characters. Also, TYPO/N is not supported with multi-byte character sets.

WILDCARD

Selects documents that contain matches to a wildcard character string. The WILDCARD operator lets you define a wildcard string, which can be used to locate related word matches in documents. A wildcard string consists of special characters.

For example, to retrieve documents that contain words such as, "pharmaceutical," "pharmacology," and "pharmacodynamics," enter the following:

pharmac*

Documents are not relevance-ranked unless the MANY modifier is used, as in:

<MANY> pharmac*

The wildcard characters "*" and "?" automatically enable wildcard searching. To use other constructs, use the WILDCARD operator explicitly with any of the characters in the following table.

Character
Function
?
Specifies one of any alphanumeric character, as in ?an, which locates "ran," "pan," "can," and "ban." It is not necessary to specify the WILDCARD operator when you use the question mark. The question mark is ignored in a set ([ ]) or in an alternative pattern ({ }).
*
Specifies zero or more of any alphanumeric character, as in corp*, which locates "corporate," "corporation," "corporal," and "corpulent." It is not necessary to specify the WILDCARD operator when you use the asterisk; do not use the asterisk to specify the first character of a wildcard string. The asterisk is ignored in a set ([ ]) or in an alternative pattern ({ }).
[ ]
Specifies one of any character in a set, as in <WILDCARD> `c[auo]t`, which locates "cat," "cut," and "cot." You must enclose the word that includes a set in backquotes (`), and a set cannot contain spaces.
{ }
Specifies one of each pattern separated by commas, as in
<WILDCARD> `bank{s,er,ing}`, which locates "banks," "banker," and "banking." You must enclose the word that includes a pattern in backquotes (`), and a set cannot contain spaces.

^
Specifies one of any character not in the set, as in <WILDCARD> `st[^oa]ck`, which excludes "stock" and "stack" but locates "stick" and "stuck." The caret (^) must be the first character after the left bracket ([) that introduces a set.
-
Specifies a range of characters in a set, as in <WILDCARD> `c[a-r]t`, which locates every three-letter word from "cat" to "crt."

Searching for Nonalphanumeric Characters

Remember that you can search for nonalphanumeric characters only if the style.lex file used to create the collections you are searching is configured to recognize the characters you want to locate. Consult your collection administrator for more information.

Searching for Wildcard Characters as Literals

Provided the style.lex file is configured for the collections to be searched, you can search for a word containing a wildcard character such as "/" or "*" by preceding the wildcard character with a backslash.

For example, when you enter the following search string:

abc\*d

the engine finds five-character words matching the "abc*d" string.

When you want to match a literal backslash, you must enter two backslashes.

Searching for Special Characters as Literals

The following nonalphanumeric characters perform special, internal search engine functions, and, by default, are not treated as literals in a wildcard string:

To interpret special characters as literals, you must surround the whole wildcard string in backquotes (`). For example, to search for the wildcard string "a{b", you surround the string with backquotes, as follows:

<WILDCARD> `a{b`

To search for a wildcard string that includes the literal backquote character (`), you must use two backquotes together and surround the whole wildcard string in backquotes (`). For example, to search for the wildcard string "*n`t", you can enter the following query:

<WILDCARD> `*n``t`

You can search on backquotes only if the style.lex file used to create the collections you are searching is configured to recognize the backquote character. Consult your collection administrator for information.

WORD

Selects documents that include one or more instances of only the word you specify without locating stemmed variations of that word.

For example, to search for documents that contain the word "rhetoric," without also considering the words "rhetorical" and "rhetorician," enter the following:

<WORD> rhetoric

Documents are not relevance-ranked unless the MANY modifier is used, as in:

<MANY><WORD> rhetoric





Copyright © 2001, Verity, Inc. All rights reserved.