IBM FileNet P8, Version 5.2.1            

Token searches: Language-aware versus exact-match

For indexing purposes, IBM® Content Search Services parses document text into tokens. A token can be generally conceived as either a word or character sequence.

For example, abc123 is a character sequence as opposed to a word. Character sequences are delimited in the document by white space or punctuation. If a character sequence is longer than 4096 characters, it is not indexed.

An example of a word is testing; both the word and the stem of the word, test, might be indexed. This type of language-aware processing also occurs for the search expression. For example, if a search expression contains the word tested, the word stem test is implicitly added to the expression. In this way, a language-aware search enables a search for tested to find a document with the word testing. In general, a search for a term finds that term and all lexical variants of the term. For example, a search for was also finds instances of is.

The alternative to a language-aware search is an exact-match search. You can specify an exact-match search in the following ways:

Search terms in quotation marks For the search term "testing" (in quotation marks), no search occurs for the word stem test
Search terms with no lowercase letters For the search term TESTING, no search occurs for the word stem test


Last updated: March 2016
csscbr_token_searches.htm

© Copyright IBM Corporation 2016.