For indexing purposes, IBM® Content Search Services parses document text into tokens. A token can be generally conceived as either a word or character sequence.
For example, abc123 is a character sequence as opposed to a word. Character sequences are delimited in the document by white space or punctuation. If a character sequence is longer than 4096 characters, it is not indexed.
An example of a word is testing; both the word and the stem of the word, test, might be indexed. This type of language-aware processing also occurs for the search expression. For example, if a search expression contains the word tested, the word stem test is implicitly added to the expression. In this way, a language-aware search enables a search for tested to find a document with the word testing. In general, a search for a term finds that term and all lexical variants of the term. For example, a search for was also finds instances of is.
The alternative to a language-aware search is an exact-match search. You can specify an exact-match search in the following ways:
Search terms in quotation marks | For the search term "testing" (in quotation marks), no search occurs for the word stem test |
---|---|
Search terms with no lowercase letters | For the search term TESTING, no search occurs for the word stem test |