Object indexing process overview

The process of indexing object text begins with the submission of an index request and ends with a change to index information for the object: the index entry for the object is created or updated. The term indexing is used here in a positive sense to refer to a subset of the meaning attached to reindexing. Reindexing an object potentially involves the deletion of the index entry for the object if the object no longer belongs to a CBR-enabled class.

The following components might be involved in indexing object text:

The following table shows the high-level steps involved in indexing the text for an object:

Step Related information
Index request submission
An index request is generated for an object when Content Engine or an application creates, updates, or deletes an object that belongs to a CBR-enabled class. The application might be Enterprise Manager or a custom application that calls Content Engine API methods.

A batch of index requests can also be generated at one time as part of an index job. An index job is automatically created when you choose a CBR-enabled class or object to be indexed in Enterprise Manager. An index job is also created when you manually choose to reindex a selected index in an index area.

Related tasks
For information about creating an index job using Enterprise Manager, see Creating an index job.
Index determination
Content Engine determines the index area and the target index file to update for the index request.
Related concepts
For information about how Content Engine determines the index for an object, see Index areas and indexes.

Index request batching
The Content Engine subsystem dispatcher groups the index request with other index requests to form an index batch. The target index is the same for all index requests in a batch.

Related reference
For information about the dispatcher Max Batch Size property, see IBM Search Configuration tab.
Text filtering
A worker thread for the subsystem dispatcher converts any binary documents to text documents. An example of a binary document is a Microsoft Word document. The index request object might have zero or more content elements that are binary documents.
Related reference
For information about the Max Text Filters Per Batch property that controls the maximum number of worker threads for text filtering, see IBM Search Configuration tab.

Related concepts
For more information about text filtering, see Indexable document types.

Index batch submission
A worker thread for the subsystem dispatcher submits the text document as part of an index batch to an index server. The phrase text document refers to the text that is indexed for a Content Engine object.
Related reference
For information about the Lease expiry time property that worker threads use to control server access to indexes, see Index tab. For information about the Maximum Indexing Threads property that controls the number of worker threads for indexing, see IBM Search Configuration tab.

Document construction: XML filtering (text preprocessing)
The XML filtering utility removes surplus XML elements from XML content.

Related tasks
For information about defining surplus XML elements, see Setting XML elements as non-searchable.

Automatic language identification (text preprocessing)
The index server identifies the text language for the text document if language identification is configured to be automatic.

Related tasks
For information about identifying the text language or setting automatic language identification, see Identifying the text language.
Tokenization (text preprocessing)
The server creates tokens for the text document based upon a language-aware analysis of the text. Word stems and other language constructs are identified.

Related concepts
For information about word stems, see Word stems.
Token indexing
The index entry for the object in the target index is updated with the tokens.

 

Related concepts
For information about servers that perform the text preprocessing steps exclusively, see Server overview. For information about the steps involved in searching for object text, see Object searching process overview. For information about reindexing, see Managing index jobs.