P8CSE50 style set reference

The P8CSE50 style set has many indexing features turned off by default compared to the FileNet_FileSystem_PushAPI style set. These features oftentimes affect the functionality of the Verity Query Language operators that you use in a full-text search expression. Enabling indexing features can increase the size of your collections and decrease system performance for indexing and searching objects. The way that performance is affected can vary significantly depending on the text that is indexed. The focus here is on feature functionality, although the possible impact of a feature on performance is mentioned when appropriate.

For information about style set files and syntax, consult the Collection Reference in your Autonomy documentation. (For information about accessing the Autonomy documentation, see Accessing the Autonomy K2 documentation.) This topic focuses on the P8CSE50 style set and briefly explains how to configure the following features in your copy of the style set:

After completing your style set modifications, you must apply your modifications to the object store. For more information, see Change a style set for an object store.

Case-sensitive searches (style.prm)

The WORD-IDXOPTS parameter in the style.prm file determines how words are indexed to permit different types of searches. Multiple parameter settings are possible. Include the Casedex setting to make your searches case-sensitive.

IMPORTANT One of the default settings for the WORD-IDXOPTS parameter is StopLangdex. Do not remove this setting.

Fields and zones (style.sfl, style.ufi, style.xml, and style.uni)

Fields and zones are alternative ways for specifying portions of object text that can be searched separately from other portions of obejct text. IBM Legacy Content Search engine implements the CBR-enabled properties for a class as zones. For example, the CONTAINS(d.DocumentTitle, 'lion') function call in a CBR query is equivalent to the following search experssion in Verity Query Language (VQL): lion <IN> DocumentTitle. You can configure other zones or fields based on markup tags in object content. Zone and field configuration is necessary only if you intend to explicitly search the zones or fields by using VQL operators such as <IN>.

Compared to fields, zones permit smaller collection sizes and consequently provide better system performance for object indexing and searching. Zones can be defined in the following ways:

Avoid defining zones for documents that tend to have a large amount of zone data such as the following ones: PDF, Microsoft Word, Microsoft Excel, and Microsoft Powerpoint documents. Such documents can cause collections sizes to increase and indexing performance to decrease.

To define fields, modify the following files:

Maximum index partition size (style.pic)

The merge_maxdocs parameter in the style.plc file sets the maximum number of indexed objects for a collection partition. (The word partition as used here does not refer to date partitioning.) A partition is created for each indexing batch that Content Engine sends to the Autonomy index server. The partitions for a collection are stored in the parts directory for the collection and have a .did file extension. Partitions are merged into larger partitions when Autonomy K2 index server periodically optimizes collection storage.

If the size of a partition exceeds 1 GB, reduce the value for this parameter from the default of 65520 to 32760. Large partitions that are 1 GB or more in size can cause long collection optimization times. Also, if the index server attempts to create a merged partition greater than 2 GB, many merge errors might be reported. The optimization attempt might fail (or might fail to complete).

As an alternative or in addition to reducing the maximum partition size, configure the style set to avoid indexing documents with the following characteristics:

Search accuracy for term grouping operators (style.prm)

The IDX-CONFIG parameter in the style.prm file affects the accuracy of the <NEAR> and <MANY> operators. (You use these operators when searching for terms based on the grouping of the terms within the text.) Set the parameter to improve search result accuracy for the following operators:

For example, to improve search result accuracy for the <NEAR> operator, set the parameter value to PSW.

The style.prm file might have several lines that set this parameter to different values. Uncomment only one of these lines. (The parameter can be set to one value only.) The default setting is WCT. Changing the default setting can cause collections to become larger and consequently decrease system performance for object searching.

Stop words (style.stp)

To define stop words, copy and rename the style.stp.disabled file to style.stp and modify style.stp as needed. Each stop word must be left-justified with one stop word per line. A stop word cannot be a phrase.

If your indexes are case-sensitive, you must add all case variations for the stop word. For example, to filter out the word the, you must include entries for both the and The. For information about setting the case-sensitivity of searches, see Case-sensitive searches.

You can specify a regular expression as a stop word. In particular, you can use a regular expression to limit the length of indexed words and the length of indexed numerical values. These kinds of limits help ensure acceptable system performance for object indexing and searching. Without these limits, collections can grow rapidly to accommodate non-word strings such as MIME text and other text-encoded data.

The regular expression syntax is like the regular expression syntax for UNIX but also includes extensions for substring matching. The stop word file includes several examples of regular expressions that are commented out. To limit the length of indexed words, use a regular expression such as the following one:

      ...................+

This example allows only the first 20 characters of a word to be indexed.  

Suppression of indexing for binary data: IBM Lotus Domino XML content (style.xml)

The IBM Lotus Domino XML files include binary data that is encoded as text. To prevent this binary data being indexed, copy the suppress elements from the style.xml.dxl file to style.xml. Alternatively, if you do not intend to define zones, copy and rename style.xml.dxl to replace style.xml.

Word stem searches (style.prm)

The WORD-IDXOPTS parameter in the style.prm file determines how words are indexed to permit different types of searches. Multiple parameter settings are possible. Include the Stemdex setting to make your searches use word stems automatically. The index server indexes the stem of each word in object text in addition to the text.

IMPORTANT One of the default settings for the WORD-IDXOPTS parameter is StopLangdex. Do not remove this setting.