The P8CSE50
style set has many indexing features turned off by default compared to the FileNet_FileSystem_PushAPI
style set. These features oftentimes affect the functionality of the Verity Query Language operators that you use in a full-text search expression. Enabling indexing features can increase the size of your collections and decrease system performance for indexing and searching objects. The way that performance is affected can vary significantly depending on the text that is indexed. The focus here is on feature functionality, although the possible impact of a feature on performance is mentioned when appropriate.
For information about style set files and syntax, consult the Collection Reference in your Autonomy documentation. (For information about accessing the Autonomy documentation, see Accessing the Autonomy K2 documentation.) This topic focuses on the P8CSE50
style set and briefly explains how to configure the following features in your copy of the style set:
After completing your style set modifications, you must apply your modifications to the object store. For more information, see Change a style set for an object store.
style.prm
)The WORD-IDXOPTS
parameter in the style.prm
file determines how words are indexed to permit different types of searches. Multiple parameter settings are possible. Include the Casedex
setting to make your searches case-sensitive.
IMPORTANT One of the default settings for the WORD-IDXOPTS
parameter is StopLangdex
. Do not remove this setting.
style.sfl
, style.ufi
, style.xml
, and style.uni
)Fields and zones are alternative ways for specifying portions of object text that can be searched separately from other portions of obejct text. IBM Legacy Content Search engine implements the CBR-enabled properties for a class as zones. For example, the CONTAINS(d.DocumentTitle, 'lion')
function call in a CBR query is equivalent to the following search experssion in Verity Query Language (VQL): lion <IN> DocumentTitle
.
You can configure other zones or fields based on markup tags in object content. Zone and field configuration is necessary only if you intend to explicitly search the zones or fields by using VQL operators such as <IN>
.
Compared to fields, zones permit smaller collection sizes and consequently provide better system performance for object indexing and searching. Zones can be defined in the following ways:
style.xml.zones
file to replace style.xml
. Modify style.xml
to configure zones as needed for your purposes. For more information, see the commented out examples in style.xml.zones
.style.uni
file. Add zone
as one of the /content-filter
parameters as shown in the following example for HTML documents:
type: "text/html" /charset = guess /def-charset = 1252 /content-filter = "zone -html -nocharmap" # HTML-specific filter /content-filter = "flt_meta" # meta tag filter
style.uni
file. Add –zoned
as one of the /format-filter
parameters as shown in the following example:
/format-filter = "flt_kv -zoned"
Avoid defining zones for documents that tend to have a large amount of zone data such as the following ones: PDF, Microsoft Word, Microsoft Excel, and Microsoft Powerpoint documents. Such documents can cause collections sizes to increase and indexing performance to decrease.
To define fields, modify the following files:
style.ufl
style.sfl
style.pic
)The merge_maxdocs
parameter in the style.plc
file sets the maximum number of indexed objects for a collection partition. (The word partition as used here does not refer to date partitioning.) A partition is created for each indexing batch that Content Engine sends to the Autonomy index server. The partitions for a collection are stored in the parts directory for the collection and have a .did file extension. Partitions are merged into larger partitions when Autonomy K2 index server periodically optimizes collection storage.
If the size of a partition exceeds 1 GB, reduce the value for this parameter from the default of 65520 to 32760. Large partitions that are 1 GB or more in size can cause long collection optimization times. Also, if the index server attempts to create a merged partition greater than 2 GB, many merge errors might be reported. The optimization attempt might fail (or might fail to complete).
As an alternative or in addition to reducing the maximum partition size, configure the style set to avoid indexing documents with the following characteristics:
style.prm
) The IDX-CONFIG
parameter in the style.prm
file affects the accuracy of the <NEAR>
and <MANY>
operators. (You use these operators when searching for terms based on the grouping of the terms within the text.) Set the parameter to improve search result accuracy for the following operators:
<NEAR>
operator: PSW
<MANY>
operator: WCT Many
<NEAR>
and <MANY>
operators: PSW Many
For example, to improve search result accuracy for the <NEAR>
operator, set the parameter value to PSW
.
The style.prm
file might have several lines that set this parameter to different values. Uncomment only one of these lines. (The parameter can be set to one value only.) The default setting is WCT
. Changing the default setting can cause collections to become larger and consequently decrease system performance for object searching.
style.stp
)
To define stop words, copy and rename the style.stp.disabled
file to style.stp
and modify style.stp
as needed. Each stop word must be left-justified with one stop word per line. A stop word cannot be a phrase.
If your indexes are case-sensitive, you must add all case variations for the stop word. For example, to filter out the word the, you must include entries for both the and The. For information about setting the case-sensitivity of searches, see Case-sensitive searches.
You can specify a regular expression as a stop word. In particular, you can use a regular expression to limit the length of indexed words and the length of indexed numerical values. These kinds of limits help ensure acceptable system performance for object indexing and searching. Without these limits, collections can grow rapidly to accommodate non-word strings such as MIME text and other text-encoded data.
The regular expression syntax is like the regular expression syntax for UNIX but also includes extensions for substring matching. The stop word file includes several examples of regular expressions that are commented out. To limit the length of indexed words, use a regular expression such as the following one:
...................+
This example allows only the first 20 characters of a word to be indexed.
style.xml
) The IBM Lotus Domino XML files include binary data that is encoded as text. To prevent this binary data being indexed, copy the suppress elements from the style.xml.dxl
file to style.xml
. Alternatively, if you do not intend to define zones, copy and rename style.xml.dxl
to replace style.xml
.
style.prm
) The WORD-IDXOPTS
parameter in the style.prm
file determines how words are indexed to permit different types of searches. Multiple parameter settings are possible. Include the Stemdex
setting to make your searches use word stems automatically. The index server indexes the stem of each word in object text in addition to the text.
IMPORTANT One of the default settings for the WORD-IDXOPTS
parameter is StopLangdex
. Do not remove this setting.