Using the style.prm File


Using the style.prm file, you can specify additional data that you want included in the collection indexes. Additional data in collection indexes support some search features, like clustering, query-by-example, and the SOUNDEX operator. The style.prm file can be used to specify case-insensitive word indexes, supporting case-insensitive searching.

Feature Vectors

To use clustering (and also fast query-by-example), you must enable the generation of document feature vectors at indexing time. To do this, include a $define directive with the DOC-FEATURES parameter in the style.prm file, as follows:

$define DOC-FEATURES "TF"

The string TF specifies the type of feature vector and must be present.

TF has an optional argument, MaxFtrs n, that is only rarely used, where n specifies the number of features to store per document in the collection. The complete syntax is:

$define DOC-FEATURES "TF MaxFtrs n"

If DOC-FEATURES is defined, the VdkFeatures field is automatically included in the internal documents table schema to contain the generated feature vectors.

Stored Summaries

To automatically generate and store document summaries at indexing time, include the $define directive with the DOC-SUMMARIES parameter in the style.prm file, as follows:

$define DOC-SUMMARIES "type [Zone]"

where type specifies the type of summary to generate: XS, LS or LB. The type is required and must appear first in the string ahead of any of the optional parameters.

Summary Types

Type
Description
XS
Extract the "best" sentences from the document
LS
Use the first sentences from the document
LB
Use the first bytes of text from the document (with white space compressed)

Summary Type Parameters

Along with the summary type, you can also specify optional parameters as described in the following table.

Parameter
Description
MaxBytes n
n specifies the maximum size of a summary in bytes. Summaries longer than n are truncated with an ellipse (...). This parameter is supported by all three summary types.

The default value is 400.

MaxSents n
n specifies the maximum number of sentences in a summary. This parameter is supported by the XS and LS summary types.

The default value is 2.

TruncSent n
n specifies the maximum length of any sentence. Sentences longer than n are truncated with an ellipse (...). This parameter is supported by the XS and LS summary types.

The default value is 400.

Zone Argument

The optional Zone argument adds the summary to the end of a document as a zone, allowing you to perform zone searches on the summary.

Examples of Stored Summaries

Generate summaries comprised of the best 3 sentences from each document, or a maximum of 500 bytes.

$define DOC-SUMMARIES "XS MaxBytes 500 MaxSents 3"

Generate summaries comprised of the first 400 bytes from each document, compressing white space, while storing the summary as a zone.

$define DOC-SUMMARIES "LB MaxBytes 400 Zone"

If a $define directive with the DOC-SUMMARIES parameter is defined, the VdkSummary field is automatically included in the internal documents table schema to contain the generated summaries.

Case-Insensitive Word Indexes

Case-sensitive word indexes are built by default so case-sensitive searching occurs automatically. With case-sensitive word indexes, if a user enters a mixed-case query, the engine finds case-sensitive matches only.

If you want to support case-insensitive searches, you need to build case-insensitive word indexes by making an edit in the default style.prm file. To do this, remove the "Casedex" value in this $define directive:

$define WORD-IDXOPTS "Stemdex Casedex"

so the edited statement looks like:

$define WORD-IDXOPTS "Stemdex"

Instance Vector Encodings (PSW and WCT)

Paragraph and sentence boundaries are detected at index time by whatever lexer/tokenizer is used. The boundary determination can be based on punctuation, indentation, or anything else the lexer implements. The built-in lexer supplied with Verity products is punctuation based.

If you modify the style.prm file, you can configure the internal, punctuation-based lexer for all document types, except PDF. If you use a Verity Locale, the behavior of the tokenizer supplied with the locale cannot be modified.

PSW and WCT are two different instance vector encodings. As words are put into the word index, their positions are encoded using one of the two encodings. PSW stands for Paragraph-Sentence-Word encoding. WCT stands for Word-Count encoding. WCT encoding is implemented by default; the style.prm file included with Verity products in the default and sample style directories includes this entry:

$define IDX-CONFIG "WCT"

PSW Encoding

When PSW is used, the explicit paragraph and sentence position of each word instance is stored in the word index. The paragraph and sentence counters are incremented whenever the indexer receives a Paragraph or Sentence token from the tokenizer. The PSW encoding allows only 255 words in a sentence and only 255 sentences in a paragraph. If you index something that does not have sentence and/or paragraph boundaries, then the indexer creates sentence and paragraph boundaries with these limits. When a boundary is met the indexer produces a message like this:


Warn E2-0526 (Document Index): Document 1 (report.pdf):
Sentence 1 in paragraph 0 has more than 255 words - splitting
sentence.
NOTE: The PDF tokenizer does not detect a paragraph or sentence boundary. For this reason, the indexer creates sentence and paragraph boundaries when PSW encoding is selected.

PSW encoding is activated by un-commenting the following line in the style.prm file:

$define IDX-CONFIG "PSW Many"

If PSW encoding is used, NEAR and NEAR/N queries will not cross sentence or paragraph boundaries. In other words, a NEAR query returns documents in which search terms appear within N words in the same sentence and paragraph.

WCT Encoding

When WCT encoding is used, explicit sentence and paragraph position information is not stored in the word index. However, word count is incremented by one any time the indexer sees a Paragraph or Sentence token coming from the tokenizer. This behavior prevents phrases from spanning a sentence and/or paragraph boundary. If no sentence and/or paragraph boundaries are detected at indexing time (as with PDF documents), then phrase searches can appear to span sentence boundaries. For non-PDF documents, the Verity lexer determines sentence and paragraph boundaries, so phrase searches do not appear to span sentence boundaries.

Instance Vector Encoding and Searching

The instance vector encoding (WCT or PSW) determines how phrase searches are processed by the Verity engine. For PDF documents indexed using PSW encoding, paragraph and sentence breaks are artificial ones mentioned above, so sentence breaks do not correspond to the apparent punctuation in the documents. For PDF documents indexed using WCT encoding, there are no sentence breaks so phrases can span any and all apparent sentences in the documents.

SENTENCE and PARAGRAPH Operators

When used in queries, the semantics of the SENTENCE and PARAGRAPH operators are the same regardless of whether the collection was built with PSW or WCT encoding.

For PSW collections, the operators use stored position information.

For WCT indexes, the sentence and paragraph boundaries are approximated using 15-word and 100-word rules. These word windows are applied dynamically at search time (for example, as long as the children of a SENTENCE operator are within 15 words of each other, the SENTENCE operator will succeed.) This means that SENTENCE and PARAGRAPH operators match documents if the search terms occur within a certain distance of each other, whether or not the terms occur in the same paragraph or sentence.

NOTE: If you need the SENTENCE operator to be accurate, meaning results contain documents that only have words in the same sentence, then use PSW encoding. Using WCT encoding means you might have results where words are not always in the same sentence.

Soundex Data

By default, a Soundex index is not built for a collection, and thus the SOUNDEX operator will not work. To specify a Soundex index to be built, you need to include the SOUNDEX value with the WORD-IDXOPTS parameter. This syntax adds the Stemdex, Casedex and Soundex values to the default WORD-IDXOPTS parameter:

$define WORD-IDXOPTS "Stemdex Casedex Soundex"

Highlight Location Data

Application developers may want to build auxiliary highlight data to store information used to highlight words in retrieved documents. For example, you can store auxiliary highlight data to store information like a page number or a byte offset into the original file. To build the auxiliary highlight data into the collection index, use the WORD-IDXOPTS parameter.

Building auxiliary highlight data is considered an advanced feature for application developers using the Verity Developer Kit.

NOTE: The default value stored for the highlight location data (when it is enabled) is the byte offset into the indexing stream to the document.

Qualify Instance Data

Application developers may want to build qualify instance data to store auxiliary information about words stored in each collection's full word index. To build qualify instance data into the collection index, use the WORD-IDXOPTS parameter.

$define WORD-IDXOPTS Qualify 4

Enable Indexing on Nouns and Noun Phrases

For Thematic Mapping, you will need to obtain a list of all the nouns and noun phrases in a collection and their term frequencies in each document. This is configurable through the style.prm parameter file.

To enable indexing on nouns and noun phrases, you should un-comment the following $define statements in the style.prm file:


#$define NOUN-IDXOPTS ""
#$define NPHR-IDXOPTS ""

The uncommented lines should appear as:


$define NOUN-IDXOPTS ""
$define NPHR-IDXOPTS ""

The statements following the $define will ensure that if these options are active and no Casedex is active (the default), nouns and noun phrases will be indexed in upper case. If these options are active and Casedex is active, nouns and noun phrases will be indexed using case-sensitivity.

See "Default style.prm File" in this chapter for a sample file.

style.prm File Syntax

The style.prm file consists of $define directives, several of which are commented out. You may uncomment $define directives to control collection index content. In the style.prm file, $define directives can appear in any order. Also, you can edit or remove $define directives, as appropriate.

NOTE: If you are changing the style.prm file for an existing collection, be sure to re-index the collection.

In the style.prm file, the $define directives may be entered in any order. Blank lines are ignored, and comments are introduced with the # character.

The $define directive has this structure:

$define parameter "parameter string"

In the following table, a vertical bar means "or" (that is, choose one of the constructs), while brackets ([]) surround optional items. Slanted text is used for user-replaceable constructs.

Parameter
Possible Values
Meaning
IDX-CONFIG
WCT|PSW [Many]
Specifies the position-recording mode for the indexes, either WCT (word count) or PSW (paragraph, sentence, word). The optional MANY flag can be used with WCT or PSW to specify document length normalization occurs when scoring search results. Without the MANY flag, documents are scored according to the number of occurrences of the search term. With the MANY flag, documents are scored according to the ratio of occurrences to the document length.
entity-IDXOPTS
Stemdex
Casedex
Soundex
Location
num Qualify num


NOTE: When specifying more than one value where there will be spaces, include everything in double quotes. For example:
"Stemdex Casedex"

For each entity (as enumerated in the Types field in style.did: WORD, ZONE or ATTR), information is supplied about what to store in its index. Valid values are:
Stemdex means the index will have stem variants.
Casedex means the index will store all case variants of a word.
Soundex means the index will contain phonetic representations.
Location{1|2|3|4} means to store the specified number of bytes containing application-specific location data, as for example, Location 1.
Qualify{1|2|3|4} means to save that number of bytes of application-specific qualify-instance data.
The above options were controlled previously in the style.did and style.wld files. The Location and Qualify options are intended for use by application developers only.
DOC-FEATURES
TF [MaxFtrs num]
The TF stands for the term frequency method of feature identification. MaxFtrs indicates the most features per document to store. For more information, see "Feature Vectors."
DOC-SUMMARIES
{XS|LS|LB} [MaxBytes num] [MaxSents num] [TruncSent num]
The first part indicates what type of summaries to generate: XS = extract best sentences, LS = leading sentences, LB = leading bytes. The optional parameters limit the summaries. For examples, see "Stored Summaries." Compare with the VDKSUMMARY field in the style.ddd file.

Default style.prm File

The default style.prm file for the File System gateway is shown below.


# style.prm - collection schema parameters
#
# This file is used to enable/disable index schema features through
# macro definitions similar to those allowed by the C preprocesser.
# This file is included in other style files using $include so
# that the selected features are propogated to the schemas of all
# tables in the index. Refer to the "Using the style.prm File"
# chapter in the Collection Buiding Guide for more information.
# ----------------------------------------------------------------
# The IDX-CONFIG parameter defines the storage format used to
# encode the word positions in the index. WCT (Word Count) format
# is a compact format, storing the ordinal counting position of the
# word from the beginning of the document. PSW (Paragraph,
# Sentence, Word) format takes approximately 15-20% more disk
# space, but stores semantically accurate paragraph and sentence
# boundaries. Optionally, Many may be specified with either WCT or
# PSW to improve the accuracy of the <MANY> operator at the expense
# of diskspace and search performance.
# This example enables Word Count word position format (the
# default).
$define IDX-CONFIG "WCT"
# This example turns on Paragraph-Sentence-Word word position
# format.
# It also enables the <MANY> operator accuracy improvement.
#$define IDX-CONFIG "PSW Many"
# ----------------------------------------------------------------
# The IDXOPTS parameters define which index options are applied to
# the various index token tables. The following index options are
# supported for each: Stemdex enables an index by the stem of each
# word. Casedex stores all case variants of a word separately, so
# one can search for case-sensitive terms such as "Jobs", "Apple",
# and "NeXT" more easily. Soundex stores phonetic representations
# of the word, using AT&T's standard soundex algorithm. The
# application may also store 1-4 bytes of application-specific
# data with each word instance, in the form of Location data and/or
# Qualify Instance data. These options are specified separately
# for each token table: word, zone, and zone attribute.
$define WORD-IDXOPTS "Stemdex Casedex"
$define ZONE-IDXOPTS ""
$define ATTR-IDXOPTS "Casedex"
#$define NOUN-IDXOPTS ""
#$define NPHR-IDXOPTS ""
$ifdef NOUN-IDXOPTS
$ifdef NPHR-IDXOPTS
$define NNP
$endif
$endif
# The following example shows how to associate 4 bytes of Location
# and Qualify data with each word instance.
#$define WORD-IDXOPTS "Location4 Qualify4"
# ----------------------------------------------------------------
# Clustering is enabled by uncommenting the DOC-FEATURES line.
# This stores a feature vector for each document in the
# Documents table. These features are used for Clustering
# results and fast Query-by-Example. See the discussions on
# Clustering in the Collection Building Guide for more
# information.
$define DOC-FEATURES "TF"
# ----------------------------------------------------------------
# Document Summarization is enabled by uncommenting one of
# the DOC-SUMMARIES lines below. The summarization data is
# stored in the documents table so that it might easily be
# shown when displaying the results of a search.
# See the discussions on Document Summarization in the
# Collection Building Guide for more information.
# The example below stores the best three sentences of
# the document, but not more than 500 bytes.
$define DOC-SUMMARIES "XS MaxSents 3 MaxBytes 500"
# The example below stores the first four sentences of
# the document, but not more than 500 bytes.
#$define DOC-SUMMARIES "LS MaxSents 4 MaxBytes 500"
# The example below stores the first 150 bytes of
# the document, with whitespace compressed.
#$define DOC-SUMMARIES "LB MaxBytes 150"





Copyright © 2002, Verity, Inc. All rights reserved.