Using the
style.prm
file, you can specify additional data that you want included in the collection indexes. Additional data in collection indexes support some search features, like clustering, query-by-example, and the SOUNDEX
operator. The style.prm
file can be used to specify case-insensitive word indexes, supporting case-insensitive searching. Feature Vectors
To use clustering (and also fast query-by-example), you must enable the generation of document feature vectors at indexing time. To do this, include a $define
directive with the DOC-FEATURES
parameter in the style.prm
file, as follows:
TF
specifies the type of feature vector and must be present.
TF
has an optional argument, MaxFtrs
n
, that is only rarely used, where n
specifies the number of features to store per document in the collection. The complete syntax is:
n
"
DOC-FEATURES
is defined, the VdkFeatures
field is automatically included in the internal documents table schema to contain the generated feature vectors.
$define
directive with the DOC-SUMMARIES parameter in the style.prm
file, as follows:
type
[Zone]"
type
specifies the type of summary to generate: XS, LS or LB. The type is required and must appear first in the string ahead of any of the optional parameters.
Type
|
Description
|
---|---|
XS
|
Extract the "best" sentences from the document
|
LS
|
Use the first sentences from the document
|
LB
|
Use the first bytes of text from the document (with white space compressed)
|
$define
directive with the DOC-SUMMARIES
parameter is defined, the VdkSummary
field is automatically included in the internal documents table schema to contain the generated summaries.
If you want to support case-insensitive searches, you need to build case-insensitive word indexes by making an edit in the default
style.prm
file. To do this, remove the "Casedex" value in this $define
directive:
If you modify the
style.prm
file, you can configure the internal, punctuation-based lexer for all document types, except PDF. If you use a Verity Locale, the behavior of the tokenizer supplied with the locale cannot be modified.PSW and WCT are two different instance vector encodings. As words are put into the word index, their positions are encoded using one of the two encodings. PSW stands for Paragraph-Sentence-Word encoding. WCT stands for Word-Count encoding. WCT encoding is implemented by default; the
style.prm
file included with Verity products in the default and sample style directories includes this entry:
- Warn E2-0526 (Document Index): Document 1 (report.pdf):
- Sentence 1 in paragraph 0 has more than 255 words - splitting
- sentence.
PSW encoding is activated by un-commenting the following line in the
style.prm
file:
NEAR
and NEAR/N
queries will not cross sentence or paragraph boundaries. In other words, a NEAR
query returns documents in which search terms appear within N
words in the same sentence and paragraph.
SENTENCE
and PARAGRAPH
operators are the same regardless of whether the collection was built with PSW or WCT encoding. For PSW collections, the operators use stored position information.
For WCT indexes, the sentence and paragraph boundaries are approximated using 15-word and 100-word rules. These word windows are applied dynamically at search time (for example, as
long
as the children of a SENTENCE
operator are within 15 words of each other, the SENTENCE
operator will succeed.) This means that SENTENCE
and PARAGRAPH
operators match documents if the search terms occur within a certain distance of each other, whether or not the terms occur in the same paragraph or sentence.NOTE: If you need the SENTENCE operator to be accurate, meaning results contain documents that only have words in the same sentence, then use PSW encoding. Using WCT encoding means you might have results where words are not always in the same sentence.
SOUNDEX
operator will not work. To specify a Soundex index to be built, you need to include the SOUNDEX
value with the WORD-IDXOPTS
parameter. This syntax adds the Stemdex, Casedex and Soundex values to the default WORD-IDXOPTS
parameter:
WORD-IDXOPTS
parameter.Building auxiliary highlight data is considered an advanced feature for application developers using the Verity Developer Kit.
NOTE: The default value stored for the highlight location data (when it is enabled) is the byte offset into the indexing stream to the document.
WORD-IDXOPTS
parameter.
style.prm
parameter file.To enable indexing on nouns and noun phrases, you should un-comment the following $define statements in the
style.prm
file:
- #$define NOUN-IDXOPTS ""
- #$define NPHR-IDXOPTS ""
The uncommented lines should appear as:
- $define NOUN-IDXOPTS ""
- $define NPHR-IDXOPTS ""
The statements following the $define will ensure that if these options are active and no Casedex is active (the default), nouns and noun phrases will be indexed in upper case. If these options are active and Casedex is active, nouns and noun phrases will be indexed using case-sensitivity.
See "Default style.prm File" in this chapter for a sample file.
style.prm File Syntax
The style.prm
file consists of $define
directives, several of which are commented out. You may uncomment $define
directives to control collection index content. In the style.prm
file, $define
directives can appear in any order. Also, you can edit or remove $define
directives, as appropriate. style.prm
file for an existing collection, be sure to re-index the collection. style.prm
file, the $define
directives may be entered in any order. Blank lines are ignored, and comments are introduced with the #
character. $define
directive has this structure:
parameter
"parameter string
"
[]
) surround optional items. Slanted text
is used for user-replaceable constructs.
style.prm
file for the File System gateway is shown below.
- # style.prm - collection schema parameters
- #
- # This file is used to enable/disable index schema features through
- # macro definitions similar to those allowed by the C preprocesser.
- # This file is included in other style files using $include so
- # that the selected features are propogated to the schemas of all
- # tables in the index. Refer to the "Using the style.prm File"
- # chapter in the Collection Buiding Guide for more information.
- # ----------------------------------------------------------------
- # The IDX-CONFIG parameter defines the storage format used to
- # encode the word positions in the index. WCT (Word Count) format
- # is a compact format, storing the ordinal counting position of the
- # word from the beginning of the document. PSW (Paragraph,
- # Sentence, Word) format takes approximately 15-20% more disk
- # space, but stores semantically accurate paragraph and sentence
- # boundaries. Optionally, Many may be specified with either WCT or
- # PSW to improve the accuracy of the <MANY> operator at the expense
- # of diskspace and search performance.
- # This example enables Word Count word position format (the
- # default).
- $define IDX-CONFIG "WCT"
- # This example turns on Paragraph-Sentence-Word word position
- # format.
- # It also enables the <MANY> operator accuracy improvement.
- #$define IDX-CONFIG "PSW Many"
- # ----------------------------------------------------------------
- # The IDXOPTS parameters define which index options are applied to
- # the various index token tables. The following index options are
- # supported for each: Stemdex enables an index by the stem of each
- # word. Casedex stores all case variants of a word separately, so
- # one can search for case-sensitive terms such as "Jobs", "Apple",
- # and "NeXT" more easily. Soundex stores phonetic representations
- # of the word, using AT&T's standard soundex algorithm. The
- # application may also store 1-4 bytes of application-specific
- # data with each word instance, in the form of Location data and/or
- # Qualify Instance data. These options are specified separately
- # for each token table: word, zone, and zone attribute.
- $define WORD-IDXOPTS "Stemdex Casedex"
- $define ZONE-IDXOPTS ""
- $define ATTR-IDXOPTS "Casedex"
- #$define NOUN-IDXOPTS ""
- #$define NPHR-IDXOPTS ""
- $ifdef NOUN-IDXOPTS
- $ifdef NPHR-IDXOPTS
- $define NNP
- $endif
- $endif
- # The following example shows how to associate 4 bytes of Location
- # and Qualify data with each word instance.
- #$define WORD-IDXOPTS "Location4 Qualify4"
- # ----------------------------------------------------------------
- # Clustering is enabled by uncommenting the DOC-FEATURES line.
- # This stores a feature vector for each document in the
- # Documents table. These features are used for Clustering
- # results and fast Query-by-Example. See the discussions on
- # Clustering in the Collection Building Guide for more
- # information.
- $define DOC-FEATURES "TF"
- # ----------------------------------------------------------------
- # Document Summarization is enabled by uncommenting one of
- # the DOC-SUMMARIES lines below. The summarization data is
- # stored in the documents table so that it might easily be
- # shown when displaying the results of a search.
- # See the discussions on Document Summarization in the
- # Collection Building Guide for more information.
- # The example below stores the best three sentences of
- # the document, but not more than 500 bytes.
- $define DOC-SUMMARIES "XS MaxSents 3 MaxBytes 500"
- # The example below stores the first four sentences of
- # the document, but not more than 500 bytes.
- #$define DOC-SUMMARIES "LS MaxSents 4 MaxBytes 500"
- # The example below stores the first 150 bytes of
- # the document, with whitespace compressed.
- #$define DOC-SUMMARIES "LB MaxBytes 150"