Using the style.uni File


The universal filter is controlled with a style file called style.uni. This style file tells the universal filter which helper filters to load in what order for every possible document type.

Here is an example of a short style.uni file that can filter Microsoft Word documents, PDF documents, and e-mail documents:


$control: 1
types:
{
autorec: "flt_rec"
autorec: "flt_kv -recognize"
type: "application/msword"
/format-filter = flt_kv
/charset = guess
/def-charset = 1252
type: "application/pdf"
/format-filter = "flt_pdf -charmapto 1252"
/charset = guess
/def-charset = 1252
# this is the MIME Content Type for email messages
type: "message/rfc822"
/charset = guess
/def-charset = 1252
/content-filter = "zone -email -nocharmap"
# if we get anything else, just skip it.
default:
/action = skip
}
$$
The syntax of the style.uni file is described in the next section.

Syntax of style.uni File Statements

The description for the style.uni file syntax for statements is provided below.

Element
Description
$control: 1
The $control statement is the first noncomment line in the style.dft file. This statement identifies the file as a Verity control file.
types:
The types statement identifies the control file as a style.uni file, and it must appear on the second noncomment line in a style.uni file.

Syntax of style.uni File Keywords

The description for the style.uni file syntax for keywords is provided below.

Element
Description
autorec: "filter"
This argument specifies the name of the filter segment to use as an autorec segment. Valid values are: flt_rec for the generic automatic recognizer that determines which filter type is appropriate for each document (required);
flt_kv -recognize for the WYSIWYG automatic recognizer that interprets binary file types for the WYSIWYG filters (required if you have WYSIWYG documents in binary formats).
There may be multiple autorec statements in the style.uni file. When multiple statements are used, they are installed in the order that they are specified, with the first one being attached to the gateway and the last one being attached on the other end to the universal filter. This argument can be a document data access (DDA) specification for external DDA filters written by a Verity developer.
type: "type"
The type keyword specifies what the universal filter should do with a particular document type. There may be many type keywords in the style.uni file, one for each content type. This argument specifies the name of the content type token as it is emitted from the autorec segment. It is usually in the form of class/subtype. For a complete list of the file types defined in the default style.uni file, refer to Appendix B, "Universal Filter Document Types".
default:
The default keyword specifies what the universal filter should do with any document type that is not explicitly listed with a type keyword. There can be only one default keyword in the style.uni file.

style.uni Keyword Modifiers

For a style.uni file, the type and default keywords can include one or more of the modifiers described in the following table.

Modifier
Description
/format-filter="value"
This modifier specifies that a filter will be used to extract text from a binary file. Valid values are:
flt_kv for the KeyView filters;
flt_pdf for the PDF filter;
flt_xml for the XML filter;
DDA spec for any DDA-based filter;
There can be multiple format-filter modifiers, and the binary information will be filtered through each of the specified filters in the order that they are specified in the style.uni file. The default is to install no filters.
The flt_xml filter can be run without converting meta tags to text elements by using the -nometa flag:
/format-filter="flt_xml -nometa"
The flt_kv filter can be run in process using the -noprot flag:
/format-filter="flt_kv -noprot"
To implement in process filtering for KeyView filters, you must add the -noprot flag for each MIME type.

/content-filter="value"
This modifier specifies that a filter will be used for extracting meta-information from the text. Valid values are:
zone for the zone filter;
flt_meta for HTML documents with meta tags (for more information, see
"Extracting META Tags as Fields" in Chapter 8, "The Zone Filter.");
DDA spec for any DDA-based filter. There may be multiple content-filter modifiers, and the text will be filtered through each of the specified filters in the order that they are specified in the style.uni file. The default is to install no filters.
/charset="name"
This modifier is used to specify the character set used to represent characters in the document after it has been format-filtered. The text will be automatically character mapped into the internal character set. The valid settings are:
guess causes the charmap segment to guess what the character set of the text is (it currently has about 99% accuracy on files larger than 512 bytes written in Western European languages);
none causes the charmap segment to pass the text through without any character set mapping;
1252 for code page 1252;
850 for IBM code page 850;
8859 for Latin1 (ISO-8859-1) encoding;
mac1 for Macintosh Roman1 encoding.
Other character sets can be specified, depending on the locale under which the search engine is currently running. The default is to perform no character set mapping.
/def-charset="name"
If the /charset modifier is given the argument guess, the guessing might fail for various reasons. For example, the file might not have been long enough to guess properly. In this case, the /def-charset specifies the default character set to use for character set mapping when the guess fails. The valid values for the name are the same as for the /charset modifier in the preceding, without the guess argument. The default setting for the default character set is none.
/action="action-name"
This optional modifier specifies the action to perform with documents of this type. Valid values are:
index to index a document that should be streamed as normal.
skip to skip this type of document, so that it is not indexed or viewed;
fields-only to stream this type of document for the purposes of extracting field information only to put in the internal documents table. The text of the document will not be indexed or viewed.

Configuration

You can disable filtering for a particular mime-type in the style.uni file. To disable the use of a KeyView filter for the mime type entry, place a pound sign (#) at the beginning of each line in the entry as shown below:


# type: "application/x-lotus-amipro"
# /format-filter = flt_kv
# /charset = guess
# /def-charset = 1252
As a result, the Verity engine will not index or display documents for the mime-type shown.

Changing the style.uni File

If you change the style.uni file, you must re-index an existing collection to update it later.

NOTE: If you change the style.uni file at all after you have created a collection, you run the risk of changing the way a document is filtered. The main consequence is that your highlights may become considerably inaccurate. If you must change the style.uni file, you should use dynamic highlighting from that point onward to guarantee more accurate highlights.





Copyright © 2002, Verity, Inc. All rights reserved.