Using the Zone Filter


By default, the Verity engine invokes the universal filter with the zone filter as a helper filter. If you open the default style.uni file you will see that it includes zone filter specifications for the built-in modes for HTML, e-mail, and Usenet news.

The zone filter can index and access for display documents in tagged ASCII formats, like HTML, SGML, e-mail and Usenet news. The tagged regions of text are defined by the search engine as zones and these zones can be searched by users.

If, for a particular colleciton, you know you will only be indexing one of these tagged ASCII formats, and no other format, then you can avoid the overhead of the universal filter and instead specify a zone filter directly in the style.dft file as described below.

For more information about the universal filter, the syntax of the style.uni and style.dft files, refer to Chapter 6, "Document Filters and Formatting."

Specifying the Zone Filter

The zone filter can be invoked together with the universal filter or as a single filter. By default, the built-in zone modes (for HTML, e-mail, and Usenet news) are all invoked with the universal filter. The zone filter specifications for the built-in zone modes are included in the default style.uni file.

To invoke the zone filter with the universal filter, you need to specify the filter in the style.uni file using the type keyword with the /content-filter modifier, as shown in the sample style.uni syntax below:


type: text/html
/charset = guess
/def-charset = 1252
/content-filter = "zone -html -nocharmap"
NOTE: The -nocharmap argument specifies that zone filter will not perform character set mapping; instead, the universal filter will use its character set recognizer.

To invoke the zone filter as a single filter, you need to specify the filter in the style.dft file using the field keyword and the /filter modifier, as shown in the sample style.dft file syntax below:


field: DOC
/filter="zone -html"
If the zone filter is invoked as a single filter, the engine will index documents in the mode specified, so the collection will be limited to those documents.

Built-in Zone Mode Options

To invoke a built-in zone mode, you need to specify one of the options shown below with the zone argument.

Option
Description
-html
Specifies Hypertext Markup Language format for World Wide Web documents.
-email
Internet e-mail conforming to the RFC822 standard.
-news
Internet Usenet news conforming to the RFC822 standard.

Character Mapping Options

To filter a document with the zone filter, you need to know which character set a text file is written in to be able to parse it properly. The character mapping options are general enough that you can use them for all Web documents, even those in, for example, Korean, Chinese, Russian, and Czech.

When the zone filter is invoked with the universal filter as a helper filter, the universal filter's character set recognizer should be used. For this reason, you should include the
-nocharmap option in your filter specification.

When the zone filter is used as a single filter for the collection, character mapping can be implemented using one of the following options in the filter specification.

Option
Description
-precharmap name
The -precharmap option tells the zone filter to map from the named character set before parsing the document. The name variable is the name of the character set to map from.
-autocharmap
The -charmap option tells the zone filter to guess the character set of the document, and then map from that character set to the internal character set before parsing the document.
-nocharmap
The -nocharmap option tells the zone filter to not perform any character set mapping before parsing the document. When the -html flag and the -nocharmap flag are given together, the document will not be mapped from the HTML standard of 8859 before being parsed.
For example, to invoke the zone filter as a single filter and to perform character mapping before parsing the documents, you need to specify the filter in the style.dft file using the field keyword and the /filter modifier, as shown in the sample style.dft file syntax:


field: DOC
/filter="zone -html -precharmap 850"
A key purpose of the -precharmap and the -autocharmap options is to support Japanese and other Asian languages in which Web pages are commonly written in a multibyte character set rather than the HTML standard of ISO-8859-1. In Japan in particular, there are three common character sets, and a Web page might be written in any of those three.

In the previous example, the -autocharmap option is a convenient way of handling a Web page written in an unknown character set. The same mechanism also works for Web pages written in other locales, as long as that locale supports a character set detection function. (Verity's Western European locales do support this function.)

Extracting META Tags as Fields

The universal filter supports filtering <META> tags, which are typically defined in HTML documents, using the special "flt_meta" content filter with the "zone" filter. The "flt_meta" filter can be specified for the text/html document type in the style.uni file.

The default style.uni file automatically invokes the "flt_meta" content filter with the "zone" filter. Here is an example of a short style.uni file that can filter HTML documents with meta tags (the type: statement below also appears in the default style.uni file):


$control: 1
types:
{
autorec: "flt_rec"
autorec: "flt_kv -recognize"
type: text/html
/charset = guess
/def-charset = 1252
/content-filter = "zone -html -nocharmap"
/content-filter = "flt_meta"
# if we get anything else, just skip it
default:
/action = skip
}
$$
The "flt_meta" filter watches markup tokens in the document stream. When the filter encounters a <META> tag, it produces a field token based on the tag and then the field token is stored as a field in the collection. In the collection's internal documents table, the field name is the name of the meta tag's name attribute, and the field value is the value of the content attribute in the meta tag.

A sample <META> tag in HTML is shown below:

<META name="Abstract" content="This is a long document"

When filtering the HTML above, the "flt_meta" filter produces a field token of this form:

ABSTRACT: This is a long document

In the default style.sfl file, the field name "Abstract" is populated with the value "This is a long document." A field definition that corresponds to the meta tag's name attribute must appear in the style.ufl or style.sfl in order for the field to be populated by the filter. In the example above, the field named "Abstract" is aliased to the "Snippet" field in the default style.sfl file so you would not need to add a field definition.

Extracting Zones as Fields

The "zone" filter supports a method for extracting zones as fields that differs from the method used by the "flt_meta" filter to extract meta tags as fields. The "zone" filter watches HTML in the document stream and produces a field tokens based on the zone name(s) specified in the style.zon file, where a zone name corresponds to a tag name. In the collection's internal documents table, the field is defined as the tag name and the tag value is the field value.

For example, when the zone filter encounters this HTML:

<TITLE>This is the title</TITLE>

the filter produces the following field token:

TITLE: This is the title

Since TITLE is defined as a standard field by default, the zone filter populates the TITLE field with the value "This is the title." For more information about extracting zones as fields, refer to Chapter 8, "The Zone Filter."





Copyright © 2002, Verity, Inc. All rights reserved.