Defining Zones as Collection Fields


Any zone can also be extracted as a collection field. The differences between zones and fields are described under "Zones vs. Fields," earlier in this chapter.

The following lines in a style.zon file show how to extract the To, From, and Subject lines from an e-mail message as fields as well as zones:


$control: 1
zonespec:
{
# Extract all header lines of this email message as zones
header: *
# also extract these three header lines as fields as well as
# zones
header: From
/field = yes
header: To
/field = yes
header: Subject
/field = yes
}
$$
Fields listed in a style.zon file must also be listed in the collection's style.ufl or style.sfl file. Otherwise, when the zone filter extracts these fields, the indexer will have no place to store the values and the values will be ignored.

The field definitions for the Verity standard fields are included in the default style.sfl file. In this file, there are field definitions for several fields including "Subject", which is aliased to the field named "Title". All of the built-in zone filter modes automatically populate "Title" by default. The e-mail and news zone modes populate the "To" and "From" fields.

While the standard field definitions cause the zone filter to define zones as collection fields, sometimes you need to create custom field definitions. If you are using the HTML zone mode, and you want to define the "To" and "From" zones as custom fields, then you need to provide a field definition corresponding to a zone name in the style.ufl file.

Here is the data table of the style.ufl file to use with the previous style.zon file:


data-table: ddf
{
# User fields go here. These fields also listed in
# the style.zon file
varwidth: From dd1
varwidth: To dd4
}
For information about making a field definition in the style.ufl file, refer to Chapter 4, "Field Definitions."

Extracting HTML Zones as Fields

The "zone" filter supports a method for extracting zones as fields that differs from the method used by the "flt_meta" filter to extract meta tags as fields (as described in the next section, "Extracting META Tags as Fields").

The "zone" filter watches HTML in the document stream and produces a field tokens based on the zone name(s) specified in the style.zon file, where a zone name corresponds to a tag name. In the collection's internal documents table, the field is defined as the tag name and the tag value is the field value.

For example, when the zone filter encounters this HTML:

<TITLE>This is the title</TITLE>

the filter produces the following field token:

TITLE: This is the title

Since TITLE is defined as a standard field by default, the zone filter populates the TITLE field with the value "This is the title".

Extracting META Tags as Fields

The universal filter supports filtering <META> tags, which are typically defined in HTML documents, using the special "flt_meta" content filter with the "zone" filter. The "flt_meta" filter can be specified for the text/html document type in the style.uni file.

The default style.uni file automatically invokes the "flt_meta" content filter with the "zone" filter. Here is an example of a short style.uni file that can filter HTML documents with meta tags (the type: statement below also appears in the default style.uni file):


$control: 1
types:
{
autorec: "flt_rec"
autorec: "flt_kv -recognize"
type: text/html
/charset = guess
/def-charset = 1252
/content-filter = "zone -html -nocharmap"
/content-filter = "flt_meta"
# if we get anything else, just skip it
default:
/action = skip
}
$$
The "flt_meta" filter watches markup tokens in the document stream. When the filter encounters a <META> tag, it produces a field token based on the tag and then the field token is stored as a field in the collection. In the collection's internal documents table, the field name is the name of the meta tag's name attribute, and the field value is the value of the content attribute in the meta tag.

A sample <META> tag in HTML is shown below:

<META name="Abstract" content="This is a long document"

When filtering the HTML above, the "flt_meta" filter produces a field token of this form:

ABSTRACT: This is a long document

In the default style.sfl file, the field name "Abstract" is populated with the value "This is a long document". A field definition that corresponds to the meta tag's name attribute must appear in the style.ufl or style.sfl in order for the field to be populated by the filter. In the example above, the field named "Abstract" is aliased to the "Snippet" field in the default style.sfl file so you would not need to add a field definition.





Copyright © 2002, Verity, Inc. All rights reserved.