By default, the Verity engine invokes the universal filter with the zone filter as a helper filter. If you open the default
style.uni
file you will see that it includes zone filter specifications for the built-in modes for HTML, e-mail, and Usenet news.style.dft
file as described below.style.uni
and style.dft
files, refer to Chapter 6, "Document Filters and Formatting."
Specifying the Zone Filter
The zone filter can be invoked together with the universal filter or as a single filter. By default, the built-in zone modes (for HTML, e-mail, and Usenet news) are all invoked with the universal filter. The zone filter specifications for the built-in zone modes are included in the default style.uni
file. style.uni
file using the type
keyword with the /content-filter
modifier, as shown in the sample style.uni
syntax below:
- type: text/html
- /charset = guess
- /def-charset = 1252
- /content-filter = "zone -html -nocharmap"
-nocharmap
argument specifies that zone filter will not perform character set mapping; instead, the universal filter will use its character set recognizer. To invoke the zone filter as a single filter, you need to specify the filter in the
style.dft
file using the field
keyword and the /filter
modifier, as shown in the sample style.dft
file syntax below:
- field: DOC
- /filter="zone -html"
zone
argument.
When the zone filter is invoked with the universal filter as a helper filter, the universal filter's character set recognizer should be used. For this reason, you should include the
-nocharmap
option in your filter specification. When the zone filter is used as a single filter for the collection, character mapping can be implemented using one of the following options in the filter specification.
style.dft
file using the field
keyword and the /filter
modifier, as shown in the sample style.dft
file syntax:
- field: DOC
- /filter="zone -html -precharmap 850"
-precharmap
and the -autocharmap
options is to support Japanese and other Asian languages in which Web pages are commonly written in a multibyte character set rather than the HTML standard of ISO-8859-1. In Japan in particular, there are three common character sets, and a Web page might be written in any of those three. In the previous example, the
-autocharmap
option is a convenient way of handling a Web page written in an unknown character set. The same mechanism also works for Web pages written in other locales, as long as that locale supports a character set detection function. (Verity's Western European locales do support this function.)
style.uni
file.The default
style.uni
file automatically invokes the "flt_meta" content filter with the "zone" filter. Here is an example of a short style.uni
file that can filter HTML documents with meta tags (the type:
statement below also appears in the default style.uni
file):
- $control: 1
- types:
- {
- autorec: "flt_rec"
- autorec: "flt_kv -recognize"
- type: text/html
- /charset = guess
- /def-charset = 1252
- /content-filter = "zone -html -nocharmap"
- /content-filter = "flt_meta"
- # if we get anything else, just skip it
- default:
- /action = skip
- }
- $$
name
attribute, and the field value is the value of the content
attribute in the meta tag.A sample <META> tag in HTML is shown below:
style.sfl
file, the field name "Abstract" is populated with the value "This is a long document." A field definition that corresponds to the meta tag's name attribute must appear in the style.ufl
or style.sfl
in order for the field to be populated by the filter. In the example above, the field named "Abstract" is aliased to the "Snippet" field in the default style.sfl
file so you would not need to add a field definition.
style.zon
file, where a zone name corresponds to a tag name. In the collection's internal documents table, the field is defined as the tag name and the tag value is the field value. For example, when the zone filter encounters this HTML: