Custom Zone Definitions


If you do not use one of the built-in modes listed in the previous sections, you must specify your own zone definitions in the style.zon file.

Built-in vs. Custom Zone Definitions

You can define zones either by specifying one of the built-in modes or by creating your own style.zon file. However, you cannot use the style.zon file to augment or override the behavior of a built-in mode. If you specify a built-in mode, the engine ignores the style.zon file. If the built-in behavior does not meet your requirements, you will need to specify a custom set of zone definitions in the style.zon file.

For most implementations, you can create the custom definitions by modifying the
built-in definitions. For information on how to dump the definitions for a built-in mode and modify them for use in a style.zon file, follow the steps described in
"Dumping style.zon Information," later in this section.

style.zon File

The structure of the style.zon file is as follows:


$control: 1
zonespec:
{
.
.
.
}
$$
The style.zon file must reside in the style directory of the collection. Note that the file must begin with $control: 1 and zonespec: on the first and second uncommented lines respectively. The file must end with $$ on a line by itself.

The content of the file depends upon the type of document for which you are creating zones, and how you want the various zones stored in the collection. The syntax for the various style.zon keywords and sample style.zon files for the various document types are included in "Zones for Markup Language Documents" and "Zones for Internet Message Format Documents" in this chapter.

style.zon Default Behavior

If you do not specify a mode argument in the /filter=zone modifier, and no style.zon file is found, the default behavior is equivalent to the following style.zon file:


$control: 1
zonespec:
{
element: *
}
$$
That is, no header lines are extracted as zones, and the parsing starts off in SGML parsing mode. All element tags are extracted as zones, and no entities are translated.

style.zon File Syntax

You can use the style.zon file to specify the tags you want to create as zones by using the element and attribute keywords. The examples in the description of the style.zon file refer to common entities in SGML.

The zonespec Keyword

You can specify a modifier for the zonespec keyword in the style.zon file, as follows.

Modifier
Description
/ignoreattributes
Specify YES or NO. The default is YES. Ignores tag attributes unless overridden by a statement beneath it.
The zonespec keyword appears as follows when using this modifier.


$control: 1
zonespec:
/ignoreattributes = yes
{
element: *
}
$$

The element Keyword

The element keyword specifies extraction or exclusion of element tags. It uses the following syntax:

element: elementname

where elementname specifies the name of the element (that is, the tag) you want to extract as a zone. Element names are case-insensitive. To extract all tags as zones, use * for elementname. You can use the following optional modifiers with the element keyword.

Modifier
Description
/ignore
Specify YES to ignore the specified element. If you use the asterisk for elementname, only those elements specified with the /ignore=yes modifier are ignored. If you do not use the asterisk, all the elements specified are extracted and those omitted are ignored.
/field
Specify YES to extract the specified element as a field as well as a zone. See "Defining Zones for Virtual Documents" in this chapter. The extracted field value is stored in the elementname field. To extract attribute names, you must also extract the element name.
There are two approaches to specifying the elements to extract as fields. The first is to specify the asterisk, and then to list any tags you do not want extracted using the /ignore modifier. The second is to explicitly list only those elements you want extracted. The following is an example of the first approach:


$control: 1
zonespec:
{
element: *
element: heading3
/ignore = yes
element: list-item
/ignore = yes
}
$$
In this case, all elements are extracted as zones except for heading3 and list-item, which are ignored.

The following is an example of the second approach.


$control: 1
zonespec:
{
element: header
element: body
element: title
element: textzone
element: section
element: sub-section
element: footnote
element: appendix
}
$$
In this case, only the eight elements specified are extracted as zones. All the rest are ignored.

The attribute Keyword

The attribute keyword specifies extraction or exclusion of attributes within a tag. It is entered in the style.zon file as a child of element and uses the following syntax:

attribute: attributename

where attributename specifies the name of the attribute you want to extract as a zone. Attribute names are case-insensitive. To extract all attribute names as zones, use * for attributename. You can use the following optional modifiers with the attribute keyword.

Modifier
Description
/ignore
Specify YES to ignore the specified attribute. To extract the attribute, specify NO (default). If you use the asterisk for attributename, only those attributes specified with the /ignore=yes modifier are ignored. If you do not use the asterisk, all the attributes specified are extracted and those omitted are ignored.
/field
Specify YES to extract the specified attribute as a field value as well as a zone. See "Defining Zones for Virtual Documents" in this chapter. When a /field=YES modifier is assigned to an attribute, the attribute name and value are prepended to the field value named by the element name. NOTE: Using /field=YES does not cause the attribute information to be extracted into its own field.
/default
Specify the default attribute value you want to use when the attribute name does not occur in the zone tag.
/values
Specify values that may appear in a tag without the corresponding attribute name.
The following is an example of the attribute keyword and its modifiers:


$control: 1
zonespec:
{
element: header
element: body
element: title
{
attribute: company
/default: "IBM"
}
element: textzone
element: section
element: sub-section
element: footnote
element: appendix
}
$$
The default behavior is to extract all attributes as zones. There are two circumstances under which this behavior can be suppressed above the attribute keyword level:

In either case you can override the ignore behavior and extract an attribute as a zone by specifying /ignore=no for that attribute.

The entity Keyword

The entity keyword specifies the translation of entities to their equivalents. It uses the following syntax:

entity: name "value"

where name is the name of the entity as it appears in the document, and value is the way you want the entity to display. You can use the following optional modifiers with the entity keyword.

Modifier
Description
/ignore
Specify either YES or NO. The default is no.
Entities in SGML are used to specify characters that would otherwise be considered as part of the markup language or that cannot be typed on the normal keyboard.

The entity begins with an ampersand (&) and ends with a semicolon (;) or white space. No space is permitted between the ampersand character and the following entity name. The entity name consists of alphanumeric characters plus any combination of underscores, dashes, and number signs (#). If the entity is terminated with a semicolon, the semicolon is also part of the string that is replaced by the equivalent string. If the entity is terminated by a whitespace character, that whitespace is not considered part of the string that is replaced.

For example, assume the following entities and their translations:

Entity
Translation
&
&
&greaterthan
>
&lessthan
<
The style.zon file would then appear as follows:


$control: 1
zonespec:
{
entity: amp "&"
entity: lessthan "<"
entity: greaterthan ">"
}
$$
The following is a sample of how the actual document would appear in ASCII text form:


Here is some text. First an entity delimited by a semicolon:
S&amp;P's stock index. Second, entities delimited by a spaces:
the &greaterthan character and the &lessthan character.
Using the above style.zon file, the resulting document would then appear as follows


Here is some text. First an entity delimited by a semicolon:
S&P's stock index. Second, a entities delimited by spaces:
the > character and the < character.
If an entity is encountered and no translation is given for it in the style.zon file or in the built-in rules, then the text of that entity is passed through the filter unchanged.

Entity Substitution

When the zone filter does entity substitution, it literally replaces the entity string (the ampersand followed by the symbolic name) with the string (typically just one character) specified in the substitution table. An exception to this behavior has been incorporated into the built-in HTML filter which interprets non-alphabetic entities as punctuation tokens. Using the built-in HTML filter, entities such as &amp, &lt, and &gt are not streamed to the literal characters: &, <, and >. A custom HTML zone filter will perform entity substitution for all entities, and all entities are streamed to literal characters.

Dumping style.zon Information

You can dump the contents of the style.zon file to the standard output. This can be useful in various circumstances, including debugging the style.zon file and modifying the behavior of built-in modes.

Debugging the style.zon File

The style.zon file can be debugged using the -dump flag in the filter specification in the style.uni file. This operation is supported for debugging only, and other entries should not be modified.

To obtain the style.zon file settings in effect, follow these steps:

1. Set the -dump flag in the style.uni file

2. Run an indexer, like mkvdk, on a document and pipe the output to a file.

When you attempt to index a document with the -dump flag present in the style.uni file, the filter prints to the standard output a style.zon file with the settings that are in effect at the time of filtering. (Printing to standard output is not supported in a Windows DLL.) After the style.zon file is printed, the actual indexing does not take place.

The -dump option produces output in the character set of the prevailing locale. The output can be mapped to another character set using the -charmap option of mkvdk.

Dumping Information for Built-In Modes

You can obtain information about a built-in mode. A zone filter specification with the
-dump flag in the style.uni file for the HTML mode is shown below:


type: text/html
/charset = guess
/def-charset = 1252
/content-filter = "zone -html -dump"

Modifying Built-in Behavior

If you want different behavior for HTML, XML, e-mail, or Usenet news other than the ones provided by the built-in modes, do the following:

1. Dump the contents of the built-in mode you want to modify, as described above.

2. Edit the output to reflect the behavior you want.

3. Use the modified file as your style.zon file.

4. Make a zone filter specification without the -mode argument in the style.uni file, assuming the universal filter is used. Otherwise, the specification is made in the style.dft file, as described earlier in "Using the Zone Filter."

Attribute Extraction

Attributes can be extracted into a field named by an element keyword. This means the element value and the one or more attribute values are stored together in the same collection field. Consider the following style.zon file:


$control: 1
zonespec:
/ignoreattributes = no
{
element: name
/field = yes
{
attribute: first
/field = yes
attribute: last
/field = yes
}
}
$$
A sample document with attributes to be indexed is shown below:


This is <name first="emily" last="shaffer">AAA</name>here.
Another is <name first="al" last="jones">ZZZ</name>here.
When the style.zon file above is in effect and the document above is indexed, the following behavior occurs:





Copyright © 2002, Verity, Inc. All rights reserved.