Zones for Internet Message Format Documents


The zone filter recognizes documents in internet message format that conform to the RFC822 standard. This includes most standard e-mail and Usenet news messages.

How the Zone Filter Parses Internet Message Format Documents

The zone filter parses the headers of Internet-style e-mail and Usenet news messages to create zones.

For example, the following e-mail message can be parsed to extract zones from the headers automatically.


From johns@verity.com Thu Dec 15 11:38:18 1994
From: John Smith <johns@verity.com>
Received: (from johns@localhost) by grimaldi
(8.6.6.Beta9/8.6.6.Beta9) id LAA12705 for johns; Thu, 15 Dec
1994 11:36:35 -0800
Message-Id: <199412151936.LAA12705@grimaldi>
Subject: test message
To: johns (John Smith)
Date: Thu, 15 Dec 1994 11:36:34 -0800 (PST)
This is a test message.
John
Below is the same document with implicit start and end zone markers for the above message. The implicit zone starts and ends are surrounded by square brackets and are underlined for easy identification.


From johns@verity.com Thu Dec 15 11:38:18 1994
From: [from-beg] John Smith <johns@verity.com>
[from-end] Received: (from johns@localhost) by grimaldi
(8.6.6.Beta9/8.6.6.Beta9) id LAA12705 for johns; Thu, 15 Dec
1994 11:36:35 -0800
Message-Id: <199412151936.LAA12705@grimaldi>
Subject: [subject-beg] test message
[subject-end] To: [to-beg] johns (John Smith)
[to-end]Date: [date-beg] Thu, 15 Dec 1994 11:36:34 -0800 (PST)
[date-end]
This is a test message.
John
Header lines should conform to the RFC822 standard for e-mail and news messages. RFC822 specifies the following syntax:


Header-line-name: data data data \n
[<whitespace>more data, more data more data \n] ...
The first line of a header line must begin with the header line name, which can consist entirely of alphanumeric characters, underscores, or dashes, followed by a colon. The rest of the line until the return character is the text of the header line. Header line names, like tag names, are case-insensitive. (For example, to matches To.)

Optionally, the header line can be continued on the next line with a continuation line. Lines whose first character is a whitespace character are continuation lines. The text of the entire continuation line is included as part of the previous header line. For example, the To header line in the following e-mail spans multiple lines. Again, zone starts and ends are underlined.


From:[from-beg] John Smith <johns@verity.com>
[from-end]Subject:[subject] another test message
[subject-end]To:[to-beg] johns (John Smith),
toddq@verity.com (Todd Quidnunc),
mick@verity.com (Mickey O'Donnicker),
ralphp@verity.com (Ralph Poobah)
[to-end]
The header section of a document is ended by the first line that contains only whitespace characters, or that starts with an SGML element tag. After that point, the parser reverts from header line parsing to SGML element parsing. If for some reason you have an internet message format document that contains embedded markup language tags, you can specify those tags in the style.zon file and they will be extracted as zones.

Zone Filter Specification for E-mail

The following zone filter specification in the style.uni file is appropriate for e-mail documents:


type: message/rfc822
/charset = guess
/def-charset = 1252
/content-filter = "zone -email -nocharmap"
The above specification will invoke the built-in e-mail filter. The rules for the built-in
e-mail filter are as follows:

Using the style.zon file

If you prefer to define your own tags rather than using the built-in e-mail mode, specify /filter="zone" without the email mode, and use the header keyword in the style.zon file.

The header keyword specifies extraction or exclusion of header lines. The syntax is as follows:

header: headername

where headername specifies the name of the header line you want to extract as a zone. Header names are case insensitive. To extract all header names as zones, use * for headername. You can use the following optional modifiers with the header keyword.

Element
Description
/ignore
Specify YES to ignore the specified header. If you use the asterisk for headername, only those attributes specified with the /ignore=yes modifier are ignored. If you do not use the asterisk, all the headers specified are extracted and those omitted are ignored.
/field
Specify YES to extract the specified element as a field as well as a zone. See "Defining Zones for Virtual Documents" in this chapter.
There are two approaches to specifying the header lines to extract as fields.

1. Specify the asterisk, and then to list any tags you do not want extracted using the /ignore modifier.

2. Explicitly list only those elements you want extracted.

The following is an example of the first approach:


$control: 1
zonespec:
{
header: *
header: received
/ignore = yes
header: message-id
/ignore = yes
}
$$
In this example, all headers are extracted as zones, except received and message-id, which are ignored.

The following is an example of the second approach:


$control: 1
zonespec:
{
header: received
header: message-id
}
$$
In this example, only received and message-id are extracted as zones. All others are ignored.

Zone Filter Specification for Usenet News

The following zone filter specification in the style.uni file is appropriate for Usenet news documents:


type: message/news
/charset = guess
/def-charset = 1252
/content-filter = "zone -news -nocharmap"
The above specification will invoke the built-in Usenet news filter. The rules for the
built-in Usenet news filter are as follows:

Using the style.zon file

If the built-in parsing rules for Usenet news documents are not sufficient, you can use the style.zon file and the header keyword to define zones, as described in "Custom Zone Definitions" later in this chapter.





Copyright © 2002, Verity, Inc. All rights reserved.