Zones for Markup Language Documents


Markup languages use tags embedded in the text of documents to specify the document's structure and formatting. Viewers and print programs are designed to read the tags and display or print the document appropriately. The international standard for markup languages is SGML, or Standard Generalized Markup Language. SGML is the basis for HTML, or Hypertext Markup Language, which is the means used to create pages for the World Wide Web. A newer markup language named XML (Extensible Markup Language) uses user-definable tags to extend the capabilities offered by HTML.

There are a number of reasons why you might want to search HTML, SGML or XML tags as zones. For example, if you are looking for information on Ecuador, you may want to search for the word "Ecuador" on the title or first-level heading. Using the title or first-level heading as a zone will help ensure that documents retrieved have Ecuador as their primary focus, rather than simply being briefly mentioned in the body of the text.

How the Zone Filter Parses Markup Language Documents

When the zone filter encounters a start zone tag, it opens a new zone. When it encounters an end zone tag, it closes that zone. The indexer makes a zone out of all the text between the two tags. The tags themselves are ignored during filtering.

The syntax for a start zone tag is:

<name[attributes...]>

The syntax for an end zone tag is:

</name[attributes...]>

That is, a start tag begins with a left angle bracket, followed by the element name. The end tag starts with a slash and a left angle bracket, followed by the element name. There can be no space between the left angle bracket and the following characters. The element name is followed by zero or more attributes. Attributes can be arbitrary text, including strings and whitespace characters, but frequently have the form:

AttributeName=Value

where Value can be an identifier, a string literal, a URL, or anything else, so long as it does not contain a right angle bracket.

Exclamation point and question mark metatags are parsed, but ignored. Examples are:


<!name[attributes...]>
<?name[attributes...]>
Here is an example of a document containing valid SGML text and tags.


<HEAD> This text is in the header zone. </HEAD>
<BODY> This text is in the body zone.
<section>
This text is in a body zone AND the section zone, which is
nested inside the body zone.
</section>
</BODY>
Tag names are case-insensitive. They can consist of all the alphabetic characters (upper and lower case), numeric characters, the dash, the underscore character, or the number sign (#).

Implicit Zone Endings

The zone filter implicitly ends zones that have not been explicitly ended with an end tag. For example:


<ul>
<li> This is the li zone.
</ul>
This is outside all zones.
The </ul> tag ended the ul zone, but also implicitly ended the li zone. It is equivalent to the following, in which the implicit end tag is underlined:


<ul>
<li> This is the li zone. </li>
</ul>
Zone end tags are implicit in only two cases:

The filter does not perform any implicit end of zones when a start tag is encountered. For example, here is an HTML construct:


<ul>
<li> This is in the li zone, which is nested in the ul zone.
<li> This is another li zone, which is also nested in the ul
zone. In HTML, this li ends the previous one. With the
zone filter, it does not.
</ul>
It would be interpreted with the following implicit end tags:


<ul>
<li> This is in the li zone, which is nested in the ul zone.
<li>This is another li zone, which is also nested in the ul
zone.</li>
</li>
</ul>
Implicit end tags are not handled when start tags are encountered because it is not very useful to search contiguous zones. Searching for "text in the li zone" in the preceding examples has little meaning because all the text is already in the li zone.

Zones for HTML Documents

HTML is based on SGML. It is, essentially, one SGML DTD. However, because of its popularity and widespread use, most HTML tags and entities are commonly recognized and read by Web browsers and authoring tools. Therefore, HTML tags and entities are built into the zone filter.

Zone Filter Specification for HTML

The following zone filter specification in the style.uni file is appropriate for HTML documents:


type: text/html
/charset = guess
/def-charset = 1252
/content-filter = "zone -html -nocharmap"
The above specification will invoke the built-in HTML filter.

Supported HTML Tags

The zone filter recognizes most standard tags through HTML 3.0. It automatically extracts certain tags as zones and ignores others.

The zone filter automatically extracts the following tags as zones.

Tag
Description
<a>
anchor
<abbrev>
abbreviation
<acronym>
acronyms
<address>
address
<au>
author name
<banner>
banner
<base>
used to resolve relative addressing
<blockquote>
block quote
<body>
body
<cite>
citation
<code>
code sample
<dfn>
definition
<fn>
footnote
<form>
form
<h1>
heading level 1
<h2>
heading level 2
<h3>
heading level 3
<h4>
heading level 4
<h5>
heading level 5
<h6>
heading level 6
<head>
header block
<html>
html zone
<lang>
alternate language
<link>
provides information relating current document to other documents
<note>
separated notational text
<person>
person element
<q>
quotation
<samp>
sample
<textarea>
text area of a form
The zone filter ignores the following tags.

Tag
Description
<b>
bold
<big>
big print
<br>
line break
<dd>
definition list definition
<del>
deleted text
<dir>
directory list
<div>
division
<dl>
definition list
<dt>
definition list text
<em>
emphasis
<fig>
figure
<hr>
horizontal rule
<i>
italics
<img>
inline image
<input>
forms input field
<ins>
inserted field
<isindex>
current page is an index document
<kbd>
user keyboard entered text
<li>
list item
<math>
math
<menu>
menu list
<nextid>
unique identifier for document
<ol>
ordered list
<option>
forms option list
<p>
paragraph
<pre>
preformatted style
<select>
forms choice list
<small>
small print
<strike>
strike out text
<strong>
strong emphasis (usually bold)
<sub>
subscript
<sup>
superscript
<tab>
tab element
<table>
table
<td>
table data
<th>
table head
<tr>
table row
<tt>
teletype font (monospace)
<u>
underline
<ul>
unordered list
<var>
variable style for names to be supplied

Supported HTML Entities

HTML uses entities to represent certain characters. An entity is a representation of a character that, when interpreted by the HTML browser, displays the proper character. Entities in HTML are used to specify characters that would otherwise be considered as part of the markup language or that cannot be typed on the normal keyboard. For example the < is used to denote the beginning of a tag. If you want this character to display in an HTML browser, you need to enter the entity lt in your HTML document. Likewise, you can display the character Á by using the entity Aacute. The zone filter supports all of the ISO8859-1 entities as specified by the HTML 3.0 proposed standard.

Additional HTML Parsing Rules

The zone filter observes the following additional rules for HTML:

1. No header lines are parsed, and the parsing starts off in markup language parsing mode.

2. The title tag is extracted as a zone and a field. You must make sure to put the title in your style.ufl file to be able to store this field. (For more information, refer to "Defining Zones as Collection Fields" in this chapter.)

3. The character set for HTML pages is ISO 8859. The zone filter automatically translates the characters into the internal character set specified at startup.

Zones for SGML and XML Documents

SGML and XML documents rely on a Document Type Definition (DTD) to define their tags. Unlike HTML, different groups of SGML and XML documents use different DTDs. Therefore, you must define the zones you want to extract for each group of SGML and XML documents that share a common DTD.

Using a Verity zone filter for SGML and XML documents, there is a known limitation. The length of any SGML attribute name plus its corresponding attribute value is limited to 256 bytes. If the indexing engine meets this limitation for an attribute, the engine truncates the zone attribute value.

Zone Filter Specification for SGML and XML

The following zone filter specification in the style.uni file is appropriate for SGML and XML documents:


type: text/sgml
/charset = guess
/content-filter = "zone -nocharmap"

Using the style.zon file

You can use the style.zon file to specify the SGML and XML tags you want to create as zones by using the element and attribute keywords. You can specify the conversion of SGML and XML entities using the entity keyword. A sample style.zon file for SGML is shown below:


$control: 1
zonespec:
{
element: *
element: heading3
/ignore = yes
element: list-item
/ignore = yes
}
$$
For complete information about using the style.zon file, refer to "Custom Zone Definitions" later in this chapter.





Copyright © 2002, Verity, Inc. All rights reserved.