Markup languages use tags embedded in the text of documents to specify the document's structure and formatting. Viewers and print programs are designed to read the tags and display or print the document appropriately. The international standard for markup languages is SGML, or Standard Generalized Markup Language. SGML is the basis for HTML, or Hypertext Markup Language, which is the means used to create pages for the World Wide Web. A newer markup language named XML (Extensible Markup Language) uses user-definable tags to extend the capabilities offered by HTML.
How the Zone Filter Parses Markup Language Documents
When the zone filter encounters a start zone tag, it opens a new zone. When it encounters an end zone tag, it closes that zone. The indexer makes a zone out of all the text between the two tags. The tags themselves are ignored during filtering.
name
[attributes
...]>
name
[attributes
...]>
AttributeName
=Value
Value
can be an identifier, a string literal, a URL, or anything else, so long as it does not contain a right angle bracket. Exclamation point and question mark metatags are parsed, but ignored. Examples are:
- <!
name
[attributes
...]>- <?
name
[attributes
...]>
- <HEAD> This text is in the header zone. </HEAD>
- <BODY> This text is in the body zone.
- <section>
- This text is in a body zone AND the section zone, which is
- nested inside the body zone.
- </section>
- </BODY>
#
).
- <ul>
- <li> This is the li zone.
- </ul>
- This is outside all zones.
</ul>
tag ended the ul
zone, but also implicitly ended the li
zone. It is equivalent to the following, in which the implicit end tag is underlined:
- <ul>
- <li> This is the li zone. </li>
- </ul>
- <ul>
- <li> This is in the li zone, which is nested in the ul zone.
- <li> This is another li zone, which is also nested in the ul
- zone. In HTML, this li ends the previous one. With the
- zone filter, it does not.
- </ul>
- <ul>
- <li> This is in the li zone, which is nested in the ul zone.
- <li>This is another li zone, which is also nested in the ul
- zone.</li>
- </li>
- </ul>
li
zone" in the preceding examples has little meaning because all the text is already in the li
zone.
style.uni
file is appropriate for HTML documents:
- type: text/html
- /charset = guess
- /def-charset = 1252
- /content-filter = "zone -html -nocharmap"
The zone filter automatically extracts the following tags as zones.
<
is used to denote the beginning of a tag. If you want this character to display in an HTML browser, you need to enter the entity lt
in your HTML document. Likewise, you can display the character Á by using the entity Aacute
. The zone filter supports all of the ISO8859-1 entities as specified by the HTML 3.0 proposed standard.
style.ufl
file to be able to store this field. (For more information, refer to "Defining Zones as Collection Fields" in this chapter.)
Using a Verity zone filter for SGML and XML documents, there is a known limitation. The length of any SGML attribute name plus its corresponding attribute value is limited to 256 bytes. If the indexing engine meets this limitation for an attribute, the engine truncates the zone attribute value.
style.uni file
is appropriate for SGML and XML documents:
- type: text/sgml
- /charset = guess
- /content-filter = "zone -nocharmap"
style.zon
file to specify the SGML and XML tags you want to create as zones by using the element
and attribute
keywords. You can specify the conversion of SGML and XML entities using the entity
keyword. A sample style.zon
file for SGML is shown below:
- $control: 1
- zonespec:
- {
- element: *
- element: heading3
- /ignore = yes
- element: list-item
- /ignore = yes
- }
- $$
style.zon
file, refer to "Custom Zone Definitions" later in this chapter.