Special Noindex and Noextract Zones


The special noindex and noextract zones are optional. The noindex zone is used to mark text that will not be indexed. The noextract zone is used to mark text that will not be processed during feature extraction (for clustering and Query-By-Example) and summarization.

Noindex Zones

The noindex zone is a special zone whose contents do not get indexed. When the contents are not indexed, the Verity engine won't find hits on the text marked by the noindex zone.

To mark a zone as a noindex zone, you need to specify the /zone=noindex modifier in the style.dft file. An example style.dft file with noindex zones defined is shown below.


$control: 1
dft:
{
constant: "Title: "
/zone=noindex
field: TITLE
/zone=noindex
constant: "\n"
/zone=noindex
field: DOC
/filter=zone
}
In the preceding example, the constant, field, and constant keywords together go into a zone called noindex. This is a special zone for the indexer. If the indexer sees this zone, it continues counting words inside that zone as it normally does, but it doesn't put those words into the full word index. Also, it doesn't store the boundaries of the noindex zone in the zone index either.

Here's another example with a different syntax, but equivalent meaning:


$control: 1
dft:
{
zone-start: noindex
constant: "Title: "
field: TITLE
constant: "\n"
zone-end: noindex
field: DOC
/filter=zone
}
An added feature that falls out of this is that you can specify noindex zones inside your SGML documents if you like:


This is normal text. More normal text.
<noindex>
This text won't be indexed because the zone filter will spit
out a "zone-start noindex" token when it sees the above
noindex tag. We can put weird words like "onomatopoeia" here
and they won't show up in the full-word index.
</noindex>
These words are outside the noindex zone again, so they will
show up in the index again, i.e. Ya gotta be careful what
you write out here!
Now, when you do a search for "onomatopoeia", you will not find it.

Noextract Zones

The noextract zone is a special zone whose contents are not processed during feature extraction. Using this zone, you have the ability to selectively exclude sections of a document from being considered for feature extraction (for clustering/Query-By-Example) and summarization. The summarization/feature extraction component recognizes the special zone token called NOEXTRACT. Anything between the start and end of a noextract zone is ignored by the feature extractor.

The use is analogous to the use of the special noindex zone, which is described in the previous section. Like a noindex zone, a noextract zone can be inserted either with the style.dft mechanism, with NOEXTRACT tags in SGML documents, or manually if you are using a custom gateway.

If you are developing a custom gateway using the Verity Gateway Developer Kit, you simply need to insert a zone token named noextract before and after the text to be ignored (with the start and end flags set appropriately).

The noextract zone is not indexed as a zone by the indexer, though the text within the zone is indexed.

Hidden Elements in NoExtract Zones

Hidden elements in noextract zones allow you to add text to the virtual document that gets indexed but cannot be viewed. This provides a way to add document fields to the full-text index for the document, allowing them to be searched faster than with standard field search, but preventing them from being viewed as part of the document. If the fields are enclosed in "hidden" zones, the fields can be searched using standard zone search syntax.

Hidden elements must be placed after all of the visible elements in the virtual document, as defined in the style.dft. For information implementing hidden elements in zones, refer to the previous section, "Hidden Elements in Zones."





Copyright © 2002, Verity, Inc. All rights reserved.