The special noindex and noextract zones are optional. The noindex zone is used to mark text that will not be indexed. The noextract zone is used to mark text that will not be processed during feature extraction (for clustering and Query-By-Example) and summarization.
Noindex Zones
The noindex
zone is a special zone whose contents do not get indexed. When the contents are not indexed, the Verity engine won't find hits on the text marked by the noindex
zone. noindex
zone, you need to specify the /zone=noindex
modifier in the style.dft
file. An example style.dft
file with noindex
zones defined is shown below.
- $control: 1
- dft:
- {
- constant: "Title: "
- /zone=noindex
- field: TITLE
- /zone=noindex
- constant: "\n"
- /zone=noindex
- field: DOC
- /filter=zone
- }
constant
, field
, and constant
keywords together go into a zone called noindex
. This is a special zone for the indexer. If the indexer sees this zone, it continues counting words inside that zone as it normally does, but it doesn't put those words into the full word index. Also, it doesn't store the boundaries of the noindex
zone in the zone index either.Here's another example with a different syntax, but equivalent meaning:
- $control: 1
- dft:
- {
- zone-start: noindex
- constant: "Title: "
- field: TITLE
- constant: "\n"
- zone-end: noindex
- field: DOC
- /filter=zone
- }
- This is normal text. More normal text.
- <noindex>
- This text won't be indexed because the zone filter will spit
- out a "zone-start noindex" token when it sees the above
- noindex tag. We can put weird words like "onomatopoeia" here
- and they won't show up in the full-word index.
- </noindex>
- These words are outside the noindex zone again, so they will
- show up in the index again, i.e. Ya gotta be careful what
- you write out here!
noextract
zone is a special zone whose contents are not processed during feature extraction. Using this zone, you have the ability to selectively exclude sections of a document from being considered for feature extraction (for clustering/Query-By-Example) and summarization. The summarization/feature extraction component recognizes the special zone token called NOEXTRACT
. Anything between the start and end of a noextract
zone is ignored by the feature extractor.The use is analogous to the use of the special
noindex
zone, which is described in the previous section. Like a noindex
zone, a noextract
zone can be inserted either with the style.dft
mechanism, with NOEXTRACT
tags in SGML documents, or manually if you are using a custom gateway.If you are developing a custom gateway using the Verity Gateway Developer Kit, you simply need to insert a zone token named
noextract
before and after the text to be ignored (with the start and end flags set appropriately). The
noextract
zone is not indexed as a zone by the indexer, though the text within the zone is indexed.
noextract
zones allow you to add text to the virtual document that gets indexed but cannot be viewed. This provides a way to add document fields to the full-text index for the document, allowing them to be searched faster than with standard field search, but preventing them from being viewed as part of the document. If the fields are enclosed in "hidden" zones, the fields can be searched using standard zone search syntax.Hidden elements must be placed after all of the visible elements in the virtual document, as defined in the
style.dft
. For information implementing hidden elements in zones, refer to the previous section, "Hidden Elements in Zones."