The Universal Filter


This section provides an overview of the universal filter and its implementation.

The universal filter is a document filter that produces indexable (or viewable) text like any other filter. The difference is that it dynamically filters documents according to the type of those documents using a number of "helper" subfilters. For example, Microsoft Word documents are filtered with a certain set of filters (using the KeyView Filter Kit), and HTML documents are filtered in a different way with a different set of filters (the current zone filter).

The advantage of the universal filter is that it removes the need to specify the document type and character set of documents before creating the collection, and it allows multiple document types written in multiple character sets to be indexed into the same collection.

The universal filter is configurable. It has a configuration file that tells it how to filter each type of document that it sees. It also allows multiple filters on each document, so that you are not limited to a single type of filter. The goal of the universal filter design, however, was to be able to filter all important document types "out of the box." That is, the default configuration file that ships with the search engine should be sufficient for almost all documents that you might want to index. Configuration is offered in case you have special needs that are not addressed in the standard configuration file.

Invoking the Universal Filter

The universal filter is invoked by default, unless you override the default style.dft file in the styleset for your collection. When you index your collection or view documents in your collection, the universal filter will filter each document appropriately.

mkvdk -create -collection mycoll -insert *

How the Universal Filter Works

The sections below describe how the universal filter components work together to filter documents during indexing and viewing operations.

Components

The universal filter has a number of different components:

1. The universal filter itself: This segment installs and synchronizes all the other stream segments.

2. The autorecognizer segments: These segments recognize the type of the document.

3. The format filters: The job of the format filters is to extract indexable text from a binary file.

4. The charmap filter: The job of the charmap filter is to guarantee that all text is written in the internal character set.

5. The content filters: The job of the content filters is to extract meta-information such as fields or zones from the text of the document and send that meta-information up the stream.

How Filtering Occurs

The following diagrams show the interrelationships of each of the parts mentioned above. If the first document to be processed was generated by an application in PDF format, the following would occur:

If the next document to be processed was a text document, the PDF filters would remain in memory for future use, and the following would occur:

Similarly, if the next document were in Microsoft Word format, the following would occur:

Later, if another PDF document were indexed, the universal filter would reuse the PDF filters it had previously set up.

Character Set Recognition and Mapping

The charmap segment, which is inserted between the format filters and the content filters, guarantees that all text it produces is written in the internal character set. Because different file types are written in different character sets, the charmap segment must sometimes dynamically determine the character set of the text of the document for each document. If the /charset=guess modifier is given for any type in the style.uni file, the charmap segment will automatically determine the character set of each document and install the correct character set mapping.

The Verity internationalization infrastructure includes the ability to determine the character set of a piece of text with very high precision. For the Western European locales, the recognition can be more than 99% correct.

The charmap segment can recognize the following character sets in these locales:

The character set recognition is also available in any other locales that may be provided by Verity's partners, such as Japanese and Chinese.

Checking File Types

You can determine information about how the Verity engine evaluates the document type and character set for a particular document by looking at the "info" messages that the engine produces. You can see these "info" messages using mkvdk with the -verbose flag, as shown in the command-line syntax below:

mkvdk -verbose -create -collection mycoll -insert mydocs/*

The above example illustrates how to index a set of documents into a collection called mycoll in verbose mode. For complete information about using mkvdk, refer to the Verity K2 Indexers Guide.

You may want to check the document type recognized by the engine if errors occur. For example, if a web page is interpreted by the engine as a plain ASCII document, then zone searching will not work; if the auto recognizer thinks that the document is written in an incorrect character set, extended characters will not be displayed.

Document types recognized by default are listed in detail in Appendix B, "Universal Filter Document Types".





Copyright © 2002, Verity, Inc. All rights reserved.