The XML Filter


The XML filter supports indexing and viewing well-formed XML documents. Meta data extraction is also supported.

Requirements for Indexing XML Documents

To prepare for indexing XML documents:

1. Make sure that the XML filter (flt_xml.dll, flt_xml.sl, flt_xml.so) resides in the bin directory for the installed platform.

2. Make sure that the style.uni contains the directive for invoking the XML filter.

3. If custom fields or zones are required, define them in the style.ufl file.

4. Specify custom fields to be populated in the style.xml file.

Requirements for Data Files

To be properly indexed, XML data files must be well-formed XML documents as specified in the Extensible Markup Language Recommendation (http://www.w3.org/TR/REC-xml).

Briefly stated, a well-formed XML document contains elements that begin with a start tag and terminate with an end tag. One element, which is called the root or document element, cannot appear in the content of another element. For all other elements, if the start tag is in the content of another element, the end tag is also in the content of the same element.

The XML data files must have an .xml extension if the universal filter is used; the universal filter is specified in the style.dft file. XML documents without the .xml extension can be indexed into a collection that contains only XML documents if the style.dft file specifies the XML filter instead of the universal filter. For more information, see the "style.dft File" section under "Style File Configuration" below.

Implementation Summary

Verity support for XML documents is implemented by a new XML filter and controlled using a number of style files.

XML Filter

The new XML filter (flt_xml.dll, flt_xml.sl, flt_xml.so) resides in the bin directory for the installed platform.

Style Files

The following style files are required to enable indexing of XML files. Default style files reside in the /common/vdkstyle directory.

Style File
Description
style.uni
Invokes the XML filter for indexing XML documents.
style.xml
Modifies the default behavior of the XML filter.
style.ufl
Defines custom fields in XML documents. The fields must also be defined in the style.xml file.
style.dft
Specifies whether the universal filter or the XML filter will be used to index the collection. If the XML filter is specified, XML documents can be indexed into their own collection and the .xml file extension for data files is not required.

Style File Configuration

style.uni File

To index XML documents, the style.uni file must include the following lines:


type: "text/xml"
/format-filter = "flt_xml"
/charset = guess
/def-charset = 8859
NOTE: Some versions of the style.uni specify that text/xml content be handled by flt-zone. This specification should be replaced with the above construct.

style.xml File

By default, the XML filter indexes regions of the document delimited by XML tags as zones, with the zones given the same name as the XML tag. META tags are automatically indexed as fields unless they are in a suppressed region.

The style.xml file enables administrators to change the default behavior of the indexer for XML documents. Administrators can specify field and zone indexing for regions of the document delimited by XML tags and skip regions of the document delimited by XML tags.

The sample style.xml contains code examples that are commented out.

style.xml Command Syntax:

<command attribute="value"/>

style.xml Command Summary:

Command
Description
field
Indexes the content between the pair of specified XML tags as field values. By default, the field name is the same as the xmltag value, unless otherwise specified by the fieldname attribute. Attributes: xmltag fieldname index
ignore
Skips indexing of xmltag but indexes the content between the pair of specified XML tags. Attributes: xmltag
preserve
Indexes specified xmltag as a zone if preceded by ignore xmltag="*". Attributes: xmltag
suppress
Suppresses every xmltag embedded within the specified xmltag. Attributes: xmltag

style.xml Command Examples:

The following command ignores all XML tags in the document, indexing only the content:

<ignore xmltag = "*"/>

The following command skips indexing the specified xmltag but indexes the content between the start and end tags of the specified xmltag:

<ignore xmltag = "section_1"/>

The following command indexes xmltag as a zone if there is also an ignore xmltag="*" command:

<preserve xmltag = "section_1"/>

The following command suppresses the entire element identified by xmltag. The tag, attribute, and content are not indexed:

<suppress xmltag = "section_1"/>

The following command indexes the content between the start and end tags of the specified xmltag as a field, which is given the same name as xmltag:

<field xmltag = "column_1"/>

The following command indexes the content between the start and end tags of the specified xmltag as a field, which is given the name specified in thefieldname attribute:

<field xmltag = "column_2" fieldname = "vdk_field_2"/>

The following command indexes the content between the start and end tags of the specified xmltag as a field, overriding any existing value of the field:

<field xmltag = "column_2" index = "override"/>

NOTE: Both fieldname and index attributes can be used in a field command.

style.ufl File

If administrators have defined custom fields to be populated in the style.xml file, the fields must also be defined in the style.ufl file or style.sfl file, using standard syntax.

style.dft File

To create a collection that contains only XML documents, administrators can modify the style.dft file to invoke the XML filter directly. In this case, the XML documents do not need an .xml extension.

The style.dft must include the following lines:


$control: 1
dft:
{
field: DOC
/filter="flt_xml"
}

Indexing XML Documents

To index XML documents using a command-line indexer, issue these commands:


mkvdk -create -style styledir -collection collname
mkvdk -collection collname file1.xml file2.xml filen.xml
Or using a file list (flist.txt):

mkvdk -create -style styledir -collection collname @flist.txt

The specified style directory must contain the modified style.uni and style.xml files to enable XML document indexing support.





Copyright © 2002, Verity, Inc. All rights reserved.