Customized metadata mapping from OmniFind Enterprise Edition
Overview
OmniFind Enterprise Edition appends special metadata to crawled documents.
When the documents come into IBM Content Analyzer PEAR module, the module
can convert such metadata to the form which IBM Content Analyzer recognizes.
You can view the the metadata in Text Miner along with text analytics data
generated by IBM Content Analyzer.
The rule of metadata conversion of the PEAR module is customizable for individual
users. The PEAR module includes a default set of conversions for convenience.
This document describes the structure of OmniFind Enterprise Edition metadata,
the default mapping rule, and how to customize the mapping.
OmniFind Enterprise Edition metadata
OmniFInd Enterprise Edition appends two kinds of metadata: document metadata
and fields.
Document metadata is a fixed set of metadata, primarily used by
the system. Currently the following items are defined in the document metadata:
- contentLanguage
- metaLanguage
- hasSeparatContent
- metalength
- url
- mimeType
- dataSourceName
- throttleID
- docType
- baseUri
- dataSource
- knownLanguage
- documentId
- crawlerId
- compressed
- actualCharset
- documentName
- httpcode
- charset
- date
- isCompleted
- sequenceNumber
- title
- staticScore
- truncated
- contentlength
- knownCharset
- redirectUrl
- rdstype
Please note that, not all items are guaranteed to be filled, and the items definition
can change in future releases.
Fields are more flexible metadata. Users can define their own fields as well as default fields
which crawlers give to the crawled documents. The field value may or may not be in the document
itself. For more detail, please refer OmniFind Enterprise Edition manuals.
Default mapping
A default metadata mapping comes with the IBM Content Analyzer PEAR module
for convenience. By default, the PEAR module converts all metadata based on the following rule:
Source OmniFind Enterprise Edition metadata | Target IBM Content Analyzer metadata |
Document metadata | Category under .ofee |
Field, the value is outside the document | Category under .offield |
Field, the value is inside the document | Text name |
For the first two types of metadata, metadata name will be the category name and metadata value
will be the category value. You should create these categories in your category tree to
show these metadata in Text Miner. If a field value resides in the document content,
the metadata is regarded as a label of the text segment. Therefore the field name
will be converted into a text name as in the last row. Text Miner shows which
segment of document content is labeled by the field.
Looking at the resulted MIML file, you can see what kind of metadata is attached by
OmniFind Enterprise Edition. If you want to change name or category path of the
metadata, or remove unnecessary metadata, please write a custom mapping rule
as described in the following section.
Custom mapping
The configuration file for the metadata mapping is $PEAR_ROOT/database/conf/ofmapping.xml .
$PEAR_ROOT is the root directory of the installed PEAR module in OmniFind Enterprise Edition,
$ES_NODE_ROOT/data/pearsupport/PearIDN by default. The configuration file
initially looks like:
<?xml version="1.0" encoding="UTF-8"?>
<metadataMapping>
<useDefaultMapping />
</metadataMapping>
The useDefaultMapping
element indicates the default mapping to be used.
You can change the configuration file to update the mapping rules. Here is a sample configuration file:
<?xml version="1.0" encoding="UTF-8"?>
<metadataMapping>
<documentMetadata>
<metadata>
<name>url</name>
<category>.url</category>
</metadata>
</documentMetadata>
<fields>
<field>
<name>author</name>
<textname>Authors</textname>
<category>.doc.author</category>
</field>
</fields>
</metadataMapping>
There are two types of configurations, documentMetadata
and fields
.
Both have zero or more settings for each metadata item. Each metadata item has the following
three elements to specify its mapping.
Element | Description |
name | Name of document metadata or field, as the source of the mapping. |
textname | The name of the text segment shown in Text Miner. If the mapping source
is a field and the value is inside the document content, Text Miner labels the text segment with this parameter.
|
caetgory | Category path, as the target of the mapping. If the mapping source is found,
the value is mapped to this category. This parameter applies for all types of metadata. |
Sample usage scenario
Let's take a look at a sample scenario using the metadata mapping. Here is a DB2 database table
which contains structured data and unstructured (text) data:
NO (integer) | NAME (varchar(50)) | VERSION (integer) |
VENDOR (varchar(30)) | COMMENT (varchar(100)) |
1 | DB2 | 9 | IBM | Hybrid data server for both XML and relational data. |
2 | WebSphere Application Server | 6 | IBM | It delivers the secure, scalable, resilient application infrastructure. |
We crawl the table with the DB2 crawler of the OmniFind Enterprise Edition, and then
analyze the data with IBM Content Analyzer PEAR module using a custom metadata
mapping.
The table looks like as follows, in the crawl space view of the DB2 crawler. By default,
the column names are used as the field names.
And we use the following mapping configuration file. The configuration maps the document metadata
url
to the category .url
. Also, it gives the text name Vendor
to the field vendor
, as well as mapping it to the category .company
.
<?xml version="1.0" encoding="UTF-8"?>
<metadataMapping>
<documentMetadata>
<metadata>
<name>url</name>
<category>.url</category>
</metadata>
</documentMetadata>
<fields>
<field>
<name>vendor</name>
<textname>Vendor</textname>
<category>.company</category>
</field>
</fields>
</metadataMapping>
With an appropriate category tree definition, the crawled data is displayed in Text Miner
as follows:
In the Text Miner, you can see the categories 'Company' and 'OmniFind URL', representing
the categories .company
and .url
defined in the
mapping configuration file. And each document in the right side pain shows
the named text 'Vendor' in addition to the whole content, as specified in
textname
element in the mapping configuration file.
Now that these metadata are mapped into categories, you can use them for
further text analysis.