Follow
http://publib.boulder.ibm.com/infocenter/discover/v8r4/index.jsp?topic=/com.ibm.discovery.es.ad.doc/administering/iiysauima.htm
to upload the IBM Content Analyzer PEAR module to the OmniFind
Enterprise Edition index server. Because the PEAR module is larger than
8MB, you should copy it onto the OmniFind Enterprise Edition index
server. IBM Content Analyzer PEAR modules are in $TAKMI_HOME/pear
directory. Each PEAR module is packaged for one language.
The PEAR contents will be installed in
$ES_NODE_ROOT/data/pearsupport/PearID
N where
N is a unique
number to identify PEAR modules in the index server. The directory is
referred as $PEAR_ROOT in this document.
If the OmniFind Enterprise Edition index server does not have IBM
Content Analyzer installation, the system library path should contain
$PEAR_ROOT/bin to use the Japanese PEAR module. See
this section of the document.
You can also specify how to map OmniFind Enterprise Edition metadata to
IBM Content Analyzer. The mapping is described in
this manual.
$PEAR_ROOT/database is the default IBM Content Analyzer
database structure used by the PEAR module. In order to use a custom
category tree or custom dictionaries, you can edit them under the
directory and run takmi_nlp_resource_deploy command same as normal
IBM Content Analyzer databases.
For the English PEAR module, follow these steps to use IBM
Content Analyzer custom dictionaries inside the OmniFind Enterprise
Edition parser:
- Stop the OmniFind Enterprise Edition parser process.
- Copy $PEAR_ROOT/database/dic/LangWare50/en-XX-TAKMIUserNE.dic
to
$ES_INSTALL_ROOT/configurations/parserservice/jediidata/frost/resources
directory.
- Backup $ES_NODE_ROOT/master_config/collection_id.parserdriver/specifiers/jfrost.xml
where collection_id is the collection ID of the target
collection.
- Edit
$ES_NODE_ROOT/master_config/collection_id.parserdriver/specifiers/jfrost.xml
as follows:
- To use the IBM Content Analyzer custom dictionary, add
en-XX-TAKMIUserNE.dic to the English LexicalDicts entry as follows:
<!-- English -->
<settingsForGroup name="en">
<nameValuePair>
<name>LexicalDicts</name>
<value>
<array>
<string>en-XX-TAKMIUserNE.dic</string>
<string>en-XX-Lex.dic</string>
</array>
</value>
</nameValuePair>
<nameValuePair>
<name>StopwordDicts</name>
<value>
<array>
<string>en-Stw.dic</string>
</array>
</value>
</nameValuePair>
<nameValuePair>
<name>SpellCorrectionDicts</name>
<value>
<array>
<string>en-XX-Rules.dic</string>
</array>
</value>
</nameValuePair>
</settingsForGroup>
- By default, OmniFind Enterprise Edition parser breaks URLs
and mail addresses into word pieces; for example “somebody@jp.ibm.com”
will be the word sequence of “somebody”, “@”, “jp”, “.”, “ibm”, “.”,
and “com”. If you need to process URLs or mail addresses as one word,
set the DoURISegmentation option to false as follows:
<nameValuePair>
<name>DoURISegmentation</name>
<value>
<boolean>false</boolean>
</value>
</nameValuePair>
Please note that this will change the OmniFind Enterprise Edition
behavior. The query “somebody” will not match “somebody@jp.ibm.com”
under this configuration.
NLP processing
Once the PEAR module is associated with a particular OmniFind Enterprise
Edition parser, the parser will feed documents to the PEAR and produce
MIML files in $PEAR_ROOT/database/db/miml directory.
Configuration parameters
- Database directory
- By default, the PEAR uses $PEAR_ROOT/database as the database
directory. In order to change the database directory, replace every
occurrences of $PEAR_ROOT/database with the new database directory path
in the descriptors under $PEAR_ROOT/desc directory. Both OmniFind
Enterprise Edition and IBM Content Analyzer admin users should be
able to read and write the directory.
- Output MIML files
- Configuration parameters for the output MIML files are defined in
MIMLWriteAnnotator.xml and set in OAE_PACK_en.xml and OAE_PACK_ja.xml.
All descriptors are located in $PEAR_ROOT/desc/text_analysis_engine
directory. The following configuration parameters are defined.
- Language
- Output language: “en” or “ja”.
- DocumentsPerMIML
- The maximum number of documents written in one MIML.
- OutputDirectory
- The directory where output MIML files are located.
- OutputBasename
- The basename of MIML files, defaulted to “docset”. The MIML
files will be named as “docset_YYYYMMDD_HHMMSS_N.miml”.