How to use IBM Content Analyzer PEAR module with IBM OmniFind Enterprise Edition

This document describes how to use IBM Content Analyzer PEAR module with OmniFind Enterprise Edition as a custom UIMA analysis engine. For the definition of $ES_INSTALL_ROOT and $ES_NODE_ROOT, please refer http://publib.boulder.ibm.com/infocenter/discover/v8r4/index.jsp?topic=/com.ibm.discovery.es.in.doc/installing/iiysidirs.htm
PEAR setup
Follow http://publib.boulder.ibm.com/infocenter/discover/v8r4/index.jsp?topic=/com.ibm.discovery.es.ad.doc/administering/iiysauima.htm to upload the IBM Content Analyzer PEAR module to the OmniFind Enterprise Edition index server. Because the PEAR module is larger than 8MB, you should copy it onto the OmniFind Enterprise Edition index server. IBM Content Analyzer PEAR modules are in $TAKMI_HOME/pear directory. Each PEAR module is packaged for one language.
The PEAR contents will be installed in $ES_NODE_ROOT/data/pearsupport/PearIDN where N is a unique number to identify PEAR modules in the index server. The directory is referred as $PEAR_ROOT in this document.
If the OmniFind Enterprise Edition index server does not have IBM Content Analyzer installation, the system library path should contain $PEAR_ROOT/bin to use the Japanese PEAR module. See this section of the document.
You can also specify how to map OmniFind Enterprise Edition metadata to IBM Content Analyzer. The mapping is described in this manual.
NLP Resource deployment
$PEAR_ROOT/database is the default IBM Content Analyzer database structure used by the PEAR module. In order to use a custom category tree or custom dictionaries, you can edit them under the directory and run takmi_nlp_resource_deploy command same as normal IBM Content Analyzer databases.
For the English PEAR module, follow these steps to use IBM Content Analyzer custom dictionaries inside the OmniFind Enterprise Edition parser:
  1. Stop the OmniFind Enterprise Edition parser process.
  2. Copy $PEAR_ROOT/database/dic/LangWare50/en-XX-TAKMIUserNE.dic to $ES_INSTALL_ROOT/configurations/parserservice/jediidata/frost/resources directory.
  3. Backup $ES_NODE_ROOT/master_config/collection_id.parserdriver/specifiers/jfrost.xml where collection_id is the collection ID of the target collection.
  4. Edit $ES_NODE_ROOT/master_config/collection_id.parserdriver/specifiers/jfrost.xml as follows:
    1. To use the IBM Content Analyzer custom dictionary, add en-XX-TAKMIUserNE.dic to the English LexicalDicts entry as follows:
      
            <!-- English -->
            <settingsForGroup name="en">
              <nameValuePair>
                <name>LexicalDicts</name>
                <value>
                  <array>
                    <string>en-XX-TAKMIUserNE.dic</string>
                    <string>en-XX-Lex.dic</string>
                  </array>
                </value>
              </nameValuePair>
              <nameValuePair>
                 <name>StopwordDicts</name>
                 <value>
                   <array>
                     <string>en-Stw.dic</string>
                   </array>
                 </value>
              </nameValuePair>
              <nameValuePair>
                  <name>SpellCorrectionDicts</name>
                  <value>
                    <array>
                      <string>en-XX-Rules.dic</string>
                    </array>
                  </value>
              </nameValuePair>
            </settingsForGroup>
      
    2. By default, OmniFind Enterprise Edition parser breaks URLs and mail addresses into word pieces; for example “somebody@jp.ibm.com” will be the word sequence of “somebody”, “@”, “jp”, “.”, “ibm”, “.”, and “com”. If you need to process URLs or mail addresses as one word, set the DoURISegmentation option to false as follows:
      
      <nameValuePair>
        <name>DoURISegmentation</name>
        <value>
          <boolean>false</boolean>
        </value>
      </nameValuePair>
      
      Please note that this will change the OmniFind Enterprise Edition behavior. The query “somebody” will not match “somebody@jp.ibm.com” under this configuration.
NLP processing
Once the PEAR module is associated with a particular OmniFind Enterprise Edition parser, the parser will feed documents to the PEAR and produce MIML files in $PEAR_ROOT/database/db/miml directory.
Configuration parameters
Database directory
By default, the PEAR uses $PEAR_ROOT/database as the database directory. In order to change the database directory, replace every occurrences of $PEAR_ROOT/database with the new database directory path in the descriptors under $PEAR_ROOT/desc directory. Both OmniFind Enterprise Edition and IBM Content Analyzer admin users should be able to read and write the directory.
Output MIML files
Configuration parameters for the output MIML files are defined in MIMLWriteAnnotator.xml and set in OAE_PACK_en.xml and OAE_PACK_ja.xml. All descriptors are located in $PEAR_ROOT/desc/text_analysis_engine directory. The following configuration parameters are defined.
Language
Output language: “en” or “ja”.
DocumentsPerMIML
The maximum number of documents written in one MIML.
OutputDirectory
The directory where output MIML files are located.
OutputBasename
The basename of MIML files, defaulted to “docset”. The MIML files will be named as “docset_YYYYMMDD_HHMMSS_N.miml”.
Problem determination
Out of memory error
Refer http://www-128.ibm.com/developerworks/db2/library/techarticle/dm-0702baessler/index.html#pear_heap_size to change the heap size for the PEAR module.
Cannot load JSA library module
Japanese PEAR module uses a Japanese text analysis engine called JSA. The location of JSA modules should be in the system library path so that the PEAR module can load them. Although the IBM Content Analyzer installer makes the appropriate setting, please check that the following environment variables are defined and effective for OmniFind Enterprise Edition parser.
Operating system Environment variable JSA modules
Windows PATH libjsa.dll, JniJSA.dll
AIX LIBPATH libjsa.so, libJniJSA.so
Linux LD_LIBRARY_PATH libjsa.so, libJniJSA.so
The IBM Content Analyzer installer changes the appropriate environment variable to include $TAKMI_HOME/uima/component/jsa where the JSA modules are installed. The JSA modules are also contained in $PEAR_ROOT/bin directory so that the PEAR can work without IBM Content Analyzer installation, but you need to change the environment variable to use the PEAR-contained JSA modules.