Integrating IBM Classification Module categories

IBM Content Analyzer incorporates results of document categorization performed by IBM Classification Module.

IBM Classification Module is an enterprise platform for a wide range of applications that require unstructured content to be automatically categorized. A category that is assigned by IBM Classification Module is a label used to mark text snippets to indicate that they belong to a particular class of text. Categories can represent textual content or indicate some other attribute of an item, such as its source. In general, each category has a specific use within a Relationship Modeling Engine-enabled application. Classification Workbench creates categories in the Knowledge Base based on the categories indicated in the training corpus.
For information on IBM Classification Module, see http://publib.boulder.ibm.com/infocenter/classify/v8r5/index.jsp
Overview
You can use the tool ICM integration tool, which is available with IBM Content Analyzer to incorporate suggestions assigned by IBM Classification Module into IBM Content Analyzer's categories. The tool is provided as a stand-alone command-line tool that operates on a JVM for each supported operating system. It reads text fields of an MIML file and sends each text in a text field to IBM Classification Module in order to get suggestions that are categorized by IBM Classification Module. After IBM Content Analyzer receives the suggestions, the tool shares them as standard features of corresponding documents in the MIML file.
The following figure shows a system overview of the TAKMI_ICM2MIML tool.


Fig1. Overview of TAKMI_ICM2MIML tool

Prerequisite software
IBM Classification Module V8.5
You must add the client module, bns.jar, and all other jar files under the Lib directory in the java classpath of the tool.
IBM Content Analyzer V8.4.2
Mapping between IBM Classification Module suggestions and IBM Content Analyzer categories
You can map the suggestions that are returned by IBM Classification Module to IBM Content Analyzer's categories in two ways.
Standard Mapping
You can map the classification information that is retrieved from IBM Classificatin Module into a StandardFeature element which represents an IBM Classification Module category.
All suggestions are associated with the common category and each suggestion value is treated as a keyword of the category.
Customized Mapping
Instead of the standardized mapping, you can define a customized mapping between an IBM Classification Module suggestion and an IBM Content Analyzer category.
Every customized mapping must be explicitly defined in a properties file.
An IBM Classification Module suggestion must be associated with an appropriate IBM Content Analyzer category and different suggestions can be mapped to one and the same IBM Content Analyzer category when necessary.
However, one IBM Classification Module suggestion cannot be mapped to more than one IBM Content Analyzer category.
For example, the IBM Classification Module suggestions, A and B can be mapped only to an IBM Content Analyzer category, C.
IBM Classification Module can return multiple suggestions with scores for a single query. The tool regards the suggestion that has the highest score as a category of IBM Content Analyzer. The suggestions with the lower scores will be ignored.
To map suggestions to categories
1. Define the categories that will store the IBM Classification Module suggestions
  • Define the categories to which IBM Classification Module suggestions are related in the category_tree.xml file.
  • Prepare at least one category for the standard mapping.
  • To enable the customized mapping, define the categories that correspond to the IBM Classification Module suggestions.
For information about how to create categories, see Section 2.3 "Designing a Category Tree" of the Operation Guide.
2. Configure a connection to IBM Classification Module and to the mappings
In the configuration file that is used by the TAKMI_ICM2MIML tool, specify the parameters to connect to IBM Classification Module. Also, specify the mappings between IBM Classification Module suggestions and IBM Content Analyzer categories.

The sample configuration file, icmbridge_sample_configuration.xml is provided in the %TAKMI_HOME%/resource directory. For information about how to configure the settings, see "Configuration parameters" section.
3. Set the eystem environment variable ICM_HOME
Specify the installation directory of IBM Classification Module as ICM_HOME
4. Run the tool
Run the command takmi_icm2miml to invoke the Java program that will incorporate results of the document categorization.
  • Ensure that the category tree design for the IBM Classification Module suggestion mapping is complete.
  • Edit the icmbridge configuration file. See "Configuration parameters" for information about how to edit these settings.
  • Open a command window (on Windows) or shell (on AIX) and run the following commands:
        Windows:
        >  takmi_icm2miml.bat  CONFIG_FILE  MIML_FILE  [HEAP_SIZE_MB]
        AIX:
        >  takmi_icm2miml.sh  CONFIG_FILE  MIML_FILE  [HEAP_SIZE_MB]
      
    Where:
  • CONFIG_FILE: Configuration file
  • MIML_FILE: MIML file from which text sections will be retrieved to assign categories that are suggested by IBM Classification Module.
  • HEAP_SIZE_MB: (optional) Java™ heap size when running the commands. Specify in units of MBs. Defaults to 256 if omitted.
5. Create an index
The takmi_icm2miml command generates an MIML file that contains the category information assigned by IBM Classification Module. The index must be created for the new MIML file, not the original MIML file, so that the category information can be viewed in TEXT MIMER.
Before running the command to create the index file, for example, takmi_index.bat, replace the file extension of the original MIML file with any value except "miml". If the file extension is "miml", the indexing process will create the index from both miml files, which can cause incorrect statistics.
For information about how to create the index, see "Indexing" in the Operation Guide.
6. Analyze the results
Suggestions returned by IBM Classification Module are mapped to the categories as defined in the configuration file. The information can be viewed and analyzed on TEXT MINER.
For information about how to open TEXT MINER, see the Text Miner Guide.
Configuration parameters
The configuration is stored in an XML file that has the following top-level elements.