Integrating IBM Classification Module categories
IBM Content Analyzer incorporates results of document categorization performed by IBM Classification Module.
IBM Classification Module is an enterprise platform for a wide range of applications that require unstructured content to be automatically categorized. A category that is assigned by IBM Classification Module is a label used to mark text snippets to indicate that they belong to a particular class of text. Categories can represent textual content or indicate some other attribute of an item, such as its source. In general, each category has a specific use within a Relationship Modeling Engine-enabled application. Classification Workbench creates categories in the Knowledge Base based on the categories indicated in the training corpus.
For information on IBM Classification Module, see
http://publib.boulder.ibm.com/infocenter/classify/v8r5/index.jsp
Overview
You can use the tool ICM integration tool, which is available with IBM Content Analyzer to incorporate suggestions assigned by IBM Classification Module into IBM Content Analyzer's categories. The tool is provided as a stand-alone command-line tool that operates on a JVM for each supported operating system. It reads text fields of an MIML file and sends each text in a text field to IBM Classification Module in order to get suggestions that are categorized by IBM Classification Module. After IBM Content Analyzer receives the suggestions, the tool shares them as standard features of corresponding documents in the MIML file.
The following figure shows a system overview of the TAKMI_ICM2MIML tool.
Fig1. Overview of TAKMI_ICM2MIML tool
Prerequisite software
IBM Classification Module V8.5
You must add the client module, bns.jar, and all other jar files under the Lib directory in the java classpath of the tool.
IBM Content Analyzer V8.4.2
Mapping between IBM Classification Module suggestions and IBM Content Analyzer categories
You can map the suggestions that are returned by IBM Classification Module to IBM Content Analyzer's categories in two ways.
Standard MappingYou can map the classification information that is retrieved from IBM Classificatin Module into a StandardFeature element which represents an IBM Classification Module category.
All suggestions are associated with the common category and each suggestion value is treated as a keyword of the category.
Customized MappingInstead of the standardized mapping, you can define a customized mapping between an IBM Classification Module suggestion and an IBM Content Analyzer category.
Every customized mapping must be explicitly defined in a properties file.
An IBM Classification Module suggestion must be associated with an appropriate IBM Content Analyzer category and different suggestions can be mapped to one and the same IBM Content Analyzer category when necessary.
However, one IBM Classification Module suggestion cannot be mapped to more than one IBM Content Analyzer category.
For example, the IBM Classification Module suggestions, A and B can be mapped only to an IBM Content Analyzer category, C.
IBM Classification Module can return multiple suggestions with scores for a single query. The tool regards the suggestion that has the highest score as a category of IBM Content Analyzer. The suggestions with the lower scores will be ignored.
To map suggestions to categories
1. Define the categories that will store the IBM Classification Module suggestions
- Define the categories to which IBM Classification Module suggestions are related in the category_tree.xml file.
- Prepare at least one category for the standard mapping.
- To enable the customized mapping, define the categories that correspond to the IBM Classification Module suggestions.
For information about how to create categories, see Section 2.3 "Designing a Category Tree" of the Operation Guide.
2. Configure a connection to IBM Classification Module and to the mappings
In the configuration file that is used by the TAKMI_ICM2MIML tool, specify the parameters to connect to IBM Classification Module. Also, specify the mappings between IBM Classification Module suggestions and IBM Content Analyzer categories.
The sample configuration file, icmbridge_sample_configuration.xml is
provided in the %TAKMI_HOME%/resource directory. For information about how to configure
the settings, see "Configuration parameters" section.
3. Set the eystem environment variable ICM_HOME
Specify the installation directory of IBM Classification Module as ICM_HOME
4. Run the tool
Run the command takmi_icm2miml to invoke the Java program that will incorporate results of the document categorization.
5. Create an index
The takmi_icm2miml command generates an MIML file that contains the category information assigned by IBM Classification Module.
The index must be created for the new MIML file, not the original MIML file, so that the category information can be viewed in TEXT MIMER.
Before running the command to create the index file, for example, takmi_index.bat, replace the file extension of the original MIML file with any value except "miml".
If the file extension is "miml", the indexing process will create the index from both miml files, which can cause incorrect statistics.
For information about how to create the index, see "Indexing" in the Operation Guide.
6. Analyze the results
Suggestions returned by IBM Classification Module are mapped to the categories as defined in the configuration file.
The information can be viewed and analyzed on TEXT MINER.
For information about how to open TEXT MINER, see the Text Miner Guide.
Configuration parameters
The configuration is stored in an XML file that has the following top-level elements.
- server
- The <server> element specifies the URL of the IBM Classification Module. This element is specified as follows.
<server>
<url>http://localhost:18087/</url>
</server> |
- rmeConfiguration
- The <rmeConfiguration> element specifies parameters for the Relationship Modeling Engine (RME). The element is specified as follows.
<rmeConfiguration>
<knowledgeBase>my knowledge base</knowledgeBase>
<dictionary>my dictionary</dictionary>
</rmeConfiguration> |
The following table describes the elements in the rmeConfiguration element:
Element |
Description |
|
knowledgeBase |
Name of the knowledge base that is available on the IBM Classification Module server. The text will be categorized by IBM Classification Module with the rules defined in this knowledge base. |
Required |
dictionary |
Name of the dictionary that is available on the IBM Classification Module server. The text will be associated with this dictionary. Multiple “dictionary” entries can be defined. The dictionaries that are not available on the IBM Classification Module server are ignored. At least one valid dictionary must to be specified. When two or more dictionaries are active on the IBM Classification Module server, the first dictionary is used and the others are ignored. |
Required |
Only one set of Knowledge Bases and dictionaries is supported. To work with multiple Knowledge Bases or dictionaries, or both, you must run the tool multiple times.
- suggestionMappings
- The <suggestionMappings> element specifies mappings from IBM Classfication Module suggestions to IBM Content Analyzer categories. It contains the <standardFeature> element and a series of <dynamicPath> elements. The element is specified as follows.
<suggestionMappings>
<standardFeature category=".icm_suggestion" notclassified="UNKNOWN"/>
<dynamicPath suggestion="ICM suggestion #1" category=".icm_suggestion.subcategory1"</>
<dynamicPath suggestion="ICM suggestion #2" category=".icm_suggestion.subcategory2"</>
...
</suggestionMappings> |
The following table describes the elements in the suggestionMappings element:
Element |
Description |
|
standardFeature |
The category attribute is used to specify the category path of the target IBM Content Analyzer standard feature.
The suggestion returned by IBM Classification Module is used as a value of the standard feature specified as ‘category’.
When IBM Classification Module returns no suggestions, the value of the notclassified attribute is used. |
Required (notclassified attribute is optional) |
dynamicPath |
Defines dynamic mappings from IBM Classification Module suggestions to category paths of IBM Content Analyzer standard feature.
This element requires ‘suggestion’ and ‘category’ attributes.
If the suggestion returned by IBM Classification Module matches the specified value, the specified category is added to the document with the specified value.
You can define multiple dynamicPath elements in a suggestionMappings element. |
Optional |
- output
- The <output> element specifies a base name of MIML file generated by this tool. The element is specified as follows.
<output useTimeStamp="true">
<basename>output_miml_file</basename>
</output> |
The following table describes the elements in the output element:
Element |
Description |
|
basename |
Basename of the output MIML files. If this is not specified, the base name of the input miml file is used instead. |
Optional |
If the useTimeStamp attribute is set to "true", the output MIML file will be named as basename_YYYYMMDD_HHMMSS_N.miml where YYYYMMDD_HHMMSS is the timestamp and N is an integer to generate a unique file name.