This document describes the operational flow of the text mining system called IBM Content Analyzer, and the tools to be used in its various operational phases. In particular, this document describes the series of procedures (preprocessing) from language processing with the data to be analyzed to the creation of the index structure for analysis.
Note that this document assumes that IBM Content Analyzer is already installed. See the Installation Guide for details on installation.
This document is written for system administrators and operational designers. They must fully understand the content of the introductory topics.
Term | Meaning |
This is an operating system environment variable for an installation directory. The directory name is decided at the time of installation. * The % symbol at both ends means that this is an environment variable. For example, when %TAKMI_HOME% is set to the value C:/Program Files/takmi, %TAKMI_HOME%/conf/global_config.xml becomes C:/Program Files/takmi/conf/global_config.xml. |
|
Database | This is a language processing result management unit for each analysis target data. For example, when two types of data, "customer inquiry" and "internal document," are individually analyzed, two databases are created, one for each data type. |
You may create a database with an arbitrary name consisting of single-byte numbers or single-byte alphabetical letters. In this document, the database you created will be expressed as DATABASE_NAME. When you read this document, replace it with the name of the created database. |
|
This is a directory where databases are physically arranged. By default, assuming that the database name is DATABASE_NAME, %TAKMI_HOME%/databases/DATABASE_NAME becomes the database directory. This may be referred to as DATABASE_DIRECTORY. For example, DATABASE_DIRECTORY/conf/database_config.xml is a database_config.xml file in the conf directory, which is in the database directory. |
|
This shows the settings created in the database configuration file: DATABASE_DIRECTORY/conf/database_config.xml The file name "database_config.xml" means the same thing. There is one configuration file for each database. |
|
This shows the settings created in the global configuration file: %TAKMI_HOME%/conf/global_config.xml The file name "global_config.xml" means the same thing. There is only one global configuration file for each IBM Content Analyzer system. If you wish to enable Web applications to use the newly created database, it is important to register that database in the global settings. |
|
ATML |
This is an input format used in language processing. The target data is always converted into this format before language processing. |
MIML |
This is an output format used in language processing. The result of language processing is returned in a file in this format, and then indexed. |
This topic provides notes on editing configuration files such as global_config.xml and database_config.xml by using the text editor.
These configuration files must be saved in UTF-8 format. When editing these files by using Microsoft® Windows® Notepad, be sure to select "UTF-8" as the encoding method.
More specifically, select "File" and then "Save As" from the Notepad menu, and save the files in UTF-8 format. Do not select "Unicode."
This section describes the series of operations before analyzing target data with IBM Content Analyzer. Details on tools to be used are provided in section 3 and later sections. See these other sections if necessary.
First, an overview of the operation flow is provided. The figure below shows the flow of batch processing before the application is ready to analyze data. These processes as a whole are called preprocessing.
Preprocessing is roughly divided into two parts: the processing in which raw data to be analyzed is converted into ATML, which is the standard input data format for IBM Content Analyzer; and the processing in which indices are generated by performing the common processing to the ATML.
IBM Content Analyzer provides the following tools for performing each of the processing steps. When operating the system, establish a workflow for launching these tools in accordance with the requirements. Later sections describe the standard flow shown above.
Name | Description |
Dictionary Editor![]() |
Supports category and dictionary edit operations. |
Data Ingester | Converts a target comma separated values (CSV) file into an ATML file. |
NLP (language processing) | Processes an ATML file and returns the result of language processing in an MIML file. |
Indexer | Generate indices for applications based on the result of language processing. |
In actual system operations, there is another process in which processed data is deleted and replaced by new data. See 4.3 Deleting Data for details.
Resources that are used in data processing are managed in units of databases. These resources are physically stored in a database directory.
In operational designing, first create a database directory. Follow the procedures below:
<?xml version="1.0" encoding="UTF-8"> <global_config> <params> <param name="language" value="ja"/> </params> <database_entries> <!-- Specify the relative path to the database directory as follows. --> <!-- Specify the absolute path to the database directory as follows. --> </global_config> |
The target data must be prepared in the following CSV (Comma Separated Values) format. If data is not in this format, follow the rules below and convert it into the CSV format beforehand.
The basic CSV format supports the format output by Microsoft Excel. Data saved as Microsoft Excel files can be easily converted into the CSV format by using Excel.
If the target data is in the CSV format from the beginning, check that it conforms to the following rules.
Required rules:
Strongly recommended rules:
Edit a category tree to relate each column in a CSV file to a particular category. Also, edit database settings in the database_config.xml file in order to define the category edited here as a standard item.
Follow the procedures below. This must be done only once when designing a database unless the structure of the original CSV data changes.
Editing the category tree (category_tree.xml)
This section describes how to relate CSV columns to categories and how to edit the category tree by using an example. The established relations are used when the Data Ingester converts the CSV file into an ATML file.
Assume that a CSV file containing the following columns will be processed:
Column 1 | Column 2 | Column 3 | Column 4 |
Date of inquiry | Customer ID | Name | Inquiry |
This CSV file has four columns. The fourth column contains text to be analyzed by language processing. The other three columns contain standard items that are attached to the target data from the beginning. Relate these three items to categories. A category consists of a category name and a category path. The following example shows how to create categories for some of the column names.
Column name | Category path | Category name |
Customer ID | .customer_id | Customer ID |
Name | .fullname | Name |
Start a category path with the period character, and from the second character and on, use character strings consisting of single-byte alphanumeric characters with no period. A category path is case-sensitive.
Characters that can be used are 0 to 9, a to z, A to Z, and either the hyphen or the underscore character. It is useful to give a name with a meaning to a category path.
Column names can be used as category names as they are, but you can change them if necessary.
For date data ("date of inquiry" in this example) to be used in time series analysis, use the special category path ".date." This path does not have to be newly created as it is already provided in the category tree template. Rules for relating the date column to the category are described in the Data Ingester settings.
Next, register these categories in the category tree. The category tree is created in the file DATABASE_DIR/category/category_tree.xml.
Use the text editor to open the category_tree.xml file, and add the created categories as follows.
<?xml version="1.0" encoding="UTF-8"?> <category_tree> <node id="1" path="date" name="Standard date" features=""/> <node id="2" path="date/dd" name="Date (day)" features="integer"/> <node id="3" path="date/dow" name="Date (day of the week)" features="integer"/> <node id="4" path="date/yyyy" name="Date (year)" features="integer"/> ... <node id="200" path="customer_id" name="Customer ID" features=""/> <node id="201" path="fullname" name="Name" features=""/> </category_tree> |
Add a node element to each category as a subelement directly below the category_tree element. Follow the instructions below when editing the category tree:
Editing database settings (database_config.xml)
Edit the database settings after editing the category tree in order to prevent the newly registered categories in the category tree from being edited by the Dictionary Editor tool.
Use the text editor to open DATABASE_DIR/conf/database_config.xml, and add category_entry, which is a subelement directly below the category_entries element, as follows.
<category_entries> <!-- Specifies subroot categories for system-reserved categories. --> <category_entry name="reserved_by_system" value=".date"/> <category_entry name="reserved_by_system" value=".date/dd"/> <category_entry name="reserved_by_system" value=".date/dow"/> <category_entry name="reserved_by_system" value=".date/yyyy"/> <category_entry name="reserved_by_system" value=".date/yyyymm"/> <category_entry name="reserved_by_system" value=".date/yyyyww"/> <category_entry name="reserved_by_system" value=".date/yyyymmdd"/> <category_entry name="reserved_by_system" value=".tkm_ja_base_word"/> <category_entry name="reserved_by_system" value=".tkm_ja_base_phrase"/> <category_entry name="reserved_by_system" value=".customer_id"/> <category_entry name="reserved_by_system" value=".fullname"/> ... |
Create resources such as categories and patterns of dictionaries or additions if necessary. See the
To convert CSV data into ATML, which is the input data format for IBM Content Analyzer, use the Data Ingester. See Data Ingester for details.
Language processing (NLP) refers to the processing of ATML files and the creation of MIML files that contain the results of language processing. See Section 4.1 for details.
Indexing refers to the processing of MIML files to create indices for high-speed mining. See Section 4.2 for details.
Once indexing is complete, applications that use indices such as Text Miner can be used. This section describes how to check operations with Text Miner.
This section describes how to convert CSV data into the ATML format by using the Data Ingester.
The Data Ingester is a tool that converts a CSV file into an ATML file. This section describes the procedures for running takmi_data_ingester, which is a command to launch the tool, and how to edit the configuration file.
Running the commands
Windows: > takmi_data_ingester.bat CONFIG_FILE CSV_FILE ATML_FILE [HEAP_SIZE_MB] AIX: > takmi_data_ingester.sh CONFIG_FILE CSV_FILE ATML_FILE [HEAP_SIZE_MB]The meaning of the arguments is as follows:
How to edit the settings
This section describes how to edit the settings of the data_ingester_config_csv2atml.xml file. See Advanced Tool Settings for further detailed settings. When editing and saving this file, be sure to set the character code to UTF-8 format.
The following is an example CSV file with four columns. The first line shows column names, and the actual data is in the second line and on. See Converting Target Data into a CSV File for details on the CSV file format.
Date of inquiry | Customer ID | Name | Inquiry |
2007/04/01 | XX001122 | Taro Sato | The PC does not start. |
2007/04/05 | XX00334455 | Ichiro Suzuki | Prices of new products |
: : |
: : |
: : |
: : |
Assume that the following settings are made in order to convert this CSV file into an ATML file.
<param name="csv.column.index.date" multivalued="no"> <value>1</value> </param> : : <param name="csv.date.format.list" multivalued="true"> <value>yyyy/MM/dd</value> </param> |
Next, relate each column in the CSV file to a category or text as shown below.
<param name="csv.column.text.indexes" multivalued="yes"> <value>4</value> </param> : : <value></value> <value>.customer_id</value> <value>.fullname</value> <value>Inquiry</value> </param> |
Finally, make the following settings to specify the first data line (the second line in this example), and the settings are complete.
Be sure to set the character code to UTF-8 and save the changes.
<param name="csv.row.firstindex" multivalued="no"> <value>2</value> </param> |
This section describes language processing.
Language processing is done for each input data (ATML file). Before running language processing, it is necessary to allocate the resources to be used in language processing which are updated by the dictionary tools and DOCAT.
Allocating language processing resources
Windows: > takmi_nlp_resource_deploy.bat DATABASE_DIRECTORY [HEAP_SIZE_MB] AIX: > takmi_nlp_resource_deploy.sh DATABASE_DIRECTORY [HEAP_SIZE_MB]The meaning of the arguments is as follows:
Windows: > takmi_nlp.bat DATABASE_DIRECTORY DATABASE_DIRECTORY/db/atml/input.atmlThe meaning of the arguments is as follows:
DATABASE_DIRECTORY/db/miml/output.miml [HEAP_SIZE_MB] AIX: > takmi_nlp.sh DATABASE_DIRECTORY DATABASE_DIRECTORY/db/atml/input.atml
DATABASE_DIRECTORY/db/miml/output.miml [HEAP_SIZE_MB]
This section describes indexing.
There are two types of processing in indexing: creation of new indices and update of indices by file addition.
A new index is created for all data processed by language processing (MIML files). When adding files to update an index, new MIML files and existing MIML files are combined to create an index.
See Advanced Tool Settings for commands to customize the indexing function.
Creating a new index
Windows: > takmi_index.bat DATABASE_DIRECTORY [HEAP_SIZE_MB] AIX: > takmi_index.sh DATABASE_DIRECTORY [HEAP_SIZE_MB]The meaning of the arguments is as follows:
Windows: > takmi_index_diff.bat DATABASE_DIRECTORY [HEAP_SIZE_MB] AIX: > takmi_index_diff.sh DATABASE_DIRECTORY [HEAP_SIZE_MB]The meaning of the arguments is as follows:
Processing the document which consists of two or more texts
In IBM Content Analyzer 8.4.2 or later, it is available to analyze the document which consists of two or more texts for each text.
When this function is used, add the text_entry tag which has the text name as the name attribute of the text_entries tag and crete the index. The following is an example in the case of processing the document which consists of two texts named "QUESTION" and "ANSWER".
<impl name="standard"> ... <text_entries> <text_entry name="QUESTION"/> <text_entry name="ANSWER"/> </text_entries> ... </impl> |
This section describes the deletion of data.
If necessary, delete previously created files when running language processing or indexing.
To delete the files, be sure to first stop WebSphere Application Server.
(The files might not be successfully deleted if the server is active.)
The behavior of any application is not guaranteed if the files have been deleted without stopping WebSphere Application Server.
Windows: > takmi_clear_index.bat DATABASE_DIRECTORY AIX: > takmi_clear_index.sh DATABASE_DIRECTORYTo reprocess all the data (to update the dictionary, for example), delete both MIML files and the index by running the following commands in
Windows: > takmi_clear_nlp_index.bat DATABASE_DIRECTORY AIX: > takmi_clear_nlp_index.sh DATABASE_DIRECTORYBefore running this command, a message appears to confirm that you want to run the command. Type "y" to confirm.
As for pre-processed ATML files, establish operational rules so that they are deleted if necessary.
This section describes how to stop and launch the server. Web applications of IBM Content Analyzer operate on WebSphere Application Server, and all the applications stop or start operating only by stopping or activating WebSphere Application Server.
Stop the operation of WebSphere Application Server by a commonly used method. It can be stopped by using command lines, selecting an appropriate icon from the Windows menu, or by stopping the WebSphere Application Server service registered in Windows services. See the WebSphere Application Server documentation for details.
Launch WebSphere Application Server by a commonly used method. It can be activated by using command lines, selecting an appropriate icon from the Windows menu, or by activating the WebSphere Application Server service registered in Windows services. See the WebSphere Application Server documentation for details.
After launching WebSphere Application Server, ensure that the IBM Content Analyzer applications are operating properly. Text Miner is properly operating if the database list is displayed when Text Miner is accessed from the Web browser. Note, however, that database names will not be displayed if they are not registered in the global settings.
This section describes log files that are created by the system.
Logs are written when the preprocessing or server process is run or is in operation. They are written by default to the following locations:
Type | Output destination |
preprocessing | %TAKMI_HOME%/logs |
Server (including Web applications) | Log directories of WebSphere Application Server |
The log directory of WebSphere Application Server varies depending on the operating environment. See the WebSphere Application Server documentation for details.
The name of a log file varies depending on the application that creates the log. Its name is "application name.log" by default, and you may use it for problem determination. The WebSphere Application Server log files such as SystemOut.log and SystemErr.log are also often useful in problem determination.
Settings of the log file are made in the configuration file called DATABASE_DIRECTORY/conf/application name_logging.properties. Specifications of the configuration file conform to the Logging API of Java. See the documentation on the Java Logging API for details.
This section describes advanced settings of the tools used in the preprocessing phase.
By editing the Data Ingester configuration file data_ingester_config_csv2atml.xml, it is possible to customize the method of conversion of CSV files into ATML files.
The format of the data_ingester_config_csv2atml.xml file is as follows:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE ingester_config SYSTEM "data_ingester_config.dtd"> <ingester_config> <data_source impl="com.ibm.research.trl.milne.application.impl.common.prenlp.PreNLPDataSourceCSV"> </data_source> <doc_converter_list> <doc_converter impl="com.ibm.research.trl.milne.application.impl.common.prenlp.PreNLPDocumentConverterATML"> </doc_converter> </doc_converter_list> <doc_serializer impl="com.ibm.research.trl.milne.application.impl.common.prenlp.PreNLPDocumentSerializerATML"> </doc_serializer> </ingester_config> |
<param name="key of this parameter" multivalued="yes"> <value>AAA</value> <value>BBB</value> </param> |
Be sure to specify either "yes" or "no" to the multivalued attribute.
When "no" is specified, only one value can be specified for that parameter.
List of data_source parameters
name | Required | multivalued | Meaning and format of value | Example of value | Note | |
csv.character.encoding | * | no | Character encoding of an input CSV file | UTF-8, MS932, and so on | Specify a character set that Java can interpret. | |
csv.row.firstindex | * | no | Specify a line in the CSV file as a starting point for incorporating text. | 1, 2, 3, ... | ||
csv.column.index.id | no | Specify column numbers that correspond to document ID character strings, with the value 1 for the leftmost column. | 1, 2, 3, ... | If omitted, sequential numbers "1," "2," and so on will be automatically given. | ||
csv.column.index.title | no | Specify column numbers that correspond to document titles, with the value 1 for the leftmost column. | 1, 2, 3, ... | If omitted, the title of each of the documents will be "". | ||
csv.date.format.list | yes | This is a date format list for interpreting the data in column specified in csv.column.index.date as dates. This must conform to the pattern supported by the Java class java.text.SimpleDateFormat. |
If omitted, only "yyyyMMdd" is used. Specify the value of csv.date.format.list in accordance with the character string format used for date data in CSV. |
|||
csv.column.index.date | no | Specify column numbers that correspond to document dates, with the value 1 for the leftmost column. | 1, 2, 3, ... | If omitted, the date of processing is incorporated in each document as the .date standard item. If character strings that are not appropriate for the format are included in the input data, the date of processing will be incorporated in that document (the same behavior as omission). Logs show data character strings that failed to be successfully processed. | ||
csv.column.text.indexes | yes | Specify the column numbers containing text you wish to analyze by language processing, with the value 1 for the leftmost column. | 1, 2, 3, ... | |||
csv.column.names | * | yes | Attach a label (character string) to each line in the input CSV file. Be sure to always set the same number of values as the number of columns in the CSV file. While maintaining the sequence, values will correspond to the order of CSV columns. The attached label functions as a category path designed for a standard item. However, date setting, text label, and space setting are the exceptions. The following rules apply to these settings.
In this case, the first column corresponds to the category ".aaa," and the third column corresponds to the category ".customer_id." Since the second column is set with space characters, the data in the second column will not be incorporated in ATML. Note that this space setting cannot be omitted. The set value "TEXT" in the fourth column is not a category path (since it does not start with the period character), an error occurs if the value 4 is not included in the csv.column.text.indexes setting. If this value is included, it will be incorporated in ATML as a text label. |
List of doc_converter parameters
name | Required | multivalued | Meaning and format of value | Example of value | Note |
string.converter.locale | no | Determine if language-dependent conversion is required for the text subject to language processing. | ja | If omitted, language-dependent conversion is not run. If it is set to ja, all single-byte characters are converted to double-byte characters. |
List of doc_serializer parameters
name | Required | multivalued | Meaning and format of value | Example of value | Note |
atml.indent.size | no | Specify the number of space characters to be used as tag indentation in the output ATML file. Specify 0 for no indentation. |
0, 1, 2, ... | 0 when omitted. |
List of arguments for the language resource deployment processing (takmi_nlp_resource_deploy.bat(.sh))
Parameter | Description |
DATABASE_DIRECTORY | A user dictionary file (.adic.xml) in dic/category/ of the specified database directory and a resource file that reads and uses category/category_tree.xml in language processing will be updated. Files in DATABASE_DIRECTORY/category, DATABASE_DIRECTORY/dic, and DATABASE_DIRECTORY/ie will be updated. |
List of arguments for language processing (takmi_nlp.bat(.sh))
Parameter | Description |
DATABASE_DIRECTORY | Read and run language processing on the files in the category, dic, and ie directories in the specified database directory. |
DATABASE_DIRECTORY/db/atml/input.atml | This is an input file for language processing. It is used only in language processing. It is not viewed by applications such as Text Miner. |
DATABASE_DIRECTORY/db/miml/output.atml | This is an output file for language processing. If the same file already exists, it will be overwritten. This is referred to by the original document display function of Text Miner. Therefore, if the file is designed to be overwritten by language processing, it is necessary to stop Text Miner (and applications accessed from Text Miner). |
Format of a user dictionary (.adci.xml) file used in language processing is as follows.
<?xml version="1.0" encoding="UTF-8"?>
<dictionary> <entry id="0" lex="Personal computer" pos="noun" cat=".category1" /> <entry id="1" lex="Software system" pos="noun" cat=".category2" /> <entryC id="2" str="PC" pos="noun" eid="0" /> <entryC id="3" str="Software" pos="noun" eid="1" /> <dictionary> |
The word entry means a keyword, and entryC means a synonym. Synonym entryC must be defined after keyword entry. Relevant attributes are as follows.
Attribute | Description |
id | A dictionary entry ID. Specify an integer larger than 0. This cannot overlap with others. |
lex | Keyword. Do not define keywords having the identical part-of-speech (pos) or category (cat) information. |
pos | Part-of-speech information. Only noun is supported. |
cat | Category information. Specify information defined in category_tree.xml. Proper operations are not guaranteed if undefined information is input. |
str | Synonym information. Do not define synonyms having the identical pos or entry ID (eid). |
eid | Specify an ID of the corresponding keyword. Proper operations are not guaranteed if an ID that does not exist is specified. Create separate entries to relate the ID to multiple keywords. |
takmi_index, which is run at the time of indexing, consists of a few different types of sub-processes. Description of each of them is provided below in the order of processing in takmi_index.
Processing (tool) | Description |
takmi_generate_config | Updates the MIML file list in database_config.xml for the specified database. |
takmi_index_singlebuild | Processes the MIML files individually and creates an intermediate file index for each MIML file. |
takmi_index_filemerge | Merges individual file indexes created in takmi_index_singlebuild to create an intermediate file index that contains information for all the MIML files. At this point, the index cannot be used yet. Processing by takmi_index_groupmerge is necessary. |
takmi_index_groupmerge | Processes the merged file index by takmi_index_filemerge to create a final index. |
In regular indexing, the processing listed above is carried out by running takmi_index in sequence, and then an index is created.
If there are many MIML files, in consideration of the volume of memory use, you might choose to run "intermediate merging" to carry out the merge processing in several batches. The number of MIML files that requires the intermediate merge processing depends on the average document size or the volume of language processing resources, but it is about 200 files. Consider using the intermediate merge processing if indexing fails with a smaller number of files.
The tool necessary for doing intermediate merge processing is takmi_index_filemidmerge. This is run instead of takmi_index_filemerge in the takmi_index_filemerge phase.
takmi_index_filemidmerge has the following two functions:
How to create an index by using the intermediate merge processing is shown below.
First, ensure that the old index has been deleted, and run takmi_generate_config and takmi_index_singlebuild by following the procedures below.
> takmi_generate_config -dbdir DATABASE_DIRECTORY -template DATABASE_CONFIG_FILE > takmi_index_singlebuild DATABASE_DIRECTORY HEAP_SIZE_MB
It is available to run takmi_index_sunglebuild by specifing the line number of data_entry in two or more processes at IBM Content Analyzer 8.4.2 or later. The following is the example which runs separetely the 2nd line or before and 3rd line or below of data_entry.
> takmi_index_singlebuild DATABASE_DIRECTORY HEAP_SIZE_MB 1 2 > takmi_index_singlebuild DATABASE_DIRECTORY HEAP_SIZE_MB 3
Then, select a MIML file group to be merged in the intermediate merge processing. Use a text editor to open DATABASE_DIRECTORY/conf/database_config.xml, search the data_entries elements, and find the list shown below.
<data_entries min_doc_id="0" max_doc_id="399999"> <data_entry path_type="relative" path="db\miml\sample.1.miml" type="miml" min_doc_id="0" max_doc_id="99999"/> <data_entry path_type="relative" path="db\miml\sample.2.miml" type="miml" min_doc_id="100000" max_doc_id="199999"/> <data_entry path_type="relative" path="db\miml\sample.3.miml" type="miml" min_doc_id="200000" max_doc_id="299999"/> <data_entry path_type="relative" path="db\miml\sample.4.miml" type="miml" min_doc_id="300000" max_doc_id="399999"/> <data_entry path_type="relative" path="db\miml\sample.5.miml" type="miml" min_doc_id="400000" max_doc_id="499999"/> <data_entry path_type="relative" path="db\miml\sample.6.miml" type="miml" min_doc_id="500000" max_doc_id="599999"/> </data_entries> |
By running takmi_index_filemidmerge, the intermediate merge processing can be partially done for MIML files.
Run the following command to merge sample.1.miml and sample.2.miml into an intermediate index.
> takmi_index_filemidmerge DATABASE_DIRECTORY -from 0 -to 199999In the same manner, run the following commands in sequence to first merge sample.3.miml and sample.4.miml, and then to merge sample.5.miml and sample.6.miml.
> takmi_index_filemidmerge DATABASE_DIRECTORY -from 200000 -to 399999 > takmi_index_filemidmerge DATABASE_DIRECTORY -from 400000 -to 599999
Specify the -from value and the -to value such that the sum of the values within the section defined by -from and -to matches with the section defined by the min_doc_id and max_doc_id of data_entries.
When the processing is completed for all the sections, run the following command to merge the indexes of individual sections.
> takmi_index_filemidmerge DATABASE_DIRECTORY -intervals 0-199999 200000-399999 400000-599999
When using the -intervals option, all the sections set in the processing 1. can be specified in the "from-to" format.
When all the processing above is successfully completed, run takmi_group_index. This will complete indexing.
> takmi_index_groupmerge DATABASE_DIRECTORY HEAP_SIZE_MB
Before deploying the language processing resources,
it is necessary to set categories for categorizing documents in database_config.xml.
See the DOCAT Instruction Manual(in Japanese only) for the DOCAT settings.
The tools used in IBM Content Analyzer are located in the directory. The list of these tools is shown below. For Windows, the file names have the extension ".bat" and for AIX the file names have the extension ".sh."
Tool | takmi_alert_correlation![]() |
Function | Does the correlation detection batch processing![]() |
How to use | takmi_alert_correlation DATABASE_NAME MAXIMUM_ANALYSIS_TIME_BY_MINUTE JAVA_HEAP_SIZE_BY_MEGA_BYTES |
Arguments |
|
Tool | takmi_alert_increase![]() |
Function | Does the increase detection batch processing![]() |
How to use | takmi_alert_increase DATABASE_NAME MAXIMUM_ANALYSIS_TIME_BY_MINUTE JAVA_HEAP_SIZE_BY_MEGA_BYTES |
Arguments |
|
Tool | takmi_clear_index |
Function | For the specified database, deletes the index generated by indexing. |
How to use | takmi_clear_index DATABASE_DIRECTORY |
Arguments |
|
Tool | takmi_clear_nlp_index |
Function | For the specified database, deletes the MIML files generated by language processing and the index generated by indexing. |
How to use | takmi_clear_nlp_index DATABASE_DIRECTORY |
Arguments |
|
Tool | takmi_data_ingester |
Function | Converts the specified CSV file into an ATML file. |
How to use | takmi_data_ingester CONFIG_FILE CSV_FILE ATML_FILE |
Arguments |
|
Tool | takmi_generate_config |
Function | For the specified database, updates the MIML file list in database_config.xml. |
How to use | takmi_generate_config -dbdir DATABASE_DIRECTORY -template DATABASE_CONFIG_FILE [-diff] |
Arguments |
|
Tool | takmi_index |
Function | For the specified database, does new indexing processing. |
How to use | takmi_index DATABASE_DIRECTORY [HEAP_SIZE_MB] |
Arguments |
|
Tool | takmi_index_diff |
Function | Does differential indexing for the specified database. |
How to use | takmi_index_diff DATABASE_DIRECTORY [HEAP_SIZE_MB] |
Arguments |
|
Tool | takmi_index_filemerge |
Function | In indexing, it does the file merge processing. It is called from takmi_index. |
How to use | takmi_index_filemerge DATABASE_DIRECTORY HEAP_SIZE_MB |
Arguments |
|
Tool | takmi_index_filemidmerge |
Function | In indexing, it does the file merge processing in batches. |
How to use |
takmi_index_filemidmerge DATABASE_DIRECTORY -from xxx -to yyy or takmi_index_filemidmerge DATABASE_DIRECTORY -intervals from1-to1 from2-to2 ... |
Arguments |
See 9.3 Indexing for details. Note that the Java heap size for intermediate merging is set to 1,000 MB by default. If you want to change it, externally set the value for the environment variable JAVA_HEAP_SIZE_BY_MEGA_BYTES_FILEMIDMERGE in units of MBs, and then run the tool. For example, run the following command in Windows: > set JAVA_HEAP_SIZE_BY_MEGA_BYTES_FILEMIDMERGE=1500 > takmi_index_filemidmerge ... |
Tool | takmi_index_groupmerge |
Function | In indexing, it does the group merge processing. It is called from takmi_index. |
How to use | takmi_index_groupmerge DATABASE_DIRECTORY HEAP_SIZE_MB |
Arguments |
|
Tool | takmi_index_singlebuild |
Function | In indexing, it processes individual MIML files separately. It is called from takmi_index. |
How to use | takmi_index_singlebuild DATABASE_DIRECTORY HEAP_SIZE_MB [FROM] [TO] |
Arguments |
|
Tool | takmi_index_singlebuild_diff |
Function | In indexing, it processes individual MIML files separately. It is called from takmi_index_diff. |
How to use | takmi_index_singlebuild_diff DATABASE_DIRECTORY [HEAP_SIZE_MB] |
Arguments |
|
Tool | takmi_nlp |
Function | By using the language processing resources of the specified database, it does language processing on ATML files to create MIML files. |
How to use | takmi_nlp DATABASE_DIRECTORY DATABASE_DIRECTORY/db/atml/input.atml DATABASE_DIRECTORY/db/miml/output.miml |
Arguments |
|
Tool | takmi_nlp_resource_deploy |
Function | For the specified database, it deploys the language processing resources. |
How to use | takmi_nlp_resource_deploy DATABASE_DIRECTORY |
Arguments |
|
Tool | takmi_remove_inactive_index |
Function | For the specified database, it deletes intermediate indexes that are not in use. |
How to use | takmi_remove_inactive_index DATABASE_DIRECTORY |
Arguments |
|
Tool | takmi_set_cp |
Function | It sets environment variables that are necessary for language processing. |
How to use | takmi_set_cp |
Arguments | None |
Tool | takmi_filenet2atml |
Function | Fetch documents from FileNet P8 server and create ATML files. |
How to use | takmi_filenet2atml CONFIG_FILE |
Arguments |
|
Tool | takmi_miml2filenet |
Function | Write category information from MIML files to documents from FileNet P8 server. |
How to use | takmi_miml2filenet CONFIG_FILE MIML_FILE |
Arguments |
|
Tool | takmi_icm2miml |
Function | This tool incorporates results of document categorization performed by IBM Classification Module into a MIML file. |
How to use | takmi_icm2miml CONFIG_FILE MIML_FILE |
Arguments |
|
This chapter describes directories and files used by IBM Content Analyzer.
The following directories and files are created when databases are created.
Directory and file name | Required | Changes during the operation | Description |
---|---|---|---|
category/category_tree.xml | Yes | Yes | It stores category information. The information is changed or categories are added when the Dictionary Editor is used to save the category tree. |
conf/database_config.xml | Yes | Yes | It has information such as modules to be used or MIML files to be indexed. It is changed as data is added or changed. |
conf/database_config_dictionary.xml | Yes | Yes | It has information on already defined categories to be used by the Dictionary Editor. It is changed when standard item information is changed. |
conf/database_config_miner.xml | Yes | No | It has view information and displayed category information of Text Miner. |
conf/database_config_alerting_system.xml | Yes | No | It has Alerting System setting information. |
conf/database_config_docat.xml | Yes | No | It has DOCAT setting information. |
conf/data_ingester_config_csv2atml.xml | Yes | No | It has Data Ingester setting information. |
conf/default,anonymous | Yes | No | It stores the Dictionary Editor configuration file (it is overwritten each time the tool is used). |
db/atml/*.atml | Yes | Yes | The input file for language processing. |
db/miml/*.miml | Yes | Yes | The output file for language processing. |
db/index | Yes | Yes | It stores the index file that Text Miner uses. |
dic/candidate | No | Yes | It stores the dictionary candidate word list. |
dic/category/*.adic.xml | No | Yes | It stores a dictionary file group for attaching categories (no files by default). Files are created as Dictionary Editor is used. |
dic/jsa/*.jma | No | Yes | (In Japanese only) It is a user dictionary file group (no files by default). Files are created as Dictionary Editor is used. |
dic/jsa/*.ddf | No | Yes | (In Japanese only) It has information on the user dictionary file group. It updates itself by using the *jma file when language processing is activated. |
dic/jsa/takmi.dso | No | Yes | (In Japanese only) It has the user dictionary information to be used in language processing. It updates itself when language processing is activated. |
pattern/dictionary.pat | No | Yes | It is an information extraction pattern. It updates itself when the Dictionary Editor is used to update the category tree. |
alerting | Yes | Yes | This is a directory for the Alerting System. Directories and files are created when Alerting System tools are used. |
ie | Yes | Yes | (In Japanese only) It is a directory for DOCAT. Directories and files are created when DOCAT tools are used. |
Directory and file name | Read operation during the resource deployment processing | Write operation during the resource deployment processing | Read operation during language processing | Write operation during language processing | Description |
---|---|---|---|---|---|
category/category_tree.xml | Yes | Yes | Yes | No | When a category is updated in Dictionary Editor, a dependency category is created in the added category. Therefore, stop operating Dictionary Editor during this process. |
conf/database_config.xml | Yes | No | Yes | No | It acquires language information. |
conf/database_config_docat.xml | Yes | No | No | No | It reads DOCAT parameter information (working_directory). |
dic/category/*.adic.xml | Yes | No | Yes | No | This file is created when Dictionary Editor is used. Although this file will not be changed, it is still necessary to stop operating Dictionary Editor during this process. This is read during the synonym processing or category attachment processing. |
dic/jsa/*.jma | No | Yes | Yes | No | (In Japanese only) This is created from the *.adic.xml file created by Dictionary Editor. This will not be created if the *.adic.xml file does not exist. This is read by the parser. |
dic/jsa/*.ddf | No | Yes | Yes | No | (In Japanese only) This is created from the *.adic.xml file created by Dictionary Editor. This will not be created if the *.adic.xml file does not exist. This is read by the parser. |
dic/jsa/takmi.dso | No | Yes | Yes | No | (In Japanese only) This is created from the *.adic.xml file created by Dictionary Editor. This will not be created if the *.adic.xml file does not exist. This is read by the parser. |
dic/LangWare50/*.* | No | Yes | Yes | No | (In English only) This is created from the *.adic.xml file created by Dictionary Editor. This will not be created if the *.adic.xml file does not exist. This is read by the parser. |
pattern/auto_generated.pat | No | Yes | Yes | No | This is created from the *.adic.xml file and category_tree.xml created by Dictionary Editor. Therefore, stop operating Dictionary Editor during this process. This is read during the expression extraction processing. |
ie/categorization/*.feature.xml | Yes | Yes | Yes | No | (In Japanese only) This is a categorization trigger information file created by DOCAT. A file is created for each category and is updated as DOCAT is used. Therefore, stop operating Dictionary Editor during this process. This is read during the categorization processing by DOCAT. |
ie/categorization/*.model | Yes | Yes | Yes | No | (In Japanese only) This is a categorization model file created by DOCAT. A file is created for each category and is updated as DOCAT is used. Therefore, stop operating Dictionary Editor during this process. This is read during the categorization processing by DOCAT. |
ie/categorization/*.scfeature.xml | Yes | No(Yes) | Yes | No | (In Japanese only) This is a categorization trigger search condition file created by DOCAT. A file is created for each category and is updated as DOCAT is used. Therefore, stop operating Dictionary Editor during this process. Files are overwritten if working_directory is set. This is read during the categorization processing by DOCAT. |
ie/categorization/*.annotation.xml | Yes | No | No | No | (In Japanese only) This is a document selection part record file created by DOCAT. A file is created for each category and is updated as DOCAT is used. Therefore, stop operating Dictionary Editor during this process. This is not referred to during language processing. |
Directory and file name | Read operation during the system operation | Write operation during the system operation | Description |
---|---|---|---|
alerting/setting | - | - | This is created when Alerting System is launched. Files created in this directory are used only by Alerting System; therefore, it has no effect on tasks such as addition of data. |
alerting/setting/increase_detection_setting | Yes | No | This is a parameter setting file for increase detection. It is updated as the parameter settings are changed. |
alerting/setting/correlation_detection_setting | Yes | No | This is a parameter setting file for correlation detection. It is updated as the parameter settings are changed. |
alerting/batch | - | - | This is created when Alerting System is launched. Files created in this directory are used only by Alerting System; therefore, it has no effect on tasks such as addition of data. |
alerting/batch/increase_detection_report.xml | No | Yes | This is the result of the regular batch processing of increase detection. It is updated as the regular batch processing is carried out. |
alerting/batch/correlation_detection_report.xml | No | Yes | This is the result of the regular batch processing of correlation detection. It is updated as the regular batch processing is carried out. |
Directory and file name | Read operation during the system operation | Write operation during the system operation | Description |
---|---|---|---|
ie/categorization/*.feature.xml | Yes | Yes | (In Japanese only) This is a categorization trigger information file created by DOCAT. A file is created for each category and is updated as DOCAT is used. |
ie/categorization/*.model | Yes | Yes | (In Japanese only) This is a categorization model file created by DOCAT. A file is created for each category and is updated as DOCAT is used. |
ie/categorization/*.scfeature.xml | Yes | No | (In Japanese only) This is a categorization trigger search condition file created by DOCAT. A file is created for each category and is updated as DOCAT is used. |
ie/categorization/*.annotation.xml | Yes | No | (In Japanese only) This is a document selection part record file created by DOCAT. A file is created for each category and is updated as DOCAT is used. |
You can customize the following applications by including JavaScript files and CSS (Cascading Style Sheets) files. Edit or create the configuration file for customization.
Application | Configuration file |
---|---|
MINER | %TAKMI_HOME%/conf/global_config_miner.xml |
DICTIONARY EDITOR | %TAKMI_HOME%/conf/global_config_dic.xml |
DoCAT | %TAKMI_HOME%/conf/global_config_docat.xml |
ALERT | %TAKMI_HOME%/conf/global_config_alerting.xml |
Edit the %TAKMI_HOME%/conf/global_config_miner.xml file. In the following sample file replace the {URL_OF_SCRIPT_FILE},{URL_OF_CSS_FILE} for your environment.
<?xml version="1.0" encoding="UTF-8"?> <global_config> <customize> <script href="{URL_OF_SCRIPT_FILE1}" /> <script href="{URL_OF_SCRIPT_FILE2}" /> <stylesheet href="{URL_OF_CSS_FILE1}" /> <stylesheet href="{URL_OF_CSS_FILE2}" /> </customize> </global_config> |
Edit the %TAKMI_HOME%/conf/global_config_dic.xml file. In the following sample file, replace the {URL_OF_SCRIPT_FILE},{URL_OF_CSS_FILE} for your environment.
<?xml version="1.0" encoding="UTF-8"?> <global_config> <customize> <script href="{URL_OF_SCRIPT_FILE1}" /> <script href="{URL_OF_SCRIPT_FILE2}" /> <stylesheet href="{URL_OF_CSS_FILE1}" /> <stylesheet href="{URL_OF_CSS_FILE2}" /> </customize> </global_config> |
Edit the %TAKMI_HOME%/conf/global_config_docat.xml file. In the following sample file, replace the {URL_OF_SCRIPT_FILE},{URL_OF_CSS_FILE} for your environment.
<?xml version="1.0" encoding="UTF-8"?> <global_config_docat> <params> <param name="skewness_scale" value="0.0"/> </params> <customize> <script href="{URL_OF_SCRIPT_FILE1}" /> <script href="{URL_OF_SCRIPT_FILE2}" /> <stylesheet1 href="{URL_OF_CSS_FILE1}" /> <stylesheet1 href="{URL_OF_CSS_FILE2}" /> </customize> </global_config_docat> |
Edit the %TAKMI_HOME%/conf/global_config_alerting.xml file. In the following sample, replace the {URL_OF_SCRIPT_FILE},{URL_OF_CSS_FILE} for your environment.
<?xml version="1.0" encoding="UTF-8"?> <global_config> <customize> <script href="{URL_OF_SCRIPT_FILE1}" /> <script href="{URL_OF_SCRIPT_FILE2}" /> <stylesheet href="{URL_OF_CSS_FILE1}" /> <stylesheet href="{URL_OF_CSS_FILE2}" /> </customize> </global_config> |
See the following sample configuration files.
In the following sample file of window title configuration (ALERT,DICTIONARY,DoCAT,MINER), replace {YOUR_TITLE_PREFIX} for your environment.
// customize-takmi-title-sample.js var CUSTOMIZE_TITLE_PREFIX = "{YOUR_TITLE_PREFIX} "; document.title = CUSTOMIZE_TITLE_PREFIX + document.title;
Sample file of browser function key configuration. (ALERT,DICTIONARY,DoCAT,MINER)
// customize-takmi-keyconfig-sample.js allowMouseRightClick = false; // Right click disabled allowKeyFunction1 = true; // F1 key enabled allowKeyFunction2 = false; // F2 key disabled allowKeyFunction3 = false; // F3 key disabled allowKeyFunction4 = false; // F4 key disabled allowKeyFunction5 = false; // F5 key disabled allowKeyFunction6 = false; // F6 key disabled allowKeyFunction7 = false; // F7 key disabled allowKeyFunction8 = false; // F8 key disabled allowKeyFunction9 = false; // F9 key disabled allowKeyFunction10 = false; // F10 key disabled allowKeyFunction11 = true; // F11 key enabled allowKeyFunction12 = false; // F12 key disabled
Sample file for logo customization. (MINER)
Please replace {URL_OF_LOGO_IMAGE} for your environment.
/* masthead-logo1-sample.css */ div#masthead-logo1{ margin-top : 0px; margin-left : 0px; margin-right : 0px; margin-bottom : 0px; padding-top : 0px; padding-left : 0px; padding-right : 0px; padding-bottom : 0px; background-image : url("{URL_OF_LOGO_IMAGE}"); width : 200px; height : 40px; position : absolute; top : 8px; left : 3px; }
IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A.For license inquiries regarding double-byte (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to:
IBM World Trade Asia Corporation Licensing 2-31 Roppongi 3-chome, Minato-ku Tokyo 106-0032, JapanThe following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.
IBM Corporation Silicon Valley Lab Building 090/H-410 555 Bailey Avenue San Jose, CA 95141-1003 U.S.A.Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee.