IBM Content Analyzer Text Miner Guide
Edition Notice
This edition applies to version 8, release 4 of IBM® Content Analyzer
and to all subsequent releases and modifications until otherwise indicated
in new editions.
This document contains proprietary information of IBM. This proprietary
information is provided in accordance with the license conditions and is
protected by copyright. Information contained in this document provides
no warranties whatsoever for any products. Also, no descriptions provided
in this document should be interpreted as product warranties. Depending
on the system environment, the yen symbol may be displayed as the backslash
symbol, or the backslash symbol may be displayed as the yen symbol.
© Copyright International Business Machines Corporation 2007, 2008. All
rights reserved.
US Government Users Restricted Rights - Use, duplication or disclosure
restricted by GSA ADP Schedule Contract with IBM Corp.
1 Introduction
This document describes how to use the IBM Content Analyzer Text Miner
application.
1.1 Screen Layout and Functional Overview
The Text Miner interface consists of four areas: Category Tree, Tools,
Search, and View.
Screen layout:
- (1) Category tree:
Select categories to be analyzed in this area.
- (2) Tools:
Various types of functions are available in this area such as saving analysis
results and showing the online help.
- (3) Search:
Perform operations (such as add and delete) that are associated with search
conditions in this area. It is also possible to add new search conditions
by using the items shown on the view.
- (4) View:
Analyze the retrieved documents narrowed down by the search conditions
in this area. This section consists of tabs for selecting a view type,
buttons for setting the parameters, and a table for showing the analysis
results.
1.2 Troubleshooting Browser problems
This section describes known problems related to the Internet Explorer
settings and their solutions.
- Problem: There are no arrow buttons in the category tree.
Solution: Depending on when data is displayed, the right frame might be
displayed over a part of the left frame, hiding the arrow buttons in the
category tree. When this occurs, change the order of display and refresh
the screen.
Normal state:
 |
|
The buttons are not displayed:
 |
- Problem: The same dialog appears twice when saving a bookmark, report,
or CSV file.
Solution: Data can be saved without problems by clicking Save each time
the dialog is displayed.
- Problem: The connection from the bookmark or report to the server is blocked.
Solution: To enable the connection in each session, click the menu bar
on the Internet Explorer window to select Allow Blocked Content, and then
click Yes. To enable the connection every time, perform the following steps.
ensure that the security policy of your environment allows the connection
to be always enabled.
1. In Internet Explorer, click Tools -> Internet Options.
2. Click the Advanced tab and select the check box for Allow active content
to run in files on My Computer
- Problem: External connection is blocked:
Solutions: The following settings can solve the problem, but first make
sure that the security policy of your environment allows these settings.
1. In Internet Explorer, click Tools -> Internet Options -> Security.
2. Select Trusted sites and click the Site. Type the URL of Text Miner
in the Add this site to the zone field. For example, if the URL of Text
Miner is
https://miner.ibm.com:9443/OAE_Text Miner/, you type https://miner.ibm.com.
Because pop-ups might be blocked by settings in, for example, the Google
Toolbar, be sure to change the settings to allow pop-ups for external connections.
1.3 Page Transition
Text Miner can analyze data while it is connected to the server, use the
bookmark and report functions to save analysis information in a local file,
and reconnect to the server from that file. The following graphic shows
the flow of page
Page transition:
Most of the system operation is done in the analysis screen, and various
types of views can be displayed in this screen. As seen in the following
graphic, you can switch views by clicking corresponding tabs.
View transition:

1.4 Select Database
Text Miner supports analysis of multiple databases. You select databases
on the Select Database screen, which is the first page in page transition.
Databases are usually created for each data format or data content type
such as customer calls, internal e-mails, and repair information.
Because the Select Database screen is on the top page in Text Miner, use
this page if you want to add the Text Miner site to My Favorites of Internet
Explorer.
Select database screen:

2 Category Tree
2.1 Category Tree Display
You can change the category tree display size (frame width) by dragging
your cursor. You can hide the category tree by dragging the line between
the left frame and the main frame.

2.2 Displaying and Selecting Categories
- Changing the order of
You can change the order in which categories are displayed by clicking
the Config, Type, and Name radio buttons.
- Click Config to sort the categories in order of creation in the configuration
file. The configuration file cannot be edited with Text Miner.
- Click Type to sort the categories depending on whether they have subcategories.
- Click Name to sort the categories in order of Unicode characters. In Unicode,
characters are arranged in order of numbers, uppercase letters, and lowercase
letters.
- Selecting a category:
The following views are available in Text Miner:
- Views in which categories are not used as parameters: Documents view and
Time Series view.
- Views in which only the vertical category is used as a parameter: Category
view, Topic view, and Delta view.
- Views in which both the vertical category and horizontal category are used
as parameters: 2D Map
For example, the keywords and subcategories of the vertical category are
displayed vertically in , and the keywords and subcategories are displayed vertically. The keywords
and subcategories are displayed in . By using the buttons in the category tree, you can select the vertical
and horizontal categories to be used.
Once a category is selected, it stays selected even if the collapse button
for subcategories is clicked and the category is hidden.
3 Search
3.1 Overview of the Search Function
In Text Miner, search conditions can be used to generate a collection of
documents for analysis. In addition to using a single search condition,
such as keyword search and category search, you can also create a complex
search condition by combining three types of operators: AND, OR, and NOT.
3.2 Keyword Search
In keyword search, documents are searched by keywords that are registered
in the IBM Content Analyzer system dictionary and the user dictionary.
When synonyms are registered as keywords, documents that contain synonyms
of the keywords are also retrieved. For example, if the word "PC"
is set as a synonym of the keyword "personal computer," documents
that contain the word "PC" also meet the search condition. Use
the Dictionary Editor to register synonyms.
- Select Keyword Search as the search type in the select box, enter the keyword in the text field,
and click the Search button to retrieve documents that contain the input keyword.
- In the Category view, select Keywords as the list type, check the listed keywords, and click the Search button to retrieve documents that contain the selected keywords. When
multiple keywords are selected, the OR condition is added to the search
condition.

Keywords that are used in keyword search are not just simple character
strings, but instead, they are the keywords extracted from the target documents
by language processing. For this reason, alghough it is possible to retrieve
documents that contain synonyms, there might be some cases in which documents
that contain character strings including the text to be searched might
not be retrieved. For example, documents that contain the phrase "call
center" will not be retrieved if you run a keyword search with the
term "center." This is because the phrase "call center"
is recognized as one word by the language processing, and the character
string "center" that is contained in that phrase is not recognized
as a word.
In keyword search, the category to which the keyword affects the search
result. Therefore, even for the same keyword, the search results might
be different depending on which Category view or which text field the search
is performed from. Use the Dictionary Editor to create an association between
a category and a keyword. For example, assume there is a dictionary that
contains the following information:
- TP (keyword) = ThinkPad (synonym) is registered in the "PC product"
category.
- TP (keyword) = TrackPoint (synonym) is registered in the "peripheral
device" category.
In this case, when a keyword search is processed with "TP" in
the PC product category, documents that contain "TP" or "ThinkPad"
are retrieved, and when a keyword search is processed with the same keyword
in the peripheral device category, documents that contain the "TP"
and "TrackPoint" are retrieved.
3.3 Category Search
In category search, documents that contain an arbitrary keyword in the
specified category will be retrieved.
- In the Category view, select Subcategories as the list type, check the listed subcategories, and click Search to retrieve documents that contain the selected categories. When multiple
subcategories are selected, the OR condition is added to the search condition.

3.4 Date Search
In date search, documents created on a specified date (including not only
a specific date but also a period of time such as a month or a week) will
be retrieved.
- Select the appropriate check boxes in the Time Series view and click the
Search to retrieve documents created on the specified date. When multiple dates
are selected, the OR condition is added to the search condition.

3.5 Search Operators
- Combining operators:
Text Miner supports a type of search condition that is created by combining
one or more OR conditions with one or more AND conditions. In the following
search condition shown below, "Sun." and "Sat." are
in the OR relationship and together generate a condition for retrieving
documents dated on Saturdays and Sundays. The keywords "windows xp"
and "boot," and the date condition "Saturdays and Sundays"
are in the AND relationship.
Search conditions:
- Adding an AND condition:
When the search condition is already specified, click AND and then click Search to generate a search condition that includes all the existing and newly added conditions.
Adding an AND condition:

- Adding an OR condition:
When the search condition is already specified, check OR radio button and then click Search button to generate a search condition that meets either the existing or newly added condition.
Adding an OR condition:

- Adding an OR condition while an AND condition already exists:
When an OR condition is added and an AND condition already exists, the
last condition on the list subject to the AND condition and the newly added
condition will be in the OR relationship.
Adding an OR condition:

- Adding search conditions by using multiple checkboxes
When multiple check boxes in the Category view and Time Series view are
selected to add search conditions, the checked items are put together as
one OR condition and then added.
After adding an OR condition consisting of multiple dates to the overall
search condition as an AND condition:

- Adding search conditions from cells in various types of views
Click cells in Topic view, Delta view, or 2D Map view to add search conditions
so that AND conditions for vertical and horizontal items will be added.
In this case, you do not need to click Search.
Adding search conditions from cells:

3.6 Using Operators with category node selection
You can select a node of a current search condition when a new condition
node is being added. In the following procedures, node A and B are a node
of a current search condition, and the X node is being added to the current
search condition.
Case |
Selected node |
Operator |
Result |
Case1 |
Parent node is AND |
AND |
Adds a new node as a leaf node of same parent node. |
Case2 |
Parent node is AND |
OR |
Adds an OR node of the selected node and adds a new node. |
Case3 |
AND |
OR |
OR node of selected AND node and new node. |
Case4 |
AND |
AND |
Adds a new node to the selected AND node. |
Case5 |
Parent node is OR |
AND |
Adds an AND node of the selected node and adds new node. |
Case6 |
Parent node is OR |
OR |
Adds a new node to the parent OR node. |
Case7 |
OR |
AND |
Adds an AND node of selected OR node and adds new node. |
Case8 |
OR |
OR |
Adds a new node to the selected OR node. |
Case1:Adds a new node as a leaf node of same parent node.

Case2: Adds an OR node of the selected node and adds a new node.

Case3: OR node of selected AND node and new node.

Case4: Adds a new node to the selected AND node.

Case5: Adds an AND node of the selected node and adds new node.

Case6: Adds a new node to the parent OR node.

Case7: Adds an AND node of selected OR node and adds new node.

Case8: Adds a new node to the selected OR node.

4 View
4.1 Common View Specifications
- Category specification:
When you specifying a category for an analysis view, select a category
in the Category Tree shown in the left half of the screen, and then click
the View tab. The category change will be applied to the view when you click the
View tab is clicked. The category specified here will not be applied when you
changing search conditions or parameters.
- Changing parameters:
When you change parameters with radio buttons or check boxes, changes will
be applied immediately after the screen operation. No other operations
are necessary after clicking the buttons.
- Resetting parameters:
All parameters, excluding vertical categories, horizontal categories, and
list items, will be reset when you click the View tab.
- Black numbers and pink numbers:
Numbers that are written in black show the frequency of appearance (number
of retrieved documents) of keywords, subcategories, or dates in retrieved
documents. Numbers that are written in pink show the statistical index
obtained based on the frequency.
4.2 Top View
This is the view displayed as the initial view after selecting a database.
The mining function is not available in this view.
4.3 Docs View
In the Documents view, documents that meet the current search conditions
can be viewed.
- Page buttons:
Click a triangle button to scroll the pages of listed documents. Documents
are listed in order that they were imported.
- Documents per page parameter:
Use radio buttons to specify how many documents should be displayed at
the same time on the screen. The page returns to the first page every time
the number is changed.
- Document display table:
- The first line shows the document title. The title is specified when importing
the data.
- The text body is displayed in the Text field. Depending on the operational settings, the text might not be displayed
or it might be truncated after a certain number of characters.
- Click the link on the "Detail" line to open a window called Document
Inspector to view detailed information about the document.
- Lines below the "Detail" line show the keywords contained in
that document for the displayed category. Standard items are usually set
here.
- Highlighting the document display table and parts that meet the search
condition:
When keywords or character strings are specified as search conditions,
parts of the document that meet the conditions are highlighted. When dependency
phrases are specified as search conditions, the entire phrase will be highlighted.
When the standard item category values, which are equivalent to the standard
columns in databases, are specified as search conditions, the screen shows
"KEYWORD" as the search condition, but there will be no highlighting
of the text because keywords do not exist in the body text.
The Document Inspector screen shows the text, standard information, and
keywords extracted from the text for the document selected in Documents
view. By using this function, you can understand the types of keywords
that are internally extracted by the language processing.
- Document Information: The Document Information area shows the document
ID, title, and the text in the original data, which is to be imported by
IBM Content Analyzer. The full text will be displayed because there is
no limit for the number of characters that can be displayed.
- Standard Information: In the Standard Information area, standard information
that was associated with the document when it was imported by IBM Content
Analyzer is displayed as <category, keyword> pairs. When there are
multiple keywords for a standard item category, multiple lines are used
to show all these keywords.
- Keyword Information: In the Keyword Information area, keywords extracted
by the language processing are displayed as <category, keyword> pairs.
Clicking a radio button in the "Highlight" field will highlight
the corresponding keyword in the original text.
4.4 Category View
For the keywords and subcategories that belong to the category specified
as the vertical category in the category tree, the Category view shows
the frequency of appearance within retrieved documents and the correlation
with search conditions (see ).
- List parameter:
Use the radio buttons to select whether keywords belonging to the vertical
category or subcategories belonging to the vertical category should be
displayed.
- Sort parameter:
Use the radio buttons to specify how to sort keywords and categories in
the list.
- When you click Frequency, keywords and subcategories are listed in order of how frequently they
appear in the retrieved documents.
- When you click Correlation, retrieved documents are listed in order of correlation strength with
the specified keywords or subcategories.
- When you click Alphabet, keywords and subcategories are sorted in order of the world standard
character code called Unicode characters. In Unicode, characters are arranged
in order of numbers, uppercase letters, and lowercase letters.
Even when the order of display changes, the keywords and subcategories
to be retrieved are determined based on the frequency. For example, when
you click Correlation and the maximum number of lines to display is set to 100, 100 keywords
that are retrieved based on frequency will be listed in order of correlation
strength.
- Max lines parameter:
Use the radio buttons to specify the maximum number of keywords or categories
that can be listed.
- Search check box (inside the table):
Select a check box and click the Search at the top of the screen to add a search condition for the specified item.
When multiple check boxes are selected, search conditions are generated
with the checked items (keywords or categories) are linked with the OR
operator and then added to the existing condition.
- Keywords (inside the table):
When keywords are specified as items to be listed, the keywords belonging
to the vertical category are listed in the second column of the table.
- Subcategories (inside the table):
When subcategories are specified as items to be listed, the subcategories
of the vertical category are listed in the second column of the table.
- Frequency (inside the table):
The frequency of appearance of the keywords or subcategories shown on that
line within the collection of retrieved documents (the number of applicable
documents) is presented as numbers and graphs in black.
- Correlation (inside the table):
Strength of correlation between the collection of retrieved documents and
the keywords or subcategories on that line is presented as numbers and
graphs in pink.
4.5 Time Series View
The Time Series view shows how often the documents meet the current search
conditions over a period time.
- Time scale parameter:
Use the time scale radio buttons to specify the size of the time axis in
the Time Series graph. Year, Half, Quarter, Month, Week, and Date correspond
to a year, half a year, a quarter, a month, a week, and a day, respectively.
Day of month means a calendar date, and the same calendar dates in different
months will be counted as one. Day of week shows the frequency on each
day of the week.
- From-time parameter:
This parameter defines the left end of the display range of the time line.
This is a function to remove unnecessary data, and this setting will be
applied to the resulting report.
- To-time parameter:
This parameter defines the right end of the display range of the time line.
This is a function to remove unnecessary data, and this setting will be
applied to the resulting report.
- Search check box (inside the table):
Select a check box and click the Search at the top of the screen to add a search condition for the specified date.
When you select multiple check boxes, search conditions generated with
the checked dates are linked with the OR operator and then added to the
existing condition.
- Unknown (inside the table):
When there are documents with no date attributes, the word "Unknown"
is displayed where the data is originally displayed, and their frequency
will be shown.
4.6 Topic View
In the Topic view, changes over time are analyzed for each keyword or subcategory
belonging to the category specified as the vertical category in the category
tree, and parts with relatively high frequency will be highlighted.
- List parameter:
Use the radio buttons to select whether keywords belonging to the vertical
category or subcategories belonging to the vertical category should be
displayed.
- Sort parameter:
Use the radio buttons to specify how to sort keywords and categories in
the list.
- When you click Frequency, keywords and subcategories are listed in order of how frequently they
are displayed in the retrieved documents.
- When you click Alphabet, keywords and subcategories are sorted in order of the world standard
character code called Unicode characters. In Unicode, characters are arranged
in order of numbers, uppercase letters, and lowercase letters.
Even when the order of display changes, the keywords and subcategories
to be retrieved are determined based on the frequency. For example, when
you click Alphabet and the maximum number of lines to display is set to 20, 20 keywords that
are retrieved based on frequency will be listed in alphabetical order.
- Max lines to display parameter:
Use the radio buttons to specify the maximum number of keywords or categories
that can be listed.
- Time scale parameter:
Use the time scale radio buttons to specify the size of the time axis in
the Time Series graph. Year, Month, Week, and Date correspond to a year,
a month, a week, and a day, respectively. Day of month means a calendar
date, and the same calendar dates in different months will be counted as
one. Day of week shows the frequency on each day of the week.
- From-time parameter:
This parameter defines the left end of the display range of the time line.
This is a function to remove unnecessary data, and this setting will be
applied to the resulting report.
- To-time parameter:
This parameter defines the right end of the display range of the time line.
This is a function to remove unnecessary data, and this setting will be
applied to the resulting report.
- Frequency (inside the table):
The total number of applicable documents is shown in a numerical value
for each date.
- Cell value (inside the table)
Frequency of appearance of the keyword or subcategory in each line is shown
as a numerical value for each date. Parts with relatively high frequency
within the retrieved document are underlined. See "" for information about highlighting. When the number in a particular
cell is clicked, the AND condition consisting of the vertical and horizontal
items will be added to the current search condition. In this case, you
do not need to click Search. In addition, the operator radio buttons and the NOT check box are both
disabled.
4.7 Delta View
In the Delta view, over-time changes of each keyword or subcategory belonging
to the category specified as the vertical category are shown in a time
series graph, and parts where future increases are predicted will be highlighted.
This view can function as a simpler version of the alert function because
it uses the same indicator that defects increases in time.
- List parameter:
Use the radio buttons to select whether keywords belonging to the vertical
category or subcategories belonging to the vertical category are be displayed.
- Sort parameter:
Use the radio buttons to specify how to sort keywords and categories in
the list.
- When you click Frequency, keywords and subcategories are listed in order of how frequently they
are displayed in the retrieved documents.
- When Alphabet, keywords and subcategories are sorted in order of the Unicode characters.
In Unicode, characters are arranged in order of numbers, uppercase letters,
and lowercase letters.
- When you click Latest alerting indicator, keywords and subcategories are listed in order of date of creation, from
most recent to oldest, of the increase indicator on the latest date within
each time line display range. The increase indicator is in the last column
to the right.
Also, keywords and subcategories to be retrieved are determined based on
the frequency, regardless of the sort method. For example, when you click
Alphabet and the maximum number of lines to display is set to 20, 20 keywords that
are retrieved based on frequency will be listed in alphabetical order.
- Max lines to display parameter:
Use the radio buttons to specify the maximum number of keywords or categories
that can be listed.
- Time scale parameter:
Use the time scale radio buttons to specify the size of the time axis in
the Time Series graph. Year, Month, Week, and Date correspond to a year,
a month, a week, and a day, respectively.
- From-time parameter:
This parameter defines the left end of the display range of the time line.
This is a function to remove unnecessary data, and this setting will be
applied to the resulting report.
- To-time parameter:
This parameter defines the right end of the display range of the time line.
This is a function to remove unnecessary data, and this setting will be
applied to the resulting report.
- Increase indicator (inside the table):
The increase indicator for each date is displayed in pink above the Time
Series graph. A value above zero means an increase, and a value below zero
means a decrease. The increase indicator is not provided for the first
four dates because the amount of data is not sufficient for calculating
the increase indicator for these dates. See "" for more information about the increase indicator.
- Time Series graph (inside the table):
The frequency of appearance of each keyword or subcategory in retrieved
documents is presented as a bar chart.
4.8 2D Map View
The 2D Map view uses a two-dimensional table to show the correlation between
keywords or subcategories that belong to the vertical category specified
in the category tree and keywords or subcategories that belong to the horizontal
category. Cells showing a high correlation between vertical and horizontal
items are highlighted.
- List parameter:
Use the radio buttons provided in the left half of the screen to select
whether keywords belonging to the vertical category or subcategories belonging
to the vertical category should be displayed on the vertical axis of the
table. In the same manner, use the radio buttons provided in the right
half of the screen to select the items to be displayed along the horizontal
axis of the table.
- Sort parameter:
Use the radio buttons to specify how to sort keywords and categories in
the list. The radio buttons provided in the left half of the screen are
for the vertical axis, and the radio buttons provided in the right half
of the screen are for the horizontal axis.
- When you click Frequency, keywords and subcategories are listed in order of how frequently they
appear in the retrieved documents.
- When you click Alphabet, keywords and subcategories are sorted in order of the Unicode characters.
In Unicode, characters are arranged in order of numbers, uppercase letters,
and lowercase letters.
Even when the order of display changes, the keywords and subcategories
to be retrieved are determined based on the frequency. For example, when
you click Alphabet and the maximum number of lines is set to 20, 20 keywords that are retrieved
based on frequency will be listed in alphabetical order.
- Max lines to display parameter:
Use the radio buttons to specify the maximum number of keywords or categories
that can be listed. The radio buttons provided in the left half of the
screen are for the vertical axis, and the radio buttons provided in the
right half of the screen are for the horizontal axis.
- Cell value (inside the table)
The first line in each cell shows in black the number of documents that
contain [vertical keyword/subcategory] and [horizontal keyword/subcategory].
The second line in each cell shows in pink the correlation between the
vertical keyword or subcategory and the horizontal keyword or subcategory.
See "" for more information on correlation values. Cells showing high correlations
are highlighted.
5 Tools
5.1 Bookmark
The bookmark function saves the current search conditions and parameters
in the currently displayed view as a local file. A connection to the server
is reestablished after the saved bookmark is opened, allowing the analysis
to resume with the saved search conditions and parameters.
The bookmark does not save analysis results of views such as graphs and
values, and it cannot be used if the server cannot be accessed. To save
snapshots of analysis results of views, use the report function (seee ).
- Saving the bookmark:
Click the bookmark link and specify the local destination. Close the blank
window after saving the bookmark.
- Managing the bookmark:
The bookmark can be managed as a local file. You can also access the same
Text Miner server to share bookmark files through e-mail or to share the
files.
- Accessing the server from the bookmark:
When you open a locally saved bookmark, click JUMP to connect to Text Miner.
Handling of invalid search conditions:
Because the bookmark does not save analysis results, results that are different
from what they were at the time of bookmark creation will be displayed
if data is changed after the bookmark was created. Note that if a category
used as a search condition is deleted in the bookmark, that search condition
becomes invalid. A warning message for the invalid condition, and the search
results becomes 0 results ("all" if the NOT operator is used
in search).
Example of an invalid search condition:
Information saved in the bookmark:
A bookmark saves information about the search condition, view selection
information, view parameters, and vertical and horizontal categories. However,
the bookmark does not save information in the category tree frame on the
left, which is the category sort method and categories that are currently
expanded.
5.2 Report Function
The currently displayed analysis results can be saved as a local file by
using the report function. Access to the server is not required for opening
the report, and it can be viewed in Internet Explorer. If you can connect
to the server, the analysis can resume with the saved search condition
and parameters similar to the bookmark.
Report edit:
Click the Report link to open the report edit screen. In the edit screen,
you can decide whether display description should be displayed. You can
also edit comments (memo about reports). You can enter up to 2,000 characters
(including line breaks) in the comment field. Also, when there are "Attach
documents" check boxes for displayed items, samples of documents narrowed
down by the checked item will be attached. After entering the report, click
Create to save the report.
Report:
- Open the saved report to see the content without connecting to the server.
- Click the here link to reconnect to the server while maintaining the search conditions
and parameters that are used in the report and resume the analysis that
is using that report.
Link to a created report or sample document:
When you select the Attach documents check box, the in-file link to the
sample document is displayed. Click the link to jump to the corresponding
document, which is provided at the bottom of the file.
Created report or sample document:
By checking the content of the sample document, you can understand the
situation in which the listed items are actually used.

5.3 CSV Output Function
You can save the currently displayed analysis results as a CSV file by
using the CSV output function. Because CSV files can be opened in Excel,
you can use the file to create a customized report.
- Saving a CSV file: Click the CSV output link and specify a local destination.
A blank window might be displayed separately. Close this window after saving
the file.
- Restriction: Opening the CSV file directly from the save window might cause
problems. Save the CSV file to your local system before you open it.
5.4 Save Function
Click Save to save the current search condition to your system. File name,
folder name, and comment can be set on save confirm window.

- Limitation: File names and folder names should not contain
- More than 32 characters
- Any of the following characters
- \ backslash
- / slash (separator of folders)
- : colon
- * asterisk
- ? question mark
- " double quotation marks
- < less than sign
- > greater than sign
- | vertical sign
- Other characters limited by the server system
5.5 List Function
Click List to show the search conditions saved in server storage.
A search condition is loaded by selecting a file name in the list.
To delete an item, select the box of the folder name or file name and click
Delete.

- Folder is not deleted if the folder contains other files or folders.
5.6 XML Download Function
Click XML Download to download the current search condition as a XML file
- To reuse a search condition, the same database must be used for the analysis.
5.7 XML Upload Function
Click XML Upload to uploading a search condition XML file that is saved with XML Download
function.
- To reuse a search condition, the same database must be used for the analysis.
5.8 Analyze a document that has of multiple text entries
You can analyze a document that has two or more text entries. Select the
text entry name from Options menu.
- For this function, multiple text entries and configuration of the database_config.xml
file are needed.
- Selection of a text entry is not reflected in a bookmark, reporting, or
the XML download function.
6 Statistical Index
6.1 Characteristics of the Text Miner Statistical Index
Values displayed in pink in Text Miner are the indexes that are calculated
by Text Miner, and they are differentiated from raw data such as the number
of documents displayed in black.
Misunderstandings are likely to occur when raw data is interpreted. For
example, the frequency of the use of the keyword "receive ... mail"
increased by a factor of 1.5 from January to February. To determine that
the frequency increased, you must examine the increase rate with respect
to the changes in the total number of documents, and to take statistical
noise into consideration;
To allow the indexes to be interpreted into intuitive images, such as an
increase rate or correlation strength, Text Miner displays fully corrected
indexes.
- Views that support heuristic analysis:
Text Miner supports the discovery of topics and problems by not only showing
the number of applicable documents in terms of keywords or subcategories
but also by visualizing the elements that are different from others in
keyword distribution in the documents. The specific methods are analysis
of over-time changes of frequency (Time Series, Topic, and Delta views)
and correlation analysis (Category and 2D Map views).
- Noise removal:
Even though Text Miner analyzes a huge number of documents, it is not always
the case that the number of documents that meet individual search conditions
or the number of documents that contain certain keywords or subcategories
is sufficient for statistical processing. Also, in the case of handling
call center data or BBS logs, there is always a certain level of noise
in the number of documents. Therefore, Text Miner always takes noise in
reliability of values into account and shows corrected values as indexes
instead of simply showing the values, such as the difference in the number
of documents and the ratio of the number of documents"
6.2 Correlation
- Applicable view:
Correlation values are used in category view and 2D Map view. The Category
view shows the level of correlation between a search condition and listed
keywords or subcategories, and 2D Map view shows the likelihood of co-occurrence
of vertical and horizontal items.
- Values before correction:
The correlation between two document collections A and B is defined as
follows. The letter D represents the entire collection of documents, and
the # symbol represents the number of documents in the collection. The
left and the right sides of the equation are equal to each other.
For example, assuming that two collections of documents are
A = {documents that contain a keyword "PC"}
and
B = {documents that contain a keyword "see ... manual"},
then the left side of the equation will be:
When the documents containing the keyword "PC" are examined,
the percentage of users who want to get an instruction manual |
|
When all the documents are examined, the percentage of users who want to
get an instruction manual |
This set can be illustrated as described in the following graphic. For
example, when 5% of all the documents are about obtaining an instruction
manual, this figure rises to 20% when only personal computer-related documents
are examined. The correlation value between "PC" and "see
manual" is 4, meaning that the correlation is strong.

The right side of the equation is a ratio between the product of density
of A and density of B (#A/#D) (#B/#D), and the actual density of (AnB),
which is #(AnB)/#D, representing a deviation from independence of A and
B. The right side is more intuitive than the left side as a 2D Map index.
- Values after correction:
Reliability of the correlation value becomes lower as the value of #(AnB)
(the number of documents that contain the keywords "PC" and "see
manual," in this example) becomes smaller in the preceding formula.
Text Miner uses an interval estimation to make low-reliability values smaller
to avoid the situation wherein a high correlation value is obtained even
though there is no reliability, causing lowering of efficiency or accuracy
of analysis. The interval estimation method obtains the smallest A that
can realize the current correlation value when the proper correlation value
A is an unknown value provided that there are an infinite number of documents,
excluding coincidences that occur below a certain probability.
6.3 Topicality Index
- Applicable view:
The topicality index is used to highlight data in the Topic view.
- Meaning of index and summary of calculations:
The topicality index measures how keywords and subcategories in each line
in the Topic view deviate from the average frequency on each date. Data
showing a higher frequency relative to other data for other dates having
the same keywords and subcategories is highlighted, and normalization is
carried out so that data will not be highlighted in response to over-time
changes in all searched documents. As a result of the normalization, it
is possible to avoid situations in which weekly analysis results lose accuracy
only for the weeks having many holidays because the number of documents
is small for these weeks. The two Topic view window show data for the same
category and the same date. Two weeks' worth of document data is removed
from August 2004 data in the lower window, but the tendency of highlighting
is the same as the upper window.
Topic view based on all documents:
Topic view based on all documents, but excluding three weeks' worth of
documents from August 2004 data:
- Definition of the index:
For each line, the frequency along the time line is normalized so that
changes linked with the frequency along the time line for all searched
documents (time series values shown in the Frequency line in the Topic
view) will be ignored, and the index is obtained by dividing the difference
between the average normalized time series value and cell frequency by
the variation scale. Its mathematical expression is as follows. The letter
D represents the entire collection of documents, and the # symbol represents
the number of documents in the collection. Here,
M = {documents in a particular month}, and
K = {documents containing the keyword shown on a particular line}.
6.4 Increase Indicator
- Applicable view:
The increase indicator is used to highlight data in the Delta view. The
same increase indicator is used in the alerting function.
- Meaning of index and summary of calculations:
As the name suggests, the increase indicator is an index to measure the
increase in frequency of use of keywords and subcategories along the time
line, and it shows how much the frequency obtained on the current date
varies from the constant state, assuming that the past time series frequency
was constant. Constant noises in the frequency time line are estimated
by using the Poisson distribution, and based on the obtained scale, the
size of variation is calculated in terms of a scaling factor. Normalization
is also carried out so that data will not vary with over-time frequency
changes in all searched documents.
- Definition of the index:
Assuming that the frequency of all searched documents along the time line
is
Global time series:
g1, g2, ..., gn (n=1, 2, ..., N)
and that the frequency of items in each line in the Delta
view along the time line is
Keyword time series:
k1, k2, ..., kn (n=1, 2, ..., N)
the time series of the accumulated frequency is defined as follows. Here,
the letter D (for decaying factor) is a parameter used for weighted average
values. As the D value increases, the time series frequency of the distant
past weighs more, and as the D value decreases, the time series frequency
of the distant past is ignored. In Text Miner, D = 0.85, and this means
that the time series frequency of the (n-4)-th date contributes to the
calculation of an average value with the half the weight of the frequency
of the n-th date.
Weighted accumulated global time series:
G1=g1
Gn=D ÁEGn-1 + gn (n=2, 3, ..., N)
Weighted accumulated keyword time series:
K1=k1
Kn=D ÁEKn-1 + kn (n=2, 3, ..., N)
Frequency of keywords and subcategories on the n-th date can be estimated
as follows when the variation within all searched documents is taken into
consideration:
Estimated average keyword value:
A n=gn ÁE(Kn-1/Gn-1) (n=2, 3, ..., N)
By using the preceding data, the increase index Xn on the n-th date is defined as
Xn=0 (n=1, 2, 3, 4)
Xn=(kn - An)/vAn (n>=5).
Terms of Use
Notices
This information was developed for products and services offered in the
U.S.A.
IBM may not offer the products, services, or features discussed in this
document in other countries. Consult your local IBM representative for
information on the products and services currently available in your area.
Any reference to an IBM product, program, or service is not intended to
state or imply that only that IBM product, program, or service may be used.
Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However,
it is the user's responsibility to evaluate and verify the operation of
any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter
described in this document. The furnishing of this document does not grant
you any license to these patents. You can send license inquiries, in writing,
to:
IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10504-1785
U.S.A.
For license inquiries regarding double-byte (DBCS) information, contact
the IBM Intellectual Property Department in your country or send inquiries,
in writing, to:
IBM World Trade Asia Corporation
Licensing
2-31 Roppongi 3-chome, Minato-ku
Tokyo 106-0032, Japan
The following paragraph does not apply to the United Kingdom or any other
country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION
"AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED,
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not
allow disclaimer of express or implied warranties in certain transactions,
therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical
errors. Changes are periodically made to the information herein; these
changes will be incorporated in new editions of the publication. IBM may
make improvements and/or changes in the product(s) and/or the program(s)
described in this publication at any time without notice.
Any references in this information to non-IBM Web sites are provided for
convenience only and do not in any manner serve as an endorsement of those
Web sites. The materials at those Web sites are not part of the materials
for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way
it believes appropriate without incurring any obligation to you.
Licensees of this program who wish to have information about it for the
purpose of enabling: (i) the exchange of information between independently
created programs and other programs (including this one) and (ii) the mutual
use of the information which has been exchanged, should contact:
IBM Corporation
Silicon Valley Lab
Building 090/H-410
555 Bailey Avenue
San Jose, CA 95141-1003
U.S.A.
Such information may be available, subject to appropriate terms and conditions,
including in some cases, payment of a fee.
The licensed program described in this document and all licensed material
available for it are provided by IBM under terms of the IBM Customer Agreement,
IBM International Program License Agreement or any equivalent agreement
between us.
Information concerning non-IBM products was obtained from the suppliers
of those products, their published announcements or other publicly available
sources. IBM has not tested those products and cannot confirm the accuracy
of performance, compatibility or any other claims related to non-IBM products.
Questions on the capabilities of non-IBM products should be addressed to
the suppliers of those products.
All statements regarding IBM's future direction or intent are subject to
change or withdrawal without notice, and represent goals and objectives
only.
This information contains examples of data and reports used in daily business
operations. To illustrate them as completely as possible, the examples
include the names of individuals, companies, brands, and products. All
of these names are fictitious and any similarity to the names and addresses
used by an actual business enterprise is entirely coincidental.
Copyright License
This information contains sample application programs in source language,
which illustrate programming techniques on various operating platforms.
You may copy, modify, and distribute these sample programs in any form
without payment to IBM, for the purposes of developing, using, marketing
or distributing application programs conforming to the application programming
interface for the operating platform for which the sample programs are
written. These examples have not been thoroughly tested under all conditions.
IBM, therefore, cannot guarantee or imply reliability, serviceability,
or function of these programs.
Trademarks
This topic lists IBM trademarks and certain non-IBM trademarks.
See for information about IBM trademarks.
The following terms are trademarks or registered trademarks of other companies:
Java and all Java-based trademarks and logos are trademarks or registered
trademarks of Sun Microsystems, Inc. in the United States, other countries,
or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of
Microsoft Corporation in the United States, other countries, or both.
Intel, Intel Inside (logos), MMX and Pentium are trademarks of Intel Corporation
in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and
other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries,
or both.
Other company, product or service names might be trademarks or service
marks of others.