Edition Notice
First Edition (February 2007)
This edition applies to version 8, release 4 of IBM® OmniFind™ Analytics Edition and to all subsequent releases and modifications until otherwise indicated in new editions.
This document contains proprietary information of IBM. This proprietary information is provided in accordance with the license conditions and is protected by copyright. Information contained in this document provides no warranties whatsoever for any products. Also, no descriptions provided in this document should be interpreted as product warranties. Depending on the system environment, the yen symbol may be displayed as the backslash symbol, or the backslash symbol may be displayed as the yen symbol.
© Copyright International Business Machines Corporation 2007. All rights reserved.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
1 Introduction
This document describes how to use the IBM OmniFind Analytics Edition Text Miner application.
1.1 Pop-up Help
1.2 Screen Layout and Functional Overview
The Text Miner interface consists of four areas: Category Tree (left), Tools (top right), Search (middle right), and View (bottom right). Screen layout:
-
Category tree:
Select categories to be analyzed in this area.
-
Tools:
Various types of functions are available in this area such as saving analysis results and showing the online help.
-
Search:
Carry out operations (such as add and delete) that are associated with search conditions in this area. It is also possible to add new search conditions by using the items shown on the view.
-
View:
Analyze the retrieved documents narrowed down by the search conditions in this area. This section consists of tabs for selecting a view type, buttons for setting the parameters, and a table for showing the analysis results.
1.3 Browser-specific Issues
This section describes problems caused by the Internet Explorer settings and their solutions.
-
There are no arrow buttons in the category tree.
Depending on when data is displayed, the right frame might be displayed over a part of the left frame, hiding the arrow buttons in the category tree. When this occurs, change the order of display and refresh the screen. Normal state: The buttons are not displayed:
-
The same dialog appears twice when saving a bookmark, report, or CSV file.
Data can be saved without problems by clicking "Save" each time the dialog appears.
-
Connection from the bookmark/report to the server is blocked:
To enable the connection in each session, click the menu bar on the Internet Explorer window to select "Allow Blocked Content", and then click "Yes". To always enable the connection, follow the procedures below, but first make sure that the security policy of your environment allows the connection to be always enabled. Start Internet Explorer. Click "Tools -> Internet Options". Click the "Advanced" tab. Select the check box for "Allow active content to run in files on My Computer".
-
External connection is blocked:
The following settings can solve the problem, but first make sure that the security policy of your environment allows these settings. Start Internet Explorer. Click "Tools -> Internet Options -> Security." Select "Trusted sites" and click the "Site" button. Type the URL of Text Miner in the "Add this site to the zone" field. For example, if the URL of Text Miner is https://miner.ibm.com:9443/OAE_Text Miner/, you would type https://miner.ibm.com. Because pop-ups might be blocked by settings in, for example, the Google Toolbar, be sure to change the settings to allow pop-ups for external connections.
1.4 Page Transition
Text Miner can analyze data while being connected to the server, use the bookmark and report functions to save analysis information in a local file, and reconnect to the server from that file. The page transition diagram, including the local file, is as follows: Page transition: Most of the system operation is carried out in the analysis screen, and various types of views can be displayed in this screen. As seen below, you can switch views by clicking corresponding tabs. View transition:
1.5 Select Database Screen
Text Miner supports analysis of multiple databases. You select databases on the Select Database screen, which is the first page in page transition. Databases are usually created for each data format or data content such as customer calls, internal e-mail, and repair information. Because the Select Database screen is the top page for Text Miner, use this page if you want to add the Text Miner site to My Favorites of Internet Explorer. Select database screen:
2 Category Tree
2.1 Category Tree Display Size
Category tree display size (frame width) can be changed by dragging the white border between the left and right panes with the mouse.
-
To completely hide the category tree, click the "Close" button on the top of the category tree.
-
To view the category tree in the original size, click "Show category tree" in the "Tools" area at the top of the screen.
2.2 Display/Selection of Categories
-
Changing the order of display:
You can change the order in which categories are displayed by using the "Config," "Type," and "Name" radio buttons. "Config" is the default setting.
-
When "Config" is selected, the categories are sorted in order of creation in the configuration file. The configuration file cannot be edited with Text Miner.
-
When "Type" is selected, the categories are sorted depending on whether they have subcategories or not.
-
When "Name" is selected, the categories are sorted in order of the world standard character code called Unicode. In Unicode, characters are arranged in order of numbers, uppercase letters, and lowercase letters.
-
Show/Hide subcategories:
When the category has a subcategory or subcategories, a button is displayed in front of the category name so that the subcategory and subcategories can be shown or hidden.
Status |
Image |
When the mouse is clicked |
Subcategories are not shown. |
 |
Show subcategories. |
Subcategories are shown. |
 |
Hide subcategories. |
-
Selecting a category:
The following views are available in Text Miner:
-
Views in which categories are not used as parameters: Docs view and Time Series view.
-
Views in which only the vertical category is used as a parameter: Category view, Topic view, and Delta view.
-
Views in which both the vertical category and horizontal category are used as parameters: 2D Map
For example, the keywords and subcategories of the vertical category are vertically listed in , and the keywords and subcategories of the vertical category are vertically listed. The keywords and subcategories of the horizontal category are horizontally listed in . By using the buttons in the category tree, you can select the vertical and horizontal categories to be used. Category selection buttons and their meanings:
Button image |
Meaning |
When the mouse is clicked |
 |
For vertical category selection. |
Select the category shown on the same line as a vertical category. |
 |
For horizontal category selection. |
Select the category shown on the same line as a horizontal category. |
 |
The corresponding category is selected as a vertical category. |
Nothing happens. |
 |
The corresponding category is selected as a horizontal category. |
Nothing happens. |
Once a category is selected, it stays selected even if the collapse button for subcategories is clicked and the category is hidden.
3 Search
3.1 Overview of the Search Function
In Text Miner, search conditions can be used to generate a collection of documents for analysis. In addition to using a single search condition, such as keyword search and category search, it is also possible to create a complex search condition by combining three types of operators: AND, OR, and NOT.
3.2 Keyword Search
In keyword search, documents are searched by keywords that are registered in the IBM OmniFind Analytics Edition system dictionary and the user dictionary. When synonyms are registered as keywords, documents that contain synonyms of the keywords are also retrieved. For example, if the word "PC" is set as a synonym of the keyword "personal computer," documents that contain the word "PC" also meet the search condition. Use Dictionary Editor to register synonyms.
-
Select "Keyword Search" as the search type in the select box, enter the keyword in the text field, and click the "Search" button to retrieve documents that contain the input keyword.
-
In Category view, select "Keywords" as the list type," check the listed keywords, and click the "Search" button to retrieve documents that contain the selected keywords. When multiple keywords are selected, the OR condition is added to the search condition.
Keywords that are used in keyword search are not just simple character strings, but instead, they are the keywords extracted from the target documents by language processing. For this reason, while it is possible to retrieve documents that contain synonyms, there might be some cases in which documents that contain character strings including the text to be searched might not be retrieved. For example, documents containing a phrase "call center" will not be retrieved if you run a keyword search with the term "center." This is because the phrase "call center" is recognized as one word by the language processing, and a character string "center" contained in that phrase is not recognized as a word. In keyword search, "the category to which the keyword belongs" affects the search result. Therefore, even for the same keyword, the search result might be different depending on which Category view or which text field the search is performed from. Use Dictionary Editor to create an association between a category and a keyword. For example, assume there is a dictionary that contains the following information:
- TP (keyword) = ThinkPad (synonym) is registered in the "PC product" category.
- TP (keyword) = TrackPoint (synonym) is registered in the "peripheral device" category.
In this case, when a keyword search is processed with "TP" in the PC product category, documents that contain the word "TP" or "ThinkPad" are retrieved, and when a keyword search is processed with the same keyword in the peripheral device category, documents that contain the word "TP" and "TrackPoint" are retrieved.
3.3 Category Search
In category search, documents that contain an arbitrary keyword in the specified category will be retrieved.
-
In the Category view, select "Subcategories" as the list type, check the listed subcategories, and click the "Search" button to retrieve documents that contain the selected categories. When multiple subcategories are selected, the OR condition is added to the search condition.
3.4 Date Search
In date search, documents created on a specified date (including not only a specific date but also a period of time such as a month or a week) will be retrieved.
-
Select the appropriate check boxes in the Time Series view and click the "Search" button to retrieve documents created on the specified date. When multiple dates are selected, the OR condition is added to the search condition.
3.5 Using Operators
-
Type of search condition that can be created by Text Miner:
Text Miner supports a type of search condition that is created by combining one or more OR conditions with one or more AND conditions. In the search condition shown below, "Sun." and "Sat." are in the OR relationship and together generate a condition for retrieving documents dated on weekend dates. The keywords "windows xp" and "boot," and the date condition "weekend" are in the AND relationship. Search conditions:
-
Adding an AND condition:
While the search condition is already specified, check the "AND" radio button and then click the "Search" button to generate a search condition that includes all the existing and newly added conditions. Adding an AND condition: After adding the AND condition:
-
Adding an OR condition:
While the search condition is already specified, check the "OR" radio button and then click the "Search" button to generate a search condition that meets either the existing or newly added condition. Adding an OR condition: After adding the OR condition:
-
Adding an OR condition while an AND condition already exists:
When an OR condition is added while an AND condition already exists, the last condition on the list subject to the AND condition and the newly added condition will be in the OR relationship. Adding an OR condition: After adding the OR condition (the "computer" cell is replaced by "computer" OR "pc"):
-
Adding search conditions using multiple checkboxes:
When multiple check boxes in the Category view and Time Series view are selected to add search conditions, the checked items are put together as one OR condition and then added. Adding an OR condition consisting of multiple dates to the overall search condition as an AND condition: After adding an OR condition consisting of multiple dates to the overall search condition as an AND condition:
-
Adding search conditions from cells in various types of view:
Click cells in Topic view, Delta view, or 2D Map view to add search conditions so that AND conditions for vertical and horizontal items will be added. In this case, the "Search" button does not have to be clicked, and also, the operator radio buttons and the NOT checkbox are both disabled. Adding search conditions from cells: After adding search conditions from cells:
4 View
4.1 Common View Specifications
-
Category specification:
When specifying a category for an analysis view, select a category in the Category Tree shown in the left half of the screen, and then click the View tab. The category change will be applied to the view when the View tab is clicked. The category specified here will not be applied when changing search conditions or parameters.
-
Changing parameters:
When parameters are changed with radio buttons or check boxes, changes will be applied immediately after the screen operation. No other operations are necessary after clicking the buttons.
-
Resetting parameters:
All parameters, excluding vertical categories, horizontal categories, and list items, will be reset when the View tab is clicked.
-
Black numbers and pink numbers:
Numbers that are written in black show the frequency of appearance (number of retrieved documents) of keywords, subcategories, or dates in retrieved documents. Numbers that are written in pink show the statistical index obtained based on the frequency.
4.2 Top View
This is the view displayed as the initial view after selecting a database. The mining function is not available in this view.
4.3 Docs View
In the Docs view, documents that meet the current search conditions can be viewed.
-
External connection button:
This button appears when applications to be connected by Text Miner are registered. When the button is clicked, the currently displayed document list is sent to a registered external application. The maximum size of search conditions that can be sent through an external connection is limited. It is usually 50,000 characters for the entire form, and if there is an extremely large number of search conditions, a warning will be issued when the external connection dialog appears. Although you cannot see how many characters are contained in the search condition form, the number will be shown when a warning is issued.
-
Page buttons:
Click a triangle button to scroll the pages of listed documents. Documents are listed in order of data import (usually by date).
-
Docs per page parameter:
Use radio buttons to specify how many documents should be displayed at the same time on the screen. The page returns to the first page every time the number is changed.
-
Document display table:
-
The first line shows the document title. The title is specified when importing the data.
-
The text body is displayed in the "Text" field. Depending on the operational settings, the text might not be displayed or it might be truncated after a certain number of characters.
-
Click the link on the "Detail" line to open a pop-up window called Document Inspector to view detailed information about the document.
-
Lines below the "Detail" line show the keywords contained in that document for the displayed category. Standard items are usually set here.
-
Highlighting the document display table/parts that meet the search condition:
When keywords or character strings are specified as search conditions, parts of the document that meet the conditions are highlighted. When dependency phrases are specified as search conditions, the entire phrase will be highlighted. When the standard item category values, which are equivalent to the standard columns in databases, are specified as search conditions, the screen shows "KEYWORD" as the search condition, but there will be no highlighting of the text since keywords do not exist in the body text.
The Document Inspector screen shows the text, standard information, and keywords extracted from the text for the document selected in Docs view. By using this function, it is possible to understand the types of keywords that are internally extracted by the language processing.
-
Document Information: The Document Information area shows the document ID, title, and the text in the original data, which is to be imported by IBM OmniFind Analytics Edition. The full text will be displayed because there is no limit for the number of characters that can be displayed.
-
Standard Information: In the Standard Information area, standard information that was associated with the document when it was imported by IBM OmniFind Analytics Edition is displayed as <category, keyword> pairs. When there are multiple keywords for a standard item category, multiple lines are used to show all these keywords.
-
Keyword Information: In the Keyword Information area, keywords extracted by the language processing are displayed as <category, keyword> pairs. Clicking a radio button in the "Highlight" field will highlight the corresponding keyword in the original text.
4.4 Category View
For the keywords and subcategories that belong to the category specified as the vertical category in the category tree, the Category view lists the frequency of appearance within retrieved documents and the correlation with search conditions (see ).
-
"List" parameter:
Use the radio buttons to select whether keywords belonging to the vertical category or subcategories belonging to the vertical category should be displayed.
-
"Sort" parameter:
Use the radio buttons to specify how to sort keywords and categories in the list.
-
When "Frequency" is selected, keywords and subcategories are listed in order of how frequently they appear in the retrieved documents.
-
When "Correlation" is selected, retrieved documents are listed in order of correlation strength with the specified keywords or subcategories.
-
When "Alphabet" is selected, keywords and subcategories are sorted in order of the world standard character code called Unicode. In Unicode, characters are arranged in order of numbers, uppercase letters, and lowercase letters.
Even when the order of display changes, the keywords and subcategories to be retrieved are determined based on the frequency. For example, when "Correlation" is selected and the maximum number of lines to display is set to 100, 100 keywords that are retrieved based on frequency will be listed in order of correlation strength.
-
"Max lines to display" parameter:
Use the radio buttons to specify the maximum number of keywords or categories that can be listed.
-
Search check box (inside the table):
Select a check box and click the "Search" button at the top of the screen to add a search condition for the specified item. When multiple check boxes are selected, search conditions are generated with the checked items (keywords or categories) are linked with the OR operator and then added to the existing condition.
-
Keywords (inside the table):
When keywords are specified as items to be listed, the keywords belonging to the vertical category are listed in the second column of the table.
-
Subcategories (inside the table):
When subcategories are specified as items to be listed, the subcategories of the vertical category are listed in the second column of the table.
-
Frequency (inside the table):
The frequency of appearance of the keywords or subcategories shown on that line within the collection of retrieved documents (the number of applicable documents) is presented as numbers and graphs in black.
-
Correlation (inside the table):
Strength of correlation between the collection of retrieved documents and the keywords or subcategories on that line is presented as numbers and graphs in pink.
4.5 Time Series View
The Time Series view shows how often the documents meet the current search conditions over a period time.
-
"Time scale" parameter:
Use the time scale radio buttons to specify the size of the time axis in the Time Series graph. "Year," "Half," "Quarter," "Month," "Week," and "Date" correspond to a year, half a year, a quarter, a month, a week, and a day, respectively. "Day of month" means a calendar date, and the same calendar dates in different months will be counted as one. "Day of week" shows the frequency on each day of the week.
-
"From-time" parameter:
This parameter defines the left end of the display range of the time line. This is a function to remove unnecessary data, and this setting will be applied to the resulting report.
-
"To-time" parameter:
This parameter defines the right end of the display range of the time line. This is a function to remove unnecessary data, and this setting will be applied to the resulting report.
-
Search check box (inside the table):
Select a check box and click the "Search" button at the top of the screen to add a search condition for the specified date. When multiple check boxes are selected, search conditions generated with the checked dates are linked with the OR operator and then added to the existing condition.
-
Unknown (inside the table):
When there are documents with no date attributes, the word "Unknown" is displayed where the data is originally displayed, and their frequency will be shown.
4.6 Topic View
In the Topic view, changes over time are analyzed for each keyword or subcategory belonging to the category specified as the vertical category in the category tree, and parts with relatively high frequency will be highlighted.
-
"List" parameter:
Use the radio buttons to select whether keywords belonging to the vertical category or subcategories belonging to the vertical category should be displayed.
-
"Sort" parameter:
Use the radio buttons to specify how to sort keywords and categories in the list.
-
When "Frequency" is selected, keywords and subcategories are listed in order of how frequently they appear in the retrieved documents.
-
When "Alphabet" is selected, keywords and subcategories are sorted in order of the world standard character code called Unicode. In Unicode, characters are arranged in order of numbers, uppercase letters, and lowercase letters.
Even when the order of display changes, the keywords and subcategories to be retrieved are determined based on the frequency. For example, when "Alphabet" is selected and the maximum number of lines to display is set to 20, 20 keywords that are retrieved based on frequency will be listed in alphabetical order.
-
"Max lines to display" parameter:
Use the radio buttons to specify the maximum number of keywords or categories that can be listed.
-
"Time scale" parameter:
Use the time scale radio buttons to specify the size of the time axis in the Time Series graph. "Year," "Month," "Week," and "Date" correspond to a year, a month, a week, and a day, respectively. "Day of month" means a calendar date, and the same calendar dates in different months will be counted as one. "Day of week" shows the frequency on each day of the week.
-
"From-time" parameter:
This parameter defines the left end of the display range of the time line. This is a function to remove unnecessary data, and this setting will be applied to the resulting report.
-
"To-time" parameter:
This parameter defines the right end of the display range of the time line. This is a function to remove unnecessary data, and this setting will be applied to the resulting report.
-
Frequency (inside the table):
The total number of applicable documents is shown in a numerical value for each date.
-
Cell value (inside the table)
Frequency of appearance of the keyword or subcategory in each line is shown as a numerical value for each date. Parts with relatively high frequency within the retrieved document are underlined. Refer to "" for details on highlighting. When the number in a particular cell is clicked, the AND condition consisting of the vertical and horizontal items will be added to the current search condition. In this case, the "Search" button does not have to be clicked. In addition, the operator radio buttons and the NOT checkbox are both disabled.
4.7 Delta View
In the Delta view, over-time changes of each keyword or subcategory belonging to the category specified as the vertical category in the category tree are shown in a time series graph, and parts where future increases are predicted will be highlighted. This view can function as a simpler version of the alert function since it uses the same increase detection indicator as the alert function.
-
"List" parameter:
Use the radio buttons to select whether keywords belonging to the vertical category or subcategories belonging to the vertical category should be displayed.
-
"Sort" parameter:
Use the radio buttons to specify how to sort keywords and categories in the list.
-
When "Frequency" is selected, keywords and subcategories are listed in order of how frequently they appear in the retrieved documents.
-
When "Alphabet" is selected, keywords and subcategories are sorted in order of the world standard character code called Unicode. In Unicode, characters are arranged in order of numbers, uppercase letters, and lowercase letters.
-
When "Latest alerting indicator" is selected, keywords and subcategories are listed in order of date of creation, from most recent to oldest, of the increase indicator on the latest date within each time line display range (the increase indicator in the last column to the right).
Also, keywords and subcategories to be retrieved are determined based on the frequency, regardless of the sort method. For example, when "Alphabet" is selected and the maximum number of lines to display is set to 20, 20 keywords that are retrieved based on frequency will be listed in alphabetical order.
-
"Max lines to display" parameter:
Use the radio buttons to specify the maximum number of keywords or categories that can be listed.
-
"Time scale" parameter:
Use the time scale radio buttons to specify the size of the time axis in the Time Series graph. "Year," "Month," "Week," and "Date" correspond to a year, a month, a week, and a day, respectively.
-
"From-time" parameter:
This parameter defines the left end of the display range of the time line. This is a function to remove unnecessary data, and this setting will be applied to the resulting report.
-
"To-time" parameter:
This parameter defines the right end of the display range of the time line. This is a function to remove unnecessary data, and this setting will be applied to the resulting report.
-
Increase indicator (inside the table):
The increase indicator for each date is displayed in pink above the Time Series graph. A value above zero means an increase, and a value below zero means a decrease. The increase indicator is not provided for the first four dates since the amount of data is not sufficient for calculating the increase indicator for these dates. Refer to "" for details on the increase indicator.
-
Time Series graph (inside the table):
The frequency of appearance of each keyword or subcategory in retrieved documents is presented as a bar chart.
4.8 2D Map View
The 2D Map view uses a two-dimensional table to show the correlation between keywords or subcategories that belong to the vertical category specified in the category tree and keywords or subcategories that belong to the horizontal category. Cells showing a high correlation between vertical and horizontal items are highlighted.
-
"List" parameter:
Use the radio buttons provided in the left half of the screen to select whether keywords belonging to the vertical category or subcategories belonging to the vertical category should be displayed on the vertical axis of the table. In the same manner, use the radio buttons provided in the right half of the screen to select the items to be displayed along the horizontal axis of the table.
-
"Sort" parameter:
Use the radio buttons to specify how to sort keywords and categories in the list. The radio buttons provided in the left half of the screen are for the vertical axis, and the radio buttons provided in the right half of the screen are for the horizontal axis.
-
When "Frequency" is selected, keywords and subcategories are listed in order of how frequently they appear in the retrieved documents.
-
When "Alphabet" is selected, keywords and subcategories are sorted in order of the world standard character code called Unicode. In Unicode, characters are arranged in order of numbers, uppercase letters, and lowercase letters.
Even when the order of display changes, the keywords and subcategories to be retrieved are determined based on the frequency. For example, when "Alphabet" is selected and the maximum number of lines is set to 20, 20 keywords that are retrieved based on frequency will be listed in alphabetical order.
-
"Max lines to display" parameter:
Use the radio buttons to specify the maximum number of keywords or categories that can be listed. The radio buttons provided in the left half of the screen are for the vertical axis, and the radio buttons provided in the right half of the screen are for the horizontal axis.
-
Cell value (inside the table)
The first line in each cell shows in black the number of documents that contain [vertical keyword/subcategory] and [horizontal keyword/subcategory]. The second line in each cell shows in pink the correlation between the vertical keyword or subcategory and the horizontal keyword or subcategory. Refer to "" for details on correlation values. Cells showing high correlations are highlighted.
5 Tools
5.1 Viewing a Category Tree
Click "Show category tree" to reset the size of the category tree display frame in the left half of the screen.
5.2 Bookmark
The bookmark function saves the current search conditions and parameters in the currently displayed view as a local file. A connection to the server is reestablished once the saved bookmark is opened, allowing the analysis to resume with the saved search conditions and parameters. The bookmark does not save analysis results of views such as graphs and values, and it cannot be used if the server cannot be accessed. To save snapshots of analysis results of views, use the report function (seee ).
-
Saving the bookmark:
When the "bookmark" link is clicked, the a dialog for saving the bookmark appears. Specify the local destination. A blank window appears separately. Close this window after saving the bookmark.
-
Managing the bookmark:
The bookmark can be managed as a local file. It is also possible for users accessing the same Text Miner server to exchange bookmark files through e-mail or to share the files.
-
Accessing the server from the bookmark:
When a locally saved bookmark is opened, the "JUMP" button appears. Click the button to connect to Text Miner.
Handling of invalid search conditions: Because the bookmark does not save analysis results, results that are different from what they were at the time of bookmark creation will be displayed if data is changed after the bookmark was created. Note that if a category used as a search condition is deleted in the bookmark, that search condition becomes invalid. A warning message in red appears for the invalid condition, and the search result becomes 0 hits ("all" if the NOT operator is used in search). Example of an invalid search condition: Information saved in the bookmark: A bookmark saves information contained in the right frame, which is a search condition, view selection information, view parameters, and vertical and horizontal categories. Meanwhile, the bookmark does not save information in the category tree frame on the left, which is the category sort method and categories that are currently expanded.
5.3 Report Function
The currently displayed analysis results can be saved as a local file by using the report function. Access to the server is not required for opening the report, and it can be viewed on Internet Explorer. If it is possible to connect the server, the analysis can resume with the saved search condition and parameters, in the same manner as the bookmark. Report edit: Click the "Report" link to open the report edit screen. In the edit screen, you can decide whether or not display description should be displayed, and also, you can edit comments (memo about reports). Up to 2,000 characters (including line breaks) can be input in the comment field. Also, when there are "Attach documents" check boxes for displayed items, samples of documents narrowed down by the checked item will be attached. After entering the report, click the "Create" button at the bottom of the screen to save the report locally. Report:
-
Open the saved report to see the content without connecting to the server.
-
Click the "here" link to reconnect to the server while maintaining the search conditions and parameters used in the report, and resume the analysis using that report.
Link to a created report/sample document: When the "Attach documents" checkbox has been checked, the in-file link to the sample document will be displayed. Click the link to jump to the corresponding document, which is provided at the bottom of the file. Created report/sample document: By checking the content of the sample document, it is possible to understand the situation in which the listed items are actually used.
5.4 CSV Output Function
The currently displayed analysis results can be saved as a CSV file by using the CSV output function. Because CSV files can be opened in Excel, you can use the file to create a customized report.
-
Saving a CSV file: When the "CSV output" link is clicked, the CSV file save dialog appears. Specify a local destination. A blank window might appear separately. Close this window after saving the file.
5.5 Help
Click "Help" to see this online instruction manual.
6 Statistical Index
6.1 Characteristics of the Text Miner Statistical Index
Values displayed in pink in Text Miner are the indices calculated by Text Miner, and they are differentiated from "raw" data such as the number of documents displayed in black. Misunderstandings are likely to occur when raw data is interpreted. For example, the frequency of use of the keyword "receive ... mail" increased by a factor of 1.5 from January to February. To determine that the frequency "increased," it is necessary to examine the increase rate with respect to the changes in the total number of documents, and it is also necessary to take statistical noise into consideration; however, these things are not likely to be considered in many cases. To allow the indices to be interpreted into intuitive images such as an increase rate or correlation strength, Text Miner displays fully corrected indices.
-
Views that support heuristic analysis:
Text Miner supports discovery of topics and problems by not only showing the number of applicable documents in terms of keywords or subcategories but also by visualizing the "elements that are different from others" in keyword distribution in the documents. The specific methods are analysis of over-time changes of frequency (Time Series, Topic, and Delta views) and correlation analysis (Category and 2D Map views).
-
Noise removal:
Even though Text Miner analyzes a huge number of documents, it is not always the case that the number of documents that meet individual search conditions or the number of documents that contain certain keywords/subcategories is sufficient for statistical processing. Also, in the case of handling call center data or BBS logs, there is always a certain level of noise in the number of documents. Therefore, Text Miner always takes noise in reliability of values into account and shows corrected values as indices instead of simply showing the values such as "difference in the number of documents" and "ratio of the number of documents."
6.2 Correlation
-
Applicable view:
Correlation values are used in category view and 2D Map view. The Category view shows the level of correlation between a search condition and listed keywords or subcategories, and 2D Map view shows the likelihood of co-occurrence of vertical and horizontal items.
-
Values before correction:
Correlation between two document collections A and B is defined as follows. The letter D represents the entire collection of documents, and the # symbol represents the number of documents among the collection thereof. The left and the right sides of the equation are equal to each other.
#(AnB)/#A |
= |
#(AnB)/#D |
|
|
#B/#D |
(#A/#D) (#B/#D) |
For example, assuming that two collections of documents are
A = {documents that contain a keyword "PC"} and
B = {documents that contain a keyword "see ... manual"},
then the left side of the equation will be:
When the documents containing the keyword "PC" are examined, the percentage of users who wish to get an instruction manual |
|
When all the documents are examined, the percentage of users who wish to get an instruction manual |
This set can be illustrated as described below.
For example, when 5% of all the documents are about obtaining an instruction manual but this figure rises to 20% when only personal computer-related documents are examined, the correlation value between "PC" and "see manual" is 4, meaning that the correlation is strong.

The right side of the equation is a ratio between the product of density of A and density of B (#A/#D) (#B/#D), and the actual density of (AnB), which is #(AnB)/#D, representing "deviation from independence of A and B."
The right side is more intuitive than the left side as a 2D Map index.
-
Values after correction:
Reliability of the correlation value becomes lower as the value of #(AnB) (the number of documents that contain the keywords "PC" and "see manual," in this example) becomes smaller in the preceding formula. Text Miner uses an interval estimation to make low-reliability values smaller in order to avoid the situation wherein a high correlation value is obtained even though there is no reliability, causing lowering of efficiency or accuracy of analysis. The interval estimation method obtains "the smallest A that can realize the current correlation value when the proper correlation value A is an unknown value provided that there are an infinite number of documents, excluding coincidences that occur below a certain probability." Refer to references on probability and statistics for details.
6.3 Topicality Index
-
Applicable view:
The topicality index is used to highlight data in Topic view.
-
Meaning of index and summary of calculations:
The topicality index measures how keywords and subcategories in each line in the Topic view deviate from the average frequency on each date. Data showing a higher frequency relative to other data for other dates having the same keywords and subcategories is highlighted, and normalization is carried out so that data will not be highlighted in response to over-time changes in all searched documents. As a result of the normalization, it is possible to avoid situations in which weekly analysis results lose accuracy only for the weeks having many holidays because the number of documents is small for these weeks. The two Topic view screens below show data for the same category and the same date. Three weeks' worth of document data is removed from February 1998 data in the lower screen, but the tendency of highlighting is the same as the upper screen. Topic view based on all documents: Topic view based on all documents, but excluding three weeks' worth of documents from February 1998 data:
-
Definition of the index:
For each line, the frequency along the time line is normalized so that changes linked with the frequency along the time line for all searched documents (time series values shown in the "Frequency" line in Topic view) will be ignored, and the index is obtained by dividing the difference between the average normalized time series value and cell frequency by the variation scale. Its mathematical expression is as follows. The letter D represents the entire collection of documents, and the # symbol represents the number of documents among the collection thereof. Here, M = {documents in a particular month}, and K = {documents containing the keyword shown on a particular line}.
#(MnK) - #K × ( #M ÷ #D ) |
|
v( #K × ( #M ÷ #D ) ) |
6.4 Increase Indicator
-
Applicable view:
The increase indicator is used to highlight data in the Delta view. The same increase indicator is used in the alerting function.
-
Meaning of index and summary of calculations:
As the name suggests, the increase indicator is an index to measure the increase in frequency of use of keywords and subcategories along the time line, and it shows "how much the frequency obtained on the current date varies from the constant state, assuming that the past time series frequency was constant." Constant noises in the frequency time line are estimated by using the Poisson distribution, and based on the obtained scale, the size of variation is calculated in terms of a scaling factor. Normalization is also carried out so that data will not vary with over-time frequency changes in all searched documents.
-
Definition of the index:
Assuming that the frequency of all searched documents along the time line is
Global time series:
g1, g2, ..., gn (n=1, 2, ..., N)
and that the frequency of items in each line in the Delta view along the time line is
Keyword time series:
k1, k2, ..., kn (n=1, 2, ..., N)
the time series of the accumulated frequency is defined as follows.
Here, the letter D (for decaying factor) is a parameter used for weighted average values.
As the D value increases, the time series frequency of the distant past weighs more, and as the D value decreases, the time series frequency of the distant past is ignored.
In Text Miner, D = 0.85, and this means that the time series frequency of the (n-4)-th date contributes to the calculation of an average value with the half the weight of the frequency of the n-th date.
Weighted accumulated global time series:
G1=g1
Gn=D × Gn-1 + gn (n=2, 3, ..., N)
Weighted accumulated keyword time series:
K1=k1
Kn=D × Kn-1 + kn (n=2, 3, ..., N)
Frequency of keywords and subcategories on the n-th date can be estimated as follows while the variation within all searched documents is taken into consideration:
Estimated average keyword value:
A n=gn × (Kn-1/Gn-1) (n=2, 3, ..., N)
By using the preceding data, the increase index Xn on the n-th date is defined as
Xn=0 (n=1, 2, 3, 4)
Xn=(kn - An)/vAn (n?5).
Terms of Use
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document.
The furnishing of this document does not grant you any license to these patents. You can send license
inquiries, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10504-1785
U.S.A.
For license inquiries regarding double-byte (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to:
IBM World Trade Asia Corporation
Licensing
2-31 Roppongi 3-chome, Minato-ku
Tokyo 106-0032, Japan
The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.
Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact:
IBM Corporation
Silicon Valley Lab
Building 090/H-410
555 Bailey Avenue
San Jose, CA 95141-1003
U.S.A.
Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee.
The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any equivalent agreement between us.
Information concerning non-IBM products was obtained from the suppliers of those products, their
published announcements or other publicly available sources. IBM has not tested those products
and cannot confirm the accuracy of performance, compatibility or any other claims related to
non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products.
All statements regarding IBM's future direction or intent are subject to change or withdrawal
without notice, and represent goals and objectives only.
This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental.
Copyright License
This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.
Trademarks
This topic lists IBM trademarks and certain non-IBM trademarks.
See for information about IBM trademarks.
The following terms are trademarks or registered trademarks of other companies:
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc.
in the United States, other countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States,
other countries, or both.
Intel, Intel Inside (logos), MMX and Pentium are trademarks of Intel Corporation in the United States, other
countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product or service names might be trademarks or service marks of others.
|