IBM Content Analyzer Text Miner Guide

Edition Notice
This edition applies to version 8, release 4 of IBM® Content Analyzer and to all subsequent releases and modifications until otherwise indicated in new editions.

This document contains proprietary information of IBM. This proprietary information is provided in accordance with the license conditions and is protected by copyright. Information contained in this document provides no warranties whatsoever for any products. Also, no descriptions provided in this document should be interpreted as product warranties. Depending on the system environment, the yen symbol may be displayed as the backslash symbol, or the backslash symbol may be displayed as the yen symbol.

© Copyright International Business Machines Corporation 2007, 2008. All rights reserved.

US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

1 Introduction
This document describes how to use the IBM Content Analyzer Text Miner application.
1.1 Screen Layout and Functional Overview
The Text Miner interface consists of four areas: Category Tree, Tools, Search, and View.

Screen layout:

1.2 Troubleshooting Browser problems
This section describes known problems related to the Internet Explorer settings and their solutions.
1.3 Page Transition
Text Miner can analyze data while it is connected to the server, use the bookmark and report functions to save analysis information in a local file, and reconnect to the server from that file. The following graphic shows the flow of page

Page transition:



Most of the system operation is done in the analysis screen, and various types of views can be displayed in this screen. As seen in the following graphic, you can switch views by clicking corresponding tabs.

View transition:
1.4 Select Database
Text Miner supports analysis of multiple databases. You select databases on the Select Database screen, which is the first page in page transition. Databases are usually created for each data format or data content type such as customer calls, internal e-mails, and repair information.

Because the Select Database screen is on the top page in Text Miner, use this page if you want to add the Text Miner site to My Favorites of Internet Explorer.

Select database screen:
2 Category Tree
2.1 Category Tree Display
You can change the category tree display size (frame width) by dragging your cursor. You can hide the category tree by dragging the line between the left frame and the main frame.
2.2 Displaying and Selecting Categories
3 Search
3.1 Overview of the Search Function
In Text Miner, search conditions can be used to generate a collection of documents for analysis. In addition to using a single search condition, such as keyword search and category search, you can also create a complex search condition by combining three types of operators: AND, OR, and NOT.
3.2 Keyword Search
In keyword search, documents are searched by keywords that are registered in the IBM Content Analyzer system dictionary and the user dictionary. When synonyms are registered as keywords, documents that contain synonyms of the keywords are also retrieved. For example, if the word "PC" is set as a synonym of the keyword "personal computer," documents that contain the word "PC" also meet the search condition. Use the Dictionary Editor to register synonyms.

Keywords that are used in keyword search are not just simple character strings, but instead, they are the keywords extracted from the target documents by language processing. For this reason, alghough it is possible to retrieve documents that contain synonyms, there might be some cases in which documents that contain character strings including the text to be searched might not be retrieved. For example, documents that contain the phrase "call center" will not be retrieved if you run a keyword search with the term "center." This is because the phrase "call center" is recognized as one word by the language processing, and the character string "center" that is contained in that phrase is not recognized as a word.

In keyword search, the category to which the keyword affects the search result. Therefore, even for the same keyword, the search results might be different depending on which Category view or which text field the search is performed from. Use the Dictionary Editor to create an association between a category and a keyword. For example, assume there is a dictionary that contains the following information: In this case, when a keyword search is processed with "TP" in the PC product category, documents that contain "TP" or "ThinkPad" are retrieved, and when a keyword search is processed with the same keyword in the peripheral device category, documents that contain the "TP" and "TrackPoint" are retrieved.
3.3 Category Search
In category search, documents that contain an arbitrary keyword in the specified category will be retrieved.

3.4 Date Search
In date search, documents created on a specified date (including not only a specific date but also a period of time such as a month or a week) will be retrieved.

3.5 Search Operators
3.6 Using Operators with category node selection

You can select a node of a current search condition when a new condition node is being added. In the following procedures, node A and B are a node of a current search condition, and the X node is being added to the current search condition.

Case Selected node Operator Result
Case1 Parent node is AND AND Adds a new node as a leaf node of same parent node.
Case2 Parent node is AND OR Adds an OR node of the selected node and adds a new node.
Case3 AND OR OR node of selected AND node and new node.
Case4 AND AND Adds a new node to the selected AND node.
Case5 Parent node is OR AND Adds an AND node of the selected node and adds new node.
Case6 Parent node is OR OR Adds a new node to the parent OR node.
Case7 OR AND Adds an AND node of selected OR node and adds new node.
Case8 OR OR Adds a new node to the selected OR node.

Case1:Adds a new node as a leaf node of same parent node.

Case2: Adds an OR node of the selected node and adds a new node.

Case3: OR node of selected AND node and new node.

Case4: Adds a new node to the selected AND node.

Case5: Adds an AND node of the selected node and adds new node.

Case6: Adds a new node to the parent OR node.

Case7: Adds an AND node of selected OR node and adds new node.

Case8: Adds a new node to the selected OR node.

4 View
4.1 Common View Specifications
4.2 Top View
This is the view displayed as the initial view after selecting a database. The mining function is not available in this view.

4.3 Docs View
In the Documents view, documents that meet the current search conditions can be viewed.



The Document Inspector screen shows the text, standard information, and keywords extracted from the text for the document selected in Documents view. By using this function, you can understand the types of keywords that are internally extracted by the language processing.



4.4 Category View
For the keywords and subcategories that belong to the category specified as the vertical category in the category tree, the Category view shows the frequency of appearance within retrieved documents and the correlation with search conditions (see 6.2 Correlationpopup).


4.5 Time Series View
The Time Series view shows how often the documents meet the current search conditions over a period time.


4.6 Topic View
In the Topic view, changes over time are analyzed for each keyword or subcategory belonging to the category specified as the vertical category in the category tree, and parts with relatively high frequency will be highlighted.


4.7 Delta View
In the Delta view, over-time changes of each keyword or subcategory belonging to the category specified as the vertical category are shown in a time series graph, and parts where future increases are predicted will be highlighted. This view can function as a simpler version of the alert function because it uses the same indicator that defects increases in time.


4.8 2D Map View
The 2D Map view uses a two-dimensional table to show the correlation between keywords or subcategories that belong to the vertical category specified in the category tree and keywords or subcategories that belong to the horizontal category. Cells showing a high correlation between vertical and horizontal items are highlighted.


5 Tools
5.1 Bookmark
The bookmark function saves the current search conditions and parameters in the currently displayed view as a local file. A connection to the server is reestablished after the saved bookmark is opened, allowing the analysis to resume with the saved search conditions and parameters.

The bookmark does not save analysis results of views such as graphs and values, and it cannot be used if the server cannot be accessed. To save snapshots of analysis results of views, use the report function (seee 5.2 Report Functionpopup).

Handling of invalid search conditions:

Because the bookmark does not save analysis results, results that are different from what they were at the time of bookmark creation will be displayed if data is changed after the bookmark was created. Note that if a category used as a search condition is deleted in the bookmark, that search condition becomes invalid. A warning message for the invalid condition, and the search results becomes 0 results ("all" if the NOT operator is used in search).

Example of an invalid search condition:


Information saved in the bookmark:

A bookmark saves information about the search condition, view selection information, view parameters, and vertical and horizontal categories. However, the bookmark does not save information in the category tree frame on the left, which is the category sort method and categories that are currently expanded.
5.2 Report Function
The currently displayed analysis results can be saved as a local file by using the report function. Access to the server is not required for opening the report, and it can be viewed in Internet Explorer. If you can connect to the server, the analysis can resume with the saved search condition and parameters similar to the bookmark.

Report edit:
Click the Report link to open the report edit screen. In the edit screen, you can decide whether display description should be displayed. You can also edit comments (memo about reports). You can enter up to 2,000 characters (including line breaks) in the comment field. Also, when there are "Attach documents" check boxes for displayed items, samples of documents narrowed down by the checked item will be attached. After entering the report, click Create to save the report.



Report:


Link to a created report or sample document:
When you select the Attach documents check box, the in-file link to the sample document is displayed. Click the link to jump to the corresponding document, which is provided at the bottom of the file.



Created report or sample document:
By checking the content of the sample document, you can understand the situation in which the listed items are actually used.

5.3 CSV Output Function
You can save the currently displayed analysis results as a CSV file by using the CSV output function. Because CSV files can be opened in Excel, you can use the file to create a customized report.
5.4 Save Function

Click Save to save the current search condition to your system. File name, folder name, and comment can be set on save confirm window.

5.5 List Function

Click List to show the search conditions saved in server storage.
A search condition is loaded by selecting a file name in the list.
To delete an item, select the box of the folder name or file name and click Delete.

5.6 XML Download Function

Click XML Download to download the current search condition as a XML file

5.7 XML Upload Function

Click XML Upload to uploading a search condition XML file that is saved with XML Download function.

5.8 Analyze a document that has of multiple text entries

You can analyze a document that has two or more text entries. Select the text entry name from Options menu.

6 Statistical Index
6.1 Characteristics of the Text Miner Statistical Index
Values displayed in pink in Text Miner are the indexes that are calculated by Text Miner, and they are differentiated from raw data such as the number of documents displayed in black.

Misunderstandings are likely to occur when raw data is interpreted. For example, the frequency of the use of the keyword "receive ... mail" increased by a factor of 1.5 from January to February. To determine that the frequency increased, you must examine the increase rate with respect to the changes in the total number of documents, and to take statistical noise into consideration;

To allow the indexes to be interpreted into intuitive images, such as an increase rate or correlation strength, Text Miner displays fully corrected indexes.

6.2 Correlation
6.3 Topicality Index
6.4 Increase Indicator
Terms of Use
Notices
This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to:

IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10504-1785
U.S.A. 
For license inquiries regarding double-byte (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to:

IBM World Trade Asia Corporation
Licensing
2-31 Roppongi 3-chome, Minato-ku
Tokyo 106-0032, Japan 
The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.

Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk.

IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.

Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact:

IBM Corporation
Silicon Valley Lab
Building 090/H-410
555 Bailey Avenue
San Jose, CA 95141-1003
U.S.A.
Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee.

The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any equivalent agreement between us.

Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only.

This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental.

Copyright License
This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.

Trademarks
This topic lists IBM trademarks and certain non-IBM trademarks.

See http://www.ibm.com/legal/copytrade.shtml for information about IBM trademarks.

The following terms are trademarks or registered trademarks of other companies:

Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.

Intel, Intel Inside (logos), MMX and Pentium are trademarks of Intel Corporation in the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Linux is a trademark of Linus Torvalds in the United States, other countries, or both.

Other company, product or service names might be trademarks or service marks of others.