IBM OmniFind Analytics Edition Overview
Edition Notice
First Edition (February 2007)

This edition applies to version 8, release 4 of IBM® OmniFind™ Analytics Edition and to all subsequent releases and modifications until otherwise indicated in new editions.

This document contains proprietary information of IBM. This proprietary information is provided in accordance with the license conditions and is protected by copyright. Information contained in this document provides no warranties whatsoever for any products. Also, no descriptions provided in this document should be interpreted as product warranties. Depending on the system environment, the yen symbol may be displayed as the backslash symbol, or the backslash symbol may be displayed as the yen symbol.

© Copyright International Business Machines Corporation 2007. All rights reserved.

US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

1 Introduction

This document describes the functions of the text mining system, IBM OmniFind Analytics Edition, and some basic concepts that are necessary for understanding these functions. Understanding of the information provided in this document is crucial for understanding other documents or instruction manuals for IBM OmniFind Analytics Edition.

1.1 Target Audience

This document is written for application users, system administrators, and operational designers of IBM OmniFind Analytics Edition.

1.2 Relevant Documents
See the List of Instruction Manuals for other instruction manuals for IBM OmniFind Analytics Edition.
2 Overview of IBM OmniFind Analytics Edition

This section describes the functions of IBM OmniFind Analytics Edition and the system configuration. This section also describes how target data is processed and analyzed.

2.1 What You Can Do With IBM OmniFind Analytics Edition

IBM OmniFind Analytics Edition analyzes text in documents and what customers say during interactions with a call center. Text is analyzed by the IBM OmniFind Analytics Edition language processing program and extracts relevant information about that text.

The following example shows information about an inquiry to an call center. IBM OmniFind Analytics Edition runs the language processing for each document. In this example, one inquiry produces one document.

Item Description
Date of inquiry July 17, 2007
Customer ID 123456
Name Taro Yamada
Age 28
Gender Male
Product group Notebook PC
Product model ABC-001
Agent Ichiro Suzuki
Days required for solving the issue 2 days
Inquiry I would like to know how to uninstall the software I bought. Where can I find the information in the instruction manual?

In the record above, "Date of inquiry" and "Name" are the items that are attached to the data to be analyzed.
In IBM OmniFind Analytics Edition, these items are called standard items. With the results of the language processing, the standard items will be analyzed by the IBM OmniFind Analytics Edition application.

Meanwhile, an inquiry is created as a free-form text. This is what the IBM OmniFind Analytics Edition language processing program analyzes. The following example is a partial result of the analysis of the inquiry by the IBM OmniFind Analytics Edition language processing program.

Category Description
Noun -> Verb software ... uninstall
Verb Purchase
Noun Software
Noun manual
(and so on) (and so on)
* Depending on the IBM OmniFind Analytics Edition settings, you might see different results for the same text.

One of the most significant characteristics of IBM OmniFind Analytics Edition is that by customizing language resources such as dictionaries, it can extract not only words such as software and purchase but also dependency expressions such as noun -> verb and expressions of intention such as want and question.


"Category" means the type of data analysis. By default, analysis categories that are offered by IBM OmniFind Analytics Edition are parts of speech for nouns and verbs, and dependency for a link between a noun and a verb. In addition to these categories, standard items, such as age and gender, you can add a technical term category that is suitable for the target of analysis. For more information, see 3.2 Category and Category Tree.

Extracting expressions of intention requires optional dictionaries that are suitable for the target data field.

Now, how can new knowledge from the extracted information be obtained? If, for example, there are 100,000 documents, reading all of them will require a long time.
IBM OmniFind Analytics Edition offers Text Miner, which is an online tool that efficiently analyzes a large number of results from the language processing of extracted text. Text Miner runs the statistical processing of the extracted information and provides the following information interactively. "Interactively" means that the processing can be run with a response time that is close to the response time of a Web search.

  • What kinds of expressions were used, and how often were they used?
  • What are examples of expressions that are highly correlated with a particular expression?
  • How have the requests for a particular product changed over time?

The following figure shows types of dependencies that are found in the documents that mentioned three particular products (ABC-001, ABC-002, and XYZ-999). These product names were retrieved from the call logs from the previous example. (Among all the retrieved dependency patterns, the five most commonly used patterns are selected.)

This result provides the following facts.

  • The frequency value suggests that the expression "I can't do X on the Internet" is the most frequently occurring expression in the retrieved documents that mentioned three products specified as the search conditions.
  • The correlation value suggests that the request to "add more memory" is a particularly unique expression used when mentioning these products.

Based on these results, you can ensure that the memory expansion method is explained effectively in the instruction manuals for these products, for example.
Note, however, that one-time analysis might not always result in useful knowledge. With Text Miner, you can select (search for) documents that contain the extracted expression and continue further analyses. By repeating analysis and retrieval, IBM OmniFind Analytics Edition supports discovery of new knowledge. Another characteristic of IBM OmniFind Analytics Edition is that it can analyze data while the analysis criteria are changed in real-time. See the Text Miner Instruction Manual for more information.

2.2 Application Functions Offered by IBM OmniFind Analytics Edition

IBM OmniFind Analytics Edition offers multiple applications depending on the purpose or use. This section provides an overview of these applications. For the details of individual applications, see the respective instruction manuals (List of Instruction Manuals).

Application Functional overview Form
Text Miner Real-time statistical analysis of the results of the language processing Web-based application
Dictionary Editor Editing of categories or dictionaries to be used in language processing Web-based application
Alerting System Statistical analysis of the results of the language processing, focusing on finding problems Web-based application (checking of the settings and results)
Batch program (detection)
DOCAT
* Only available in Japanese
Establishing rules that are useful for document categorization based on the documents found by Text Miner Web-based application (setting)
Batch program (categorization)

2.3 IBM OmniFind Analytics Edition System Configuration

The following figure shows how the applications mentioned in the previous section relate to each other. Arrows show the flow of data.

Data to be analyzed, including text, must be prepared in CSV (comma separated values) data format. Many standard spreadsheet applications and relational databases support exporting files to a CSV format.

CSV data is first converted to the internal data format for IBM OmniFind Analytics Edition, and then analyzed by the natural language processing ("NLP" in the figure). Results of the language processing are stored in the index structure for analysis.

Text Miner and Alerting System use this index to analyze the results of the language processing.

DOCAT GUI receives sample documents from Text Miner to support the users to use these documents to establish a set of rules that is useful for document categorization. The rules are called triggers. When the language processing is run again, the DOCAT program used in the processing will categorize documents based on the triggers. The resulting document categorization can be viewed by using Text Miner.

Dictionary Editor supports creating and maintaining categories and dictionaries used in the language processing.

2.4 IBM OmniFind Analytics Edition System Operational Design

This section provides an example of the standard system operational design for the IBM OmniFind Analytics Edition system environment. See the "Operation Guide" for more information.

To design an operation, follow these steps:

  1. Create a database for storing data to be analyzed and analysis results. Register that database in the global setting. The global setting is the list of databases referred to by the applications.
  2. Examine the data to be analyzed and use Dictionary Editor to specify the categories required for analysis.
  3. If necessary, select the method to convert data to be analyzed into a CSV file. Note that each line in the CSV file is created for each document in IBM OmniFind Analytics Edition.
  4. Specify the settings so that the CSV file is converted into the internal data format (called the ATML format) for IBM OmniFind Analytics Edition.
    More specifically, in the conversion tool settings, specify which category or text each line in the CSV file corresponds to. See the Operation Guide for more information.
  5. Review a small subset of data to ensure that the conversion of a CSV file into the ATML format, language processing, and indexing (index creation processing) are done properly.
  6. After indexing, start WebSphere Application Server to check that Text Miner recognizes the processing results.
  7. Check the operation of individual applications.

When you add new data, follow these steps:

  1. Stop WebSphere Application Server.
  2. Convert the data to be analyzed into a CSV file.
  3. Convert the CSV file into the ATML format, run the language processing, and create the index (index creation processing).
  4. After indexing, start WebSphere Application Server to check that Text Miner recognizes the processing results.
  5. Check the operation of individual applications.

3 Basic Concept of IBM OmniFind Analytics Edition
This section describes the basic concepts of IBM OmniFind Analytics Edition.
3.1 Database

IBM OmniFind Analytics Edition manages the results of language processing for each type of data to be analyzed. The unit of the management is called a database. When analyzing particular data by using IBM OmniFind Analytics Edition, you must create a database for that data.

A database contains the resources necessary for analyzing the results of the language processing, the results of the language processing, and the index structure required for real-time analysis. The database is viewed by application users simply as "data to be analyzed," and its physical structure is not recognized.
Operational designers and administrators must understand the physical structure of the database. See the Operation Guide for more information.

3.2 Category and Category Tree

Category is a label name given to a keyword. A category is a perspective of data analysis in IBM OmniFind Analytics Edition. IBM OmniFind Analytics Edition uses multiple categories.

Categories can be divided into the standard item category, system category, and user defined category. Their characteristics can be summarized as follows:

Category Description Example
Standard item category A field value that is attached to raw data to be analyzed.
It is defined independently of the language processing.
Date of inquiry, age, product ID, and so on
System category A category provided by IBM OmniFind Analytics Edition by default.
This category applies to the results of the language processing.
Part-of-speech system (nouns, verbs, and so on) and dependency
User defined category A category defined by users based on data characteristics.
This category applies to the results of the language processing.
List of particular customers (found in the text)

Categories can have parent-child relationships with other categories.

A category tree is a collection of all categories defined in IBM OmniFind Analytics Edition. It is called a tree because it has a tree structure as in the following example. When a particular category is considered to semantically contain a different category, these two categories can be registered in the category tree as categories having a parent-child relationship.
In a category tree, there is a virtual root category that functions as the parent category for all categories. Usually, users are not aware of this category.
The following example shows a part of a category tree. Note that it changes with the language used or customization.

3.3 Keyword and Dictionary

A keyword is a combination of a category and a character string. When category C and character string S in the example of Section 2.1 are used, the resulting keywords [C, "S"] are: [Name, "Taro Yamada"], [Product name, "ABC-001"], [Noun -> Verb, "software … uninstall"] and so on.

As arbitrary character strings are used as keywords, note that expressions that are not words are also regarded as keywords, such as dependency, as in the expression "software … uninstall" in the last example.
Whether two keywords are equal to each other depends on whether both the category and character strings are equal to each other between the two keywords. For example, [Family name, "Nagano"] and [Prefecture, "Nagano"] are two different keywords.

The IBM OmniFind Analytics Edition language processing extracts information in the form of keywords from text. See the example in Section 2.1 for extraction examples. See the instructions for individual applications about how to handle application-dependent keywords.

A dictionary is a collection of data that is used when the language processing extracts keywords from text. A dictionary contains two types of information:

  1. Collection of keywords
  2. A collection of keywords that are considered to have the same category as the keywords in 1. above and therefore are equivalent to the keywords in 1. above.
A specific example is used here to illustrate how the dictionary works. Through an example of processing of a short example sentence, you can see how IBM OmniFind Analytics Edition uses the dictionary to process the text. Note that depending on the IBM OmniFind Analytics Edition settings, you might see different results for the same text.

Target text 1:
"Where in the manual can I find how to uninstall the software?"

By default, IBM OmniFind Analytics Edition has a dictionary (system dictionary) for extracting nouns and dependency. By using the system dictionary, the language processing extracts nouns, verbs, and dependency. Some of the extracted information is as follows:

Category Description
Noun manual
Noun -> Verb software ... uninstall
(and so on) (and so on)

Now, analyze the following text, which is quite similar to Target text 1 above.

Target text 2:
"Can I find information on how to uninstall the program in the instructions?"

If the system dictionary is the only dictionary that IBM OmniFind Analytics Edition has, it probably returns the following results:

Category Description
Noun instructions
Noun -> Verb Software ... uninstall
(and so on) (and so on)

When you look at the output description, you can see that Target sentence 1 and Target sentence 2 have identical questions. However, because the extracted keywords are different from those extracted from Target sentence 1, the application cannot determine whether the extracted information is the same as the previously extracted one.

This issue can be solved by providing a new dictionary for language processing. Before you run the language processing, you can add a new dictionary (usually called a user dictionary because you edit it) that contains the following knowledge.

Keyword List of keywords (synonyms) that are equivalent to the keyword on the left
[Noun. "soft"] [Noun, "software"], [Noun, "program"]
[Noun, "manual"] [Noun, "instructions"]

Two keywords [Noun, "software"] and [Noun, "manual"] are registered in this dictionary. Keywords that are equivalent to these keywords are also registered.
The latter keywords are called synonyms for the original keywords. For example, [Noun, "instructions"] is a synonym for [Noun, "manual"].

The language processing searches keywords that are found in the text in the synonym list of the dictionary. If synonyms are not found, the keywords are output as they are, but if synonyms are found, both the "keywords that correspond to the retrieved synonyms" and the synonyms are output.

In the previous example, new keywords such as [Noun, "software"] and [Noun, "manual"] will be extracted from Target text 2 in addition to the keywords [Noun, "software"] and [Noun, "instructions"]. Also, these settings at the word level are used in dependency, new dependency [Noun -> Verb, "software … uninstall"] will also be newly extracted.

As seen in this example, how keywords and their synonyms are registered in the dictionary is the important element in analyzing the results.

Terms of Use

Notices
This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to:

IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10504-1785
U.S.A. 
For license inquiries regarding double-byte (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to:

IBM World Trade Asia Corporation
Licensing
2-31 Roppongi 3-chome, Minato-ku
Tokyo 106-0032, Japan 
The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.

Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk.

IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.

Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact:

IBM Corporation
Silicon Valley Lab
Building 090/H-410
555 Bailey Avenue
San Jose, CA 95141-1003
U.S.A.
Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee.

The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any equivalent agreement between us.

Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only.

This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental.

Copyright License
This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.

Trademarks
This topic lists IBM trademarks and certain non-IBM trademarks.

See http://www.ibm.com/legal/copytrade.shtml for information about IBM trademarks.

The following terms are trademarks or registered trademarks of other companies:

Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.

Intel, Intel Inside (logos), MMX and Pentium are trademarks of Intel Corporation in the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Linux is a trademark of Linus Torvalds in the United States, other countries, or both.

Other company, product or service names might be trademarks or service marks of others.