IBM Content Analyzer Rule Editor Guide | ||||||||||||
This document contains proprietary information of IBM. This proprietary information is provided in accordance with the license conditions and is protected by copyright. Information contained in this document provides no warranties whatsoever for any products. Also, no descriptions provided in this document should be interpreted as product warranties. Depending on the system environment, the yen symbol may be displayed as the backslash symbol, or the backslash symbol may be displayed as the yen symbol. © Copyright International Business Machines Corporation 2008. All rights reserved. US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. 1 Introduction
This document describes how to use the IBM Content Analyzer Rule Editor.
1.1 Overview
The Rule Editor is a Web application that is used to create and edit rules in the format supported by IBM Content Analyzer. These rules can subsequently be used by Text Miner for mining information from the document collections. The rule editor is linked with the dictionary tree, and only supports operations on rules that are associated with a category in the category tree. See the Overview document for the dictionary editor
![]() Specifically, with the rule editor, you can:
![]() Figure 1: Relationship of rule editor with components of the dictionary editor.1.2 Rule editor files
Similar to the Dictionary Editor, the rule editor supports editing operations by multiple users. To avoid editing conflicts, the Rule Editor includes a mechanism to lock the file to be edited and prevent other users from editing the same file. The synchronization mechanisms are discussed in section 8. The rule editor is associated with "Rich Pattern Format" files, which are XML files with the extension .RPF. All rules created by the rule editor are saved in .RPF files. Note the following points about how RPF files are maintained.
1.3 The dictionary editor and the rule editor
At the level of the user interface, the Rule Editor is tightly integrated to the Dictionary Editor and runs as an application within the Dictionary Editor. The Dictionary Editor operates in the context of a specific database that is also the database used by the Rule Editor. Further, each rule is associated with a category to which it belongs. This category is an element of the category tree in the dictionary. The main home page of the rule editor is accessible through a link from the home page of the Dictionary Editor. All operations at the level of a rule, such as rule creation, modification and deletion are associated with the category to which the rule belongs. Also, synchronization of the common resources is also performed across the rule editor and dictionary editor. Figure 2 shows the home page of the dictionary editor with the link Edit Rules. When clicked, this link will open the home page of the Rule Editor.
![]() Figure 2: Home page of the Dictionary Editor2 Rules and rule files2.1 Rule definition
A rule is an entity that allows the user to specify what is to be mined from the various documents. A rule consists of two parts: a pattern and the value. The rule editor helps the user define the pattern by enabling the configuration of a set of parameters values. The rule value specifies what the result of applying the pattern to the document should be.
2.2 Rule structure
Figure 3 specifies a sample rule and its various components.
![]() Figure 3: Components of a rule
2.3 Rule files - RPF format
The RPF format supports all the rules supported by the PAT format. In addition, the RPF format allows you to specify optional constraints for each w (word) element of each rule. The rule editor supports both PAT and RPF files. However, every newly created rule is saved in a file that has the name of the category of that rule.
3 Accessing the rule editor![]() Figure 4: Home page of the Rule EditorThe rule editor runs as an application within the dictionary editor. When you open the dictionary editor (as described in the overview section of the dictionary editor), you are first prompted to select a database. At this time, the Edit Rules link in the left frame is disabled. When you select a database, the Edit Rules link is now highlighted (as shown in figure - 2). After you click this link, you see the Home Page of the rule editor, which is a view of the category tree. Those categories with a plus sign (+) on the left can be drilled down further until the lowest level. Each category displays links to Add a new rule and to Browse rules in that category. Also, at the top level, there are links to Browse all rules, Search rules, Backup Rules, Restore Rules, Transform Rules, and Test Rules. This view will subsequently be referred to as the home page of the rule editor. Figure 4 shows a screen capture of the dictionary tree. 4 Browsing the rule tree
When you select Edit rules from the home page of the dictionary editor, you see the dictionary tree (see figure 4), with links at each node (each node represents one category) to add a new rule or to browse for rules under that category. At the top of the page is a link to Browse all rules. You can view the rules by using one of the following options:
4.1 Browsing rules of one category![]() Figure 5: View of Browse RulesFigure 5 shows a screen capture of the view that you see when you select Browse Rules from a specific category from the home page of the rule editor. All rules in that category are displayed. The details displayed are the category name, the tokens in the rule and the rule value. 4.2 Browsing all rules![]() Figure 6: View of Browse all rulesFigure 6 shows a screen capture of the view that you see when you select Browse all Rules from the home page of the rule editor. This view is similar to the previous view (in figure 5) except that here all the rules in all the categories are displayed. Viewing all the rules is useful when you do not know which rule you are looking for or when the number of rules in the system is not very high. The Search Rules view also has the same view as shown in the previous figure. ![]() Figure 7: Options for Browsing RulesIn all the previously described cases, you have the following options to handle a particular rule:
5 Creating a new rule in Rule Editor
Every rule belongs to some category, and you can add a new rule by first selecting the category to which the rule belongs. No rule can be added without specifying the category to which it belongs. You can also specify various parameters for the rule. After you click Save, the rule is saved to an XML file. Subsequently, the rule can be viewed by searching for a parameter of the rule or by browsing the rules in that category.
5.1 Flow of work![]() Figure 10: Flow of tasks in creating a new ruleFigure 10 describes the sequence of steps and the work flow to create a new rule. Follow these general steps to create a new rule:
5.2 Add Rule User Interface![]() Figure 11: Add Rule User InterfaceThe previous figure shows the window for adding a rule. Review the following descriptions of parameters for building a rule:
5.3 Edit Token window
Whenever you create a new rule, the rule editor automatically creates an empty token. You can edit the empty token by clicking Edit. The following figure shows the window for editing a token for a particular rule.
![]() Figure 12: Edit Token windowThe supported constraint types are String, Lex, Part-of-Speech, Features and Category. You must select one of these constraints. After you select a constraint, the relevant page is displayed for entering the value for that constraint. The user can then enter the value for that constraint as follows.
Figure 12 shows a screen capture for the selection of the lex constraint. Figures 13, 14, and 15 show the screen captures for the selection of the constraint values when the constraint types are String, Parts of Speech / Features, and Category, respectively. ![]() Figure 13: View for entering string value![]() Figure 14: View for entering Parts of Speech / Features value![]() Figure 15: View for entering Category value5.4 Regular Expression syntax in ICA
IBM Content Analyzer uses the Java java.util.regex package for regular expression matching. In addition to the Java regular expression processing, all linguistic processing, including that done by Language Ware, also applies.
The / operator is not part of the Java regular expression syntax. It is used to tell the rule interpreter to invoke the Java regular expression processing on a rule. Therefore, str="love" only matches with "love". str="/love/" matches with strings that contain "love" inside such as "lover", "lovely", "beloved", "glove", and "loves".
The main operator for writing rules is the vertical line (|) operator, which is interpreted as the Boolean OR. This operator has the lowest precedence among all operators. It is important when writing String constraints to understand that this constraint is a literal constraint in contrast to a regular expression. This means the rule will not be interpreted by the regular expression interpreter. Moreover, it will not be interpreted at all. The only matches will be on exactly one of those words in the list.
The ampersand character (&) is another frequently used operator used in rules, which is interpreted as the Boolean AND. To match the pattern containing AND operators, all the words should be matched exactly in the text.
The caret, or circumflex character, (^) matches only at the beginning of a line. The dollar sign ($) matches only at the end of a line. Here are some examples: str="/^love/" matches with strings that start with "love" such as "lover", "lovely", and "loves", in addition to "love" , but does not match "beloved" or "glove". str="/love$/" matches with strings that end with "love" such as "glove" and "truelove", in addition to "love", but does not match "lovely" or "loves". str="/^love$/" matches with strings that start and end with "love". So it only matches with "love".
If you want more than one word and variations of those words to be evaluated by the Java regular expression interpreter, you need to use parentheses. The parentheses characters are a grouping operator. Like any usual regular expression, the syntax ab|cd is interpreted as (a)(b|c)(d) and matches "abd" or "acd". In contrast (ab)|(cd) matches "ab" or "cd". Here are a few concrete examples: str="/love|hate/" matches with any combination of the letter "l" + "ove" + "e" (the last letter of "hate"). Thus, "lovee" would match. "l" + "hate" would also work: "lhate". If you want either variations on the words "love" or "hate", add parentheses around each of these words, for example: str="/(love)|(hate)/" has the same effect as str="/((love)|(hate))/" 5.5 Writing Constraint Values
As mentioned in section 5.2, based on the constraint type selected, you can see the appropriate page for specifying the constraint value. Figure 12 shows the page for entering the constraint value when the constraint type is lex. Figures 13, 14, and 15 show the pages for entering constraint values when the type is String, Parts of Speech / Features, and Category, respectively.
The constraint value can be any valid expression constructed by using a combination of operators and operands. Note the following tips when you write rules:
5.6 Adding New Tokens
To create and empty token or copy and existing token, click Add Rule under a category to add a new rule from the home page of the Rule Editor (See Figure 11).
![]() Figure 17: Add / Copy TokenTo add empty token, click Add and then click Submit. To copy an existing token, click Copy Token and then select the Token ID of the token that you want to copy. After you select the Token ID, click Submit. 6 Search
Figure 4 shows the rule tree page, with a link 'Search rules'. This allows the user to search for rules based on 2 criteria - based on the rule name or based on (constraint type, constraint value). The search is case insensitive. Figure 18 shows the screen displayed when you click Search Rules.
![]() Figure 18: Search Rules Window6.1 Search on Rule Name
This option allows you to search for rules based on part of the rule name or the full name. You enter the string and click Submit. All rules that have the string as part of their names will be returned in the search results. The results obtained after searching are shown in the following window.
6.2 Search on constraint value![]() Figure 19: Search based on constraint type and valueThis option allows you to search for rules based on the constraint type and value. Figure 19 shows a search window that allows you to enter the constraint type to be searched for and the value. All rules that have at least one constraint of this type having this value are returned. 7 Backup Rules
This link, which is available on rule editor home page allows you to back up the current rules. When you click this link, you must enter the name of the folder in which you want to save the rule files. This link is provided at the top of the home page of the rule editor.
The rule files that are backed up are stored in the directory structure %TAKMI_HOME%/databases/database_name/pattern/backup/rule-files-folder-name.
The folder name inside the backup folder consists of two parts:
NewRules.20080131.1737 In this example, the value NewRules is the name that you provided and 20080131.1737 is the time stamp created by the system. See the following screen capture of the Backup Rules window. The interface for backup of the rules is shown below: ![]() Figure 20: Backup Rules window8 Restore Rules
This link, which is available on rule editor home page, allows you to restore the rules back to the current environment.
Important: Whenever you restore the rules, the current rules are overwritten by the ones that are to be restored. See the following screen capture of the Restore Rules window. ![]() Figure 21: Restore Rules windowAfter you click Restore Rules, you see the list of all the folder names in where the backed up rules reside. Select the folder name that you want to restore and click Restore. You can also see the path where the backed up files are stored. Each folder name is appended with the time stamp. This time stamp helps you to map the files to the date on which you backed up the rule files. 9 Transform Rules
Click Transform Rules to convert the rule file in RPF format to PAT format. After you click this link, you see the list of categories of rule files that will be transformed. See the following screen capture of the Transform Rules window.
![]() Figure 21: Transform Rules windowApart from the link provided, the Rule Editor is also accompanied by a utility for exporting files in the RPF format to files in PAT format. The utility is batch script that is named takmi_transform_rules.bat. The script requires the XSLT file takmi-resolve-repetition.xsl in the same directory and the JAR file in the lib/oae-rule-editor-backend.jar directory. The batch script can be invoked as: takmi_transform_rules.bat source_xml_file (in rpf format) destination_xml_file (in pat format) 10 Testing Rules![]() Figure 22: Test Rules windowTesting rules is one of the most powerful features of the Rule Editor. Testing rules helps you to test the rule on a small set of unstructured data with a click of a button. The Test Rules link is provided on the home page of the Rule Editor. After clicking the link, you must enter the following parameters:
After you enter the text and select the category, click Test. After you click Test, you see another window that shows the results of deploying the rules in the particular category onto the text provided. The following figure shows the window that shows the test results. ![]() Figure 23: Test Results InterfaceYou see the results of the rules tested on the text provided in the previous step the document details, including the document ID, date, and the text on which the rules were tested. You also see the category and the keyword information. Each keyword's starting and ending offset is also shown. To check the consistency of the rule made, click the radio button next to one or more keywords. After you select a keyword, the corresponding text is highlighted in the text provided in the document details. 11 Synchronizing among multiple users
The Dictionary Editor supports editing operations by multiple users. The Rule Editor, which is a part of Dictionary Editor, can lock the files to be edited and prevent other users from editing the same files. Problems can be avoided if users know which files might cause conflicts when they edit them. A description of each file type is as follows:
Terms of UseNotices
This information was developed for products and services offered in the U.S.A.
Copyright License
IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A.For license inquiries regarding double-byte (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to: IBM World Trade Asia Corporation Licensing 2-31 Roppongi 3-chome, Minato-ku Tokyo 106-0032, JapanThe following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact: IBM Corporation Silicon Valley Lab Building 090/H-410 555 Bailey Avenue San Jose, CA 95141-1003 U.S.A.Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee. The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any equivalent agreement between us. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental.
This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.
Trademarks
This topic lists IBM trademarks and certain non-IBM trademarks.
See http://www.ibm.com/legal/copytrade.shtml for information about IBM trademarks. The following terms are trademarks or registered trademarks of other companies: Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Intel, Intel Inside (logos), MMX and Pentium are trademarks of Intel Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product or service names might be trademarks or service marks of others. |