Documentation
(C) IBM Corp. 1996, 1999

Text Extender: Administration and Programming


Preparing text documents for searching

This section describes how to prepare a database so that its text tables can be searched by Text Extender. The steps are:

  1. Change the text configuration (optional)

  2. Modify the stop-word and abbreviation files (optional)

  3. Modify the document model file (optional)

  4. Create a sample table (optional)

  5. Enable a database for use by Text Extender

  6. Enable a text table for use by Text Extender (optional)

  7. Enable a text column for use by Text Extender

  8. Enable Text Extender to search in external text files (optional).

If you have just installed Text Extender, do these steps in sequence.

Changing the text configuration

Summary

When
When you want to make different default settings used for creating and updating an index.

Command
CHANGE TEXT CONFIGURATION

Authorization
SELECT

When you create an index, the following parameters described in Text configuration settings, are set for that index:

When Text Extender is first installed, default values for these settings are established in the text configuration. To display the current text configuration values, see Displaying the text configuration settings.

To change the text configuration to be used as default values when indexes are created, enter:

     db2tx "CHANGE TEXT CFG USING settings"

Examples

To change the default index type and the default index directory for future indexes:

     db2tx "CHANGE TEXT CONFIGURATION USING
                       INDEXTYPE   precise
                       INDEXOPTION normalized
                       DIRECTORY   DB2TX_INSTOWNER/db2tx/indexes"

To change the default update frequency for indexes so that they are updated at 12:00 or 15:00, on Monday to Friday, if there is a minimum of 100 text documents queued:

     db2tx "CHANGE TEXT CONFIGURATION USING
                       UPDATEFREQ min(100) d(1,2,3,4,5) h(12,15) m(00)"

To stop the periodic updating of an index:

     db2tx "CHANGE TEXT CONFIGURATION USING
                       UPDATEFREQ none"

Modifying the stop-word and abbreviation files

Summary

When
If possible, only once when Text Extender is first installed. Optional.

Command
Your own editor command

Authorization
None

There is one stop-word file and one abbreviation file per language. To understand the implications of editing these files, see Why text documents need to be indexed.
Tip

Before you begin editing one of these files, make a backup copy.

The stop word and abbreviation files on OS/2 and Windows systems are in:

drive:\dmb\db2tx\dict

On AIX, HP-UX, and SUN-Solaris systems, they are in:

DB2TX_INSTOWNER /db2tx/dicts

Use your own editor to edit these files. They use CCSID 850, so ensure that your application CCSID is also set to 850 before you begin.

Remove words and abbreviations that you want to be indexed. Add words that you do not want to be indexed.

Modifying the document model file

Summary

When
If you intend to work with a document's structure. Optional.

Command
Your own editor command

Authorization
None

You can restrict a search to particular sections of documents. This concept, and the use of a document model file to enable Text Extender to recognise such sections, is described in Working with structured documents.

When a server instance is created, an example of a document model file desmodel.ini is created in the server instance directory. There is also an example of a document model for HTML documents in the same file.

Use your own editor to edit document model files.

Later, when you enable the text column that contains the documents, you must specify INDEXPROPERTY SECTIONS_ENABLED.

Creating a sample table

Summary

When
Optional. To create a table that can be used in later examples.

Command
db2 "CREATE TABLE ..."

Authorization
SYSADM or DBADM

This section describes how to create table DB2TX.SAMPLE that is used in many of the examples in this chapter, and in many of the examples in Chapter 5, Searching with Text Extender UDFs.

Alternatively, you can run TXVERIFY and TXSAMPLE to automatically create a sample table to be enabled for each index type. See Preparing a sample database containing enabled text. If you do this, however, you can no longer use the DB2TX.SAMPLE table name as shown in the examples of administration commands to enable a text table and to enable a text column. If you decide not to use TXSAMPLE to automatically create the sample table, make and run a command script to do the following:

  1. Create the sample table:
         db2 "CREATE TABLE db2tx.sample (
                 docid      VARCHAR(30),
                 author     VARCHAR(50),
                 subject    VARCHAR(100),
                 date       TIMESTAMP,
                 comment    LONG VARCHAR)"
    

  2. Fill the table with data:
         db2 "import from \"dessamp.imp\"
         of del modified by DELPRIORITYCHAR
         insert into db2tx.sample"
    

    The file DESSAMP.IMP is in the /samples subdirectory in the directory in which Text Extender is installed.

Preparing a sample database containing enabled text

Text Extender offers utilities for preparing a sample database. They are useful for quickly preparing text for testing Text Extender's search capabilities immediately after installation, and for general test purposes at any time after that.

These utilities are in the following directories:

To prepare a sample database, either at the server or at a client workstation:

  1. At the operating system prompt, run:
    TXVERIFY database-name  [user-id] [password]
    

    This command creates and enables a sample database.

  2. At the operating system prompt, run:
    TXSAMPLE database -name [user-id] [password]
    

    The user ID and password are required only if you are working from a client workstation.

    This command does the following:

    1. Connects to the specified database
    2. Creates the following table:

      • db2tx.sample_p
    3. Imports sample English documents to fill the table
    4. Enables the text column in the table, with the following index types:
      precise
      precise normalized
      linguistic
      linguistic sections-enabled
      Ngram
      Ngram case-enabled
      dual
    5. Waits for the text index to be built.

Enabling a database

Summary

When
Once for each database that contains columns of text to be searched in.

Command
ENABLE DATABASE

Authorization
SYSADM or DBADM

To enable the connected database , enter:

     db2tx "ENABLE DATABASE"

This command takes no parameters. It prepares a database for use by Text Extender.

This command also declares user-defined functions (UDFs) and user-defined distinct types (UDTs) to DB2. These are the SQL functions that you use later to search for text. They are described in Chapter 9, UDTs and UDFs. These declarations apply to all future sessions.

A catalog view, TEXTINDEXES, is created that keeps track of enabled text columns. See Working with the Text Extender catalog view.

This command creates text configuration information for the database, containing default values for index, text, and processing characteristics. They are described in Text configuration settings.

Figure 7. Enabling a database


Enabling a database

Once a database has been enabled, it remains so until you disable it. To reverse the changes made by ENABLE DATABASE, refer to Disabling a database.
Tips

If the environment variable DB2TX_INSTOWNER is used, it must be set to the name of the instance owner before the database is enabled. This is particularly important for UNIX users because, in UNIX, this variable is set by default.

If you later decide to drop an enabled database, you should first disable it to ensure that the declared UDFs, the catalog view, and so on, are removed.

Enabling a text table

Summary

When
Optional. Once to create a common index for all text columns in the table.

Command
ENABLE TEXT TABLE

Authorization
ALTER, SELECT, UPDATE on the table

This step determines whether you have one common index for all the text columns in the table, or several indexes, that is, a separate index for each text column. See Creating one or several text indexes for a table for further information.

To have a common index, run ENABLE TEXT TABLE, then run ENABLE TEXT COLUMN for each text column. To have separate indexes, skip ENABLE TEXT TABLE, and run only ENABLE TEXT COLUMN for each text column. This is illustrated later in this topic.

During this step, Text Extender creates an empty text index that is common to all subsequently enabled text columns. You specify the type of index, how frequently the index is to be updated, and in which directory the index is to be stored. Default values for any parameters that you do not specify are taken from the text configuration settings.

The examples in other chapters assume that the index type is dual.
Tip

If a setting, such as the index update frequency, should be the same for most text tables, it may be more convenient to use text configuration information to specify default settings. See CHANGE TEXT CONFIGURATION.

This step also creates an empty log table for recording which documents in the table are added, changed, or deleted. Triggers are created to keep the log table updated.

You cannot run ENABLE TEXT TABLE for a table that already contains a text column that has been enabled for Text Extender.

To delete an index created by ENABLE TEXT TABLE, see Disabling a text table.
Tip

If you later decide to drop an enabled text table, you should first disable it to ensure that the index, the log table, and so on, are removed.

Examples

The following example enables text table DB2TX.SAMPLE:

     db2tx "ENABLE TEXT TABLE db2tx.sample"

Default values for the index characteristics are taken from the text configuration settings.

The next example explicitly sets the characteristics of the common index that is created for the table.

For a UNIX server:

     db2tx "ENABLE TEXT TABLE    db2tx.sample
                      INDEXTYPE  linguistic
                      UPDATEFREQ min(100) d(1,2,3,4,5) h(12,15) m(00)
                      DIRECTORY  DB2TX_INSTOWNER/db2tx/indexes"

For an OS/2, or Windows NT server:

     db2tx "ENABLE TEXT TABLE    db2tx.sample
                      INDEXTYPE  linguistic
                      UPDATEFREQ min(100) d(1,2,3,4,5) h(12,15) m(00)
                      DIRECTORY  \dmb\db2tx\indexes"

The example sets the index type and the index directory, and then sets the index update frequency so that the index is updated at 12:00 or 15:00, on Monday to Friday, if there is a minimum of 100 text documents queued.

Figure 8. Creating a common index for all text columns in a table


Creating a common index for all columns in a table

Figure 9. Creating a separate index for each text column


Creating a separate index for each text column

Enabling a text column

Summary

When
Once for each column that contains text to be searched.

Command
ENABLE TEXT COLUMN

Authorization
ALTER, SELECT, UPDATE on the table
Tip

If a setting, such as the index update frequency, should be the same for most text columns, use the text configuration information to specify default settings.

To reverse the changes made by ENABLE TEXT COLUMN, use the DISABLE TEXT COLUMN command. To disable all enabled text columns in a table, use the DISABLE TEXT TABLE command.

When you enable a text column, a handle column is added to the table, the document information (format, language, CCSID) is set, a log table is created, and an index is created,

A handle column is added

During this step, Text Extender adds to the table a 60-byte VARCHAR handle column - a column that contains handles associated with the text column that is being enabled. Handles contain information about the text in the associated text column and in the associated external files. This information includes a unique document ID, the document's language, format, and CCSID, and the index name. They are described in The sample table DB2TX.SAMPLE.
Tip

This version of Text Extender cannot work with indexes created using an earlier release. You must disable and then re-enable all Text Extender-enabled databases and tables.

The column containing text blocks is COMMENT. Before you can search through the text in this column, you must prepare the database and the COMMENT column for use by Text Extender.

After this preparation step, the DB2TX.SAMPLE table contains an additional column for handles.
Note:When you subsequently search for text, you specify the handle column, not the text column, as the column to be searched.

The document information is set

You specify the type of text documents you typically store in this text column: their format (such as ASCII), their language, and their CCSID. Defaults for this information can be specified in the text configuration settings. See Text configuration settings.

A log table is created

During this step, a log table and a view called LOGIXnnnnnn is created, where IXnnnnnn is the index name (available from the catalog view). If a default tablespace is specified in text configuration, the log table is stored there; otherwise, it is stored in the DB2 system default tablespace. To optimize performance and the use of disk space, you can specify a different tablespace to be used for the log tables.

Triggers are also created that add information to the log table whenever a document in the column is added or changed. This information causes these documents to be indexed the next time indexing takes place.

If external files are added or changed, these triggers are not aware of the changes. In such cases, to cause the triggers to add the information to the log table, use an UPDATE statement as shown in the example in Updating an index for external files.

If errors occur during indexing, such as when a document queued for indexing could not be found, so-called error events are added to the log table and can be displayed, as described in Displaying error events.
Tip

If you run out of log space in this step, see Enabling a text column in a large table for possible solutions.

In partitioned databases, each table is assigned to a tablespace and a nodegroup. It is important that the log table is assigned to a tablespace that belongs to the same nodegroup as the enabled user table. Text Extender checks this during the ENABLE command.

An index is created

If you intend to have a separate index for each text column, that is, you have skipped the step ENABLE TEXT TABLE, Text Extender creates a separate index for the text column during this step. You specify the type of index, how frequently the index is to be updated, and in which directory the index is to be stored. If, on the other hand, you intend to have one index for the whole table, then you have already run ENABLE TEXT TABLE and specified the index parameters; they are ignored if you repeat them here.

Use the UPDATEINDEX keyword to determine whether the indexing of the text documents in the specified text column begins immediately, or when periodic indexing is next scheduled. If you do not use this keyword, the value specified in the text configuration settings is taken.

Creating indexes of various types for a text column. You can create more than one index for a text column. This can be useful if you want to allow, for example, linguistic and fuzzy search on the same text column, by associating it with different index types, such as linguistic and Ngram indexes. You do this by running ENABLE TEXT COLUMN again, specifying not only the additional type of index to be created, but also a unique handle column name.
Tip

If you are using a partitioned database, and you decided to change the configuration of a nodegroup and start a redistribution of the table rows, you must recreate the text indexes and log tables, that is, you must disable and enable again all Text Extender-enabled tables of that nodegroup.

Enabling a text column in a large table

If you are working with a table that has a large row length, keep in mind that enabling a text column adds a handle column of type DB2TEXTH (VARCHAR 60). Similarly, enabling an external file adds a handle column of type DB2TEXTFH (VARCHAR 210). This could be significant if the table is approaching its maximum row length as determined by DB2.

Also when you enable a text column in large table, use the DB2 UDB REORGANIZE utility to check whether the table needs to be reorganized. When you enable a large table for the first time, the following steps make indexing faster:

  1. Enable the table using the NOUPDATE option. This creates the handles, but does not yet index the documents.

  2. Reorganize the table using the DB2 UDB REORGANIZE utility.

  3. Create the index by running UPDATE INDEX.

When you enable a text column or external files, Text Extender adds a handle column to the table and initializes the handle values, thereby causing DB2 UDB log entries to be written. If there is an unusually large number of log entries to be written, DB2 UDB can run out of log space.

There are two ways to handle this situation; the first is better for performance reasons:

Examples

The following example enables text column COMMENT in table DB2TX.SAMPLE, and assigns the name COMMENTHANDLE to the handle column that is created:

     db2tx "ENABLE TEXT COLUMN    db2tx.sample    comment
                        HANDLE    commenthandle"

Default values for the text information and for the index characteristics are taken from the text configuration settings.

The next example explicitly sets the values for the type of documents that are in the COMMENT column. Default values for the index characteristics are taken from the text configuration settings.

     db2tx "ENABLE TEXT COLUMN    db2tx.sample    comment
                        HANDLE    commenthandle
                        CCSID     819
                        LANGUAGE  uk_english
                        FORMAT    rft"

The next example explicitly sets the values for the characteristics of the index that is created for the COMMENT column. The example sets the index type and the index directory, and sets the index update frequency so that the index is updated at 12:00 or 15:00, on Monday to Friday, if there is a minimum of 100 text documents queued. Default values for the text information are taken from the text configuration settings.

For a UNIX server:

     db2tx "ENABLE TEXT COLUMN     db2tx.sample    comment
                       HANDLE      commenthandle
                       INDEXTYPE   linguistic
                       UPDATEFREQ  min(100) d(1,2,3,4,5) h(12,15) m(00)
                       UPDATEINDEX UPDATE
                       DIRECTORY   DB2TX_INSTOWNER/db2tx/indexes"

For an OS/2 or Windows NT server:

     db2tx "ENABLE TEXT COLUMN     db2tx.sample    comment
                       HANDLE      commenthandle
                       INDEXTYPE   linguistic
                       UPDATEFREQ  min(100) d(1,2,3,4,5) h(12,15) m(00)
                       UPDATEINDEX UPDATE
                       DIRECTORY   \dmb\db2tx\indexes"

Enabling text columns of a nonsupported data type

Text columns must be CHAR, GRAPHIC, VARGRAPHIC, LONG VARGRAPHIC, DBCLOB, VARCHAR, LONG VARCHAR, or CLOB to be enabled by Text Extender. If the documents are in a column of a different type, such as a user-defined distinct type (UDT), you must provide a user-defined function that takes the user type as input and provides as output type CHAR, GRAPHIC, VARGRAPHIC, LONG VARGRAPHIC, DBCLOB, VARCHAR, LONG VARCHAR, or CLOB.

Use the FUNCTION keyword in ENABLE TEXT COLUMN to specify the name of this function.

Example: You intend to store compressed text in a table.

  1. Create a UDT for the text:
         db2 CREATE DISTINCT TYPE COMPRESSED_TEXT AS CLOB(1M)
    

  2. Create a table and insert the text into it:
         db2 CREATE TABLE MYTEXT (author VARCHAR(50),
                                  text   COMPRESSED_TEXT)
         db2 INSERT ...
    

To enable the text column for use by Text Extender:

  1. Create a UDF called, for example, UNCOMPRESS, that receives a value of type COMPRESSED_TEXT and returns the corresponding uncompressed text as, for example, a CLOB(10M) value.

  2. Enable the text column using the FUNCTION keyword to identify the UNCOMPRESS UDF:
         db2tx "ENABLE TEXT COLUMN MYTABLE  text
                                   FUNCTION uncompress
                                   HANDLE   handle
                                   ..."
    

Maintaining indexes in a multiple-node environment

If the text column that you are enabling belongs to a table that is part of a multiple-node nodegroup, the index directory that you specify must be available on all physical nodes. If you use the default directory specified in the text configuration, make sure that the path is available on all nodes of the nodegroup. If this is not convenient, you can specify a specific path for each node in the ENABLE TEXT COLUMN command.

If you change the node configuration of a nodegroup that contains a table that is enabled for Text Extender, you must reindex the table. You can check if the node configuration has been changed by using the command TXNCHECK.
Tip

If you are using DB2 UDB Extended Enterprise Edition, and have tables partitioned among several nodes, there may be a difference between the system time of the nodes (when one of the nodes is on summer time, for example). To prevent problems, ensure that the difference between the system time settings of the nodes is within the value of DB2's MAX_TIME_DIFF setting. In Windows NT, use the NET TIME command to synchronize the system time settings.

Working with structured documents

A structured document is one that contains sections, such as the document title, author, and subject. You can limit the scope of a search to a particular section of documents.

Structured flat file, XML, and HTML documents are supported. A flat file with marked up sections could look like this:

<title>IBM Dictionary of Computing
<author>McDaniel, George
<subject>Computers, Reference, ...

An XML document might look like this:

<?xml version="1.0"?>
<!DOCTYPE LETTER SYSTEM "letter.dtd">
 
<LETTER>
 <HEADER> This tag has been skipped in the definition, so this text will
          be added to the section named LETTER.
 </HEADER>
 <ADDRESS>
    Text will added to section named LETTER/address.
  <CITY>
    Text will added to section named LETTER/address/City.
  </CITY>
 </ADDRESS>
 <CONTENT>
   Text will be added section named LETTER/Content.
  <NOSECTION>  Text will be added to section named LETTER/Content since element
               "NOSECTION" was not defined.
  </NOSECTION>
  <GREETINGS>
   Text will be added section named LETTER/Content/Greetings.
  </GREETINGS>
 </CONTENT>
</LETTER>

Restrictions:

To make Text Extender aware of such sections when indexing, you must create a document model containing descriptive section names of your choice for use in queries against that section, and the markup tags that identify the sections. Often, you will specify a section name that is the same as the tag name:

Section  Tag
 
Title    title
Author   author
Subject  subject

The document model file. You describe the document model in a document model ini file. Tags that are not defined in the document model file are indexed according to the index type. There is one document model file for each Text Extender server instance. As you can see from the following example, a document model file can contain more than one document model.

;Comments must be preceded by a semicolon
 
[MODELS]
modelname=sample
modelname=play
 
[sample]
title=title
author=author
subject=subject
content=content
 
[play]
play=play
author=author
title=title
scene=scene

For XML documents, the model name defined in the document models file must be the same as the section name identifier and the section name tag of the root element. The names are case-sensitive.

For section-enabled indexes, well-formed XML documents containing root elements with names matching the name of a valid document model will be indexed according to that document definition. If no matching document model is found, there will be an error message and the document will not be indexed. For indexes that are not section-enabled, all at least well-formed XML documents will be indexed and the section name tags will be ignored.

Here is an example of a model definition for documents in XML format. It contains nested sections, that is, sections within other sections.

; list of document models
[MODELS]
modelname = LETTER
;  sample for XML document model definition
;  left side = section name identifier encoding whole path
;  right side = section name tags specifying tag for each element
;               of the path through the tree down to specified
;                  node. Tag delimiter is '/'.
[LETTER]
LETTER = LETTER
LETTER/address = LETTER/ADDRESS
LETTER/Content = LETTER/CONTENT
LETTER/address/City = LETTER/ADDRESS/CITY
LETTER/Content/Greetings = LETTER/CONTENT/GREETINGS

The Letter model entry shows how to model sections to be stored in the index for documents in XML format. Information on the model tree is given on the right-hand side of section entries: each character string denotes an XML element, the "/" delimiter denotes the element hierarchy. The definition given in the configuration file must be a subset of the model as defined in the DTD (document type definition) file for a given document.

The first section entry for a model to be used for XML documents must be for the root element. The complete hierarchy of the model must be defined up to the last element defined as a section. Ordering must be "breadth first". That is, all elements of level n of the model hierarchy which will be used for indexing must be declared before any element of level n+1. You cannot skip the entry "LETTER/Content = LETTER/CONTENT" if you want to use "LETTER/Content/Greetings = LETTER/CONTENT/GREETINGS" as a section in the index. However, you can skip whole branches of the hierarchy: text contained in these will be treated as belonging to the root element.

Restrictions: Searching in nested sections is possible only in XML documents. For Ngram indexes, only one section name can be searched and it must not be a nested section.

When a server instance is created, an example of a document model file, desmodel.ini is created in the server instance subdirectory. There is also an example of a document model file for HTML documents in the same directory. HTML documents cannot contain nested sections.

To enable section support, define document models in desmodel.ini, and enable the text column that contains the documents using INDEXPROPERTY SECTIONS_ENABLED. See ENABLE TEXT COLUMN for details.
Tips

  • The model names and section names are case-sensitive, and they can contain only the characters A-Z, a-z, 0-9 and /.

  • Comments can be added, preceded by a semicolon.

Enabling external text files

Summary

When
Optional. Once for each table associated with external files that are to be searched.

Command
ENABLE TEXT FILES

Authorization
ALTER, SELECT, UPDATE on the table

Text Extender can search not only in text stored in DB2 UDB tables, but also in text documents stored in files. This preparation step is needed if you intend to search for text in external files. The table associated with the external text files must not have been enabled by the command ENABLE TEXT TABLE.

An index is created, a log table is created, and the document information is set, in the same way as described in Enabling a text column.
Tips

  1. If you run out of log space in this step, see Enabling a text column in a large table for possible solutions.

  2. If the table you are enabling is partitioned and using a nodegroup that has multiple physical nodes, make sure that the external files you are referencing in the table columns are located on the node where their respective table partition resides.

A handle column of type DB2TEXTFH is added to an existing DB2 UDB table. The handle column will hold the references for the external files, each handle containing index and the document information (CCSID, format, and language).

See Handles for external files for a description.

You can specify additional parameters, such as the default index characteristics, in the same way as for enabling a text column.

After the index has been created, you can move or delete the external files. You can still search on the files. You can insert new rows in the table and use UPDATE INDEX to update the index with the new file references.

If the table you are enabling is using a nodegroup with multiple physical nodes, make sure that the external files you are referencing in the columns of your table are located on the node where the table partition resides.

Examples

  1. Create a table DB2TX.EXTFILE having at least one column, or use an existing table.

  2. Add handle column FILEHANDLES to table DB2TX.EXTFILE
             db2tx "ENABLE TEXT FILES  db2tx.extfile
                           HANDLE      filehandles
                           INDEXTYPE   linguistic
                           UPDATEFREQ  min(100) d(1,2,3,4,5) h(12,15) m(00)
                           UPDATEINDEX NOUPDATE
                           DIRECTORY   \any\db2tx\indices"
    

  3. Initialize the handle

  4. Update the index
             db2tx "UPDATE INDEX  db2tx.extfile
                          HANDLE  filehandles"
    

Ending the administration session

You have now completed the steps to prepare your text documents to be searched.

If you specified NOUPDATE for the UPDATEINDEX keyword when you enabled the text column, Text Extender does not index the text immediately, but waits for the next periodic indexing. To update the index now, see Updating an index.

When indexing of the documents has finished, you can begin retrieving information as described in Chapter 5, Searching with Text Extender UDFs.

Enter QUIT to end the Text Extender command processor.
Tip

Use GET INDEX STATUS to determine when indexing has finished.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]