Documentation
(C) IBM Corp. 1996, 1999

Text Extender: Administration and Programming


Thesaurus concepts

A thesaurus is a controlled vocabulary of semantically related terms that usually covers a specific subject area. It can be visualized as a semantic network where each term is represented by a node. If two terms are related to each other, their nodes are connected by a link labeled with the relation name. All terms that are directly related to a given term can be reached by following all connections that leave its node. Further related terms can be reached by iteratively following all connections leaving the nodes reached in the previous step. Figure 5 shows an example of the structure of a very small thesaurus.

Figure 5. A thesaurus displayed as a network


Figure DESU9S31 not displayed.

Text Extender lets you expand a search term by adding additional terms from a thesaurus that you have previously created. Refer to Chapter 10, Syntax of search arguments to find out how to use thesaurus expansion in a query.

To create a thesaurus for using it in a search application requires a thesaurus definition file that has to be compiled into an internal format, the thesaurus dictionary.

The dictionary format used by a linguistic and a precise index differs from the one used by an Ngram index. Thus two different thesaurus compilers are provided with the product. They are not only slightly different in the concepts they are based on, but require different source formats. So you should first decide which index type you will use before you start defining the thesauri for your search application.

The basic components of a thesaurus are "terms" and "relations".

Terms

A term is a word or expression denoting a concept within the subject domain of the thesaurus. For example, the following could be terms in one or more thesauri:

data processing
helicopter
gross national product

In a Text Extender thesarus, terms are classified as either descriptors or nondescriptors. A descriptor is a term in a class of synonyms that is the preferred term for indexing and searching. The other terms in the class are called nondescriptors. For example, outline and shape are synonymous, where shape could be the descriptor and outline a nondescriptor.

An Ngram thesaurus does not distinguish between descriptors and nondescriptors.

Relations

A relation is an expression of an association between two terms. Relations have the following properties:

Thesaurus expansion can use every relation defined in the thesaurus. You can also specify the depth of the expansion. This is the maximum number of transitions from a source term to a target term. Note however that the term set may increase exponentially as the depth is incremented.

The following example shows those terms that are newly added as the depth increases.

health
 
health service, paramedical, medicine, illness
 
allergology, virology, veterinary medicine, toxicology, surgery,
stomatology, rhumatology, radiotherapy, psychiatry, preventive
medicine, pathology, odontology, nutrition, nuclear medicine,
neurology, nephrology, medical check up, industrial medicine,
hematology, general medicine, epidemiology, clinical trial,
cardiology, cancerology

Text Extender thesaurus relations

These are the relation types provided by a Text Extender thesaurus:

In a Text Extender thesaurus there are no predefined relations. You can give each relation a name, such as BROADER TERM, which can be a mnemonic abbreviation, such as BT. The common relations used in thesaurus design are:

Associative

An associative relation is a bidirectional relation between descriptors, extending to any depth. It binds two terms that are neither equivalent nor hierarchical, yet are semantically associated to such an extent that the link between them may suggest additional terms for use in indexing or retrieval.

Associative relations are commonly designated as RT (related term). Examples are:

dog RT security
pet RT veterinarian

Synonymous

When a distinction is made between descriptors and nondescriptors, as it is in a Text Extender thesaurus, the synonymous relation is unidirectional between two terms that have the same or similar meaning. In a class of synonyms, one of the terms is designated as the descriptor. The other terms are then called nondescriptors. Refer to Ngram thesaurus relations for a definition of the synonymous relation when no distinction is made between descriptors and nondescriptors.

The common designation USE leads from a given nondescriptor to its descriptor. The common designation USE FOR leads from the descriptor to each nondescriptor.

feline USE cat
lawyer UF advocate

Hierarchical

A hierarchical relation is a unidirectional relation between descriptors that states that one of the terms is more specific, or less general, than the other. This difference leads to representation of the terms as a hierarchy, where one term represents a class, and subordinate terms refer to its member parts. For example, the term "mouse" belongs to the class "rodent".

BROADER TERM and NARROWER TERM are hierarchical relations. For example:

car NT limousine
equine BT horse

Other

A relation of type other is the most general. It represents an association that does not easily fall into one of the other categories. A relation of type other can be bidirectional or unidirectional, there is no depth restriction, and relations can exist between descriptors and nondescriptors.

This relation is often used for new terms in a thesaurus until the proper relation with other terms can be determined.

Of course you can define your own bidirectional synonymous relation by using the relation type associative for a synonymous relation between descriptors or even with the relation type other for a synonymous relation between arbitrary terms.

Ngram thesaurus relations

An Ngram thesaurus supports the following two types:

There are two predefined relations, each of them based on one of these two types. You can define your own relations based on the type associative. For details, see Creating an Ngram thesaurus.

Associative

An associative relation is a bidirectional relation between two terms that do not express the same concept but relate to each other. The predefined relation RELATED_TO and all user-defined relations are based on this relation type.

Examples are:

tennis RELATED_TO racket
German RELATED_TO sausage

Synonymous

A synonym relation is a bidirectional relation between two terms that have the same or similar meaning and can be used as alternatives for each other. This relation can, for example, be used for a term and its abbreviation. The predefined relation SYNONYM_OF is the only relation based on this type.

Examples are:

spot SYNONYM_OF stain
US SYNONYM_OF United States

Creating a thesaurus

See also Creating an Ngram thesaurus.

There is a sample English thesaurus compiler input file desthes.sgm stored in the samples directory of the installation path. The dictionary directory on OS/2 and Windows systems is:

drive:\dmb\db2tx\samples

On AIX, HP-UX, and SUN-Solaris systems, the directory is:

DB2TX_INSTOWNER /db2tx/samples

A compiled version of this thesaurus and its SGML input file is stored in the dictionary directory.

drive:\dmb\db2tx\dict
or
DB2TX_INSTOWNER /db2tx/dicts

The files belonging to this thesaurus are called desthes.th1, desthes.th2, ..., and desthes.th6..

To create a thesaurus, first define its content in a file. It is recommended that you use a plain directory for each thesaurus that you define. The file can have any extension except th1 to th6, which are used for the thesaurus dictionary. If you use the same directory for an Ngram thesaurus, see Creating an Ngram thesaurus for more excluded file extensions.

Then compile the file by running:

txthesc -f filename -c  ccsid

where filename can contain only the characters a-z, A-Z, and 0-9.

Currently, only CCSID 850 is supported.

txthesc produces thesaurus files having the name filename without extension and the extension th1 to th6, in the same directory where the definition file is located. If there is already a thesaurus with the same name, it is overwritten without warning.

Refer to Chapter 10, Syntax of search arguments to find out how to use a thesaurus in a query.

Specify the content of a thesaurus using the Standard Generalized Markup Language (SGML). The following diagram shows the syntax rules to follow when creating a thesaurus.

>>-<thesaurus>--<header>--<thname>--thesaurus-name--</thname>--->
 
              .----------------------------.
              V                            |
>----<rldef>-----| relation-definition |---+---</rldef>--------->
 
                  .------------------------.
                  V                        |
>----</header>-------| thesaurus-entry |---+--</thesaurus>-----><
 
relation-definition
 
|---<rls>--<rlname>--relation-name--</rlname>------------------->
 
>-----<rltype>--+-ASSOCIATIVE--+---</rltype>----</rls>----------|
                +-SYNONYMOUS---+
                +-HIERARCHICAL-+
                '-OTHER--------'
 
thesaurus-entry
 
|---<en>--unique-number--,----+-1-+--<t>--term--</t>------------>
                              '-2-'
 
>-----+--------------------+--</en>-----------------------------|
      '-| related-terms |--'
 
related-terms
 
|---<r>--------------------------------------------------------->
 
      .------------------------------------------------------.
      |                      .--------------------.          |
      V                      V                    |          |
>--------<l>--relation-name-----<t>--term--</t>---+---</l>---+-->
 
>-----</r>------------------------------------------------------|
 

relation-name can contain only the characters a-z, A-Z, and 0-9.

Figure 6 shows the SGML definition of the thesaurus shown in Figure 5.

Figure 6. The definition of a simple thesaurus

<thesaurus>
<header>
<thname>thesc example thesaurus</thname>
<rldef>
 
<rls>
<rlname>Related Term</rlname>
<rltype>associative</rltype>
</rls>
 
<rls>
<rlname>Narrower Term</rlname>
<rltype>hierarchical</rltype>
</rls>
 
<rls>
<rlname>Instance</rlname>
<rltype>hierarchical</rltype>
</rls>
 
<rls>
<rlname>Synonym</rlname>
<rltype>synonymous</rltype>
</rls>
</rldef>
</header>
 
<en> 2, 1
<t>database management system</t>
<r>
  <l>Narrower Term
  <t>oo database management system</t>
  <t>relational database management system</t>
  </l>
 
  <l>Synonym
  <t>DBMS</t>
  </l>
 
  <l>Related Term
  <t>document management system</t>
  </l>
 
  <l>Instance
  <t>database</t>
  </l>
</r>
</en>
<en> 5, 1
<t> relational database management system </t>
<r>
  <l>Narrower Term
  <t>object relational database management system</t>
  </l>
</r>
</en>
 
<en> 3, 1
<t>object relational database management system</t>
<r>
  <l>Instance
  <t>DB2 Universal Database</t>
  </l>
</r>
</en>
 
<en> 6, 1
<t>object oriented database management system</t>
<r>
  <l>Narrower Term
  <t>object relational database management system</t>
  </l>
</r>
</en>
 
<en> 4, 1
<t>document management system</t>
<r>
  <l>Synonym
  <t>library</t>
  </l>
</r>
</en>
 
<en> 9, 1
<t>library</t>
</en>
 
<en> 10, 1
<t>DB2 Unversal Database</t>
</en>
 
<en> 11, 1
<t>database</t>
</en>
</thesaurus>

Creating an Ngram thesaurus

There is a sample English Ngram thesaurus compiler input file desnthes.def stored in the dictionary directory of the installation path. The dictionary directory on OS/2 and Windows systems is:

drive:\dmb\db2tx\dict

On AIX, HP-UX, and SUN-Solaris systems, the dictionary directory is:

DB2TX_INSTOWNER /db2tx/dicts

A compiled version of this sample thesaurus is also stored there. The files belonging to this thesaurus are called desnthes.<extension> with the following extension where n is a digit:

To create an Ngram thesaurus, first define its content in a definition file. You can have several thesauri in the same directory, but it is recommended that you have a separate directory for each thesaurus. The length of the file name without extension must not exceed 8 characters. The extension is optional but is restricted to 3 characters and should be different from any of the above listed extensions.

If you use the same directory for other Text Extender thesauri, do not use the extensions listed under Creating a thesaurus.

Then compile the file by running:

txthesn -f definition-file-name -ccsid  code-page

Here is a list of the code pages supported by an Ngram thesaurus:
932 AIX, OS/2 Japanese
942 OS/2 Japanese
943 OS/2, Windows Japanese
949 OS/2 Korean
950 AIX, HP-UX, OS/2, SUN-Solaris, Windows Traditional Chinese
970 AIX, HP-UX, SUN-Solaris Korean
1381 OS/2, Windows Simplified Chinese
1363 Windows Korean
1383 AIX, HP-UX, SUN-Solaris Simplified Chinese
850 AIX, OS/2 Latin-1
1252 Windows Latin-1

txthesn produces thesaurus files having the same name as definition-file-name with the extensions mentioned above. The files are created in the same directory as the definition file. If there already exists a thesaurus with the same name in this directory it is overwritten without warning.

The CCSIDs supported are listed in the description of EhwCreateIndex in Text Search Engine: Programming Interfaces.

Specify the content of the thesaurus using the following syntax diagram:

   .---------------------------------.
   V                                 |
>>---+-| group-definition-block |-+--+-------------------------><
     '-| comment-line |-----------'
 
group-definition-block
 
|---|  block-starting-line |--\n-------------------------------->
 
      .-----------------------------------------------.
      V                                               |
>---------+-|   member-term-definition |-----+---\n---+---------|
          '-|   associated-term-definition |-'
 
block-starting-line
 
|---:WORDS----+-----------------------+------------------------------|
         '-|  member-relation |--'
 
member-relation
 
|---+-:SYNONYM----------------+----------------------------------------|
    +-:RELATED----------------+
    '-:RELATED--(--number--)--'
 
member-term-definition
 
|---member-term-------------------------------------------------|
 
associated-term-definition
 
|---+-.RELATED_TO----------------+--associated-term-------------|
    +-.SYNONYM_OF----------------+
    '-.RELATED_TO--(--number--)--'
 
comment-line
 
|---#--any-comment----------------------------------------------|
 

Each member term must be written to a single line. Each associated term must be preceded by the relation name. If the member terms are related to each other, specify a member relation.

The length of member terms and associated terms is restricted to 164 characters. Single-byte characters and double-byte characters of the same letter are regarded as the same. Capital and small letters are not distinct. A term can contain a blank character but either the single byte character period "." or colon ":" can be used.

The user-defined relations are all based on the associative type. They are identified by unique numbers between 1 and 128.

If an application wants to use symbolic names for their thesaurus relations instead of the relation name and number, it must administrate the mapping itself. For example, if the relation OPPOSITE_OF was defined as RELATED_TO(1), the application has to map this name to the internal relation name RELATED_TO(1). Refer to Chapter 10, Syntax of search arguments to find out how to use thesaurus expansion in a query.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]