Using the style.lex File


The character definitions for a collection affect how words are indexed and stored in the word index. The character definitions used by the Verity engine are located in the CTYPE table in the Verity locale. Many accented alphabetic characters are defined for each locale, so a style.lex file may not be required to index and search words with these characters. For nonalphanumerics not specified in the locale, use the style.lex file so that these characters are interpreted by the engine and words containing these characters will appear in the word index.

For example, if users want to enter nonalphanumeric characters (such as &, /, and ") as search criteria and these characters and not defined in the collection configuration, you can specify these characters in a style.lex file. If a style.lex file is present in the style directory, the word index is built based on the specifications made in the style.lex file. For example, if the character "/" is specified as a valid character, the word index will include that character and users can search for such words as "OS/2".

NOTE: The style.lex file is only for 8-bit locales.

style.lex File Syntax Reference

Entries in the style.lex file identify the patterns that the Verity engine interprets as valid characters in words, punctuation such as newlines and white space, and characters used to perform retrievals such as end-of-sentence and end-of-paragraph delimiters. The application developer creates a style.lex file only when it is necessary to override the system defaults.

A sample style.lex file is shown below. This file represents the closest approximation to the default style.lex file used by the Verity engine. The internal implementation is platform dependent, which affects the accuracy of the extended ASCII characters. The style.lex file below handles 7-bit characters only.


# style.lex -- 7-bit only version of internal hardwired lexer
$control: 1
lex:
{
define: ALNUM "[A-Za-z0-9]"
define: W "[ \t\f\r\v]"
token: WORD "{ALNUM}+(\\.{ALNUM}+)*"
token: EOS "[.?!][.?! \t]*"
token: EOP "{W}*\n({W}*\n)+"
token: NEWLINE "{W}*\n"
token: WHITE "{W}+"
token: PUNCT "[^A-Za-z0-9 \t\f\r\v.?!]+"
}
$$

General Information

The first noncomment lines in a style.lex file must be the following:


$control:1
lex:
After the lex statement, two types of keyword statements can be specified: define statements and token statements. The define statements are used to specify macros used in the style.lex file. The token statements are used to define words, paragraphs, white space, and so on. In the sample style.lex file above, the define statements are used to define allowed letters and numbers and valid white space characters. The token statements are used to define words, end of sentences, paragraphs and so on that occur in the documents contained in the collection.

In the style.lex file, the following symbols are used to create the token definitions.

Symbol Type
Symbol
Description
Quotes
""
Specifies the elements that make up the define statement macro or token statement definition.
Brackets
[]
Defines a character class.
Braces
{}
Specifies a macro that was created in a define statement.
Plus
+
Specifies one or more occurrences of a combination of characters and/or numbers.
Asterisk
*
Specifies zero or more occurrences of a combination of characters and/or numbers.
Two Backslashes
\\
Specifies an escape sequence. When two backslashes are used, it is to escape the second backslash. For instance, (\\.) is used to specify a floating decimal.
Pound Sign
#
Specifies that the characters following are a comment.
For additional information regarding regular expressions, refer to Appendix E, "Regular Expressions".

define Statements

The define statements used in the style.lex file specify macros to be used within the following token statements. When define statement macros are used in token statements, the macro is enclosed in braces {}. Use of define statements is optional.

token Statements

Each token statement contains a flag identifying tokens such as end-of-sentence, end-of-paragraph, and white space. The default patterns used to match these tokens appear in the various token statements. Typical tokens are listed below.

Token
Pattern
WORD
A word represented as any string comprised of alphanumeric characters (both uppercase and lowercase) or a floating decimal.
EOS
SENT

An end-of-sentence character represented as either a period (.), question mark (?), or exclamation point (!). EOS and SENT are identical in meaning and are interchangeable.
NEWLINE
A single end-of-line represented as a newline.
EOP
PARA

An end-of-paragraph represented as two or more newlines. EOP and PARA are identical in meaning and are interchangeable.
WHITE
A blank space represented by one or more white spaces.
PUNCT
Any character except a newline.

Statement Interpretation

Two statements of the same type in the style.lex file are ORed. For example, if you had the following two statements in your style.lex:


token: WORD "[A-Za-z]+"
token: WORD "[0-9]+"
then a word would be defined as any string of alphabetical characters or any string of numeric characters.

The order of the token statements in the style.lex file determines which token the lexical analyzer ("lexer") returns. The lexer returns the longest string that matches any pattern specified in the style.lex file. The token associated with that pattern is returned as well. If that string matches more than one pattern, the token that appears earliest in the style.lex is returned.

For example, if the following statements appeared in the order below:


token: PUNCT "."
token: WORD "[A-Z]+"
and the text looked like this:

"XY Z"

then the letters "XY" would be returned as a WORD token, the white space would be returned as a PUNCT token, and the "Z" would be returned as a PUNCT token. The "Z" is not returned as a WORD token because it matches the patterns in both TOKEN statements, so the Verity engine selects the first matching pattern, in this case PUNCT.

As shown, a token: WORD statement typically contains a regular expression. If you specify a regular expression that contains a backslash (\), then you must enter two backslashes so that the Verity engine will interpret the additional backslash as a literal. Note that the double-backslash entry is not needed when specifying a predefined character. The backslash usage is consistent with all Verity control files.

A style.lex file must specify token statements for all the tokens you want the Verity engine to match. Note that default values for individual token statements are not provided.

Character Mapping

The default style.lex file, as documented in this manual, does not index 8-bit characters, even though they are valid in English documents. In addition, the character set for the style.lex file is the internal character set even if you set everything else in the application to a different code page (8859, for example), unless you add the $charmap option to the style.lex file, as shown here:


$control: 1
$charmap: 8859
lex:
{
[...]
}
$$
The $charmap construct specifies that the contents should be mapped to the internal character set before being used for lexing.





Copyright © 2002, Verity, Inc. All rights reserved.