The character definitions for a collection affect how words are indexed and stored in the word index. The character definitions used by the Verity engine are located in the CTYPE table in the Verity locale. Many accented alphabetic characters are defined for each locale, so a
style.lex
file may not be required to index and search words with these characters. For nonalphanumerics not specified in the locale, use the style.lex
file so that these characters are interpreted by the engine and words containing these characters will appear in the word index.style
.lex
file. If a style
.lex
file is present in the style
directory, the word index is built based on the specifications made in the style
.lex
file. For example, if the character "/" is specified as a valid character, the word index will include that character and users can search for such words as "OS/2".style
.lex
file is only for 8-bit locales. style.lex File Syntax Reference
Entries in the style
.lex
file identify the patterns that the Verity engine interprets as valid characters in words, punctuation such as newlines and white space, and characters used to perform retrievals such as end-of-sentence and end-of-paragraph delimiters. The application developer creates a style
.lex
file only when it is necessary to override the system defaults.style
.lex
file is shown below. This file represents the closest approximation to the default style.lex
file used by the Verity engine. The internal implementation is platform dependent, which affects the accuracy of the extended ASCII characters. The style.lex
file below handles 7-bit characters only.
- # style.lex -- 7-bit only version of internal hardwired lexer
- $control: 1
- lex:
- {
- define: ALNUM "[A-Za-z0-9]"
- define: W "[ \t\f\r\v]"
- token: WORD "{ALNUM}+(\\.{ALNUM}+)*"
- token: EOS "[.?!][.?! \t]*"
- token: EOP "{W}*\n({W}*\n)+"
- token: NEWLINE "{W}*\n"
- token: WHITE "{W}+"
- token: PUNCT "[^A-Za-z0-9 \t\f\r\v.?!]+"
- }
- $$
style
.lex
file must be the following:
- $control:1
- lex:
lex
statement, two types of keyword statements can be specified: define
statements and token
statements. The define
statements are used to specify macros used in the style
.lex
file. The token
statements are used to define words, paragraphs, white space, and so on. In the sample style
.lex
file above, the define
statements are used to define allowed letters and numbers and valid white space characters. The token statements are used to define words, end of sentences, paragraphs and so on that occur in the documents contained in the collection.In the
style
.lex
file, the following symbols are used to create the token definitions.
define Statements
The define
statements used in the style
.lex
file specify macros to be used within the following token
statements. When define
statement macros are used in token
statements, the macro is enclosed in braces {}. Use of define
statements is optional. token Statements
Each token
statement contains a flag identifying tokens such as end-of-sentence, end-of-paragraph, and white space. The default patterns used to match these tokens appear in the various token
statements. Typical tokens are listed below.
style
.lex
file are ORed. For example, if you had the following two statements in your style.lex
:
- token: WORD "[A-Za-z]+"
- token: WORD "[0-9]+"
The order of the
token
statements in the style
.lex
file determines which token the lexical analyzer ("lexer") returns. The lexer returns the longest string that matches any pattern specified in the style.lex
file. The token associated with that pattern is returned as well. If that string matches more than one pattern, the token that appears earliest in the style.lex
is returned.For example, if the following statements appeared in the order below:
- token: PUNCT "."
- token: WORD "[A-Z]+"
WORD
token, the white space would be returned as a PUNCT
token, and the "Z" would be returned as a PUNCT
token. The "Z" is not returned as a WORD
token because it matches the patterns in both TOKEN
statements, so the Verity engine selects the first matching pattern, in this case PUNCT
.As shown, a
token: WORD
statement typically contains a regular expression. If you specify a regular expression that contains a backslash (\), then you must enter two backslashes so that the Verity engine will interpret the additional backslash as a literal. Note that the double-backslash entry is not needed when specifying a predefined character. The backslash usage is consistent with all Verity control files.A
style
.lex
file must specify token
statements for all the tokens you want the Verity engine to match. Note that default values for individual token
statements are not provided.
style.lex
file, as documented in this manual, does not index 8-bit characters, even though they are valid in English documents. In addition, the character set for the style.lex
file is the internal character set even if you set everything else in the application to a different code page (8859, for example), unless you add the $charmap
option to the style.lex
file, as shown here:
- $control: 1
- $charmap: 8859
- lex:
- {
- [...]
- }
- $$
$charmap
construct specifies that the contents should be mapped to the internal character set before being used for lexing.