Operators for Regular Expressions


The following table lists the regular expression operators available in the Verity engine and the pattern the operator matches.

Operator
Matched Pattern
x
The character "x".
\x
The character "x", even if x is an operator. You would use this, for example, to search for the $ character, which is an operator. (\n and \t are exceptions; see the following explanations.)
\b
A backspace.
\f
A form-feed.
\n
A newline.
\r
A carriage-return.
\t
A tab.
\v
A vertical tab.
[xy]
The character "x" or "y".
[x-z]
The characters "x", "y", or "z"; this regular expression searches for a range of characters. For example, the expression [a-z] is used to search for any character in the lowercase alphabet; [0-9] is used to search for any digit.
.
Any character but newline.
[^z]
Any character but "z".
^x
An "x" at the beginning of a line.
x$
An "x" at the end of the line.
x?
0 or 1 occurrence of "x".
x*
0 or more occurrences of "x".
x+
1 or more occurrences of "x".
x|y
An "x" or a "y".
(x|y)z
"xz" or "yz"; the parentheses are used for grouping.
{symbol}
The translation of a symbol defined earlier in the file.

Symbols

You can define symbols to avoid redefining expressions to search for common patterns. Symbol definitions should appear at the top of a file, before any regular expressions that use them. To define symbols, use the define statement with the following syntax.

define: symbol "regular expression"

Element
Description
symbol
The word replaced by the quoted pattern; the symbol name can contain any alphanumeric characters.
"regular expression"
A regular expression that the Verity engine uses for matching when it encounters the defined symbol. Double quotes are used in the event that the matched pattern contains white space.

Symbol Examples

The define statement is used to codify an expression so that user-defined symbols can be included in regular expressions. For example, you might use the following definitions at the beginning of a style.lex file:

define: D "[0-9]"

The preceding statement defines the symbol D, which represents any digit.

define: SPACE "[ \t]"

The preceding statement defines a symbol SPACE that represents either a space, or a tab.

You can also use previously defined symbols in other symbol definitions. For example, you might first define a symbol for any digits as follows:

define: D "[0-9]"

You could then use the symbol {D} in other symbol definitions, as in this definition of a YEAR symbol as follows:

define: YEAR "{D}{D}{D}{D}"

Symbols in regular expressions must be enclosed in braces.

Substrings

Normally, the text returned is the entire string that matches the pattern in the regular expression. The Verity engine includes an extension to regular expression syntax that allows you to identify a string and then select a substring of that string. To define a substring and retrieve only that substring, enclose the substring in angle brackets.

"TITLE:<.*>"

This expression returns any characters after the string "TITLE:", but not including the string "TITLE:".

"Volume{SPACE}+<{DIGIT}+>"

This expression returns any number of digits following the string "Volume" and one or more spaces.

Regular Expression Examples

Some simple examples of regular expressions are presented below.

Example 1

^[0-9]

This expression matches any digit at the beginning of a line.

Example 2

^[0-9]+

This expression matches one or more digits at the beginning of a line.

Example 3

[^0-9]

This expression matches any single character except a digit.

Example 4

"TITLE:.*$"

This expression matches a string beginning with "TITLE:" and followed by any characters until the end of the line.

Example 5

"^Sub(j|ject):.*$"

This expression matches the string "Subj" or "Subject" that occurs at the beginning of a line and is followed by a colon and any other characters until the end of the line.

Example 6

"FIELD:\t"

This expression matches the string "FIELD:" followed by a tab character.





Copyright © 2002, Verity, Inc. All rights reserved.