style.tde Syntax


This section explains the style.tde syntax used to supply field extraction rules.

In this section, an example of the style.tde syntax template is followed by descriptions of the style.tde statements in alphabetical order. A statement is any word that may appear at the beginning of a line and is always immediately followed by a colon (:).

Syntax Template

The following syntax template includes the style.tde syntax relevant to mkvdk for extracting field values.


$control: 1
tde:
{
pre-process:
{
relative-path: yes|no
}
{
datamap:
/docsep = "pattern"
/filter = "filter_name"
/system = "system_call"
/charmap = charmap_code
{
define: pattern_name "pattern"
...
field: field_name FILENAME|TIME|FILETIME|
FILESIZE|FILEPATH|PATTERN|LINE num|"pattern"
/required = yes|no
/which = [1|###|LAST|ALL]
/string-before = "string"
/string-between = "string"
/string-after = "string"
/default = "field_value"
/alsowrite = [field_name|"field_names"]
...
dispatch: field_name
/required = yes|no
/start-line = "string"
/start-pattern = "string"
/end-line = "string"
/end-pattern = "string"
/inclusive = yes|no
...
}
}
}
$$

$control

The $control statement identifies the file as a Verity control file. It must be the first non-comment line in the style.tde file.

Syntax

$control: 1

datamap

The datamap section of the style.tde file defines the document body text and the rules for populating field values from documents. The document body text is used to create a collection's full-word index. Field population rules populate fields in the collection's documents table, as defined by the style.ddd and style.ufl files.

The following statements may appear as children of datamap: define, dispatch, field. define, dispatch, and field are described later in this section.

Syntax


datamap:
/docsep = "pattern"
/filter = "filter_name"
/system = "system_call"
/charmap = charmap_code
{
datamapping
}
Element
Description
/docsep
This optional modifier identifies the characters that separate documents within a file. If a /docsep modifier is not specified, the default separator is the end-of-file. If you have multiple documents in a file, you must define a dispatch field to avoid indexing the entire file for each document within the file. This modifier can be used only if your documents are in plain text (ASCII) format.
/filter="universal"
This optional modifier specifies that the universal filter will be used to read documents during indexing and to present them for viewing, and is used only when you want the engine to convert the format of documents before parsing them. Note that if this modifier is used, a style.dft file with the same /filter modifier is also required.
/system
This optional modifier allows you to specify your own filter. Two substitution variables are available: $filename and $$. The variable $filename represents the input text file name, and $$ represents the output temporary file name to be parsed by the Verity engine.
/charmap
This modifier specifies the character map used to print characters to the screen when a language other than English is used. Note that this modifier is required if a language other than English is used. The following character map codes are available:
1252 for code page 1252;
850 for IBM code page 850;
8859 for ISO-8859;
mac for Macintosh systems

datamapping
This represents a set of statements that define the document body text and the rules for extracting field values from documents. The document body text is used to create the full-word index. Extracted field values may be stored in a collection. To store extracted field values in a collection, the field(s) must be defined in the style.ufl file.

define

The define statement is a child of the datamap statement. It assigns a name to a regular expression. Use this name to represent the expression elsewhere in the datamap section of the style.tde file.

Syntax

define: pattern_name "pattern"

Element
Description
pattern_name
The name assigned to a pattern. The name specified can be up to 128 characters long, and can consist of alphanumeric characters, underscores, and hyphens.
pattern
A regular expression describing a character or set of characters. For more information, see Appendix E, "Regular Expressions." The pattern specified must be enclosed in quotes.

dispatch

The dispatch statement is a child of the datamap statement. It supplies the rules for populating the dispatch field defined in the style.ddd or style.ufl file. It identifies the document body text to include in the full-word index. If there is no style.dft file, the document body text, as specified by the the dispatch statement, is displayed for viewing.

The start and end of the document body can be identified by a line number or a pattern written as a regular expression.

Syntax


dispatch: field_name
/required = yes|no
/start-line = "string"
/start-pattern = "string"
/end-line = "string"
/end-pattern = "string"
/inclusive = yes|no
Element
Description
field_name
The name of the dispatch field as defined in the collection's style.ddd or style.ufl file. The document dispatch field may be assigned a field name other than DOC, but the field name must match the field name specified in the style.dft file.
/required
This optional modifier identifies whether the field is required in order for the Verity engine to include the document in the collection. If you specify yes, the Verity engine ignores the document if the field is not found. The default is no.
/start-line
This optional modifier identifies the line of the document on which the dispatch field begins. If neither /start-line nor /start-pattern is given, the field begins on line 1.
/start-pattern
This optional modifier identifies the pattern, represented by a regular expression, with which the dispatch field begins. If neither /start-line nor /start-pattern is given, the field begins on line 1.
/end-line
This optional modifier identifies the line of the document on which the dispatch field ends. If neither /end-line nor /end-pattern is given, the dispatch field ends at the end of the file. A dollar sign ($) can be used to signify the end of the file.
/end-pattern
This optional modifier identifies the pattern, represented by a regular expression, that occurs at the end of every document. If an /end-line modifier or an /end-pattern modifier is not given, the documents end at the end of the file.
/inclusive
This optional modifier identifies whether a specified start pattern and end pattern are to be included in the dispatch field. By default, these patterns are not included. If you specify YES, both patterns will be included in the dispatch field.

field

The field statement is a child of the datamap statement. It identifies the rules the Verity engine follows to parse data for a specified field name. Field values can be stored in collections. To store extracted field values in a collection, the field(s) must be defined in the style.ufl file. A field keyword must be present for each field for which you want to store extracted values.

Syntax


field: field_name {FILENAME|FILEPATH|FILETIME|FILESIZE| LINE|PATTERN|TIME} num|"pattern"
/required = yes|no
/which = [1|###|LAST|ALL]
/string-before = "string"
/string-between = "string"
/string-after = "string"
/default = "field_value"
/alsowrite = [field_name|"field_names"]
Element
Description
field_name
The name of a field for which you want to extract values from your documents. If this name corresponds to a name in the style.ufl, the Verity engine stores the extracted field values in the collection. Field names are case-insensitive and may contain alphanumeric characters, hyphens (-), and underscores (_). Note that field names cannot contain blank spaces.
FILENAME
The name of the source file containing the document, as in this example:
field: DOC_FILENAME FILENAME

FILEPATH
The fully-qualified pathname where the source file containing the document is located, as in this example:
field: DOC_PATHNAME FILEPATH

FILETIME
The time when the source file containing the document was last edited, as in this example:
field: DOC_FILETIME FILETIME

FILESIZE
The size of the source file, in bytes, that contains the document, as in this example:
field: DOC_FILESIZE FILESIZE

LINE num
Assigns the text at that line in the document to the specified field, as in this example:
field: DOC_LINE LINE 3

PATTERN "pattern"
The flag that precedes a regular expression that the Verity engine matches. The specified regular expression must appear in quotes as shown in this example: field: TITLE PATTERN "Title :<.*>"
PATTERN "{pattern_name}"
The flag that precedes a regular expression that the Verity engine matches. The pattern_name variable represents a macro which can be substituted for a long regular expression. The macro must surrounded by curly braces and must be specified in a define statement. An example is provided at the end of this chapter.
TIME
The time when the document was parsed. For example:
field: DOC_TIME TIME

/required
This optional modifier identifies whether the field is required in order for the Verity engine to include the document in the collection. If you specify yes, the Verity engine ignores the document if the field is not found. The default is no.
/which
This optional modifier specifies how the Verity engine behaves when multiple instances of a particular field are found in a document.
1 - the first instance of the field is used (the default is 1);
a number- the field with the given instance number is used;
LAST - the last instance of the field is used;
ALL - all instances of the field are used.

/string-before
This optional modifier identifies a string to be inserted before the field value.
/string-between
This optional modifier identifies a string to be inserted between field values when the Verity engine extracts many field values for the field. This option is only valid when the modifier /which=all is specified.
/string-after
This optional modifier identifies a string to be inserted after the field value.
/default
This optional modifier specifies the string that the Verity engine assigns the field if a field value is not found. If this modifier is not specified, the field is empty and appears blank if displayed in a Verity application.
/alsowrite
Using this modifier, you can store a parsed value in two fields: the field named by the field statement, and the field named by the /alsowrite modifier. /alsowrite also allows you to populate more than two fields with the same value. Simply specify a space-separated list of field names delimited by quote marks:
/alsowrite = "Us1 Us2 Gx1 Gx2"

pre-process

The pre-process statement identifies the beginning of the pre-process section of the style.tde file. Note that no more than one pre-process statement should be included in the style.tde file. Each pre-process statement can have up to 1000 child statements.

Syntax


pre-process:
{
pre-processing
}
Element
Description
pre-processing
A set of statements that define the work done by the Verity engine when documents are initially parsed. For example, the relative-path: NO statement can be specified to build collections with full paths to documents (by default, relative paths are used).

tde

The tde statement identifies the control file as a style.tde file. It should be the first non-comment line after the $control statement.

Syntax

tde:

style.tde Example with a Custom Macro

Sample style.tde and style.ufl files for defining the use of a custom macro are shown below.


# style.tde example
$control: 1
tde:
{
pre-process:
{
datamap:
{
define: writename "<E.*>"
field: Writer PATTERN "{writename}"
/required = yes
dispatch: DOC
}
}
}
$$


# style.ufl field definition to be used with style.tde above
data-table: dad
{
varwidth: Writer dxa
}





Copyright © 2002, Verity, Inc. All rights reserved.