The PDF Filter


PDF indexing is supported through a dynamically loadable PDF filter (flt_pdf.so or flt_xml.sl on UNIX, flt_pdf.dll on Windows). By default, the Verity engine invokes the universal filter with the PDF filter as a helper filter.

The following is an example of how to construct and add documents to a PDF collection that does not use the universal filter. It uses the PDF filter on its own. Given a Verity installation directory of /usr/verity, the PDF collection can be created as follows:

% mkvdk -collection pdfcoll -create -style /usr/verity/data/stylesets/def_filesystem

To index the file REPORT.PDF into the collection, use the following command:

% mkvdk -collection pdfcoll -insert REPORT.PDF

The PDF support (as of Verity Developer's Kit V2.0) offers great versatility, since the PDF documents can reside in any repository as long as it is supported with a valid Verity gateway. For example, the PDF filter can be used in conjunction with the HTTP gateway to index PDF documents on the World Wide Web.

Custom Lexing Rules Not Supported

The PDF filter streams PDF documents and performs the task of lexing. The output of the PDF filter is a series of word tokens and punctuation tokens. These tokens are ignored by the Verity default lexer or any custom lexer that might be defined in the style.lex file. There is no way to specify alternative lexing rules.

Specifying the PDF Filter

The PDF filter, named flt_pdf can be invoked together with the universal filter or as a single filter. By default, the PDF filter is invoked with the universal filter.

The PDF filter can be invoked in two ways. To invoke the PDF filter with the universal filter, it must be specified in the style.uni file with the type keyword and the /format-filter modifier, as shown in the sample style.uni syntax below:


type: "application/pdf"
/format-filter = "flt_pdf"
/charset = 1252 #1252 is the default
To invoke the PDF filter as a single filter for a collection using the 850 character set, you must specify the filter in the style.dft file using the field keyword and the /filter modifier, as shown in the sample style.dft file syntax below:


field: DOC
/filter = "flt_pdf -charmapto 850"
/charmap = 850
If the PDF filter is invoked as a single filter, the engine will index PDF documents only, so the collection will be limited to PDF documents.

Using the -fieldoverride Option

The PDF filter specification located in the style.uni file can include a field override option, -fieldoverride, that specifies that the field values generated by the PDF filter override those generated by a Verity gateway.

To use the -fieldoverride option, include it as part of the /format-filter specification as follows:


type: "application/pdf"
/format-filter = "flt_pdf -fieldoverride"
/charset = 1252 #1252 is the default

Using the -charmapto Option

The PDF filter specification located in the style.dft can include a character mapping option, -charmapto, to control the character set output by the filter. This option is specified in the style.dft file and is used only when the PDF filter is invoked as a single filter. Valid values for the -charmapto option are:

-charmapto Value
Description
1252
For code page 1252
850
For IBM code page 850
8859
For ISO-8859
mac1
For Macintosh systems
The default character set used is platform-dependent. When the -charmapto option is not specified, the PDF filter uses the platform's default character encoding. On Unix and Windows systems, the default character encoding is 8859; on Macintosh systems it is mac1.

PDF Fields

While processing each document, the PDF filter generates a series of field tokens containing information extracted and derived from the PDF document. When these fields are defined in the style.sfl file, they are populated in the collection's internal documents table. PDF fields can be populated by the PDF filter if they exist in the information dictionary for the PDF document.

Standard PDF Fields

The following PDF fields are predefined as standard fields in the default style.sfl file. These fields are populated unless changes are made to the style.sfl file. For the predefined fields, the Adobe PDF field names are mapped to Verity collection names as described below.

PDF Field Name
(Verity Collection Field Name) Description
PageMap
PageMap This field represents a vector of integers, one for each page, describing the number of word instances for each page. This field is required. In the default style.sfl file, the PageMap field is defined as:
varwidth: PageMap xya
/_hexdata=yes

FTS_Author
(Author) The author of the PDF document obtained by reading the value for the Author key in the PDF document's information dictionary. Definition is:
varwidth: Author ddh
/alias=FTS_Author

FTS_Keywords
(Keywords) This field contains the keywords key for the PDF document obtained by reading the value for the Producer key in the PDF document's information dictionary. Definition is:
varwidth: Keywords ddh
/alias = FTS_Keywords

FTS_ModificationDate
(Date) The last modification date of the PDF document obtained by reading the value for the ModDate key in the PDF document's information dictionary. Definition is:
fixwidth: Date 4 date
/alias = FTS_ModifidationDate

FTS_Title
(Title) The title of the PDF document obtained by reading the value for the Title key in the PDF document's information dictionary. Definition is:
varwidth: Title ddh
/alias= FTS_Title

Optional PDF Fields

There are several optional PDF fields that can be defined as standard fields. These fields exist in the style.sfl file, but are commented out and therefore are not populated by the PDF filter. For information on defining these fields, see "Defining Optional PDF Fields" following this table.

PDF Field Name
Description
DirID
The Adobe path specification for the directory where the PDF file exists. If the PDF document is being pulled from a repository other than the file system, this directory will be the temp directory. Definition is:
varwidth: DirID ddc

FileName
The Adobe filename specification for the PDF document. Definition is:
varwidth: FileName xya

FTS_CreationDate
The creation date of the PDF document obtained by reading the value for the CreationDate key in the PDF. Definition is:
fixwidth: FTS_CreationDate 4 date

FTS_Creator
The creator of the PDF document obtained by reading the value for the Creator key in the PDF document's information dictionary. Definition is:
varwidth: FTS_Creator xya

FTS_Producer
The producer of the PDF document obtained by reading the value for the Producer key in the PDF document's information dictionary. Definition is:
varwidth: FTS_Producer xya

FTS_Subject
The subject of the PDF document obtained by reading the value for the Subject key in the PDF document's information dictionary. Definition is:
varwidth: FTS_Subject xyd

InstanceID
The changing ID found in /ID array (position 1) in the trailer of the PDF document. If it does not exist, one is generated using the last modification time. Definition is:
fixwidth: InstanceID 32 text

NumPages
The number of pages in the PDF document. Definition is:
fixwidth: NumPages 4 unsigned-integer

PermanentID
The changing ID that is found in /ID array (position 0) in the trailer of the PDF document. If it does not exist, one is generated using the last modification time. Definition is:
fixwidth: PermanentID 32 text

WXEVersion
The version of the Adobe Word Finder used to extract the text from the PDF document. Definition is:
fixwidth: WXEVersion 1 unsigned-integer

Defining Optional PDF Fields

In oder to define these option PDF fields, you must do the following:

Editing the style.uni File

In a text editor, open the style.uni file and add the -fieldoverride option to the PDF filter specification as follows:


type: "application/pdf"
/format-filter = "flt_pdf
-fieldoverride"

Editing the style.sfl File

In order to use one of the optional PDF fields, you must define your own field, using the optional PDF field's definition, that aliases the optional PDF field. In a text editor, open the style.sfl file and do the following:

1. Define your new field.

When you define the new field, add a comment line prior to where you insert the definition so you can easily review what you have added. For example:


#My new field to define FTS_CreationDate
2. Define your field by using the appropriate definition from the table of optional PDF fields provided previously, but replace the field name with your own name. For example:


#My new field to define FTS_CreationDate
fixwidth: PdfCreatedDate 4 date
3. Add an alias specification that refers to the optional PDF field. For example:


#My new field to define FTS_CreationDate
fixwidth: PdfCreatedDate 4 date
/alias = FTS_CreationDate
NOTE: When defining a field for FTS_CreationDate, you also need to add an alias to the field Created as follows:


#My new field to define FTS_CreationDate
fixwidth: PdfCreatedDate 4 date
/alias = FTS_CreationDate
/alias = Created
When defining your fields for the other optional fields, you can just alias the optional field itself.





Copyright © 2002, Verity, Inc. All rights reserved.