IBM FileNet P8, Version 5.2.1            

Enabling PDF-specific text extraction

For more accurate indexing of PDF documents that are written in right-to-left language, specify that Content Platform Engine use the Apache PDFBox technology for text extraction. Some examples of right-to-left languages are Arabic and Hebrew.

About this task

Text extraction occurs as part of the process for indexing a document. By default, Content Platform Engine uses the Oracle Outside In Search Export product to extract text from a document. Do not override this default text extraction mechanism unless your documents are written in either a right-to-left language or English. The default mechanism is faster than PDF-specific text extraction.
Restriction: PDF-specific text extraction might cause garbled text in the following circumstance: the PDF document contains text that overlays other text, such as in the case of revised text or transparent text.

Procedure

To enable PDF-specific text extraction:

Set the following Java™ virtual machine (JVM) parameter to true:
	-Dcom.filenet.cbr.processPDFWithPDFBox=true


Last updated: March 2016
csscbr_pdfbox_enabling.htm

© Copyright IBM Corporation 2016.