For more accurate indexing of PDF documents that are written
in right-to-left language, specify that Content Platform Engine use the Apache PDFBox
technology for text extraction. Some examples of right-to-left languages
are Arabic and Hebrew.
About this task
Text extraction occurs as part of the process for indexing
a document. By default,
Content Platform Engine uses
the Oracle Outside In Search Export product to extract text from a
document. Do not override this default text extraction mechanism unless
your documents are written in either a right-to-left language or English.
The default mechanism is faster than PDF-specific text extraction.
Restriction: PDF-specific text extraction might cause garbled
text in the following circumstance: the PDF document contains text
that overlays other text, such as in the case of revised text or transparent
text.
Procedure
To enable PDF-specific text extraction:
Set the following Java™ virtual
machine (JVM) parameter to true: -Dcom.filenet.cbr.processPDFWithPDFBox=true