IBM FileNet P8, Version 5.2.1            

Text extraction can fail for PDF documents that use embedded fonts

Text extraction from a Portable Document Format (PDF) document that uses embedded fonts can fail. IBM® Content Search Services is unable to index or search on these documents.

Symptoms

When extracting text from a PDF document, the extraction fails and garbage characters display. To verify that this issue is occurring because of embedded fonts with custom encoding, select a portion of the text from the PDF document and paste the text into a document in an application such as Microsoft Word. If you see garbage characters in the newly created document, then the problem is likely caused by the embedded fonts, and IBM Content Search Services cannot extract the text for indexing. Indexing and search of these kinds of documents will fail.

Causes

IBM Content Search Services cannot extract the text from a PDF document that uses embedded fonts with custom encodings that cannot be mapped to any standard codepage. Consequently, these documents cannot be indexed or searched on.

Resolving the problem

There is no workaround to this issue. IBM Content Search Services does not support PDF documents that use embedded fonts with custom encodings that cannot be mapped to any standard codepage.


Last updated: October 2015
p8pcs001.htm

© Copyright IBM Corporation 2015.