Text extraction from a Portable Document Format (PDF) document
that uses embedded fonts can fail. IBM® Content Search Services is unable to index or
search on these documents.
Symptoms
When extracting text from a PDF document, the extraction
fails and garbage characters display. To verify that this issue is
occurring because of embedded fonts with custom encoding, select a
portion of the text from the PDF document and paste the text into
a document in an application such as Microsoft Word. If you see garbage characters
in the newly created document, then the problem is likely caused by
the embedded fonts, and IBM Content Search Services cannot
extract the text for indexing. Indexing and search of these kinds
of documents will fail.
Causes
IBM Content Search Services cannot
extract the text from a PDF document that uses embedded fonts with
custom encodings that cannot be mapped to any standard codepage. Consequently,
these documents cannot be indexed or searched on.
Resolving the problem
There is no workaround to this issue. IBM Content Search Services does not support PDF
documents that use embedded fonts with custom encodings that cannot
be mapped to any standard codepage.