IBM FileNet P8, Version 5.2.1            

Indexable document types and text extraction

An indexable document is a document that Content Platform Engine deems eligible for indexing and that the Oracle Outside In Search Export product can convert to text. The specific types of convertible documents depend on the version of the Oracle product that is used in your Content Platform Engine release. Content Platform Engine determines the eligibility of a document for indexing by identifying the MIME type of the document. Some MIME types are considered to be ineligible for indexing.

Converting a document to text for purposes of indexing the document is called text extraction. The text extraction step can occur in the following ways:

Component Documents Comment
Content Platform Engine Default mechanism By default, Content Platform Engine uses the Oracle product to perform text extraction. For example, when the content element for an object is a Microsoft Word document, the Oracle product converts the Word format to a simple text format. Content Platform Engine then sends the text to IBM® Content Search Services for indexing.
Content Platform Engine PDF documents (optional) For extracting text from PDF documents, you can optionally use Apache PDFBox technology instead of the Oracle product.
IBM Content Search Services IBM Content Collector documents IBM Content Search Services performs text extraction for any documents that are submitted to it by IBM Content Collector. IBM Content Collector is an optional add-on component for Content Platform Engine.

Supported document type information

For information about the document types that the Oracle product supports, do the following steps:

  1. Verify the version of the Oracle product that your Content Platform Engine uses. To do so, see the INSO version on the Content Engine Startup Context page (ping page). For information about browsing to that page, see the instructions in Verifying the Content Platform Engine deployment.
  2. If the verified Oracle product version is the latest product version, you can find the information about the supported document types by following these steps:
    1. Browse to the following website: Oracle.
    2. On the Oracle website, search for "Outside In supported formats".
    3. In the search results, click the link to the document that contains the supported formats for Outside In.
  3. If the verified Oracle product version is not the latest product version, contact Oracle for information about the supported document types for that version.


Last updated: October 2015
csscbr_indexable_documenttype.htm

© Copyright IBM Corporation 2015.