An indexable document is a document that Content Platform Engine deems eligible for indexing and that the Oracle Outside In Search Export product can convert to text. The specific types of convertible documents depend on the version of the Oracle product that is used in your Content Platform Engine release. Content Platform Engine determines the eligibility of a document for indexing by identifying the MIME type of the document. Some MIME types are considered to be ineligible for indexing.
Converting a document to text for purposes of indexing the document is called text extraction. The text extraction step can occur in the following ways:
Component | Documents | Comment |
---|---|---|
Content Platform Engine | Default mechanism | By default, Content Platform Engine uses the Oracle product to perform text extraction. For example, when the content element for an object is a Microsoft Word document, the Oracle product converts the Word format to a simple text format. Content Platform Engine then sends the text to IBM® Content Search Services for indexing. |
Content Platform Engine | PDF documents (optional) | For extracting text from PDF documents, you can optionally use Apache PDFBox technology instead of the Oracle product. |
IBM Content Search Services | IBM Content Collector documents | IBM Content Search Services performs text extraction for any documents that are submitted to it by IBM Content Collector. IBM Content Collector is an optional add-on component for Content Platform Engine. |
For information about the document types that the Oracle product supports, do the following steps: