Consider the performance tuning options that are related to handling a large number of folders, documents, and objects when you plan a FileNet® P8 system, develop efficient custom applications, or configure a client application.
Retrieving the contents of a folder can be a memory-intensive operation. The methods that retrieve the contents of a folder return a list of all folders and objects that are contained in the folder. When the list is returned, it is stored in memory. If many containees exist, the server can run out of memory. The application can also fail or it can take an excessive amount of time to display the results. To avoid this problem, a custom application can use the search API methods to filter the list of subfolders by using a query. For example, you can define a search query to return only those folder names that start with "Case 2".
You cannot limit the number of folders that a client application returns. Each time a user opens an object store or opens a folder, a client application retrieves a list of all the folders that are contained at that same level. The list also contains several properties for each folder. For example, when a user opens an object store, a list of all the top-level folders in that object store is retrieved for display. If the object store or folder that is opened contains an excessive number of items, both performance and usability can be adversely affected. Requiring users to browse through many folders is also not recommended for usability reasons.
Because the FileNet P8 architecture does not impose a limit on the number of folders that can be stored at any given level, you can optimize system performance by using no more than 100-200 folders at a given level. For applications that require many folders, consider implementing the search APIs and filtering to limit the results. In addition to performance considerations, limiting the number of folders at any given level makes the application easier to use. It also provides a browsing-style interface that does not require users to enter filter conditions.
The number of folders displayed per page in a client application like Workplace is configurable by setting the site preferences.
The number of folders or subfolders that are retrieved by a client application like IBM Content Navigator is not configurable. That is, any folder that is opened through a client application like IBM Content Navigator results in the retrieval of all subfolders at that folder level, regardless of whether those folders are displayed.
As previously noted, retrieving the contents of a folder can be a memory-intensive operation for custom applications and some client applications. Each time a user opens a folder, a client application retrieves the entire contents of the folder along with several properties for each item. If the folder contains an excessive number of documents or custom objects, system performance can be adversely affected.
Workplace provides filtering options for items other than folders. You can use the folder filter options in the site preferences and the user preferences to limit the number of documents and custom objects that a client application like Workplace retrieves. The folder filter options limit how many documents and custom objects are retrieved per folder access, with a default value of 100 per object type. If a folder contains more items than is specified by the folder filter options, you can retrieve the remaining items by entering criteria in the browse page filter folder field. Doing so retrieves only those objects that match the criteria. Overhead from the initial retrieval operation can be reduced by applying folder filtering. For example, set the filter criteria so that only document names that were added by a specific user are returned.
To minimize the affects on performance, consider the following items:
Document name that is difficult to filter | Document name that is easy to filter |
---|---|
Loan_Origination_Application_John_D_Doe | John_D_Doe_application |
Loan_Origination_Application_John_Q_Public | John_Q_Public_application |
Loan_Terms_Complete_John D_Doe | John_D_Doe_Loan_Terms_Complete |
An index skew is an uneven distribution of values in an index. For example, if 1,000,000 documents are filed in the system, and half of those documents are filed in one folder, half of the rows in the index will have the same tail_id value.
When an index skew occurs, the database might not use the index, and instead might perform a full table scan because the logic behind the database optimizer is that if too many values are the same, it takes more time to scan the index and then lookup the rows than just reading the rows in the table. The database optimizer has this logic because it is not passed the values of the search, which is the case in Content Platform Engine that uses bound parameters.
Because the optimizer uses a general plan for all values, it has no way to determine if the searched values are one of the few values that do not have many duplicates. If the search is in a folder that has many documents, it is slightly better to use the full table scan to retrieve the documents. However, if the search is in a folder that has only a few documents, it is much worse to use the table scan. Therefore, the following solutions are presented to eliminate the table scans. A common use-case is a customer that knows not to browse a large folder.
Therefore, avoid placing many objects in one folder regardless of whether they are subfolders, documents, or custom objects. Many objects in one folder relative to the other folders creates an index skew, which can affect:
The symptom of skew occurs when browsing a folder is too slow or times out due to a query plan that uses table or index scans.
Specifically, the situations that can encounter poor browsing performance are:
Correspondingly, the two primary database optimizer issues encountered for those two cases are:
select count(*), parent_container_id from container group by parent_container_id order by 1 desc
select count(*), tail_id from relationship group by tail_id order by 1 desc
The output shows the counts of each group by column, thus allowing the skew to be observed. If one folder has more than half of the rows with the same tail_id or parent_container_id value, that index is heavily skewed.
If the query plan for the operation shows a full table scan rather than the use of the relevant index, then the cause is most likely from index skew.
For information, see the Enabling DB2® Index Use with Skewed Data in the FileNet Content Platform Engine technote.
Following are some common approaches that enable Oracle to use an index. The best approaches are listed first.
Histogram statistics enable the "peek binding" technology (when a query plan is first compiled, the data or "bind" values are peeked at) to use the values and choose the best plan for the peeked values because it stores the distribution with the histogram statistics.
While the following statistics solution is written for the skew on relationship.tail_id, the skew on container.parent_container_id would be solved analogously.
First compute the number of distinct values:
select count(distinct(tail_id)) as NUM_DISTINCT from relationship;
If NUM_DISTINCT is less than 254 (the Oracle 9i bucket maximum), then use value-based histograms by specifying the SIZE value to be greater than the number of distinct values (while allowing for growth).
For example, with 16 distinct values, the size could be SIZE=20.
Next, try the following solution to solve the skew, assuming less than 20 distinct values in tail_id:
Exec dbms_stats.gather_table_stats( ownname=> 'P8User', tabname=> 'DocVersion' ,
estimate_percent=> 20, cascade=> TRUE, method_opt=> 'FOR ALL INDEXED COLUMNS SIZE 20');
Exec dbms_stats.gather_table_stats( ownname=> 'P8User', tabname=> 'Relationship' ,
estimate_percent=> 20, cascade=> TRUE, method_opt=> 'FOR ALL INDEXED COLUMNS SIZE 20');
This solution keeps the statistics uniform across the DOCVERSION and RELATIONSHIP tables. If this solution does not work, consider an experimental approach to determine if the statistics work, which is trying estimate_percent=100 and FOR ALL COLUMNS. If one or both work, reduce the estimate_percent to determine the required threshold. The statistics take trial and error to refine settings for the best performance.
Rather than "bind peeking", SQL Server has "parameter sniffing", which is similar to "bind peeking". At query plan compile time, the optimizer examines the parameters. Subsequent parameterized query executions of the same query form, but with different data, do reuse the same query plan, which can be observed in a profile trace as a "cache hit".
The SQL Server auto-statistics always generates histograms. Use the dbcc show_statistics command to see the chosen bucket ranges, as well as the sampled rows. It might be necessary to run the create statistics and update statistics commands manually to increase the sample size if skew problems occur, such as query plans with table or index scans rather than index seeks. Examine the rows sampled from the dbcc command and increase the amount if query plans are not correct.
Either choice essentially freezes the statistics in place, which might permit poor plans if data changes. Therefore, you need to monitor performance because newly added data could outdate the query plan, which might benefit from new statistics. The objective is to have the smaller folders browsed first.
For manual statistics gathering, SQL Server does have the option of specifically specifying the column to be analyzed. So, tail_id could be charted at a fine granularity specifically, thus avoiding an expensive statistics gathering on more columns than is necessary.