IBM FileNet P8, Version 5.2.1            

Handling a large number of folders, documents, and objects

Consider the performance tuning options that are related to handling a large number of folders, documents, and objects when you plan a FileNet® P8 system, develop efficient custom applications, or configure a client application.

Important:
These tuning options are not meant to replace any application-specific performance studies or any specific recommendations that you might have received from IBM® for FileNet P8.

Folders and subfolders

Retrieving the contents of a folder can be a memory-intensive operation. The methods that retrieve the contents of a folder return a list of all folders and objects that are contained in the folder. When the list is returned, it is stored in memory. If many containees exist, the server can run out of memory. The application can also fail or it can take an excessive amount of time to display the results. To avoid this problem, a custom application can use the search API methods to filter the list of subfolders by using a query. For example, you can define a search query to return only those folder names that start with "Case 2".

You cannot limit the number of folders that a client application returns. Each time a user opens an object store or opens a folder, a client application retrieves a list of all the folders that are contained at that same level. The list also contains several properties for each folder. For example, when a user opens an object store, a list of all the top-level folders in that object store is retrieved for display. If the object store or folder that is opened contains an excessive number of items, both performance and usability can be adversely affected. Requiring users to browse through many folders is also not recommended for usability reasons.

Because the FileNet P8 architecture does not impose a limit on the number of folders that can be stored at any given level, you can optimize system performance by using no more than 100-200 folders at a given level. For applications that require many folders, consider implementing the search APIs and filtering to limit the results. In addition to performance considerations, limiting the number of folders at any given level makes the application easier to use. It also provides a browsing-style interface that does not require users to enter filter conditions.

The number of folders displayed per page in a client application like Workplace is configurable by setting the site preferences.

The number of folders or subfolders that are retrieved by a client application like IBM Content Navigator is not configurable. That is, any folder that is opened through a client application like IBM Content Navigator results in the retrieval of all subfolders at that folder level, regardless of whether those folders are displayed.

Documents and custom objects

As previously noted, retrieving the contents of a folder can be a memory-intensive operation for custom applications and some client applications. Each time a user opens a folder, a client application retrieves the entire contents of the folder along with several properties for each item. If the folder contains an excessive number of documents or custom objects, system performance can be adversely affected.

Workplace provides filtering options for items other than folders. You can use the folder filter options in the site preferences and the user preferences to limit the number of documents and custom objects that a client application like Workplace retrieves. The folder filter options limit how many documents and custom objects are retrieved per folder access, with a default value of 100 per object type. If a folder contains more items than is specified by the folder filter options, you can retrieve the remaining items by entering criteria in the browse page filter folder field. Doing so retrieves only those objects that match the criteria. Overhead from the initial retrieval operation can be reduced by applying folder filtering. For example, set the filter criteria so that only document names that were added by a specific user are returned.

Possible tuning options

To minimize the affects on performance, consider the following items:

Avoiding index skew

An index skew is an uneven distribution of values in an index. For example, if 1,000,000 documents are filed in the system, and half of those documents are filed in one folder, half of the rows in the index will have the same tail_id value.

When an index skew occurs, the database might not use the index, and instead might perform a full table scan because the logic behind the database optimizer is that if too many values are the same, it takes more time to scan the index and then lookup the rows than just reading the rows in the table. The database optimizer has this logic because it is not passed the values of the search, which is the case in Content Platform Engine that uses bound parameters.

Remember:
Bound parameters, which allow for query plan reuse for queries re-executed with the same SQL statement but with different search values, are essential for transaction processing systems, which frequently execute queries but cannot afford to have a query plan recompiled on each execution.

Because the optimizer uses a general plan for all values, it has no way to determine if the searched values are one of the few values that do not have many duplicates. If the search is in a folder that has many documents, it is slightly better to use the full table scan to retrieve the documents. However, if the search is in a folder that has only a few documents, it is much worse to use the table scan. Therefore, the following solutions are presented to eliminate the table scans. A common use-case is a customer that knows not to browse a large folder.

Therefore, avoid placing many objects in one folder regardless of whether they are subfolders, documents, or custom objects. Many objects in one folder relative to the other folders creates an index skew, which can affect:

The symptom of skew occurs when browsing a folder is too slow or times out due to a query plan that uses table or index scans.

Specifically, the situations that can encounter poor browsing performance are:

Correspondingly, the two primary database optimizer issues encountered for those two cases are:

Performance diagnosis

To diagnose slow browsing operations during full-table scans, obtain histograms as follows:
  • Browsing operations on folders:

    select count(*), parent_container_id from container group by parent_container_id order by 1 desc

  • Browsing operations on documents and custom objects:

    select count(*), tail_id from relationship group by tail_id order by 1 desc

The output shows the counts of each group by column, thus allowing the skew to be observed. If one folder has more than half of the rows with the same tail_id or parent_container_id value, that index is heavily skewed.

If the query plan for the operation shows a full table scan rather than the use of the relevant index, then the cause is most likely from index skew.

DB2 solutions for skew

For information, see the Enabling DB2® Index Use with Skewed Data in the FileNet Content Platform Engine technote.

Oracle solutions for skew

Following are some common approaches that enable Oracle to use an index. The best approaches are listed first.

Histogram statistics enable the "peek binding" technology (when a query plan is first compiled, the data or "bind" values are peeked at) to use the values and choose the best plan for the peeked values because it stores the distribution with the histogram statistics.

While the following statistics solution is written for the skew on relationship.tail_id, the skew on container.parent_container_id would be solved analogously.

First compute the number of distinct values:

select count(distinct(tail_id)) as NUM_DISTINCT from relationship;

If NUM_DISTINCT is less than 254 (the Oracle 9i bucket maximum), then use value-based histograms by specifying the SIZE value to be greater than the number of distinct values (while allowing for growth).

For example, with 16 distinct values, the size could be SIZE=20.

Next, try the following solution to solve the skew, assuming less than 20 distinct values in tail_id:

Exec dbms_stats.gather_table_stats( ownname=> 'P8User', tabname=> 'DocVersion' , 
	estimate_percent=> 20, cascade=> TRUE, method_opt=> 'FOR ALL INDEXED COLUMNS SIZE 20');
Exec dbms_stats.gather_table_stats( ownname=> 'P8User', tabname=> 'Relationship' , 
	estimate_percent=> 20, cascade=> TRUE, method_opt=> 'FOR ALL INDEXED COLUMNS SIZE 20');

This solution keeps the statistics uniform across the DOCVERSION and RELATIONSHIP tables. If this solution does not work, consider an experimental approach to determine if the statistics work, which is trying estimate_percent=100 and FOR ALL COLUMNS. If one or both work, reduce the estimate_percent to determine the required threshold. The statistics take trial and error to refine settings for the best performance.

Note:
  • Verify the Oracle Statistics Job, if configured, does not run. It can overwrite your experimental changes. Also, change the job to use the new statistics when they are found.
  • Execute the query (that is, the folder browse) on a folder that does not have many documents or subfolders. Otherwise, the optimizer peeks (at compile time only) at the "bad" values – for the large folder that needs the table scan – instead of at the "good" values for which on a fewer set of values the index lookup is best, and the bad plan is cached and reused for the good values. By searching on the good values first, the good plan is cached and reused for all queries of that form.
    • Never browse a large folder first after running the previously mentioned statistics. Because the query plan can be marked as "Obsolete" or "Flushed" at any time, such as when new statistics are generated either by a background job or manually. Browsing a large folder first after the new statistics, would prime the cache with a slow plan for a small folder, such as a table scan.

Behavior on SQL Server

Rather than "bind peeking", SQL Server has "parameter sniffing", which is similar to "bind peeking". At query plan compile time, the optimizer examines the parameters. Subsequent parameterized query executions of the same query form, but with different data, do reuse the same query plan, which can be observed in a profile trace as a "cache hit".

The SQL Server auto-statistics always generates histograms. Use the dbcc show_statistics command to see the chosen bucket ranges, as well as the sampled rows. It might be necessary to run the create statistics and update statistics commands manually to increase the sample size if skew problems occur, such as query plans with table or index scans rather than index seeks. Examine the rows sampled from the dbcc command and increase the amount if query plans are not correct.

Following are two techniques that can be used in SQL Server while browsing a large folder either before or after statistics are outdated. However, they result in a flush of the query plan then recompile on new execution, which can result in a poor plan being cached:
  • Set the autostats option to off
  • Use the norecompute option in the create statistics command

Either choice essentially freezes the statistics in place, which might permit poor plans if data changes. Therefore, you need to monitor performance because newly added data could outdate the query plan, which might benefit from new statistics. The objective is to have the smaller folders browsed first.

For manual statistics gathering, SQL Server does have the option of specifically specifying the column to be analyzed. So, tail_id could be charted at a fine granularity specifically, thus avoiding an expensive statistics gathering on more columns than is necessary.



Last updated: October 2015
p8ppt002.htm

© Copyright IBM Corporation 2015.