Wednesday 2 May 2012

Increemental Statistics Gathering Feature -11g

Increemental Statistics Gathering Feature -11g

Expensive global statistics collection


In data warehouse environment it is very common to do a bulk load directly into one or more empty partitions. This will make the partition statistics stale and may also make the global statistics stale. Re-gathering statistics for the effected partitions and for the entire table can be very time consuming. Traditionally, statistics collection is done in a two-pass approach:
  • In the first pass we will scan the table to gather the global statistics
  • In the second pass we will scan the partitions that have been changed to gather their partition level statistics.
The full scan of the table for global statistics collection can be very expensive depending on the size of the table. Note that the scan of the entire table is done even if we change a small subset of partitions.


In Oracle Database 11g, we avoid scanning the whole table when computing global statistics by deriving the global statistics from the partition statistics. Some of the statistics can be derived easily and accurately from partition statistics. For example, number of rows at global level is the sum of number of rows of partitions. Even global histogram can be derived from partition histograms. But the number of distinct values (NDV) of a column cannot be derived from partition level NDVs. So, Oracle maintains another structure called a synopsis for each column at the partition level. A synopsis can be considered as sample of distinct values. The NDV can be accurately derived from synopses. We can also merge multiple synopses into one. The global NDV is derived from the synopsis generated by merging all of the partition level synopses. To summarize

  1. Gather statistics and create synopses for the changed partitions only
  2. Oracle automatically merges partition level synopses into a global synopsis
  3. The global statistics are automatically derived from the partition level statistics and global synopses


Incremental maintenance feature is disabled by default. It can be enabled by changing the INCREMENTAL table preference to true. It can also be enabled for a particular schema or at the database level. If you are interested in more details of the incremental maintenance feature

Assume we have table called SALES that is range partitioned by day on the SALES_DATE column. At the end of every day data is loaded into latest partition and partition statistics are gathered. Global statistics are only gathered at the end of every month because gathering them is very time and resource intensive. Use the following steps in order to maintain global statistics after every load.
1 -Turn on incremental feature for the table. 

EXEC DBMS_STATS.SET_TABLE_PREFS('SH','SALES','INCREMENTAL','TRUE');

2 -At the end of every load gather table statistics using GATHER_TABLE_STATS command. You don't need to specify the partition name. Also, do not specify the granularity parameter. The command will collect statistics for partitions where data has change or statistics are missing and update the global statistics based on the partition level statistics and synopsis.
EXEC DBMS_STATS.GATHER_TABLE_STATS('SH','SALES');

Note: that the incremental maintenance feature was introduced in Oracle Database 11g Release 1. However, we also provide a solution in Oracle Database10g Release 2 (10.2.0.4) that simulates the same behavior. The 10g solution is a new value, 'APPROX_GLOBAL AND PARTITION' for the GRANULARITY parameter of the GATHER_TABLE_STATS procedures. It behaves the same as the incremental maintenance feature except that we don't update the NDV for non-partitioning columns and number of distinct keys of the index at the global level. For partitioned column we update the NDV as the sum of NDV at the partition levels. Also we set the NDV of columns of unique indexes as the number of rows of the table. In general, non-partitioning column NDV at the global level becomes stale less often. It may be possible to collect global statistics less frequently then the default (when table changes 10%) since approx_global option maintains most of the global statistics accurately.

Let's take a look at an example to see how you would effectively use the Oracle Database 10g approach.
After the data load is complete, gather statistics using DBMS_STATS.GATHER_TABLE_STATS for the last partition (say SALES_11FEB2009), specify granularity => 'APPROX_GLOBAL AND PARTITION'. It will collect statistics for the specified partition and derive global statistics from partition statistics (except for NDV as described before).

EXEC DBMS_STATS.GATHER_TABLE_STATS ('SH', 'SALES', 'SALES_11FEB2009', GRANULARITY => 'APPROX_GLOBAL AND PARTITION');
It is necessary to install the one off patch for bug 8719831 if you are using the above features in 10.2.0.4 (patch 8877245) or in 11.1.0.7 (patch 8877251)


Let’s take the ORDERS2 table, which is partitioned by month on order_date.  We will begin by enabling incremental statistics for the table and gathering statistics on the table.


After the statistics gather the last_analyzed date for the table and all of the partitions now show 13-Mar-12.

And we now have the following column statistics for the ORDERS2 table.

We can also confirm that we really did use incremental statistics by querying the dictionary table sys.HIST_HEAD$, which should have an entry for each column in the ORDERS2 table.

So, now that we have established a good baseline, let’s move on to the DML. Information is loaded into the latest partition of the ORDERS2 table once a month. Existing orders maybe also be update to reflect changes in their status. Let’s assume the following transactions take place on the ORDERS2 table this month.

After these transactions have occurred we need to re-gather statistic since the partition ORDERS_MAR_2012 now has rows in it and the number of distinct values and the maximum value for the STATUS column have also changed.

Now if we look at the last_analyzed date for the table and the partitions, we will see that the global statistics and the statistics on the partitions where rows have changed due to the update (ORDERS_FEB_2012) and the data load (ORDERS_MAR_2012) have been updated.

The column statistics also reflect the changes with the number of distinct values in the status column increase to reflect the update.



Note : Above info is referenced from Oracle blogs and Oracle Manuals.

2 comments:

  1. Great feature and great explanation as well!!! really gud...thnx for this..

    ReplyDelete
    Replies
    1. Thanks Mate, please share your experiences on stats gather as well, to make it more interactive!!1

      Delete