Thursday 6 December 2012

Histogram


Histograms


Histograms tell the Optimizer about the distribution of data within a column. By default (without a histogram), the Optimizer assumes a uniform distribution of rows across the distinct values in a column. As described above, the Optimizer calculates the cardinality for an equality predicate by dividing the total number of rows in the table by the number of distinct values in the column used in the equality predicate. If the data distribution in that column is not uniform (i.e., a data skew) then the cardinality estimate will be incorrect. In order to accurately reflect a non-uniform data distribution, a histogram is required on the column. The presence of a histogram changes the formula used by the Optimizer to estimate the cardinality, and allows it to generate a more accurate execution plan. Oracle automatically determines the columns that need histograms based on the column usage information (SYS.COL_USAGE$), and the presence of a data skew. For example, Oracle will not automatically create a histogram on a unique column if it is only seen in equality predicates. There are two types of histograms, frequency or height-balanced. Oracle determines the type of

histogram to be created based on the number of distinct values in the column.

Frequency Histograms


Frequency histograms are created when the number of distinct values in the column is less than 254. Oracle uses the following steps to create a frequency histogram.

1. Let’s assume that Oracle is creating a frequency histogram on the PROMO_CATEGORY_ID column of the PROMOTIONS table. The first step is to select the PROMO_CATEGORY_ID from  the PROMOTIONS table ordered by PROMO_CATEGORY_ID.
2. Each PROMO_CATEGORY_ID is then assigned to its own histogram bucket3.

PROMO_CATEGORY_ID
2                                             Bucket 1 has endpoint 2
3                                             Bucket 2 has endpoint 3
3                                             Bucket 3 has endpoint 3
--------------                          -------------------------
3                                             Bucket 115 has endpoint 3
--------------                          -------------------------
9                                             Bucket 417 has endpoint 9
--------------                           -------------------------
9                                             Bucket 483 has endpoint 9
--------------                           -------------------------
10                                             Bucket 502 has endpoint 10
10                                             Bucket 503 has endpoint 10

3.At this stage we could have more than 254 histogram buckets, so the buckets that hold the same value are then compressed into the highest bucket with that value. In this case, buckets 2 through 115 are compressed into bucket 115, and buckets 484 through 503 are compressed into bucket 503, and so on until the total number of buckets remaining equals the number of  distinct values in the column Note the above steps are for illustration purposes. The DBMS_STATS package has been optimized to build compressed histograms directly

PROMO_CATEGORY_ID
2                                             Bucket 1 has endpoint 2
3                                             Bucket 3 has endpoint 3
9                                             Bucket 483 has endpoint 9
10                                             Bucket 503 has endpoint 10

4. The Optimizer now accurately determines the cardinality for predicates on the PROMO_CATEGORY_ID column using the frequency histogram. For example, for the predicate PROMO_CATEGORY_ID =10, the Optimizer would first need to determine how many buckets in the histogram have 10 as their end point. It does this by finding the bucket whose endpoint
is 10, bucket 503, and then subtracts the previous bucket number, bucket 483, 503 - 483 = 20.
Then the cardinality estimate would be calculated using the following formula
(number ofbucket endpoints / total number of bucket) X NUM_ROWS,
20/503 X 503,
so the number of rows in the PROMOTOINS table where PROMO_CATEGORY_ID =10 is 20.


Height balanced Histograms


Height-balanced histograms are created when the number of distinct values in the column is greater than 254. In a height-balanced histogram, column values are divided into buckets so that each bucke contains approximately the same number of rows. Oracle uses the following steps to create a heightbalanced histogram.


1. Let’s assume that Oracle is creating a height-balanced histogram on the CUST_CITY_ID
column of the CUSTOMERS table because the number of distinct values in the CUST_CITY_ID
column is greater than 254. Just like with a frequency histogram, the first step is to
select the CUST_CITY_ID from the CUSTOMERS table ordered by CUST_CITY_ID.
2. There are 55,500 rows in the CUSTOMERS table and there is a maximum of 254 buckets in a
histogram. In order to have an equal number of rows in each bucket, Oracle must put 219
rows in each bucket. The 219th CUST_CITY_ID from step one will become the endpoint for
the first bucket. In this case that is 51043. The 438th CUST_CITY_ID from step one will
become the endpoint for the second bucket, and so on until all 254 buckets are filled

ROW_COUNT        CUST_CITY_ID
1                                             51040
----                                             ---------                    ------------------------------219                                             51043                        Bucket 1 has endpoint 51043
5256                                           51166                        Bucket 24 has endpoint 51166
5475 5                                        1166                          Bucket 25 has endpoint 51166
55500                                         52631                        Bucket 254 has endpoint 52631
 
3. Once the buckets have been created Oracle checks to see if the endpoint of the first bucket is the minimum value for the CUST_CITY_ID column. If it is not, a "zero" bucket is added to the histogram that has the minimum value for the CUST_CITY_ID column as its end point.

ROW_COUNT                 CUST_CITY_ID
1                                             51040                            Bucket 0 has endpoint 51040
----                                         ---------                         ------------------------------
219                                          51043                          Bucket 1 has endpoint 51043
5256                                        51166                          Bucket 24 has endpoint 51166
5475                                        51166                          Bucket 25 has endpoint 51166
55500                                      52631                          Bucket 254 has endpoint 52631

4. Just as with a frequency histogram, the final step is to compress the height-balanced histogram, and remove the buckets with duplicate end points. The value 51166 is the end Understanding Optimizer Statistics point for bucket 24 and bucket 25 in our height-balanced histogram on the CUST_CITY_ID column. So, bucket 24 will be compressed in bucket 25

ROW_COUNT                 CUST_CITY_ID
1                                             51040                             Bucket 0 has endpoint 51040
----                                         ---------                          ------------------------------- 
219                                          51043                            Bucket 1 has endpoint 51043
5475                                        51166                            Bucket 25 has endpoint 51166
55500                                      52631                            Bucket 254 has endpoint 52631

5. The Optimizer now computes a better cardinality estimate for predicates on the CUST_CITY_ID column by using the height-balanced histogram. For example, for the predicate CUST_CITY_ID =51806, the Optimizer would first check to see how many buckets in the histogram have 51806 as their end point. In this case, the endpoint for bucket 136,137,138 and 139 is 51806(info found in USER_HISTOGRAMS). The Optimizer then uses the following formula .However, if the predicate was CUST_CITY_ID =52500, which is not the endpoint for any bucket then the Optimizer uses a different formula. For values that are the endpoint for only one bucket or are not an endpoint at all, the Optimizer uses the following formula:
DENSITY X number of rows in the table
where DENSITY is calculated ‘on the fly’ during optimization using an internal formula based  on information in the histogram. The value for DENSITY seen in the dictionary view
USER_TAB_COL_STATISTICS is not the value used by the Optimizer from Oracle Database 10.2.0.4 onwards. This value is recorded for backward compatibility, as this is the value used  in Oracle Database 9i and earlier releases of 10g. Furthermore, if the parameter  OPTIMIZER_FEATURES_ENABLE is set to version release earlier than 10.2.0.4, the value for  DENSITY in the dictionary view will be used.

No comments: