Saturday, 5 May 2012

Oracle Join Nested/Hash/Sort Merge- Performance Tuning

Normally we have three types of  Oracle join:

1. Nested - Loop Join
2. Hash Join
3. Sort Merge Join

I will discuss about characteristics of these three joins today:

1. Nested Loop Joins

Nested loop joins are useful when small subsets of data are being joined and if the join condition is an efficient way of accessing the second table.
It is very important to ensure that the inner table is driven from (dependent on) the outer table. If the inner table's access path is independent of the outer table, then the same rows are retrieved for every iteration of the outer loop, degrading performance considerably. In such cases, hash joins joining the two independent row sources perform better.
See Also:

A nested loop join involves the following steps:
  1. The optimizer determines the driving table and designates it as the outer table.
  2. The other table is designated as the inner table.
  3. For every row in the outer table, Oracle accesses all the rows in the inner table. The outer loop is for every row in the outer table and the inner loop is for every row in the inner table. The outer loop appears before the inner loop in the execution plan, as follows:
    NESTED LOOPS 
      outer_loop 
      inner_loop 
    

1.1 When the Optimizer Uses Nested Loop Joins

The optimizer uses nested loop joins when joining small number of rows, with a good driving condition between the two tables. You drive from the outer loop to the inner loop, so the order of tables in the execution plan is important.
The outer loop is the driving row source. It produces a set of rows for driving the join condition. The row source can be a table accessed using an index scan or a full table scan. Also, the rows can be produced from any other operation. For example, the output from a nested loop join can be used as a row source for another nested loop join.
The inner loop is iterated for every row returned from the outer loop, ideally by an index scan. If the access path for the inner loop is not dependent on the outer loop, then you can end up with a Cartesian product; for every iteration of the outer loop, the inner loop produces the same set of rows. Therefore, you should use other join methods when two independent row sources are joined together.

1.2 Nested Loop Join Hints

If the optimizer is choosing to use some other join method, you can use the USE_NL(table1 table2) hint, where table1 and table2 are the aliases of the tables being joined.


2 Hash Joins

Hash joins are used for joining large data sets. The optimizer uses the smaller of two tables or data sources to build a hash table on the join key in memory. It then scans the larger table, probing the hash table to find the joined rows.
This method is best used when the smaller table fits in available memory. The cost is then limited to a single read pass over the data for the two tables.

2.1 When the Optimizer Uses Hash Joins

The optimizer uses a hash join to join two tables if they are joined using an equijoin and if either of the following conditions are true:
  • A large amount of data needs to be joined.
  • A large fraction of a small table needs to be joined.

2.2 Hash Join Hints

Apply the USE_HASH hint to instruct the optimizer to use a hash join when joining two tables together.

3. Sort Merge Joins

Sort merge joins can be used to join rows from two independent sources. Hash joins generally perform better than sort merge joins. On the other hand, sort merge joins can perform better than hash joins if both of the following conditions exist:
  • The row sources are sorted already.
  • A sort operation does not have to be done.
However, if a sort merge join involves choosing a slower access method (an index scan as opposed to a full table scan), then the benefit of using a sort merge might be lost.
Sort merge joins are useful when the join condition between two tables is an inequality condition (but not a nonequality) like <, <=, >, or >=. Sort merge joins perform better than nested loop joins for large data sets. You cannot use hash joins unless there is an equality condition.
In a merge join, there is no concept of a driving table. The join consists of two steps:
  1. Sort join operation: Both the inputs are sorted on the join key.
  2. Merge join operation: The sorted lists are merged together.
If the input is already sorted by the join column, then a sort join operation is not performed for that row source. However, a sort merge join always creates a positionable sort buffer for the right side of the join so that it can seek back to the last match in the case where duplicate join key values come out of the left side of the join.

3.1 When the Optimizer Uses Sort Merge Joins

The optimizer can choose a sort merge join over a hash join for joining large amounts of data if any of the following conditions are true:
  • The join condition between two tables is not an equi-join.
  • Because of sorts already required by other operations, the optimizer finds it is cheaper to use a sort merge than a hash join.

3.2 Sort Merge Join Hints

To instruct the optimizer to use a sort merge join, apply the USE_MERGE hint. You might also need to give hints to force an access path.
There are situations where it is better to override the optimizer with the USE_MERGE hint. For example, the optimizer can choose a full scan on a table and avoid a sort operation in a query. However, there is an increased cost because a large table is accessed through an index and single block reads, as opposed to faster access through a full table scan.


Note : if we consider on the broader look Nested Loop join is performed on small tables with index on driven (inner) column will add edge to it, on the other hand Hash Join is used on Large tables with no indexes and use pga for preparing hash table. Sort Merge join is used in case of medium sized tables.

Above note is referenced from Oracle Internals and manuals.


Example:

SQL> conn hr/*****
Connected.
SQL> create table e as select * from emp;
Table created.
SQL> create table d as select * from dept;
Table created.
SQL> create index e_deptno on e(deptno);
Index created.
Gather D stats as it is
SQL> exec dbms_stats.gather_table_stats('hr','D')
PL/SQL procedure successfully completed.

Set artificial stats for E:
SQL> exec dbms_stats.set_table_stats(ownname => 'hr', tabname => 'E', numrows => 100, numblks => 100, avgrlen => 124);
PL/SQL procedure successfully completed.

Set artificial stats for E_DEPTNO index
SQL> exec dbms_stats.set_index_stats(ownname => 'hr', indname => 'E_DEPTNO', numrows => 100, numlblks => 10);
PL/SQL procedure successfully completed.

Check out the plan:
A) With less number of rows(100 in E), you will see Nested loop getting used.

SQL> select e.ename,d.dname from e, d where e.deptno=d.deptno;
Execution Plan
----------------------------------------------------------
Plan hash value: 3204653704
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 100 | 2200 | 6 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| E | 25 | 225 | 1 (0)| 00:00:01 |
| 2 | NESTED LOOPS | | 100 | 2200 | 6 (0)| 00:00:01 |
| 3 | TABLE ACCESS FULL | D | 4 | 52 | 3 (0)| 00:00:01 |
|* 4 | INDEX RANGE SCAN | E_DEPTNO | 33 | | 0 (0)| 00:00:01 |
----------------------------------------------------------------------------------------

B) Let us set some more artificial stats to see which plans is getting used:

SQL> exec dbms_stats.set_table_stats(ownname => 'hr', tabname => 'E', numrows => 1000000, numblks => 10000, avgrlen => 124);
PL/SQL procedure successfully completed.
SQL> exec dbms_stats.set_index_stats(ownname => 'hr', indname => 'E_DEPTNO', numrows => 1000000, numlblks => 1000);
PL/SQL procedure successfully completed.
SQL> exec dbms_stats.set_table_stats(ownname => 'hr', tabname => 'D', numrows => 1000000,numblks => 10000 , avgrlen => 124);
PL/SQL procedure successfully completed.

Now we have 1000000 number of rows in E and D table both and index on E(DEPTNO) reflects the same.
Plans changes !!
SQL> select e.ename,d.dname from e, d where e.deptno=d.deptno;
Execution Plan
----------------------------------------------------------
Plan hash value: 51064926
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 250G| 5122G| | 3968K(100)| 13:13:45 |
|* 1 | HASH JOIN | | 250G| 5122G| 20M| 3968K(100)| 13:13:45 |
| 2 | TABLE ACCESS FULL| E | 1000K| 8789K| | 2246 (3)| 00:00:27 |
| 3 | TABLE ACCESS FULL| D | 1000K| 12M| | 2227 (2)| 00:00:27 |
-----------------------------------------------------------------------------------

C) Now to test MERGE JOIN, we set moderate number of rows and do some ordering business.
SQL> exec dbms_stats.set_table_stats(ownname => 'hr', tabname => 'E', numrows => 10000, numblks => 1000, avgrlen => 124);
PL/SQL procedure successfully completed.
SQL> exec dbms_stats.set_index_stats(ownname => 'hr', indname => 'E_DEPTNO', numrows => 10000, numlblks => 100);
PL/SQL procedure successfully completed.
SQL> exec dbms_stats.set_table_stats(ownname => 'hr', tabname => 'D', numrows => 1000, numblks => 100, avgrlen => 124);
PL/SQL procedure successfully completed.
SQL> select e.ename,d.dname from e, d where e.deptno=d.deptno order by e.deptno;
Execution Plan
----------------------------------------------------------
Plan hash value: 915894881
-----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2500K| 52M| 167 (26)| 00:00:02 |
| 1 | MERGE JOIN | | 2500K| 52M| 167 (26)| 00:00:02 |
| 2 | TABLE ACCESS BY INDEX ROWID| E | 10000 | 90000 | 102 (1)| 00:00:02 |
| 3 | INDEX FULL SCAN | E_DEPTNO | 10000 | | 100 (0)| 00:00:02 |
|* 4 | SORT JOIN | | 1000 | 13000 | 25 (4)| 00:00:01 |
| 5 | TABLE ACCESS FULL | D | 1000 | 13000 | 24 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

 

No comments:

Post a Comment