Join

Read Part 2

IBM DB2 e-kit for Database Professionals

Learn how easy it is to get trained and certified for DB2 for Linux, UNIX, and Windows with the IBM DB2 e-kit for Database Professionals. Register now, and expand your skills portfolio, or extend your DBMS vendor support to include DB2.
If there is one SQL construct that I believe has generated the most confusion since its overhaul in DB2 for OS/390® Version 6, it would have to be outer joins.
Version 6 expanded the capabilities for coding predicates within the ON clause, as well as introducing a host of other optimization and query rewrite enhancements. Enhancing the syntax has definitely increased the potential usage of outer joins, but this also means that there is more to understand. The syntax too has been aligned much more closely with its cousins on the UNIX®, Linux, Windows®, and OS/2® platforms, making it easier to be consistent with your SQL coding across the DB2 family.
In this article, which consists of two parts, I attempt to assemble a guide for coding outer joins to achieve two goals:
  • The most important goal, obtaining the correct result.
  • Secondly, consideration of the performance implications of coding your predicates in different ways.


Part 1 covers the simpler constructs of outer joins, providing a simple comparison of the effect of coding predicates in the ON or WHERE clause. In Part 2, I will cover the more complex topics such as outer join simplification and nesting of outer joins.
The examples in this article use extracts from the DB2 Universal Database (UDB) (non-OS/390) sample database. The data (in Figure 1) is a subset of the full tables. To cater for all outer join combinations, the row with PROJNO = 'IF2000' in the project table has been updated to set the DEPTNO = 'E01'.
For z/OS® and OS/390 users, the table names differ:
DB2 on Workstation table names DB2 for OS/390 and z/OS table names
EMPLOYEE EMP
DEPARTMENT DEPT
PROJECT PROJ


   
   
   

Back to top
Inner to outer joins


Inner joins
For an inner join (or simple table join), only matched rows based on the join predicates are included in the result. Therefore, unmatched rows are not included.
In Figure 2, when joining the Project and Department tables on the DEPTNO column, the row with DEPTNO = 'E01' in the Project (left) table does not have a matched row in the Department table, and is therefore not returned in the result. Similarly, the row with DEPTNO = 'A01' in the department (right) table is also unmatched and not returned.
   
   
   
This example uses the"explicit join" syntax, whereby the keywords"INNER JOIN" (or simply JOIN) are coded between the joined tables. The join predicates are coded in the ON clause. Although this is not mandatory syntax for inner joins, it is for outer joins, and it is therefore good programming practice for consistency. There are a number of other reasons to consider this syntax:
  • It is more descriptive than simply coding a"comma" in the FROM clause to separate tables. This is important as queries become larger.
  • It forces the join predicates (ON clause) to be coded after each join, which means you are less likely to forget to code join predicates.
  • It is easy to determine which join predicates belong to what tables.
  • An inner join can be easily converted to an outer join if required.


And, finally, on the subject of inner joins, people often ask me: "Does it matter in what order I code my tables in the FROM clause?" For retrieving the correct result, the answer is "no." For performance, the answer is "generally, no." The DB2 optimizer evaluates all possible join permutations (sequences) and selects the most efficient one. However, to quote the DB2 UDB for OS/390 and z/OS Administration Guide: "The order of tables or views in the FROM CLAUSE can affect the access path." My interpretation of this statement is that if two (or more) different join sequences equate to the same cost, then the tie-breaker may be the table order in the FROM clause.
Outer join table classification
Before exploring outer join examples, it is important to first understand how we classify tables in the join.
Tables in the FROM clause of an outer join can be classified as either preserved row or NULL- supplying. The preserved row table refers to the table that preserves rows when there is no match in the join operation. Therefore, all rows from the preserved row table that qualify against the WHERE clause will be returned, regardless of whether there is a matched row in the join.
The preserved row table is:
  • The left table in a left outer join.
  • The right table in a right outer join.
  • Both tables in a full outer join.


The NULL-supplying table supplies NULLs when there is an unmatched row. Any column from the NULL- supplying table referred to in the SELECT list or subsequent WHERE or ON clause will contain NULL if there was no match in the join operation.
The NULL-supplying table is:
  • The right table in a left outer join
  • The left table in a right outer join
  • Both tables in a full outer join.


In a full outer join, both tables can preserve rows, and also can supply NULLs. This is significant, because there are rules that apply to purely preserved row tables that do not apply if the table can also supply NULLs.
The order of coding tables in the FROM clause can have extreme significance for left and right outer joins -- and also for outer joins involving more than two tables -- because preserved row and NULL supplying tables behave differently when there is an unmatched row in the join.
Left outer join
Figure 3 shows a simple left outer join.
   
   
   
The left outer join returns those rows that exist in the left table and not in the right table (DEPTNO = 'E01'), plus the inner join rows. Unmatched rows are preserved from the preserved row table and are supplied with NULLs from the NULL-supplying table. That is, when the row is unmatched with the right table (DEPTNO = 'E01'), then the DEPTNO value is NULL-supplied from the Department table.
Note the select list includes DEPTNO from both the preserved row and NULL-supplying table. From the output you can see that it is important to select columns from the preserved row table where possible, otherwise the column value may not exist.
Right outer join
   
   
   
The right outer join returns those rows that exist in the right table and not in the left table (DEPTNO = 'A00'), plus the inner join rows. Unmatched rows are preserved from the preserved row table and are supplied with NULLs from the NULL-supplying table.
For a right outer join, the right table becomes the preserved row table, and the left table is the NULL-supplying table. The DB2 for OS/390 and z/OS optimizer rewrites all right outer joins to become left outer joins, by simply inverting the tables in the FROM clause and by changing the keyword RIGHT to LEFT. This query rewrite can only be seen by the presence of the value"L" in the JOIN_TYPE column of the plan table. For this reason, you should avoid coding right outer joins to avoid confusion when you are interpreting the access path in the plan table.
Full outer joins
   
   
   
The full outer join returns those rows that exist in the right and not in the left (DEPTNO = 'A00'), plus the rows that exist in the left table and not in the right table (DEPTNO = 'E01'), and the inner join rows.
Both tables can supply NULLs but also preserve rows. However, the tables are identified as NULL-supplying because there are"query rewrite" and"WHERE clause predicate evaluation" rules that apply separately to NULL-supplying and to preserved row tables. I'll describe more about these differences in later examples.
In this example, both join columns have been selected to show that either table can supply NULL for unmatched rows.
To ensure that a non-NULL is always returned, code the COALESCE, VALUE, or IFNULL clause, which returns the first argument that is not NULL, as shown here:COALESCE(P.DEPTNO,D.DEPTNO)

Back to top
Outer join predicate types


In releases before DB2 for OS/390 Version 6, predicates could be only applied before the join, or totally after the join. Version 6 introduced the concepts of"during-join" predicates and"after-join-step" predicates.
DB2 can apply before-join predicates before the join to delimit the number of rows that are joined to subsequent tables. These"local", or"table access," predicates are evaluated as regular indexable, stage 1 or stage 2 predicates on the outer table of a pairwise join. Pairwise join is the term used to describe each join step of two or more tables. For example, a row from table 1 and table 2 is joined, and the result is joined to table 3. Each join only joins rows from two tables at a time.
During-join predicates are those coded in the ON clause. For all but full outer joins, these predicates can be evaluated as regular indexable, stage 1 or stage 2 predicates (similar to before-join predicates) on the inner table of a pairwise nested loop or hybrid join. For a full outer join, or any join using a merge scan join, these predicates are applied at stage 2, where the physical joining of rows occurs.
After-join-step predicates can be applied between joins. These are applied after the join in which all columns of the where clause predicate become available (simple or complex predicate separated by OR), and before any subsequent joins.
Totally-after-join predicates are dependent on all joins occurring before they can be applied.
Before-join predicates
Before Version 6 DB2 for OS/390, DB2 had a limited ability to push down WHERE clause predicates for application before the join. Therefore, to ensure a where clause predicate was applied before the join, you had to code the predicate in a nested table expression. This not only added complexity to achieve acceptable performance, but the nested table expression required the additional overhead of materializing the result before the join.
   
   
   
From Version 6 onwards, DB2 can merge the nested table expression into a single query block, and thus avoids any unnecessary materialization. DB2 aggressively merges any nested table expression based upon the standard materialization rules listed in the Administration Guide or the Application Programming and SQL Guide.
Instead of coding these in nested table expressions, the predicates can now be coded in the WHERE clause as shown in Figure 7.
   
   
   
The rule for before-join predicates coded in a WHERE clause is that they must apply to the preserved row table only; or to be more specific, the WHERE clause must not apply to a NULL-supplying table. This means you no longer have to code these predicates in nested table expressions.
For a full outer join, neither table can be identified as only being a preserved row table, and of course, both are NULL-supplying. For NULL-supplying tables, the risks of coding predicates in the WHERE clause are that they will either be applied totally after the join or will cause simplification of the outer join (which I will talk about in Part 2). To apply the predicates before the join, you must code them in nested table expressions as shown in Figure 8.
   
   
   
Because they limit the number of rows that will be joined, before-join predicates are the most efficient of the predicate types described here. If you begin with a 5-million row table, which returns one row after the WHERE clause is applied, it is obviously more efficient to apply the predicate before joining the one row. The other alternative, which is not efficient, is to join 5 million rows, and then apply the predicate to produce a result of one row.
During-join predicates
Coding join predicates in the ON clause is mandatory for outer joins. In DB2 for OS/390 Version 6 and later, you can also code expressions, or"column-to-literal" comparisons (such as DEPTNO = 'D01',) in the ON clause. However, coding expressions in the ON clause can produce very different results from those same expressions coded in a WHERE clause.
This is because predicates in the ON clause, or during-join predicates, do not limit the result rows that are returned; they only limit which rows are joined. Only WHERE clause predicates limit the number of rows of rows that are actually retrieved.
Figure 9 demonstrates the result of coding an expression in the ON clause of a left outer join. This is not the result expected by most people when coding this type of query.
   
   
   
In this example, because there are no WHERE clause predicates to limit the result, all rows of the preserved row (left) table are returned. But the ON clause dictates that the join only occurs when both P.DEPTNO = D.DEPTNO AND P.DEPTNO = 'D01'. When the ON clause is false (that is, P.DEPTNO <> 'D01'), then the row is supplied NULLs for those columns selected from the NULL-supplying table. Similarly, when P.DEPTNO is 'E01', then the first element of the ON clause fails, and the row from the left table is preserved, and the null is supplied from the right table.
When DB2 accesses the first table and determines that the ON clause will fail (such as when P.DEPTNO <> 'D01'), then to improve performance, DB2 immediately supplies NULL for the NULL-supplying table columns without even attempting to join the row.
Now let's talk about during- join predicates for a full outer join. The rules for the ON clause are the same for full joins as for left and right outer joins: the predicates in the ON clause do not limit the resultant rows which are returned, only which rows are joined.
For the example in Figure 10, because there are no WHERE clause predicates to limit the result and because both tables of a FULL JOIN preserve rows, then all rows of the left and right tables are returned. But the ON clause dictates that the join only occurs when P.DEPTNO = 'D01'. When the ON clause is false (that is,    P.DEPTNO <> 'D01'), then the row is supplied NULLs for those columns selected from the opposite table to the table whose row is being preserved.
Note: This syntax is non-OS/390 only because OS/390 does not permit expressions in the ON clause of a full outer join.
   
   
   
To simulate having the non-OS/390 comply with OS/390 DB2 syntax, then we must first derive the expression as a column within a nested table expression, and then perform the join. By first deriving the column DEPT2 as 'D01' in Figure 11, the ON clause effectively becomes a join only when P.DEPTNO = 'D01'.
   
   
   
After-join predicates
Figure 12 contains a query with both after-join-step and totally-after-join predicates.
   
   
   
The first compound predicate in the WHERE clause refers only to tables D and E (D.MGRNO = E.EMPNO OR E.EMPNO IS NULL). Therefore, if the join sequence chosen by the optimizer mimics the coding of the SQL, then DB2 can apply the WHERE clause predicate after the join between D and E, and before the join to P. However, the second compound predicate in the WHERE clause refers to tables D and P (D.MGRNO = P.RESPEMP OR P.RESPEMP IS NULL). These are the first and third tables in the join sequence. This predicate cannot be applied therefore until the third table is joined, which is the final table in the join sequence. Hence this is referred to as a totally-after-join predicate.
It is likely that an after-join-step predicate may revert to a totally-after-join predicate if the table join sequence alters, which is possible given that the DB2 OS/390 optimizer can reorder the table join sequence based upon the lowest cost access path. Given that DB2 is able to apply the predicate as early as possible in between joins to limit the rows required for subsequent joins, then you should also attempt to code your predicates such that DB2 is able to apply them as early in the join sequence as possible.

Back to top
Conclusion
In this article, I described several topics:
  • The order of tables in the FROM clause and the effect on inner and outer joins
  • The differences between these type of joins
  • The different predicate types.


To recap, WHERE clause predicates that are applied to the preserved row table can filter rows as either:
  • Before-join predicates
  • After-join-step or totally-after-join predicates.


If these predicates are currently coded in a nested table expression, you can now write them in the WHERE clause. Before-join predicates are the most efficient predicates, because they limit the number of rows before the join. After-join-step predicates also limit the number of rows for subsequent joins. Totally-after-join predicates are the least efficient, since filtering occurs completely after all joins have taken place.
Predicates in the ON clause are the biggest surprise, because they only filter rows on the NULL-supplying table as during-join predicates. They do not filter rows on the preserved row table, as WHERE clause predicates do.
In Part 2 of this article, I will describe what happens if WHERE clause predicates are coded against the NULL-supplying table.
I hope that this article has given you some insight into outer joins and has given you some clues on how to solve the mystery of where to code your outer join predicates.