Difference between NOT IN, LEFT JOIN and NOT EXISTS.
The objective is to fetch all records in one table that are not present in another table. The most common code I’ve seen at client sites includes the use of NOT IN, because this keyword is included in most programming languages and programmers tend to use this technique when writing stored procedures in the database too. The code example I have found in most cases is this simple and understandable
SELECT a.iFROM #a AS aWHERE a.i NOT IN (SELECT b.j FROM #b AS b)
OR a.i IS NULL
The basic idea is to get all records from table #a where the value in column i is either NULL or not present in column j of table #b. What basically happens behind the scene is that the NOT IN part creates a list of values and stores them in a temporary table and then matches the values from column i in table #a against this temporary table. If there is not a match, or value from table #a is NULL, the column value is valid and returned to query.
There are at least three other ways to do this, which all are much more efficient!
SELECT a.iFROM #a AS aWHERE a.i NOT IN (SELECT b.j FROM #b AS b)
UNION ALL
SELECT a.iFROM #a AS aWHERE a.i IS NULL
SELECT a.iFROM #a AS aLEFT JOIN #b AS b ON b.j = a.iWHERE b.j IS NULL
SELECT a.iFROM #a AS aWHERE NOT EXISTS (SELECT * FROM #b AS b
WHERE b.j = a.i)
Running these four queries simultaneously gives exactly the same result.
Original query takes 95% of the batch, second query (first alternative with UNION ALL) takes 2%, and the last two takes only 1% each!
This is the average profiler results for the four queries after 100 executions each
CPU | Reads | Writes | Duration | Rows | |
Query 1 | 1 767 | 26 292 | 0 | 1 772 | 2 193 |
Query 2 | 6 | 57 | 0 | 6 | 2 193 |
Query 3 | 6 | 24 | 0 | 6 | 2 193 |
Query 4 | 3 | 24 | 0 | 3 | 2 193 |
As you can see, all three alternative approaches are superior to the NOT IN approach!
Next question is obviously “Which approach to use”?
- Second query can be the best choice if there are few records because there are now 5 table scans compared to two table scans in the original query.
- Third query is fast, but can be slow if table #b has many more records than table #a.
- Fourth query will always perform fast because the EXISTS parts ends directly when first matching record is found. But if the second table #b is a mix of more JOINS this will not perform that fast.
It seems that LEFT JOIN and NOT EXISTS are both superior to NOT IN.
Which one of the two approaches you choose, must be decided carefully for every query needing this technique.
Which one of the two approaches you choose, must be decided carefully for every query needing this technique.
Also remember that if table #b were not a temporary table, SQL Server would have places a lot of locks on that table.