Joins Algorithms
Prev: sorting—aggregations Next: query-execution-i
Joins
For a tuple , and a tuple that match on join attributes, the join operator concatenates and together into a new output tuple.
This can be done a few ways:
- Copy all the data: This copies all the data from all matching tuples in and into an intermediate result table. This has the advantage that the query never needs to go back to either table, but it also potentially wastes a lot of memory.
- Record Ids: Only copy the ids of matching tuples, and fetch their data later. This is called late materialization.
Cost Analysis
The cost analysis metric will be the number of disk I/Os used to compute the join.
Set as the pages in , and as the pages in .
Nested Loop Join
This algorithm involves using two for loops, with the smaller table on the outer loop, and the larger table in the inner loop.
Simple Nested Loop Join The regular algorithm. Cost:
Block Nested Loop Join For each block in the outer table, fetch each block from the inner table, and work on each pair of blocks. This is no faster in terms of comparisons, but less costly in terms of I/O. Cost: .
Index Nested Loop Join If the inner table has an index, we can use that. Assume that each index lookup costs a constant, . Cost: .
Sort-Merge Join
First sort the tuples on the join key, and then merge them where they match.
Sort Cost for : Sort Cost for : Merge Cost: Cost: Sort + Merge
Hash Join
Basic Hash Join Hash each item in and on the join key. Then, for each bucket of a hash function that matches, check the members of each hash bucket in and .
This has a variable cost, depending on the amount of collisions, but is .
Grace Hash Join When hash tables do not fit in memory, swapping to disk would be expensive. Thus, this algorithm hashes the inner table into partitions that are written out to disk as well.
- Build: Scan both the outer and inner tables and populate a hash table using the hash function on the join attributes. These are written out to disk as needed. If a bucket doesn’t fit in memory, recurisvely partition it with a second hash function to further divide the bucket.
- Probe: For each bucket, retrieve the pages in the outer and inner tables. Then perform a nested loop join on all tuples.
Partition Cost: Probe Cost: Total Cost:
Prev: sorting—aggregations Next: query-execution-i