Partitioned Parallel Hash Join in Parallel Database / Parallel Hash Join Technique
Partitioned Parallel Hash Join
The
Sequential Hash Join technique discussed in the post Hash Join technique in DBMS
can be parallelized. .
The Technique
Please
go through the post Hash Join technique in DBMS for better understanding of
Parallel Hash-Join.
Assume that we have,
·
n
processors, P0, P1, …, Pn-1
·
two
tables, r and s (r and s are already partitioned into disks of n processors),
·
s
is the smaller table
Parallel Hash-Join Algorithm:
Step
1: Choose a hash function h1. h1 should be able to take the join
attribute value of each tuple of relations r and s, and maps them to
one of the n processors.
Step
2: Smaller relations s is taken as the Build relation.
Step
3: Each processor Pi reads the tuples of smaller relation s in it’s disk Di, and
sends them to different processors based on the hash function h1.
Step
4: On receiving the records of si at every destination processor Pi,
the Pi further partitions the tuples using another hash function h2. This hash function is used for
joining the tuples locally at every processor. This step is independent for any
processor.
Step
5: On completion of the distribution of all the records of smaller relation s,
it is now the turn for the larger relation r. Step 3 is done for all the
records of relation r.
Step
6: On receiving the records of ri at every destination processor Pi,
the Pi further partitions the tuples using another hash function h2. This phase is called the Probe
Phase.
Step
7: At last, all the records which can be joined will be in same processors’
same partitions. Each processor Pi executes the build and probe phases of the
hash–join algorithm on the local partitions ri and si of
r and s to produce a partition of the final result of the hash–join.
Points to remember:
Hash
Join at each processor is independent of the other.
It
works for equi-join and natural join conditions