Showing posts with label Parallel Database Concepts. Show all posts
Showing posts with label Parallel Database Concepts. Show all posts

Sunday, June 22, 2014

Partitioned Parallel Hash Join



Partitioned Parallel Hash Join in Parallel Database / Parallel Hash Join Technique

Partitioned Parallel Hash Join

The Sequential Hash Join technique discussed in the post Hash Join technique in DBMS can be parallelized. .

The Technique

Please go through the post Hash Join technique in DBMS for better understanding of Parallel Hash-Join.

Assume that we have,

·         n processors, P0, P1, …, Pn-1
·         two tables, r and s (r and s are already partitioned into disks of n processors),
·         s is the smaller table

Parallel Hash-Join Algorithm:

Step 1: Choose a hash function h1. h1 should be able to take the join attribute value of each tuple of relations r and s, and maps them to one of the n processors.
Step 2: Smaller relations s is taken as the Build relation.
Step 3: Each processor Pi reads the tuples of smaller relation s in it’s disk Di, and sends them to different processors based on the hash function h1.
Step 4: On receiving the records of si at every destination processor Pi, the Pi further partitions the tuples using another hash function h2. This hash function is used for joining the tuples locally at every processor. This step is independent for any processor.
Step 5: On completion of the distribution of all the records of smaller relation s, it is now the turn for the larger relation r. Step 3 is done for all the records of relation r.
Step 6: On receiving the records of ri at every destination processor Pi, the Pi further partitions the tuples using another hash function h2. This phase is called the Probe Phase.
Step 7: At last, all the records which can be joined will be in same processors’ same partitions. Each processor Pi executes the build and probe phases of the hash–join algorithm on the local partitions ri and si of r and s to produce a partition of the final result of the hash–join.

Points to remember:

Hash Join at each processor is independent of the other.
It works for equi-join and natural join conditions

Wednesday, March 12, 2014

Parallel Database - Intraoperation Parallelism



Intra-operation Parallelism

It is about parallelizing a single relational operation given in a query.
SELECT * FROM Email ORDER BY Start_Date;
In the above query, the relational operation is Sorting. Since a table may have large number of records in it, the operation can be performed on different subsets of the table in multiple processors, which reduces the time required to sort.

Consider another query,
SELECT * FROM Student, CourseRegd WHERE Student.Regno = CourseRegd.Regno;
In this query, the relational operation is Join. The query joins two tables Student, and CourseRegd on common attribute Regno. Parallelism is required here if the size of tables is very large. Usually, order of tuples does not matter in DBMS. Hence, the tables arranged in random order needs every record of one table should be matched with every record of other table to complete the join process. For example, if Student has 10000 records and CourseRegd has 60000 records, then join requires 10000 X 60000 comparisons. If we exploit parallelism in here, we could achieve better performance.

There are many such relational operations which can be executed in parallel using many processors on subsets of the table/tables mentioned in the query. The following list includes the relational operations and various techniques used to implement those operations in parallel.


Featured Content

Multiple choice questions in Natural Language Processing Home

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers Multiple Choice Que...

All time most popular contents