Comparison of Data Partitioning Strategies with Example:
ENo
|
EName
|
DOB
|
Designation
|
Salary
|
|
E101
|
Kumar
|
14-JUN-1980
|
Lecturer
|
10002
|
R1
|
E102
|
Melvin
|
12-MAR-1975
|
AP
|
23000
|
R2
|
E103
|
Ram
|
10-MAR-1978
|
AP
|
22051
|
R3
|
E107
|
Jacob
|
28-MAY-1970
|
Professor
|
30011
|
R4
|
E109
|
Nisha
|
10-AUG-1981
|
Lecturer
|
9031
|
R5
|
E105
|
Shaik
|
20-DEC-1971
|
Professor
|
29002
|
R6
|
E104
|
Vinay
|
15-SEP-1982
|
Attender
|
4004
|
R7
|
E106
|
Menon
|
30-AUG-1978
|
AP
|
21016
|
R8
|
Table
1 - Employee
Let us
suppose a table Employee with the schema Employee(ENo, EName, DOB, Designation,
Salary). Table 1 shows Employee table with sample data. We assume that Shared Nothing Parallel Architecture is
used. For partitioning this table using Hash or Range partitioning we need
one or more attributes as partitioning attributes. Let us choose Salary
as partitioning attribute.
The
partitioning attribute salary is represented in the following figures inside
big rectangle with all these ranges in small rectangles. The big rectangle
means the whole table (Employee). Consider the
salary values differently for different partitioning. That is, treat the values
as 100 to MAX for Round-robin and Hash partitioning (not as group of values
instead treat as one group), and as different vectors for Range partitioning in
the following discussions. The same representation is used for all
the figures (Round-robin, Hash, and Range partitioning).
Note: We
cannot physically partition a table using more than one partitioning technique.
We have taken this example to explain
with different partitioning techniques’ advantages and disadvantages.
Round-robin technique:
In
Round-robin technique, we use the function i mod n to distribute data into
multiple disks. Here, variable i
represents the position (physical order) of the record in the actual table. Variable
n represents the number of target
disks into which the table should be partitioned.
For our table
Employee, Figure 1 shows the data distribution using Round-robin partitioning. Here,
you would find distribution links from every small rectangle to every disk.
That means, all the ranges are available with all the disks. For example, the
records of Table 1, according to the function i mod 4, will be sent to
the 4 disks as follows;
Disk 0 : R4, R8 (i.e, records with salary values 30011,
and 21016 respectively)
Disk 1 : R1, R5 (i.e, records with salary values 10002,
and 9031 respectively)
Disk 2 : R2, R6 (i.e, records with salary values 23000,
and 29002 respectively)
Disk 3 : R3, R7 (i.e, records with salary values 22051,
and 4004 respectively)
By looking at
the sample data and the distribution to different disks, it is clear that the
distribution is not even distribution.
Figure 1 - Data distribution in Shared Nothing Parallel Architecture using Round-robin technique |
Advantages:
Distribution is equal, so load is equal (all
the disks have equal amount of records)
Good for sequential scan.
Disadvantages:
Hard to answer range queries because of
uneven distribution. For this, let us consider the following query;
SELECT
* FROM Employee WHERE Salary BETWEEN 10000 AND 20000;
The above
query is a Range query. This query
demands all the records of Employee table for which the salary value falls between
10002 and 20000. In our case, the distribution is not based on attribute; it
was actually based on the record position. Hence, all the salary values falls
in the given range may be in all the disks. So, we need complete scan on all
the disks for answering this query.
Hard to answer point queries. Consider
another query;
SELECT
* FROM Employee WHERE EName = ‘Murugan’;
This query is
a Point query. Query needs the
records of employee ‘Murugan’. Again, for this query too we cannot identify the
exact disk where the records with the specified values would be stored. Hence,
we need full scan on all the disks to answer this query.
Hash Partitioning:
Hash
partitioning uses Hash function. Let us use the following hash function on
partitioning attribute Salary.
h (attribute mod n) = h (salary mod 4) =
h(4004 mod 4) = 0 à Disk 0
For this
function, the records will be distributed as follows;
Disk 0 : R2, R7, R8
Disk 1 :
Disk 2 : R1, R6
Disk 3 : R3, R4, R5
In Hash
partitioning, we have uneven distribution. It is shown in Figure 2.
Figure 2 - Data distribution in Shared Nothing Parallel Architecture using Hash Partitioning technique |
Advantages:
Sequential can be done efficiently by
exploiting all the processors in parallel. For example, the query,
SELECT
* FROM Employee ORDER BY Phone;
can be
executed in parallel using all the 4 processors.
Best suited for point queries. Consider
the following query; (point query on partitioning attribute Salary)
SELECT
* FROM Employee WHERE Salary = 12000;
This query
can be executed efficiently by only one of the available processors. The reason
is, on applying the hash function h(salary mod 4) which is h(12000 mod 4), will point to disk 0 where
the actual record is stored. During this time we can exploit other processors
for executing something else. That is, many numbers of simple transactions can
be executed in parallel by various processors, which lead to higher Transaction Throughput.
Disadvantages:
Hard to answer range queries. Consider
the query,
SELECT
* FROM Employee WHERE Salary BETWEEN 10000 AND 20000;
For this
query, the salary in the range 10000 to 20000 will be distributed to almost all
the partitions. For example, 10000 mod 4 will point Disk 0, 10001 mod 4 will point
Disk 1, and so on. Hence, this query cannot be executed at a particular
processor. Instead, it has to be executed by all the processors in parallel.
Hard to answer point queries on
non-partitioning attributes. Consider the following query;
SELECT
* FROM Employee WHERE EName = ‘Murugan’;
Our table is
partitioned using Salary as the
partitioning attribute using Hash partitioning. Hence, EName is the non-partitioning attribute. So, the queries with Salary in the WHERE clause can be
efficiently executed at one processor, all the other queries which do not have Salary in WHERE clause must be executed
at the all the processors as we do not know in which partition the value ‘Murugan’
would be available.
Range Partitioning:
For Range
partitioning, we need a range vector. For this we have chosen the following vector
as range vector;
v
[5000, 10000, 25000]
These vector
represents 4 partitions, viz, 5000 and less, 10000 and less, 25000 and less,
and the other salary values more than 25000. For this range vector, our data
will be distributed as follows;
Disk 0 : R7 (only record 7 is <=5000)
Disk 1 : R5 (only record 5 is >5000 and
<=10000)
Disk 2 : R2, R3, R8 (the range >10000 and
<=25000)
Disk 3 : R4, R6 (the range >25000)
Figure 3
shows the data distribution in Range partitioning.
Figure 3 - Data distribution in Shared Nothing Parallel Architecture using Range Partitioning technique |
Advantages:
Good for
sequential scan. For example, the query,
SELECT
* FROM Employee ORDER BY Phone;
can be
executed efficiently by all the processors in parallel.
Point queries
can be executed efficiently as in the case of Hash partitioning. For example,
the query,
SELECT
* FROM Employee WHERE Salary = 12000;
need to be
executed by only one of all the available processors. The reason is, the salary
value 12000, in our case, belongs to Disk 2. Hence, we need to scan only the
records at Disk 2.
Range queries
can be executed efficiently compared to other partitioning techniques. For example,
the query,
SELECT
* FROM Employee WHERE Salary BETWEEN 10001 AND 20000;
This query
needs one to few disks to be scanned based on the values specified in the WHERE
clause. For our query, we need to scan only Disk 2. Because, the range 10001
and 20000 stored in Disk 2 only.
Disadvantages:
Execution
Skew might occur. For example, consider the query,
SELECT
* FROM Employee WHERE Salary BETWEEN 10001 AND 26000;
This range
falls in Disk 2 and 3. Unfortunately, because of the bad range vectors, disk 2
and 3 have more number of records compared to any other disks. Hence, though
the query needs to be executed at Processor 2 and 3, it consumes more time as
there are more number of records.