TOPICS (Click to Navigate)

Pages

Friday, March 7, 2014

Comparison of Data Partitioning Strategies with Example



Comparison of Data Partitioning Strategies with Example:


ENo
EName
DOB
Designation
Salary

E101
Kumar
14-JUN-1980
Lecturer
10002
R1
E102
Melvin
12-MAR-1975
AP
23000
R2
E103
Ram
10-MAR-1978
AP
22051
R3
E107
Jacob
28-MAY-1970
Professor
30011
R4
E109
Nisha
10-AUG-1981
Lecturer
9031
R5
E105
Shaik
20-DEC-1971
Professor
29002
R6
E104
Vinay
15-SEP-1982
Attender
4004
R7
E106
Menon
30-AUG-1978
AP
21016
R8
Table 1 - Employee

Let us suppose a table Employee with the schema Employee(ENo, EName, DOB, Designation, Salary). Table 1 shows Employee table with sample data. We assume that Shared Nothing Parallel Architecture is used. For partitioning this table using Hash or Range partitioning we need one or more attributes as partitioning attributes. Let us choose Salary as partitioning attribute.
The partitioning attribute salary is represented in the following figures inside big rectangle with all these ranges in small rectangles. The big rectangle means the whole table (Employee). Consider the salary values differently for different partitioning. That is, treat the values as 100 to MAX for Round-robin and Hash partitioning (not as group of values instead treat as one group), and as different vectors for Range partitioning in the following discussions. The same representation is used for all the figures (Round-robin, Hash, and Range partitioning).
Note: We cannot physically partition a table using more than one partitioning technique. We have taken this example to explain with different partitioning techniques’ advantages and disadvantages.

Round-robin technique:
In Round-robin technique, we use the function i mod n to distribute data into multiple disks. Here, variable i represents the position (physical order) of the record in the actual table. Variable n represents the number of target disks into which the table should be partitioned.
For our table Employee, Figure 1 shows the data distribution using Round-robin partitioning. Here, you would find distribution links from every small rectangle to every disk. That means, all the ranges are available with all the disks. For example, the records of Table 1, according to the function i mod 4, will be sent to the 4 disks as follows;
Disk 0    : R4, R8 (i.e, records with salary values 30011, and 21016 respectively)
Disk 1    : R1, R5 (i.e, records with salary values 10002, and 9031 respectively)
Disk 2    : R2, R6 (i.e, records with salary values 23000, and 29002 respectively)
Disk 3    : R3, R7 (i.e, records with salary values 22051, and 4004 respectively)
By looking at the sample data and the distribution to different disks, it is clear that the distribution is not even distribution.
Figure 1 - Data distribution in Shared Nothing Parallel Architecture using Round-robin technique

Advantages:
Distribution is equal, so load is equal (all the disks have equal amount of records)
Good for sequential scan.
Disadvantages:
Hard to answer range queries because of uneven distribution. For this, let us consider the following query;
SELECT * FROM Employee WHERE Salary BETWEEN 10000 AND 20000;
The above query is a Range query. This query demands all the records of Employee table for which the salary value falls between 10002 and 20000. In our case, the distribution is not based on attribute; it was actually based on the record position. Hence, all the salary values falls in the given range may be in all the disks. So, we need complete scan on all the disks for answering this query.
Hard to answer point queries. Consider another query;
SELECT * FROM Employee WHERE EName = ‘Murugan’;
This query is a Point query. Query needs the records of employee ‘Murugan’. Again, for this query too we cannot identify the exact disk where the records with the specified values would be stored. Hence, we need full scan on all the disks to answer this query.

Hash Partitioning:
Hash partitioning uses Hash function. Let us use the following hash function on partitioning attribute Salary.
h (attribute mod n) = h (salary mod 4) = h(4004 mod 4) = 0 à Disk 0
For this function, the records will be distributed as follows;
Disk 0    : R2, R7, R8
Disk 1    :
Disk 2    : R1, R6
Disk 3    : R3, R4, R5
In Hash partitioning, we have uneven distribution. It is shown in Figure 2.
Figure 2 - Data distribution in Shared Nothing Parallel Architecture using Hash Partitioning technique

Advantages:
Sequential can be done efficiently by exploiting all the processors in parallel. For example, the query,
SELECT * FROM Employee ORDER BY Phone;
can be executed in parallel using all the 4 processors.
Best suited for point queries. Consider the following query; (point query on partitioning attribute Salary)
SELECT * FROM Employee WHERE Salary = 12000;
This query can be executed efficiently by only one of the available processors. The reason is, on applying the hash function h(salary mod 4) which is  h(12000 mod 4), will point to disk 0 where the actual record is stored. During this time we can exploit other processors for executing something else. That is, many numbers of simple transactions can be executed in parallel by various processors, which lead to higher Transaction Throughput.
Disadvantages:
Hard to answer range queries. Consider the query,
SELECT * FROM Employee WHERE Salary BETWEEN 10000 AND 20000;
For this query, the salary in the range 10000 to 20000 will be distributed to almost all the partitions. For example, 10000 mod 4 will point Disk 0, 10001 mod 4 will point Disk 1, and so on. Hence, this query cannot be executed at a particular processor. Instead, it has to be executed by all the processors in parallel.
Hard to answer point queries on non-partitioning attributes. Consider the following query;
SELECT * FROM Employee WHERE EName = ‘Murugan’;
Our table is partitioned using Salary as the partitioning attribute using Hash partitioning. Hence, EName is the non-partitioning attribute. So, the queries with Salary in the WHERE clause can be efficiently executed at one processor, all the other queries which do not have Salary in WHERE clause must be executed at the all the processors as we do not know in which partition the value ‘Murugan’ would be available.

Range Partitioning:
For Range partitioning, we need a range vector. For this we have chosen the following vector as range vector;
v [5000, 10000, 25000]
These vector represents 4 partitions, viz, 5000 and less, 10000 and less, 25000 and less, and the other salary values more than 25000. For this range vector, our data will be distributed as follows;
Disk 0    : R7 (only record 7 is <=5000)
Disk 1    : R5 (only record 5 is >5000 and <=10000)
Disk 2    : R2, R3, R8 (the range >10000 and <=25000)
Disk 3    : R4, R6 (the range >25000)
Figure 3 shows the data distribution in Range partitioning.
Figure 3 - Data distribution in Shared Nothing Parallel Architecture using Range Partitioning technique

Advantages:
Good for sequential scan. For example, the query,
SELECT * FROM Employee ORDER BY Phone;
can be executed efficiently by all the processors in parallel.
Point queries can be executed efficiently as in the case of Hash partitioning. For example, the query,
SELECT * FROM Employee WHERE Salary = 12000;
need to be executed by only one of all the available processors. The reason is, the salary value 12000, in our case, belongs to Disk 2. Hence, we need to scan only the records at Disk 2.
Range queries can be executed efficiently compared to other partitioning techniques. For example, the query,
SELECT * FROM Employee WHERE Salary BETWEEN 10001 AND 20000;
This query needs one to few disks to be scanned based on the values specified in the WHERE clause. For our query, we need to scan only Disk 2. Because, the range 10001 and 20000 stored in Disk 2 only.

Disadvantages:
Execution Skew might occur. For example, consider the query,
SELECT * FROM Employee WHERE Salary BETWEEN 10001 AND 26000;
This range falls in Disk 2 and 3. Unfortunately, because of the bad range vectors, disk 2 and 3 have more number of records compared to any other disks. Hence, though the query needs to be executed at Processor 2 and 3, it consumes more time as there are more number of records.