Sunday, February 23, 2014

Data Partitioning Strategies in Parallel Database Systems



Data Partitioning Strategies in Parallel Database Systems


Partitioning the tables/databases is very important step in parallelizing the database activities. By partitioning the data equally into many different processors’ workload, we can achieve better performance (better parallelism) of the whole system. We have the following types of fragmentation (partitioning) techniques in Parallel database;

Horizontal Fragmentation – Partitioning the tables using the conditions  specified through WHERE clause of SQL query to distribute bunch of tuples (records). For example, consider the following schema;
STUDENT (Regno, SName, Address, Branch, Phone)
In this table, Branch column is used to store the academic branch in which the student admitted in. suppose, there are various branches like ‘BTech CIVIL’, ‘BTech MECH’, ‘BTech CSE’, and so on, then the following query would be used to partition the table using horizontal partitioning.
SELECT * FROM student WHERE Branch = branch name;
In the result, there won’t be any changes in the schema of the table; i.e the structure of the table is unaltered. Only, ‘BTech CIVIL’ students fragmented into one partition, ‘BTech CSE’ students fragmented into one partition and so on.
Vertical Fragmentation – Partitioning tables using the decomposition rules to break and distribute the tables into multiple partitions vertically (different schemas) is called Vertical fragmentation.
For example, if we would like to break STUDENT into different tables like STUDENT(Regno, SName, Address, Branch) and STU_PHONE(Regno, Phone), i.e into two different schema on columns is Vertical partitioning.

Among the above discussed fragmentation techniques, partitioning any relation with respect to Parallel databases involves Horizontal Partitioning. Horizontal data partition helps us to distribute the data into several processors to execute queries on them simultaneously.

Partitioning Strategies:

There are various partitioning strategies proposed to manage the data distribution into multiple processors evenly.
Let us assume that in our parallel database system we have n processors P0, P1, P2, …, Pn-1 and n disks D0, D1, D2, …, Dn-1 where we partition our data. The value of n is chosen according to the degree of parallelism required. The partitioning strategies are,

Round-Robin Partitioning

 

                It is the simplest form of partitioning strategy. Here, the data are distributed into various disks in the fashion, first record into first disk, second record into second disk, and so on. If the number of available disks n is 10, then first record goes to D1 (1 mod 10 = 1), second record goes to D2 (2 mod 10 =2), and so on and 10th record goes to D0 (10 mod 10 = 0), 11th record goes to D1 (11 mod 10 = 1). This scheme distributes data evenly in all the disks.

Hash Partitioning

 

                This strategy identifies one or more attributes as partitioning attributes (which is not required in Round-Robin partitioning). We need a hash function which is carefully chosen (careful to avoid data skewness) which takes the identified partitioning attributes as input to hash function.
For example, consider the following table;
EMPLOYEE(ENo, EName, DeptNo, Salary, Age)
If we choose DeptNo attribute as the partitioning attribute and if we have 10 disks to distribute the data, then the following would be a hash function;
h(DeptNo) = DeptNo mod 10
If we have 10 departments, then according to the hash function, all the employees of department 1 will go into disk 1, department 2 to disk 2 and so on.

As another example, if we choose the EName of the employees as partitioning attribute, then we could have the following hash function;
h(EName) = (Sum of ASCII value of every character in the name) mod n,
where n is the number of disks/partitions needed.

Range Partitioning

 

In Range Partitioning we identify one or more attributes as partitioning attributes. Then we choose a range partition vector to partition the table into n disks. The vector is the values present in the partitioning attribute.
For example, for the EMPLOYEE relation given above, if the partitioning attribute is Salary, then the vector would be one as follows;
[5000, 15000, 30000],
where every value means the individual range of salaries. That is, 5000 represents the first range (0 – 5000), 15000 represents the range (5001 – 15000), 30000 represents the third range (15001 – 30000), and it includes the final range which is (30001 – rest). Hence, the vector with 3 values represents 4 disks/partitions.

The above discussed partition strategies must be chosen carefully according to the workload of your parallel database system. The workload may involve many components like, which attribute is frequently used in any queries as filtering condition, the number of records in a table, the size of the database, approximate number of incoming queries in a day, etc.

Detailed Example:




Let us start with the following table Emp_table. Emp_table instance has 14 records and every record stores information about the name of the employee; his/her work grade, and the department name. Assume that we have 3 processors namely P0, P1, P2, and 3 Disks associated with those 3 processors namely D0, D1, D2.



Emp_table
ENAME
GRADE
DNAME
SMITH
1
RESEARCH
BLAKE
4
SALES
FORD
4
RESEARCH
KING
5
ACCOUNTING
SCOTT
4
RESEARCH
MILLER
2
ACCOUNTING
TURNER
3
SALES
WARD
2
SALES
MARTIN
2
SALES
ADAMS
1
RESEARCH
JONES
4
RESEARCH
JAMES
1
SALES
CLARK
4
ACCOUNTING
ALLEN
3
SALES
Table 1 – Emp_table

A. Round-Robin Partitioning:

In this strategy we partition records in a round-robin manner using the function i mod n, where i is the record position in the table and n is the number of partitions/disks which is in our case 3. On the application of partitioning technique, first record goes into D1, second record goes into D2, third record goes into D0, fourth record goes into D1, and so on. After distribution of records, we will get the following partitions;



Emp_table_Partition0
ENAME
GRADE
DNAME
FORD
4
RESEARCH
MILLER
2
ACCOUNTING
MARTIN
2
SALES
JAMES
1
SALES
Table 2 – Records 3, 6, 9, 12 mod 3

Emp_table_Partition1
ENAME
GRADE
DNAME
SMITH
1
RESEARCH
KING
5
ACCOUNTING
TURNER
3
SALES
ADAMS
1
RESEARCH
CLARK
4
ACCOUNTING
Table 3 – Records 1, 4, 7, 10, 13 mod 3

Emp_table_Partition2
ENAME
GRADE
DNAME
BLAKE
4
SALES
SCOTT
4
RESEARCH
WARD
2
SALES
JONES
4
RESEARCH
 ALLEN
3
SALES
Table 4 – Records 2, 5, 8, 11, 14 mod 3

B. Hash Partitioning:

Let us take GRADE attribute of the Emp_table to explain Hash partitioning. Let us choose a hash function as follows;
h(GRADE) = (GRADE mod n)
where GRADE is the value of GRADE attribute of a record and n is number of partitions which is 3 in our case. While applying the hash partitioning on GRADE, we will get the following partitions of Emp_table. For example, the GRADE of ‘Smith’ is 1 and while hashing the function shows partition 1 (i.e 1 mod 3 = 1). The GRADE of ‘Blake’ is 4, then (4 mod 3) directs to partition 1. The GRADE of ‘King’ is 5 which directs to partition 2 (5 mod 3 = 2).

Emp_table_Partition0
ENAME
GRADE
DNAME
TURNER
3
SALES
 ALLEN
3
SALES
Table 5 – GRADEs 3 mod 3

Emp_table_Partition1
ENAME
GRADE
DNAME
SMITH
1
RESEARCH
BLAKE
4
SALES
FORD
4
RESEARCH
SCOTT
4
RESEARCH
ADAMS
1
RESEARCH
JONES
4
RESEARCH
JAMES
1
SALES
CLARK
4
ACCOUNTING
Table 6 – GRADEs 1, 4 mod 3

Emp_table_Partition2
ENAME
GRADE
DNAME
KING
5
ACCOUNTING
MILLER
2
ACCOUNTING
WARD
2
SALES
MARTIN
2
SALES
Table 7 – GRADEs 2, 5 mod 3

C. Range Partitioning:

Let us consider GRADE of Emp_table to partition under range partitioning. For applying range partition, we need to first identify partitioning vector, [v0, v1, …, vn-2]. Let us choose the following vector as range partitioning vector for our case;
[2, 4]
According to the vector, the records having the GRADE value 2 and less will go into partition 0, greater than 2 and less than or equal to 4 will go into partition 1, and all the other values (greater than 4) will go into partition 2 as depicted in the following tables.

Emp_table_Partition0
ENAME
GRADE
DNAME
SMITH
1
RESEARCH
MILLER
2
ACCOUNTING
WARD
2
SALES
MARTIN
2
SALES
ADAMS
1
RESEARCH
JAMES
1
SALES
Table 8 – GRADE values 1 and 2

Emp_table_Partition1
ENAME
GRADE
DNAME
BLAKE
4
SALES
FORD
4
RESEARCH
SCOTT
4
RESEARCH
TURNER
3
SALES
JONES
4
RESEARCH
CLARK
4
ACCOUNTING
ALLEN
3
SALES
Table 9 – GRADE values 3 and 4

Emp_table_Partition2
ENAME
GRADE
DNAME
KING
5
ACCOUNTING
Table 10 – GRADE value 5 and above

Featured Content

Multiple choice questions in Natural Language Processing Home

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers Multiple Choice Que...

All time most popular contents

data recovery