Thursday, March 6, 2014

Comparison of Data-Partitioning Strategies in Parallel Database



Comparison of different data-partitioning strategies based on the data-access types:

We have already discussed about different data-partitioning techniques, namely, Round-robin, Hash and Range Partitioning in an older post.


The above said data-partitioning strategies are useful in partitioning data across several processors to establish a complete parallel system. Here, the term ‘Complete Parallel System’ means the Parallel Database System which can deliver the expected improvement in the performance of the whole system. Hence, it cannot be achieved directly without considering how the database is going to be accessed by the end user and other related details.
The following list would help us in choosing the right partitioning technique;

  • The workload of the system

a.       What is the size of the database?
b.      How many users are accessing the system?
c.       What type of accesses they are? (reading/writing) and so on.

  • What type of data-accesses? (Scanning the entire relation, Point queries, Range queries).

  • The nature of data stored in the table.

a.       The data types
b.      Cardinality of data of every column (uniqueness of values stored in a column) etc.

The above are not the complete list. Different systems try with different options. In this post, let us discuss about select of partitioning strategies based on the access type on data, i.e, based on the type of query which tries to scan the entire relation, range queries or point queries.
Let us use the following three queries for explaining the concept;

  A.      SELECT * FROM Employee WHERE Emp_ID = ‘E101’; (Assume Emp_ID is Primary key)
  B.      SELECT * FROM Employee WHERE EName = ‘Murugan’; (Assume EName is non-key attribute)
  C.      SELECT * FROM Employee ORDER BY Phone;
  D.      SELECT * FROM Employee WHERE Salary BETWEEN 10000 AND 20000;

 

1.Round-robin Technique

Round-robin partitioning strategy partitions the given relational table into set of disks based on the position of the records. That is, it sends ith record into disk (i mod n). hence, this strategy ensures even distribution of records into available disks.

Advantages:

  • Suitable for queries which requires scanning the entire relation. For example, Query C can be handled efficiently by distributing data evenly among the available disks.
  • Due to even distribution of records, the workload is well balanced among disks.

Disadvantages:

  • Records are not distributed in any order in this strategy. Hence, we cannot handle point queries and range queries efficiently (when compared to other strategies). So, the strategy is not well suited for Point and Range queries. Query A, B and D cannot be answered efficiently (when compared to other strategies. That is, for example, query b can be answered efficiently if Hash partitioning strategy is used).

 

2. HashPartitioning

Hash partitioning strategy works by using a hash function which is designed to distribute the records into many disks based on the input values (partitioning attribute values).

Advantages:

  • Sequential scan can be done efficiently. Query C can be answered efficiently as the data are available in all the disks, sorting can be done in parallel
  • Well balanced data distribution is possible provided that good portioning attributes are chosen and good hash function is chosen.
  • Point queries can be executed very efficiently compare to Round-robin partitioning. The reason is, in Round-robin technique to answer query B, one has to search in all the disks. But in the case of Hash partitioning the similar values could be found in one location only. Hence, the other processors could be used to handle other queries. This leads to higher Transaction throughput.

Disadvantages:

  • It is difficult to answer range queries (when compared to Range partitioning technique). The data distributed unevenly. Hence, to answer Query D, we have to use all the processors.
  • Not good for partitioning on non-partitioning attributes. Suppose, if the table Employee is partitioned using Hash partitioning on Emp_ID attribute, then query B has to scan all the disks.

 

3. RangePartitioning

Range partitioning strategy partitions the data based on the partitioning attributes values. We need to find set of range vectors on which we are about to partition. For example, the records with Salary range 100 to 5000 will be in disk 1, 5001 to 10000 in disk 2, and so on.

Advantages:

  • Sequential scan can be done in parallel using all the processors. Query C can be executed efficiently.
  • Well balanced data distribution is possible, only if good partitioning vector is chosen. One has to choose the partitioning vector which would at least near equally into many disks.
  • Point queries can be handled efficiently just like Hash partitioning (when compare to Round-robin technique). Queries A and B can be handled efficiently.
  • For range queries, according to the range given in the query, we may need to scan one to few disks as whole. Hence, Range queries can be handled efficiently. Query D can be executed efficiently (if not execution skew present).
  • Higher throughput and good response time.

Disadvantages:

  • If the query specifies large range values, i.e, if the result consists of large number of records in very few disks among the available disk, an I/O bottleneck may be created on those disks. This is called Execution skew. If the same query executed in Round-robin or Hash partitioned disks, we could get better performance compared to Range partitioning.

The Table 1 given below, compares the partitioning strategies.

Partitioning Strategy
Advantages
Disadvantages
Round-robin
·         Sequential scan of entire relation
·         Well balanced data distribution
·         Records are distributed in random. No specific order is followed.
·         Hard to answer range queries.
Hash Partitioning
·         Sequential scan of entire relation
·         Well balanced data distribution
·         For point queries, only one disk has to be searched (only if the data partitioned on the query attribute).
·         Hard to answer range queries
·         Not so good for point queries on non-partitioning attribute.

Range Partitioning
·         Sequential scan of entire relation
·         Well balanced data distribution
·         For point queries, only one disk has to be searched (only if the data partitioned on the query attribute).
·         For range queries, only few disks have to be searched according to the range specified in the query (only if the data partitioned on the query attribute).
·         Execution skew might occur.
Table 1 – Comparison of Partitioning Strategies
Figure 1 - Round-robin Partitioning Strategy in Shared Nothing Parallel Architecture

Figure 2 - Hash Partitioning Strategy in Shared Nothing Parallel Architecture

Figure 3 - Range Partitioning Strategy in Shared Nothing Parallel Architecture









2 comments:

Featured Content

Multiple choice questions in Natural Language Processing Home

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers Multiple Choice Que...

All time most popular contents

data recovery