How do we perform Duplicate data elimination, Projection, and Aggregation in parallel in parallel database system?
Assume that
the table is partitioned and stored in disks D0, D1, …, Dn-1 with processors
P0, P1, …, Pn-1. For example, consider the following figure where the table Employee is partitioned used Round-robin
partitioning technique.
Figure 1 - Partitioning Employee table |
Duplication elimination:
Duplicate
elimination is about removing duplicate values or neglecting repeated values
that are stored in an attribute for various reasons; For example, in one or all
of the following situations we need duplicate elimination;
- When we would like to count the number of unique values present in a table under a particular attribute.
- When we need to retrieve the values and present only the unique values of an attribute.
- When we need to join two tables, etc.
The main
approach used for duplicate elimination is by sorting the data on the attribute
where in the duplicate values to be removed.
Duplicate
elimination can be achieved in the following two ways in parallel database;
1. During
parallel sort, if we find any repeated values while partitioning, those can be
discarded immediately. (This method is for tables that are not partitioned). For
example, in Figure 1, while you partition the data, you send tuples to
different partitions based on the partition attribute and conditions. Suppose that
we are partitioning on Salary attribute. During the process, if you find any of
the repeated values which are already send into a partition, you can discard
those values if they repeat. In our example, salary value 5000 occurs in two
records. Hence, one can be accepted and the other can be discarded.
2. We can
partition the table into many partitions (using range or hash partitioning),
and instruct all the processors to sort the data locally and remove the
duplicates. (This works only for the data that are hash or range partitioned on
the duplicate elimination attribute)
Projection:
Projection
means selection of one or more attributes from a table with the records stored
in them. This operation can be parallelized as follows;
1. Projection
without duplicate elimination: while you read data into various disks during
partitioning, you can project the required columns.
2. Projection
with duplicate elimination: any of the techniques suggested in Duplicate
elimination section above can be used.
Aggregation:
Aggregation
operation involves finding the count of records, sum of values stored in an
attribute, minimum or maximum value of all the values stored in an attribute,
and average value of all the attribute values. This operation basically needs
grouping. That is, for example, we can find the sum of salary for all the
records of Employee table, or we can find sum of salary with some filter conditions.
In the first case, all the records of Employee table come under one group. In
the later case, we choose the group based on the conditions included.