TOPICS (Click to Navigate)

Pages

Friday, September 11, 2015

File organization in DBMS - Hash file organization

Hash File Organization / Advantages and Diadvantages


Hash File Organization


It is a file organization technique where a hash function is used to compute the address of a record. It uses the value of an attribute or set of attributes as input and gives the location (page/block/bucket) where the record can be stored.

For example, let us consider the following table Student;

RegNo
SName
Gen
Phone
1
Sundar
M
9898786756
3
Karthik
M
8798987867
4
John
M
7898886756
2
Ram
M
9897786772
5
Martin
M
9765430231
6
Rashmi
F
8976543990

A hash function is a function which maps the large set of values into smaller set of files/locations/values. Let us organize the above table using the phone attribute value as input for the hash function.
h(phone mod 10)
In the above hash function, phone is the phone attribute’s value of each record. 10 is the number of buckets/pages where we want to store our table. [10 buckets means bucket0, bucket1, …, bucket9].
For our example,
For 1st record, h(9898786756 mod 10)  = 6 ie., the first record has to be stored in 6th bucket.
For 2nd record, h(8798987867 mod 10) = 7 ie., the second record has be stored in 7th bucket.
For 4th record, h(7898886756 mod 10) = 6 ie., the fourth record has be stored in 6th bucket [like 1st]
For 5th record, h(9765430231 mod 10) = 1 ie., the 5th record has to be stored in 1st bucket.
For last record, h(8976543990 mod 10) = 0 ie., the last record has to be stored in 0th bucket.

Important points for consideration:



  • If bucket(s) is/are full, then overflow buckets can be used to store more records.
  • Hash function has to be chosen with extra care to avoid uneven distribution. That is, a bad hash function may assign more records to few buckets and less to others.
  • The attribute(s) that is frequently used for data manipulation can be chosen as the input for the hash function.
  • Same hash function that was used to store the records has to be used for deletion, modification or selection of records.
  • Two types of hashing : static and dynamic

How would we locate a bucket for inserting/deleting/updating/reading a record?


Let us assume that the following query is executed.

SEELCT * FROM Student WHERE phone = 8976543990;

For searching the record, we has to use the same hash function that we used for storing the records. Hence, h(8976543990 mod 10) = 0. And the result points to the 0th bucket. It actually gives us the quick access to the required record.

Advantages



  • Quick access to records in terms of selection. [If queried on the attribute that was used for hashing]
  • Easy to insert, delete, or update a record.

Disadvantages



  • Records are randomly stored in scattered locations. May waste a lot of space in case of small files.
  • For queries that involve ranges, hash file organization is not efficient. [eg. SELECT * FROM Emp WHERE Salary BETWEEN 10000 AND 25000]
  • If querying attribute is not the hashed attribute, you may need to scan the entire table for retrieval.
  • Frequent update to the hashed column results in movement of data between buckets which actually affects the system performance.


1 comment: