Find the TF-IDF of terms of a given document and a collection of documents, how to calculate tf-idf, the use of tf-idf in finding the importance of a term in a document, term frequency-inverse document frequency
Question:
Given a document X containing terms t1, t2 and t3 with frequencies (inside brackets) as follows;
t1(3), t2(2), t3(1)
Let us assume that the collection contains 10,000 documents and document frequencies of these terms are as follows;
t1(50), t2(1300), t3(250)
Then, find the TF-IDF weight of terms t1, t2, and t3 in the document X.
Solution:
TF-IDF (Term Frequency-Inverse Document Frequency) is a measure to calculate “how relevant a term is in a given document”.
TFt,d counts the number of times a term t occurs in a document d. It can be calculated as follows;
For example, if the document D1 contains the term ‘quick’ 10 times, and it has 54 words in it, then the TF’quick’, D1 = 10/54 = 0.19.
DFt refers to the number of documents in which t presents.
For example, if 120 documents consist of the word ‘quick’, then the DF’quick’ = 120.
IDFt is the inverse measure used to calculate the informativeness of the given term t. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This can be calculated as follows;
Here, N is the number of documents in the given collection, and DFt is the document frequency of term t.
The TF-IDF weight of a term is the product of its TF weight and its IDF weight.
TF-IDF for term t1;TFt1 = (number of times t1 occurs in X)/(number of words in X) = 3/3
IDFt1 = log(No. of docs in the collection/No. of docs t1 appears) = log(10000/50) = 5.3
TF-IDF for t1 = 5.3
TF-IDF for term t2;
TFt2 = 2/3
IDFt2 = log (10000/1300) = 2.0
TF-IDF for t2 = 1.3
TF-IDF for term t3;
TFt3 = 1/3
IDFt3 = log (10000/250) = 3.7
TF-IDF for t3 = 1.23
******************
No comments:
Post a Comment