Naive bayes classifier solved exercise in NLP

Tuesday, March 24, 2020

Naive bayes classifier solved exercise in NLP

Naive bayes classifier solved exercise in NLP, How to find the class of a word document using Naive Bayes classifier? Naive Bayes classifier solved example, text classification using naive bayes classifier, solved text classification problem using naive bayes

Naïve Bayes Classifier

Question:

A Naive Bayes text classifier has to decide whether the document ‘Chennai Hyderabad’ is about India (class India) or about England (class England).

a) Estimate the probabilities that are needed for this decision from the following document collection using Maximum Likelihood estimation (no smoothing).

Doc. No.	Document	Class
1	Chennai Mumbai	India
2	Delhi London Hyderabad	England
3	Chennai Kolkata	India
4	Delhi Hyderabad Pune	India
5	London Bristol Chennai	England

b) Based on the estimated probabilities, which class does the classifier predict? Explain. Show that you have understood the Naïve Bayes classification rule.

Solution:

a) Probability estimation

As per Naïve bayes classifier, we need two types of probabilities namely, conditional probability denoted as P(word|class) and prior probability denoted as P(class) in order to solve this problem.

Conditional probability

Let w_i be a word among n words and c_j be the class among m classes. The "individual" likelihoods for every word in the word vector can be estimated via the maximum-likelihood estimate as follows;

Here,

is the Number of times word w_i appears in documents under class c_j

is the Count of words appears in all documents that are listed under class c_j.

Prior probability

Prior probability is the total probability of a class. That is, how often does this particular class occur in total? This can be calculated as follows;

Here,

is the Total number of documents that are listed under class c_j

is the total number of classes

For the given problem, we need to calculate these probabilities for the test document ‘Chennai Hyderabad’. It goes as follows;

Conditional probability estimation

P(word | class) = P(Chennai|India) = 2/7

[How P(Chennai|India) = 2/7? As per the training data given, only 2 documents (documents 1 and 3) are listed under the class 'India' and have the word 'Chennai'. hence, 2 in the numerator. There are totally 7 words (2 words in doc 1, 2 in doc 3, and 3 in doc 4) in all the documents under the class 'India' put together. For the remaining conditional probabilities, you do the calculation.]

P(Hyderabad | India) = 1/7

P(Chennai | England) = 1/6

P(Hyderabad | England) = 1/6

Prior probability estimation

P(India) = 3/5 [How P(India) = 3/5? As per the training data, out of 5 documents, only 3 are listed under the class 'India'.]

P(England) = 2/5

b) To predict the correct class of the test document ‘Chennai Hyderabad’, we need to find the posterior probability of the test document under each class as follows;

As per Naïve Bayes, the posterior probability for n features for a class c_j is calculated as follows;

P(w1, w2, …, wn|cj) = P(c_j) * P(w₁|c_j) * P(w₂|c_j) * … * P(w_n|c_j)

P(‘Chennai Hyderabad’ | India) = P(India) * P(Chennai | India) * P(Hyderabad | India)

= 3/5 * 2/7 * 1/7

= 0.6 * 0.286 * 0.143

= 0.0245

P(‘Chennai Hyderabad’ | England) = P(England) * P(Chennai | England) * P(Hyderabad | England)

= 2/5 * 1/6 * 1/6

= 0.4 * 0.167 * 0.167

= 0.0112

After the calculation, we found that P(‘Chennai Hyderabad’ | India) > P(‘Chennai Hyderabad’ | England). Hence, the predicated class of the given document is India.

Major links

Quicklinks

Tuesday, March 24, 2020