Multiple choices questions in NLP, Natural Language Processing solved MCQ, Bigram model, How to calculate the bigram probability using a corpus statistics? maximum likelihood estimate to find the bigram probability
Natural Language Processing MCQ - Bigram probability calculation using MLE
1. Using Maximum Likelihood Estimate (MLE), to compute the bigram probability P(wn|wn-1), we need to count the number of bigrams (wn-1wn) from a corpus and normalize by the count of all bigrams that start with wn-1. This normalization step ensures that the estimate lie between 0 and 1.
P(wn|wn-1) = Count (wn-1wn) / Sum(Count(wn-1w))
Here, w is any word that follows wn-1.
This equation can be simplified by replacing the bigram count in the denominator with the unigram count of wn-1. Why do we want to do that?
a) Bigram count can only be normalized by unigram count
b) Sum of all bigram counts that start with the word wn-1 is equal to the unigram count of the same word
c) Normalization using bigram count will make the estimate to be greater than 1 in some cases.
d) None of the above.
| Answer: (b) Sum of all bigram counts that start with the word wn-1 is equal to the unigram count of the same word Let us calculate the bigram probability P(increase | to) using both the normalization using bigram and unigram. (Note: hereafter I use ‘C’ to refer ‘Count’) 
 Normalizing by sum of all bigram counts 
 For this case, we need to normalize using the total count of bigrams that start with the word “to”. 
 P(increase | to) = C(“to increase”)/[C(“to increase”)+C(“to be”)+C(“to fill”)] = 2/[2+1+1] = 2/4 = 0.5 
 Normalizing by unigram count 
 For this case, we need to normalize using the unigram count of the same word “to”. 
 P(increase|to) = C(“to increase”)/C(“to”) = 2/4 = 0.5 
 We have only 4 occurrences of word “to” in the corpus. Hence, the sum of count of any bigram that starts with “to” cannot exceed 4. For this reason, we can simplify the equation by normalizing using unigram count instead of sum of all bigram counts. | 
