You are building an ngram model of a corpus. Should you stem the words and do the counts or leave them in the surface form? Give pros and cons and include what characteristics of the corpus might influence your decision.
Question:
You are building an ngram model of a corpus. Should you stem the words and do the counts or leave them in the surface form? Give pros and cons and include what characteristics of the corpus might influence your decision.
Answer:
- Stemming the words means there will be fewer types, since there will just be base forms. This means that some generalizations will be captured (He swam, he swims … He swim). However, there are some generalization that won’t be captured (I swim vs. she swims),
- This is a good idea when there is a small amount of data and there are fewer examples of the ngrams or in highly inflected languages where there are many different forms of each word.
- If a large amount of data available, however, ngrams over the surface forms can be more powerful and precise.
*************************
Related questions:
You are building an ngram model of a corpus. Should you stem the words and do the counts or leave them in the surface form? Give pros and cons and include what characteristics of the corpus might influence your decision.
No comments:
Post a Comment