Preprocessing technique
|
How?
|
Benefits
|
Extract root
words
|
* Stemming
(Rule-based, dictionary based, corpus based)
* Lemmatization
|
1. Improves recall
2. Indexing size
reduced
|
Stop words
removal
|
Stop word list
can be used
|
1. Improves
efficiency of retrieval
2. Indexing size
reduced
|
Tokenization
(break sentences into tokens/keywords)
|
Typical
solution is to split a sentence at non-letter characters, mostly white
spaces.
|
Tokens are
indexed for further processing.
|
Normalization
|
* Case folding
(convert all text to lower case)
* Spelling
variations (have common spelling)
* Diacritics/Accent
marks on letters (naïve to naive)
|
Randomness is
reduced
|
Detecting
common phrases
|
By indexing meaningful
phrases
|
Effective
retrieval by avoiding tokenizing phrases into bag-of-words
|
Building index
|
Add
preprocessed terms to inverted index (it stores the list of documents in which the terms appear)
|
It is a lookup
table to quickly find all documents containing a word.
|
No comments:
Post a Comment