common preprocessing steps used in information retrieval task, Significance of preprocessing in information retrieval, All you need to know about text preprocessing in information retrieval

Question:

What are the common preprocessing steps used in information retrieval task?

Answer:

Preprocessing technique	How?	Benefits
Extract root words	* Stemming (Rule-based, dictionary based, corpus based) * Lemmatization	1. Improves recall 2. Indexing size reduced
Stop words removal	Stop word list can be used	1. Improves efficiency of retrieval 2. Indexing size reduced
Tokenization (break sentences into tokens/keywords)	Typical solution is to split a sentence at non-letter characters, mostly white spaces.	Tokens are indexed for further processing.
Normalization	* Case folding (convert all text to lower case) * Spelling variations (have common spelling) * Diacritics/Accent marks on letters (naïve to naive)	Randomness is reduced
Detecting common phrases	By indexing meaningful phrases	Effective retrieval by avoiding tokenizing phrases into bag-of-words
Building index	Add preprocessed terms to inverted index (it stores the list of documents in which the terms appear)	It is a lookup table to quickly find all documents containing a word.