What is document triage, Define document triage, Purpose of document triage in natural language processing, what are the steps in document triage, why document triage is important
Document Triage
Document
triage is the process of converting a set of digital files into well-defined
text documents. It is one of two stages of text pre-processing.
Document
triage process may involve one or more of the following steps based on the
origin of the files being processed;
Character
encoding identification – For any document to be machine readable, the characters
and numbers should be represented in a character encoding. Character encoding
is to store text as binary data and we have different character encoding schemes
(ASCII, Unicode, UTF). Character encoding identification step is to determine
the character encoding used in a text file.
Language
identification – A document may consist of texts in a single language or
multiple languages. This step is to identify the language(s) used in the
document.
Text
sectioning - Identifies the actual content within a file while discarding
undesirable elements, such as images, tables, headers, links, and HTML
formatting.
**********
Go to NLP Glossary
Go to Natural Language Processing Home page
No comments:
Post a Comment