A Tokenized document is a text file that has been processed by a tokenizer, which is a software program that breaks up the text into smaller pieces called tokens. The tokens can be words, numbers, or other elements of punctuation.
The purpose of tokenization is to make it easier to work with the text, by reducing it to its essential elements. For example, if you were looking for all the instances of the word “cat” in a document, a tokenizer would identify each instance of the word “cat” as a separate token, making it easier to find them all.
Tokenized documents are often used as input for other software programs that perform text analytics, such as text mining or natural language processing.
What is the definition of Tokenized document outside of Text Analytics ?
The term Tokenized document can also refer to a document that has been divided into tokens in a different way, such as by sentence or by paragraph. This type of tokenization is sometimes used for the purpose of information retrieval, so that each token can be treated as a separate unit of information.
How is Tokenized document different from related terms ?
The term Tokenized document is sometimes used interchangeably with the term Text file, but there is a subtle difference between the two. A Text file is simply a files that contains text, while a Tokenized document has been through some level of processing, such as tokenization.
The term Parsed document is also similar to Tokenized document, but there is a key distinction between the two. A Parsed document has not only been tokenized, but the tokens have also been analyzed to extract additional information, such as the part of speech for each word.
A Tokenized document has only been divided into tokens, without any further analysis.