A tokenizer is a software program that takes in text and breaks it up into smaller pieces called tokens. In the context of text analytics, tokenizers are used to break apart words or phrases in order to better analyze the meaning of the text.
Tokenizers can be customized to break text up in a variety of ways, depending on the needs of the project. For example, a tokenizer might split text up by sentence, by word, or by white space. Tokenizers can also be configured to recognize certain types of tokens, such as emails or URLs.
Outside of Text Analytics
The term tokenizer can also refer to a piece of software that encrypts data. In this context, a tokenizer is a program that takes in data and transforms it into a string of random characters. This string can then be used to represent the data in a more secure way.
Comparison with Other Terms
Tokenizer is sometimes used interchangeably with lemmatizer and stemmer, but there are some key differences between these terms. A lemmatizer takes in a word and reduces it to its base form, also called a lemma. For example, the lemma of “running” is “run.” A stemmer takes in a word and reduces it to its root form. The root form of a word is not always the same as the lemma; for example, the root form of “running” could be “run,” but
it could also be “runn.”
Both lemmatizers and stemmers are used in text analytics in order to reduce the size of the text corpus and to improve the accuracy of analysis. However, tokenizers can be used for other purposes as well, such as encryption.
Tokenizer tools
There are a variety of tokenizer tools available, both open source and commercial. Some popular open source tokenizers include the Natural Language Toolkit (NLTK) and the Apache OpenNLP library. These libraries can be used to develop custom tokenizers for specific projects. For example, NLTK includes a sent_tokenize() function that can be used to split text into sentences, and a word_tokenize() function that can be used to split text into words.
OpenNLP also provides sentence and word tokenizers, as well as tokenizers for other languages such as French and Spanish.
There are also many commercial tokenizer offerings, such as the Rosette Text Analytics platform from Basis Technology. Rosette includes a number of different text analytics capabilities, including tokenization, lemmatization, and stemmation.