N-gram is a term used in the text analytics industry to refer to a sequence of items, typically words, that are processed as a unit. The term is also used outside of the industry, where it may refer to a sequence of any kind of item, not just words.
N-gram First Used
The term N-gram was first used in the early 1950s by Jean Carletta, a French linguist. Carletta used the term to refer to a sequence of letters in a word. The term was later popularized by Frederick Jelinek, an American computer scientist, who used it to refer to a sequence of words in a sentence.
Applications Using N-grams
N-grams are used in many different applications, such as natural language processing, computational linguistics, and speech recognition.
N-grams can be unigrams (single items), bigrams (pairs of items), trigrams (triplets of items), or higher-order n-grams. For example,
- unigram: “dog”
- bigram: “dog food”
- trigram: “dog food bowl”
- 4-gram: “dog food bowl dish”
- 5-gram: “dog food bowl dish table”
As you can see, N-grams can be of any length. The length of an N-gram is referred to as its order.
Applications for N-grams
Some common applications for N-grams include:
- Information retrieval: N-grams are often used to index and retrieve documents that contain a given sequence of items.
- Statistical modeling: N-grams are often used in statistical models, such as probabilistic language models, which are used to predict the likelihood of a sequence of items occurring.
- Text generation: N-grams can be used to generate text, such as in the Google Books Ngram Viewer, which generates n-grams from a corpus of books.
- Word embeddings: N-grams can be used to create word embeddings, which are vector representations of words that capture the context in which they occur.
There are many different applications for N-grams, and the term is used in many different fields. N-grams are a powerful tool for understanding and working with data, and they have a wide range of applications.