A shingle is a short phrase or word used to identify a concept. In text analytics, shingles are generally created by taking a sliding window of words from a document and concatenating them together. For example, if we take a window of three words from the following sentence:
“The quick brown fox jumps over the lazy dog”
we would create the following shingles:
“The quick brown”, “quick brown fox”, “brown fox jumps”, “fox jumps over”, “jumps over the”, “over the lazy”, “the lazy dog”
Shingles are often used in text analytics algorithms such as Latent Dirichlet Allocation (LDA) for topic modeling, or in term co-occurrence analysis.
Benefits of Shingle
There are several benefits to using shingles in text analytics:
- Shingles can help to better identify topics within a document by providing more context than single words.
- Shingles can help to reduce the dimensionality of a document, which can be helpful when working with large documents.
- Shingling can also help to improve the performance of some text analytics algorithms.
Drawbacks of Shingle
However, there are also some drawbacks to using shingles:
Shingles can sometimes create too much context, which can make it difficult to identify individual topics.
Shingling can also increase the amount of time and memory required to process a document.
Alternatives to Shingle
There are a few alternatives to using shingles in text analytics:
N-grams: N-grams are similar to shingles, but they do not require a sliding window. Instead, they simply concatenate N words together. For example, if we take an N of 3 from the same sentence as before, we would create the following 3-grams: “The quick brown”, “quick brown fox”, “brown fox jumps”
Bag of words: A bag of words is a representation of a document where each word is represented by a vector. This vector can contain information about the word’s frequency or position within the document.
TF-IDF: TF-IDF is a numerical representation of a document where each word is represented by a weight. This weight is calculated by multiplying the word’s frequency in the document by its inverse document frequency.