Shingle

A shingle is a short phrase or word used to identify a concept. In text analytics, shingles are generally created by taking a sliding window of words from a document and concatenating them together. For example, if we take a window of three words from the following sentence:

“The quick brown fox jumps over the lazy dog”

we would create the following shingles:

“The quick brown”, “quick brown fox”, “brown fox jumps”, “fox jumps over”, “jumps over the”, “over the lazy”, “the lazy dog”

Shingles are often used in text analytics algorithms such as Latent Dirichlet Allocation (LDA) for topic modeling, or in term co-occurrence analysis.

Benefits of Shingle

There are several benefits to using shingles in text analytics:

  • Shingles can help to better identify topics within a document by providing more context than single words.
  • Shingles can help to reduce the dimensionality of a document, which can be helpful when working with large documents.
  • Shingling can also help to improve the performance of some text analytics algorithms.

Drawbacks of Shingle

However, there are also some drawbacks to using shingles:

Shingles can sometimes create too much context, which can make it difficult to identify individual topics.

Shingling can also increase the amount of time and memory required to process a document.

Alternatives to Shingle

There are a few alternatives to using shingles in text analytics:

N-grams: N-grams are similar to shingles, but they do not require a sliding window. Instead, they simply concatenate N words together. For example, if we take an N of 3 from the same sentence as before, we would create the following 3-grams: “The quick brown”, “quick brown fox”, “brown fox jumps”

Bag of words: A bag of words is a representation of a document where each word is represented by a vector. This vector can contain information about the word’s frequency or position within the document.

TF-IDF: TF-IDF is a numerical representation of a document where each word is represented by a weight. This weight is calculated by multiplying the word’s frequency in the document by its inverse document frequency.

Leave a Reply

Your email address will not be published. Required fields are marked *

Unlock the power of actionable insights with AI-based natural language processing.

Follow Us

© 2023 VeritasNLP, All Rights Reserved. Website designed by Mohit Ranpura.
This is a staging enviroment