Stop words are words that are filtered out before or after the processing of natural language data (text). Though “stop words” usually refer to the most common words in a language, there are no definitive stop word lists. Stop words may be common, but they carry little meaning – excluding them from analysis often improves results.
Examples of Stop Words
The following are examples of stop words in the English language:
“a”, “about”, “above”, “after”, “again”, “against”, “all”, “am”, “an”, “and”,
“any”,”are”,”aren’t”,”as”,”at”,”be”,”because”,”been”,”before”,”being”,”below”,
“between”,”both”,”but”,”by”,”can’t”, “cannot”,”could”,”couldn’t”,”did”, “didn’t”.
Stop Words Removal Tools
Different tools use different lists of stop words. Some common stop word removal tools are:
- NLTK (Natural Language Toolkit): NLTK is a python library that comes with a pre-defined set of stop words (about 150) for multiple languages.
- Stop Word Filter: This is a Java-based tool that uses a list of stop words.
- Snowball: Snowball is a small string processing language designed for use in Information Retrieval. It has a list of stop words for multiple languages.
- R: R has a package called tm (text mining) that includes a set of stop words for multiple languages.
Advantages of Using Stop Words
There are a few advantages to using stop words:
- It can help improve the results of your text analytics by removing common, meaningless words.
- It can make your text analytics more efficient by reducing the amount of data that needs to be processed.
Disadvantages of Using Stop Words
Stop words also have a few disadvantages:
- They can remove important context from your data. For example, the word “not” is a stop word, so if you are trying to analyze sentiment and the text includes the phrase “not good”, the stop word removal would change the meaning of the phrase.
- They can create issues with homonyms. For example, the word “fly” could be removed as a stop word, but then the text would lose its meaning if it included the phrase “fly fishing”.