Robots Exclusion Protocol, also known as REP, is a standard used by website owners to communicate with web crawlers and other web robots. The standard specifies how to place a file called “robots.txt” on a website’s server which contains instructions for web robots about which pages on the website should not be crawled or indexed.
REP is commonly used by website owners to prevent web robots from indexing pages that contain sensitive or confidential information, such as login pages or sign-up forms. REP is also used to prevent web robots from overloading a website’s server with requests by restricting the number of pages that can be crawled per day.
The Robots Exclusion Protocol standard is not enforced by law, but most web robots honor the standard and will respect the instructions in a website’s “robots.txt” file.
Why is Robots Exclusion Protocol important
As a text analytics company, we often crawl websites to collect data that can be used for various purposes, such as sentiment analysis or market research. When we crawl a website, we check for a “robots.txt” file to see if there are any instructions from the website owner about which pages we should not crawl.
For example, a website owner may have a “robots.txt” file that contains the following instructions:
This would tell any web robot that it should not crawl the “/login” page on the website.
As a text analytics company, we would follow these instructions and would not crawl the “/login” page on the website.