Regulations Haven't Stopped AI Companies: They Continue to Collect Data from the Internet

It has been revealed that AI companies are bypassing the guidelines known as robots.txt.

With the rise of artificial intelligence, companies entering this field need vast amounts of data to develop their tools. The most obvious source for this data is, of course, the internet. However, not all internet data and content can be used to train AI. Websites specify whether their data can be collected using a file called robots.txt.

According to a report by Reuters, many AI developers are choosing to bypass the directives in this file and are collecting data from these sites anyway. One of the companies drawing the most criticism for this practice is Perplexity, which bills itself as a “free AI search engine,” but it is not alone in this practice.

OpenAI, Anthropic…

Reports indicate that many AI developers continue to extract content from sites by ignoring the robots.txt files. Although the report does not name specific companies, it has been learned that OpenAI and Anthropic are among them. A server used by Perplexity was also found to not follow these guidelines. Perplexity CEO Aravind Srinivas previously stated that the company had “no case of bypassing the protocol first and then lying about it.”

On the other hand, the robots.txt protocol has been in use since the 1990s and does not have legal binding. Perhaps creating a new, stricter, and more detailed protocol could help address this issue.

Scroll to Top