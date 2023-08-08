OpenAI recently launched a new website crawling bot to scan website content for training its large language models (LLMs). However, website owners and creators quickly discovered this bot, called GPTBot, and began sharing tips on how to block it from scraping their sites’ data. To address these concerns, OpenAI added a support page that provides instructions on how to block GPTBot by modifying a website’s robots.txt file.

While it is unclear if blocking GPTBot will completely prevent content from being included in LLM training data due to other web scraping efforts, many website owners have already taken action. Web outlets like The Verge and Casey Newton’s substack newsletter, Platformer, have implemented the robots.txt flag to prevent OpenAI from accessing their content. Neil Clarke, editor of sci-fi magazine Clarkesworld, also announced that they would block GPTBot.

In response to the backlash, OpenAI announced a $395,000 grant and partnership with New York University’s Arthur L. Carter Journalism Institute. This initiative aims to help students develop responsible ways to leverage AI in the news industry. OpenAI expressed support for addressing challenges related to the ethical implementation of AI in journalism but did not mention the controversy surrounding web scraping in their announcement.

Blocking GPTBot may provide some control over the use of open web content, but it remains uncertain whether it will effectively stop LLMs from using public data that is not behind a paywall. LLMs and other generative AI platforms have already utilized vast collections of public data, such as Google’s Colossal Clean Crawled Corpus and Common Crawl, for training purposes. If data or content was captured in these scraping efforts, it is likely a permanent part of the training information used by AI platforms.

Web scraping practices for AI training have faced legal challenges. While the U.S. Ninth Circuit of Appeals ruled that web scraping publicly accessible data is legal, it has still come under scrutiny. OpenAI has been sued for allegedly copying book text without consent and collecting personal data in violation of privacy laws. Other lawsuits have been filed against OpenAI, as well as X and Reddit, for training LLMs on copyrighted works without consent. In response, platforms have taken measures to restrict access to their datasets.

Overall, the backlash against OpenAI’s web scraping bot highlights ongoing concerns and legal debates surrounding the use of public data for training AI models.