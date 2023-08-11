OpenAI recently revealed details about its web crawler, GPTBot, which is used to retrieve webpages for training its AI models like GPT-4. The company added information about GPTBot to its online documentation site, stating that crawled webpages may be used to enhance future models. OpenAI believes that allowing GPTBot access to websites can improve the accuracy, capabilities, and safety of AI models.

OpenAI assures that certain filters prevent GPTBot from accessing paywalled sources, websites collecting personally identifiable information, and content that violates OpenAI’s policies. However, the announcement of potentially blocking GPTBot’s access comes too late to affect the current training data for ChatGPT and GPT-4, which were obtained years ago without any announcement.

The documentation does not clarify whether blocking GPTBot will prevent web-browsing versions of ChatGPT or ChatGPT plugins from accessing real-time information on websites. OpenAI has been contacted for further clarification on this matter.

To identify GPTBot, OpenAI specifies its user agent token as “GPTBot” with the full string being “Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)”. The documentation also provides instructions for blocking GPTBot using the robots.txt file, a standard text file used to instruct web crawlers not to index a website.

By adding two lines to a site’s robots.txt file, admins can block GPTBot:

User-agent: GPTBot

Disallow: /

OpenAI also allows admins to restrict GPTBot’s access to specific parts of a site by using different tokens in robots.txt:

User-agent: GPTBot

Allow: /directory-1/

Disallow: /directory-2/

Additionally, OpenAI has shared the IP address blocks from which GPTBot operates, enabling firewall blocking.

It’s important to note that blocking GPTBot does not guarantee that a site’s data will not be used in training future AI models. Other data sets scraped from websites, such as The Pile, are not affiliated with OpenAI and can be used to train other language models.

The reaction to the option of blocking GPTBot has been mixed. Some individuals and organizations, previously critical of OpenAI’s data usage, expressed their intention to block GPTBot from accessing their content. However, larger websites face a dilemma as blocking language model crawlers may result in gaps in knowledge and cultural footprint that could impact their online presence.

As the field of generative AI continues to evolve, OpenAI has provided the option for websites to block GPTBot. The impact of blocking AI model training remains to be seen as the technology develops, and AI chatbots potentially become more prevalent in user interfaces.