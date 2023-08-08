OpenAI has introduced a web crawler called GPTBot with the aim of improving artificial intelligence (AI) models. The company stated that web pages crawled by GPTBot may be utilized to enhance future models while removing sources that require paywall access, gather personally identifiable information (PII), or contain text violating their policies. Allowing GPTBot access to websites can contribute to the accuracy, general capabilities, and safety of AI models.

A web crawler, also known as a bot, is typically employed by search engines to index website content for search result display. This involves accessing websites automatically and extracting data using specialized software. OpenAI provided instructions on how to prevent GPTBot from accessing a website, either partially or entirely. Options include blocking the crawler’s IP address or adding the GPTBot to the site’s robots.txt file, which guides web crawlers on site accessibility.

OpenAI clarified that GPTBot’s calls to websites will be made from the documented IP address block specified on the OpenAI website.

In a notable development, OpenAI and other AI companies signed an agreement with the White House to develop a watermarking system that would inform internet users if content was generated by AI. However, the agreement does not include a commitment to cease using internet data for training purposes.

By launching GPTBot, OpenAI aims to improve AI models by leveraging web data and eliminating sensitive sources. Website owners have the option to control GPTBot’s access to their sites, providing a level of flexibility and transparency in this collaborative effort between AI models and website content.