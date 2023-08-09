OpenAI, the creator of ChatGPT, has recently introduced a new web crawler named GPTBot. Web crawlers are utilized by search engines and AI companies to scan websites and gather data for training language models. OpenAI continues to enhance its large language models (LLMs) like GPT-3.5 and GPT-4.

GPTBot aims to expedite the training process of AI models by allowing them to access a vast amount of data from websites. OpenAI states that granting access to GPTBot can improve the accuracy, general capabilities, and safety of AI models. However, they are taking precautions by filtering out web pages that require paywall access, collect personally-identifying information, or contain text that violates OpenAI’s policies.

Developers have the choice to block GPTBot from accessing their sites if they do not want their information used for training AI systems. To do so, they can add the GPTBot token and “Disallow: /” to their site’s robots.txt file. Alternatively, OpenAI provides the option to customize GPTBot’s access by allowing it to crawl only specific parts of a site. This can be achieved by modifying the robots.txt file and specifying directories to allow or disallow access.

While OpenAI has not publicly disclosed the use of web crawlers to train their existing language models, it is speculated that GPTBot may be utilized for training future iterations such as GPT-5. OpenAI has filed a trademark for GPT-5, hinting at its potential development. GPT-5 is expected to be more powerful and larger than the current largest LLM, GPT-4.

It is worth noting that OpenAI has faced legal challenges alleging data theft through ChatGPT. This has led some websites, including Stack Overflow, Reddit, and Twitter, to consider charging AI companies for access to their data.

OpenAI’s release of GPTBot and the provided instructions for blocking access demonstrate the company’s commitment to transparency and allowing developers to have control over their site’s accessibility to web crawlers.