OpenAI has introduced GPTBot, a web crawling bot designed to collect data for training the company’s next generation of AI systems. With the trademark application for “GPT-5,” it suggests a forthcoming release from OpenAI. Functioning similarly to popular search engines like Google and Bing, GPTBot will gather publicly available data from websites. However, website owners who prefer not to have their content included in the dataset can employ a “disallow” rule on their server’s standard file. OpenAI assures that GPTBot will remove personally identifiable information and content that violates the company’s policies.

While the opt-out approach of GPTBot has raised concerns among technology ethicists who argue that consent issues remain unresolved, some users on Hacker News defend OpenAI’s decision, asserting the necessity of current data for continually updating AI models. In response to recent criticism of OpenAI’s data scraping practices, particularly with regards to training large language models like ChatGPT, the company updated its privacy policies earlier this year. The trademark application for GPT-5 further affirms OpenAI’s commitment to training its next model.

In contrast to OpenAI’s focus on gathering extensive data for its models, Meta, the social media giant, has adopted a different strategy. Meta provides an open-source language model but restricts its usage by competitors and large businesses. Although Meta does not disclose the datasets it utilizes or the information it collects, it allows users to fine-tune the model with their own data.

OpenAI’s ChatGPT remains widely utilized, and its partnership with Microsoft has bolstered Bing’s capabilities. As OpenAI continues to spearhead advancements in the AI domain, concerns regarding copyright and consent arise due to the expansion of internet data collection. Striking a balance between transparency, ethics, and capabilities will prove to be a complex challenge as AI systems become increasingly sophisticated.