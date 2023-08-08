OpenAI has introduced GPTBot, a web crawling bot designed to gather data and expand its dataset for training future AI systems. This move is further supported by OpenAI’s trademark application for “GPT-5,” implying an upcoming release.

GPTBot will collect publicly available data from websites, similar to popular search engines such as Google and Bing. It will avoid paywalled, sensitive, and prohibited content, but otherwise consider accessible information as fair game unless website owners specifically disallow it.

OpenAI has reassured users that GPTBot’s data scraping process will remove personally identifiable information (PII) and text that violates the company’s policies. However, concerns have been raised regarding the opt-out approach, with some technology ethicists arguing that it raises consent issues.

This unveiling follows a previous controversy surrounding OpenAI’s data scraping practices to train its Large Language Models (LLMs), including ChatGPT. As a response to these concerns, the company updated its privacy policies in April.

The trademark application for GPT-5 suggests that OpenAI is actively developing its next model for a future launch. This shift implies a possible departure from the company’s previous focus on transparency and AI safety. Given the popularity of ChatGPT and OpenAI’s requirement for more diverse and up-to-date data, it is likely that large-scale web scraping is involved.

In contrast, social media giant Meta has introduced an open-source LLM. Although Meta has not disclosed its dataset sources, users have the ability to fine-tune the model using their own data. Meta aims to build a profitable business by sharing its data with third parties.

OpenAI’s ChatGPT, integrated with Microsoft’s Bing search engine, currently serves over 1.5 billion monthly active users. As the AI landscape becomes increasingly competitive, striking a balance between transparency, ethics, and capabilities will continue to pose challenges. The expansion of internet data collection further raises concerns regarding copyright and consent.