The New York Times (NYT) has recently updated its terms of service (TOS) to explicitly prohibit the scraping of its articles and images for artificial intelligence (AI) training purposes. This move is in response to the increasing monetization of AI language apps through unauthorized scraping of internet data.

Under section 2.1 and section 4.1 of the updated TOS, the NYT specifies that its content, including articles, videos, images, and metadata, cannot be used to train AI models without express written permission. The TOS emphasizes that the content is intended for personal, non-commercial use, which specifically excludes the development of software programs or the training of AI systems.

Nonetheless, the restriction on scraping content for AI training has not historically deterred companies from using the internet as a vast dataset. Major language models like OpenAI’s GPT-4, Anthropic’s Claude 2, Meta’s Llama 2, and Google’s PaLM 2 have all been trained on large datasets obtained through web scraping. By leveraging unsupervised learning, AI models analyze the relationships between words to develop a conceptual understanding of language.

Using scraped data for AI training has been a contentious issue, with unresolved legal challenges, including a plagiarism lawsuit against OpenAI. In response to concerns, news organizations, including the Associated Press, have called for the development of a legal framework to protect content used in AI applications.

OpenAI, anticipating further legal challenges, has taken steps to address criticism. They recently provided guidelines to allow websites to block their AI-training web crawler using robots.txt. Some sites and authors have already indicated their intention to block the crawler.

However, existing scraping data is already incorporated into GPT-4, including content from the New York Times. Whether future AI vendors, like OpenAI, will honor content owners’ requests to exclude their material remains to be seen. If not, the industry may see an increase in AI-related lawsuits or regulatory intervention in the future.