One of the concerns surrounding generative AI is its secretive nature. Companies like Meta and OpenAI use large quantities of written material to create systems that can provide human-like answers to questions. However, the full extent of the texts used in training these programs is largely unknown to the public.

In a recent lawsuit filed in California, writers Sarah Silverman, Richard Kadrey, and Christopher Golden accused Meta of violating copyright laws. They alleged that Meta used their books to train LLaMA, a language model similar to OpenAI’s GPT-4. The lawsuit did not provide specific information about which books were used to train LLaMA.

However, an analysis of the dataset used by Meta to train LLaMA revealed that upwards of 170,000 books, mostly published in the past 20 years, were included in the training data. These books included works by authors such as Michael Pollan, Rebecca Solnit, Jon Krakauer, James Patterson, Stephen King, George Saunders, Zadie Smith, and Junot Díaz. The dataset, known as “Books3,” was not only used to train LLaMA but also other generative-AI programs like Bloomberg’s BloombergGPT and EleutherAI’s GPT-J.

It was discovered that the Books3 dataset contained not only book texts but also other sources such as YouTube video subtitles, European Parliament documents, Enron Corporation emails, and more. Generative AI works by analyzing word relationships, making the subject matter less important than the quantity of text available.

The magnitude of the Books3 dataset required specialized programs to manage it. Through this process, over 170,000 books were identified, with fiction and nonfiction titles from both big and small publishers. The collection included works by authors like Elena Ferrante, Rachel Cusk, Haruki Murakami, Jennifer Egan, Jonathan Franzen, bell hooks, David Grann, and Margaret Atwood, as well as books by L. Ron Hubbard and John F. MacArthur.

The use of pirated books in training generative-AI is a common practice. While some efforts are being made to create datasets exclusively licensed for this purpose, the widespread availability and use of copyrighted material raise concerns about intellectual property rights and the impact on the publishing industry.

This practice is not widely known outside the AI community, but it shines a light on the ethical considerations surrounding the development and use of generative AI models. Transparency and collaboration with creators and rights holders are crucial in ensuring that the use of copyrighted material is done within legal and ethical boundaries.