There is a common saying that if people knew how sausages were made, they would never eat them. While this unfair stereotype does not apply to the entire meat-processing industry, it serves as a cautionary principle when dealing with products whose manufacturing processes are shrouded in secrecy. This brings us to the tech companies currently promoting their generative AI marvels, particularly large language models (LLMs) that can fluently compose coherent English sentences in response to human prompts.

The standard explanation for this technological marvel is centered around the extensive dataset on which these machines are trained. Human-published material in machine-readable form is crawled to create an enormous database, which is used to train the machines. The technology behind this involves a combination of massive computing power, sophisticated algorithms (including the “transformer” architecture invented by Google in 2017), and neural networks resurrected by computer scientist Geoff Hinton in 1986. By leveraging statistical predictions, these machines can generate text by determining the most likely word to follow in a sentence.

However, these machines are essentially expensive statistical parrots and do not possess true intelligence. The tech industry benefits from the misconception that AI machines are capable of posing an existential threat, as it diverts attention from the actual harm caused by the current deployment of this technology.

One crucial principle in computing is GIGO: garbage in, garbage out. This applies to LLMs, as their performance depends on the quality of the training data. Unfortunately, AI companies are secretive about the nature of their training data, which is primarily obtained through web crawlers that systematically browse the internet. One such service is Common Crawl, but it is unclear how much pirated material has been included in the training data.

Additionally, the environmental impact of these systems is a concern. Training an early LLM in 2019 was estimated to emit 300,000 kg of CO2, equivalent to 125 round-trip flights between New York and Beijing. Although companies claim to offset these emissions, they are noticeably secretive about the environmental costs of their operations.

At this pivotal point in our technological journey, it is crucial to address the lack of transparency surrounding these inscrutable machines owned by corporations. Regulators should prioritize formalizing and mandating detailed disclosure about the measurement and control methods employed by those developing and operating advanced AI systems.

Ultimately, transparency is essential to understanding the true nature of generative AI technology and its implications. It is vital to shed light on how these “sausages” are made to ensure responsible and ethical use of these powerful tools.