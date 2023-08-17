Researchers from Arthur AI conducted a study on AI models from Meta, OpenAI, Cohere, and Anthropic to analyze their performance. The study revealed that some models had a higher tendency to fabricate facts or “hallucinate” compared to others. Cohere’s AI model displayed the highest rate of hallucinations, while Meta’s Llama 2 had more hallucinations overall than GPT-4 and Claude 2.

GPT-4 outperformed all the models tested in the study and showed a significant reduction in hallucinations compared to its prior version, GPT-3.5. For example, on math questions, GPT-4 hallucinated between 33% and 50% less depending on the category.

The research becomes relevant in the context of growing concerns about misinformation generated by AI systems, particularly in light of the upcoming 2024 U.S. presidential election. According to Adam Wenchel, co-founder and CEO of Arthur AI, this report is the first to comprehensively examine rates of hallucination, rather than simply providing a single ranking number.

Hallucinations occur when large language models (LLMs) generate false information, presenting it as factual. One instance cited in the study involved ChatGPT including “bogus” cases in a New York federal court filing, potentially leading to potential sanctions for the involved attorneys.

During the study, researchers tested the AI models in categories such as combinatorial mathematics, U.S. presidents, and Moroccan political leaders. The questions designed for testing aimed to challenge the LLMs’ reasoning abilities. Overall, OpenAI’s GPT-4 performed the best, exhibiting fewer hallucinations compared to GPT-3.5. In contrast, Meta’s Llama 2 had a higher tendency to hallucinate than GPT-4 and Anthropic’s Claude 2.

The study also examined the models’ tendency to hedge their answers with warning phrases. GPT-4 showed a 50% increase in hedging compared to GPT-3.5, making it potentially more frustrating to use. Cohere’s AI model did not hedge at all, while Claude 2 demonstrated the most self-awareness by accurately gauging its knowledge and answering questions only when supported by training data.

The key takeaway for users and businesses is to test the AI models on their specific workload to understand their performance in a real-world context. Benchmarking measures alone may not accurately reflect how the models perform in practical applications. Understanding their performance for specific use cases is crucial for effectively utilizing AI models.