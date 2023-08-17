Arthur, an AI performance platform trusted by large organizations, has launched Arthur Bench, an open-source evaluation tool for comparing large language models (LLMs), prompts, and hyperparameters for generative text models. This tool will allow businesses to assess the performance of different LLMs in real-world scenarios, assisting them in making data-driven decisions when adopting AI technologies.

In addition to Arthur Bench, Arthur has unveiled The Generative Assessment Project (GAP), a research initiative that ranks the strengths and weaknesses of language model offerings from industry leaders such as OpenAI, Anthropic, and Meta. The research suggests that Anthropic may have a competitive advantage over OpenAI’s GPT-4 in terms of “reliability” within specific domains. For example, Anthropic’s Claude-2 model outperformed GPT-4 in avoiding factual mistakes and providing appropriate responses when answering history questions. The goal of GAP is to share insights and best practices regarding behavior differences in language models with the public.

Arthur Bench is the latest addition to Arthur’s suite of LLM-focused products, following Arthur Shield. It offers several benefits to businesses:

1. Model Selection & Validation: Arthur Bench helps companies compare different LLM options using a consistent metric, enabling them to determine the best fit for their applications.

2. Budget & Privacy Optimization: Not all applications require the most advanced and expensive LLMs. Arthur Bench helps identify cost-effective models that perform the required tasks equally well. Additionally, it allows for greater control over data privacy by leveraging in-house models.

3. Translating Academic Benchmarks to Real-World Performance: Bench enables companies to test and compare the performance of different models quantitatively, using standard metrics. It also allows for the customization of benchmarks based on specific business needs and customer requirements.

According to Priyanka Oberoi, Staff Data Scientist at Axios HQ, an early user of Arthur Bench, the tool has helped their team develop an internal framework for standardized LLM evaluation and meaningful performance metrics.

