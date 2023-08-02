When OpenAI released its latest text-generating AI, GPT-4, in March, it excelled at identifying prime numbers. However, a few months later, the same test produced drastically different results. This highlights the complexity of large AI models. Instead of steadily improving at every task, AI models go through a winding road full of speed bumps and detours.

A recent preprint study by three computer scientists from Stanford University and the University of California, Berkeley, examined the performance of GPT-4 and its predecessor, GPT-3.5, in March and June. The researchers found significant differences between the two models and also observed changes in each model’s output over time.

In the June tests, GPT-4 provided less verbose answers compared to March, becoming less inclined to explain itself. It started appending accurate but potentially disruptive descriptions to computer code snippets and became less likely to provide offensive or inappropriate responses. GPT-4 also showed slight improvement in solving visual reasoning problems and was harder to manipulate with content moderation firewalls.

However, determining whether GPT-4 is better or worse than GPT-3.5 overall is challenging. The definition of “better” is subjective, and OpenAI has not released benchmark data for every update. Speculating on the changes in GPT-4’s performance is complicated by OpenAI’s reluctance to disclose development and training details. Nonetheless, it is clear that GPT-4’s behavior has changed since its release, which can be problematic for developers and researchers who rely on this AI in their work.

Changes in AI behavior, known as “model drift,” over time have been observed in previous studies. These shifts in behavior require developers and users to adapt their approaches and prompts, potentially impacting their applications built on these models.

The changes in AI behavior are influenced by two main factors: the model’s parameters and the training data used. Large AI models like GPT-4 have numerous parameters that guide their behavior, but modifying these parameters can lead to unexpected consequences. To refine the performance of AI models, developers often employ fine-tuning, a process where new information and feedback are incorporated, similar to gene editing in biology.

Understanding and managing the complexities of AI models like GPT-4 is crucial for their effective utilization and to mitigate unintended consequences.