A Google scientist has demonstrated that OpenAI’s GPT-4 large language model (LLM) can assist in breaking some safeguards put around other machine learning models. The research scientist, Nicholas Carlini, explores how AI-Guardian, a defense against adversarial attacks on models, can be undone using GPT-4. Carlini’s paper includes Python code suggested by GPT-4 for defeating AI-Guardian’s efforts to block adversarial attacks. The attacks reduce the robustness of AI-Guardian from 98 percent to just 8 percent. AI-Guardian, developed by Hong Zhu, Shengzhi Zhang, and Kai Chen, aims to prevent adversarial examples from tricking machine learning models. However, Carlini and GPT-4 were able to identify the backdoor trigger function used by AI-Guardian and construct adversarial examples to bypass it.

One potential issue with Carlini’s approach is that it requires access to the confidence vector from the defense model, which may not always be available in real-world scenarios. Zhang mentioned that they have worked on an improved prototype system that is not vulnerable to Carlini’s approach. Despite its potential limitations, Carlini’s work demonstrates how GPT-4 can quickly and coherently describe attack methods and solutions. The research emphasizes the value of working with GPT-4 as a coding assistant.

Carlini stated that the purpose of his work was more about demonstrating the value of LLM assistants rather than showcasing a novel attack technique. He acknowledges that manually crafting an attack algorithm would have been faster, but the ability to communicate with a machine learning model over natural language and perform an attack is both surprising and concerning. The research highlights the potential capabilities and concerns associated with the collaboration between humans and AI language models like GPT-4.