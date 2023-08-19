Artificial intelligence (AI) systems have made significant progress thanks to Large Language Models (LLMs) like ChatGPT, Bard, and Llama-2. These LLMs have showcased their remarkable capabilities by assisting in tool utilization, improving human evaluations, and simulating human interactive behaviors. However, as these LLMs are extensively deployed, ensuring the security and reliability of their responses becomes a major challenge.

Recent research has focused on advancing the understanding and application of LLMs in non-natural languages, specifically ciphers. The research team has introduced CipherChat, a framework designed to evaluate safety alignment methods from natural languages in the context of non-natural languages. CipherChat involves human interaction with LLMs through cipher-based prompts, system role assignments, and enciphered demonstrations. This framework thoroughly examines the LLMs’ understanding of ciphers, participation in conversations, and sensitivity to inappropriate content.

The study emphasizes the need for safety alignment methods tailored to non-natural languages like ciphers to match the capabilities of LLMs effectively. While LLMs excel in human languages, they also demonstrate unexpected proficiency in comprehending non-natural languages. Developing safety regulations that cover these non-traditional forms of communication is crucial.

Various experiments utilizing realistic human ciphers were conducted on modern LLMs such as ChatGPT and GPT-4 to assess the performance of CipherChat. The results revealed that certain ciphers can bypass GPT-4’s safety alignment procedures with high success rates. This underscores the importance of developing customized safety alignment mechanisms for non-natural languages to ensure the reliability of LLMs’ responses.

Additionally, the research uncovered the presence of a secret cipher within LLMs. The team suggests that LLMs may possess a latent ability to decipher certain encoded inputs, indicating a unique cipher-related capability. Building on this observation, the team introduced SelfCipher, a framework that taps into and activates the latent secret cipher capability within LLMs through role-play scenarios and limited natural language demonstrations. The effectiveness of SelfCipher showcases the potential of harnessing these hidden abilities to enhance LLM performance in deciphering encoded inputs and generating meaningful responses.

This research underscores the importance of addressing safety concerns in non-natural languages and highlights the promising applications of LLMs in deciphering and generating responses to encoded inputs.