Google DeepMind Develops RT-2: A Language Model for Robotic Learning

ByRobert Andrew

Aug 2, 2023
Researchers from Google DeepMind have introduced RT-2, a system designed to leverage large language models (LLMs) and enhance robots’ ability to learn and perform new tasks. The goal of RT-2 is to improve upon the success rate achieved by its predecessor, RT-1, which stood at a mere 32%.

RT-2, also known as Robotics Transformer 2, is classified as a vision-language-action (VLA) model. It draws upon both textual and visual information sourced from the internet to acquire new skills and perform tasks. Unlike earlier models, RT-2 does not require explicit training to understand concepts. It possesses a general understanding of tasks like trash disposal and is able to piece together the necessary steps to complete them.

To train RT-2, the researchers utilized two LLMs: PaLI-X, a vision and language model with 55 billion parameters, and PaLM-E, an embodied multimodal language model with 12 billion parameters. These models learn from a large dataset to identify word relationships and comprehend context. RT-2 then employs this learned knowledge to generate generalized instructions for robotic actions.

It is important to note that while RT-2 showcases improved capabilities in generalization, semantic understanding, and visual comprehension, it is not yet capable of autonomously learning new actions. Instead, it applies pre-existing actions to new scenarios. Nonetheless, RT-2 represents a significant stride towards developing general-purpose robots that can learn and adapt to various situations.

Looking to the future, the researchers envision further advancements in the capabilities of robots, possibly with the development of RT-3 or RT-4. These future iterations aim to enhance the robots’ learning abilities and expand their repertoire of skills.

