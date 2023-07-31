CityLife

The Power of AI Models

DeepMind’s Robotics Transformer Version 2: Advancing Language and Vision Models in Robotics

ByMampho Brescia

Jul 31, 2023
DeepMind’s Robotics Transformer Version 2 (RT-2) is a significant milestone in the field of robotics. This large language model combines images, text, and coordinate data to understand and perform various robotic tasks. Unlike previous projects, RT-2 not only generates a plan of action but also provides the coordinates necessary for completing a given command.

RT-2 builds upon the success of Google’s vision-language models, PaLI-X and PaLM-E. While PaLI-X focuses on image and text tasks, and PaLM-E generates commands to drive robots using language and images, RT-2 takes it a step further by integrating the physics of robots with language and image neural networks.

Compared to its predecessor, RT-1, which was based on a smaller language and vision program, RT-2 is based on larger language models. These models have more neural weights, making them more proficient. The training process of RT-2 involves combinations of image, text, and actions extracted from recorded robot data.

Once trained, RT-2 can perform tasks by taking natural-language commands and images as input. It generates action plans accompanied by coordinate movements. The model showcases the ability to generalize to new real-world scenarios, understand symbols and object relations, and perform reasoning and human recognition tasks.

DeepMind’s RT-2 represents a significant advancement in instructing robots in real-time using language and visual cues. Its integration of language, images, and the physics of robots opens up new possibilities for seamless human-robot interactions. The development of RT-2 paves the way for future advancements in the field of robotics.

