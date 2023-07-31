DeepMind has developed a new language model called RT-2, which combines images, text, and coordinate data to generate instructions for robots. RT-2, also known as Robotics Transformer 2, can take an image and a command as input and produce both a plan of action and the necessary coordinates to carry out the command.

The key insight of RT-2 is that robot actions can be represented as another language. The actions, recorded as coordinates in space, are treated as tokens during training, similar to language tokens and image tokens. By incorporating robot coordinates into the model, RT-2 can generate meaningful actions for the robot.

One of the significant milestones of RT-2 is the use of coordinates, which bridges the gap between low-level programming for robots and language and image neural nets. Normally, the physics of robots require separate programming, but RT-2 combines all of these elements into one cohesive model.

RT-2 builds upon previous Google efforts, namely PaLI-X and PaLM-E, both of which are vision-language models. PaLI-X focuses on image and text tasks, while PaLM-E takes it a step further by using language and image to drive a robot through generated commands. RT-2 goes beyond PaLM-E by not only generating plans of action but also producing coordinates for movement in space.

Compared to its predecessor, RT-1, RT-2 is based on larger language models, PaLI-X and PaLM-E, which contain billions of parameters. RT-2’s training incorporates image and text combinations, as well as actions extracted from recorded robot data.

Once trained, RT-2 can be tested by providing natural-language commands and images, just like interacting with a language model such as ChatGPT. The model is capable of generalizing to various real-world situations and exhibits emergent capabilities, such as re-purposing learned skills to place objects near specific locations and interpreting relationships between objects.

DeepMind’s RT-2 represents a significant advancement in instructing robots using a combination of vision, language, and coordinate data. It opens up possibilities for more intuitive and natural human-machine communication in the future.