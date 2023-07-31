DeepMind has developed a new large language model called RT-2, which combines images, text, and robot coordinate data to enable real-time instructions to machines. The researchers at DeepMind propose that this vision-language-action model can translate human commands to machines as easily as talking to OpenAI’s ChatGPT. The key insight of the model is that robot actions can be represented as another language, just like training on text enables ChatGPT to generate new text.

The actions of the robot, known as degrees of freedom, are encoded as coordinates in space within the robotics transformer. During training, these coordinates are fed into the program alongside language tokens and image tokens. This integration of coordinates is a significant milestone, as it combines the physics of robots with language and image neural nets.

RT-2 builds upon two previous Google efforts, PaLI-X and PaLM-E, which are vision-language models. While PaLI-X focuses solely on image and text tasks, PaLM-E takes it a step further by using language and image to generate commands that drive robots. RT-2 extends this capability by not only generating plans of action but also providing coordinates for movement in space.

Compared to the previous version, RT-1, RT-2 is based on larger language models, PaLI-X and PaLM-E. These large models have more parameters, making them more proficient. The training of RT-2 involves a combination of image-text pairs and robot action data.

Once RT-2 is trained, it can be tested by providing natural-language commands and images to the model. The model will generate both a plan of action and a series of coordinates necessary to carry out those actions. This allows the robot to perform tasks such as picking up objects, moving them, and dropping them based on typed commands.

One key aspect of RT-2 is its ability to generalize to real-world situations and handle new objects. The model demonstrates reasoning, symbol understanding, and human recognition capabilities. It can interpret relations between objects and determine which object to pick and where to place it, even when these relations are not provided in the robot demonstrations.

Overall, RT-2 represents a significant advancement in training models to understand and execute human instructions in real-time, bridging the gap between language, vision, and robotics.