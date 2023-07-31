CityLife

The Power of AI Models

AI

DeepMind’s Robotics Transformer 2: Translating Vision and Language into Action

ByVicky Stavropoulou

Jul 31, 2023
DeepMind’s Robotics Transformer 2: Translating Vision and Language into Action

DeepMind has developed a new language model called RT-2, which combines images, text, and coordinate data to generate instructions for robots. RT-2, also known as Robotics Transformer 2, can take an image and a command as input and produce both a plan of action and the necessary coordinates to carry out the command.

The key insight of RT-2 is that robot actions can be represented as another language. The actions, recorded as coordinates in space, are treated as tokens during training, similar to language tokens and image tokens. By incorporating robot coordinates into the model, RT-2 can generate meaningful actions for the robot.

One of the significant milestones of RT-2 is the use of coordinates, which bridges the gap between low-level programming for robots and language and image neural nets. Normally, the physics of robots require separate programming, but RT-2 combines all of these elements into one cohesive model.

RT-2 builds upon previous Google efforts, namely PaLI-X and PaLM-E, both of which are vision-language models. PaLI-X focuses on image and text tasks, while PaLM-E takes it a step further by using language and image to drive a robot through generated commands. RT-2 goes beyond PaLM-E by not only generating plans of action but also producing coordinates for movement in space.

Compared to its predecessor, RT-1, RT-2 is based on larger language models, PaLI-X and PaLM-E, which contain billions of parameters. RT-2’s training incorporates image and text combinations, as well as actions extracted from recorded robot data.

Once trained, RT-2 can be tested by providing natural-language commands and images, just like interacting with a language model such as ChatGPT. The model is capable of generalizing to various real-world situations and exhibits emergent capabilities, such as re-purposing learned skills to place objects near specific locations and interpreting relationships between objects.

DeepMind’s RT-2 represents a significant advancement in instructing robots using a combination of vision, language, and coordinate data. It opens up possibilities for more intuitive and natural human-machine communication in the future.

By Vicky Stavropoulou

Related Post

AI

The Current State of AI Regulation: A Look at Laws and Regulations Governing AI

Jul 31, 2023 Mampho Brescia
AI

Instagram Leaks Suggest Generative AI Features in Development

Jul 31, 2023 Robert Andrew
AI

Concerns at Adobe Over AI’s Impact on Jobs

Jul 31, 2023 Robert Andrew

You missed

AI

The Current State of AI Regulation: A Look at Laws and Regulations Governing AI

Jul 31, 2023 Mampho Brescia 0 Comments
AI

Instagram Leaks Suggest Generative AI Features in Development

Jul 31, 2023 Robert Andrew 0 Comments
News

Quantum Sensors: Revolutionizing Healthcare

Jul 31, 2023 Robert Andrew 0 Comments
News

Rural Canadians to Benefit from Xplore Inc.’s New Broadband Service

Jul 31, 2023 Mampho Brescia 0 Comments