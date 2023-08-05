Google’s DeepMind unit has introduced RT-2, a groundbreaking vision-language-action (VLA) model that surpasses previous robot control models in terms of efficiency. Known as the “robotics transformer,” RT-2 is set to revolutionize the way robots interact with their environment and perform tasks with accuracy.

RT-2 is an advanced learning model that continuously improves over time and effortlessly comprehends both words and images. It possesses the capability to tackle complex challenges it has never encountered or been trained on before. By learning and adapting in real-world scenarios, RT-2 can gather information from various sources including the internet and robotics data. The model can handle tasks it has not explicitly been trained for by comprehending both language and visual input.

The foundation for RT-2 was created by combining two existing models, Pathways Language and Image Model (PaLI-X), and Pathways Language Model Embodied (PaLM-E). This VLA model empowers robots to understand both language and visuals, enabling them to take appropriate actions. The training process involved extensive text data and internet-derived images, similar to popular chatbots like ChatGPT.

According to the researchers, robots equipped with the RT-2 model can perform a wide range of complex tasks by utilizing both visual and language data. For example, they can organize files alphabetically by reading the labels on the documents and sorting them accordingly.

This revolutionary model, titled “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,” by Anthony Brohan and his colleagues, is detailed in a paper available on the DeepMind blog.