Researchers at the Department of Automation and Beijing National Research Centre for Information Science and Technology have proposed a TAsk Planning Agent (TaPA) for embodied tasks with physical scene constraints. These agents generate executable plans by aligning with visual perception models.

According to the researchers, TaPA can generate grounded plans without constraining task types and target objects. They created a multimodal dataset consisting of visual scenes, instructions, and corresponding plans. Using this dataset, they fine-tuned the pre-trained LLaMA network and assigned it as a task planner.

The embodied agent collects RGB images from various viewpoints, enabling the TaPA to generate executable actions considering scene information and human instructions.

To generate the multimodal dataset, the researchers utilized vision-language models and large multimodal models. They used GPT-3.5 to generate a large-scale multimodal dataset for training the planning agent.

The task planner was trained from the pre-trained LLMs, and the multimodal dataset was constructed. The dataset contained 80 indoor scenes with 15,000 instructions and action plans. Image collection strategies such as location selection criteria and rotated cameras were used to explore the surrounding 3D scenes.

According to the researchers, TaPA agents achieve a higher success rate in generating action plans compared to state-of-the-art LLMs and large multimodal models. TaPA also demonstrates better understanding of input objects compared to other models, with a decrease in hallucination cases.

The complexity of the tasks in the collected multimodal dataset indicates the need for new methods for optimization.

In summary, the researchers have developed TaPA, a task planning agent that generates executable plans based on visual scenes and human instructions. This agent outperforms other models in terms of action plan generation and understanding of input objects.