Long-term action anticipation (LTA) is crucial in scenarios like self-driving cars and routine domestic chores. However, accurately predicting human behavior is challenging due to its ambiguity and unpredictability.

This study explores the use of large language models (LLMs) in LTA. The researchers propose that LLMs, trained on procedural text material like recipes, can provide valuable prior knowledge for long-term action anticipation. By encoding this knowledge, LLMs can help answer questions about likely future actions and the remaining steps to achieve a goal.

To evaluate their ideas, the researchers develop a two-stage system called AntGPT. AntGPT uses supervised action recognition algorithms to identify human activities and feeds these representations to OpenAI GPT models. The GPT models predict future actions, either using autoregressive methods or fine-tuning, depending on the LTA approach.

Quantitative and qualitative evaluations using various LTA benchmarks demonstrate the effectiveness of AntGPT in long-term action prediction. The results also show that LLMs can infer high-level objectives and perform counterfactual action anticipation.

The study suggests using language models to infer objectives, model temporal dynamics, and improve long-term action anticipation. The AntGPT framework is proposed as a comprehensive solution that integrates LLMs with computer vision algorithms. The researchers also provide insights into the design decisions, benefits, and limitations of using LLMs for LTA.

In summary, this research contributes to advancing the understanding and application of language models in long-term action anticipation tasks. It highlights the potential of LLMs to enhance predictions and improve human-machine communication in various domains.