Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Artificial Intelligence goal is to construct an agent that can solve a variety of tasks
- Recent progress in text-guided image synthesis has yielded models with ability to generate complex novel images
- Investigating if such tools can be used to construct more general-purpose agents
- Sequential decision making problem cast as text-conditioned video generation problem
- Text-encoded specification of desired goal used to synthesize set of future frames
- Control actions extracted from generated video
- Leveraging text as underlying goal specification enables combinatorial generalization to novel goals
- Policy-as-video formulation can represent environments with different state and action spaces in unified space of images
- Leveraging pretrained language embeddings and widely available videos enables knowledge transfer
Paper Content
Introduction
- Building models that solve a diverse set of tasks is a dominant paradigm in vision and language
- Large pretrained models have demonstrated zero-shot learning of new language tasks
- Models have shown zero-shot classification and object recognition capabilities
- Training agents faces challenge of environmental diversity
- Universal tokens used to encode different environments
- Video used as universal interface for conveying action and observation behavior
- Text used as universal interface for expressing task descriptions
- Model enables combinatorial generalization, multi-task learning, action planning, and internet-scale knowledge transfer
Problem formulation
- Introduces a new abstraction, the Unified Predictive Decision Process (UPDP), as an alternative to the Markov Decision Process (MDP)
- Presents an instantiation of UPDP with diffusion models
Markov decision process
- Markov Decision Process (MDP) is a broad abstraction used to formulate many sequential decision making problems
- Many RL algorithms have been derived from MDPs with empirical successes
- Existing algorithms are typically unable to combinatorially generalize across different environments
- Lack of universal state interface across different control environments
- Explicit requirement of real-valued reward function in an MDP
- Dynamics model in an MDP is environment and agent dependent
- Unified Predictive Decision Process (UPDP) exploits images as a universal interface across environments, texts as task specifiers to avoid reward design, and a task-agnostic planning module
- UPDP bypasses reward design, state extraction and explicit planning, and allows for non-Markovian modeling of image-based state space
- UPDP isolates video-based planner from deferred action selection
- UPDP leverages existing large text-video models that have been pretrained on massive, web-scale datasets
- UPDP uses a continuous-time diffusion model to define a forward process and a generative process to reverse the forward process
Decision making with videos
- Proposed approach UniPi is an instantiation of the diffusion UPDP
- UniPi incorporates two main components: a diffusion model and a task-specific action generator
Universal video-based planner
- Text-to-video models have been successful
- We want to construct a video diffusion module as a trajectory planner
- This is more challenging than typical text-to-video models
- We use a constrained video synthesis model
- We use tiling to ensure environment consistency
- We use hierarchical planning
- We use flexible behavioral modulation
Task specific action adaptation
- Train a small model to estimate actions given input images
- Generate an action sequence given x 0 and c by synthesizing H image frames and applying the learned inverse-dynamics model
- Inferred actions can be executed via closed-loop or open-loop control
- Use open-loop control for computational efficiency
Experimental evaluation
Combinatorial policy synthesis
- Measured ability of UniPi to generalize to different language tasks
- Used combinatorial robot planning tasks
- Robot must manipulate blocks to satisfy language instructions
- Split language instructions into two sets, one seen during training and one seen during testing
- Compared UniPi to three separate representative approaches
- Measured final task completion accuracy
- UniPi generalizes well to seen and novel combinations of language prompts
- Ablated UniPi on seen language instructions and in-relation-to tasks
- All components of UniPi are crucial for good performance
- Assessed ability of UniPi to adapt at test time to new constraints
Multi-environment transfer
- Evaluated ability of UniPi to learn across different tasks and generalize to unseen environments
- Used language guided manipulation tasks from Shridhar et al., 2022
- Trained method using demonstrations from 10 separate tasks
- Evaluated ability to transfer to 3 different test tasks
- Generated 200k videos of language execution in environment
- Used same baseline methods as in Section 4.1
- Results presented in Table 3 and video visualizations in Figure 6
Real world transfer
- UniPi is evaluated to see if it can generalize to real world scenarios and construct complex behaviors.
- Training data consists of an internet-scale pretraining dataset and a smaller real-world robotic dataset.
- Pretraining on internet-scale video data helps with generating plans for robots.
- UniPi with pretraining is able to generalize to novel task commands and scenes not seen during training.
Related work
- Models trained to generate environment rewards and dynamics can be used for reinforcement learning and planning
- Learning a world model requires data in a strict state-action-reward format
- Diffusion models have been applied to different decision making problems
- Text-conditioned video policies can learn the world model and conduct hierarchical planning
- Learning generalist agents can only operate under environments with the same state and action spaces
- Text commands can be used to learn multi-task and generalist control policies
- Images can be used as a universal state and action space to enable broad knowledge transfer
Conclusion
- Representing policies using text-conditioned video generation enables effective combinatorial generalization, multi-task learning, and real world transfer.
- Generative models and data on the internet can be used to generate general-purpose decision making systems.
- Inverse dynamics model trained on action annotations from 20k generated videos.
- Pretraining enables combinatorial generalization.
- Robust to background change.
- Generates video plans on different new test tasks in the multitask setting.
- Task completion accuracy on multitask environment.
- Video generation quality of UniPi on real environment.