Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Artificial Intelligence goal is to construct an agent that can solve a variety of tasks
Recent progress in text-guided image synthesis has yielded models with ability to generate complex novel images
Investigating if such tools can be used to construct more general-purpose agents
Sequential decision making problem cast as text-conditioned video generation problem
Text-encoded specification of desired goal used to synthesize set of future frames
Control actions extracted from generated video
Leveraging text as underlying goal specification enables combinatorial generalization to novel goals
Policy-as-video formulation can represent environments with different state and action spaces in unified space of images
Leveraging pretrained language embeddings and widely available videos enables knowledge transfer

Paper Content

Introduction

Building models that solve a diverse set of tasks is a dominant paradigm in vision and language
Large pretrained models have demonstrated zero-shot learning of new language tasks
Models have shown zero-shot classification and object recognition capabilities
Training agents faces challenge of environmental diversity
Universal tokens used to encode different environments
Video used as universal interface for conveying action and observation behavior
Text used as universal interface for expressing task descriptions
Model enables combinatorial generalization, multi-task learning, action planning, and internet-scale knowledge transfer

Problem formulation

Introduces a new abstraction, the Unified Predictive Decision Process (UPDP), as an alternative to the Markov Decision Process (MDP)
Presents an instantiation of UPDP with diffusion models

Markov decision process

Markov Decision Process (MDP) is a broad abstraction used to formulate many sequential decision making problems
Many RL algorithms have been derived from MDPs with empirical successes
Existing algorithms are typically unable to combinatorially generalize across different environments
Lack of universal state interface across different control environments
Explicit requirement of real-valued reward function in an MDP
Dynamics model in an MDP is environment and agent dependent
Unified Predictive Decision Process (UPDP) exploits images as a universal interface across environments, texts as task specifiers to avoid reward design, and a task-agnostic planning module
UPDP bypasses reward design, state extraction and explicit planning, and allows for non-Markovian modeling of image-based state space
UPDP isolates video-based planner from deferred action selection
UPDP leverages existing large text-video models that have been pretrained on massive, web-scale datasets
UPDP uses a continuous-time diffusion model to define a forward process and a generative process to reverse the forward process

Decision making with videos

Proposed approach UniPi is an instantiation of the diffusion UPDP
UniPi incorporates two main components: a diffusion model and a task-specific action generator

Universal video-based planner

Text-to-video models have been successful
We want to construct a video diffusion module as a trajectory planner
This is more challenging than typical text-to-video models
We use a constrained video synthesis model
We use tiling to ensure environment consistency
We use hierarchical planning
We use flexible behavioral modulation

Task specific action adaptation

Train a small model to estimate actions given input images
Generate an action sequence given x 0 and c by synthesizing H image frames and applying the learned inverse-dynamics model
Inferred actions can be executed via closed-loop or open-loop control
Use open-loop control for computational efficiency

Experimental evaluation

Combinatorial policy synthesis

Measured ability of UniPi to generalize to different language tasks
Used combinatorial robot planning tasks
Robot must manipulate blocks to satisfy language instructions
Split language instructions into two sets, one seen during training and one seen during testing
Compared UniPi to three separate representative approaches
Measured final task completion accuracy
UniPi generalizes well to seen and novel combinations of language prompts
Ablated UniPi on seen language instructions and in-relation-to tasks
All components of UniPi are crucial for good performance
Assessed ability of UniPi to adapt at test time to new constraints

Multi-environment transfer

Evaluated ability of UniPi to learn across different tasks and generalize to unseen environments
Used language guided manipulation tasks from Shridhar et al., 2022
Trained method using demonstrations from 10 separate tasks
Evaluated ability to transfer to 3 different test tasks
Generated 200k videos of language execution in environment
Used same baseline methods as in Section 4.1
Results presented in Table 3 and video visualizations in Figure 6

Real world transfer

UniPi is evaluated to see if it can generalize to real world scenarios and construct complex behaviors.
Training data consists of an internet-scale pretraining dataset and a smaller real-world robotic dataset.
Pretraining on internet-scale video data helps with generating plans for robots.
UniPi with pretraining is able to generalize to novel task commands and scenes not seen during training.

Models trained to generate environment rewards and dynamics can be used for reinforcement learning and planning
Learning a world model requires data in a strict state-action-reward format
Diffusion models have been applied to different decision making problems
Text-conditioned video policies can learn the world model and conduct hierarchical planning
Learning generalist agents can only operate under environments with the same state and action spaces
Text commands can be used to learn multi-task and generalist control policies
Images can be used as a universal state and action space to enable broad knowledge transfer

Conclusion

Representing policies using text-conditioned video generation enables effective combinatorial generalization, multi-task learning, and real world transfer.
Generative models and data on the internet can be used to generate general-purpose decision making systems.
Inverse dynamics model trained on action annotations from 20k generated videos.
Pretraining enables combinatorial generalization.
Robust to background change.
Generates video plans on different new test tasks in the multitask setting.
Task completion accuracy on multitask environment.
Video generation quality of UniPi on real environment.

Link to paper#

Abstract#

Paper Content#

Introduction#

Problem formulation#

Markov decision process#

Decision making with videos#

Universal video-based planner#

Task specific action adaptation#

Experimental evaluation#

Combinatorial policy synthesis#

Multi-environment transfer#

Real world transfer#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Problem formulation

Markov decision process

Decision making with videos

Universal video-based planner

Task specific action adaptation

Experimental evaluation

Combinatorial policy synthesis

Multi-environment transfer

Real world transfer

Related work

Conclusion