Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Artificial Intelligence goal is to construct an agent that can solve a variety of tasks
  • Recent progress in text-guided image synthesis has yielded models with ability to generate complex novel images
  • Investigating if such tools can be used to construct more general-purpose agents
  • Sequential decision making problem cast as text-conditioned video generation problem
  • Text-encoded specification of desired goal used to synthesize set of future frames
  • Control actions extracted from generated video
  • Leveraging text as underlying goal specification enables combinatorial generalization to novel goals
  • Policy-as-video formulation can represent environments with different state and action spaces in unified space of images
  • Leveraging pretrained language embeddings and widely available videos enables knowledge transfer

Paper Content

Introduction

  • Building models that solve a diverse set of tasks is a dominant paradigm in vision and language
  • Large pretrained models have demonstrated zero-shot learning of new language tasks
  • Models have shown zero-shot classification and object recognition capabilities
  • Training agents faces challenge of environmental diversity
  • Universal tokens used to encode different environments
  • Video used as universal interface for conveying action and observation behavior
  • Text used as universal interface for expressing task descriptions
  • Model enables combinatorial generalization, multi-task learning, action planning, and internet-scale knowledge transfer

Problem formulation

  • Introduces a new abstraction, the Unified Predictive Decision Process (UPDP), as an alternative to the Markov Decision Process (MDP)
  • Presents an instantiation of UPDP with diffusion models

Markov decision process

  • Markov Decision Process (MDP) is a broad abstraction used to formulate many sequential decision making problems
  • Many RL algorithms have been derived from MDPs with empirical successes
  • Existing algorithms are typically unable to combinatorially generalize across different environments
  • Lack of universal state interface across different control environments
  • Explicit requirement of real-valued reward function in an MDP
  • Dynamics model in an MDP is environment and agent dependent
  • Unified Predictive Decision Process (UPDP) exploits images as a universal interface across environments, texts as task specifiers to avoid reward design, and a task-agnostic planning module
  • UPDP bypasses reward design, state extraction and explicit planning, and allows for non-Markovian modeling of image-based state space
  • UPDP isolates video-based planner from deferred action selection
  • UPDP leverages existing large text-video models that have been pretrained on massive, web-scale datasets
  • UPDP uses a continuous-time diffusion model to define a forward process and a generative process to reverse the forward process

Decision making with videos

  • Proposed approach UniPi is an instantiation of the diffusion UPDP
  • UniPi incorporates two main components: a diffusion model and a task-specific action generator

Universal video-based planner

  • Text-to-video models have been successful
  • We want to construct a video diffusion module as a trajectory planner
  • This is more challenging than typical text-to-video models
  • We use a constrained video synthesis model
  • We use tiling to ensure environment consistency
  • We use hierarchical planning
  • We use flexible behavioral modulation

Task specific action adaptation

  • Train a small model to estimate actions given input images
  • Generate an action sequence given x 0 and c by synthesizing H image frames and applying the learned inverse-dynamics model
  • Inferred actions can be executed via closed-loop or open-loop control
  • Use open-loop control for computational efficiency

Experimental evaluation

Combinatorial policy synthesis

  • Measured ability of UniPi to generalize to different language tasks
  • Used combinatorial robot planning tasks
  • Robot must manipulate blocks to satisfy language instructions
  • Split language instructions into two sets, one seen during training and one seen during testing
  • Compared UniPi to three separate representative approaches
  • Measured final task completion accuracy
  • UniPi generalizes well to seen and novel combinations of language prompts
  • Ablated UniPi on seen language instructions and in-relation-to tasks
  • All components of UniPi are crucial for good performance
  • Assessed ability of UniPi to adapt at test time to new constraints

Multi-environment transfer

  • Evaluated ability of UniPi to learn across different tasks and generalize to unseen environments
  • Used language guided manipulation tasks from Shridhar et al., 2022
  • Trained method using demonstrations from 10 separate tasks
  • Evaluated ability to transfer to 3 different test tasks
  • Generated 200k videos of language execution in environment
  • Used same baseline methods as in Section 4.1
  • Results presented in Table 3 and video visualizations in Figure 6

Real world transfer

  • UniPi is evaluated to see if it can generalize to real world scenarios and construct complex behaviors.
  • Training data consists of an internet-scale pretraining dataset and a smaller real-world robotic dataset.
  • Pretraining on internet-scale video data helps with generating plans for robots.
  • UniPi with pretraining is able to generalize to novel task commands and scenes not seen during training.
  • Models trained to generate environment rewards and dynamics can be used for reinforcement learning and planning
  • Learning a world model requires data in a strict state-action-reward format
  • Diffusion models have been applied to different decision making problems
  • Text-conditioned video policies can learn the world model and conduct hierarchical planning
  • Learning generalist agents can only operate under environments with the same state and action spaces
  • Text commands can be used to learn multi-task and generalist control policies
  • Images can be used as a universal state and action space to enable broad knowledge transfer

Conclusion

  • Representing policies using text-conditioned video generation enables effective combinatorial generalization, multi-task learning, and real world transfer.
  • Generative models and data on the internet can be used to generate general-purpose decision making systems.
  • Inverse dynamics model trained on action annotations from 20k generated videos.
  • Pretraining enables combinatorial generalization.
  • Robust to background change.
  • Generates video plans on different new test tasks in the multitask setting.
  • Task completion accuracy on multitask environment.
  • Video generation quality of UniPi on real environment.