Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Large language models are good at complex tasks
  • Enabling general inference in the real world is challenging
  • Embodied language models incorporate real-world continuous sensor modalities
  • Input to the model is multi-modal sentences
  • Model is trained end-to-end for multiple embodied tasks
  • Evaluations show model can address a variety of embodied reasoning tasks
  • Model benefits from diverse joint training
  • Largest model is a visual-language generalist with state-of-the-art performance

Paper Content

Introduction

  • LLMs demonstrate strong reasoning capabilities across various domains
  • Limitation of LLMs is the issue of grounding
  • Previous work interfaces LLMs with learned robotic policies and affordance functions
  • LLMs are only provided with textual input, which is insufficient for many tasks
  • Current state-of-the-art visuallanguage models cannot directly solve robotic reasoning tasks
  • Proposed embodied language models incorporate continuous inputs from sensor modalities
  • Embodied language models are trained end-to-end to output sequential decisions
  • Evaluated in a variety of settings
  • Multi-task training improves performance
  • Transfer across tasks leads to high data-efficiency for robotics tasks
  • PaLM-E-562B achieves state-of-the-art performance on the OK-VQA benchmark
  • PaLM-E-562B exhibits a wide array of capabilities including zero-shot multimodal chain-of-thought reasoning
  • Recent years have seen a growing interest in large vision-language models (VLMs)
  • VLMs can be applied to tasks such as visual question answering, captioning, optical character recognition, and object detection
  • Different methods exist for integrating images into VLMs
  • Prior works focus on combining vision and language inputs in an embodied setting with the goal of direct action prediction
  • LLMs contain vast amounts of internalized knowledge about the world
  • LLMs have been employed in embodied domains to understand natural language goals and to represent planning
  • PaLM-E is trained to generate plans directly without relying on auxiliary models for grounding
  • PaLM-E investigates a generalist, multi-embodiment model, across multiple modalities

Palm-e: an embodied multimodal language model

  • PaLM-E injects continuous observations into the language embedding space of a pre-trained language model
  • PaLM-E is a decoder-only LLM that generates textual completions autoregressively
  • Inputs to PaLM-E consist of text and continuous observations
  • Output of PaLM-E is text generated auto-regressively
  • Decoder-only LLMs predict the probability of a piece of text
  • Prefix-decoder-only LLMs can be conditioned on a prefix
  • Tokens are embedded into a word token embedding space
  • Continuous observations are injected into the LLM by mapping them into the language embedding space
  • PaLM-E can be used to generate text or decisions/plans
  • Low-level policies translate decisions into low-level actions
  • PaLM-E is integrated into a control-loop to sequence and control low-level policies

Input & scene representations for different sensor modalities

  • Incorporate different modalities into PaLM-E
  • Set up encoders to map modalities into language embedding space
  • Investigate state estimation vectors, Vision Transformers (ViTs) and Object Scene Representation Transformer (OSRT)
  • Consider object-centric representations to factor observations into tokens
  • Investigate ViT token learner architecture
  • Label multi-modal tokens in input prompt to enable PaLM-E to reference objects

Training recipes

  • PaLM-E is a decoder-only model that uses multi-modal sentences and text tokens to make predictions.
  • The loss function is a cross-entropy loss averaged over the non-prefix tokens.
  • PaLM-E is based on the pretrained 8B, 62B, and 540B parameter variants of PaLM.
  • The model can be trained with all components updated, or with the LLM frozen and only the input encoders trained.
  • Co-training across tasks is investigated in experiments.

Experiments

  • Experiments conducted on 3 robot embodiments in simulation and with 2 real robots
  • Evaluated on vision-language tasks such as VQA, image captioning, and language modeling
  • Comparing different input representations with respect to performance, generalization, and data-efficiency
  • Investigating transfer by training on a mixture of tasks
  • PaLM-E capable of guiding a real robot through a multi-stage tabletop manipulation task
  • PaLM-E capable of one-shot and zero-shot learning
  • PaLM-E can generalize zero-shot to tasks involving novel object pairs and unseen objects

Robot environments / tasks

  • PaLM-E is trained on expert data from 3 different robot environments
  • TAMP tasks involve large combinatorics and many decision sequences are infeasible
  • Language-Table environment includes several objects, large cardinality of language, and complex pushing dynamics
  • PaLM-E has to reason about the poses of the objects
  • PaLM-E is integrated into the control loop to execute plans in the real world

Tamp environment

  • Planning success rates and VQA performance for TAMP environment tested
  • LLM frozen in experiments
  • Input representations trained on 96,000 training scenes of TAMP environment
  • Pre-trained LLM improves performance when increasing number of objects
  • 62B LLM shows better out-of-distribution generalization than 8B variant
  • SayCan baseline has difficulty solving environment
  • Significant differences between input representations when training on 1% of dataset
  • ViT-4B performance more than doubles when co-trained on other robot environments and vision-language datasets
  • OSRT leads to best performance
  • PaLI not able to solve tasks

Language-table environment

  • PaLM-E is a control loop that takes a long-horizon task and an image as input and outputs an instruction for a low-level policy.
  • Joint training on internet-scale vision and language results in a more effective model for robot planning.
  • Scaling the 12B model to the 84B model leads to improvements on 2 of 3 tasks.
  • SayCan and zero-shot PaLI are not effective, unable to solve the easiest task tested.

Mobile manipulation environment

  • PaLM-E is tested on challenging and diverse mobile manipulation tasks
  • PaLM-E needs to plan a sequence of navigation and manipulation actions based on a human instruction
  • 3 use cases are developed to test PaLM-E’s embodied reasoning abilities: affordance prediction, failure detection, and long-horizon planning
  • PaLM-E outperforms other models on VQA and planning tasks
  • PaLM-E is tested on a real robot in a kitchen and can carry out long-horizon mobile manipulation tasks
  • PaLM-E is tested under adversarial disturbances

Performance on general visual-language tasks

  • PaLM-E model achieves highest reported number on OK-VQA
  • PaLM-E outperforms models finetuned specifically on OK-VQA
  • PaLM-E achieves highest performance on VQA v2 with frozen LLM

Performance on general language tasks

  • PaLM-E has been tested on 21 general language benchmarks for NLU and NLG tasks
  • Performance increases with increasing model scale
  • There is less catastrophic forgetting of language capabilities with increasing model scale

Summary of experiments & discussion

  • Transfer of PaLM-E leads to increased performance
  • Co-training on the “full mixture” achieves more than double the performance
  • Adding LLM/ViT pre-training and training on the full mixture improves performance
  • Geometric input representation is data-efficient
  • Two paths to retain language capabilities during multimodal training

Conclusion

  • Proposed to build an embodied language model by injecting multi-modal information
  • Experiments showed that off-the-shelf models are not sufficient for embodied reasoning tasks
  • Proposed PaLM-E, a single model that is able to control different robots in simulation and in the real world
  • PaLM-E is trained on a mixture of diverse tasks across multiple robot embodiments as well as general vision-language tasks
  • PaLM-E is capable of transfer from vision-language domains into embodied decision making
  • PaLM-E is capable of one-shot and zero shot generalization
  • PaLM-E is robust to adversarial disturbances
  • PaLM-E is able to reason over multiple images
  • PaLM-E is able to interactively guide a real robot through long-horizon manipulation tasks
  • PaLM-E is able to solve robotics tasks from very few training examples in the robotics domain
  • Dataset sampling frequency and ratio for the “full mixture” referred to in experiments