Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Large language models are good at complex tasks
- Enabling general inference in the real world is challenging
- Embodied language models incorporate real-world continuous sensor modalities
- Input to the model is multi-modal sentences
- Model is trained end-to-end for multiple embodied tasks
- Evaluations show model can address a variety of embodied reasoning tasks
- Model benefits from diverse joint training
- Largest model is a visual-language generalist with state-of-the-art performance
Paper Content
Introduction
- LLMs demonstrate strong reasoning capabilities across various domains
- Limitation of LLMs is the issue of grounding
- Previous work interfaces LLMs with learned robotic policies and affordance functions
- LLMs are only provided with textual input, which is insufficient for many tasks
- Current state-of-the-art visuallanguage models cannot directly solve robotic reasoning tasks
- Proposed embodied language models incorporate continuous inputs from sensor modalities
- Embodied language models are trained end-to-end to output sequential decisions
- Evaluated in a variety of settings
- Multi-task training improves performance
- Transfer across tasks leads to high data-efficiency for robotics tasks
- PaLM-E-562B achieves state-of-the-art performance on the OK-VQA benchmark
- PaLM-E-562B exhibits a wide array of capabilities including zero-shot multimodal chain-of-thought reasoning
Related work
- Recent years have seen a growing interest in large vision-language models (VLMs)
- VLMs can be applied to tasks such as visual question answering, captioning, optical character recognition, and object detection
- Different methods exist for integrating images into VLMs
- Prior works focus on combining vision and language inputs in an embodied setting with the goal of direct action prediction
- LLMs contain vast amounts of internalized knowledge about the world
- LLMs have been employed in embodied domains to understand natural language goals and to represent planning
- PaLM-E is trained to generate plans directly without relying on auxiliary models for grounding
- PaLM-E investigates a generalist, multi-embodiment model, across multiple modalities
Palm-e: an embodied multimodal language model
- PaLM-E injects continuous observations into the language embedding space of a pre-trained language model
- PaLM-E is a decoder-only LLM that generates textual completions autoregressively
- Inputs to PaLM-E consist of text and continuous observations
- Output of PaLM-E is text generated auto-regressively
- Decoder-only LLMs predict the probability of a piece of text
- Prefix-decoder-only LLMs can be conditioned on a prefix
- Tokens are embedded into a word token embedding space
- Continuous observations are injected into the LLM by mapping them into the language embedding space
- PaLM-E can be used to generate text or decisions/plans
- Low-level policies translate decisions into low-level actions
- PaLM-E is integrated into a control-loop to sequence and control low-level policies
Input & scene representations for different sensor modalities
- Incorporate different modalities into PaLM-E
- Set up encoders to map modalities into language embedding space
- Investigate state estimation vectors, Vision Transformers (ViTs) and Object Scene Representation Transformer (OSRT)
- Consider object-centric representations to factor observations into tokens
- Investigate ViT token learner architecture
- Label multi-modal tokens in input prompt to enable PaLM-E to reference objects
Training recipes
- PaLM-E is a decoder-only model that uses multi-modal sentences and text tokens to make predictions.
- The loss function is a cross-entropy loss averaged over the non-prefix tokens.
- PaLM-E is based on the pretrained 8B, 62B, and 540B parameter variants of PaLM.
- The model can be trained with all components updated, or with the LLM frozen and only the input encoders trained.
- Co-training across tasks is investigated in experiments.
Experiments
- Experiments conducted on 3 robot embodiments in simulation and with 2 real robots
- Evaluated on vision-language tasks such as VQA, image captioning, and language modeling
- Comparing different input representations with respect to performance, generalization, and data-efficiency
- Investigating transfer by training on a mixture of tasks
- PaLM-E capable of guiding a real robot through a multi-stage tabletop manipulation task
- PaLM-E capable of one-shot and zero-shot learning
- PaLM-E can generalize zero-shot to tasks involving novel object pairs and unseen objects
Robot environments / tasks
- PaLM-E is trained on expert data from 3 different robot environments
- TAMP tasks involve large combinatorics and many decision sequences are infeasible
- Language-Table environment includes several objects, large cardinality of language, and complex pushing dynamics
- PaLM-E has to reason about the poses of the objects
- PaLM-E is integrated into the control loop to execute plans in the real world
Tamp environment
- Planning success rates and VQA performance for TAMP environment tested
- LLM frozen in experiments
- Input representations trained on 96,000 training scenes of TAMP environment
- Pre-trained LLM improves performance when increasing number of objects
- 62B LLM shows better out-of-distribution generalization than 8B variant
- SayCan baseline has difficulty solving environment
- Significant differences between input representations when training on 1% of dataset
- ViT-4B performance more than doubles when co-trained on other robot environments and vision-language datasets
- OSRT leads to best performance
- PaLI not able to solve tasks
Language-table environment
- PaLM-E is a control loop that takes a long-horizon task and an image as input and outputs an instruction for a low-level policy.
- Joint training on internet-scale vision and language results in a more effective model for robot planning.
- Scaling the 12B model to the 84B model leads to improvements on 2 of 3 tasks.
- SayCan and zero-shot PaLI are not effective, unable to solve the easiest task tested.
Mobile manipulation environment
- PaLM-E is tested on challenging and diverse mobile manipulation tasks
- PaLM-E needs to plan a sequence of navigation and manipulation actions based on a human instruction
- 3 use cases are developed to test PaLM-E’s embodied reasoning abilities: affordance prediction, failure detection, and long-horizon planning
- PaLM-E outperforms other models on VQA and planning tasks
- PaLM-E is tested on a real robot in a kitchen and can carry out long-horizon mobile manipulation tasks
- PaLM-E is tested under adversarial disturbances
Performance on general visual-language tasks
- PaLM-E model achieves highest reported number on OK-VQA
- PaLM-E outperforms models finetuned specifically on OK-VQA
- PaLM-E achieves highest performance on VQA v2 with frozen LLM
Performance on general language tasks
- PaLM-E has been tested on 21 general language benchmarks for NLU and NLG tasks
- Performance increases with increasing model scale
- There is less catastrophic forgetting of language capabilities with increasing model scale
Summary of experiments & discussion
- Transfer of PaLM-E leads to increased performance
- Co-training on the “full mixture” achieves more than double the performance
- Adding LLM/ViT pre-training and training on the full mixture improves performance
- Geometric input representation is data-efficient
- Two paths to retain language capabilities during multimodal training
Conclusion
- Proposed to build an embodied language model by injecting multi-modal information
- Experiments showed that off-the-shelf models are not sufficient for embodied reasoning tasks
- Proposed PaLM-E, a single model that is able to control different robots in simulation and in the real world
- PaLM-E is trained on a mixture of diverse tasks across multiple robot embodiments as well as general vision-language tasks
- PaLM-E is capable of transfer from vision-language domains into embodied decision making
- PaLM-E is capable of one-shot and zero shot generalization
- PaLM-E is robust to adversarial disturbances
- PaLM-E is able to reason over multiple images
- PaLM-E is able to interactively guide a real robot through long-horizon manipulation tasks
- PaLM-E is able to solve robotics tasks from very few training examples in the robotics domain
- Dataset sampling frequency and ratio for the “full mixture” referred to in experiments