Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • LLMs can learn and leverage Internet-scale knowledge through pre-training with autoregressive models.
  • LLMs are not suitable for settings with embodied agents due to lack of experience with the physical world, inability to parse non-language observations, and ignorance of rewards or safety constraints.
  • Language-conditioned robotic policies can provide the necessary grounding for the agent to be correctly situated in the real world, but are limited by the lack of high-level semantic understanding.
  • We must construct an action sequence that is both likely according to the language model and also realizable according to grounded models of the environment.
  • We demonstrate this guided decoding strategy is able to solve complex, long-horizon embodiment tasks in a robotic setting.

Paper Content

Introduction

  • Recent works have demonstrated robots that can understand and act on natural language.
  • Large language models (LLMs) are used to generate text from web-scale data.
  • Applying LLMs to embodied settings is a challenge.
  • Robots must understand instructions, determine steps needed to fulfill them, and sequence them appropriately.
  • Grounded Decoding (GD) is a scalable, general approach to planning with LLMs in embodied domains.
  • GD combines token probabilities from LLMs and token-conditioned robotic functions.
  • Decoding strategies for large language models is an active area of research
  • Recent works have focused on developing decoding heuristics for natural text generation
  • External classifiers are used to maximize language-space utilities when decoding language models
  • Classifier-guided decoding methods have been developed for offline domains such as image captioning and task-oriented dialog
  • Training language models to understand embodiment is an active area of research
  • SayCan uses a large language model and a value function to select robotic skills
  • Grounded Decoding jointly decodes the language model and the grounded model at the token level
  • Task and motion planning seeks to solve high-level instructions via sequencing tasks
  • Machine learning is used to accelerate planning and enable new domains
  • Grounding functions model the probability of tokens given the robot’s state

Problem formulation.

  • LLM can generate text not grounded in physical state
  • Grounded Decoding (GD) proposed to guide generation of token sequences with grounding function conditioned on embodiment of system
  • GD factorized into token decoding
  • GD proceeds through process similar to probabilistic filtering
  • GD provides grounded scoring function
  • GD used in context of robot control

Experiments

  • Demonstrated Grounded Decoding on three different environments
  • Used a variety of grounding functions to show generality and flexibility
  • Tabletop manipulation environment with grounding functions for affordances, safety, and preferences
  • 2D Maze environment built from Minigrid with RL-trained value function-based grounding functions
  • Real robot in an office kitchen with CLIP-based grounding function

Long-horizon tabletop manipulation

  • Experiment with simulated tabletop manipulation scene based on RAVENS and CLIPort
  • 20 tasks specified via natural language instructions
  • CLIPort predicts unnormalized logits over pixel space used as affordances
  • Safety grounding function used for 3 tasks in Box Packing task family
  • Preference grounding function used for 2 tasks in Box Packing task family
  • Results grouped by task category in Table 1
  • Supervised methods perform poorly on unseen tasks
  • Grounded Decoding best results with beam search
  • Grounding functions composed of robot, environment, and policy

2d maze

  • Grounded Decoding is evaluated on Minigrid tasks
  • Tasks are divided into three categories: Easy, Medium, and Hard
  • Easy tasks have short horizons and are fully described by the instruction
  • Medium tasks have short and long horizons and have step-by-step instructions
  • Hard tasks have complex, long-horizon instructions with ambiguous instructions
  • Grounded Decoding uses a language model as a planner to decompose instructions
  • It combines language model planning with an affordance function grounded in the agent’s observations
  • Performance is compared to a solitary policy, a hierarchical algorithm, and a hierarchical algorithm with an ungrounded language model
  • Beam search improves performance in long-horizon tasks

Mobile manipulation in a kitchen

  • Implemented same mobile manipulation platform and skills as Say-Can
  • Performed instruction following tasks
  • Split tasks into two categories: Unambiguous and Ambiguous
  • Modified SayCan algorithm to enable grounded decoding
  • Added object detection score as grounding function
  • Found that performance is recovered when queries are explicit and gain 25% in planning performance when queries are ambiguous

Analysis

Comparison to saycan

  • GD is two orders of magnitude more efficient than SayCan
  • GD allows open-vocabulary grounding beyond just affordances

Breakdown of failure reasons

  • Hierarchical approaches have imperfect low-level policies for step execution.
  • Grounded Decoding reduces planning failure by incorporating grounded scene information into the decoding process.
  • Beam search performs better than greedy search by being aware of full-length single-step instructions.

Grounded action manifold

  • Goal of work is to investigate integration of grounded information into language model decoding
  • Used t-SNE plot to illustrate extent to which grounded models help narrow down search space for language models
  • Represented instructions as dots in figure, computed affordance values with respect to four different scenes
  • Grouped dots using t-SNE and BERT embeddings
  • Grounded models can effectively identify achievable skills to produce actionable manifold within language space
  • Language alone does not perfectly group actionable skills
  • Language space is much larger, composed of roughly 40,000 token vocabulary at each step

Conclusions, limitations, & future

  • Presented Grounded Decoding (GD), an approach for leveraging the knowledge and capabilities of large language models in embodied settings
  • GD resembles probabilistic filtering, by decoding tokens that have high probabilities under the language model and under the grounding model
  • GD is a general, flexible, and expressive approach to embodied tasks
  • Demonstrated GD on three embodied domains, showing it is capable of solving complex, long-horizon tasks
  • Limitations of GD include: quality and availability of general grounding functions, prompt engineering to steer LLMs to desired action space, and joint decoding may be limiting compared to a single model
  • Flexibility of GD enables many other grounding functions and ways to integrate grounding
  • Development and integration of a foundation model for grounding would improve performance significantly
  • Evaluated GD on simulated tabletop rearrangement tasks, Minigrid 2D Maze tasks, and real-world kitchen mobile manipulation tasks
  • Used open-source gym-minigrid suite of environments with one simple change -instead of the default observation space which is a 7 × 7 egocentric window, our agent has access to entire grid
  • Tasks grouped in three categories: Easy, Medium, and Hard
  • Train low-level primitives using PPO and a single multi-task policy conditioned on the CLIP embeddings
  • Perform hindsight relabeling to allow generalization to partial strings
  • Use task string-conditioned value function estimates from learned policy to obtain a visually grounded affordance function