Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs can learn and leverage Internet-scale knowledge through pre-training with autoregressive models.
LLMs are not suitable for settings with embodied agents due to lack of experience with the physical world, inability to parse non-language observations, and ignorance of rewards or safety constraints.
Language-conditioned robotic policies can provide the necessary grounding for the agent to be correctly situated in the real world, but are limited by the lack of high-level semantic understanding.
We must construct an action sequence that is both likely according to the language model and also realizable according to grounded models of the environment.
We demonstrate this guided decoding strategy is able to solve complex, long-horizon embodiment tasks in a robotic setting.

Paper Content

Introduction

Recent works have demonstrated robots that can understand and act on natural language.
Large language models (LLMs) are used to generate text from web-scale data.
Applying LLMs to embodied settings is a challenge.
Robots must understand instructions, determine steps needed to fulfill them, and sequence them appropriately.
Grounded Decoding (GD) is a scalable, general approach to planning with LLMs in embodied domains.
GD combines token probabilities from LLMs and token-conditioned robotic functions.

Decoding strategies for large language models is an active area of research
Recent works have focused on developing decoding heuristics for natural text generation
External classifiers are used to maximize language-space utilities when decoding language models
Classifier-guided decoding methods have been developed for offline domains such as image captioning and task-oriented dialog
Training language models to understand embodiment is an active area of research
SayCan uses a large language model and a value function to select robotic skills
Grounded Decoding jointly decodes the language model and the grounded model at the token level
Task and motion planning seeks to solve high-level instructions via sequencing tasks
Machine learning is used to accelerate planning and enable new domains
Grounding functions model the probability of tokens given the robot’s state

Problem formulation.

LLM can generate text not grounded in physical state
Grounded Decoding (GD) proposed to guide generation of token sequences with grounding function conditioned on embodiment of system
GD factorized into token decoding
GD proceeds through process similar to probabilistic filtering
GD provides grounded scoring function
GD used in context of robot control

Experiments

Demonstrated Grounded Decoding on three different environments
Used a variety of grounding functions to show generality and flexibility
Tabletop manipulation environment with grounding functions for affordances, safety, and preferences
2D Maze environment built from Minigrid with RL-trained value function-based grounding functions
Real robot in an office kitchen with CLIP-based grounding function

Long-horizon tabletop manipulation

Experiment with simulated tabletop manipulation scene based on RAVENS and CLIPort
20 tasks specified via natural language instructions
CLIPort predicts unnormalized logits over pixel space used as affordances
Safety grounding function used for 3 tasks in Box Packing task family
Preference grounding function used for 2 tasks in Box Packing task family
Results grouped by task category in Table 1
Supervised methods perform poorly on unseen tasks
Grounded Decoding best results with beam search
Grounding functions composed of robot, environment, and policy

2d maze

Grounded Decoding is evaluated on Minigrid tasks
Tasks are divided into three categories: Easy, Medium, and Hard
Easy tasks have short horizons and are fully described by the instruction
Medium tasks have short and long horizons and have step-by-step instructions
Hard tasks have complex, long-horizon instructions with ambiguous instructions
Grounded Decoding uses a language model as a planner to decompose instructions
It combines language model planning with an affordance function grounded in the agent’s observations
Performance is compared to a solitary policy, a hierarchical algorithm, and a hierarchical algorithm with an ungrounded language model
Beam search improves performance in long-horizon tasks

Mobile manipulation in a kitchen

Implemented same mobile manipulation platform and skills as Say-Can
Performed instruction following tasks
Split tasks into two categories: Unambiguous and Ambiguous
Modified SayCan algorithm to enable grounded decoding
Added object detection score as grounding function
Found that performance is recovered when queries are explicit and gain 25% in planning performance when queries are ambiguous

Analysis

Comparison to saycan

GD is two orders of magnitude more efficient than SayCan
GD allows open-vocabulary grounding beyond just affordances

Breakdown of failure reasons

Hierarchical approaches have imperfect low-level policies for step execution.
Grounded Decoding reduces planning failure by incorporating grounded scene information into the decoding process.
Beam search performs better than greedy search by being aware of full-length single-step instructions.

Grounded action manifold

Goal of work is to investigate integration of grounded information into language model decoding
Used t-SNE plot to illustrate extent to which grounded models help narrow down search space for language models
Represented instructions as dots in figure, computed affordance values with respect to four different scenes
Grouped dots using t-SNE and BERT embeddings
Grounded models can effectively identify achievable skills to produce actionable manifold within language space
Language alone does not perfectly group actionable skills
Language space is much larger, composed of roughly 40,000 token vocabulary at each step

Conclusions, limitations, & future

Presented Grounded Decoding (GD), an approach for leveraging the knowledge and capabilities of large language models in embodied settings
GD resembles probabilistic filtering, by decoding tokens that have high probabilities under the language model and under the grounding model
GD is a general, flexible, and expressive approach to embodied tasks
Demonstrated GD on three embodied domains, showing it is capable of solving complex, long-horizon tasks
Limitations of GD include: quality and availability of general grounding functions, prompt engineering to steer LLMs to desired action space, and joint decoding may be limiting compared to a single model
Flexibility of GD enables many other grounding functions and ways to integrate grounding
Development and integration of a foundation model for grounding would improve performance significantly
Evaluated GD on simulated tabletop rearrangement tasks, Minigrid 2D Maze tasks, and real-world kitchen mobile manipulation tasks
Used open-source gym-minigrid suite of environments with one simple change -instead of the default observation space which is a 7 × 7 egocentric window, our agent has access to entire grid
Tasks grouped in three categories: Easy, Medium, and Hard
Train low-level primitives using PPO and a single multi-task policy conditioned on the CLIP embeddings
Perform hindsight relabeling to allow generalization to partial strings
Use task string-conditioned value function estimates from learned policy to obtain a visually grounded affordance function

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Problem formulation.#

Experiments#

Long-horizon tabletop manipulation#

2d maze#

Mobile manipulation in a kitchen#

Analysis#

Comparison to saycan#

Breakdown of failure reasons#

Grounded action manifold#

Conclusions, limitations, & future#

Link to paper

Abstract

Paper Content

Introduction

Related work

Problem formulation.

Experiments

Long-horizon tabletop manipulation

2d maze

Mobile manipulation in a kitchen

Analysis

Comparison to saycan

Breakdown of failure reasons

Grounded action manifold

Conclusions, limitations, & future