Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs can learn and leverage Internet-scale knowledge through pre-training with autoregressive models.
- LLMs are not suitable for settings with embodied agents due to lack of experience with the physical world, inability to parse non-language observations, and ignorance of rewards or safety constraints.
- Language-conditioned robotic policies can provide the necessary grounding for the agent to be correctly situated in the real world, but are limited by the lack of high-level semantic understanding.
- We must construct an action sequence that is both likely according to the language model and also realizable according to grounded models of the environment.
- We demonstrate this guided decoding strategy is able to solve complex, long-horizon embodiment tasks in a robotic setting.
Paper Content
Introduction
- Recent works have demonstrated robots that can understand and act on natural language.
- Large language models (LLMs) are used to generate text from web-scale data.
- Applying LLMs to embodied settings is a challenge.
- Robots must understand instructions, determine steps needed to fulfill them, and sequence them appropriately.
- Grounded Decoding (GD) is a scalable, general approach to planning with LLMs in embodied domains.
- GD combines token probabilities from LLMs and token-conditioned robotic functions.
Related work
- Decoding strategies for large language models is an active area of research
- Recent works have focused on developing decoding heuristics for natural text generation
- External classifiers are used to maximize language-space utilities when decoding language models
- Classifier-guided decoding methods have been developed for offline domains such as image captioning and task-oriented dialog
- Training language models to understand embodiment is an active area of research
- SayCan uses a large language model and a value function to select robotic skills
- Grounded Decoding jointly decodes the language model and the grounded model at the token level
- Task and motion planning seeks to solve high-level instructions via sequencing tasks
- Machine learning is used to accelerate planning and enable new domains
- Grounding functions model the probability of tokens given the robot’s state
Problem formulation.
- LLM can generate text not grounded in physical state
- Grounded Decoding (GD) proposed to guide generation of token sequences with grounding function conditioned on embodiment of system
- GD factorized into token decoding
- GD proceeds through process similar to probabilistic filtering
- GD provides grounded scoring function
- GD used in context of robot control
Experiments
- Demonstrated Grounded Decoding on three different environments
- Used a variety of grounding functions to show generality and flexibility
- Tabletop manipulation environment with grounding functions for affordances, safety, and preferences
- 2D Maze environment built from Minigrid with RL-trained value function-based grounding functions
- Real robot in an office kitchen with CLIP-based grounding function
Long-horizon tabletop manipulation
- Experiment with simulated tabletop manipulation scene based on RAVENS and CLIPort
- 20 tasks specified via natural language instructions
- CLIPort predicts unnormalized logits over pixel space used as affordances
- Safety grounding function used for 3 tasks in Box Packing task family
- Preference grounding function used for 2 tasks in Box Packing task family
- Results grouped by task category in Table 1
- Supervised methods perform poorly on unseen tasks
- Grounded Decoding best results with beam search
- Grounding functions composed of robot, environment, and policy
2d maze
- Grounded Decoding is evaluated on Minigrid tasks
- Tasks are divided into three categories: Easy, Medium, and Hard
- Easy tasks have short horizons and are fully described by the instruction
- Medium tasks have short and long horizons and have step-by-step instructions
- Hard tasks have complex, long-horizon instructions with ambiguous instructions
- Grounded Decoding uses a language model as a planner to decompose instructions
- It combines language model planning with an affordance function grounded in the agent’s observations
- Performance is compared to a solitary policy, a hierarchical algorithm, and a hierarchical algorithm with an ungrounded language model
- Beam search improves performance in long-horizon tasks
Mobile manipulation in a kitchen
- Implemented same mobile manipulation platform and skills as Say-Can
- Performed instruction following tasks
- Split tasks into two categories: Unambiguous and Ambiguous
- Modified SayCan algorithm to enable grounded decoding
- Added object detection score as grounding function
- Found that performance is recovered when queries are explicit and gain 25% in planning performance when queries are ambiguous
Analysis
Comparison to saycan
- GD is two orders of magnitude more efficient than SayCan
- GD allows open-vocabulary grounding beyond just affordances
Breakdown of failure reasons
- Hierarchical approaches have imperfect low-level policies for step execution.
- Grounded Decoding reduces planning failure by incorporating grounded scene information into the decoding process.
- Beam search performs better than greedy search by being aware of full-length single-step instructions.
Grounded action manifold
- Goal of work is to investigate integration of grounded information into language model decoding
- Used t-SNE plot to illustrate extent to which grounded models help narrow down search space for language models
- Represented instructions as dots in figure, computed affordance values with respect to four different scenes
- Grouped dots using t-SNE and BERT embeddings
- Grounded models can effectively identify achievable skills to produce actionable manifold within language space
- Language alone does not perfectly group actionable skills
- Language space is much larger, composed of roughly 40,000 token vocabulary at each step
Conclusions, limitations, & future
- Presented Grounded Decoding (GD), an approach for leveraging the knowledge and capabilities of large language models in embodied settings
- GD resembles probabilistic filtering, by decoding tokens that have high probabilities under the language model and under the grounding model
- GD is a general, flexible, and expressive approach to embodied tasks
- Demonstrated GD on three embodied domains, showing it is capable of solving complex, long-horizon tasks
- Limitations of GD include: quality and availability of general grounding functions, prompt engineering to steer LLMs to desired action space, and joint decoding may be limiting compared to a single model
- Flexibility of GD enables many other grounding functions and ways to integrate grounding
- Development and integration of a foundation model for grounding would improve performance significantly
- Evaluated GD on simulated tabletop rearrangement tasks, Minigrid 2D Maze tasks, and real-world kitchen mobile manipulation tasks
- Used open-source gym-minigrid suite of environments with one simple change -instead of the default observation space which is a 7 × 7 egocentric window, our agent has access to entire grid
- Tasks grouped in three categories: Easy, Medium, and Hard
- Train low-level primitives using PPO and a single multi-task policy conditioned on the CLIP embeddings
- Perform hindsight relabeling to allow generalization to partial strings
- Use task string-conditioned value function estimates from learned policy to obtain a visually grounded affordance function