Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Systems for language-guided human-robot interaction must be adaptive and efficient.
Existing instruction-following agents cannot adapt and require many demonstrations to learn.
LILAC is a framework for incorporating and adapting to natural language corrections.
LILAC splits agency between the human and robot.
Real-time corrections refine the human’s control space.
User study shows higher task completion rates and is preferred by users.

Paper Content

Shared autonomy regime

User can get stuck in an “irrecoverable” state
Prior work only allows for issuing a single language utterance for the entire task
Our approach allows users to provide language corrections at any point during execution, allowing the robot to adapt online

Introduction

Research in natural language for robotics has focused on interactions between humans and robots
Existing systems require large amounts of data or make restrictive assumptions
Robot in Figure 1 trying to execute a long-horizon task with several critical states
Existing approaches fail to complete the task repeatably
User can provide real-time corrections to refine the control space
LILAC: Language-Informed Latent Actions with Corrections allows for adaptation
LILAC learns from 10-20 demonstrations instead of thousands
Split agency between human and robot
User study shows LILAC has higher task success rates
LILAC is more reliable, precise, and easy to use
LILA learns a single, static mapping
Online language corrections allow user to quickly diagnose the problem and refine the robot’s behavior

Incorporating language corrections for manipulation
Learning language-conditioned policies
Incorporating other forms of corrective feedback
Data-efficient online corrections
Post-hoc corrections
No need for hand-designed correction primitives
No need for prior environment dynamics
Real-time, online approach

Lilac: framing corrections

LILAC builds off of LILA, an architecture introduced by Karamcheti et al.
LILAC incorporates natural language corrections in a data-driven way.
LILAC focuses on directional and referential corrections.

Problem statement

Problem defined by elements (S, A, T , U, C * , Z)
S denotes state of robot and environment
A denotes robot’s 6-DoF delta in end-effector pose
T is a stochastic unobserved transition function
U denotes high-level natural language instruction
C* denotes stack of language corrections
Z denotes user-provided input via low-dimensional control device
Goal is to learn a function F that maps state, input, instruction, and corrections to robot action

Modeling: inference & learning

Given a state, language, and c, F maps low-dimensional user control inputs to high-dimensional robot actions.
The state space consists of robot’s proprioceptive state and object positions.
Language is encoded using a last-in-first-out strategy and a frozen Distil-RoBERTa language model.
GPT-3 is used to modulate the amount of state information.
FiLM is used to incorporate language.
A two-layer MLP is used to predict basis vectors.
Gram-Schmidt is used to orthonormalize the basis vectors.
The dataset consists of language and trajectory pairs.
The training process is framed as a state-and-language conditional autoencoder.
The loss function minimizes the mean squared error between the high-DoF robot action and the reconstructed action.

Gating instructions vs. corrections

LILAC is a computer science approach to scaling language
Different language utterances require different amounts of object/environment state-dependence
An example utterance is given to illustrate the concept of state-dependence
A gating function is used to predict a discrete value to signify state-independence
GPT-3 is used to identify corrections

Reproducibility

Released an open source codebase with complete pipeline
Model architecture uses GELU activation and 128 parameters
Training is efficient and can be done on consumer laptop CPUs
Training uses AdamW optimizer with default learning rate and weight decay
Dataset consists of high-level task and correction utterances

User study preliminaries

Evaluating LILAC against language-conditioned approaches for full and shared autonomy
User study conducted with 12 participants
Environment is a multi-task “desk” environment with 5 tasks of varying complexity
50 full-task demonstrations collected
Correction demonstrations collected with associated language utterances
Participants recruited from university students, 8 male/4 female
Robot used is a Franka Emika Panda
Within-subjects user study with 3 candidate methods
Hypotheses tested regarding LILAC’s performance relative to the baseline strategies
Baseline implementations trained on same data as LILAC
Qualitative measures tracked via survey questions

User study results

LILAC achieves highest success rate across all subtasks
LILAC is significantly more performant than imitation learning and LILA baselines
LILAC is subjectively preferred by users
Visualizations show LILAC allows for precise, targeted control
LILAC allows users to stay closer to training state distribution

Training state distribution lila (no corrections) lilac (ours)

Discussion

Limitations of current approach
Need for context-sensitive language corrections
Easily overused corrections
Need for more natural and intuitive control spaces
Ambiguous interpretation of corrections

Conclusion.

Argued that scalable systems for language-driven human-robot interaction must be able to exhibit adaptivity and sample efficiency
Presented LILAC as a potential answer
LILAC is built within the shared autonomy paradigm
User study comparing LILAC with language-conditioned imitation learning and language-informed shared autonomy
LILAC is subjectively preferred by users and objectively performant
LILAC incorporates language corrections efficiently
GPT-3 used to provide transfer learning
Results from user study across three conditions
Qualitative trajectories across different control strategies
Fully autonomous imitation learning fails, LILA and LILAC able to reach objects but fail to precisely aim and grasp
LILA deviates from observed state distribution, LILAC close to those seen at training

Link to paper#

Abstract#

Paper Content#

Shared autonomy regime#

Introduction#

Related work#

Lilac: framing corrections#

Problem statement#

Modeling: inference & learning#

Gating instructions vs. corrections#

Reproducibility#

User study preliminaries#

User study results#

Training state distribution lila (no corrections) lilac (ours)#

Discussion#

Conclusion.#

Link to paper

Abstract

Paper Content

Shared autonomy regime

Introduction

Related work

Lilac: framing corrections

Problem statement

Modeling: inference & learning

Gating instructions vs. corrections

Reproducibility

User study preliminaries

User study results

Training state distribution lila (no corrections) lilac (ours)

Discussion

Conclusion.