Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Systems for language-guided human-robot interaction must be adaptive and efficient.
  • Existing instruction-following agents cannot adapt and require many demonstrations to learn.
  • LILAC is a framework for incorporating and adapting to natural language corrections.
  • LILAC splits agency between the human and robot.
  • Real-time corrections refine the human’s control space.
  • User study shows higher task completion rates and is preferred by users.

Paper Content

Shared autonomy regime

  • User can get stuck in an “irrecoverable” state
  • Prior work only allows for issuing a single language utterance for the entire task
  • Our approach allows users to provide language corrections at any point during execution, allowing the robot to adapt online

Introduction

  • Research in natural language for robotics has focused on interactions between humans and robots
  • Existing systems require large amounts of data or make restrictive assumptions
  • Robot in Figure 1 trying to execute a long-horizon task with several critical states
  • Existing approaches fail to complete the task repeatably
  • User can provide real-time corrections to refine the control space
  • LILAC: Language-Informed Latent Actions with Corrections allows for adaptation
  • LILAC learns from 10-20 demonstrations instead of thousands
  • Split agency between human and robot
  • User study shows LILAC has higher task success rates
  • LILAC is more reliable, precise, and easy to use
  • LILA learns a single, static mapping
  • Online language corrections allow user to quickly diagnose the problem and refine the robot’s behavior
  • Incorporating language corrections for manipulation
  • Learning language-conditioned policies
  • Incorporating other forms of corrective feedback
  • Data-efficient online corrections
  • Post-hoc corrections
  • No need for hand-designed correction primitives
  • No need for prior environment dynamics
  • Real-time, online approach

Lilac: framing corrections

  • LILAC builds off of LILA, an architecture introduced by Karamcheti et al.
  • LILAC incorporates natural language corrections in a data-driven way.
  • LILAC focuses on directional and referential corrections.

Problem statement

  • Problem defined by elements (S, A, T , U, C * , Z)
  • S denotes state of robot and environment
  • A denotes robot’s 6-DoF delta in end-effector pose
  • T is a stochastic unobserved transition function
  • U denotes high-level natural language instruction
  • C* denotes stack of language corrections
  • Z denotes user-provided input via low-dimensional control device
  • Goal is to learn a function F that maps state, input, instruction, and corrections to robot action

Modeling: inference & learning

  • Given a state, language, and c, F maps low-dimensional user control inputs to high-dimensional robot actions.
  • The state space consists of robot’s proprioceptive state and object positions.
  • Language is encoded using a last-in-first-out strategy and a frozen Distil-RoBERTa language model.
  • GPT-3 is used to modulate the amount of state information.
  • FiLM is used to incorporate language.
  • A two-layer MLP is used to predict basis vectors.
  • Gram-Schmidt is used to orthonormalize the basis vectors.
  • The dataset consists of language and trajectory pairs.
  • The training process is framed as a state-and-language conditional autoencoder.
  • The loss function minimizes the mean squared error between the high-DoF robot action and the reconstructed action.

Gating instructions vs. corrections

  • LILAC is a computer science approach to scaling language
  • Different language utterances require different amounts of object/environment state-dependence
  • An example utterance is given to illustrate the concept of state-dependence
  • A gating function is used to predict a discrete value to signify state-independence
  • GPT-3 is used to identify corrections

Reproducibility

  • Released an open source codebase with complete pipeline
  • Model architecture uses GELU activation and 128 parameters
  • Training is efficient and can be done on consumer laptop CPUs
  • Training uses AdamW optimizer with default learning rate and weight decay
  • Dataset consists of high-level task and correction utterances

User study preliminaries

  • Evaluating LILAC against language-conditioned approaches for full and shared autonomy
  • User study conducted with 12 participants
  • Environment is a multi-task “desk” environment with 5 tasks of varying complexity
  • 50 full-task demonstrations collected
  • Correction demonstrations collected with associated language utterances
  • Participants recruited from university students, 8 male/4 female
  • Robot used is a Franka Emika Panda
  • Within-subjects user study with 3 candidate methods
  • Hypotheses tested regarding LILAC’s performance relative to the baseline strategies
  • Baseline implementations trained on same data as LILAC
  • Qualitative measures tracked via survey questions

User study results

  • LILAC achieves highest success rate across all subtasks
  • LILAC is significantly more performant than imitation learning and LILA baselines
  • LILAC is subjectively preferred by users
  • Visualizations show LILAC allows for precise, targeted control
  • LILAC allows users to stay closer to training state distribution

Training state distribution lila (no corrections) lilac (ours)

Discussion

  • Limitations of current approach
  • Need for context-sensitive language corrections
  • Easily overused corrections
  • Need for more natural and intuitive control spaces
  • Ambiguous interpretation of corrections

Conclusion.

  • Argued that scalable systems for language-driven human-robot interaction must be able to exhibit adaptivity and sample efficiency
  • Presented LILAC as a potential answer
  • LILAC is built within the shared autonomy paradigm
  • User study comparing LILAC with language-conditioned imitation learning and language-informed shared autonomy
  • LILAC is subjectively preferred by users and objectively performant
  • LILAC incorporates language corrections efficiently
  • GPT-3 used to provide transfer learning
  • Results from user study across three conditions
  • Qualitative trajectories across different control strategies
  • Fully autonomous imitation learning fails, LILA and LILAC able to reach objects but fail to precisely aim and grasp
  • LILA deviates from observed state distribution, LILAC close to those seen at training