Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Systems for language-guided human-robot interaction must be adaptive and efficient.
- Existing instruction-following agents cannot adapt and require many demonstrations to learn.
- LILAC is a framework for incorporating and adapting to natural language corrections.
- LILAC splits agency between the human and robot.
- Real-time corrections refine the human’s control space.
- User study shows higher task completion rates and is preferred by users.
Paper Content
Shared autonomy regime
- User can get stuck in an “irrecoverable” state
- Prior work only allows for issuing a single language utterance for the entire task
- Our approach allows users to provide language corrections at any point during execution, allowing the robot to adapt online
Introduction
- Research in natural language for robotics has focused on interactions between humans and robots
- Existing systems require large amounts of data or make restrictive assumptions
- Robot in Figure 1 trying to execute a long-horizon task with several critical states
- Existing approaches fail to complete the task repeatably
- User can provide real-time corrections to refine the control space
- LILAC: Language-Informed Latent Actions with Corrections allows for adaptation
- LILAC learns from 10-20 demonstrations instead of thousands
- Split agency between human and robot
- User study shows LILAC has higher task success rates
- LILAC is more reliable, precise, and easy to use
- LILA learns a single, static mapping
- Online language corrections allow user to quickly diagnose the problem and refine the robot’s behavior
Related work
- Incorporating language corrections for manipulation
- Learning language-conditioned policies
- Incorporating other forms of corrective feedback
- Data-efficient online corrections
- Post-hoc corrections
- No need for hand-designed correction primitives
- No need for prior environment dynamics
- Real-time, online approach
Lilac: framing corrections
- LILAC builds off of LILA, an architecture introduced by Karamcheti et al.
- LILAC incorporates natural language corrections in a data-driven way.
- LILAC focuses on directional and referential corrections.
Problem statement
- Problem defined by elements (S, A, T , U, C * , Z)
- S denotes state of robot and environment
- A denotes robot’s 6-DoF delta in end-effector pose
- T is a stochastic unobserved transition function
- U denotes high-level natural language instruction
- C* denotes stack of language corrections
- Z denotes user-provided input via low-dimensional control device
- Goal is to learn a function F that maps state, input, instruction, and corrections to robot action
Modeling: inference & learning
- Given a state, language, and c, F maps low-dimensional user control inputs to high-dimensional robot actions.
- The state space consists of robot’s proprioceptive state and object positions.
- Language is encoded using a last-in-first-out strategy and a frozen Distil-RoBERTa language model.
- GPT-3 is used to modulate the amount of state information.
- FiLM is used to incorporate language.
- A two-layer MLP is used to predict basis vectors.
- Gram-Schmidt is used to orthonormalize the basis vectors.
- The dataset consists of language and trajectory pairs.
- The training process is framed as a state-and-language conditional autoencoder.
- The loss function minimizes the mean squared error between the high-DoF robot action and the reconstructed action.
Gating instructions vs. corrections
- LILAC is a computer science approach to scaling language
- Different language utterances require different amounts of object/environment state-dependence
- An example utterance is given to illustrate the concept of state-dependence
- A gating function is used to predict a discrete value to signify state-independence
- GPT-3 is used to identify corrections
Reproducibility
- Released an open source codebase with complete pipeline
- Model architecture uses GELU activation and 128 parameters
- Training is efficient and can be done on consumer laptop CPUs
- Training uses AdamW optimizer with default learning rate and weight decay
- Dataset consists of high-level task and correction utterances
User study preliminaries
- Evaluating LILAC against language-conditioned approaches for full and shared autonomy
- User study conducted with 12 participants
- Environment is a multi-task “desk” environment with 5 tasks of varying complexity
- 50 full-task demonstrations collected
- Correction demonstrations collected with associated language utterances
- Participants recruited from university students, 8 male/4 female
- Robot used is a Franka Emika Panda
- Within-subjects user study with 3 candidate methods
- Hypotheses tested regarding LILAC’s performance relative to the baseline strategies
- Baseline implementations trained on same data as LILAC
- Qualitative measures tracked via survey questions
User study results
- LILAC achieves highest success rate across all subtasks
- LILAC is significantly more performant than imitation learning and LILA baselines
- LILAC is subjectively preferred by users
- Visualizations show LILAC allows for precise, targeted control
- LILAC allows users to stay closer to training state distribution
Training state distribution lila (no corrections) lilac (ours)
Discussion
- Limitations of current approach
- Need for context-sensitive language corrections
- Easily overused corrections
- Need for more natural and intuitive control spaces
- Ambiguous interpretation of corrections
Conclusion.
- Argued that scalable systems for language-driven human-robot interaction must be able to exhibit adaptivity and sample efficiency
- Presented LILAC as a potential answer
- LILAC is built within the shared autonomy paradigm
- User study comparing LILAC with language-conditioned imitation learning and language-informed shared autonomy
- LILAC is subjectively preferred by users and objectively performant
- LILAC incorporates language corrections efficiently
- GPT-3 used to provide transfer learning
- Results from user study across three conditions
- Qualitative trajectories across different control strategies
- Fully autonomous imitation learning fails, LILA and LILAC able to reach objects but fail to precisely aim and grasp
- LILA deviates from observed state distribution, LILAC close to those seen at training