Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Problem of training an instruction-following agent through user feedback
Human users instruct agent using natural language and provide binary feedback
Learning cast as a contextual bandit problem
15.4% absolute improvement in instruction execution over time
Robust to design variations
Feedback signal equivalent to supervised demonstration data

Paper Content

Introduction

Human-agent interactions expose language learning signals
Example of signal: explicit feedback from users
Learning from this signal reduces data costs and enables continual improvement
Signal differs from gold-standard annotated data
Learning from user feedback in human-agent interactions is studied in this paper
Setup: two participants collaborate towards a common goal in a shared world
Challenge: complexity of learning signal
Approach: contextual bandit scenario
Experiment: dramatic improvements in agent behavior observed

Technical overview

Two participants collaborate to collect sets of matching cards in a 3D environment
Leader plans and describes follower’s part of the plan using natural language instructions
Follower’s role is to follow instructions
Leader can provide binary feedback signals to the follower
Agent’s task is to map natural language instructions and observations to follower actions
Goal is to generate a sequence of observations and actions, ending with STOP
Agent parameters are optimized through rounds of continual learning
Main metric is instruction execution accuracy, evaluated through human judgments and user feedback

Continual learning

Estimate policy parameters from user feedback
Process progresses in rounds
Each round includes deploying agent policy, computing rewards from user feedback, and optimizing policy parameters
Initialize process with policy parameterized by θ 1 estimated on human demonstration data

Deployment interactions

Users collaborate with the agent and give it tasks by typing natural language instructions.
For each instruction, a sequence of actions is sampled from the policy.
The user can provide binary feedback signals and manually reboot the follower at any point during instruction execution.

Dataset construction

A training dataset is created from traces collected in round ρ
Each example in the dataset is a tuple of instruction, agent observation, and numerical reward
Positive feedback is given a value of +1 and negative feedback is given a value of -1
Reward is computed from feedback signals and corrected for human response delay
If no feedback is given or reward is 0, no example is created
Reward is heuristically propagated to actions that otherwise receive no reward
Reward is prevented from being propagated if it is noisy or results in an invalid set

Parameter optimization

Maximize expected immediate reward
Train from scratch at end of each round
Process initial human demonstration data to same form as reward data
Create example with action reward set to +1
Use IPS to debias policies and avoid exploding gradients
Clip IPS coefficient to max of 1 to avoid overfitting
Update parameters with gradient updates across batches

Experimental setup

Initialization Data: 8,790 instructions from 456 randomly sampled human-human interactions used for demonstration training dataset
Model: Neural network with parameters θ, instruction and observation embedded independently, convolutions used to mix features from both inputs, modified LIN-GUNET used to generate action distribution
Deployment: Fixed number of interactions per user, users instructed to use reboot button sparingly, feedback used to train agent for future rounds
Evaluation: Post-hoc manual evaluation, accuracy adjusted down for instructions rebooted by user

Results and analysis

Conducted two experiments
11-round experiment to observe long-term learning and user behavior trends
5-round experiment to compare learning design decisions

Long-term experiment

Evaluated over 11 rounds of deployment and training
Collected 3,368 games and 46,573 instructions
Cost of $15,944.45 USD
Agent accuracy improved from 66.7 to 82.1
Game score increased from 3.3 to 5.3
User perception of agent improved over time
Error types decreased significantly over time
Instruction length remained stable
Instruction content changed over time

Comparison of learning design choices

Conducted a second deployment experiment to study the impact of initial demonstration data, negative feedback, and reward propagation heuristics
Compared feedback learning signal to supervised demonstration data
Designed and deployed five system variations
Reward propagation has minor benefit relative to the simple reward
Negative feedback is important
Can likely start with a much weaker agent
Feedback data is roughly equivalent to supervised data as a learning signal
Training through user interaction is less expensive

Learning for instruction commonly relies on data with varying levels of supervision
Data includes gold-standard human demonstrations and goal annotations
Learning and deployment shifted into human-agent interactions
Limited work on continual learning for language-related tasks
Cast learning as a contextual bandit problem
Human feedback through annotation or selection of intended output studied for semantic parsing and summarization
Post-interaction feedback studied in the context of dialogue
Focus on sequential execution of instructions with realtime feedback
Inspired by TAMER and COACH

Discussion

Proposed approach for learning to follow instructions through interaction with human users
Demonstrated effectiveness through multiple rounds of training and deployment
Experimented with various learning design decisions, showing robustness of learning signal and approach

C evaluation

Randomly sample instruction execution traces for manual evaluation
Excluding rebooted instructions from evaluation creates a biased sample
Adjusted correctness rate is calculated based on reboot rates
Assume all rebooted instructions are incorrect executions
Hex distance between stopping positions is used to measure accuracy

D crowdsourcing details

Qualified workers must have a HIT approval rate of over 90% with at least 100 approved HITs.
Qualified workers receive a $2.00 bonus.
Workers are split into two pools: expert and novice.
Expert workers receive a 50% higher bonus per game than novice workers.

E additional results

Five systems were deployed for five rounds in an experiment comparing learning design choices.
Users were asked to provide Likert distribution for three post-interaction statements.

E.2 evaluation on static data

Evaluation of instruction-following agents is done using static data from Suhr et al. (2019).
Improvement in SWSD is due to adding training data from human-agent interactions.
Evaluation also includes instruction execution accuracy, game scores, and feedback from users.
Static evaluation data is measured using SWSD.

Link to paper#

Abstract#

Paper Content#

Introduction#

Technical overview#

Continual learning#

Deployment interactions#

Dataset construction#

Parameter optimization#

Experimental setup#

Results and analysis#

Long-term experiment#

Comparison of learning design choices#

Related work#

Discussion#

C evaluation#

D crowdsourcing details#

E additional results#

E.2 evaluation on static data#

Link to paper

Abstract

Paper Content

Introduction

Technical overview

Continual learning

Deployment interactions

Dataset construction

Parameter optimization

Experimental setup

Results and analysis

Long-term experiment

Comparison of learning design choices

Related work

Discussion

C evaluation

D crowdsourcing details

E additional results

E.2 evaluation on static data