Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Problem of training an instruction-following agent through user feedback
  • Human users instruct agent using natural language and provide binary feedback
  • Learning cast as a contextual bandit problem
  • 15.4% absolute improvement in instruction execution over time
  • Robust to design variations
  • Feedback signal equivalent to supervised demonstration data

Paper Content


  • Human-agent interactions expose language learning signals
  • Example of signal: explicit feedback from users
  • Learning from this signal reduces data costs and enables continual improvement
  • Signal differs from gold-standard annotated data
  • Learning from user feedback in human-agent interactions is studied in this paper
  • Setup: two participants collaborate towards a common goal in a shared world
  • Challenge: complexity of learning signal
  • Approach: contextual bandit scenario
  • Experiment: dramatic improvements in agent behavior observed

Technical overview

  • Two participants collaborate to collect sets of matching cards in a 3D environment
  • Leader plans and describes follower’s part of the plan using natural language instructions
  • Follower’s role is to follow instructions
  • Leader can provide binary feedback signals to the follower
  • Agent’s task is to map natural language instructions and observations to follower actions
  • Goal is to generate a sequence of observations and actions, ending with STOP
  • Agent parameters are optimized through rounds of continual learning
  • Main metric is instruction execution accuracy, evaluated through human judgments and user feedback

Continual learning

  • Estimate policy parameters from user feedback
  • Process progresses in rounds
  • Each round includes deploying agent policy, computing rewards from user feedback, and optimizing policy parameters
  • Initialize process with policy parameterized by θ 1 estimated on human demonstration data

Deployment interactions

  • Users collaborate with the agent and give it tasks by typing natural language instructions.
  • For each instruction, a sequence of actions is sampled from the policy.
  • The user can provide binary feedback signals and manually reboot the follower at any point during instruction execution.

Dataset construction

  • A training dataset is created from traces collected in round ρ
  • Each example in the dataset is a tuple of instruction, agent observation, and numerical reward
  • Positive feedback is given a value of +1 and negative feedback is given a value of -1
  • Reward is computed from feedback signals and corrected for human response delay
  • If no feedback is given or reward is 0, no example is created
  • Reward is heuristically propagated to actions that otherwise receive no reward
  • Reward is prevented from being propagated if it is noisy or results in an invalid set

Parameter optimization

  • Maximize expected immediate reward
  • Train from scratch at end of each round
  • Process initial human demonstration data to same form as reward data
  • Create example with action reward set to +1
  • Use IPS to debias policies and avoid exploding gradients
  • Clip IPS coefficient to max of 1 to avoid overfitting
  • Update parameters with gradient updates across batches

Experimental setup

  • Initialization Data: 8,790 instructions from 456 randomly sampled human-human interactions used for demonstration training dataset
  • Model: Neural network with parameters θ, instruction and observation embedded independently, convolutions used to mix features from both inputs, modified LIN-GUNET used to generate action distribution
  • Deployment: Fixed number of interactions per user, users instructed to use reboot button sparingly, feedback used to train agent for future rounds
  • Evaluation: Post-hoc manual evaluation, accuracy adjusted down for instructions rebooted by user

Results and analysis

  • Conducted two experiments
  • 11-round experiment to observe long-term learning and user behavior trends
  • 5-round experiment to compare learning design decisions

Long-term experiment

  • Evaluated over 11 rounds of deployment and training
  • Collected 3,368 games and 46,573 instructions
  • Cost of $15,944.45 USD
  • Agent accuracy improved from 66.7 to 82.1
  • Game score increased from 3.3 to 5.3
  • User perception of agent improved over time
  • Error types decreased significantly over time
  • Instruction length remained stable
  • Instruction content changed over time

Comparison of learning design choices

  • Conducted a second deployment experiment to study the impact of initial demonstration data, negative feedback, and reward propagation heuristics
  • Compared feedback learning signal to supervised demonstration data
  • Designed and deployed five system variations
  • Reward propagation has minor benefit relative to the simple reward
  • Negative feedback is important
  • Can likely start with a much weaker agent
  • Feedback data is roughly equivalent to supervised data as a learning signal
  • Training through user interaction is less expensive
  • Learning for instruction commonly relies on data with varying levels of supervision
  • Data includes gold-standard human demonstrations and goal annotations
  • Learning and deployment shifted into human-agent interactions
  • Limited work on continual learning for language-related tasks
  • Cast learning as a contextual bandit problem
  • Human feedback through annotation or selection of intended output studied for semantic parsing and summarization
  • Post-interaction feedback studied in the context of dialogue
  • Focus on sequential execution of instructions with realtime feedback
  • Inspired by TAMER and COACH


  • Proposed approach for learning to follow instructions through interaction with human users
  • Demonstrated effectiveness through multiple rounds of training and deployment
  • Experimented with various learning design decisions, showing robustness of learning signal and approach

C evaluation

  • Randomly sample instruction execution traces for manual evaluation
  • Excluding rebooted instructions from evaluation creates a biased sample
  • Adjusted correctness rate is calculated based on reboot rates
  • Assume all rebooted instructions are incorrect executions
  • Hex distance between stopping positions is used to measure accuracy

D crowdsourcing details

  • Qualified workers must have a HIT approval rate of over 90% with at least 100 approved HITs.
  • Qualified workers receive a $2.00 bonus.
  • Workers are split into two pools: expert and novice.
  • Expert workers receive a 50% higher bonus per game than novice workers.

E additional results

  • Five systems were deployed for five rounds in an experiment comparing learning design choices.
  • Users were asked to provide Likert distribution for three post-interaction statements.

E.2 evaluation on static data

  • Evaluation of instruction-following agents is done using static data from Suhr et al. (2019).
  • Improvement in SWSD is due to adding training data from human-agent interactions.
  • Evaluation also includes instruction execution accuracy, game scores, and feedback from users.
  • Static evaluation data is measured using SWSD.