Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Problem of training an instruction-following agent through user feedback
- Human users instruct agent using natural language and provide binary feedback
- Learning cast as a contextual bandit problem
- 15.4% absolute improvement in instruction execution over time
- Robust to design variations
- Feedback signal equivalent to supervised demonstration data
Paper Content
Introduction
- Human-agent interactions expose language learning signals
- Example of signal: explicit feedback from users
- Learning from this signal reduces data costs and enables continual improvement
- Signal differs from gold-standard annotated data
- Learning from user feedback in human-agent interactions is studied in this paper
- Setup: two participants collaborate towards a common goal in a shared world
- Challenge: complexity of learning signal
- Approach: contextual bandit scenario
- Experiment: dramatic improvements in agent behavior observed
Technical overview
- Two participants collaborate to collect sets of matching cards in a 3D environment
- Leader plans and describes follower’s part of the plan using natural language instructions
- Follower’s role is to follow instructions
- Leader can provide binary feedback signals to the follower
- Agent’s task is to map natural language instructions and observations to follower actions
- Goal is to generate a sequence of observations and actions, ending with STOP
- Agent parameters are optimized through rounds of continual learning
- Main metric is instruction execution accuracy, evaluated through human judgments and user feedback
Continual learning
- Estimate policy parameters from user feedback
- Process progresses in rounds
- Each round includes deploying agent policy, computing rewards from user feedback, and optimizing policy parameters
- Initialize process with policy parameterized by θ 1 estimated on human demonstration data
Deployment interactions
- Users collaborate with the agent and give it tasks by typing natural language instructions.
- For each instruction, a sequence of actions is sampled from the policy.
- The user can provide binary feedback signals and manually reboot the follower at any point during instruction execution.
Dataset construction
- A training dataset is created from traces collected in round ρ
- Each example in the dataset is a tuple of instruction, agent observation, and numerical reward
- Positive feedback is given a value of +1 and negative feedback is given a value of -1
- Reward is computed from feedback signals and corrected for human response delay
- If no feedback is given or reward is 0, no example is created
- Reward is heuristically propagated to actions that otherwise receive no reward
- Reward is prevented from being propagated if it is noisy or results in an invalid set
Parameter optimization
- Maximize expected immediate reward
- Train from scratch at end of each round
- Process initial human demonstration data to same form as reward data
- Create example with action reward set to +1
- Use IPS to debias policies and avoid exploding gradients
- Clip IPS coefficient to max of 1 to avoid overfitting
- Update parameters with gradient updates across batches
Experimental setup
- Initialization Data: 8,790 instructions from 456 randomly sampled human-human interactions used for demonstration training dataset
- Model: Neural network with parameters θ, instruction and observation embedded independently, convolutions used to mix features from both inputs, modified LIN-GUNET used to generate action distribution
- Deployment: Fixed number of interactions per user, users instructed to use reboot button sparingly, feedback used to train agent for future rounds
- Evaluation: Post-hoc manual evaluation, accuracy adjusted down for instructions rebooted by user
Results and analysis
- Conducted two experiments
- 11-round experiment to observe long-term learning and user behavior trends
- 5-round experiment to compare learning design decisions
Long-term experiment
- Evaluated over 11 rounds of deployment and training
- Collected 3,368 games and 46,573 instructions
- Cost of $15,944.45 USD
- Agent accuracy improved from 66.7 to 82.1
- Game score increased from 3.3 to 5.3
- User perception of agent improved over time
- Error types decreased significantly over time
- Instruction length remained stable
- Instruction content changed over time
Comparison of learning design choices
- Conducted a second deployment experiment to study the impact of initial demonstration data, negative feedback, and reward propagation heuristics
- Compared feedback learning signal to supervised demonstration data
- Designed and deployed five system variations
- Reward propagation has minor benefit relative to the simple reward
- Negative feedback is important
- Can likely start with a much weaker agent
- Feedback data is roughly equivalent to supervised data as a learning signal
- Training through user interaction is less expensive
Related work
- Learning for instruction commonly relies on data with varying levels of supervision
- Data includes gold-standard human demonstrations and goal annotations
- Learning and deployment shifted into human-agent interactions
- Limited work on continual learning for language-related tasks
- Cast learning as a contextual bandit problem
- Human feedback through annotation or selection of intended output studied for semantic parsing and summarization
- Post-interaction feedback studied in the context of dialogue
- Focus on sequential execution of instructions with realtime feedback
- Inspired by TAMER and COACH
Discussion
- Proposed approach for learning to follow instructions through interaction with human users
- Demonstrated effectiveness through multiple rounds of training and deployment
- Experimented with various learning design decisions, showing robustness of learning signal and approach
C evaluation
- Randomly sample instruction execution traces for manual evaluation
- Excluding rebooted instructions from evaluation creates a biased sample
- Adjusted correctness rate is calculated based on reboot rates
- Assume all rebooted instructions are incorrect executions
- Hex distance between stopping positions is used to measure accuracy
D crowdsourcing details
- Qualified workers must have a HIT approval rate of over 90% with at least 100 approved HITs.
- Qualified workers receive a $2.00 bonus.
- Workers are split into two pools: expert and novice.
- Expert workers receive a 50% higher bonus per game than novice workers.
E additional results
- Five systems were deployed for five rounds in an experiment comparing learning design choices.
- Users were asked to provide Likert distribution for three post-interaction statements.
E.2 evaluation on static data
- Evaluation of instruction-following agents is done using static data from Suhr et al. (2019).
- Improvement in SWSD is due to adding training data from human-agent interactions.
- Evaluation also includes instruction execution accuracy, game scores, and feedback from users.
- Static evaluation data is measured using SWSD.