Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Real-world applications of language models involve human-LM interaction.
HALIE is a new framework to evaluate human-LM interaction.
HALIE captures interactive process, subjective experience, and preference.
Five tasks are designed to capture different forms of interaction.
Non-interactive performance does not always result in better human-LM interaction.

Paper Content

Introduction

Language models have advanced and can be used for a wide range of tasks
Evaluation of language models is currently non-interactive
Most benchmarks focus on non-interactive evaluation
HALIE framework expands on non-interactive evaluation by considering the full process, first-person experience, and preference beyond quality
LMs are already used interactively to brainstorm, paraphrase, reformulate, autocomplete, and write code
Goal is to augment human capabilities rather than automate them

Dialogue is a popular mode of interaction for language models
We evaluate human-LM interaction in the context of open-ended dialogue about social situations
Task: given a social scenario, users converse with a system until they choose to finish
System logic: user input, possible actions, updating dialogue history
User study: 189 crowd workers, 10 scenarios, survey questions
Results: instruction tuning improves performance on most quality metrics, but not specificity
Users may prefer to interact with a more specific LM

Question answering

Question answering is a task in NLP
Users can query a system multiple times to answer a question
System consists of multiple-choice question, user input, and system output
342 crowd workers recruited to answer questions with and without assistance from an LM
Users with LM assistance generally outperformed an LM alone
Count number of queries needed to answer each question as a proxy measurement for efficiency
TextDavinci achieved highest accuracy while requiring least effort
TextBabbage performed better than Davinci on most metrics
Instruction tuned models were perceived most favorably in survey evaluation

Crossword puzzles

Crossword puzzles have been studied as a challenging task for AI systems
Solving a crossword puzzle is a generative task requiring open-ended responses to clues
Crossword puzzle task provides additional structure, whereby a user can check whether a candidate answer satisfies the lexical constraints of the puzzle
Clues are often not straightforward and a user might need to reformulate the query to find the desired information
System logic includes a state of a crossword puzzle, selected clue, user letters entered in the puzzle, dialogue history, and user input
User study recruited 350 workers on Amazon Mechanical Turk, split across four language models and five puzzles
Survey questions asked users to rank different qualities of the AI assistant on a 5-point Likert scale
Results show that users significantly preferred Text-Davinci over other models with respect to helpfulness
Misinformation was particularly pernicious using TextBabbage
Short prompts exacerbate misinformation and toxicity
Users demonstrate diverse engagement behavior

Text summarization

Text summarization is a long-standing problem in NLP
We focus on human-LM interaction for single-document summarization
System provides previous human-edited summaries as examples to the system to improve future summaries
Task is to edit model-generated summary to be consistent, relevant, and coherent
964 documents randomly selected from XSum dataset
39 crowd workers recruited on Amazon Mechanical Turk
Summary-level questions ask consistency, relevance, and coherence of the original and edited summaries
Session-level questions evaluate users’ overall perceptions of the summarization system
100 documents randomly sampled and assessed by 3 different evaluators

Metaphor generation

Metaphors are used to communicate complex or abstract ideas
Creating metaphors requires divergent, lateral thinking
Prior work designed metaphor generation tools to help with ideation
Task is to write metaphorical sentences that evoke a given metaphor
System logic consists of seed metaphor, user sentence history, user input, and system suggestions
32 workers recruited on Amazon Mechanical Turk to come up with metaphorical sentences using the system
10 minutes given to each user to come up with as many sentences as possible
Evaluation criteria from Gero and Chilton (2019) used

Framework

Introduce HALIE framework for evaluating human-LM interaction
Describe tasks and system construction for studying human-LM interaction
Use interaction traces to represent interaction process
Propose dimensions and metrics for evaluating human-LM interaction

Solving tasks interactively

Studying human-LM interaction in the context of tasks
Five tasks studied: social dialogue, question answering, crossword puzzles, text summarization, and metaphor generation
Tasks span from goal-oriented to open-ended
High coverage on common usages of LMs reported by Ouyang et al. (2022)

Constructing an interactive system

Human-LM interaction can be designed in different ways
Language models take a text prompt and decoding parameters as input and return a set of text completions
System logic defines states, user actions, and a transition function
Interaction traces are sequences of state-action pairs
Design considerations include how much control users have over prompts, decoding parameters, and how completions are shown to users

Evaluating human-lm interaction

Evaluation of human-LM interaction should consider the entire interaction process, not just the final output.
Evaluation should reflect the perspective of the user who interacts with the LM.
Evaluation should consider both objective metrics (e.g. accuracy) and subjective metrics (e.g. enjoyment).
Evaluation should consider all combinations of the three dimensions (targets, perspectives, and criteria).
Metrics can be proxies for different quality or preference metrics depending on the task.

Experiments

Introduction of five tasks and four state-of-the-art LMs
Evaluation of human-LM interaction
Text summarization evaluated by third-party evaluators for consistency, relevancy, and coherence
First-person perspectives evaluated by edit distance and survey responses for helpfulness and improvement
Discrepancy between metrics based on third-party and first-person perspectives
TextDavinci better at in-context learning user’s preferences for summarization
Davinci requires least effort and highest acceptance rate for suggestions
TextDavinci most helpful and satisfactory according to survey responses
Third-party evaluators considered sentences written with TextDavinci to be the worst
Satisfaction and reuse may depend on different factors

Evaluation of LM-based language generation systems distinguishes between model completions and interaction traces
Evaluation of model completions has a long history in NLP
Evaluation traditionally centers on specific tasks
Users perceived TextDavinci to be the most helpful and satisfied with results
Users perceived Davinci to be the easiest to work with and willing to reuse the model
Positive correlation between helpfulness, satisfaction, ease of interaction, and willingness to reuse
Evaluation of interaction traces covers broader range of evaluation dimensions
Interactive systems studied in many disparate communities
Interaction often understood as feedback signal for model improvement
Focus on broader user experience with LMs

Discussion

Challenges encountered while designing tasks and systems for benchmarking LMs in interactive settings
Potential solutions and paths forward for benchmarking LMs in interactive settings

Low latency matters

Latency affects human-LM interaction and user perception.
Guidelines recommend interactive systems respond within 0.1-1 seconds.
Some models were too slow, influencing user perception and causing exclusion.
Meeting latency standards can be a challenge, especially with larger models.
Factors beyond model scale (e.g. compute resources, query optimization, API support) affect latency.
Latency is crucial for positive user experience in real world.

Complexity of interactive study design

User study design must account for individual differences in how users interact with LMs.
It is desirable for each user to interact with most/all models.
Sequential effects need to be controlled for to avoid undesired confounding.
It is recommended to recruit a large number of diverse users to allow for more flexibility.

Potential impact on users

LMs can generate toxic, biased, or otherwise undesirable text
Exposure to this text can cause psychological harm
Responsible deployment of interactive human-LM systems requires a harm mitigation strategy
User accommodation reflects underlying properties of the models
Task framing can affect user accommodation
Human-LM interaction can have lasting and longitudinal impacts on users

Link to paper#

Abstract#

Paper Content#

Introduction#

Social dialogue#

Question answering#

Crossword puzzles#

Text summarization#

Metaphor generation#

Framework#

Solving tasks interactively#

Constructing an interactive system#

Evaluating human-lm interaction#

Experiments#

Related work#

Discussion#

Low latency matters#

Complexity of interactive study design#

Potential impact on users#

Link to paper

Abstract

Paper Content

Introduction

Social dialogue

Question answering

Crossword puzzles

Text summarization

Metaphor generation

Framework

Solving tasks interactively

Constructing an interactive system

Evaluating human-lm interaction

Experiments

Related work

Discussion

Low latency matters

Complexity of interactive study design

Potential impact on users