Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Real-world applications of language models involve human-LM interaction.
- HALIE is a new framework to evaluate human-LM interaction.
- HALIE captures interactive process, subjective experience, and preference.
- Five tasks are designed to capture different forms of interaction.
- Non-interactive performance does not always result in better human-LM interaction.
Paper Content
Introduction
- Language models have advanced and can be used for a wide range of tasks
- Evaluation of language models is currently non-interactive
- Most benchmarks focus on non-interactive evaluation
- HALIE framework expands on non-interactive evaluation by considering the full process, first-person experience, and preference beyond quality
- LMs are already used interactively to brainstorm, paraphrase, reformulate, autocomplete, and write code
- Goal is to augment human capabilities rather than automate them
Social dialogue
- Dialogue is a popular mode of interaction for language models
- We evaluate human-LM interaction in the context of open-ended dialogue about social situations
- Task: given a social scenario, users converse with a system until they choose to finish
- System logic: user input, possible actions, updating dialogue history
- User study: 189 crowd workers, 10 scenarios, survey questions
- Results: instruction tuning improves performance on most quality metrics, but not specificity
- Users may prefer to interact with a more specific LM
Question answering
- Question answering is a task in NLP
- Users can query a system multiple times to answer a question
- System consists of multiple-choice question, user input, and system output
- 342 crowd workers recruited to answer questions with and without assistance from an LM
- Users with LM assistance generally outperformed an LM alone
- Count number of queries needed to answer each question as a proxy measurement for efficiency
- TextDavinci achieved highest accuracy while requiring least effort
- TextBabbage performed better than Davinci on most metrics
- Instruction tuned models were perceived most favorably in survey evaluation
Crossword puzzles
- Crossword puzzles have been studied as a challenging task for AI systems
- Solving a crossword puzzle is a generative task requiring open-ended responses to clues
- Crossword puzzle task provides additional structure, whereby a user can check whether a candidate answer satisfies the lexical constraints of the puzzle
- Clues are often not straightforward and a user might need to reformulate the query to find the desired information
- System logic includes a state of a crossword puzzle, selected clue, user letters entered in the puzzle, dialogue history, and user input
- User study recruited 350 workers on Amazon Mechanical Turk, split across four language models and five puzzles
- Survey questions asked users to rank different qualities of the AI assistant on a 5-point Likert scale
- Results show that users significantly preferred Text-Davinci over other models with respect to helpfulness
- Misinformation was particularly pernicious using TextBabbage
- Short prompts exacerbate misinformation and toxicity
- Users demonstrate diverse engagement behavior
Text summarization
- Text summarization is a long-standing problem in NLP
- We focus on human-LM interaction for single-document summarization
- System provides previous human-edited summaries as examples to the system to improve future summaries
- Task is to edit model-generated summary to be consistent, relevant, and coherent
- 964 documents randomly selected from XSum dataset
- 39 crowd workers recruited on Amazon Mechanical Turk
- Summary-level questions ask consistency, relevance, and coherence of the original and edited summaries
- Session-level questions evaluate users’ overall perceptions of the summarization system
- 100 documents randomly sampled and assessed by 3 different evaluators
Metaphor generation
- Metaphors are used to communicate complex or abstract ideas
- Creating metaphors requires divergent, lateral thinking
- Prior work designed metaphor generation tools to help with ideation
- Task is to write metaphorical sentences that evoke a given metaphor
- System logic consists of seed metaphor, user sentence history, user input, and system suggestions
- 32 workers recruited on Amazon Mechanical Turk to come up with metaphorical sentences using the system
- 10 minutes given to each user to come up with as many sentences as possible
- Evaluation criteria from Gero and Chilton (2019) used
Framework
- Introduce HALIE framework for evaluating human-LM interaction
- Describe tasks and system construction for studying human-LM interaction
- Use interaction traces to represent interaction process
- Propose dimensions and metrics for evaluating human-LM interaction
Solving tasks interactively
- Studying human-LM interaction in the context of tasks
- Five tasks studied: social dialogue, question answering, crossword puzzles, text summarization, and metaphor generation
- Tasks span from goal-oriented to open-ended
- High coverage on common usages of LMs reported by Ouyang et al. (2022)
Constructing an interactive system
- Human-LM interaction can be designed in different ways
- Language models take a text prompt and decoding parameters as input and return a set of text completions
- System logic defines states, user actions, and a transition function
- Interaction traces are sequences of state-action pairs
- Design considerations include how much control users have over prompts, decoding parameters, and how completions are shown to users
Evaluating human-lm interaction
- Evaluation of human-LM interaction should consider the entire interaction process, not just the final output.
- Evaluation should reflect the perspective of the user who interacts with the LM.
- Evaluation should consider both objective metrics (e.g. accuracy) and subjective metrics (e.g. enjoyment).
- Evaluation should consider all combinations of the three dimensions (targets, perspectives, and criteria).
- Metrics can be proxies for different quality or preference metrics depending on the task.
Experiments
- Introduction of five tasks and four state-of-the-art LMs
- Evaluation of human-LM interaction
- Text summarization evaluated by third-party evaluators for consistency, relevancy, and coherence
- First-person perspectives evaluated by edit distance and survey responses for helpfulness and improvement
- Discrepancy between metrics based on third-party and first-person perspectives
- TextDavinci better at in-context learning user’s preferences for summarization
- Davinci requires least effort and highest acceptance rate for suggestions
- TextDavinci most helpful and satisfactory according to survey responses
- Third-party evaluators considered sentences written with TextDavinci to be the worst
- Satisfaction and reuse may depend on different factors
Related work
- Evaluation of LM-based language generation systems distinguishes between model completions and interaction traces
- Evaluation of model completions has a long history in NLP
- Evaluation traditionally centers on specific tasks
- Users perceived TextDavinci to be the most helpful and satisfied with results
- Users perceived Davinci to be the easiest to work with and willing to reuse the model
- Positive correlation between helpfulness, satisfaction, ease of interaction, and willingness to reuse
- Evaluation of interaction traces covers broader range of evaluation dimensions
- Interactive systems studied in many disparate communities
- Interaction often understood as feedback signal for model improvement
- Focus on broader user experience with LMs
Discussion
- Challenges encountered while designing tasks and systems for benchmarking LMs in interactive settings
- Potential solutions and paths forward for benchmarking LMs in interactive settings
Low latency matters
- Latency affects human-LM interaction and user perception.
- Guidelines recommend interactive systems respond within 0.1-1 seconds.
- Some models were too slow, influencing user perception and causing exclusion.
- Meeting latency standards can be a challenge, especially with larger models.
- Factors beyond model scale (e.g. compute resources, query optimization, API support) affect latency.
- Latency is crucial for positive user experience in real world.
Complexity of interactive study design
- User study design must account for individual differences in how users interact with LMs.
- It is desirable for each user to interact with most/all models.
- Sequential effects need to be controlled for to avoid undesired confounding.
- It is recommended to recruit a large number of diverse users to allow for more flexibility.
Potential impact on users
- LMs can generate toxic, biased, or otherwise undesirable text
- Exposure to this text can cause psychological harm
- Responsible deployment of interactive human-LM systems requires a harm mitigation strategy
- User accommodation reflects underlying properties of the models
- Task framing can affect user accommodation
- Human-LM interaction can have lasting and longitudinal impacts on users