Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs encode parametric knowledge about world facts and have shown good performance in knowledge-driven NLP tasks.
- LLMs may overlook contextual cues, leading to incorrect predictions in context-sensitive NLP tasks.
- This paper seeks to assess and enhance LLMs’ contextual faithfulness.
- Opinion-based prompts and counterfactual demonstrations are the most effective methods for improving faithfulness.
- Experiments on three datasets of two standard NLP tasks show significant improvement in faithfulness to contexts.
Paper Content
Introduction
- Large language models have made advances in solving NLP problems
- LLMs can answer factual questions without external context
- LLMs can achieve comparable results to supervised approaches
Knowledge conflict
- Evaluated knowledge conflict setting using counterfactual datasets
- Measured frequency of exact match of original and substituted answers
- Used memorization ratio to assess model’s reluctance to update prediction
- Conducted experiments in three different settings
- Combination of OPIN + IN-STR prompting and counterfactual demonstrations most effective
- Opinion-based prompts generally perform better than other templates
- Counterfactual demonstrations lead to improved performance
Prediction with abstention
- Dataset created based on RealTime QA
- Added “I don’t know” choice and relabeled dataset
- 113 test instances, 63 answerable and 50 unanswerable
- Calculate probability of choice and take largest as prediction
- Report accuracy on all, answerable and unanswerable
- OPIN + INSTR prompt best for both zero-shot and few-shot settings
- OPIN prompt second best
- All proposed prompting templates better on unanswerable instances
- Use of demonstrations improves LLMs’ ability to make selective predictions
Related work
- LLMs have shown promising results in closed-book QA tasks
- Facts stored in model parameters may become outdated
- Some studies have explored ways to identify and edit facts stored in model parameters
- Selective prediction with abstention is an important problem in trustworthy AI
- Abstention is preferred when context is irrelevant to the question
- Neeman et al. propose answerability augmentation to address the problem
Method
- Focuses on context-specific NLP tasks
- Input is (c, q) for free-form generation tasks, (c, q, o) for tasks with close decision spaces
- Desired output is either free-form text or a choice
- Solved by prompting LLMs
- Two proposed methods: opinion-based prompts and counterfactual demonstrations
Opinion-based prompting
- Given an input (c, q, o), a base prompting template is used
- Two types of prompting templates are investigated: opinion-based and instructed
- Opinion-based prompts transform questions into opinion-seeking questions
- Instructed prompts explicitly instruct LLMs to read the context
- APE is used to generate instructions for prompted questions
- Experiments show that all prompting templates perform better than the base prompting template
Demonstrations
- Demonstrations are a standard way to do few-shot inference on LLMs.
- Previous works propose to finetune LLMs using counterfactual instances.
- We propose to use counterfactual instances as demonstrations for LLMs.
- We use KATE to retrieve the most relevant counterfactual instances.
- We encode test and counterfactual instances with RoBERTa nli+sts-b.
- We select top counterfactual instances based on cosine similarity.
- Using original (factual) instances as demonstrations underperforms counterfactual demonstrations.
Experiments
- Experimental setups for evaluation of proposed methods
- Knowledge conflict and prediction with abstention
- Additional analysis on results across different model sizes and original datasets
- Case study with examples of prompts and LLMs’ outputs
Experimental setup
- Experiments conducted using Instruct-GPT model
- Compared baseline against prompting templates
- Evaluated effectiveness of templates in zero-shot and few-shot settings
Additional analysis
- OPIN + INSTR consistently outperforms other prompts across different model sizes
- Memorization ratio decreases with increased model size
- Larger LLMs have more severe memorization on full evaluation set
- Opinion-based prompts yield similar or better results on original dataset
Case study
- LLM ignores context and returns memorized answer
- Opinion-based prompts and instructions lead to more faithful response
- Base prompts return a choice, potentially incorrect
- Opinion-based prompts and instructions can abstain from making predictions and return “I don’t know”
Conclusion
- LLMs may ignore context and make unfaithful predictions
- Two methods, opinion-based prompts and counterfactual demonstrations, can improve LLMs’ faithfulness to contexts
- Evaluated methods on two tasks, machine reading comprehension and relation extraction, and observed significant improvement in faithfulness