Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs encode parametric knowledge about world facts and have shown good performance in knowledge-driven NLP tasks.
LLMs may overlook contextual cues, leading to incorrect predictions in context-sensitive NLP tasks.
This paper seeks to assess and enhance LLMs’ contextual faithfulness.
Opinion-based prompts and counterfactual demonstrations are the most effective methods for improving faithfulness.
Experiments on three datasets of two standard NLP tasks show significant improvement in faithfulness to contexts.

Paper Content

Introduction

Large language models have made advances in solving NLP problems
LLMs can answer factual questions without external context
LLMs can achieve comparable results to supervised approaches

Knowledge conflict

Evaluated knowledge conflict setting using counterfactual datasets
Measured frequency of exact match of original and substituted answers
Used memorization ratio to assess model’s reluctance to update prediction
Conducted experiments in three different settings
Combination of OPIN + IN-STR prompting and counterfactual demonstrations most effective
Opinion-based prompts generally perform better than other templates
Counterfactual demonstrations lead to improved performance

Prediction with abstention

Dataset created based on RealTime QA
Added “I don’t know” choice and relabeled dataset
113 test instances, 63 answerable and 50 unanswerable
Calculate probability of choice and take largest as prediction
Report accuracy on all, answerable and unanswerable
OPIN + INSTR prompt best for both zero-shot and few-shot settings
OPIN prompt second best
All proposed prompting templates better on unanswerable instances
Use of demonstrations improves LLMs’ ability to make selective predictions

LLMs have shown promising results in closed-book QA tasks
Facts stored in model parameters may become outdated
Some studies have explored ways to identify and edit facts stored in model parameters
Selective prediction with abstention is an important problem in trustworthy AI
Abstention is preferred when context is irrelevant to the question
Neeman et al. propose answerability augmentation to address the problem

Method

Focuses on context-specific NLP tasks
Input is (c, q) for free-form generation tasks, (c, q, o) for tasks with close decision spaces
Desired output is either free-form text or a choice
Solved by prompting LLMs
Two proposed methods: opinion-based prompts and counterfactual demonstrations

Opinion-based prompting

Given an input (c, q, o), a base prompting template is used
Two types of prompting templates are investigated: opinion-based and instructed
Opinion-based prompts transform questions into opinion-seeking questions
Instructed prompts explicitly instruct LLMs to read the context
APE is used to generate instructions for prompted questions
Experiments show that all prompting templates perform better than the base prompting template

Demonstrations

Demonstrations are a standard way to do few-shot inference on LLMs.
Previous works propose to finetune LLMs using counterfactual instances.
We propose to use counterfactual instances as demonstrations for LLMs.
We use KATE to retrieve the most relevant counterfactual instances.
We encode test and counterfactual instances with RoBERTa nli+sts-b.
We select top counterfactual instances based on cosine similarity.
Using original (factual) instances as demonstrations underperforms counterfactual demonstrations.

Experiments

Experimental setups for evaluation of proposed methods
Knowledge conflict and prediction with abstention
Additional analysis on results across different model sizes and original datasets
Case study with examples of prompts and LLMs’ outputs

Experimental setup

Experiments conducted using Instruct-GPT model
Compared baseline against prompting templates
Evaluated effectiveness of templates in zero-shot and few-shot settings

Additional analysis

OPIN + INSTR consistently outperforms other prompts across different model sizes
Memorization ratio decreases with increased model size
Larger LLMs have more severe memorization on full evaluation set
Opinion-based prompts yield similar or better results on original dataset

Case study

LLM ignores context and returns memorized answer
Opinion-based prompts and instructions lead to more faithful response
Base prompts return a choice, potentially incorrect
Opinion-based prompts and instructions can abstain from making predictions and return “I don’t know”

Conclusion

LLMs may ignore context and make unfaithful predictions
Two methods, opinion-based prompts and counterfactual demonstrations, can improve LLMs’ faithfulness to contexts
Evaluated methods on two tasks, machine reading comprehension and relation extraction, and observed significant improvement in faithfulness

Link to paper#

Abstract#

Paper Content#

Introduction#

Knowledge conflict#

Prediction with abstention#

Related work#

Method#

Opinion-based prompting#

Demonstrations#

Experiments#

Experimental setup#

Additional analysis#

Case study#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Knowledge conflict

Prediction with abstention

Related work

Method

Opinion-based prompting

Demonstrations

Experiments

Experimental setup

Additional analysis

Case study

Conclusion