Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Language models often generate incoherent outputs.
SituationSupervision is a family of approaches to improve coherence in LMs.
SituationSupervision has two components: an auxiliary situation modeling task and a latent state inference procedure.
SituationSupervision can be applied to fine-tuning and prompting.
SituationSupervision requires only a small number of state annotations to produce major coherence improvements.

Paper Content

Introduction

Language models encode a distribution over texts given contexts
Most language models are implemented as deep neural networks
Sampling from language models produces naturalistic text
Language models are prone to failure modes such as incoherent, untruthful, or unreliable text
Humans avoid these failure modes by maintaining explicit beliefs about entities and relations in a discourse
Language models may benefit from explicit modeling of situation state
Language models can be adapted with auxiliary prediction tasks
Adapting language models requires data for auxiliary supervision
Language modeling with explicit situations can be formulated as a latent variable problem
Inferred states can be used to supervise small models and prompt large ones

Auxiliary situation modeling

Assume access to a pre-trained language model and two sources of supervision
Training data consists of text examples (T, T ) and examples (T, S, T ) annotated with situation descriptions
Situation descriptions consist of declarative sentences about relevant entities
Two auxiliary prediction schemes use annotations to improve language model’s ability to model conditional text distribution

Situation modeling for fine-tuning

Encoder-decoder models consist of an encoder and a decoder.
Parameters of the encoder and decoder are chosen to maximize a standard training.
An auxiliary loss is added to improve state representations.
An auxiliary decoder is trained to predict state representations from the encoded context.

Prompting

Used 25 sentences (3 stories) in P for TW and 80 sentences (16 stories) in P for TRIP.
Held out state annotations on 13 sentences (2 stories) in TW and 60 sentences (12 stories) in TRIP.
Fully annotated passages with state improved performance by 6.5% in TW and 11% in TRIP.
Incorporating generated latent states into the prompt helped performance by 7.1% in TW and 8.9% in TRIP.

Situation prediction for prompting

Approach described is general
Costly to apply in LMs with large numbers of parameters
Prompts can induce models to build better task representations
Prompts have three components: task description, training set, input context
Training set can include unannotated and annotated examples
Prompt string constructed with control token to predict annotations and text
Similar to existing “scratchpad” and “chainof-thought” methods

Latent state inference

State supervision requires ground-truth state annotations to be effective.
State annotations are difficult to collect and design.
This section describes how to obtain state annotations automatically.
Re-formulate two approaches as latent variable models.
Inference problem is easier at training time than prediction time.
Most work on semisupervised inference of auxiliary prediction targets focuses on automatic optimization of prompts and reasoning chains.
Inferred latent variables have not been used to train scaffold decoders or design intermediate state representation for multi-step text generation.

Latent state inference for fine-tuning

Intuitively, a good state representation is one that is both predictable from context and useful for predicting subsequent text.
Introduce another encoder-decoder into the model to guide inference of states for auxiliary prediction.
Model has two pathways for predicting T: one that uses encoder representations to predict it directly from T, and another which generates textual state descriptions S from decoder representations.
Optimize complete likelihood by initializing parameters and inferring missing states that maximize probability of next sentences.
Alternating between E-step and M-step to train parameters and maximize likelihood.

Latent state inference for prompting

Work on few-shot prompting finds benefits from adding extra examples to prompts
Produce extra examples for a seed prompt by finding situation descriptions that improve prediction of T on unannotated examples
Choose prompts to maximize, then add newly annotated examples to the prompt during training and evaluation

Experimental setup

Evaluated SITUATIONSUPERVISION on English language modeling datasets
TW derived from TextWorld, generate transcripts of players navigating a house
TRIP features pairs of plausible and implausible stories requiring physical commonsense reasoning
Chunks in TW consist of action description and game response, chunks in TRIP consist of single sentence and plausibility judgment
Annotated passages in XA have corresponding state information
Models fine-tuned using BART-base and GPT3 da-vinci-002
Evaluate TW by sampling next actions and computing fraction of coherent sentences
Evaluate TRIP by training models to predict OK or Not OK after each sentence
Report sentence-wise metrics for TW and passage-wise metrics for TRIP
Compare models trained using ordinary language modeling techniques and state supervision

Experiments

Analysis

Choice of state is important for TW environment
TW environment provides detailed groundtruth state annotations
Experiments use subset of entities and properties
Known state consists of facts deducible from prior context
Causally relevant state consists of facts relevant to next sentence
Choice of state matters, relevant known state outperforms full state
Optimize design of TRIP state annotations
Text-only prompting vs. auxiliary prompting with original/handcrafted states
Supplement existing examples with state annotations instead of additional text-only examples

Conclusion

Reasoning about the world is necessary for effective text generation
Auxiliary supervision can improve language models’ ability to reason
Increasing state supervision is more efficient than text-only supervision
Semantic supervision can improve LM generation coherence
State annotations are harder to collect
Latent supervision algorithms can be used to improve coherence
TW state consists of facts about player location, accessible items, and accessible doorways
TRIP state captures known facts and ground-truth acceptable completion
TW generation diversity is not harmed by SITUATION-SUPERVISION
Prompting with SITUATION-SUPERVISION increases diversity
BARTbase model used for fine-tuning
GPT3 used for prompting
Generation temperature of 0.7 used for TW
Generation temperature of 0.9 used for TRIP
Recall used to measure diversity
Ablating state reranking improves coherence

Link to paper#

Abstract#

Paper Content#

Introduction#

Background and related work#

Auxiliary situation modeling#

Situation modeling for fine-tuning#

Prompting#

Situation prediction for prompting#

Latent state inference#

Latent state inference for fine-tuning#

Latent state inference for prompting#

Experimental setup#

Experiments#

Analysis#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Background and related work

Auxiliary situation modeling

Situation modeling for fine-tuning

Prompting

Situation prediction for prompting

Latent state inference

Latent state inference for fine-tuning

Latent state inference for prompting

Experimental setup

Experiments

Analysis

Conclusion