Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Language models often generate incoherent outputs.
- SituationSupervision is a family of approaches to improve coherence in LMs.
- SituationSupervision has two components: an auxiliary situation modeling task and a latent state inference procedure.
- SituationSupervision can be applied to fine-tuning and prompting.
- SituationSupervision requires only a small number of state annotations to produce major coherence improvements.
Paper Content
Introduction
Background and related work
- Language models encode a distribution over texts given contexts
- Most language models are implemented as deep neural networks
- Sampling from language models produces naturalistic text
- Language models are prone to failure modes such as incoherent, untruthful, or unreliable text
- Humans avoid these failure modes by maintaining explicit beliefs about entities and relations in a discourse
- Language models may benefit from explicit modeling of situation state
- Language models can be adapted with auxiliary prediction tasks
- Adapting language models requires data for auxiliary supervision
- Language modeling with explicit situations can be formulated as a latent variable problem
- Inferred states can be used to supervise small models and prompt large ones
Auxiliary situation modeling
- Assume access to a pre-trained language model and two sources of supervision
- Training data consists of text examples (T, T ) and examples (T, S, T ) annotated with situation descriptions
- Situation descriptions consist of declarative sentences about relevant entities
- Two auxiliary prediction schemes use annotations to improve language model’s ability to model conditional text distribution
Situation modeling for fine-tuning
- Encoder-decoder models consist of an encoder and a decoder.
- Parameters of the encoder and decoder are chosen to maximize a standard training.
- An auxiliary loss is added to improve state representations.
- An auxiliary decoder is trained to predict state representations from the encoded context.
Prompting
- Used 25 sentences (3 stories) in P for TW and 80 sentences (16 stories) in P for TRIP.
- Held out state annotations on 13 sentences (2 stories) in TW and 60 sentences (12 stories) in TRIP.
- Fully annotated passages with state improved performance by 6.5% in TW and 11% in TRIP.
- Incorporating generated latent states into the prompt helped performance by 7.1% in TW and 8.9% in TRIP.
Situation prediction for prompting
- Approach described is general
- Costly to apply in LMs with large numbers of parameters
- Prompts can induce models to build better task representations
- Prompts have three components: task description, training set, input context
- Training set can include unannotated and annotated examples
- Prompt string constructed with control token to predict annotations and text
- Similar to existing “scratchpad” and “chainof-thought” methods
Latent state inference
- State supervision requires ground-truth state annotations to be effective.
- State annotations are difficult to collect and design.
- This section describes how to obtain state annotations automatically.
- Re-formulate two approaches as latent variable models.
- Inference problem is easier at training time than prediction time.
- Most work on semisupervised inference of auxiliary prediction targets focuses on automatic optimization of prompts and reasoning chains.
- Inferred latent variables have not been used to train scaffold decoders or design intermediate state representation for multi-step text generation.
Latent state inference for fine-tuning
- Intuitively, a good state representation is one that is both predictable from context and useful for predicting subsequent text.
- Introduce another encoder-decoder into the model to guide inference of states for auxiliary prediction.
- Model has two pathways for predicting T: one that uses encoder representations to predict it directly from T, and another which generates textual state descriptions S from decoder representations.
- Optimize complete likelihood by initializing parameters and inferring missing states that maximize probability of next sentences.
- Alternating between E-step and M-step to train parameters and maximize likelihood.
Latent state inference for prompting
- Work on few-shot prompting finds benefits from adding extra examples to prompts
- Produce extra examples for a seed prompt by finding situation descriptions that improve prediction of T on unannotated examples
- Choose prompts to maximize, then add newly annotated examples to the prompt during training and evaluation
Experimental setup
- Evaluated SITUATIONSUPERVISION on English language modeling datasets
- TW derived from TextWorld, generate transcripts of players navigating a house
- TRIP features pairs of plausible and implausible stories requiring physical commonsense reasoning
- Chunks in TW consist of action description and game response, chunks in TRIP consist of single sentence and plausibility judgment
- Annotated passages in XA have corresponding state information
- Models fine-tuned using BART-base and GPT3 da-vinci-002
- Evaluate TW by sampling next actions and computing fraction of coherent sentences
- Evaluate TRIP by training models to predict OK or Not OK after each sentence
- Report sentence-wise metrics for TW and passage-wise metrics for TRIP
- Compare models trained using ordinary language modeling techniques and state supervision
Experiments
Analysis
- Choice of state is important for TW environment
- TW environment provides detailed groundtruth state annotations
- Experiments use subset of entities and properties
- Known state consists of facts deducible from prior context
- Causally relevant state consists of facts relevant to next sentence
- Choice of state matters, relevant known state outperforms full state
- Optimize design of TRIP state annotations
- Text-only prompting vs. auxiliary prompting with original/handcrafted states
- Supplement existing examples with state annotations instead of additional text-only examples
Conclusion
- Reasoning about the world is necessary for effective text generation
- Auxiliary supervision can improve language models’ ability to reason
- Increasing state supervision is more efficient than text-only supervision
- Semantic supervision can improve LM generation coherence
- State annotations are harder to collect
- Latent supervision algorithms can be used to improve coherence
- TW state consists of facts about player location, accessible items, and accessible doorways
- TRIP state captures known facts and ground-truth acceptable completion
- TW generation diversity is not harmed by SITUATION-SUPERVISION
- Prompting with SITUATION-SUPERVISION increases diversity
- BARTbase model used for fine-tuning
- GPT3 used for prompting
- Generation temperature of 0.7 used for TW
- Generation temperature of 0.9 used for TRIP
- Recall used to measure diversity
- Ablating state reranking improves coherence