  • Language models often generate incoherent outputs.
  • SituationSupervision is a family of approaches to improve coherence in LMs.
  • SituationSupervision has two components: an auxiliary situation modeling task and a latent state inference procedure.
  • SituationSupervision can be applied to fine-tuning and prompting.
  • SituationSupervision requires only a small number of state annotations to produce major coherence improvements.

Paper Content


  • Language models encode a distribution over texts given contexts
  • Most language models are implemented as deep neural networks
  • Sampling from language models produces naturalistic text
  • Language models are prone to failure modes such as incoherent, untruthful, or unreliable text
  • Humans avoid these failure modes by maintaining explicit beliefs about entities and relations in a discourse
  • Language models may benefit from explicit modeling of situation state
  • Language models can be adapted with auxiliary prediction tasks
  • Adapting language models requires data for auxiliary supervision
  • Language modeling with explicit situations can be formulated as a latent variable problem
  • Inferred states can be used to supervise small models and prompt large ones

Auxiliary situation modeling

  • Assume access to a pre-trained language model and two sources of supervision
  • Training data consists of text examples (T, T ) and examples (T, S, T ) annotated with situation descriptions
  • Situation descriptions consist of declarative sentences about relevant entities
  • Two auxiliary prediction schemes use annotations to improve language model’s ability to model conditional text distribution

Situation modeling for fine-tuning

  • Encoder-decoder models consist of an encoder and a decoder.
  • Parameters of the encoder and decoder are chosen to maximize a standard training.
  • An auxiliary loss is added to improve state representations.
  • An auxiliary decoder is trained to predict state representations from the encoded context.


  • Used 25 sentences (3 stories) in P for TW and 80 sentences (16 stories) in P for TRIP.
  • Held out state annotations on 13 sentences (2 stories) in TW and 60 sentences (12 stories) in TRIP.
  • Fully annotated passages with state improved performance by 6.5% in TW and 11% in TRIP.
  • Incorporating generated latent states into the prompt helped performance by 7.1% in TW and 8.9% in TRIP.

Situation prediction for prompting

  • Approach described is general
  • Costly to apply in LMs with large numbers of parameters
  • Prompts can induce models to build better task representations
  • Prompts have three components: task description, training set, input context
  • Training set can include unannotated and annotated examples
  • Prompt string constructed with control token to predict annotations and text
  • Similar to existing “scratchpad” and “chainof-thought” methods

Latent state inference

  • State supervision requires ground-truth state annotations to be effective.
  • State annotations are difficult to collect and design.
  • This section describes how to obtain state annotations automatically.
  • Re-formulate two approaches as latent variable models.
  • Inference problem is easier at training time than prediction time.
  • Most work on semisupervised inference of auxiliary prediction targets focuses on automatic optimization of prompts and reasoning chains.
  • Inferred latent variables have not been used to train scaffold decoders or design intermediate state representation for multi-step text generation.

Latent state inference for fine-tuning

  • Intuitively, a good state representation is one that is both predictable from context and useful for predicting subsequent text.
  • Introduce another encoder-decoder into the model to guide inference of states for auxiliary prediction.
  • Model has two pathways for predicting T: one that uses encoder representations to predict it directly from T, and another which generates textual state descriptions S from decoder representations.
  • Optimize complete likelihood by initializing parameters and inferring missing states that maximize probability of next sentences.
  • Alternating between E-step and M-step to train parameters and maximize likelihood.

Latent state inference for prompting

  • Work on few-shot prompting finds benefits from adding extra examples to prompts
  • Produce extra examples for a seed prompt by finding situation descriptions that improve prediction of T on unannotated examples
  • Choose prompts to maximize, then add newly annotated examples to the prompt during training and evaluation

Experimental setup

  • Evaluated SITUATIONSUPERVISION on English language modeling datasets
  • TW derived from TextWorld, generate transcripts of players navigating a house
  • TRIP features pairs of plausible and implausible stories requiring physical commonsense reasoning
  • Chunks in TW consist of action description and game response, chunks in TRIP consist of single sentence and plausibility judgment
  • Annotated passages in XA have corresponding state information
  • Models fine-tuned using BART-base and GPT3 da-vinci-002
  • Evaluate TW by sampling next actions and computing fraction of coherent sentences
  • Evaluate TRIP by training models to predict OK or Not OK after each sentence
  • Report sentence-wise metrics for TW and passage-wise metrics for TRIP
  • Compare models trained using ordinary language modeling techniques and state supervision



  • Choice of state is important for TW environment
  • TW environment provides detailed groundtruth state annotations
  • Experiments use subset of entities and properties
  • Known state consists of facts deducible from prior context
  • Causally relevant state consists of facts relevant to next sentence
  • Choice of state matters, relevant known state outperforms full state
  • Optimize design of TRIP state annotations
  • Text-only prompting vs. auxiliary prompting with original/handcrafted states
  • Supplement existing examples with state annotations instead of additional text-only examples


  • Reasoning about the world is necessary for effective text generation
  • Auxiliary supervision can improve language models’ ability to reason
  • Increasing state supervision is more efficient than text-only supervision
  • Semantic supervision can improve LM generation coherence
  • State annotations are harder to collect
  • Latent supervision algorithms can be used to improve coherence
  • TW state consists of facts about player location, accessible items, and accessible doorways
  • TRIP state captures known facts and ground-truth acceptable completion
  • TW generation diversity is not harmed by SITUATION-SUPERVISION
  • Prompting with SITUATION-SUPERVISION increases diversity
  • BARTbase model used for fine-tuning
  • GPT3 used for prompting
  • Generation temperature of 0.7 used for TW
  • Generation temperature of 0.9 used for TRIP
  • Recall used to measure diversity
  • Ablating state reranking improves coherence