Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Task of text generation with constraints specified in natural language
- Created benchmark Cognac with knowledge-intensive constraints from databases
- State-of-the-art language models fail on this task
- Proposed solution CognacGen to leverage language model’s internal knowledge
- Three forms of guidance and prefix-tuning approaches to distill guidance
- Empirical evaluations demonstrate CognacGen can generalize to unseen instructions and outperform baselines
Paper Content
Introduction
- Language models are becoming increasingly good at generating text
- A key question is how to control them to produce what is required while preventing unwanted generations
- This is especially important for reducing issues of toxicity, bias, and misinformation
- Prior work has used special control codes or classifiers to modify the LM’s probability distribution
- This paper considers the problem of controlling generation in LMs with constraints specified in natural language
- A new benchmark called COGNAC is created with two datasets based on WordNet and Wikidata
- State-of-the-art LMs fail to follow simple language constraints
- A new language model generation method called COGNAC-GEN is developed to follow linguistic guidance without retraining
- COGNAC-GEN outperforms prior methods and other strong baselines by a significant margin
- It is able to improve generation even with imperfect guidance and can generalize to unseen instructions
Task setup
- Problem of conditional text generation with topics and constraints provided in natural language
- Input includes a topic, example generations, and a constraint
- Goal is to train LMs to generate fluent on-topic content while respecting the constraint
- Model has to understand the topic and constraint specified in natural language
- Topics and constraints are knowledge-intensive
- Model has to respect both the topic and the constraint simultaneously
Dataset collection
- Two new datasets based on WordNet and Wikidata created for COGNAC benchmark
- WordNet dataset constructed using five root nodes and their leaf nodes as topics and constraints
- Wikidata dataset constructed using five properties and their values as topics and constraints
- 35 unique templates created to reflect diverse nature of instructions
- 107,555 and 18,900 unique instructions created for WordNet and Wikidata respectively
Evaluation metrics
- Evaluated different generation methods of LMs using metrics that test for correctness and fluency
- Main metric is whether the generation conforms to the constraint while staying on topic
- Reported on-topic score, constraint violation score, Copy-BLEU score, repetition, and perplexity
Overview
- COGNAC model will benefit from querying itself
- Conditional probability is explicitly factorized
- Probability conditioned on demonstrations
- Probability evaluated if task is performed
- Probability evaluated if constraint is conformed
- Generation model can be modeled with pre-trained LM
Guided generation
- COGNACGEN updates the next token prediction probability from the generation model by modifying the logits using guidance.
- Greedy decoding is used to generate from the modified probability.
Guidance model
- Construct a guidance model to modify the guidance probabilities of a language model
- Three variations of guidance model: binary verifier, top-k token, and textual example
- Binary verifier evaluates the probability of a token being a type of a given constraint
- Top-k token uses the next token probability distribution from the guidance model
- Textual example uses a trie tree based approach to decide what guidance to apply at each step
Tackling diverse natural language instructions
- Method assumes topic and constraint are given
- Model takes natural language instruction and demonstrations as input
- Model needs to infer topic and constraint entities from full input
- Model is finetuned using prefix-tuning
- Model’s own knowledge is distilled and generalized to unseen instructions
Experimental setup
- Evaluations performed under two settings for both datasets in COGNAC
- Control code setting: binary verifier, top-k token, and textual example guidance
- NL instruction setting: unseen instruction templates
- Baselines: fine-tuned model, self-debiasing technique
- Model details: GPT-2 XL, top-k token, textual example guidance, GPT-3, InstructGPT
Results
- COGNACGEN (textual example) outperforms self-debiasing baseline by 12 points on WordNet and 16 points on Wikidata
- Fine-tuned baseline achieves lowest IC scores across both datasets
- Textual example guidance performs better than top-k token guidance and binary verifier
- COGNACGEN textual example achieves higher performance than GPT-3 (legacy) by 10.0 points on WordNet and 11.6 points on Wikidata
- COGNACGEN struggles to avoid violating constraints for ‘Art’ category
- Struggles to stay on topic for knowledge-heavy categories
- Oracle model has access to knowledge base and achieves IC of 73 compared to COGNACGEN’s 35.8
- Ablating away trie-based generation causes drastic performance degradation
- Existing approaches limited to high-level concepts and predetermined binary or categorical codes
Conclusion
- Introduced a new task for controllable generation in language models with constraints specified in natural language
- Developed COGNAC, a new benchmark containing knowledge-based constraints
- State-of-the-art language models like GPT-3 fail to conform to the provided instructions
- Developed COGNACGEN, a method to use knowledge internal to a language model to guide its own generations
- Innovations such as guidance self-distillation using prefix-tuning and a trie-based decoding scheme based on the guidance of textual examples
- Generates on-topic text that violates constraints less frequently compared to several baselines
- Method requires training only the prefix parameters and can easily be scaled to larger models
- Room to improve on COGNAC
- Reducing undesirable generations in LMs while promoting desirable text
- Benchmark limited by the comprehensiveness of the underlying knowledge bases used
- Implicit biases might cause unfairness to the end users of the model