Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs can generate fluent text when the output follows natural language patterns.
- LLMs struggle when the output is confined to a limited ontology.
- MSP is a parameter-efficient procedure for generating data in a controlled manner.
- MSP produces diverse and natural text while preserving label semantics.
- MSP achieves state-of-the-art results on three benchmarks.
Paper Content
Introduction
- Complex NLU systems require large amounts of labeled data to be useful
- Low resource settings are common when expanding a system into a new domain
- Domain-adaptive fewshot learning is the task of learning a target domain from limited data
- Large language models are effective classifiers in low resource settings
- Data augmentation techniques can be used to tackle limited data issues
- LLMs can be used as a tool for controlled data generation
- MSP is a novel method for combining the Mixture of Soft Prompts to generate diverse, class-conditioned training data
- MSP outperforms a model of the same size by up to 30%
Task formulation
- User needs to (re)train model when expanding product or adding new feature
- Few-shot natural language understanding can take many forms
- NLU tasks in real life are complex
- Given dataset with n training examples from group of s source domains
- Each training example has natural language input and structured output label
- Goal is to expand into target domain t with m examples, where m « n
Few-shot direct prediction
- Pre-training a large neural network can help with low-resource scenarios.
- LLMs have shown good performance in few-shot tasks.
Data-centered alternative
- Use data augmentation to produce additional training examples
- Combine original seed data with synthesized data to train a downstream model
- Benefits of using LLMs as a data augmentation tool: inspectable, flexible, faster inference, transferable across model types
Prompt construction
- Soft prompt tuning is a method to leverage the power of LLMs without the onerous computation requirements of training from scratch.
- Soft prompts are initialized with the name and description of an attribute.
- The full input contains four parts: instruction prefix, soft prompts, meta-data, and exemplars.
Attribute mixing
- Model is conditioned on desired attributes during generation
- Prior works focus on single attribute constraint
- Task contains multiple attributes
- Five different methods of composing attributes for data generation
- Methods include Concat, Pooling, Attention, Bottleneck, and CNN Mixture
Data denoising
- Generate 20% more data and filter to reduce noise
- Set keep-rate inversely proportional to how often attributes occur to balance data
- Weight examples according to distance from seed example to improve label preservation
Experimental setup
Datasets and tasks
- Tested on 3 diverse, multi-attribute natural language understanding datasets
- Task 1: Multi-aspect intent detection, measured by F1 score
- Task 2: Cross-domain named entity recognition
- Generated utterances typically preserve desired semantic attributes and lexical entities
- Under-sampling over-represented attributes balances generated data
Baseline methods
- FLAN-T5 XXL is the base model used for generating data
- GODEL is the smaller downstream LLM used
- Data augmentation techniques used include EDA, masked in-filling, BART-large, RTT, CLM, DExperts, and CVAE
Automatic evaluation
- Evaluated synthesized data quantitatively with three metrics
- Distinct@K measures diversity of text based on unique n-grams
- Perplexity measures text fluency with GPT2-large
- Correctness checks how well synthetic data preserves attribute labels
Implementation details
- Instruction prefix set to 100 tokens
- Attribute token length set to 20
- Learning rate for teacher model set to 3e-2
- Learning rate for student set to 3e-5
- Augmentation methods generate 4 new datapoints per seed example
Main results
- MSP achieves state-of-the-art results across all three datasets
- MSP leverages LLMs for data augmentation
- MSP outperforms data synthesis baselines on 8 out of 9 domains
- Meta-learning and data augmentation can be combined for better results
- All DA and CTG methods outperform naive GODEL baseline
- RTT leads to drop in performance for CrossNER
- Problems with RTT persist in TOPv2
- MSP is able to reliably handle lexical, semantic and structural constraints
Synthesized data quality
- LLMs used as intermediate data augmentation tool for few-shot learning
- Interpretability, flexibility and modularity of MSP
- MSP yields higher quality data than other data augmentation methods
- MSP yields better performance in downstream tasks
- MSP leverages LLMs to generate data rather than direct predictions
- MSP related to techniques that combine multiple prompts
- MSP controls generation as a means to an end
- MSP uses BLEU score as a proxy for measuring model convergence
- Oracle attribute classifier based on DeBERTa-XLarge used for automatic evaluation
- Oracle attribute classifier reaches over 90% accuracy
- K=2 works best for number of exemplars
- Num generations set to 4
- Learning rate set to 0.3
- Batch size set to 8 and gradient accumulation set to 3 steps
- Downstream task learning rate set to 3e-5
- Data augmentation promotes diversity and label preservation
- Temperature parameter of LLM can be increased to increase diversity
- Exemplars can be shuffled or excluded to minimize copying behavior
- Novel attribute combinations can be composed