Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • LLMs can generate fluent text when the output follows natural language patterns.
  • LLMs struggle when the output is confined to a limited ontology.
  • MSP is a parameter-efficient procedure for generating data in a controlled manner.
  • MSP produces diverse and natural text while preserving label semantics.
  • MSP achieves state-of-the-art results on three benchmarks.

Paper Content

Introduction

  • Complex NLU systems require large amounts of labeled data to be useful
  • Low resource settings are common when expanding a system into a new domain
  • Domain-adaptive fewshot learning is the task of learning a target domain from limited data
  • Large language models are effective classifiers in low resource settings
  • Data augmentation techniques can be used to tackle limited data issues
  • LLMs can be used as a tool for controlled data generation
  • MSP is a novel method for combining the Mixture of Soft Prompts to generate diverse, class-conditioned training data
  • MSP outperforms a model of the same size by up to 30%

Task formulation

  • User needs to (re)train model when expanding product or adding new feature
  • Few-shot natural language understanding can take many forms
  • NLU tasks in real life are complex
  • Given dataset with n training examples from group of s source domains
  • Each training example has natural language input and structured output label
  • Goal is to expand into target domain t with m examples, where m « n

Few-shot direct prediction

  • Pre-training a large neural network can help with low-resource scenarios.
  • LLMs have shown good performance in few-shot tasks.

Data-centered alternative

  • Use data augmentation to produce additional training examples
  • Combine original seed data with synthesized data to train a downstream model
  • Benefits of using LLMs as a data augmentation tool: inspectable, flexible, faster inference, transferable across model types

Prompt construction

  • Soft prompt tuning is a method to leverage the power of LLMs without the onerous computation requirements of training from scratch.
  • Soft prompts are initialized with the name and description of an attribute.
  • The full input contains four parts: instruction prefix, soft prompts, meta-data, and exemplars.

Attribute mixing

  • Model is conditioned on desired attributes during generation
  • Prior works focus on single attribute constraint
  • Task contains multiple attributes
  • Five different methods of composing attributes for data generation
  • Methods include Concat, Pooling, Attention, Bottleneck, and CNN Mixture

Data denoising

  • Generate 20% more data and filter to reduce noise
  • Set keep-rate inversely proportional to how often attributes occur to balance data
  • Weight examples according to distance from seed example to improve label preservation

Experimental setup

Datasets and tasks

  • Tested on 3 diverse, multi-attribute natural language understanding datasets
  • Task 1: Multi-aspect intent detection, measured by F1 score
  • Task 2: Cross-domain named entity recognition
  • Generated utterances typically preserve desired semantic attributes and lexical entities
  • Under-sampling over-represented attributes balances generated data

Baseline methods

  • FLAN-T5 XXL is the base model used for generating data
  • GODEL is the smaller downstream LLM used
  • Data augmentation techniques used include EDA, masked in-filling, BART-large, RTT, CLM, DExperts, and CVAE

Automatic evaluation

  • Evaluated synthesized data quantitatively with three metrics
  • Distinct@K measures diversity of text based on unique n-grams
  • Perplexity measures text fluency with GPT2-large
  • Correctness checks how well synthetic data preserves attribute labels

Implementation details

  • Instruction prefix set to 100 tokens
  • Attribute token length set to 20
  • Learning rate for teacher model set to 3e-2
  • Learning rate for student set to 3e-5
  • Augmentation methods generate 4 new datapoints per seed example

Main results

  • MSP achieves state-of-the-art results across all three datasets
  • MSP leverages LLMs for data augmentation
  • MSP outperforms data synthesis baselines on 8 out of 9 domains
  • Meta-learning and data augmentation can be combined for better results
  • All DA and CTG methods outperform naive GODEL baseline
  • RTT leads to drop in performance for CrossNER
  • Problems with RTT persist in TOPv2
  • MSP is able to reliably handle lexical, semantic and structural constraints

Synthesized data quality

  • LLMs used as intermediate data augmentation tool for few-shot learning
  • Interpretability, flexibility and modularity of MSP
  • MSP yields higher quality data than other data augmentation methods
  • MSP yields better performance in downstream tasks
  • MSP leverages LLMs to generate data rather than direct predictions
  • MSP related to techniques that combine multiple prompts
  • MSP controls generation as a means to an end
  • MSP uses BLEU score as a proxy for measuring model convergence
  • Oracle attribute classifier based on DeBERTa-XLarge used for automatic evaluation
  • Oracle attribute classifier reaches over 90% accuracy
  • K=2 works best for number of exemplars
  • Num generations set to 4
  • Learning rate set to 0.3
  • Batch size set to 8 and gradient accumulation set to 3 steps
  • Downstream task learning rate set to 3e-5
  • Data augmentation promotes diversity and label preservation
  • Temperature parameter of LLM can be increased to increase diversity
  • Exemplars can be shuffled or excluded to minimize copying behavior
  • Novel attribute combinations can be composed