Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- PLMs have shown impressive unaided performance across many NLP tasks.
- Adding a few labeled in-context exemplars can improve PLMs.
- Designing prompts for complex tasks like dialogue state tracking is difficult.
- Building in-context exemplars for dialogue tasks is difficult due to short model input lengths.
- A meta-learning scheme and novel training method are used to stabilize the model and find ideal in-context examples.
- A saliency model is used to limit dialogue text length and include more exemplars per query.
- Highly competitive results are achieved for few-shot DST on MultiWOZ.
Paper Content
Introduction
- Tremendous gains have been made on dialogue state tracking using large pre-trained language models
- Fine-tuning these systems requires significant amounts of data
- Prompting has emerged as a technique for achieving strong performance with less resources
- In-context exemplars provide a pattern for the model to follow
- Difficulty of hand-crafting prompts and targets is a challenge
- Dialogue sequence lengths are often much longer than other tasks
- Selecting the exemplars is difficult due to sparsity
- Aim to achieve good results with a low-resource model setting
- Meta in-context learning framework to stabilize training and reduce variance
- Inspired by summarization work to condense dialogue histories
- Novel loss function to train a retrieval model to select ideal exemplars
- Works on any sort of language model
Related works
Few-shot dialog state tracking
- Recent works on dialogue state tracking use large pre-trained LMs
- Few-shot learning can be achieved with transfer learning or data augmentation
- Clustering techniques like prototypical networks have been successful
Meta in-context learning with prompting
- Few-shot techniques of meta-learning and prompting with large PLMs are used
- Pre-training a model to learn how to learn is used to get away with only a few examples at test time
- Methods which circumvent the need to calculate second-order gradients have been applied to the task of DST
- Prompts have been found to work well on a wide variety of NLP tasks
- Prompt engineering has become its own complex task
- Meta in-context learning on classification tasks has been successful
- Aim to side-step the prompt design issue altogether by applying metalearning to teach a model to recognize arbitrary instructions
Exemplar retrieval
- Retrieval with dense vectors can be used for in-context learning (Liu et al., 2022).
- Dense vectors have been used for dialogue in open-domain chat and knowledge-base retrieval (Adolphs et al., 2021; Komeili et al., 2022; Eric et al., 2017; Lee et al., 2021).
Our method
- Proposal of a Stabilized dialogue state tracker
- Leverages Meta incontext learning, dialogue Summarization and a novel Multi-part training loss
- SM2 for fine-tuning a retrieval model
Preliminaries
- DST aims to understand customer intentions in a conversation
- DST predicts a cumulative dialogue state based on dialogue history
- Few-shot setup only allows access to a small percentage of labeled data
- Model receives no gradient signal from task-specific data, relies on in-context learning
Stabilized meta-learning
- PLMs understand instructions written in natural language
- Minor tweaks in prompt text can cause extreme changes in generated output
- Meta-ICL stabilizes the variance of prompts
- Meta-learning uses labeled data from support sets to adapt a model
- Meta-ICL avoids costly loss calculation by using in-context learning
- MultiWOZ is the held out target task
- Model familiarizes itself with complex DST prompts during meta-training
- Any prompt can be used to instruct the model, including random tokens
Dialogue compression
- Dialogue context is condensed to fit more exemplars into the model input sequence.
- Dialogue history is summarized instead of removing prior utterances.
- Heuristics are used to identify non-salient utterances and filter them away.
Multi-part retrieval training
- Exemplars are important for in-context learning
- Exemplars are retrieved based on their proximity to the query example
- An SBERT embedder is used to encode exemplars into a shared embedding space
- Two categories of training techniques are explored to improve the performance of the retrieval model
- A multi-contrastive loss and a multi-MSE loss are used to modify the target label
- The target label is modified to include multiple parts such as domain, slot and value
Model input
- Model input consists of context summary, current turn (2 utterances), domain and slot, and exemplars
- During meta-training, a final [value] token is added to the model input which is what is hoped to be predicted when testing the left out query set
Experiments
- Training implementation details outlined
- Key experiments discussed
Training setup
- Considered 4 datasets as support sets
- Used MultiWOZ 2.1 versions
- Selected best models through early stopping on validation data
- Set learning rate to 3e-4, used Adafactor optimizer and cosine scheduler with warmup of 10,000 steps
- Best system used an ensemble of exemplar embedders trained with κ = [20,30,40] and learning rate of 3e-5
Prompt variations
- Model training is considered stable if different prompts produce similar outcomes
- Six prompts are collected based on common sense and prior work
- Prompts are designed by others to avoid biasing the rankings
- Prompts take the form of statements, questions, schema, naive, none and random
- Baseline is in-context learning without meta-training
- Variance among scores is measured before and after metalearning
Filtering threshold
- Two experts annotated 50 dialogs to verify the saliency model.
- Results of the model were tested with different filtering thresholds, ranging from 0.1 to 0.9.
- Maximum F1-score was reached at 0.6, but 0.4 was chosen as the filtering threshold for higher recall.
- Qualitative examples of irrelevant sentences removed can be found in section 5.4.
Retrieval methods
- Adapted SBERT to DST task with 4 different objective functions
- Tested with number of pairs per exemplar from 10 to 100
- Found κ = 30 to work best
- Included default SBERT model without fine-tuning
- Evaluated results with MRR@10, NDCG@10 and MAP@100
- Multi-part cosine loss showed strongest ability to select meaningful exemplars
Results and analysis
- Goal is to achieve strong results on DST without prompt engineering
- Analyze ability of best performing models
- Discuss performance stability across different prompts
Main results
- In-context learning methods outperform fine-tuning with few-shot data
- SM2-11b model achieves best joint goal accuracy on MultiWOZ 2.1 and 2.4
- SM2-3b outperforms IC-DST 2.7b models
- SM2 models exhibit 2x reduction in variance over models trained under other regimes
- Meta-learning from SM2 stabilizes prompt performance across multiple model types
Ablation study
- Removing saliency filtering causes a 1-2% drop in model performance
- Disabling context summarization causes a bigger decrease in accuracy
- Using the default SBERT embedder leads to a nearly 10% drop, suggesting exemplar selection is most critical
- Ideas are independently applicable to other NLP tasks
Additional discussion
- Fine-tuning performs best
- SM2 outperforms in-context learning
- Transfer learning from source datasets to target dataset does not work as well
- Performance increases from 1% to 5% data, but not from 5% to 10%
- Statement prompt does best, Random does worst, but still above chance
Qualitative analysis
- Utterance with “domain=restaurant” and “slots=price range, food type” receives high score
- Second exemplar E2 discusses different topic, producing low score
- Sentence embedder effectively distinguishes value of exemplars
- Saliency model successfully conserves token space
- Short sentences and those without dialog state info are safe for removal
Conclusion
- Method of performing few-shot dialogue state tracking by leveraging large pre-trained LMs with prompts
- Does not require any gradient-based training for the target task
- Leverages in-context learning to guide model generation
- Stabilizes training across prompts with Meta-ICL
- Applies saliency filtering and context summarization to reduce dialogue length
- Fine-tunes a sentence embedder with a custom loss objective to improve exemplar retrieval
- Reaches state-of-the-art results on MultiWOZ when limited to models under 100 billion parameters
- Plans to explore techniques that push model and data efficiency even further
- Applies framework to other task-oriented dialog datasets
- Runs experiments with different random seeds
- Uses softmax-based contrastive loss and max-margin contrastive loss
- Adjusts batch size and adopts AdaFactor as the optimizer
- Ensemble decoding for multiple times using different retrieval embedders
- Uses verbalizers to map natural sounding output to the more limited slot-values in the ontology
- Squeezes multiple in-context exemplars, dialogue query with conversational context, and a full prompt into the finite input length of a large PLM
- Leverages embeddings to search for exemplars in dialogue
- Fine-tunes sentence embedder with various loss functions
- Measures instability as standard deviation of the accuracy scores