Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

PLMs have shown impressive unaided performance across many NLP tasks.
Adding a few labeled in-context exemplars can improve PLMs.
Designing prompts for complex tasks like dialogue state tracking is difficult.
Building in-context exemplars for dialogue tasks is difficult due to short model input lengths.
A meta-learning scheme and novel training method are used to stabilize the model and find ideal in-context examples.
A saliency model is used to limit dialogue text length and include more exemplars per query.
Highly competitive results are achieved for few-shot DST on MultiWOZ.

Paper Content

Introduction

Tremendous gains have been made on dialogue state tracking using large pre-trained language models
Fine-tuning these systems requires significant amounts of data
Prompting has emerged as a technique for achieving strong performance with less resources
In-context exemplars provide a pattern for the model to follow
Difficulty of hand-crafting prompts and targets is a challenge
Dialogue sequence lengths are often much longer than other tasks
Selecting the exemplars is difficult due to sparsity
Aim to achieve good results with a low-resource model setting
Meta in-context learning framework to stabilize training and reduce variance
Inspired by summarization work to condense dialogue histories
Novel loss function to train a retrieval model to select ideal exemplars
Works on any sort of language model

Few-shot dialog state tracking

Recent works on dialogue state tracking use large pre-trained LMs
Few-shot learning can be achieved with transfer learning or data augmentation
Clustering techniques like prototypical networks have been successful

Meta in-context learning with prompting

Few-shot techniques of meta-learning and prompting with large PLMs are used
Pre-training a model to learn how to learn is used to get away with only a few examples at test time
Methods which circumvent the need to calculate second-order gradients have been applied to the task of DST
Prompts have been found to work well on a wide variety of NLP tasks
Prompt engineering has become its own complex task
Meta in-context learning on classification tasks has been successful
Aim to side-step the prompt design issue altogether by applying metalearning to teach a model to recognize arbitrary instructions

Exemplar retrieval

Retrieval with dense vectors can be used for in-context learning (Liu et al., 2022).
Dense vectors have been used for dialogue in open-domain chat and knowledge-base retrieval (Adolphs et al., 2021; Komeili et al., 2022; Eric et al., 2017; Lee et al., 2021).

Our method

Proposal of a Stabilized dialogue state tracker
Leverages Meta incontext learning, dialogue Summarization and a novel Multi-part training loss
SM2 for fine-tuning a retrieval model

Preliminaries

DST aims to understand customer intentions in a conversation
DST predicts a cumulative dialogue state based on dialogue history
Few-shot setup only allows access to a small percentage of labeled data
Model receives no gradient signal from task-specific data, relies on in-context learning

Stabilized meta-learning

PLMs understand instructions written in natural language
Minor tweaks in prompt text can cause extreme changes in generated output
Meta-ICL stabilizes the variance of prompts
Meta-learning uses labeled data from support sets to adapt a model
Meta-ICL avoids costly loss calculation by using in-context learning
MultiWOZ is the held out target task
Model familiarizes itself with complex DST prompts during meta-training
Any prompt can be used to instruct the model, including random tokens

Dialogue compression

Dialogue context is condensed to fit more exemplars into the model input sequence.
Dialogue history is summarized instead of removing prior utterances.
Heuristics are used to identify non-salient utterances and filter them away.

Multi-part retrieval training

Exemplars are important for in-context learning
Exemplars are retrieved based on their proximity to the query example
An SBERT embedder is used to encode exemplars into a shared embedding space
Two categories of training techniques are explored to improve the performance of the retrieval model
A multi-contrastive loss and a multi-MSE loss are used to modify the target label
The target label is modified to include multiple parts such as domain, slot and value

Model input

Model input consists of context summary, current turn (2 utterances), domain and slot, and exemplars
During meta-training, a final [value] token is added to the model input which is what is hoped to be predicted when testing the left out query set

Experiments

Training implementation details outlined
Key experiments discussed

Training setup

Considered 4 datasets as support sets
Used MultiWOZ 2.1 versions
Selected best models through early stopping on validation data
Set learning rate to 3e-4, used Adafactor optimizer and cosine scheduler with warmup of 10,000 steps
Best system used an ensemble of exemplar embedders trained with κ = [20,30,40] and learning rate of 3e-5

Prompt variations

Model training is considered stable if different prompts produce similar outcomes
Six prompts are collected based on common sense and prior work
Prompts are designed by others to avoid biasing the rankings
Prompts take the form of statements, questions, schema, naive, none and random
Baseline is in-context learning without meta-training
Variance among scores is measured before and after metalearning

Filtering threshold

Two experts annotated 50 dialogs to verify the saliency model.
Results of the model were tested with different filtering thresholds, ranging from 0.1 to 0.9.
Maximum F1-score was reached at 0.6, but 0.4 was chosen as the filtering threshold for higher recall.
Qualitative examples of irrelevant sentences removed can be found in section 5.4.

Retrieval methods

Adapted SBERT to DST task with 4 different objective functions
Tested with number of pairs per exemplar from 10 to 100
Found κ = 30 to work best
Included default SBERT model without fine-tuning
Evaluated results with MRR@10, NDCG@10 and MAP@100
Multi-part cosine loss showed strongest ability to select meaningful exemplars

Results and analysis

Goal is to achieve strong results on DST without prompt engineering
Analyze ability of best performing models
Discuss performance stability across different prompts

Main results

In-context learning methods outperform fine-tuning with few-shot data
SM2-11b model achieves best joint goal accuracy on MultiWOZ 2.1 and 2.4
SM2-3b outperforms IC-DST 2.7b models
SM2 models exhibit 2x reduction in variance over models trained under other regimes
Meta-learning from SM2 stabilizes prompt performance across multiple model types

Ablation study

Removing saliency filtering causes a 1-2% drop in model performance
Disabling context summarization causes a bigger decrease in accuracy
Using the default SBERT embedder leads to a nearly 10% drop, suggesting exemplar selection is most critical
Ideas are independently applicable to other NLP tasks

Additional discussion

Fine-tuning performs best
SM2 outperforms in-context learning
Transfer learning from source datasets to target dataset does not work as well
Performance increases from 1% to 5% data, but not from 5% to 10%
Statement prompt does best, Random does worst, but still above chance

Qualitative analysis

Utterance with “domain=restaurant” and “slots=price range, food type” receives high score
Second exemplar E2 discusses different topic, producing low score
Sentence embedder effectively distinguishes value of exemplars
Saliency model successfully conserves token space
Short sentences and those without dialog state info are safe for removal

Conclusion

Method of performing few-shot dialogue state tracking by leveraging large pre-trained LMs with prompts
Does not require any gradient-based training for the target task
Leverages in-context learning to guide model generation
Stabilizes training across prompts with Meta-ICL
Applies saliency filtering and context summarization to reduce dialogue length
Fine-tunes a sentence embedder with a custom loss objective to improve exemplar retrieval
Reaches state-of-the-art results on MultiWOZ when limited to models under 100 billion parameters
Plans to explore techniques that push model and data efficiency even further
Applies framework to other task-oriented dialog datasets
Runs experiments with different random seeds
Uses softmax-based contrastive loss and max-margin contrastive loss
Adjusts batch size and adopts AdaFactor as the optimizer
Ensemble decoding for multiple times using different retrieval embedders
Uses verbalizers to map natural sounding output to the more limited slot-values in the ontology
Squeezes multiple in-context exemplars, dialogue query with conversational context, and a full prompt into the finite input length of a large PLM
Leverages embeddings to search for exemplars in dialogue
Fine-tunes sentence embedder with various loss functions
Measures instability as standard deviation of the accuracy scores

Link to paper#

Abstract#

Paper Content#

Introduction#

Related works#

Few-shot dialog state tracking#

Meta in-context learning with prompting#

Exemplar retrieval#

Our method#

Preliminaries#

Stabilized meta-learning#

Dialogue compression#

Multi-part retrieval training#

Model input#

Experiments#

Training setup#

Prompt variations#

Filtering threshold#

Retrieval methods#

Results and analysis#

Main results#

Ablation study#

Additional discussion#

Qualitative analysis#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related works

Few-shot dialog state tracking

Meta in-context learning with prompting

Exemplar retrieval

Our method

Preliminaries

Stabilized meta-learning

Dialogue compression

Multi-part retrieval training

Model input

Experiments

Training setup

Prompt variations

Filtering threshold

Retrieval methods

Results and analysis

Main results

Ablation study

Additional discussion

Qualitative analysis

Conclusion