Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Various techniques have been developed to improve dense retrieval.
Existing DRs often suffer from effectiveness tradeoffs between supervised and zero-shot retrieval.
A generalizable DR can be trained to achieve high accuracy in both supervised and zero-shot retrieval without increasing model size.
Common data augmentation practices are often inefficient and sub-optimal.
DRAGON is the first BERT-base-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations.

Paper Content

Introduction

Bi-encoder based neural retrievers allow documents to be pre-computed and stored
Training data is often scarce in real-world scenarios
SPLADE++ and Col-BERTv2 are more expressive representations
Dense retrieval (DR) is a simpler bi-encoder retrieval model
Pretraining, query augmentation and distillation can improve DR effectiveness
Tradeoff between supervised and zero-shot effectiveness
GTR-XXL breaks the effectiveness tradeoff but is inefficient
Unified framework of data augmentation for contrastive learning
Key to training a generalizable DR is to create diverse relevance labels
Cheap and large-scale augmented queries can be used instead of neural generative queries
Progressive label augmentation strategy proposed
DRAGON breaks the supervised and zero-shot effectiveness tradeoff without increasing model size

Background

Retrieval task and contrastive learning approach for dense retrieval introduced
Unified framework for understanding recent approaches to improve dense retrieval training provided

Training dense retrieval models

Task is to retrieve documents to maximize ranking metrics
Dense retrieval uses bi-encoder architecture and dot product between encoded vectors as similarity score
Contrastive learning used to train DR models by contrasting positive pairs against negatives
Data augmentation used to increase query size and improve data quality
Query augmentation includes sentence cropping and pseudo query generation
Supervised label augmentation used on augmented queries

Settings for empirical studies

Introduce basic experimental settings in Section 3
Detailed settings in Section 4
Use MS MARCO and BEIR datasets
Evaluate models on MS MARCO Dev
Report MRR@10 and Recall@1000
Use BEIR for zero-shot evaluations
Report averaged nDCG@10 over BEIR-13

Pilot studies on data augmentation

Discuss exploration space of data augmentation
Conduct empirical studies on how to better train a dense retriever
Propose data augmentation recipe to train DRAGON

An exploration of data augmentation

Query augmentation uses sentences from MS MARCO corpus and synthetic queries from doct5query
Label augmentation uses multiple sources of supervisions from existing sparse, dense and multi-vector retrievers

Training with diverse supervisions

Introduced searching space for query and label augmentation
Training dense retriever on augmented data is not trivial
Need to create supervised training data using a teacher from augmented queries
Need to train dense retriever to digest multiple supervisions
Strategies to train dense retriever with diverse supervisions: linear score fusion, uniform supervision, progressive supervision
Train dense retriever using contrastive loss

Empirical studies

Empirical studies conducted on how to better train a dense retriever
Three teachers used for supervised labels: uniCOIL, Contriever, ColBERTv2
Models trained with single and multiple sources of supervisions
Progressive supervision used: uniCOIL→ Contriever → Col-BERTv2
Learning from diverse relevance labels from multiple retrievers is key to gain generalizability capability
Different trajectories have impact on models’ zero-shot retrieval effectiveness
Query size is key to successful training
Mixture of cropped sentences and generative queries yields strong retrieval effectiveness
Final recipe proposed: 20 epochs, trajectory of progressive supervision, mixture of cropped sentences and synthetic queries
DRAGON variants: DRAGON-S, DRAGON-Q, DRAGON+

Comparison with the state of the art

Datasets

Evaluated model supervised effectiveness on TREC DL queries
TREC DL queries have on average 95 graded relevance labels per query
Evaluated models on 18 datasets in BEIR
Evaluated models on LoTTE, which consists of questions and answers from 5 topics
Reported retrieval effectiveness of Success@5 on search and forum queries

Baseline models

DRAGON is compared to dense retrievers using bert-base-uncased
Knowledge Distillation, Contrastive Pre-Training, Masked Auto-Encoding Pre-Training and Domain Adaptation techniques are used
Different corpus are used for pre-training
MS MARCO training queries are used for fine-tuning
Results are reported from original papers and pyserini

Implementation details

Trained dense retrievers using 32 A100 GPUs with batch size of 64 and learning rate of 3e-5
Used asymmetric dual encoder with two distinctly parameterized encoders and inbatch negative mining
Maximum query and passage lengths set to 32 and 128 for MS MARCO training and evaluation
Maximum input lengths set to 512 for BEIR evaluation

Results

Baseline dense retrievers perform well on MS MARCO dev set but not on TREC DL queries
DRAGON variants show strong retrieval effectiveness on MS MARCO dev and TREC DL queries
Dense retrievers perform better in zero-shot evaluations than MS MARCO dev queries
DRAGON variants outperform dense retrievers in zero-shot evaluations
DRAGON+ combined with MAE pre-training sees further improvement on zero-shot evaluations
Mixing different types of queries can mitigate the issue of queries being far different from human-like queries

Discussions

DRAGON-S is trained with augmented relevance labels from a cross encoder
Re-ranking with the cross encoder does not improve retrieval effectiveness
DRAGON-S benefits from masked autoencoding rather than contrastive pre-training
It is challenging to normalize relevance scores from a sparse and dense retriever
Sentence cropping yields a generalizable dense retriever
Cropped sentences provide diverse queries from a passage
Cropped sentences have more unique augmented relevant passages than generative queries

Knowledge Distillation is a technique used to improve the effectiveness of DR
Previous work used soft labels from KD and relevant passages labeled by humans
Recent work mines more positive samples using cross encoders to augment labels
Curriculum Learning is used to improve machine learning tasks
Pre-training techniques include contrastive pre-training and masked auto encoding pre-training
Supervised contrastive learning is used on artificially created text pairs

Conclusion

We present DRAGON, a dense retriever trained with diverse data augmentation
We propose a unified framework of data augmentation to understand recent progress of training dense retrievers
We study how to improve dense retrieval training through query and relevance label augmentation
We propose a diverse data augmentation recipe, query augmentation with the mixture of sentence cropping and generative queries, and progressive relevance label augmentation with multiple teachers
We demonstrate that a single BERT-base-sized dense retriever can achieve state-of-the-art effectiveness in both supervised and zero-shot retrieval tasks
We study the impact of top-k positive sampling
We observe that treating top-10 passages from each teacher as positives yields the best supervised and zero-shot effectiveness
We estimate the accuracy and diversity of supervision
We compare with existing state-of-the-art dense retrievers
We compare uniform and progressive supervision effectiveness
We list data statistics of MS MARCO dataset and DRAGONs’ detailed effectiveness on LoTTE

Link to paper#

Abstract#

Paper Content#

Introduction#

Background#

Training dense retrieval models#

Settings for empirical studies#

Pilot studies on data augmentation#

An exploration of data augmentation#

Training with diverse supervisions#

Empirical studies#

Comparison with the state of the art#

Datasets#

Baseline models#

Implementation details#

Results#

Discussions#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Background

Training dense retrieval models

Settings for empirical studies

Pilot studies on data augmentation

An exploration of data augmentation

Training with diverse supervisions

Empirical studies

Comparison with the state of the art

Datasets

Baseline models

Implementation details

Results

Discussions

Related work

Conclusion