Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Various techniques have been developed to improve dense retrieval.
  • Existing DRs often suffer from effectiveness tradeoffs between supervised and zero-shot retrieval.
  • A generalizable DR can be trained to achieve high accuracy in both supervised and zero-shot retrieval without increasing model size.
  • Common data augmentation practices are often inefficient and sub-optimal.
  • DRAGON is the first BERT-base-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations.

Paper Content

Introduction

  • Bi-encoder based neural retrievers allow documents to be pre-computed and stored
  • Training data is often scarce in real-world scenarios
  • SPLADE++ and Col-BERTv2 are more expressive representations
  • Dense retrieval (DR) is a simpler bi-encoder retrieval model
  • Pretraining, query augmentation and distillation can improve DR effectiveness
  • Tradeoff between supervised and zero-shot effectiveness
  • GTR-XXL breaks the effectiveness tradeoff but is inefficient
  • Unified framework of data augmentation for contrastive learning
  • Key to training a generalizable DR is to create diverse relevance labels
  • Cheap and large-scale augmented queries can be used instead of neural generative queries
  • Progressive label augmentation strategy proposed
  • DRAGON breaks the supervised and zero-shot effectiveness tradeoff without increasing model size

Background

  • Retrieval task and contrastive learning approach for dense retrieval introduced
  • Unified framework for understanding recent approaches to improve dense retrieval training provided

Training dense retrieval models

  • Task is to retrieve documents to maximize ranking metrics
  • Dense retrieval uses bi-encoder architecture and dot product between encoded vectors as similarity score
  • Contrastive learning used to train DR models by contrasting positive pairs against negatives
  • Data augmentation used to increase query size and improve data quality
  • Query augmentation includes sentence cropping and pseudo query generation
  • Supervised label augmentation used on augmented queries

Settings for empirical studies

  • Introduce basic experimental settings in Section 3
  • Detailed settings in Section 4
  • Use MS MARCO and BEIR datasets
  • Evaluate models on MS MARCO Dev
  • Report MRR@10 and Recall@1000
  • Use BEIR for zero-shot evaluations
  • Report averaged nDCG@10 over BEIR-13

Pilot studies on data augmentation

  • Discuss exploration space of data augmentation
  • Conduct empirical studies on how to better train a dense retriever
  • Propose data augmentation recipe to train DRAGON

An exploration of data augmentation

  • Query augmentation uses sentences from MS MARCO corpus and synthetic queries from doct5query
  • Label augmentation uses multiple sources of supervisions from existing sparse, dense and multi-vector retrievers

Training with diverse supervisions

  • Introduced searching space for query and label augmentation
  • Training dense retriever on augmented data is not trivial
  • Need to create supervised training data using a teacher from augmented queries
  • Need to train dense retriever to digest multiple supervisions
  • Strategies to train dense retriever with diverse supervisions: linear score fusion, uniform supervision, progressive supervision
  • Train dense retriever using contrastive loss

Empirical studies

  • Empirical studies conducted on how to better train a dense retriever
  • Three teachers used for supervised labels: uniCOIL, Contriever, ColBERTv2
  • Models trained with single and multiple sources of supervisions
  • Progressive supervision used: uniCOIL→ Contriever → Col-BERTv2
  • Learning from diverse relevance labels from multiple retrievers is key to gain generalizability capability
  • Different trajectories have impact on models’ zero-shot retrieval effectiveness
  • Query size is key to successful training
  • Mixture of cropped sentences and generative queries yields strong retrieval effectiveness
  • Final recipe proposed: 20 epochs, trajectory of progressive supervision, mixture of cropped sentences and synthetic queries
  • DRAGON variants: DRAGON-S, DRAGON-Q, DRAGON+

Comparison with the state of the art

Datasets

  • Evaluated model supervised effectiveness on TREC DL queries
  • TREC DL queries have on average 95 graded relevance labels per query
  • Evaluated models on 18 datasets in BEIR
  • Evaluated models on LoTTE, which consists of questions and answers from 5 topics
  • Reported retrieval effectiveness of Success@5 on search and forum queries

Baseline models

  • DRAGON is compared to dense retrievers using bert-base-uncased
  • Knowledge Distillation, Contrastive Pre-Training, Masked Auto-Encoding Pre-Training and Domain Adaptation techniques are used
  • Different corpus are used for pre-training
  • MS MARCO training queries are used for fine-tuning
  • Results are reported from original papers and pyserini

Implementation details

  • Trained dense retrievers using 32 A100 GPUs with batch size of 64 and learning rate of 3e-5
  • Used asymmetric dual encoder with two distinctly parameterized encoders and inbatch negative mining
  • Maximum query and passage lengths set to 32 and 128 for MS MARCO training and evaluation
  • Maximum input lengths set to 512 for BEIR evaluation

Results

  • Baseline dense retrievers perform well on MS MARCO dev set but not on TREC DL queries
  • DRAGON variants show strong retrieval effectiveness on MS MARCO dev and TREC DL queries
  • Dense retrievers perform better in zero-shot evaluations than MS MARCO dev queries
  • DRAGON variants outperform dense retrievers in zero-shot evaluations
  • DRAGON+ combined with MAE pre-training sees further improvement on zero-shot evaluations
  • Mixing different types of queries can mitigate the issue of queries being far different from human-like queries

Discussions

  • DRAGON-S is trained with augmented relevance labels from a cross encoder
  • Re-ranking with the cross encoder does not improve retrieval effectiveness
  • DRAGON-S benefits from masked autoencoding rather than contrastive pre-training
  • It is challenging to normalize relevance scores from a sparse and dense retriever
  • Sentence cropping yields a generalizable dense retriever
  • Cropped sentences provide diverse queries from a passage
  • Cropped sentences have more unique augmented relevant passages than generative queries
  • Knowledge Distillation is a technique used to improve the effectiveness of DR
  • Previous work used soft labels from KD and relevant passages labeled by humans
  • Recent work mines more positive samples using cross encoders to augment labels
  • Curriculum Learning is used to improve machine learning tasks
  • Pre-training techniques include contrastive pre-training and masked auto encoding pre-training
  • Supervised contrastive learning is used on artificially created text pairs

Conclusion

  • We present DRAGON, a dense retriever trained with diverse data augmentation
  • We propose a unified framework of data augmentation to understand recent progress of training dense retrievers
  • We study how to improve dense retrieval training through query and relevance label augmentation
  • We propose a diverse data augmentation recipe, query augmentation with the mixture of sentence cropping and generative queries, and progressive relevance label augmentation with multiple teachers
  • We demonstrate that a single BERT-base-sized dense retriever can achieve state-of-the-art effectiveness in both supervised and zero-shot retrieval tasks
  • We study the impact of top-k positive sampling
  • We observe that treating top-10 passages from each teacher as positives yields the best supervised and zero-shot effectiveness
  • We estimate the accuracy and diversity of supervision
  • We compare with existing state-of-the-art dense retrievers
  • We compare uniform and progressive supervision effectiveness
  • We list data statistics of MS MARCO dataset and DRAGONs’ detailed effectiveness on LoTTE