Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Doc2Query is a technique used to improve the first-stage retrieval effectiveness of search engines.
  • Sequence-to-sequence models are known to “hallucinate” content that is not present in the source text.
  • This work explores techniques for filtering out these harmful queries prior to indexing.
  • Using a relevance model to remove poor-quality queries can improve the retrieval effectiveness of Doc2Query.
  • Code, data, and a live demonstration are available at https://github.com/terrierteam/pyterrier_doc2query.

Paper Content

Introduction

  • Neural network models can improve search effectiveness
  • Some approaches focus on re-ranking document sets, others on improving the first stage
  • Doc2Query is a first-stage approach that uses a sequence-to-sequence model to generate queries
  • This approach has been shown to be effective, but can generate content that does not reflect the input text
  • This paper proposes a filtering phase to remove poor queries prior to indexing
  • The classical lexical mismatch problem is a key issue in information retrieval.
  • Various approaches have been used to address this problem, including query reformulation, query expansion models, and document expansion.
  • Recently, transformer-based language models have been used to represent text in embedding spaces.
  • Doc2Query is a sequence-to-sequence model that maps a document to queries that it might be able to answer.
  • Doc2Query has a generation phase and a filtering phase.
  • The relevance threshold is determined by the distribution of relevance scores across all expansion queries.

Experimental setup

  • Conducted experiments to answer two research questions
  • Used MS MARCO v1 passage corpus and five test collections
  • Evaluated using Reciprocal Rank at 10 and nDCG@10
  • Used T5 Doc2Query model from Nogueira and Lin
  • Three neural relevance models for filtering: ELECTRA, MonoT5, and TCT-ColBERT
  • Used PyTerrier toolkit with PISA index
  • Inference conducted on NVIDIA 3090 GPU

Results

  • Relevance filtering can improve the retrieval of Doc2Query models
  • All filters significantly improved performance on Dev and Dev2 datasets
  • Performance on Eval dataset also improved
  • Relevance filtering improved retrieval effectiveness at each value of n
  • Filtering reduces index size and query processing time
  • Filtering increases GPU time but improved effectiveness makes up for cost
  • Doc2Query–provides higher effectiveness at lower query-time costs

Conclusions

  • Doc2Query is a new approach for improving the effectiveness and efficiency of document expansion
  • 16% improvement in retrieval effectiveness can be achieved with Doc2Query
  • Doc2Query reduces index size by 48% and mean query execution time by 30%
  • Relevance filtering could potentially apply to other approaches such as generating alternative forms of queries, training data, or natural language responses