Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Doc2Query is a technique used to improve the first-stage retrieval effectiveness of search engines.
- Sequence-to-sequence models are known to “hallucinate” content that is not present in the source text.
- This work explores techniques for filtering out these harmful queries prior to indexing.
- Using a relevance model to remove poor-quality queries can improve the retrieval effectiveness of Doc2Query.
- Code, data, and a live demonstration are available at https://github.com/terrierteam/pyterrier_doc2query.
Paper Content
Introduction
- Neural network models can improve search effectiveness
- Some approaches focus on re-ranking document sets, others on improving the first stage
- Doc2Query is a first-stage approach that uses a sequence-to-sequence model to generate queries
- This approach has been shown to be effective, but can generate content that does not reflect the input text
- This paper proposes a filtering phase to remove poor queries prior to indexing
Related work
- The classical lexical mismatch problem is a key issue in information retrieval.
- Various approaches have been used to address this problem, including query reformulation, query expansion models, and document expansion.
- Recently, transformer-based language models have been used to represent text in embedding spaces.
- Doc2Query is a sequence-to-sequence model that maps a document to queries that it might be able to answer.
- Doc2Query has a generation phase and a filtering phase.
- The relevance threshold is determined by the distribution of relevance scores across all expansion queries.
Experimental setup
- Conducted experiments to answer two research questions
- Used MS MARCO v1 passage corpus and five test collections
- Evaluated using Reciprocal Rank at 10 and nDCG@10
- Used T5 Doc2Query model from Nogueira and Lin
- Three neural relevance models for filtering: ELECTRA, MonoT5, and TCT-ColBERT
- Used PyTerrier toolkit with PISA index
- Inference conducted on NVIDIA 3090 GPU
Results
- Relevance filtering can improve the retrieval of Doc2Query models
- All filters significantly improved performance on Dev and Dev2 datasets
- Performance on Eval dataset also improved
- Relevance filtering improved retrieval effectiveness at each value of n
- Filtering reduces index size and query processing time
- Filtering increases GPU time but improved effectiveness makes up for cost
- Doc2Query–provides higher effectiveness at lower query-time costs
Conclusions
- Doc2Query is a new approach for improving the effectiveness and efficiency of document expansion
- 16% improvement in retrieval effectiveness can be achieved with Doc2Query
- Doc2Query reduces index size by 48% and mean query execution time by 30%
- Relevance filtering could potentially apply to other approaches such as generating alternative forms of queries, training data, or natural language responses