Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

InPars introduced a method to use LLMs in information retrieval tasks
InPars-v2 uses open-source LLMs and existing rerankers to generate synthetic query-document pairs
BM25 retrieval pipeline and monoT5 reranker finetuned on InPars-v2 data achieves new state-of-the-art results on BEIR benchmark
Code, synthetic data, and finetuned models open sourced

Data augmentation is a tool to improve AI models when there is not enough in-domain training data
Previous work used LLMs to generate synthetic training data for information retrieval models
Bonifacio et al. proposed InPars to generate queries from documents in the corpus using LLMs
Promptagator model uses dataset-specific prompts, a larger LLM and a fully trainable retrieval pipeline
This work extends Bonifacio et al. by using a reranker to select the best synthetically generated examples
Open-source query generator is used and source code and data is provided to reproduce results on TPUs

BM25, monoT5-3B finetuned on MS MARCO, monoT5-3B finetuned on MS MARCO and further finetuned on InPars-v1, and monoT5-3B finetuned on MS MARCO and then finetuned on InPars-v2 data are compared.
Results show that InPars-v2 is substantially better than InPars-v1 on TREC-News, Climate-FEVER, Robust and Touche.
Results are better than Promptagator and RankT5 on average of all BEIR datasets.