Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Reproducibility is an ideal that all researchers agree with
  • Reproducibility is difficult to achieve in practice
  • The author’s research group has had success building a “culture of reproducibility”
  • Reproducibility efforts should yield easy-to-use, well-packaged, and self-contained software artifacts
  • The primary beneficiaries of reproducibility efforts are those making the investments
  • Social processes and standardized tools are important for achieving reproducibility
  • The dogfood principle ties these ideas together

Paper Content

Introduction

  • Appeal to self interest instead of altruism
  • Engineer social processes to promote virtuous cycles
  • Build standardized tools to reduce technical barriers
  • Reproducibility is an ideal that no researcher would dispute
  • Today, it’s expected that each paper is accompanied by a code repository
  • Many researchers make model checkpoints publicly available
  • Voorhees et al. (2016) found that none of the 79 Open Runs were reproducible
  • Computational notebooks are often touted as a solution to reproducibility
  • Pimentel et al. (2019) found that only 24% of notebooks executed without errors
  • Reproducibility is a noble goal, but not obligatory
  • Competing priorities often take precedence
  • Goal is to offer a path forward towards building a culture of reproducibility
  • Science is a systematic attempt to accumulate and organize knowledge
  • Reproducibility and replicability are the mechanisms to accumulate knowledge
  • Funding for research is provided by governments via tax dollars
  • Researchers should share their results in the broadest way possible
  • FAIR principles: Findable, Accessible, Interoperable, and Reusable
  • Software artifacts should be responsibly and securely managed
  • Appeal to self interest to prioritize reproducibility
  • Investing in reproducibility helps research group iterate more rapidly
  • Repeatability is a good first step towards reproducibility
  • Good reproducibility looks like repeatability + social processes + standardized tools

How?

  • Motivate importance of reproducibility by appealing to self interest
  • Engineer social processes to promote virtuous cycles
  • Build standardized tools to reduce technical barriers
  • Use “dogfood principle” to use own “product”
  • Pyserini provides competitive baselines and foundation for first-stage retrieval
  • Pyserini makes it easy to run experiments on standard IR test collections
  • Self interest allows students and collaborators to build on each other’s results

Social processes: from repeatability to reproducibility

  • Experiments must be documented to be repeatable
  • Documentation is called a reproducibility guide
  • Guide contains command-line invocations and descriptions
  • Goal is to reproduce same results
  • Self interest and reciprocity motivate students to write good documentation
  • Reproduction guides provide onboarding paths for new students
  • Reproduction logs record successful reproductions and code versions used

Standardized tools: from reproducibility to two-click reproductions

  • Reproducibility can be divided into social and technical aspects.
  • Social processes are more important than technical tools.
  • Technical tools can lower barriers to reproducibility.
  • Anserini and Pyserini have built regression tests with test harnesses.
  • Regression tests are integrated with two-click reproductions.
  • Run_regression.py builds the index, verifies index stats, performs retrieval, evaluates outputs and checks effectiveness.
  • Pyserini tests compare output with Anserini’s output.
  • Reproduction matrix organizes experimental conditions.
  • Reproduction matrix is backed by a script that iterates through all rows.
  • CI/CD framework is adapted to research.
  • Regression tests ensure code continues to generate expected results.

Other considerations

  • Discusses a variety of issues
  • Issues don’t fit into “what”, “why”, and “how” narrative

Scoping and timing of reproducibility efforts

  • Reproduction guides and automated regression testing lie along a spectrum of “reproduction rigor”
  • Criteria for which retrieval models or experimental results receive the regression treatment is not clear-cut
  • Rough heuristic is to ask if the work is something to extend further
  • Scoping the effort is an important part of the reproducibility discussion
  • Building tests for ineffective contrastive settings or ablation conditions provides little value
  • Decisions are made on a case-by-case basis
  • Any addition to the test suites incurs a permanent maintenance commitment

Bootstrapping reproducibility

  • Cold start process is the process of starting a virtuous cycle of reproducibility
  • Self-interest is a motivator for participating in reproducibility ecosystems
  • Long-term vision and commitment to a shared codebase is necessary to reap rewards of initial investment
  • Research statements can provide inspiration for identifying opportunities for contributing software artifacts

The critical role of leadership

  • I am the overall architect of Pyserini and Anserini
  • I am the engineering manager and tech lead
  • I am the institutional memory of the research group
  • I coordinate multiple overlapping research projects
  • I introduce students to existing features in Anserini and Pyserini

Conclusions

  • Building a culture of reproducibility is hard work, but worthwhile
  • Sharing software artifacts to recreate work is important
  • Jimmy’s work is valuable to the community
  • Academic community is recognizing importance of reproducibility
  • Reproducibility narrative works for certain “styles” of research
  • Getting flywheel spinning is hard
  • Adopt software engineering best practices