Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Reproducibility is an ideal that all researchers agree with
- Reproducibility is difficult to achieve in practice
- The author’s research group has had success building a “culture of reproducibility”
- Reproducibility efforts should yield easy-to-use, well-packaged, and self-contained software artifacts
- The primary beneficiaries of reproducibility efforts are those making the investments
- Social processes and standardized tools are important for achieving reproducibility
- The dogfood principle ties these ideas together
Paper Content
Introduction
- Appeal to self interest instead of altruism
- Engineer social processes to promote virtuous cycles
- Build standardized tools to reduce technical barriers
- Reproducibility is an ideal that no researcher would dispute
- Today, it’s expected that each paper is accompanied by a code repository
- Many researchers make model checkpoints publicly available
- Voorhees et al. (2016) found that none of the 79 Open Runs were reproducible
- Computational notebooks are often touted as a solution to reproducibility
- Pimentel et al. (2019) found that only 24% of notebooks executed without errors
- Reproducibility is a noble goal, but not obligatory
- Competing priorities often take precedence
- Goal is to offer a path forward towards building a culture of reproducibility
- Science is a systematic attempt to accumulate and organize knowledge
- Reproducibility and replicability are the mechanisms to accumulate knowledge
- Funding for research is provided by governments via tax dollars
- Researchers should share their results in the broadest way possible
- FAIR principles: Findable, Accessible, Interoperable, and Reusable
- Software artifacts should be responsibly and securely managed
- Appeal to self interest to prioritize reproducibility
- Investing in reproducibility helps research group iterate more rapidly
- Repeatability is a good first step towards reproducibility
- Good reproducibility looks like repeatability + social processes + standardized tools
How?
- Motivate importance of reproducibility by appealing to self interest
- Engineer social processes to promote virtuous cycles
- Build standardized tools to reduce technical barriers
- Use “dogfood principle” to use own “product”
- Pyserini provides competitive baselines and foundation for first-stage retrieval
- Pyserini makes it easy to run experiments on standard IR test collections
- Self interest allows students and collaborators to build on each other’s results
Social processes: from repeatability to reproducibility
- Experiments must be documented to be repeatable
- Documentation is called a reproducibility guide
- Guide contains command-line invocations and descriptions
- Goal is to reproduce same results
- Self interest and reciprocity motivate students to write good documentation
- Reproduction guides provide onboarding paths for new students
- Reproduction logs record successful reproductions and code versions used
Standardized tools: from reproducibility to two-click reproductions
- Reproducibility can be divided into social and technical aspects.
- Social processes are more important than technical tools.
- Technical tools can lower barriers to reproducibility.
- Anserini and Pyserini have built regression tests with test harnesses.
- Regression tests are integrated with two-click reproductions.
- Run_regression.py builds the index, verifies index stats, performs retrieval, evaluates outputs and checks effectiveness.
- Pyserini tests compare output with Anserini’s output.
- Reproduction matrix organizes experimental conditions.
- Reproduction matrix is backed by a script that iterates through all rows.
- CI/CD framework is adapted to research.
- Regression tests ensure code continues to generate expected results.
Other considerations
- Discusses a variety of issues
- Issues don’t fit into “what”, “why”, and “how” narrative
Scoping and timing of reproducibility efforts
- Reproduction guides and automated regression testing lie along a spectrum of “reproduction rigor”
- Criteria for which retrieval models or experimental results receive the regression treatment is not clear-cut
- Rough heuristic is to ask if the work is something to extend further
- Scoping the effort is an important part of the reproducibility discussion
- Building tests for ineffective contrastive settings or ablation conditions provides little value
- Decisions are made on a case-by-case basis
- Any addition to the test suites incurs a permanent maintenance commitment
Bootstrapping reproducibility
- Cold start process is the process of starting a virtuous cycle of reproducibility
- Self-interest is a motivator for participating in reproducibility ecosystems
- Long-term vision and commitment to a shared codebase is necessary to reap rewards of initial investment
- Research statements can provide inspiration for identifying opportunities for contributing software artifacts
The critical role of leadership
- I am the overall architect of Pyserini and Anserini
- I am the engineering manager and tech lead
- I am the institutional memory of the research group
- I coordinate multiple overlapping research projects
- I introduce students to existing features in Anserini and Pyserini
Conclusions
- Building a culture of reproducibility is hard work, but worthwhile
- Sharing software artifacts to recreate work is important
- Jimmy’s work is valuable to the community
- Academic community is recognizing importance of reproducibility
- Reproducibility narrative works for certain “styles” of research
- Getting flywheel spinning is hard
- Adopt software engineering best practices