Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Keeping up with research literature is important for scientists
  • This paper examines literature review practices of data scientists
  • Data science is a field with an exponential rise in papers
  • Tools have been developed to help data scientists cope with the deluge of research
  • Interviews and think-aloud protocols were conducted to uncover challenges faced by data scientists
  • Challenges include seeking and sensemaking of papers, understanding papers with missing details, and grappling with the deluge
  • Data scientists rely on peers online and in-person for help

Paper Content

Introduction

  • Literature reviews are important for scientists.
  • There is a lot of research literature, and it is increasing.
  • Data scientists are facing an information overload.
  • No prior work has examined the literature review practices of data scientists.
  • Data scientists have training in computer science and other disciplines.
  • Literature reviews involve information seeking and sensemaking.
  • Identify potential sources
  • Evaluate sources for relevance and reliability

Interacting with sources

Synthesis & presentation

  • Examined literature review practices of data scientists
  • Examined information seeking and sensemaking
  • Examined search as learning process
  • Examined practices of information seeking and sensemaking in isolation
  • Overview of information seeking systems and practices for knowledge workers
  • Overview of sensemaking systems and practices for scholarly literature
  • Examination of data scientists’ practices grounded in prior work
  • Overview of work on search as learning and effects of tasks on learning-oriented goals
  • Overview of recent work examining data scientists’ practices

Info-seeking & sensemaking -practices

  • Information seeking practices of knowledge workers studied
  • Computer scientists’ goals and tools for literature searches
  • STEM researchers’ information-seeking behavior and use of social media/reference management tools
  • Studies examining behaviors in specific systems
  • Challenges of personal information management in engineering researchers
  • Tool use patterns in managing ideas and capturing context/metadata

Info-seeking & sensemaking -systems

  • Prior work has explored a range of strategies for information seeking
  • Strategies include discovery through seed papers and citation chaining, metadata such as authors or keywords, and specialized querying methods
  • Sensemaking systems aid collections of documents, facilitate reading and note-taking of individual documents
  • Systems for sensemaking also contain components to aid information seeking
  • Systems help determine salient time-aligned trends in the scientific literature
  • Reading aids for individual papers have examined highlights over salient text and augmentations for reading equations and terms
  • Note-taking systems help generate and organize text snippets and notes from reading papers
  • Integrated systems for literature review span search and discovery, organization, synthesis, and composition

Search as learning

  • Search as learning studies how learning occurs in the context of the search process
  • Learning is considered the formation of mental models of knowledge, their retention over time, and their application
  • Search is viewed as a series of tasks in a constructive process
  • Most work in SAL has focused on classroom students or crowd workers, not knowledge workers

User characteristics.

  • Examining user characteristics often focuses on domain knowledge
  • Willoughby et al. found that domain knowledge is beneficial for searching
  • Roy et al. found that prior topic knowledge helps gain knowledge during a search session
  • Vakkari et al. found that domain knowledge only influences searching for users with knowledge of the search system

System characteristics .

  • Prior work has explored search and reading/note-taking features.
  • Di Sciascio et al. found a transparent and controllable search system to deliver better learning outcomes than PubMed.
  • Qiu et al. found web-search interfaces to lead to better knowledge gain compared to conversational search interfaces.
  • Freund et al. found simpler text environments to lead to improved comprehension.
  • Roy et al. found highlights in reading to improve topic coverage in essay writing tasks.
  • Note taking leads to the inclusion of more facts.
  • Syed et al. found automatically generated questions embedded in the text to improve learning outcomes.

Search behaviors.

  • Vakkari et al. found that as vocabulary improved, queries became more specific
  • Dosso et al. found domain knowledge did not influence query length or number, but medicine students used more domain-specific vocabulary
  • Vakkari and Huuskonen found increased effort in examining documents was associated with improved essays
  • Moraes et al. found search paired with instructor lectures improved learning outcomes
  • Urgo and Arguello found “pathway” toward learning objectives varied in cognitive complexity
  • Study focuses on practices and challenges of search and use of search results by data scientists

Practices of data scientists

  • Data scientists are a currently emerging body of knowledge workers
  • Work of Crisan et al. helps paint a picture of who data scientists are
  • Kross and Guo and Koesten et al. have examined dataset-related information-seeking needs of data scientists
  • Range of work examines data science workflows, documentation practices, and analysis decisions
  • Challenges and needs of data scientists in developing fair machine learning systems and adopting explainable machine learning systems have been examined
  • How data scientists seek and make sense of expanding research literature remains unknown

Study methods

Study design

  • Conducted semi-structured interviews and think-aloud observations
  • 20 participants, 13 in Ph.D. programs and 7 in industry/non-profit organizations

Study participants

  • Recruited participants using social media, email invitations, and university mailing lists
  • Asked respondents if they identified as data scientists and to submit 3 research papers they found useful or enjoyable
  • 20 participants selected on a first-come-first-serve basis
  • Participants compensated with $25 gift card
  • 11 he/him, 9 she/hers, 2 they/them
  • 14 Ph.D. degrees, 6 master’s degrees
  • 13 in universities, 7 in non-profit/for-profit industry labs
  • Average of 4 research papers published

Study procedure

  • Study sessions conducted by authors and lasted 1 hour
  • Consent obtained and demographic survey filled out
  • Semi-structured interview about research focus and goals, practices, and challenges for literature reviews
  • Think-aloud conducted over Zoom screenshare
  • 3 task scenarios used as prompts for think-aloud

Data collection and analysis

  • Automatic transcription of audio was obtained from Zoom and corrected for errors
  • 3 rounds of coding by 3 authors and a thematic analysis of the interview and think-aloud transcripts
  • First round of coding included open and axial coding, with agreement of 0.92 in terms of nominal Krippendorff’s alpha
  • 167 codes generated, of which 51 noted names of tools used by participants or logistic aspects of the study
  • 59 new codes added, of which 32 noted names of tools or logistic aspects

Results

  • Information Search Process consists of 4 stages: Formulation of an information need, query formulation and search, assessment of search results, and synthesis of documents
  • Participants focus on understanding problems and solutions, and seek solutions for direct application or as baseline systems
  • Participants seek novelty and build on prior work
  • Challenges in formulating queries are corroborated in prior work
  • Participants also solicit recommendations from expert peers

How

4.2.2

  • Data scientists find literature through search and passively.
  • Automated methods for discovery include following individuals on social media, subscribing to email alerts, following authors, and newsletters.
  • Participants noted being overwhelmed with alerts and trapped in a disciplinary bubble.
  • Participants receive recommendations from peers, which have advantages such as understanding interests and deeper engagement.

How do data scientists select papers?

  • Participants faced a large volume of similar papers in heavily crowded disciplines of data science.
  • Participants turned to surveys or good reviews of the literature to find salient papers.
  • Participants leveraged repeated references to specific concepts or papers as a sign of having found the papers worthy of examination.

Establishing the credibility of papers.

  • Establishing credibility of papers is a challenge
  • Indicators of credibility include authors, affiliations, publication venues, and citation counts
  • Content of papers sometimes does not match the information scent
  • Challenges with exaggerated/re-branded claims and needing to sift through many similar papers

Everyone skims papers.

  • Participants skimmed individual papers to make quick decisions of correctness.
  • Participants relied on knowing the discipline to know where to look for specific information.
  • Skimming is often interspersed with information seeking.

What challenges do data scientists face in reading papers?

  • Participants noted the challenge of missing details in papers.
  • Participants noted a tension between including a lot of detail and readers wanting a high-level idea.
  • Participants noted the value of augmentations provided by code.
  • Availability of code alongside papers is currently at 25%.

4.4.2

  • Struggling to understand math in papers
  • Leveraging code, blogs, and talks to aid understanding
  • Understanding the “delta” of a paper compared to other work
  • Difficulty in establishing if a paper is poorly written or if the participant is missing context
  • Recent work in NLP exploring methods to explain relationships between papers

How do data scientists lean on social ties?

  • Leveraging social ties for paper discovery
  • Leaning on peers for other purposes

Collaboratively brainstorming and making sense of papers.

  • Participants noted the value of group discussions centered on papers to keep up with the literature, help brainstorm ideas, or spark new research directions
  • Participants noted the value of discussions with collaborators to understand the details of specific important papers
  • Participants noted the value of sharing notes and literature with collaborators to establish the provenance of ideas or correctness of information
  • Participants sought weaker social ties and online discussions to seek recommendations from experts on forums and establish the credibility of papers
  • Participants found value in interacting with authors directly or passively through recorded talks and forums such as Twitter or Reddit
  • Visual communication and incentive for authors to communicate their idea was found to be useful for understanding
  • Alternative publication formats and author engagement through social media are under-explored

Discussion

  • Examined practices and challenges of data scientists reviewing scientific literature
  • Anchored results along formulation of an information need, query formulation and search, assessment of search results, synthesis of documents, and leveraging social ties
  • Noted challenges across themes and speculated future work

Support cross-disciplinary access

  • Data science is seeing exponential growth
  • Subdisciplines are emerging with their own norms
  • Scientists are tasked with conducting work in fragmented disciplines
  • Challenges of fragmented knowledge are echoed in information search process
  • Query recommendation and verbose queries can help with search in unknown disciplines
  • Explanations and adaptive document layouts can help with skimming and understanding papers

Facilitate reliance on close peers

  • Participants relied on close peers for recommendations, brainstorming, credibility of papers, and understanding of papers.
  • Collaborative practices of data scientists have been examined in the context of code and data work.
  • Evidence suggests bringing “friends-into-the-loop” of recommender systems results in more accurate and diverse recommendations.
  • Early work melding sensemaking, reading, and discovery in a collaborative feed-reader promises to leverage trust in peers.
  • Users prefer to interleave egocentric search with lightweight communication.
  • Careful design is needed for virtual communication to not curb creative ideation.
  • Interoperability of new tools is important for uptake.

Leverage the knowledge context of papers

  • Close community of expert peers not always available
  • Variety of resources sought to augment papers
  • Knowledge context used in web search engine result pages
  • Push for greater use of knowledge context
  • Questions remain about retrieval and presentation of knowledge context
  • Room for more complex augmentations

Future work and limitations

  • Examine longer-term activities such as synthesis and composition
  • Recruit more varied participants from different locations
  • Examine incentive structures surrounding disseminating research
  • Consider influence of COVID-19 pandemic on work practices

Conclusions

  • Number of scholarly publications is increasing
  • Examined information seeking and sensemaking practices of data scientists
  • Ran exploratory interview study with 20 data scientists
  • Established their goals for accessing the literature
  • Examined their practices and challenges in search and discovery, selection of search results, skimming and reading
  • Reliance on peers in a number of these tasks
  • Challenges arising from fragmented scientific disciplines
  • Challenges of missing detail and mathematical content in reading papers
  • Leverage knowledge context surrounding scientific papers in the form of code, blogs, talks, and forums
  • Leverage existing scientific literature to find open problems and solutions