Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Keeping up with research literature is important for scientists
This paper examines literature review practices of data scientists
Data science is a field with an exponential rise in papers
Tools have been developed to help data scientists cope with the deluge of research
Interviews and think-aloud protocols were conducted to uncover challenges faced by data scientists
Challenges include seeking and sensemaking of papers, understanding papers with missing details, and grappling with the deluge
Data scientists rely on peers online and in-person for help

Paper Content

Introduction

Literature reviews are important for scientists.
There is a lot of research literature, and it is increasing.
Data scientists are facing an information overload.
No prior work has examined the literature review practices of data scientists.
Data scientists have training in computer science and other disciplines.
Literature reviews involve information seeking and sensemaking.

Search

Identify potential sources
Evaluate sources for relevance and reliability

Interacting with sources

Synthesis & presentation

Examined literature review practices of data scientists
Examined information seeking and sensemaking
Examined search as learning process
Examined practices of information seeking and sensemaking in isolation

Overview of information seeking systems and practices for knowledge workers
Overview of sensemaking systems and practices for scholarly literature
Examination of data scientists’ practices grounded in prior work
Overview of work on search as learning and effects of tasks on learning-oriented goals
Overview of recent work examining data scientists’ practices

Info-seeking & sensemaking -practices

Information seeking practices of knowledge workers studied
Computer scientists’ goals and tools for literature searches
STEM researchers’ information-seeking behavior and use of social media/reference management tools
Studies examining behaviors in specific systems
Challenges of personal information management in engineering researchers
Tool use patterns in managing ideas and capturing context/metadata

Info-seeking & sensemaking -systems

Prior work has explored a range of strategies for information seeking
Strategies include discovery through seed papers and citation chaining, metadata such as authors or keywords, and specialized querying methods
Sensemaking systems aid collections of documents, facilitate reading and note-taking of individual documents
Systems for sensemaking also contain components to aid information seeking
Systems help determine salient time-aligned trends in the scientific literature
Reading aids for individual papers have examined highlights over salient text and augmentations for reading equations and terms
Note-taking systems help generate and organize text snippets and notes from reading papers
Integrated systems for literature review span search and discovery, organization, synthesis, and composition

Search as learning

Search as learning studies how learning occurs in the context of the search process
Learning is considered the formation of mental models of knowledge, their retention over time, and their application
Search is viewed as a series of tasks in a constructive process
Most work in SAL has focused on classroom students or crowd workers, not knowledge workers

User characteristics.

Examining user characteristics often focuses on domain knowledge
Willoughby et al. found that domain knowledge is beneficial for searching
Roy et al. found that prior topic knowledge helps gain knowledge during a search session
Vakkari et al. found that domain knowledge only influences searching for users with knowledge of the search system

System characteristics .

Prior work has explored search and reading/note-taking features.
Di Sciascio et al. found a transparent and controllable search system to deliver better learning outcomes than PubMed.
Qiu et al. found web-search interfaces to lead to better knowledge gain compared to conversational search interfaces.
Freund et al. found simpler text environments to lead to improved comprehension.
Roy et al. found highlights in reading to improve topic coverage in essay writing tasks.
Note taking leads to the inclusion of more facts.
Syed et al. found automatically generated questions embedded in the text to improve learning outcomes.

Search behaviors.

Vakkari et al. found that as vocabulary improved, queries became more specific
Dosso et al. found domain knowledge did not influence query length or number, but medicine students used more domain-specific vocabulary
Vakkari and Huuskonen found increased effort in examining documents was associated with improved essays
Moraes et al. found search paired with instructor lectures improved learning outcomes
Urgo and Arguello found “pathway” toward learning objectives varied in cognitive complexity
Study focuses on practices and challenges of search and use of search results by data scientists

Practices of data scientists

Data scientists are a currently emerging body of knowledge workers
Work of Crisan et al. helps paint a picture of who data scientists are
Kross and Guo and Koesten et al. have examined dataset-related information-seeking needs of data scientists
Range of work examines data science workflows, documentation practices, and analysis decisions
Challenges and needs of data scientists in developing fair machine learning systems and adopting explainable machine learning systems have been examined
How data scientists seek and make sense of expanding research literature remains unknown

Study methods

Study design

Conducted semi-structured interviews and think-aloud observations
20 participants, 13 in Ph.D. programs and 7 in industry/non-profit organizations

Study participants

Recruited participants using social media, email invitations, and university mailing lists
Asked respondents if they identified as data scientists and to submit 3 research papers they found useful or enjoyable
20 participants selected on a first-come-first-serve basis
Participants compensated with $25 gift card
11 he/him, 9 she/hers, 2 they/them
14 Ph.D. degrees, 6 master’s degrees
13 in universities, 7 in non-profit/for-profit industry labs
Average of 4 research papers published

Study procedure

Study sessions conducted by authors and lasted 1 hour
Consent obtained and demographic survey filled out
Semi-structured interview about research focus and goals, practices, and challenges for literature reviews
Think-aloud conducted over Zoom screenshare
3 task scenarios used as prompts for think-aloud

Data collection and analysis

Automatic transcription of audio was obtained from Zoom and corrected for errors
3 rounds of coding by 3 authors and a thematic analysis of the interview and think-aloud transcripts
First round of coding included open and axial coding, with agreement of 0.92 in terms of nominal Krippendorff’s alpha
167 codes generated, of which 51 noted names of tools used by participants or logistic aspects of the study
59 new codes added, of which 32 noted names of tools or logistic aspects

Results

Information Search Process consists of 4 stages: Formulation of an information need, query formulation and search, assessment of search results, and synthesis of documents
Participants focus on understanding problems and solutions, and seek solutions for direct application or as baseline systems
Participants seek novelty and build on prior work
Challenges in formulating queries are corroborated in prior work
Participants also solicit recommendations from expert peers

How

4.2.2

Data scientists find literature through search and passively.
Automated methods for discovery include following individuals on social media, subscribing to email alerts, following authors, and newsletters.
Participants noted being overwhelmed with alerts and trapped in a disciplinary bubble.
Participants receive recommendations from peers, which have advantages such as understanding interests and deeper engagement.

How do data scientists select papers?

Participants faced a large volume of similar papers in heavily crowded disciplines of data science.
Participants turned to surveys or good reviews of the literature to find salient papers.
Participants leveraged repeated references to specific concepts or papers as a sign of having found the papers worthy of examination.

Establishing the credibility of papers.

Establishing credibility of papers is a challenge
Indicators of credibility include authors, affiliations, publication venues, and citation counts
Content of papers sometimes does not match the information scent
Challenges with exaggerated/re-branded claims and needing to sift through many similar papers

Everyone skims papers.

Participants skimmed individual papers to make quick decisions of correctness.
Participants relied on knowing the discipline to know where to look for specific information.
Skimming is often interspersed with information seeking.

What challenges do data scientists face in reading papers?

Participants noted the challenge of missing details in papers.
Participants noted a tension between including a lot of detail and readers wanting a high-level idea.
Participants noted the value of augmentations provided by code.
Availability of code alongside papers is currently at 25%.

4.4.2

Struggling to understand math in papers
Leveraging code, blogs, and talks to aid understanding
Understanding the “delta” of a paper compared to other work
Difficulty in establishing if a paper is poorly written or if the participant is missing context
Recent work in NLP exploring methods to explain relationships between papers

Leveraging social ties for paper discovery
Leaning on peers for other purposes

Collaboratively brainstorming and making sense of papers.

Participants noted the value of group discussions centered on papers to keep up with the literature, help brainstorm ideas, or spark new research directions
Participants noted the value of discussions with collaborators to understand the details of specific important papers
Participants noted the value of sharing notes and literature with collaborators to establish the provenance of ideas or correctness of information
Participants sought weaker social ties and online discussions to seek recommendations from experts on forums and establish the credibility of papers
Participants found value in interacting with authors directly or passively through recorded talks and forums such as Twitter or Reddit
Visual communication and incentive for authors to communicate their idea was found to be useful for understanding
Alternative publication formats and author engagement through social media are under-explored

Discussion

Examined practices and challenges of data scientists reviewing scientific literature
Anchored results along formulation of an information need, query formulation and search, assessment of search results, synthesis of documents, and leveraging social ties
Noted challenges across themes and speculated future work

Support cross-disciplinary access

Data science is seeing exponential growth
Subdisciplines are emerging with their own norms
Scientists are tasked with conducting work in fragmented disciplines
Challenges of fragmented knowledge are echoed in information search process
Query recommendation and verbose queries can help with search in unknown disciplines
Explanations and adaptive document layouts can help with skimming and understanding papers

Facilitate reliance on close peers

Participants relied on close peers for recommendations, brainstorming, credibility of papers, and understanding of papers.
Collaborative practices of data scientists have been examined in the context of code and data work.
Evidence suggests bringing “friends-into-the-loop” of recommender systems results in more accurate and diverse recommendations.
Early work melding sensemaking, reading, and discovery in a collaborative feed-reader promises to leverage trust in peers.
Users prefer to interleave egocentric search with lightweight communication.
Careful design is needed for virtual communication to not curb creative ideation.
Interoperability of new tools is important for uptake.

Leverage the knowledge context of papers

Close community of expert peers not always available
Variety of resources sought to augment papers
Knowledge context used in web search engine result pages
Push for greater use of knowledge context
Questions remain about retrieval and presentation of knowledge context
Room for more complex augmentations

Future work and limitations

Examine longer-term activities such as synthesis and composition
Recruit more varied participants from different locations
Examine incentive structures surrounding disseminating research
Consider influence of COVID-19 pandemic on work practices

Conclusions

Number of scholarly publications is increasing
Examined information seeking and sensemaking practices of data scientists
Ran exploratory interview study with 20 data scientists
Established their goals for accessing the literature
Examined their practices and challenges in search and discovery, selection of search results, skimming and reading
Reliance on peers in a number of these tasks
Challenges arising from fragmented scientific disciplines
Challenges of missing detail and mathematical content in reading papers
Leverage knowledge context surrounding scientific papers in the form of code, blogs, talks, and forums
Leverage existing scientific literature to find open problems and solutions

Link to paper#

Abstract#

Paper Content#

Introduction#

Search#

Interacting with sources#

Synthesis & presentation#

Related work#

Info-seeking & sensemaking -practices#

Info-seeking & sensemaking -systems#

Search as learning#

User characteristics.#

System characteristics .#

Search behaviors.#

Practices of data scientists#

Study methods#

Study design#

Study participants#

Study procedure#

Data collection and analysis#

Results#

How#

4.2.2#

How do data scientists select papers?#

Establishing the credibility of papers.#

Everyone skims papers.#

What challenges do data scientists face in reading papers?#

4.4.2#

How do data scientists lean on social ties?#

Collaboratively brainstorming and making sense of papers.#

Discussion#

Support cross-disciplinary access#

Facilitate reliance on close peers#

Leverage the knowledge context of papers#

Future work and limitations#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Search

Interacting with sources

Synthesis & presentation

Related work

Info-seeking & sensemaking -practices

Info-seeking & sensemaking -systems

Search as learning

User characteristics.

System characteristics .

Search behaviors.

Practices of data scientists

Study methods

Study design

Study participants

Study procedure

Data collection and analysis

Results

How

4.2.2

How do data scientists select papers?

Establishing the credibility of papers.

Everyone skims papers.

What challenges do data scientists face in reading papers?

4.4.2

How do data scientists lean on social ties?

Collaboratively brainstorming and making sense of papers.

Discussion

Support cross-disciplinary access

Facilitate reliance on close peers

Leverage the knowledge context of papers

Future work and limitations

Conclusions