Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Keeping up with research literature is important for scientists
- This paper examines literature review practices of data scientists
- Data science is a field with an exponential rise in papers
- Tools have been developed to help data scientists cope with the deluge of research
- Interviews and think-aloud protocols were conducted to uncover challenges faced by data scientists
- Challenges include seeking and sensemaking of papers, understanding papers with missing details, and grappling with the deluge
- Data scientists rely on peers online and in-person for help
Paper Content
Introduction
- Literature reviews are important for scientists.
- There is a lot of research literature, and it is increasing.
- Data scientists are facing an information overload.
- No prior work has examined the literature review practices of data scientists.
- Data scientists have training in computer science and other disciplines.
- Literature reviews involve information seeking and sensemaking.
Search
- Identify potential sources
- Evaluate sources for relevance and reliability
Interacting with sources
Synthesis & presentation
- Examined literature review practices of data scientists
- Examined information seeking and sensemaking
- Examined search as learning process
- Examined practices of information seeking and sensemaking in isolation
Related work
- Overview of information seeking systems and practices for knowledge workers
- Overview of sensemaking systems and practices for scholarly literature
- Examination of data scientists’ practices grounded in prior work
- Overview of work on search as learning and effects of tasks on learning-oriented goals
- Overview of recent work examining data scientists’ practices
Info-seeking & sensemaking -practices
- Information seeking practices of knowledge workers studied
- Computer scientists’ goals and tools for literature searches
- STEM researchers’ information-seeking behavior and use of social media/reference management tools
- Studies examining behaviors in specific systems
- Challenges of personal information management in engineering researchers
- Tool use patterns in managing ideas and capturing context/metadata
Info-seeking & sensemaking -systems
- Prior work has explored a range of strategies for information seeking
- Strategies include discovery through seed papers and citation chaining, metadata such as authors or keywords, and specialized querying methods
- Sensemaking systems aid collections of documents, facilitate reading and note-taking of individual documents
- Systems for sensemaking also contain components to aid information seeking
- Systems help determine salient time-aligned trends in the scientific literature
- Reading aids for individual papers have examined highlights over salient text and augmentations for reading equations and terms
- Note-taking systems help generate and organize text snippets and notes from reading papers
- Integrated systems for literature review span search and discovery, organization, synthesis, and composition
Search as learning
- Search as learning studies how learning occurs in the context of the search process
- Learning is considered the formation of mental models of knowledge, their retention over time, and their application
- Search is viewed as a series of tasks in a constructive process
- Most work in SAL has focused on classroom students or crowd workers, not knowledge workers
User characteristics.
- Examining user characteristics often focuses on domain knowledge
- Willoughby et al. found that domain knowledge is beneficial for searching
- Roy et al. found that prior topic knowledge helps gain knowledge during a search session
- Vakkari et al. found that domain knowledge only influences searching for users with knowledge of the search system
System characteristics .
- Prior work has explored search and reading/note-taking features.
- Di Sciascio et al. found a transparent and controllable search system to deliver better learning outcomes than PubMed.
- Qiu et al. found web-search interfaces to lead to better knowledge gain compared to conversational search interfaces.
- Freund et al. found simpler text environments to lead to improved comprehension.
- Roy et al. found highlights in reading to improve topic coverage in essay writing tasks.
- Note taking leads to the inclusion of more facts.
- Syed et al. found automatically generated questions embedded in the text to improve learning outcomes.
Search behaviors.
- Vakkari et al. found that as vocabulary improved, queries became more specific
- Dosso et al. found domain knowledge did not influence query length or number, but medicine students used more domain-specific vocabulary
- Vakkari and Huuskonen found increased effort in examining documents was associated with improved essays
- Moraes et al. found search paired with instructor lectures improved learning outcomes
- Urgo and Arguello found “pathway” toward learning objectives varied in cognitive complexity
- Study focuses on practices and challenges of search and use of search results by data scientists
Practices of data scientists
- Data scientists are a currently emerging body of knowledge workers
- Work of Crisan et al. helps paint a picture of who data scientists are
- Kross and Guo and Koesten et al. have examined dataset-related information-seeking needs of data scientists
- Range of work examines data science workflows, documentation practices, and analysis decisions
- Challenges and needs of data scientists in developing fair machine learning systems and adopting explainable machine learning systems have been examined
- How data scientists seek and make sense of expanding research literature remains unknown
Study methods
Study design
- Conducted semi-structured interviews and think-aloud observations
- 20 participants, 13 in Ph.D. programs and 7 in industry/non-profit organizations
Study participants
- Recruited participants using social media, email invitations, and university mailing lists
- Asked respondents if they identified as data scientists and to submit 3 research papers they found useful or enjoyable
- 20 participants selected on a first-come-first-serve basis
- Participants compensated with $25 gift card
- 11 he/him, 9 she/hers, 2 they/them
- 14 Ph.D. degrees, 6 master’s degrees
- 13 in universities, 7 in non-profit/for-profit industry labs
- Average of 4 research papers published
Study procedure
- Study sessions conducted by authors and lasted 1 hour
- Consent obtained and demographic survey filled out
- Semi-structured interview about research focus and goals, practices, and challenges for literature reviews
- Think-aloud conducted over Zoom screenshare
- 3 task scenarios used as prompts for think-aloud
Data collection and analysis
- Automatic transcription of audio was obtained from Zoom and corrected for errors
- 3 rounds of coding by 3 authors and a thematic analysis of the interview and think-aloud transcripts
- First round of coding included open and axial coding, with agreement of 0.92 in terms of nominal Krippendorff’s alpha
- 167 codes generated, of which 51 noted names of tools used by participants or logistic aspects of the study
- 59 new codes added, of which 32 noted names of tools or logistic aspects
Results
- Information Search Process consists of 4 stages: Formulation of an information need, query formulation and search, assessment of search results, and synthesis of documents
- Participants focus on understanding problems and solutions, and seek solutions for direct application or as baseline systems
- Participants seek novelty and build on prior work
- Challenges in formulating queries are corroborated in prior work
- Participants also solicit recommendations from expert peers
How
4.2.2
- Data scientists find literature through search and passively.
- Automated methods for discovery include following individuals on social media, subscribing to email alerts, following authors, and newsletters.
- Participants noted being overwhelmed with alerts and trapped in a disciplinary bubble.
- Participants receive recommendations from peers, which have advantages such as understanding interests and deeper engagement.
How do data scientists select papers?
- Participants faced a large volume of similar papers in heavily crowded disciplines of data science.
- Participants turned to surveys or good reviews of the literature to find salient papers.
- Participants leveraged repeated references to specific concepts or papers as a sign of having found the papers worthy of examination.
Establishing the credibility of papers.
- Establishing credibility of papers is a challenge
- Indicators of credibility include authors, affiliations, publication venues, and citation counts
- Content of papers sometimes does not match the information scent
- Challenges with exaggerated/re-branded claims and needing to sift through many similar papers
Everyone skims papers.
- Participants skimmed individual papers to make quick decisions of correctness.
- Participants relied on knowing the discipline to know where to look for specific information.
- Skimming is often interspersed with information seeking.
What challenges do data scientists face in reading papers?
- Participants noted the challenge of missing details in papers.
- Participants noted a tension between including a lot of detail and readers wanting a high-level idea.
- Participants noted the value of augmentations provided by code.
- Availability of code alongside papers is currently at 25%.
4.4.2
- Struggling to understand math in papers
- Leveraging code, blogs, and talks to aid understanding
- Understanding the “delta” of a paper compared to other work
- Difficulty in establishing if a paper is poorly written or if the participant is missing context
- Recent work in NLP exploring methods to explain relationships between papers
How do data scientists lean on social ties?
- Leveraging social ties for paper discovery
- Leaning on peers for other purposes
Collaboratively brainstorming and making sense of papers.
- Participants noted the value of group discussions centered on papers to keep up with the literature, help brainstorm ideas, or spark new research directions
- Participants noted the value of discussions with collaborators to understand the details of specific important papers
- Participants noted the value of sharing notes and literature with collaborators to establish the provenance of ideas or correctness of information
- Participants sought weaker social ties and online discussions to seek recommendations from experts on forums and establish the credibility of papers
- Participants found value in interacting with authors directly or passively through recorded talks and forums such as Twitter or Reddit
- Visual communication and incentive for authors to communicate their idea was found to be useful for understanding
- Alternative publication formats and author engagement through social media are under-explored
Discussion
- Examined practices and challenges of data scientists reviewing scientific literature
- Anchored results along formulation of an information need, query formulation and search, assessment of search results, synthesis of documents, and leveraging social ties
- Noted challenges across themes and speculated future work
Support cross-disciplinary access
- Data science is seeing exponential growth
- Subdisciplines are emerging with their own norms
- Scientists are tasked with conducting work in fragmented disciplines
- Challenges of fragmented knowledge are echoed in information search process
- Query recommendation and verbose queries can help with search in unknown disciplines
- Explanations and adaptive document layouts can help with skimming and understanding papers
Facilitate reliance on close peers
- Participants relied on close peers for recommendations, brainstorming, credibility of papers, and understanding of papers.
- Collaborative practices of data scientists have been examined in the context of code and data work.
- Evidence suggests bringing “friends-into-the-loop” of recommender systems results in more accurate and diverse recommendations.
- Early work melding sensemaking, reading, and discovery in a collaborative feed-reader promises to leverage trust in peers.
- Users prefer to interleave egocentric search with lightweight communication.
- Careful design is needed for virtual communication to not curb creative ideation.
- Interoperability of new tools is important for uptake.
Leverage the knowledge context of papers
- Close community of expert peers not always available
- Variety of resources sought to augment papers
- Knowledge context used in web search engine result pages
- Push for greater use of knowledge context
- Questions remain about retrieval and presentation of knowledge context
- Room for more complex augmentations
Future work and limitations
- Examine longer-term activities such as synthesis and composition
- Recruit more varied participants from different locations
- Examine incentive structures surrounding disseminating research
- Consider influence of COVID-19 pandemic on work practices
Conclusions
- Number of scholarly publications is increasing
- Examined information seeking and sensemaking practices of data scientists
- Ran exploratory interview study with 20 data scientists
- Established their goals for accessing the literature
- Examined their practices and challenges in search and discovery, selection of search results, skimming and reading
- Reliance on peers in a number of these tasks
- Challenges arising from fragmented scientific disciplines
- Challenges of missing detail and mathematical content in reading papers
- Leverage knowledge context surrounding scientific papers in the form of code, blogs, talks, and forums
- Leverage existing scientific literature to find open problems and solutions