Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Large-language models have been used in natural language tasks.
  • Adoption of these models in real-world settings has been limited due to incorrect and toxic statements.
  • This paper explores the ability of large-language models to streamline medical guidelines and recommendation referencing.
  • Improved factual grounding, helpfulness, and safety is demonstrated in clinical scenarios.

Paper Content

Introduction

  • Language model pre-training is a powerful training paradigm in NLP
  • Performance improvements have been observed to scale with model and dataset size
  • LLMs can be prone to generating factually invalid statements
  • LLMs can reproduce social biases and generate statements reinforcing stereotypes
  • Different ways of steering LLM outputs to align with user-intent have been explored
  • LLMs have been used in transformative applications
  • LLMs are at risk of adversarial attacks
  • Almanac is a framework to explore the role of LLMs in the clinical workflow
  • Pre-training transformers on scientific and biomedical corpora has improved performance on biomedical tasks
  • Smaller domain-specific language models can be beneficial even with limited data
  • Large language models are prone to hallucinations and biases
  • Leveraging language models for language understanding and modeling capabilities can improve accuracy of question answering
  • External tools can be used to retrieve knowledge and improve clinically useful tasks

Results

  • Almanac outperforms its counterpart in factuality and has evenly matched performances in completeness and safety
  • ChatGPT struggles to provide references with 1 correct, 15 invalid, and 4 incorrect references
  • Almanac is able to correctly reference various sources for fact-checking

Dataset

  • Task of medical question answering is evaluated
  • Existing datasets are not sufficient for capturing clinical scenarios
  • ClinicalQA is a novel benchmark of clinical questions
  • Summary statistics and samples are provided in tables
  • ClinicalQA can serve as a benchmark for LM-based clinical decision-making support systems
  • Future work will expand dataset to include more varied examples and multi-modal inputs

Architecture

  • Almanac consists of many components working together to achieve accurate document retrieval, reasoning, and question-answering.
  • Database stores content semantically, using information-dense vectors and a similarity metric.
  • Browser accesses predetermined domains to fetch information from the web.
  • Retriever encodes queries and reference materials into the same high-dimensional space.
  • Language model extracts relevant information from scored context to formulate an answer.

Evaluation

  • Evaluated Almanac on subset of ClinicalQA
  • Framework with physician feedback to ensure alignment with 3 key objectives
  • Current LLM evaluation metrics rely on automated methods such as BLEU
  • Rubric to assess outputs with aim of addressing factuality, completeness, and safety
  • Each physician independently evaluates questions within their respective specialties

Discussion

  • LLMs can hallucinate incorrect facts and have biases
  • System designed to distill information from a given context, no need for continuous training
  • System mitigates explainability concerns by enabling clinicians to review documents used to formulate output
  • Improved knowledge retrieval on black-box NLP models, more domain-specific language models will further improve performance

Conclusion

  • Combining text encoders, databases, and large language models can provide accurate outputs to medical queries.
  • Current practices involve clinicians manually searching and curating medical documents.
  • Almanac refactors clinical queries into search and retrieval tasks, while performing knowledge distillation via LLM.