Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Large-language models have been used in natural language tasks.
Adoption of these models in real-world settings has been limited due to incorrect and toxic statements.
This paper explores the ability of large-language models to streamline medical guidelines and recommendation referencing.
Improved factual grounding, helpfulness, and safety is demonstrated in clinical scenarios.

Language model pre-training is a powerful training paradigm in NLP
Performance improvements have been observed to scale with model and dataset size
LLMs can be prone to generating factually invalid statements
LLMs can reproduce social biases and generate statements reinforcing stereotypes
Different ways of steering LLM outputs to align with user-intent have been explored
LLMs have been used in transformative applications
LLMs are at risk of adversarial attacks
Almanac is a framework to explore the role of LLMs in the clinical workflow

Pre-training transformers on scientific and biomedical corpora has improved performance on biomedical tasks
Smaller domain-specific language models can be beneficial even with limited data
Large language models are prone to hallucinations and biases
Leveraging language models for language understanding and modeling capabilities can improve accuracy of question answering
External tools can be used to retrieve knowledge and improve clinically useful tasks

Almanac outperforms its counterpart in factuality and has evenly matched performances in completeness and safety
ChatGPT struggles to provide references with 1 correct, 15 invalid, and 4 incorrect references
Almanac is able to correctly reference various sources for fact-checking

Task of medical question answering is evaluated
Existing datasets are not sufficient for capturing clinical scenarios
ClinicalQA is a novel benchmark of clinical questions
Summary statistics and samples are provided in tables
ClinicalQA can serve as a benchmark for LM-based clinical decision-making support systems
Future work will expand dataset to include more varied examples and multi-modal inputs

Almanac consists of many components working together to achieve accurate document retrieval, reasoning, and question-answering.
Database stores content semantically, using information-dense vectors and a similarity metric.
Browser accesses predetermined domains to fetch information from the web.
Retriever encodes queries and reference materials into the same high-dimensional space.
Language model extracts relevant information from scored context to formulate an answer.

Evaluated Almanac on subset of ClinicalQA
Framework with physician feedback to ensure alignment with 3 key objectives
Current LLM evaluation metrics rely on automated methods such as BLEU
Rubric to assess outputs with aim of addressing factuality, completeness, and safety
Each physician independently evaluates questions within their respective specialties

LLMs can hallucinate incorrect facts and have biases
System designed to distill information from a given context, no need for continuous training
System mitigates explainability concerns by enabling clinicians to review documents used to formulate output
Improved knowledge retrieval on black-box NLP models, more domain-specific language models will further improve performance

Combining text encoders, databases, and large language models can provide accurate outputs to medical queries.
Current practices involve clinicians manually searching and curating medical documents.
Almanac refactors clinical queries into search and retrieval tasks, while performing knowledge distillation via LLM.