Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Large-language models have been used in natural language tasks.
- Adoption of these models in real-world settings has been limited due to incorrect and toxic statements.
- This paper explores the ability of large-language models to streamline medical guidelines and recommendation referencing.
- Improved factual grounding, helpfulness, and safety is demonstrated in clinical scenarios.
Paper Content
Introduction
- Language model pre-training is a powerful training paradigm in NLP
- Performance improvements have been observed to scale with model and dataset size
- LLMs can be prone to generating factually invalid statements
- LLMs can reproduce social biases and generate statements reinforcing stereotypes
- Different ways of steering LLM outputs to align with user-intent have been explored
- LLMs have been used in transformative applications
- LLMs are at risk of adversarial attacks
- Almanac is a framework to explore the role of LLMs in the clinical workflow
Related work
- Pre-training transformers on scientific and biomedical corpora has improved performance on biomedical tasks
- Smaller domain-specific language models can be beneficial even with limited data
- Large language models are prone to hallucinations and biases
- Leveraging language models for language understanding and modeling capabilities can improve accuracy of question answering
- External tools can be used to retrieve knowledge and improve clinically useful tasks
Results
- Almanac outperforms its counterpart in factuality and has evenly matched performances in completeness and safety
- ChatGPT struggles to provide references with 1 correct, 15 invalid, and 4 incorrect references
- Almanac is able to correctly reference various sources for fact-checking
Dataset
- Task of medical question answering is evaluated
- Existing datasets are not sufficient for capturing clinical scenarios
- ClinicalQA is a novel benchmark of clinical questions
- Summary statistics and samples are provided in tables
- ClinicalQA can serve as a benchmark for LM-based clinical decision-making support systems
- Future work will expand dataset to include more varied examples and multi-modal inputs
Architecture
- Almanac consists of many components working together to achieve accurate document retrieval, reasoning, and question-answering.
- Database stores content semantically, using information-dense vectors and a similarity metric.
- Browser accesses predetermined domains to fetch information from the web.
- Retriever encodes queries and reference materials into the same high-dimensional space.
- Language model extracts relevant information from scored context to formulate an answer.
Evaluation
- Evaluated Almanac on subset of ClinicalQA
- Framework with physician feedback to ensure alignment with 3 key objectives
- Current LLM evaluation metrics rely on automated methods such as BLEU
- Rubric to assess outputs with aim of addressing factuality, completeness, and safety
- Each physician independently evaluates questions within their respective specialties
Discussion
- LLMs can hallucinate incorrect facts and have biases
- System designed to distill information from a given context, no need for continuous training
- System mitigates explainability concerns by enabling clinicians to review documents used to formulate output
- Improved knowledge retrieval on black-box NLP models, more domain-specific language models will further improve performance
Conclusion
- Combining text encoders, databases, and large language models can provide accurate outputs to medical queries.
- Current practices involve clinicians manually searching and curating medical documents.
- Almanac refactors clinical queries into search and retrieval tasks, while performing knowledge distillation via LLM.