Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs have been used for natural language understanding and generation.
There is no standard to evaluate model predictions and reasoning across tasks.
MultiMedQA is a benchmark combining existing open question answering datasets.
Human evaluation is proposed to assess model answers.
PaLM and Flan-PaLM are evaluated on MultiMedQA and achieve state-of-the-art accuracy.
Instruction prompt tuning is introduced to align LLMs to new domains.

Paper Content

Introduction

Medicine is a humane endeavor that relies on language
AI models for medicine and healthcare have largely failed to utilize language
Recent advances in large language models offer an opportunity to rethink AI systems
Potential applications of large language models in medicine include knowledge retrieval, clinical decision support, summarization, triaging, etc.
Safety-critical nature of the domain necessitates thoughtful development of evaluation frameworks
Curation of MultiMedQA, a benchmark for medical question answering
Evaluation of PaLM and its instructed-tuned variant, Flan-PaLM, on MultiMedQA
Flan-PaLM exceeded SOTA performance on MedQA (USMLE), MedMCQA, PubMedQA, and MMLU clinical topics
Introduction of instruction prompt tuning to further align Flan-PaLM to the medical domain
Med-PaLM’s answers to consumer medical questions compared favorably with clinician-generated answers
Key limitations of LLMs revealed through human evaluation framework

LLMs have shown impressive performance on NLP tasks
LLMs owe their success to scaling up transformer-based models
LLMs are often trained using self-supervision on large scale
LLMs have demonstrated promising results across a wide range of tasks
LLMs have the capacity to act as implicit knowledge bases
LLMs for science and biomedicine have been developed

Methods

Dataset: MultiMedQA benchmark for medical question answering
Framework for human evaluation: rating framework for model and clinician answers
Modeling: Large language models and methods to align to medical domain

Datasets

LLMs have potential in medicine, specifically medical question answering
Existing medical question answering datasets exist
Datasets vary in format, capabilities tested, domain, question source, and labels/metadata
MultiMedQA includes multiple-choice and long-form answer questions
MultiMedQA includes MedQA, MedMCQA, PubMedQA, LiveQA, MedicationQA, and MMLU clinical topics
HealthSearchQA is a new dataset of curated commonly searched health queries
MultiMedQA allows for probing of LLMs along multiple axes
Future work may include other relevant datasets

Framework for human evaluation

LLMs can be used to answer medical questions
Objective accuracy metrics on multiple-choice questions are a robust measure of model performance, but omit important details
Human evaluation of long-form answers to medical questions is proposed
Evaluation includes agreement with scientific consensus, possibility and likelihood of harm, evidence of comprehension, reasoning and retrieval ability, presence of inappropriate, incorrect or missing content and possibility of bias in the answer
Evaluation is done by clinicians and lay users
LLMs are adapted and aligned with domain-specific data
Prompting strategies are used to achieve fast in-context learning
Few-shot, chain-of-thought and self-consistency prompting are used
Prompt tuning is a simple and computationally inexpensive approach to finetuning LLMs

Results

PubMedGPT and BioGPT models achieved 79.0% accuracy on the PubMedQA dataset.
Human performance on PubMedQA is 78.0%.

State-of-the-art performance on mmlu clinical topics

MMLU dataset contains multiple-choice questions from several clinical knowledge, medicine and biology related topics
Flan-PaLM 540B achieved state of the art performance on all these subsets
Instruction tuning improves performance on medical question answering
Scaling improves performance on medical question answering
Chain-of-Thought (CoT) prompting did not improve performance
Self-consistency (SC) leads to strong improvement in multiple-choice performance
LLMs can generate statements inconsistent with fact
Selective prediction task used to measure relationship between LLM uncertainty and statement accuracy
140 questions evaluated by clinicians to understand how answers related to current scientific consensus
Clinicians’ answers were judged to be aligned with the scientific consensus in 92.9% of questions
Flan-PaLM was found to be in agreement with the scientific consensus in only 61.9% of answers
Med-PaLM (instruction prompt-tuned Flan-PaLM) was found to be in agreement with the scientific consensus in 92.9% of answers
Clinicians asked to identify evidence of correct/incorrect medical reading comprehension, medical knowledge retrieval and medical reasoning capabilities
Expert generated answers were considerably superior to Flan-PaLM and Med-PaLM
Clinician answers showed evidence of inappropriate/incorrect content in only 1.4% of the cases, compared to 16.1% for Flan-PaLM and 18.7% for Med-PaLM
Clinician answers were judged to have missing important information in 2.9% of the cases, compared to 47.2% for Flan-PaLM and 15.1% for Med-PaLM
Instruction prompt tuning helped improve model performance in omission of important information

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Methods#

Datasets#

Framework for human evaluation#

Results#

State-of-the-art performance on mmlu clinical topics#

Link to paper

Abstract

Paper Content

Introduction

Related work

Methods

Datasets

Framework for human evaluation

Results

State-of-the-art performance on mmlu clinical topics