Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • LLMs have been used for natural language understanding and generation.
  • There is no standard to evaluate model predictions and reasoning across tasks.
  • MultiMedQA is a benchmark combining existing open question answering datasets.
  • Human evaluation is proposed to assess model answers.
  • PaLM and Flan-PaLM are evaluated on MultiMedQA and achieve state-of-the-art accuracy.
  • Instruction prompt tuning is introduced to align LLMs to new domains.

Paper Content

Introduction

  • Medicine is a humane endeavor that relies on language
  • AI models for medicine and healthcare have largely failed to utilize language
  • Recent advances in large language models offer an opportunity to rethink AI systems
  • Potential applications of large language models in medicine include knowledge retrieval, clinical decision support, summarization, triaging, etc.
  • Safety-critical nature of the domain necessitates thoughtful development of evaluation frameworks
  • Curation of MultiMedQA, a benchmark for medical question answering
  • Evaluation of PaLM and its instructed-tuned variant, Flan-PaLM, on MultiMedQA
  • Flan-PaLM exceeded SOTA performance on MedQA (USMLE), MedMCQA, PubMedQA, and MMLU clinical topics
  • Introduction of instruction prompt tuning to further align Flan-PaLM to the medical domain
  • Med-PaLM’s answers to consumer medical questions compared favorably with clinician-generated answers
  • Key limitations of LLMs revealed through human evaluation framework
  • LLMs have shown impressive performance on NLP tasks
  • LLMs owe their success to scaling up transformer-based models
  • LLMs are often trained using self-supervision on large scale
  • LLMs have demonstrated promising results across a wide range of tasks
  • LLMs have the capacity to act as implicit knowledge bases
  • LLMs for science and biomedicine have been developed

Methods

  • Dataset: MultiMedQA benchmark for medical question answering
  • Framework for human evaluation: rating framework for model and clinician answers
  • Modeling: Large language models and methods to align to medical domain

Datasets

  • LLMs have potential in medicine, specifically medical question answering
  • Existing medical question answering datasets exist
  • Datasets vary in format, capabilities tested, domain, question source, and labels/metadata
  • MultiMedQA includes multiple-choice and long-form answer questions
  • MultiMedQA includes MedQA, MedMCQA, PubMedQA, LiveQA, MedicationQA, and MMLU clinical topics
  • HealthSearchQA is a new dataset of curated commonly searched health queries
  • MultiMedQA allows for probing of LLMs along multiple axes
  • Future work may include other relevant datasets

Framework for human evaluation

  • LLMs can be used to answer medical questions
  • Objective accuracy metrics on multiple-choice questions are a robust measure of model performance, but omit important details
  • Human evaluation of long-form answers to medical questions is proposed
  • Evaluation includes agreement with scientific consensus, possibility and likelihood of harm, evidence of comprehension, reasoning and retrieval ability, presence of inappropriate, incorrect or missing content and possibility of bias in the answer
  • Evaluation is done by clinicians and lay users
  • LLMs are adapted and aligned with domain-specific data
  • Prompting strategies are used to achieve fast in-context learning
  • Few-shot, chain-of-thought and self-consistency prompting are used
  • Prompt tuning is a simple and computationally inexpensive approach to finetuning LLMs

Results

  • PubMedGPT and BioGPT models achieved 79.0% accuracy on the PubMedQA dataset.
  • Human performance on PubMedQA is 78.0%.

State-of-the-art performance on mmlu clinical topics

  • MMLU dataset contains multiple-choice questions from several clinical knowledge, medicine and biology related topics
  • Flan-PaLM 540B achieved state of the art performance on all these subsets
  • Instruction tuning improves performance on medical question answering
  • Scaling improves performance on medical question answering
  • Chain-of-Thought (CoT) prompting did not improve performance
  • Self-consistency (SC) leads to strong improvement in multiple-choice performance
  • LLMs can generate statements inconsistent with fact
  • Selective prediction task used to measure relationship between LLM uncertainty and statement accuracy
  • 140 questions evaluated by clinicians to understand how answers related to current scientific consensus
  • Clinicians’ answers were judged to be aligned with the scientific consensus in 92.9% of questions
  • Flan-PaLM was found to be in agreement with the scientific consensus in only 61.9% of answers
  • Med-PaLM (instruction prompt-tuned Flan-PaLM) was found to be in agreement with the scientific consensus in 92.9% of answers
  • Clinicians asked to identify evidence of correct/incorrect medical reading comprehension, medical knowledge retrieval and medical reasoning capabilities
  • Expert generated answers were considerably superior to Flan-PaLM and Med-PaLM
  • Clinician answers showed evidence of inappropriate/incorrect content in only 1.4% of the cases, compared to 16.1% for Flan-PaLM and 18.7% for Med-PaLM
  • Clinician answers were judged to have missing important information in 2.9% of the cases, compared to 47.2% for Flan-PaLM and 15.1% for Med-PaLM
  • Instruction prompt tuning helped improve model performance in omission of important information