Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • ChatGPT is a large language model with human-like expression and reasoning abilities
  • This study investigates the feasibility of using ChatGPT to translate radiology reports into plain language
  • Radiology reports from 62 low-dose chest CT lung cancer screening scans and 76 brain MRI metastases screening scans were collected
  • ChatGPT can successfully translate radiology reports with an average score of 4.1
  • ChatGPT offers general relevant suggestions and specific suggestions for 37% of cases
  • GPT-4 can significantly improve the quality of translated reports

Paper Content

Introduction

  • OpenAI released ChatGPT, a state-of-the-art NLP model in November 2022
  • ChatGPT has received global attention with over 100 million users
  • ChatGPT can answer general queries and perform various tasks
  • ChatGPT is a quantum leap compared to previous NLP models
  • ChatGPT is being adapted for downstream tasks
  • ChatGPT is being investigated for clinical usage

Methodology

Report acquisition

  • Collected 62 chest CT and 76 brain MRI screening reports from Atrium Health Wake Forest Baptist clinical database
  • Reports generated between February 1st and 13th
  • Reports de-identified by removing sensitive patient information
  • Chest CT reports followed low-dose protocol without contrast agents
  • Patients aged 53-80 with average of 66.9 years old (32 male, 30 female)
  • Reports finalized by 11 experienced radiologists with average of 278 ± 57 words
  • Reports classified into 6 classes based on Lung-RADS category
  • Brain MRI reports followed brain tumor protocol with/without contrast agent
  • Patients aged 5-98 with average of 55.0 years old (45 male, 31 female)
  • Reports finalized by 14 experienced radiologists with 247 ± 92 words
  • Reports classified into 3 classes based on metastases findings

Experimental design

  • ChatGPT was given 3 prompts
  • ChatGPT responses were collected in mid February
  • Prompts were related to radiology reports

Performance evaluation

  • Experienced radiologists evaluated the quality of ChatGPT responses based on 3 aspects: overall score, completeness, and correctness.
  • Statistical analysis was conducted to record the number of places where information was missing or misinterpreted, as well as high-frequency suggestions, the percentage of specific suggestions, and the percentage of inappropriate suggestions.

Chatgpt-translated reports versus the original reports

  • ChatGPT generated plain language versions with generally fewer words in both chest CT and brain MRI cases.
  • ChatGPT can reduce the length of the original reports by up to 54%.
  • ChatGPT replaces medical terminologies with common words.
  • ChatGPT integrates information from different sections of the original report.

Evaluation of chatgpt translations by radiologists

  • Radiologists evaluated the quality of translated reports using 3 metrics.
  • ChatGPT performed well on chest CT and brain MRI scan reports.
  • 53% of chest CT reports and 34-35% of brain MRI scan reports were rated with an overall score of 4 or 5.

Evaluation of chatgpt-generated suggestions

  • ChatGPT cannot provide medical advice or treatment
  • ChatGPT provides general suggestions for patients and healthcare providers
  • Statistical analysis shows that the suggestions are highly relevant
  • Suggestions based on radiology reports include “follow up with doctors” and “communicate the findings clearly to patient”
  • ChatGPT provides specific suggestions based on findings in the radiology report

Robustness of chatgpt’s translations

  • ChatGPT’s translation of radiology reports is not unique
  • 10 translations of the same chest CT radiology report were collected and evaluated
  • 55.2% of all translated points were good translations
  • 19.2%, 24.8%, and 0.8% of points were completely omitted, partially translated, and misinterpreted respectively
  • Lung nodule findings were inaccurately translated

Optimized prompt for improved translation

  • ChatGPT tends to generate different responses given the same input
  • A reason for this is the ambiguity of the prompts
  • Optimized the prompt to be comprehensive and specific
  • Quality of translation increased from 55.2% to 77.2%

Different prompts on chatgpt’s performance

  • Investigated effect of prompt engineering on ChatGPT’s performance
  • Changed first prompt into 5 different formats
  • Evaluated ChatGPT’s responses with same method as before
  • Results similar to original prompt, worse than optimized prompt
  • Fourth prompt designed by ChatGPT performed slightly better than other four

Chatgpt’s ensemble learning results

  • Investigated ChatGPT’s performance via ensemble learning
  • Randomly selected 5 translated reports and input into ChatGPT for information integration
  • Results of 10 ensemble learning presented in Table 10
  • ChatGPT cannot generate significantly better results through ensemble learning

Comparison with gpt-4

  • OpenAI launched GPT-4 with impressive performance on multi-modal tasks
  • GPT-4 improved quality of translated reports with higher good rates and lower other rates
  • GPT-4’s results on original prompt was competitive with ChatGPT using optimized prompt
  • GPT-4 almost achieved 100% good rate
  • GPT-4 still has some randomness

Discussions

  • ChatGPT can be used for multiple purposes such as writing news, telling stories, and language translation
  • ChatGPT has three merits for radiology report translation: conciseness, clarity, and comprehensiveness
  • ChatGPT deletes redundant words and summarizes multiple findings in a single sentence
  • ChatGPT replaces complicated medical terminologies with commonly-used words
  • ChatGPT has a strong ability to understand the original radiology report and integrate information
  • ChatGPT’s responses are uncertain and can generate distinctive responses each time
  • ChatGPT does not have a built-in template for its generated report translation

Conclusion

  • Investigated feasibility and utility of ChatGPT in clinical applications
  • Evaluated ChatGPT’s performance with an overall score of 4.098
  • ChatGPT’s translations tend to over-simplify or over-look key points
  • GPT-4 can significantly improve quality of translated reports