Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

ChatGPT is a large language model with human-like expression and reasoning abilities
This study investigates the feasibility of using ChatGPT to translate radiology reports into plain language
Radiology reports from 62 low-dose chest CT lung cancer screening scans and 76 brain MRI metastases screening scans were collected
ChatGPT can successfully translate radiology reports with an average score of 4.1
ChatGPT offers general relevant suggestions and specific suggestions for 37% of cases
GPT-4 can significantly improve the quality of translated reports

Paper Content

Introduction

OpenAI released ChatGPT, a state-of-the-art NLP model in November 2022
ChatGPT has received global attention with over 100 million users
ChatGPT can answer general queries and perform various tasks
ChatGPT is a quantum leap compared to previous NLP models
ChatGPT is being adapted for downstream tasks
ChatGPT is being investigated for clinical usage

Methodology

Report acquisition

Collected 62 chest CT and 76 brain MRI screening reports from Atrium Health Wake Forest Baptist clinical database
Reports generated between February 1st and 13th
Reports de-identified by removing sensitive patient information
Chest CT reports followed low-dose protocol without contrast agents
Patients aged 53-80 with average of 66.9 years old (32 male, 30 female)
Reports finalized by 11 experienced radiologists with average of 278 ± 57 words
Reports classified into 6 classes based on Lung-RADS category
Brain MRI reports followed brain tumor protocol with/without contrast agent
Patients aged 5-98 with average of 55.0 years old (45 male, 31 female)
Reports finalized by 14 experienced radiologists with 247 ± 92 words
Reports classified into 3 classes based on metastases findings

Experimental design

ChatGPT was given 3 prompts
ChatGPT responses were collected in mid February
Prompts were related to radiology reports

Performance evaluation

Experienced radiologists evaluated the quality of ChatGPT responses based on 3 aspects: overall score, completeness, and correctness.
Statistical analysis was conducted to record the number of places where information was missing or misinterpreted, as well as high-frequency suggestions, the percentage of specific suggestions, and the percentage of inappropriate suggestions.

Chatgpt-translated reports versus the original reports

ChatGPT generated plain language versions with generally fewer words in both chest CT and brain MRI cases.
ChatGPT can reduce the length of the original reports by up to 54%.
ChatGPT replaces medical terminologies with common words.
ChatGPT integrates information from different sections of the original report.

Evaluation of chatgpt translations by radiologists

Radiologists evaluated the quality of translated reports using 3 metrics.
ChatGPT performed well on chest CT and brain MRI scan reports.
53% of chest CT reports and 34-35% of brain MRI scan reports were rated with an overall score of 4 or 5.

Evaluation of chatgpt-generated suggestions

ChatGPT cannot provide medical advice or treatment
ChatGPT provides general suggestions for patients and healthcare providers
Statistical analysis shows that the suggestions are highly relevant
Suggestions based on radiology reports include “follow up with doctors” and “communicate the findings clearly to patient”
ChatGPT provides specific suggestions based on findings in the radiology report

Robustness of chatgpt’s translations

ChatGPT’s translation of radiology reports is not unique
10 translations of the same chest CT radiology report were collected and evaluated
55.2% of all translated points were good translations
19.2%, 24.8%, and 0.8% of points were completely omitted, partially translated, and misinterpreted respectively
Lung nodule findings were inaccurately translated

Optimized prompt for improved translation

ChatGPT tends to generate different responses given the same input
A reason for this is the ambiguity of the prompts
Optimized the prompt to be comprehensive and specific
Quality of translation increased from 55.2% to 77.2%

Different prompts on chatgpt’s performance

Investigated effect of prompt engineering on ChatGPT’s performance
Changed first prompt into 5 different formats
Evaluated ChatGPT’s responses with same method as before
Results similar to original prompt, worse than optimized prompt
Fourth prompt designed by ChatGPT performed slightly better than other four

Chatgpt’s ensemble learning results

Investigated ChatGPT’s performance via ensemble learning
Randomly selected 5 translated reports and input into ChatGPT for information integration
Results of 10 ensemble learning presented in Table 10
ChatGPT cannot generate significantly better results through ensemble learning

Comparison with gpt-4

OpenAI launched GPT-4 with impressive performance on multi-modal tasks
GPT-4 improved quality of translated reports with higher good rates and lower other rates
GPT-4’s results on original prompt was competitive with ChatGPT using optimized prompt
GPT-4 almost achieved 100% good rate
GPT-4 still has some randomness

Discussions

ChatGPT can be used for multiple purposes such as writing news, telling stories, and language translation
ChatGPT has three merits for radiology report translation: conciseness, clarity, and comprehensiveness
ChatGPT deletes redundant words and summarizes multiple findings in a single sentence
ChatGPT replaces complicated medical terminologies with commonly-used words
ChatGPT has a strong ability to understand the original radiology report and integrate information
ChatGPT’s responses are uncertain and can generate distinctive responses each time
ChatGPT does not have a built-in template for its generated report translation

Conclusion

Investigated feasibility and utility of ChatGPT in clinical applications
Evaluated ChatGPT’s performance with an overall score of 4.098
ChatGPT’s translations tend to over-simplify or over-look key points
GPT-4 can significantly improve quality of translated reports

Link to paper#

Abstract#

Paper Content#

Introduction#

Methodology#

Report acquisition#

Experimental design#

Performance evaluation#

Chatgpt-translated reports versus the original reports#

Evaluation of chatgpt translations by radiologists#

Evaluation of chatgpt-generated suggestions#

Robustness of chatgpt’s translations#

Optimized prompt for improved translation#

Different prompts on chatgpt’s performance#

Chatgpt’s ensemble learning results#

Comparison with gpt-4#

Discussions#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Methodology

Report acquisition

Experimental design

Performance evaluation

Chatgpt-translated reports versus the original reports

Evaluation of chatgpt translations by radiologists

Evaluation of chatgpt-generated suggestions

Robustness of chatgpt’s translations

Optimized prompt for improved translation

Different prompts on chatgpt’s performance

Chatgpt’s ensemble learning results

Comparison with gpt-4

Discussions

Conclusion