Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- ChatGPT is a large language model with human-like expression and reasoning abilities
- This study investigates the feasibility of using ChatGPT to translate radiology reports into plain language
- Radiology reports from 62 low-dose chest CT lung cancer screening scans and 76 brain MRI metastases screening scans were collected
- ChatGPT can successfully translate radiology reports with an average score of 4.1
- ChatGPT offers general relevant suggestions and specific suggestions for 37% of cases
- GPT-4 can significantly improve the quality of translated reports
Paper Content
Introduction
- OpenAI released ChatGPT, a state-of-the-art NLP model in November 2022
- ChatGPT has received global attention with over 100 million users
- ChatGPT can answer general queries and perform various tasks
- ChatGPT is a quantum leap compared to previous NLP models
- ChatGPT is being adapted for downstream tasks
- ChatGPT is being investigated for clinical usage
Methodology
Report acquisition
- Collected 62 chest CT and 76 brain MRI screening reports from Atrium Health Wake Forest Baptist clinical database
- Reports generated between February 1st and 13th
- Reports de-identified by removing sensitive patient information
- Chest CT reports followed low-dose protocol without contrast agents
- Patients aged 53-80 with average of 66.9 years old (32 male, 30 female)
- Reports finalized by 11 experienced radiologists with average of 278 ± 57 words
- Reports classified into 6 classes based on Lung-RADS category
- Brain MRI reports followed brain tumor protocol with/without contrast agent
- Patients aged 5-98 with average of 55.0 years old (45 male, 31 female)
- Reports finalized by 14 experienced radiologists with 247 ± 92 words
- Reports classified into 3 classes based on metastases findings
Experimental design
- ChatGPT was given 3 prompts
- ChatGPT responses were collected in mid February
- Prompts were related to radiology reports
Performance evaluation
- Experienced radiologists evaluated the quality of ChatGPT responses based on 3 aspects: overall score, completeness, and correctness.
- Statistical analysis was conducted to record the number of places where information was missing or misinterpreted, as well as high-frequency suggestions, the percentage of specific suggestions, and the percentage of inappropriate suggestions.
Chatgpt-translated reports versus the original reports
- ChatGPT generated plain language versions with generally fewer words in both chest CT and brain MRI cases.
- ChatGPT can reduce the length of the original reports by up to 54%.
- ChatGPT replaces medical terminologies with common words.
- ChatGPT integrates information from different sections of the original report.
Evaluation of chatgpt translations by radiologists
- Radiologists evaluated the quality of translated reports using 3 metrics.
- ChatGPT performed well on chest CT and brain MRI scan reports.
- 53% of chest CT reports and 34-35% of brain MRI scan reports were rated with an overall score of 4 or 5.
Evaluation of chatgpt-generated suggestions
- ChatGPT cannot provide medical advice or treatment
- ChatGPT provides general suggestions for patients and healthcare providers
- Statistical analysis shows that the suggestions are highly relevant
- Suggestions based on radiology reports include “follow up with doctors” and “communicate the findings clearly to patient”
- ChatGPT provides specific suggestions based on findings in the radiology report
Robustness of chatgpt’s translations
- ChatGPT’s translation of radiology reports is not unique
- 10 translations of the same chest CT radiology report were collected and evaluated
- 55.2% of all translated points were good translations
- 19.2%, 24.8%, and 0.8% of points were completely omitted, partially translated, and misinterpreted respectively
- Lung nodule findings were inaccurately translated
Optimized prompt for improved translation
- ChatGPT tends to generate different responses given the same input
- A reason for this is the ambiguity of the prompts
- Optimized the prompt to be comprehensive and specific
- Quality of translation increased from 55.2% to 77.2%
Different prompts on chatgpt’s performance
- Investigated effect of prompt engineering on ChatGPT’s performance
- Changed first prompt into 5 different formats
- Evaluated ChatGPT’s responses with same method as before
- Results similar to original prompt, worse than optimized prompt
- Fourth prompt designed by ChatGPT performed slightly better than other four
Chatgpt’s ensemble learning results
- Investigated ChatGPT’s performance via ensemble learning
- Randomly selected 5 translated reports and input into ChatGPT for information integration
- Results of 10 ensemble learning presented in Table 10
- ChatGPT cannot generate significantly better results through ensemble learning
Comparison with gpt-4
- OpenAI launched GPT-4 with impressive performance on multi-modal tasks
- GPT-4 improved quality of translated reports with higher good rates and lower other rates
- GPT-4’s results on original prompt was competitive with ChatGPT using optimized prompt
- GPT-4 almost achieved 100% good rate
- GPT-4 still has some randomness
Discussions
- ChatGPT can be used for multiple purposes such as writing news, telling stories, and language translation
- ChatGPT has three merits for radiology report translation: conciseness, clarity, and comprehensiveness
- ChatGPT deletes redundant words and summarizes multiple findings in a single sentence
- ChatGPT replaces complicated medical terminologies with commonly-used words
- ChatGPT has a strong ability to understand the original radiology report and integrate information
- ChatGPT’s responses are uncertain and can generate distinctive responses each time
- ChatGPT does not have a built-in template for its generated report translation
Conclusion
- Investigated feasibility and utility of ChatGPT in clinical applications
- Evaluated ChatGPT’s performance with an overall score of 4.098
- ChatGPT’s translations tend to over-simplify or over-look key points
- GPT-4 can significantly improve quality of translated reports