Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- ChatGPT is a large language model with mass adoption
- Evaluating ChatGPT’s performance is challenging due to its closed nature and continuous updates
- Data contamination is an issue when evaluating ChatGPT
- Stance detection is used as a case study to highlight the issue of data contamination
- Fair model evaluation is a challenge in the age of closed and continuously trained models
Paper Content
Introduction
Methods
- Zhang et al. used either the November 30 or December 15 version of ChatGPT to obtain their results
- Used the test sets of SemEval 2016 Task 6 and P-stance to perform experiments
- Used the same prompt for both datasets
- Manually collected responses of Jan 30th ChatGPT for 860 tweets from SemEval 2016 Task 6
- Collected and included responses for 2157 tweets in the P-stance test dataset
- Used open-source API to automate collection of responses from Feb 13th ChatGPT plus for both datasets
- Manually extracted stance labels from responses when explicitly mentioned
Evaluation metric and results
- Macro-F and micro-F scores are shown for different versions of ChatGPT
- Performance is improved in recent versions of ChatGPT compared to V1
- Performance is greater on SemEval than P-Stance dataset
- Feb 13 ChatGPT plus has a performance drop compared to Jan 30 ChatGPT, but still an improvement compared to V1
Discussion
- Closed nature of model makes it impossible to verify if existing dataset was used
- Possibility of data leakage to model
- Data leakage likely leads to boost in apparent performance
- Case study of potential contamination with documented evidence
- Care must be taken to ensure pre-training and fine-tuning data of models are not contaminated
- Making claims about zero-shot or few-shot inference capabilities of models require careful inspection of training datasets
Data availability
- ChatGPT responses can be made available upon request
- Figure 1 shows updates of ChatGPT since its release
- Figure 2 shows evolution of zero-shot performance
- Network errors sometimes forced the team to open a new chat session
- Team could not try multiple queries or estimate uncertainty of performance
- Goal of article is to highlight possibility of data leakage and impossibility of verifying lack of data leakage with closed model
- Model creators should pay closer attention to training datasets, create mechanisms to scrutinize data leakage, and build systems to prevent data contamination