Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

ChatGPT is a large language model with mass adoption
Evaluating ChatGPT’s performance is challenging due to its closed nature and continuous updates
Data contamination is an issue when evaluating ChatGPT
Stance detection is used as a case study to highlight the issue of data contamination
Fair model evaluation is a challenge in the age of closed and continuously trained models

Zhang et al. used either the November 30 or December 15 version of ChatGPT to obtain their results
Used the test sets of SemEval 2016 Task 6 and P-stance to perform experiments
Used the same prompt for both datasets
Manually collected responses of Jan 30th ChatGPT for 860 tweets from SemEval 2016 Task 6
Collected and included responses for 2157 tweets in the P-stance test dataset
Used open-source API to automate collection of responses from Feb 13th ChatGPT plus for both datasets
Manually extracted stance labels from responses when explicitly mentioned

Macro-F and micro-F scores are shown for different versions of ChatGPT
Performance is improved in recent versions of ChatGPT compared to V1
Performance is greater on SemEval than P-Stance dataset
Feb 13 ChatGPT plus has a performance drop compared to Jan 30 ChatGPT, but still an improvement compared to V1

Closed nature of model makes it impossible to verify if existing dataset was used
Possibility of data leakage to model
Data leakage likely leads to boost in apparent performance
Case study of potential contamination with documented evidence
Care must be taken to ensure pre-training and fine-tuning data of models are not contaminated
Making claims about zero-shot or few-shot inference capabilities of models require careful inspection of training datasets

ChatGPT responses can be made available upon request
Figure 1 shows updates of ChatGPT since its release
Figure 2 shows evolution of zero-shot performance
Network errors sometimes forced the team to open a new chat session
Team could not try multiple queries or estimate uncertainty of performance
Goal of article is to highlight possibility of data leakage and impossibility of verifying lack of data leakage with closed model
Model creators should pay closer attention to training datasets, create mechanisms to scrutinize data leakage, and build systems to prevent data contamination