Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • ChatGPT is a large language model with mass adoption
  • Evaluating ChatGPT’s performance is challenging due to its closed nature and continuous updates
  • Data contamination is an issue when evaluating ChatGPT
  • Stance detection is used as a case study to highlight the issue of data contamination
  • Fair model evaluation is a challenge in the age of closed and continuously trained models

Paper Content



  • Zhang et al. used either the November 30 or December 15 version of ChatGPT to obtain their results
  • Used the test sets of SemEval 2016 Task 6 and P-stance to perform experiments
  • Used the same prompt for both datasets
  • Manually collected responses of Jan 30th ChatGPT for 860 tweets from SemEval 2016 Task 6
  • Collected and included responses for 2157 tweets in the P-stance test dataset
  • Used open-source API to automate collection of responses from Feb 13th ChatGPT plus for both datasets
  • Manually extracted stance labels from responses when explicitly mentioned

Evaluation metric and results

  • Macro-F and micro-F scores are shown for different versions of ChatGPT
  • Performance is improved in recent versions of ChatGPT compared to V1
  • Performance is greater on SemEval than P-Stance dataset
  • Feb 13 ChatGPT plus has a performance drop compared to Jan 30 ChatGPT, but still an improvement compared to V1


  • Closed nature of model makes it impossible to verify if existing dataset was used
  • Possibility of data leakage to model
  • Data leakage likely leads to boost in apparent performance
  • Case study of potential contamination with documented evidence
  • Care must be taken to ensure pre-training and fine-tuning data of models are not contaminated
  • Making claims about zero-shot or few-shot inference capabilities of models require careful inspection of training datasets

Data availability

  • ChatGPT responses can be made available upon request
  • Figure 1 shows updates of ChatGPT since its release
  • Figure 2 shows evolution of zero-shot performance
  • Network errors sometimes forced the team to open a new chat session
  • Team could not try multiple queries or estimate uncertainty of performance
  • Goal of article is to highlight possibility of data leakage and impossibility of verifying lack of data leakage with closed model
  • Model creators should pay closer attention to training datasets, create mechanisms to scrutinize data leakage, and build systems to prevent data contamination