Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • ChatGPT is a chatbot service released by OpenAI
  • Robustness of ChatGPT is unclear
  • Evaluated from adversarial and out-of-distribution perspective
  • Results show ChatGPT does not have consistent advantages
  • ChatGPT performs well on translation tasks
  • ChatGPT provides informal suggestions for medical tasks

Paper Content

Introduction

  • Large language models (LLMs) have achieved significant performance on NLP tasks
  • LLMs have in-context learning capability
  • ChatGPT is a chatbot service released by OpenAI
  • It has attracted over 100 million users
  • Evaluating potential risks behind ChatGPT is important
  • Robustness refers to the ability to withstand disturbances or external factors
  • Robustness threats include OOD samples, adversarial inputs, long-tailed samples, and noisy inputs
  • This paper evaluates ChatGPT’s adversarial and OOD robustness
  • Zero-shot robustness evaluation is used
  • Results show ChatGPT has consistent advantage on adversarial and OOD classification tasks
  • Performance is far from perfection, indicating room for improvement

Background

  • Foundation models are used for natural language processing tasks
  • ChatGPT is a generative foundation model in the GPT-3.5 series
  • ChatGPT is trained using reinforcement learning from human feedback
  • Foundation models are also used for computer vision, music generation, biology, and speech recognition
  • Previous evaluations of ChatGPT have shown mixed results
  • There are concerns that ChatGPT should be regulated
  • Evaluations on ethics have been done
  • Robustness evaluation is currently under-explored

Robustness

  • Adversarial robustness is a type of classification task where a d-dimensional input and output are given and a -bounded, imperceptible perturbation is added to the original input.
  • OOD robustness is a type of generalization which aims to learn an optimal classifier on an unseen distribution by training on existing data.

Datasets and tasks

Adversarial datasets

  • Adopt AdvGLUE and ANLI benchmarks to evaluate adversarial robustness
  • AdvGLUE modified version of GLUE benchmark with different kinds of adversarial noise
  • 5 tasks from AdvGLUE: SST-2, QQP, MNLI, QNLI, and RTE
  • Adopt AdvGLUE development set for evaluation
  • Construct AdvGLUE-T dataset for adversarial machine translation
  • ANLI dataset created by Facebook AI Research with 16,000 premise-hypothesis pairs
  • ANLI divided into 3 parts (R1, R2, R3) with R3 being the most difficult and diverse
  • Select ANLI R3 test set for evaluating adversarial robustness

Out-of-distribution datasets

  • Two new datasets (Flipkart and DDXPlus) for OOD robustness evaluation
  • Flipkart is a product review dataset and DDXPlus is a medical diagnosis dataset
  • Subsets of each dataset are randomly sampled to form test sets

Experiment

  • ChatGPT is compared to 8 existing popular foundation models
  • Attack success rate is used as the metric for robustness on AdvGLUE and ANLI
  • F1-score is used as the metric for OOD classification tasks
  • GPT-3 models outperform the fine-tuned models
  • ChatGPT is readable and reasonable to humans, even given adversarial inputs

Case study

  • ChatGPT is challenged by both word-level and sentence-level adversarial inputs.
  • Adversarial inputs are common in everyday interactions, so defensive strategies are necessary.
  • It is difficult to analyze why ChatGPT performs poorly on OOD inputs.

Discussion

Adversarial attack remains a major threat

  • Adversarial inputs remain a major threat to safety-critical applications.
  • Foundation models might never cover all distributions of possible adversarial inputs.
  • Pre-trained models can be trained on human-generated or algorithm-generated adversarial inputs to improve robustness.
  • Reducing defects through fine-tuning could be impossible for large models.
  • Open question on how to defend against adversarial attack.

Can ood generalization be solved by large foundation models?

  • Large models have potential to achieve superior performance on OOD datasets
  • Large models use huge training data and parameters, which can lead to overfitting or generalization
  • Adding OOD data into training set is enough for large models
  • It is unknown when and why LLMs will overfit
  • Training data of large models could encompass similar distributions to test sets
  • ID-OOD performances can be positively or inversely correlated
  • Regularization and other techniques should be developed to improve OOD performance of language models

Beyond nlp foundation models

  • Adversarial and OOD robustness exist in multiple domains, not just natural language.
  • Most research comes from machine learning and computer vision communities.
  • ViT-22B is a large vision foundation model that shows superior performance on image classification tasks.

Limitation

  • Only zero-shot classification is performed
  • Difficult to find larger datasets for evaluation
  • Most evaluations on text classification, minor evaluations on machine translation
  • ChatGPT mainly designed to be a chatbot service

Conclusion

  • This paper presented a preliminary evaluation of the robustness of ChatGPT
  • Acknowledged the advance of large foundation models on adversarial and out-of-distribution robustness
  • Experiments show that there is still room for improvement to ChatGPT and other large models
  • In-depth analysis and discussion beyond NLP area
  • Highlighted potential research directions regarding foundation models
  • ChatGPT usage and authors
  • Rate of an adversarial attack method
  • Generalization error and hypothesis set
  • Superior performance of large foundation models
  • VC-dimension and correlation with datasets
  • Introduction to foundation models used in experiments
  • OOD generalization and adaptation research
  • Interpretation of success of large foundation models
  • Questions and sentences entailment
  • Translate sentence from English to Chinese
  • Classify sentence into positive or negative