Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- ChatGPT is a chatbot service released by OpenAI
- Robustness of ChatGPT is unclear
- Evaluated from adversarial and out-of-distribution perspective
- Results show ChatGPT does not have consistent advantages
- ChatGPT performs well on translation tasks
- ChatGPT provides informal suggestions for medical tasks
Paper Content
Introduction
- Large language models (LLMs) have achieved significant performance on NLP tasks
- LLMs have in-context learning capability
- ChatGPT is a chatbot service released by OpenAI
- It has attracted over 100 million users
- Evaluating potential risks behind ChatGPT is important
- Robustness refers to the ability to withstand disturbances or external factors
- Robustness threats include OOD samples, adversarial inputs, long-tailed samples, and noisy inputs
- This paper evaluates ChatGPT’s adversarial and OOD robustness
- Zero-shot robustness evaluation is used
- Results show ChatGPT has consistent advantage on adversarial and OOD classification tasks
- Performance is far from perfection, indicating room for improvement
Background
- Foundation models are used for natural language processing tasks
- ChatGPT is a generative foundation model in the GPT-3.5 series
- ChatGPT is trained using reinforcement learning from human feedback
- Foundation models are also used for computer vision, music generation, biology, and speech recognition
- Previous evaluations of ChatGPT have shown mixed results
- There are concerns that ChatGPT should be regulated
- Evaluations on ethics have been done
- Robustness evaluation is currently under-explored
Robustness
- Adversarial robustness is a type of classification task where a d-dimensional input and output are given and a -bounded, imperceptible perturbation is added to the original input.
- OOD robustness is a type of generalization which aims to learn an optimal classifier on an unseen distribution by training on existing data.
Datasets and tasks
Adversarial datasets
- Adopt AdvGLUE and ANLI benchmarks to evaluate adversarial robustness
- AdvGLUE modified version of GLUE benchmark with different kinds of adversarial noise
- 5 tasks from AdvGLUE: SST-2, QQP, MNLI, QNLI, and RTE
- Adopt AdvGLUE development set for evaluation
- Construct AdvGLUE-T dataset for adversarial machine translation
- ANLI dataset created by Facebook AI Research with 16,000 premise-hypothesis pairs
- ANLI divided into 3 parts (R1, R2, R3) with R3 being the most difficult and diverse
- Select ANLI R3 test set for evaluating adversarial robustness
Out-of-distribution datasets
- Two new datasets (Flipkart and DDXPlus) for OOD robustness evaluation
- Flipkart is a product review dataset and DDXPlus is a medical diagnosis dataset
- Subsets of each dataset are randomly sampled to form test sets
Experiment
- ChatGPT is compared to 8 existing popular foundation models
- Attack success rate is used as the metric for robustness on AdvGLUE and ANLI
- F1-score is used as the metric for OOD classification tasks
- GPT-3 models outperform the fine-tuned models
- ChatGPT is readable and reasonable to humans, even given adversarial inputs
Case study
- ChatGPT is challenged by both word-level and sentence-level adversarial inputs.
- Adversarial inputs are common in everyday interactions, so defensive strategies are necessary.
- It is difficult to analyze why ChatGPT performs poorly on OOD inputs.
Discussion
Adversarial attack remains a major threat
- Adversarial inputs remain a major threat to safety-critical applications.
- Foundation models might never cover all distributions of possible adversarial inputs.
- Pre-trained models can be trained on human-generated or algorithm-generated adversarial inputs to improve robustness.
- Reducing defects through fine-tuning could be impossible for large models.
- Open question on how to defend against adversarial attack.
Can ood generalization be solved by large foundation models?
- Large models have potential to achieve superior performance on OOD datasets
- Large models use huge training data and parameters, which can lead to overfitting or generalization
- Adding OOD data into training set is enough for large models
- It is unknown when and why LLMs will overfit
- Training data of large models could encompass similar distributions to test sets
- ID-OOD performances can be positively or inversely correlated
- Regularization and other techniques should be developed to improve OOD performance of language models
Beyond nlp foundation models
- Adversarial and OOD robustness exist in multiple domains, not just natural language.
- Most research comes from machine learning and computer vision communities.
- ViT-22B is a large vision foundation model that shows superior performance on image classification tasks.
Limitation
- Only zero-shot classification is performed
- Difficult to find larger datasets for evaluation
- Most evaluations on text classification, minor evaluations on machine translation
- ChatGPT mainly designed to be a chatbot service
Conclusion
- This paper presented a preliminary evaluation of the robustness of ChatGPT
- Acknowledged the advance of large foundation models on adversarial and out-of-distribution robustness
- Experiments show that there is still room for improvement to ChatGPT and other large models
- In-depth analysis and discussion beyond NLP area
- Highlighted potential research directions regarding foundation models
- ChatGPT usage and authors
- Rate of an adversarial attack method
- Generalization error and hypothesis set
- Superior performance of large foundation models
- VC-dimension and correlation with datasets
- Introduction to foundation models used in experiments
- OOD generalization and adaptation research
- Interpretation of success of large foundation models
- Questions and sentences entailment
- Translate sentence from English to Chinese
- Classify sentence into positive or negative