Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- PFMs are used for various downstream tasks with different data modalities
- Pretraining is used to provide reasonable parameter initialization for a wide range of applications
- GPT and BERT use Transformers to train on large datasets
- AI has made waves in a variety of fields over the past few years
- This study provides a comprehensive review of recent research advancements, current and future challenges, and opportunities for PFMs
Paper Content
Introduction
- PFMs are essential components of AI in the era of big data
- PFMs are studied in the three major AI fields: NLP, CV and GL
- PFMs are powerful general models that are effective in various fields or across fields
- PFMs have demonstrated great potential in learning feature representations in various learning tasks
- PFMs show superior performance for training on multiple tasks with large-scale corpus and fine-tuning it to similar small-scale tasks
Pfms and pretraining
- PFMs are based on pretraining technique which uses large amounts of data and tasks
- Pretraining originates from transfer learning in CV tasks
- When applied to NLP, LMs capture rich knowledge beneficial for downstream tasks
- Pretraining data can be derived from any unlabeled text corpus
- Early pretraining was static, but dynamic pretraining techniques have been proposed
- PFMs are used for text, image, and graph tasks
- PFMs have two major advantages: minor fine-tuning and already vetted on quality
- Related work focuses on model efficiency, security, and compression
Contribution and organization
- Several survey studies have reviewed pretrained models for specific areas
- Bommasani et.al. summarize opportunities and risks of foundation model
- Comprehensive review of PFMs in different areas (e.g., CV, NLP, GL, Speech, Video)
- Summarize existing models from traditional to PFMs
- Discuss advanced topics about PFMs (unified PFMs, model efficiency and compression, security and privacy)
Basic components
- PFMs are large neural network models used for neural information processing
- Transformer is a popular model architecture for PFMs in NLP and CV
- Training large models requires various datasets for pretraining
- After training, the model should be fine-tuned for efficacy, efficiency, and privacy
Transformer for pfms
- Transformer is a novel architecture that uses attention mechanisms without recurrence or convolution
- Attention mechanism assigns weights to encoded input representations
- Self-attention connects different positions in a single sequence
- Transformer helps solve long-range dependency issues in NLP
- ViT and GTN use transformer structures for CV and GL
- Transformers are scalable enough to drive groundbreaking capabilities for PFMs
- Largest language models have 100B+ parameters
Learning mechanisms for pfms
- Deep learning models outperform traditional learning models in CV, NLP, and GL
- Supervised learning uses a training dataset with labels to learn a function
- Semi-supervised learning uses a training dataset with labels and an unlabelled dataset to learn a function
- Weakly-supervised learning uses a training dataset with inaccurate labels to learn a function
- Self-supervised learning uses the information in the data itself to learn feature representations
- Reinforcement learning models the learning process as an interaction between an agent and an environment
- The agent seeks to learn an optimal policy for sequential decision-making problems
- The agent aims to maximize the expectation of long-term return from each state
Pretraining tasks for pfms
- Pretraining is an initialization framework used with fine-tuning
- Parameters of the model are trained on pre-set tasks to capture specific attributes, structure, and community information
- Pretraining tasks can be divided into five categories
- Mask Language Modeling (MLM) erases some words randomly in the input sequence and then predicts these erased words
- Denoising AutoEncoder (DAE) adds noise to the original corpus and reconstructs the original input
- Replaced Token Detection (RTD) is a discriminant task that determines whether the LM has replaced the current token
- Next Sentence Prediction (NSP) checks whether the order of two sentences is correct
- Sentence Order Prediction (SOP) uses two contiguous fragments from a document as positive samples and the exchange order of the two fragments as negative samples
- Pretext tasks are created for the encoder networks to perform during the pretraining phase
- Frame Order Learning Task involves frame processing through time steps
- Data Reconstruction Task pretrains models by predicting the missed center part
- Miscellaneous tasks include Graph Information Completion, Graph Property Prediction, and integration mechanisms
Pfms for natural language processing
- NLP is a research field that combines linguistics and computer science
- PFM is a popular technology in NLP
- PFM models syntactic and semantic representations of words
- Table 1 lists numerous PFMs
- Autoregressive language model, contextual LM, and permuted LM are word representation learning models
- Neural network architectures for PFM designing and masking designing are presented
- Boosting methods, multi-task learning, and different downstream tasks are summarized
- Instruction-aligning methods, such as RLHF and Chain-of-Thoughts, are applied in PFMs
Word representations methods
- Large-scale pretrained models have achieved better performance than humans in certain tasks
- Pretraining LMs are divided into three branches: autoregressive LM, contextual LM, and permuted LM
- Autoregressive LM predicts the next possible word based on the preceding word
- GPT-2 uses an autoregressive LM and multi-task learning
- GPT-3 has 175 billion parameters and trains with 45 Terabytes of data
- ELMO uses bi-directional Long Short-Term Memory (LSTM)
- BERT uses a stacked multi-layer bi-directional Transformer
- RoBERTa uses a larger batch size and unlabeled data
- Permuted LM combines the advantages of the autoregressive LM and the autoencoder LM
- XLNET and MPNet are common permuted LM models
Model architecture designing methods
- ELMO uses a multi-layer RNN structure with bi-directional LSTM
- ELMO introduces contextual information and improves polysemy problem, but has weak feature extraction
- PFMs have two main directions: fine-tuning (e.g. BERT) and zero/few-shot prompts (e.g. GPT)
- BART is a noise-reducing autoencoder built by seq2seq model with bi-directional document rearrangement
Masking designing methods
- Attention mechanism aggregates words and sentences into text vectors
- BERT model hinders ability to learn NLG tasks
- SpanBERT uses dynamic masking and single segment pretraining
- SpanBERT predicts span using token closest to boundary and eliminates NSP pretraining task
- MASS randomly masks input sequence of encoder
- UniLM uses different mask for two sentences in input data
Boosting methods
- Pretraining models need lots of data and hardware, making it challenging to retrain
- ERNIE Tiny and ALBERT reduce the number of parameters of the model without performance loss
- ERNIE 2.0 uses multi-task learning to pretrain lexical, grammar, and semantics
- UniLM uses three pretraining tasks: unidirectional LM, bidirectional LM, and encoder-decoder LM
- BERT-WWM and ZEN use N-gram to enhance performance
- ChatGPT and Bard use supervised learning and external knowledge sources to improve model performance
Instruction-aligning methods
- Instruction-aligning methods aim to let LMs follow human intents and generate meaningful outputs.
- Supervised Fine-Tuning (SFT) is a technique to unlock knowledge and apply it to specific tasks.
- Reinforcement Learning (RL) is used to enhance various models in NLP tasks and align LMs with human preferences.
- Chain-of-Thoughts (CoT) is a series of intermediate reasoning steps that can improve the ability of large LMs to perform complex reasoning.
- Fine-tuning with CoT can make LMs slightly more harmless.
- Claude uses RL from AI Feedback (RLAIF) to control outputs to be less harmful.
Summary
- Traditional feature learning methods often lead to information loss
- People began to focus on the distribution of data and attributes in the graph data as self-supervised signals to pretrain the graph model
- Contrastive learning strategies have been applied to the pretraining of graph models
- Joint pretraining of speech and text has been researched to apply methods from the NLP community to speech
Pfms for computer vision
- Popularity of PFM used in NLP motivates researchers to explore PFM in CV
- Pretraining is adjusting parameters on a general dataset to make other tasks train faster
- SSL uses human-designed labels to learn representations that can be generalized to various tasks
- SSL reduces reliance on data annotations
- General pipeline of SSL includes a pretext task and a supervised learning stage
- Pretraining PFMs in CV can be done with specific pretext tasks, frame order, generation, reconstruction, memory bank, sharing, and clustering
Learning by specific pretext task
- Early stage of unsupervised learning involves designing a pretext task and predicting the answer
- Dosovitskiy et al. pretrain Exemplar CNN to discriminate patches from unlabelled data
- Context prediction uses handcrafted supervised signal as label for pair classification
- Inpainting pretrains models by predicting the missed center part
- Colorization evaluates how colorization as a pretext task can help learn semantic representation
- Split-Brain Autoencoder learns representations in self-supervised way by forcing network to solve cross-channel prediction tasks
- Jigsaw pretrains Context-Free Network in self-supervised manner by designing Jigsaw puzzle as pretext task
- CDJP learns image representation by complicating pretext tasks
- Noroozi et al. use counting visual primitives as pretext task
- NAT learns representation by aligning output of backbone CNN to low-dimensional noise
- RotNet predicts different rotations of images
Learning by frame order
- Learning sequence data involves frame processing over time.
- Contrastive Predictive Coding (CPC) is a model that predicts future frames in latent space.
- CPC components include input sequence x t, latent sequence z t, and context latent representation c t.
- CPC models a “density ratio” f k to represent the mutual information between c t and x t+k.
- CPC v2 improves CPC by pretraining on unsupervised representations.
Learning by generation
- GAN-based approach has not been fully exploited
- BiGANs proposed to project data back into latent space
- BigBiGAN achieves SOTA in unsupervised representation learning on ImageNet
- GANs components used to produce data-latent pairs
- Final loss is sum of data-specific and data-joint terms
- Discriminator learns to discriminate between sample pairs from raw data, latent distribution and encoded vector
Learning by reconstruction
- iGPT and ViT models have adapted the masked prediction task from language to image data
- BEiT is the first to outperform a conventional SOTA method without pretraining
- BEiT consists of two stages: token embedding and tokenizer training
- MAE proposes an end-to-end solution with a higher masking ratio than BERT
- SimMIM also uses a higher masking ratio and random masking strategy
- hViT and local attention have been proposed to improve efficiency
- UM-MAE and LoMaR sample small windows to focus attention on local regions
Learning by memory bank
- NPID is the first method to use instances to learn representations for downstream tasks
- NPID uses a memory bank to store feature representations for instance-level classification
- LA maximizes a metric of local aggregation to move similar data instances together in the embedding space
- PIRL argues that semantic representations are invariant under pretext transformation tasks
- PIRL uses a memory bank to store representations for comparison
Learning by sharing
- SSL uses two encoder networks for data augmentation
- Two different or same encoders are used to extract contrastive representations
- SSL can be divided into two categories: Soft Sharing and Hard Sharing
- Momentum Contrast (MoCo) uses momentum to control the slight difference between two encoders
- InfoNCE Loss is used for pretraining in MoCo
- Bootstrap Your Own Latent (BYOL) does not use negative samples
- SimCLR uses hard parameter-sharing architecture
- Swapping Assignments between multiple Views of the same image (SwAV) uses clustering instead of comparison between pairs
- SElf-supERvised (SEER) uses RegNetY architectures
- Simple Siamese (SimSiam) networks avoid the design of negative sample pairs, large batches, and momentum encoders
Learning by clustering
- DeepCluster is the first model to use clustering for large-scale dataset learning.
- SwAV and PCL bridge contrastive learning with clustering.
- Clustering helps encode more semantic information.