Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

PFMs are used for various downstream tasks with different data modalities
Pretraining is used to provide reasonable parameter initialization for a wide range of applications
GPT and BERT use Transformers to train on large datasets
AI has made waves in a variety of fields over the past few years
This study provides a comprehensive review of recent research advancements, current and future challenges, and opportunities for PFMs

Paper Content

Introduction

PFMs are essential components of AI in the era of big data
PFMs are studied in the three major AI fields: NLP, CV and GL
PFMs are powerful general models that are effective in various fields or across fields
PFMs have demonstrated great potential in learning feature representations in various learning tasks
PFMs show superior performance for training on multiple tasks with large-scale corpus and fine-tuning it to similar small-scale tasks

Pfms and pretraining

PFMs are based on pretraining technique which uses large amounts of data and tasks
Pretraining originates from transfer learning in CV tasks
When applied to NLP, LMs capture rich knowledge beneficial for downstream tasks
Pretraining data can be derived from any unlabeled text corpus
Early pretraining was static, but dynamic pretraining techniques have been proposed
PFMs are used for text, image, and graph tasks
PFMs have two major advantages: minor fine-tuning and already vetted on quality
Related work focuses on model efficiency, security, and compression

Contribution and organization

Several survey studies have reviewed pretrained models for specific areas
Bommasani et.al. summarize opportunities and risks of foundation model
Comprehensive review of PFMs in different areas (e.g., CV, NLP, GL, Speech, Video)
Summarize existing models from traditional to PFMs
Discuss advanced topics about PFMs (unified PFMs, model efficiency and compression, security and privacy)

Basic components

PFMs are large neural network models used for neural information processing
Transformer is a popular model architecture for PFMs in NLP and CV
Training large models requires various datasets for pretraining
After training, the model should be fine-tuned for efficacy, efficiency, and privacy

Transformer for pfms

Transformer is a novel architecture that uses attention mechanisms without recurrence or convolution
Attention mechanism assigns weights to encoded input representations
Self-attention connects different positions in a single sequence
Transformer helps solve long-range dependency issues in NLP
ViT and GTN use transformer structures for CV and GL
Transformers are scalable enough to drive groundbreaking capabilities for PFMs
Largest language models have 100B+ parameters

Learning mechanisms for pfms

Deep learning models outperform traditional learning models in CV, NLP, and GL
Supervised learning uses a training dataset with labels to learn a function
Semi-supervised learning uses a training dataset with labels and an unlabelled dataset to learn a function
Weakly-supervised learning uses a training dataset with inaccurate labels to learn a function
Self-supervised learning uses the information in the data itself to learn feature representations
Reinforcement learning models the learning process as an interaction between an agent and an environment
The agent seeks to learn an optimal policy for sequential decision-making problems
The agent aims to maximize the expectation of long-term return from each state

Pretraining tasks for pfms

Pretraining is an initialization framework used with fine-tuning
Parameters of the model are trained on pre-set tasks to capture specific attributes, structure, and community information
Pretraining tasks can be divided into five categories
Mask Language Modeling (MLM) erases some words randomly in the input sequence and then predicts these erased words
Denoising AutoEncoder (DAE) adds noise to the original corpus and reconstructs the original input
Replaced Token Detection (RTD) is a discriminant task that determines whether the LM has replaced the current token
Next Sentence Prediction (NSP) checks whether the order of two sentences is correct
Sentence Order Prediction (SOP) uses two contiguous fragments from a document as positive samples and the exchange order of the two fragments as negative samples
Pretext tasks are created for the encoder networks to perform during the pretraining phase
Frame Order Learning Task involves frame processing through time steps
Data Reconstruction Task pretrains models by predicting the missed center part
Miscellaneous tasks include Graph Information Completion, Graph Property Prediction, and integration mechanisms

Pfms for natural language processing

NLP is a research field that combines linguistics and computer science
PFM is a popular technology in NLP
PFM models syntactic and semantic representations of words
Table 1 lists numerous PFMs
Autoregressive language model, contextual LM, and permuted LM are word representation learning models
Neural network architectures for PFM designing and masking designing are presented
Boosting methods, multi-task learning, and different downstream tasks are summarized
Instruction-aligning methods, such as RLHF and Chain-of-Thoughts, are applied in PFMs

Word representations methods

Large-scale pretrained models have achieved better performance than humans in certain tasks
Pretraining LMs are divided into three branches: autoregressive LM, contextual LM, and permuted LM
Autoregressive LM predicts the next possible word based on the preceding word
GPT-2 uses an autoregressive LM and multi-task learning
GPT-3 has 175 billion parameters and trains with 45 Terabytes of data
ELMO uses bi-directional Long Short-Term Memory (LSTM)
BERT uses a stacked multi-layer bi-directional Transformer
RoBERTa uses a larger batch size and unlabeled data
Permuted LM combines the advantages of the autoregressive LM and the autoencoder LM
XLNET and MPNet are common permuted LM models

Model architecture designing methods

ELMO uses a multi-layer RNN structure with bi-directional LSTM
ELMO introduces contextual information and improves polysemy problem, but has weak feature extraction
PFMs have two main directions: fine-tuning (e.g. BERT) and zero/few-shot prompts (e.g. GPT)
BART is a noise-reducing autoencoder built by seq2seq model with bi-directional document rearrangement

Masking designing methods

Attention mechanism aggregates words and sentences into text vectors
BERT model hinders ability to learn NLG tasks
SpanBERT uses dynamic masking and single segment pretraining
SpanBERT predicts span using token closest to boundary and eliminates NSP pretraining task
MASS randomly masks input sequence of encoder
UniLM uses different mask for two sentences in input data

Boosting methods

Pretraining models need lots of data and hardware, making it challenging to retrain
ERNIE Tiny and ALBERT reduce the number of parameters of the model without performance loss
ERNIE 2.0 uses multi-task learning to pretrain lexical, grammar, and semantics
UniLM uses three pretraining tasks: unidirectional LM, bidirectional LM, and encoder-decoder LM
BERT-WWM and ZEN use N-gram to enhance performance
ChatGPT and Bard use supervised learning and external knowledge sources to improve model performance

Instruction-aligning methods

Instruction-aligning methods aim to let LMs follow human intents and generate meaningful outputs.
Supervised Fine-Tuning (SFT) is a technique to unlock knowledge and apply it to specific tasks.
Reinforcement Learning (RL) is used to enhance various models in NLP tasks and align LMs with human preferences.
Chain-of-Thoughts (CoT) is a series of intermediate reasoning steps that can improve the ability of large LMs to perform complex reasoning.
Fine-tuning with CoT can make LMs slightly more harmless.
Claude uses RL from AI Feedback (RLAIF) to control outputs to be less harmful.

Summary

Traditional feature learning methods often lead to information loss
People began to focus on the distribution of data and attributes in the graph data as self-supervised signals to pretrain the graph model
Contrastive learning strategies have been applied to the pretraining of graph models
Joint pretraining of speech and text has been researched to apply methods from the NLP community to speech

Pfms for computer vision

Popularity of PFM used in NLP motivates researchers to explore PFM in CV
Pretraining is adjusting parameters on a general dataset to make other tasks train faster
SSL uses human-designed labels to learn representations that can be generalized to various tasks
SSL reduces reliance on data annotations
General pipeline of SSL includes a pretext task and a supervised learning stage
Pretraining PFMs in CV can be done with specific pretext tasks, frame order, generation, reconstruction, memory bank, sharing, and clustering

Learning by specific pretext task

Early stage of unsupervised learning involves designing a pretext task and predicting the answer
Dosovitskiy et al. pretrain Exemplar CNN to discriminate patches from unlabelled data
Context prediction uses handcrafted supervised signal as label for pair classification
Inpainting pretrains models by predicting the missed center part
Colorization evaluates how colorization as a pretext task can help learn semantic representation
Split-Brain Autoencoder learns representations in self-supervised way by forcing network to solve cross-channel prediction tasks
Jigsaw pretrains Context-Free Network in self-supervised manner by designing Jigsaw puzzle as pretext task
CDJP learns image representation by complicating pretext tasks
Noroozi et al. use counting visual primitives as pretext task
NAT learns representation by aligning output of backbone CNN to low-dimensional noise
RotNet predicts different rotations of images

Learning by frame order

Learning sequence data involves frame processing over time.
Contrastive Predictive Coding (CPC) is a model that predicts future frames in latent space.
CPC components include input sequence x t, latent sequence z t, and context latent representation c t.
CPC models a “density ratio” f k to represent the mutual information between c t and x t+k.
CPC v2 improves CPC by pretraining on unsupervised representations.

Learning by generation

GAN-based approach has not been fully exploited
BiGANs proposed to project data back into latent space
BigBiGAN achieves SOTA in unsupervised representation learning on ImageNet
GANs components used to produce data-latent pairs
Final loss is sum of data-specific and data-joint terms
Discriminator learns to discriminate between sample pairs from raw data, latent distribution and encoded vector

Learning by reconstruction

iGPT and ViT models have adapted the masked prediction task from language to image data
BEiT is the first to outperform a conventional SOTA method without pretraining
BEiT consists of two stages: token embedding and tokenizer training
MAE proposes an end-to-end solution with a higher masking ratio than BERT
SimMIM also uses a higher masking ratio and random masking strategy
hViT and local attention have been proposed to improve efficiency
UM-MAE and LoMaR sample small windows to focus attention on local regions

Learning by memory bank

NPID is the first method to use instances to learn representations for downstream tasks
NPID uses a memory bank to store feature representations for instance-level classification
LA maximizes a metric of local aggregation to move similar data instances together in the embedding space
PIRL argues that semantic representations are invariant under pretext transformation tasks
PIRL uses a memory bank to store representations for comparison

SSL uses two encoder networks for data augmentation
Two different or same encoders are used to extract contrastive representations
SSL can be divided into two categories: Soft Sharing and Hard Sharing
Momentum Contrast (MoCo) uses momentum to control the slight difference between two encoders
InfoNCE Loss is used for pretraining in MoCo
Bootstrap Your Own Latent (BYOL) does not use negative samples
SimCLR uses hard parameter-sharing architecture
Swapping Assignments between multiple Views of the same image (SwAV) uses clustering instead of comparison between pairs
SElf-supERvised (SEER) uses RegNetY architectures
Simple Siamese (SimSiam) networks avoid the design of negative sample pairs, large batches, and momentum encoders

Learning by clustering

DeepCluster is the first model to use clustering for large-scale dataset learning.
SwAV and PCL bridge contrastive learning with clustering.
Clustering helps encode more semantic information.

Link to paper#

Abstract#

Paper Content#

Introduction#

Pfms and pretraining#

Contribution and organization#

Basic components#

Transformer for pfms#

Learning mechanisms for pfms#

Pretraining tasks for pfms#

Pfms for natural language processing#

Word representations methods#

Model architecture designing methods#

Masking designing methods#

Boosting methods#

Instruction-aligning methods#

Summary#

Pfms for computer vision#

Learning by specific pretext task#

Learning by frame order#

Learning by generation#

Learning by reconstruction#

Learning by memory bank#

Learning by sharing#

Learning by clustering#

Link to paper

Abstract

Paper Content

Introduction

Pfms and pretraining

Contribution and organization

Basic components

Transformer for pfms

Learning mechanisms for pfms

Pretraining tasks for pfms

Pfms for natural language processing

Word representations methods

Model architecture designing methods

Masking designing methods

Boosting methods

Instruction-aligning methods

Summary

Pfms for computer vision

Learning by specific pretext task

Learning by frame order

Learning by generation

Learning by reconstruction

Learning by memory bank

Learning by sharing

Learning by clustering