  • PFMs are used for various downstream tasks with different data modalities
  • Pretraining is used to provide reasonable parameter initialization for a wide range of applications
  • GPT and BERT use Transformers to train on large datasets
  • AI has made waves in a variety of fields over the past few years
  • This study provides a comprehensive review of recent research advancements, current and future challenges, and opportunities for PFMs

Paper Content


  • PFMs are essential components of AI in the era of big data
  • PFMs are studied in the three major AI fields: NLP, CV and GL
  • PFMs are powerful general models that are effective in various fields or across fields
  • PFMs have demonstrated great potential in learning feature representations in various learning tasks
  • PFMs show superior performance for training on multiple tasks with large-scale corpus and fine-tuning it to similar small-scale tasks

Pfms and pretraining

  • PFMs are based on pretraining technique which uses large amounts of data and tasks
  • Pretraining originates from transfer learning in CV tasks
  • When applied to NLP, LMs capture rich knowledge beneficial for downstream tasks
  • Pretraining data can be derived from any unlabeled text corpus
  • Early pretraining was static, but dynamic pretraining techniques have been proposed
  • PFMs are used for text, image, and graph tasks
  • PFMs have two major advantages: minor fine-tuning and already vetted on quality
  • Related work focuses on model efficiency, security, and compression

Contribution and organization

  • Several survey studies have reviewed pretrained models for specific areas
  • Bommasani summarize opportunities and risks of foundation model
  • Comprehensive review of PFMs in different areas (e.g., CV, NLP, GL, Speech, Video)
  • Summarize existing models from traditional to PFMs
  • Discuss advanced topics about PFMs (unified PFMs, model efficiency and compression, security and privacy)

Basic components

  • PFMs are large neural network models used for neural information processing
  • Transformer is a popular model architecture for PFMs in NLP and CV
  • Training large models requires various datasets for pretraining
  • After training, the model should be fine-tuned for efficacy, efficiency, and privacy

Transformer for pfms

  • Transformer is a novel architecture that uses attention mechanisms without recurrence or convolution
  • Attention mechanism assigns weights to encoded input representations
  • Self-attention connects different positions in a single sequence
  • Transformer helps solve long-range dependency issues in NLP
  • ViT and GTN use transformer structures for CV and GL
  • Transformers are scalable enough to drive groundbreaking capabilities for PFMs
  • Largest language models have 100B+ parameters

Learning mechanisms for pfms

  • Deep learning models outperform traditional learning models in CV, NLP, and GL
  • Supervised learning uses a training dataset with labels to learn a function
  • Semi-supervised learning uses a training dataset with labels and an unlabelled dataset to learn a function
  • Weakly-supervised learning uses a training dataset with inaccurate labels to learn a function
  • Self-supervised learning uses the information in the data itself to learn feature representations
  • Reinforcement learning models the learning process as an interaction between an agent and an environment
  • The agent seeks to learn an optimal policy for sequential decision-making problems
  • The agent aims to maximize the expectation of long-term return from each state

Pretraining tasks for pfms

  • Pretraining is an initialization framework used with fine-tuning
  • Parameters of the model are trained on pre-set tasks to capture specific attributes, structure, and community information
  • Pretraining tasks can be divided into five categories
  • Mask Language Modeling (MLM) erases some words randomly in the input sequence and then predicts these erased words
  • Denoising AutoEncoder (DAE) adds noise to the original corpus and reconstructs the original input
  • Replaced Token Detection (RTD) is a discriminant task that determines whether the LM has replaced the current token
  • Next Sentence Prediction (NSP) checks whether the order of two sentences is correct
  • Sentence Order Prediction (SOP) uses two contiguous fragments from a document as positive samples and the exchange order of the two fragments as negative samples
  • Pretext tasks are created for the encoder networks to perform during the pretraining phase
  • Frame Order Learning Task involves frame processing through time steps
  • Data Reconstruction Task pretrains models by predicting the missed center part
  • Miscellaneous tasks include Graph Information Completion, Graph Property Prediction, and integration mechanisms

Pfms for natural language processing

  • NLP is a research field that combines linguistics and computer science
  • PFM is a popular technology in NLP
  • PFM models syntactic and semantic representations of words
  • Table 1 lists numerous PFMs
  • Autoregressive language model, contextual LM, and permuted LM are word representation learning models
  • Neural network architectures for PFM designing and masking designing are presented
  • Boosting methods, multi-task learning, and different downstream tasks are summarized
  • Instruction-aligning methods, such as RLHF and Chain-of-Thoughts, are applied in PFMs

Word representations methods

  • Large-scale pretrained models have achieved better performance than humans in certain tasks
  • Pretraining LMs are divided into three branches: autoregressive LM, contextual LM, and permuted LM
  • Autoregressive LM predicts the next possible word based on the preceding word
  • GPT-2 uses an autoregressive LM and multi-task learning
  • GPT-3 has 175 billion parameters and trains with 45 Terabytes of data
  • ELMO uses bi-directional Long Short-Term Memory (LSTM)
  • BERT uses a stacked multi-layer bi-directional Transformer
  • RoBERTa uses a larger batch size and unlabeled data
  • Permuted LM combines the advantages of the autoregressive LM and the autoencoder LM
  • XLNET and MPNet are common permuted LM models

Model architecture designing methods

  • ELMO uses a multi-layer RNN structure with bi-directional LSTM
  • ELMO introduces contextual information and improves polysemy problem, but has weak feature extraction
  • PFMs have two main directions: fine-tuning (e.g. BERT) and zero/few-shot prompts (e.g. GPT)
  • BART is a noise-reducing autoencoder built by seq2seq model with bi-directional document rearrangement

Masking designing methods

  • Attention mechanism aggregates words and sentences into text vectors
  • BERT model hinders ability to learn NLG tasks
  • SpanBERT uses dynamic masking and single segment pretraining
  • SpanBERT predicts span using token closest to boundary and eliminates NSP pretraining task
  • MASS randomly masks input sequence of encoder
  • UniLM uses different mask for two sentences in input data

Boosting methods

  • Pretraining models need lots of data and hardware, making it challenging to retrain
  • ERNIE Tiny and ALBERT reduce the number of parameters of the model without performance loss
  • ERNIE 2.0 uses multi-task learning to pretrain lexical, grammar, and semantics
  • UniLM uses three pretraining tasks: unidirectional LM, bidirectional LM, and encoder-decoder LM
  • BERT-WWM and ZEN use N-gram to enhance performance
  • ChatGPT and Bard use supervised learning and external knowledge sources to improve model performance

Instruction-aligning methods

  • Instruction-aligning methods aim to let LMs follow human intents and generate meaningful outputs.
  • Supervised Fine-Tuning (SFT) is a technique to unlock knowledge and apply it to specific tasks.
  • Reinforcement Learning (RL) is used to enhance various models in NLP tasks and align LMs with human preferences.
  • Chain-of-Thoughts (CoT) is a series of intermediate reasoning steps that can improve the ability of large LMs to perform complex reasoning.
  • Fine-tuning with CoT can make LMs slightly more harmless.
  • Claude uses RL from AI Feedback (RLAIF) to control outputs to be less harmful.


  • Traditional feature learning methods often lead to information loss
  • People began to focus on the distribution of data and attributes in the graph data as self-supervised signals to pretrain the graph model
  • Contrastive learning strategies have been applied to the pretraining of graph models
  • Joint pretraining of speech and text has been researched to apply methods from the NLP community to speech

Pfms for computer vision

  • Popularity of PFM used in NLP motivates researchers to explore PFM in CV
  • Pretraining is adjusting parameters on a general dataset to make other tasks train faster
  • SSL uses human-designed labels to learn representations that can be generalized to various tasks
  • SSL reduces reliance on data annotations
  • General pipeline of SSL includes a pretext task and a supervised learning stage
  • Pretraining PFMs in CV can be done with specific pretext tasks, frame order, generation, reconstruction, memory bank, sharing, and clustering

Learning by specific pretext task

  • Early stage of unsupervised learning involves designing a pretext task and predicting the answer
  • Dosovitskiy et al. pretrain Exemplar CNN to discriminate patches from unlabelled data
  • Context prediction uses handcrafted supervised signal as label for pair classification
  • Inpainting pretrains models by predicting the missed center part
  • Colorization evaluates how colorization as a pretext task can help learn semantic representation
  • Split-Brain Autoencoder learns representations in self-supervised way by forcing network to solve cross-channel prediction tasks
  • Jigsaw pretrains Context-Free Network in self-supervised manner by designing Jigsaw puzzle as pretext task
  • CDJP learns image representation by complicating pretext tasks
  • Noroozi et al. use counting visual primitives as pretext task
  • NAT learns representation by aligning output of backbone CNN to low-dimensional noise
  • RotNet predicts different rotations of images

Learning by frame order

  • Learning sequence data involves frame processing over time.
  • Contrastive Predictive Coding (CPC) is a model that predicts future frames in latent space.
  • CPC components include input sequence x t, latent sequence z t, and context latent representation c t.
  • CPC models a “density ratio” f k to represent the mutual information between c t and x t+k.
  • CPC v2 improves CPC by pretraining on unsupervised representations.

Learning by generation

  • GAN-based approach has not been fully exploited
  • BiGANs proposed to project data back into latent space
  • BigBiGAN achieves SOTA in unsupervised representation learning on ImageNet
  • GANs components used to produce data-latent pairs
  • Final loss is sum of data-specific and data-joint terms
  • Discriminator learns to discriminate between sample pairs from raw data, latent distribution and encoded vector

Learning by reconstruction

  • iGPT and ViT models have adapted the masked prediction task from language to image data
  • BEiT is the first to outperform a conventional SOTA method without pretraining
  • BEiT consists of two stages: token embedding and tokenizer training
  • MAE proposes an end-to-end solution with a higher masking ratio than BERT
  • SimMIM also uses a higher masking ratio and random masking strategy
  • hViT and local attention have been proposed to improve efficiency
  • UM-MAE and LoMaR sample small windows to focus attention on local regions

Learning by memory bank

  • NPID is the first method to use instances to learn representations for downstream tasks
  • NPID uses a memory bank to store feature representations for instance-level classification
  • LA maximizes a metric of local aggregation to move similar data instances together in the embedding space
  • PIRL argues that semantic representations are invariant under pretext transformation tasks
  • PIRL uses a memory bank to store representations for comparison

Learning by sharing

  • SSL uses two encoder networks for data augmentation
  • Two different or same encoders are used to extract contrastive representations
  • SSL can be divided into two categories: Soft Sharing and Hard Sharing
  • Momentum Contrast (MoCo) uses momentum to control the slight difference between two encoders
  • InfoNCE Loss is used for pretraining in MoCo
  • Bootstrap Your Own Latent (BYOL) does not use negative samples
  • SimCLR uses hard parameter-sharing architecture
  • Swapping Assignments between multiple Views of the same image (SwAV) uses clustering instead of comparison between pairs
  • SElf-supERvised (SEER) uses RegNetY architectures
  • Simple Siamese (SimSiam) networks avoid the design of negative sample pairs, large batches, and momentum encoders

Learning by clustering

  • DeepCluster is the first model to use clustering for large-scale dataset learning.
  • SwAV and PCL bridge contrastive learning with clustering.
  • Clustering helps encode more semantic information.