Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Machine Learning-guided solutions have made progress in protein learning tasks.
  • Data availability is a constraint for success in scientific discovery tasks.
  • Deep learning models pretrained on millions of protein sequences have shown promise.
  • Representation Learning via Dictionary Learning (R2DL) is an end-to-end representation learning framework.
  • R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
  • R2DL can improve data efficiency by up to $10^5$ times over baselines.

Paper Content

Introduction

  • Recent advances in AI have led to major innovations in biology
  • Deep learning models aim to learn a highly accurate and compressed representation of the biological system
  • Successful applications include high-quality protein structure prediction and novel peptide discoveries
  • Deep learning models require large amounts of labeled data, which is scarce and sparse in biology
  • Pretraining methods leverage large amounts of sequence data to learn features that explain variance in sequences
  • Pretraining has enabled meaningful density modelling across protein functions, structures, and families
  • Pretraining is resource-intensive and expensive
  • This work proposes a lightweight, more accurate alternative method to large-scale pretraining
  • This method reprograms an existing foundation model of high capacity that is trained on data from a different domain
  • This method is defined by its ability to achieve the performance of transformers that are pretrained on billions of tokens
  • This method is tested on seven different physicochemical, structural, and functional property prediction tasks
  • This method is over 105 times more data-efficient than existing pretraining methods

Results

  • R2DL is a framework for embedding protein sequence datasets
  • R2DL uses a transformer pretrained on an English text corpus to learn representations for downstream protein prediction tasks

R2dl framework formulation

  • R2DL is a method to reprogram a source model (pretrained language model) to classify or predict the regression values of protein sequences.
  • The source model is BERT, a bidirectional transformer, which has been finetuned for different language tasks.
  • The input data of the source model is tokenized at the word level, while the input data of the target task is tokenized at the character level.
  • R2DL learns a linear mapping between the source model embedding space and the target task embedding space.
  • R2DL uses a k-SVD solver to select k English tokens and use their linearly combined embeddings as the embedding of a target token.

R2dl training and optimization procedure

  • Pretrained classifier C is used on source-task dataset with source tokens denoted by {v Si } |V S | i=1 and target-task dataset with target tokens denoted by {V T j } |V T | j=1.
  • R2DL finds a linear mapping function Θ to represent target tokens as a sparse encoding of the dictionary V S.
  • Map Θ is used to reprogram C to classify protein sequences.
  • Number of trainable parameters in R2DL is size of matrix Θ, which is usually much smaller than number of parameters in pretrained deep neural network classifier C.

Benchmark tasks and evaluation

  • Four physicochemical structure and property prediction tasks from a protein benchmark are considered
  • Secondary structure prediction involves predicting a structure for each amino acid in a protein sequence
  • Solubility prediction involves mapping a protein sequence to a label of membrane-bound or water soluble
  • Homology detection is a sequence classification task with labels representing different protein folds
  • Stability prediction is a regression task
  • Three biomedically relevant function prediction tasks are also considered
  • Data efficiency is used as a metric to compare the performance of R2DL to established benchmarks

Model baselines and data

  • Two types of models are considered: supervised and unsupervised
  • Supervised models are trained from scratch using labeled datasets
  • Unsupervised models are pretrained on protein sequence data without labels
  • Pretraining on unlabeled data is more advantageous
  • Pretraining corpus size is 1.7 million peptide sequences, with only 0.005% labeled
  • Baseline model for physicochemical property prediction is pretrained on Pfam corpus of 31 million protein domains
  • R2DL repurposes existing pretrained English language models

Data efficiency and accuracy of reprogramming

  • R2DL outperforms task-specific baselines for all prediction tasks
  • R2DL outperforms standard, supervised LSTM by up to 29.3%
  • R2DL is up to 10,000 times more data efficient than task-specific baselines
  • R2DL has higher predictive accuracy than baseline LDA model by 3%
  • R2DL has better classification accuracy than LDA model with imbalanced datasets

R2dl performance vs. pretraining performance in low data settings

  • Tested task-specific predictive performance of R2DL in reduced-data training settings
  • Compared performance of R2DL to baseline models
  • R2DL outperforms pretrained models in Toxicity, Secondary Structure, Homology and Solubility tasks
  • R2DL outperforms LDA model in antibody affinity dataset
  • R2DL outperforms random guess until cutoff point

Correlation between learned embeddings and evolutionary distances

  • R2DL model compared to individual protein task benchmarks
  • R2DL shows interpretable correspondences between learned embeddings and protein properties
  • Figures 5(a-c) show t-SNE projection of task-specific R2DL embeddings
  • Clear separation between different protein classes
  • Euclidean distances between latent embeddings and pairwise evolutionary distances between protein sequences show correlation of close to 1.0

Discussion

  • Proposed a new framework, R2DL, to reprogram large language models for various protein tasks
  • Demonstrates powerful predictive performance across tasks involving evolutionary understanding, structure prediction, property prediction and protein engineering
  • Alternative to pretraining large language models on up to 106 protein sequences
  • Requires only a pretrained natural language model, small-sized labeled protein data set and small amount of cross-domain finetuning
  • Up to 105 times more data-efficient than any current methods
  • Mitigates need to train large pretrained models for peptide learning tasks
  • Cheaper to run and more accurate than pretraining
  • Paves the path to extending this approach to other domains where pretrained LLMs do not exist

Method

Representation of tokens

  • R2DL framework uses two input datasets: English language text dataset and protein sequence dataset
  • Protein sequence dataset has a vocabulary size of 20 (due to 20 natural amino acids)
  • English text dataset is embedded in a latent space (V S ) using a pretrained language model
  • Protein sequence data is also embedded in the same latent space (V T )
  • Token embedding matrix is of dimensions (n, m) where n is number of tokens and m is length of embedding vectors
  • Same encoding scheme of V S and V T is used across all downstream tasks

Procedure description of the r2dl framework for a protein task

  • Pretrained English sentence classifier C is used
  • Target model training data X for task
  • Class mapping label function, h
  • Maximum number of iterations T 1 for updates to Θ
  • Number of iterations T 2 for k-SVD
  • Step size {α t } T1 t=1
  • Random initialization of Θ
  • Objective function for k-SVD
  • Calculate the loss and perform gradient descent
  • Output protein sequence labels for protein sequence x of task

Data

  • Provide five biologically relevant downstream physicochemical property prediction tasks
  • Datasets vary in size from 4,000 to 50,000
  • Secondary Structure Prediction task has 8,678 data samples and a benchmark of 80% accuracy
  • Solubility task has 16,253 data samples and a benchmark of 91% accuracy
  • Antigen Affinity task has 4,000 data samples and a benchmark of 0.87 Spearman’s ρ
  • Antimicrobial Prediction task has 6,489 data samples and a benchmark of 88% accuracy

R2dl settings and hyperparameter details amp

  • Dataset size is 8112
  • Training set size is 6489
  • Test set size is 812
  • L 0 norm used in objective function
  • 10,000 k-SVD iterations
  • 0.045 used as a parameter

Toxicity

  • Dataset size is 10,192
  • Training set size is 8153
  • Test set size is 1020
  • L 0 norm used in objective function
  • 10,000 k-SVD iterations
  • 0.045 used

Secondary structure

  • Dataset size is 9270
  • Training set size is 7416
  • Test set size is 1854
  • L 0 norm used in objective function
  • 9000 k-SVD iterations
  • 0.38 used as parameter

Stability

  • Used 56,126 Stability dataset
  • Training set size of 44,900 and test set size of 11,226
  • Used L 0 norm in objective function
  • 6,000 k-SVD iterations and = 0.29

Homology

  • Dataset size: 13,048
  • Training set size: 10,438
  • Test set size: 2,610
  • Objective function: L 0 norm
  • k-SVD iterations: 4,000
  • Parameter: 0.73

Solubility

  • Dataset size is 43,876
  • Training set size is 35,100
  • Test set size is 8,775
  • L 0 norm used in objective function
  • 9,000 k-SVD iterations
  • 0.42 used as parameter

Data and code availability

  • Links to protein sequence data and code are available on Github
  • 3 comparable methods introduced in Figure 1
  • Data efficiency of R2DL vs. pretrained methods
  • Confusion matrix of baseline model and R2DL model for antibody affinity prediction task
  • Task-specific evaluation of R2DL performance compared to baseline models
  • 6 downstream protein tasks: secondary structure, stability, homology, solubility, antimicrobial, toxicity
  • R2DL performs better when fewer labeled training data samples are available