Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Machine Learning-guided solutions have made progress in protein learning tasks.
Data availability is a constraint for success in scientific discovery tasks.
Deep learning models pretrained on millions of protein sequences have shown promise.
Representation Learning via Dictionary Learning (R2DL) is an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
R2DL can improve data efficiency by up to $10^5$ times over baselines.

Paper Content

Introduction

Recent advances in AI have led to major innovations in biology
Deep learning models aim to learn a highly accurate and compressed representation of the biological system
Successful applications include high-quality protein structure prediction and novel peptide discoveries
Deep learning models require large amounts of labeled data, which is scarce and sparse in biology
Pretraining methods leverage large amounts of sequence data to learn features that explain variance in sequences
Pretraining has enabled meaningful density modelling across protein functions, structures, and families
Pretraining is resource-intensive and expensive
This work proposes a lightweight, more accurate alternative method to large-scale pretraining
This method reprograms an existing foundation model of high capacity that is trained on data from a different domain
This method is defined by its ability to achieve the performance of transformers that are pretrained on billions of tokens
This method is tested on seven different physicochemical, structural, and functional property prediction tasks
This method is over 105 times more data-efficient than existing pretraining methods

Results

R2DL is a framework for embedding protein sequence datasets
R2DL uses a transformer pretrained on an English text corpus to learn representations for downstream protein prediction tasks

R2dl framework formulation

R2DL is a method to reprogram a source model (pretrained language model) to classify or predict the regression values of protein sequences.
The source model is BERT, a bidirectional transformer, which has been finetuned for different language tasks.
The input data of the source model is tokenized at the word level, while the input data of the target task is tokenized at the character level.
R2DL learns a linear mapping between the source model embedding space and the target task embedding space.
R2DL uses a k-SVD solver to select k English tokens and use their linearly combined embeddings as the embedding of a target token.

R2dl training and optimization procedure

Pretrained classifier C is used on source-task dataset with source tokens denoted by {v Si } |V S | i=1 and target-task dataset with target tokens denoted by {V T j } |V T | j=1.
R2DL finds a linear mapping function Θ to represent target tokens as a sparse encoding of the dictionary V S.
Map Θ is used to reprogram C to classify protein sequences.
Number of trainable parameters in R2DL is size of matrix Θ, which is usually much smaller than number of parameters in pretrained deep neural network classifier C.

Benchmark tasks and evaluation

Four physicochemical structure and property prediction tasks from a protein benchmark are considered
Secondary structure prediction involves predicting a structure for each amino acid in a protein sequence
Solubility prediction involves mapping a protein sequence to a label of membrane-bound or water soluble
Homology detection is a sequence classification task with labels representing different protein folds
Stability prediction is a regression task
Three biomedically relevant function prediction tasks are also considered
Data efficiency is used as a metric to compare the performance of R2DL to established benchmarks

Model baselines and data

Two types of models are considered: supervised and unsupervised
Supervised models are trained from scratch using labeled datasets
Unsupervised models are pretrained on protein sequence data without labels
Pretraining on unlabeled data is more advantageous
Pretraining corpus size is 1.7 million peptide sequences, with only 0.005% labeled
Baseline model for physicochemical property prediction is pretrained on Pfam corpus of 31 million protein domains
R2DL repurposes existing pretrained English language models

Data efficiency and accuracy of reprogramming

R2DL outperforms task-specific baselines for all prediction tasks
R2DL outperforms standard, supervised LSTM by up to 29.3%
R2DL is up to 10,000 times more data efficient than task-specific baselines
R2DL has higher predictive accuracy than baseline LDA model by 3%
R2DL has better classification accuracy than LDA model with imbalanced datasets

R2dl performance vs. pretraining performance in low data settings

Tested task-specific predictive performance of R2DL in reduced-data training settings
Compared performance of R2DL to baseline models
R2DL outperforms pretrained models in Toxicity, Secondary Structure, Homology and Solubility tasks
R2DL outperforms LDA model in antibody affinity dataset
R2DL outperforms random guess until cutoff point

Correlation between learned embeddings and evolutionary distances

R2DL model compared to individual protein task benchmarks
R2DL shows interpretable correspondences between learned embeddings and protein properties
Figures 5(a-c) show t-SNE projection of task-specific R2DL embeddings
Clear separation between different protein classes
Euclidean distances between latent embeddings and pairwise evolutionary distances between protein sequences show correlation of close to 1.0

Discussion

Proposed a new framework, R2DL, to reprogram large language models for various protein tasks
Demonstrates powerful predictive performance across tasks involving evolutionary understanding, structure prediction, property prediction and protein engineering
Alternative to pretraining large language models on up to 106 protein sequences
Requires only a pretrained natural language model, small-sized labeled protein data set and small amount of cross-domain finetuning
Up to 105 times more data-efficient than any current methods
Mitigates need to train large pretrained models for peptide learning tasks
Cheaper to run and more accurate than pretraining
Paves the path to extending this approach to other domains where pretrained LLMs do not exist

Method

Representation of tokens

R2DL framework uses two input datasets: English language text dataset and protein sequence dataset
Protein sequence dataset has a vocabulary size of 20 (due to 20 natural amino acids)
English text dataset is embedded in a latent space (V S ) using a pretrained language model
Protein sequence data is also embedded in the same latent space (V T )
Token embedding matrix is of dimensions (n, m) where n is number of tokens and m is length of embedding vectors
Same encoding scheme of V S and V T is used across all downstream tasks

Procedure description of the r2dl framework for a protein task

Pretrained English sentence classifier C is used
Target model training data X for task
Class mapping label function, h
Maximum number of iterations T 1 for updates to Θ
Number of iterations T 2 for k-SVD
Step size {α t } T1 t=1
Random initialization of Θ
Objective function for k-SVD
Calculate the loss and perform gradient descent
Output protein sequence labels for protein sequence x of task

Data

Provide five biologically relevant downstream physicochemical property prediction tasks
Datasets vary in size from 4,000 to 50,000
Secondary Structure Prediction task has 8,678 data samples and a benchmark of 80% accuracy
Solubility task has 16,253 data samples and a benchmark of 91% accuracy
Antigen Affinity task has 4,000 data samples and a benchmark of 0.87 Spearman’s ρ
Antimicrobial Prediction task has 6,489 data samples and a benchmark of 88% accuracy

R2dl settings and hyperparameter details amp

Dataset size is 8112
Training set size is 6489
Test set size is 812
L 0 norm used in objective function
10,000 k-SVD iterations
0.045 used as a parameter

Toxicity

Dataset size is 10,192
Training set size is 8153
Test set size is 1020
L 0 norm used in objective function
10,000 k-SVD iterations
0.045 used

Secondary structure

Dataset size is 9270
Training set size is 7416
Test set size is 1854
L 0 norm used in objective function
9000 k-SVD iterations
0.38 used as parameter

Stability

Used 56,126 Stability dataset
Training set size of 44,900 and test set size of 11,226
Used L 0 norm in objective function
6,000 k-SVD iterations and = 0.29

Homology

Dataset size: 13,048
Training set size: 10,438
Test set size: 2,610
Objective function: L 0 norm
k-SVD iterations: 4,000
Parameter: 0.73

Solubility

Dataset size is 43,876
Training set size is 35,100
Test set size is 8,775
L 0 norm used in objective function
9,000 k-SVD iterations
0.42 used as parameter

Data and code availability

Links to protein sequence data and code are available on Github
3 comparable methods introduced in Figure 1
Data efficiency of R2DL vs. pretrained methods
Confusion matrix of baseline model and R2DL model for antibody affinity prediction task
Task-specific evaluation of R2DL performance compared to baseline models
6 downstream protein tasks: secondary structure, stability, homology, solubility, antimicrobial, toxicity
R2DL performs better when fewer labeled training data samples are available

Link to paper#

Abstract#

Paper Content#

Introduction#

Results#

R2dl framework formulation#

R2dl training and optimization procedure#

Benchmark tasks and evaluation#

Model baselines and data#

Data efficiency and accuracy of reprogramming#

R2dl performance vs. pretraining performance in low data settings#

Correlation between learned embeddings and evolutionary distances#

Discussion#

Method#

Representation of tokens#

Procedure description of the r2dl framework for a protein task#

Data#

R2dl settings and hyperparameter details amp#

Toxicity#

Secondary structure#

Stability#

Homology#

Solubility#

Data and code availability#

Link to paper

Abstract

Paper Content

Introduction

Results

R2dl framework formulation

R2dl training and optimization procedure

Benchmark tasks and evaluation

Model baselines and data

Data efficiency and accuracy of reprogramming

R2dl performance vs. pretraining performance in low data settings

Correlation between learned embeddings and evolutionary distances

Discussion

Method

Representation of tokens

Procedure description of the r2dl framework for a protein task

Data

R2dl settings and hyperparameter details amp

Toxicity

Secondary structure

Stability

Homology

Solubility

Data and code availability