Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proteins power a variety of processes in cells
- Protein design enables engineering of cellular behavior
- Structure-based protein design looks for designable, novel, and diverse structures
- Search-based methods are limited due to the large space of sequences and structures
- Generative models learn the low-dimensional structure of complex data distributions
- Genie is a generative model of protein structures that performs discrete-time diffusion
- Genie generates more designable, novel, and diverse protein backbones than existing models
Paper Content
Introduction
- Proteins play an essential role in cellular processes
- Evolution has explored a small subregion of foldable protein space
- Protein design efforts have focused on optimizing functional properties of naturally occurring proteins
- Recent advances in protein structure prediction methods have enabled new approaches to explore structure space
- Generative models can capture complex data distributions and have been applied to protein design
- Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs) have been used
- Denoising Diffusion Probabilistic Models (DDPMs) have shown promise in generating high quality 2D images
- Multiple prior efforts have applied generative modeling to structure-based protein design
- FoldingDiff uses internal coordinates to parameterize proteins
- ProtDiff uses atomic coordinates in Cartesian space
- AlphaFold2 combines implicit reasoning in a latent space with geometric reasoning in Cartesian space
- Genie combines aspects of SE(3)-equivariant reasoning with DDPMs to create a diffusion process over protein backbone geometry in Cartesian space
Methods
- Genie is a DDPM that generates protein backbones as a sequence of C α atomic coordinates
- Genie performs diffusion directly in Cartesian space and uses an SE(3)-equivariant denoiser
- Section 2.1 describes tailored implementation of DDPMs for protein backbone generation
- Section 2.2 provides details on the SE(3)-equivariant denoiser
- Section 2.3 and 2.4 describe how to train and sample from the model
Denoising diffusion probabilistic model
- Denotes a sequence of Cα coordinates of length N, corresponding to a protein with N residues
- Adds isotropic Gaussian noise to the sample following a cosine variance schedule
- Reverse process modeled as a Gaussian process
- Starting the reverse process from pure white noise and then iteratively removing noise generates protein backbones de novo
Noise prediction
- First element of F i is rotation matrix
- Second element is translation vector
- Edge cases of N- and C-termini of proteins handled by assigning frames of second and second-to-last residues to first and last residues
Se(3)-invariant encoder
- Generates and refines single and paired residue representations
- Extracts updated coordinates from translation component of frames
- Computes predicted noise as difference between coordinates
Training
- Genie is a computer program that can reverse the diffusion process and generate novel protein backbones.
- Training Genie involves minimizing the error in noise prediction for each diffusion step.
- The maximum sequence length considered is 128.
- The loss is defined as the sum of per residue L2 distances between true and predicted noise vectors.
- Training data is from the Structural Classification of Proteins extended (SCOPe) dataset.
- 8,766 domains are used, with 3,942 domains having at most 128 residues.
Sampling
- Generate a new protein backbone of length N
- Sample a random sequence of coordinates
- Feed sequence through reverse diffusion process
- Update rule is given by Equation 3
Results
- Genie was evaluated by generating 10 proteins for each sequence length between 50 and 128 residues
- Genie was assessed on designability, diversity, and novelty
- Genie outperformed ProtDiff and FoldingDiff on all three criteria
Designability
- Genie outperforms ProtDiff and FoldingDiff in terms of designability
- Designability is measured by scTM score and pLDDT score
- scTM score ranges from 0 to 1, higher numbers indicate higher likelihood of designability
- pLDDT score ranges from 0 to 100, higher numbers indicate higher confidence in prediction
- scTM > 0.5 and pLDDT > 70 are used as cutoffs for designability
- Genie generates more designable protein structures than ProtDiff and FoldingDiff
Diversity
- Evaluated diversity by considering relative proportion of secondary structure elements (SSEs)
- Used Protein Secondary Element Assignment (P-SEA) algorithm to identify SSEs
- Genie designs are more diverse, with 254 mainly α-helical, 25 mainly β-strand, and 176 α, β-mixed domains
- Genie achieves lower average maximum TM score than ProtDiff and FoldingDiff, suggesting more diverse domains
Novelty
- Novelty of generated protein structures is a key feature of any structure-based protein design tool.
- 98 out of 455 (21.5%) confidently designable structures generated by Genie are novel.
- Multidimensional scaling (MDS) is applied to the pairwise TM scores of all 455 confidently designable domains to visualize the design space of Genie.
Conclusion
- Presented Genie, a novel DDPM for de novo protein design
- Dual representations for protein residues used
- Noise prediction accomplished by combining IPA with backbone updates
- Future directions include expanding Genie to include a sequence generation module and facilitating application of Genie to biologically functional designs
- Sinusoidal encoding of diffusion step and residue index used
- Relative positional encoding used to compute pair representation
- Single Feature Network and Pair Feature Network used
- Pair Transform Network uses 5 layers of triangular multiplicative updates
- Genie implemented in PyTorch
- Adam optimizer used with learning rate of 10-4
- Trained on 12 A100 Nvidia GPUs with effective batch size of 48
- Trained for 50000 epochs
- Evaluated using ProtDiff and FoldingDiff
- Additional evaluations of Genie and other methods by replacing OmegaFold with ESMFold
- Self-consistency Template Modeling (scTM) pipeline used