Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proteins power a variety of processes in cells
  • Protein design enables engineering of cellular behavior
  • Structure-based protein design looks for designable, novel, and diverse structures
  • Search-based methods are limited due to the large space of sequences and structures
  • Generative models learn the low-dimensional structure of complex data distributions
  • Genie is a generative model of protein structures that performs discrete-time diffusion
  • Genie generates more designable, novel, and diverse protein backbones than existing models

Paper Content

Introduction

  • Proteins play an essential role in cellular processes
  • Evolution has explored a small subregion of foldable protein space
  • Protein design efforts have focused on optimizing functional properties of naturally occurring proteins
  • Recent advances in protein structure prediction methods have enabled new approaches to explore structure space
  • Generative models can capture complex data distributions and have been applied to protein design
  • Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs) have been used
  • Denoising Diffusion Probabilistic Models (DDPMs) have shown promise in generating high quality 2D images
  • Multiple prior efforts have applied generative modeling to structure-based protein design
  • FoldingDiff uses internal coordinates to parameterize proteins
  • ProtDiff uses atomic coordinates in Cartesian space
  • AlphaFold2 combines implicit reasoning in a latent space with geometric reasoning in Cartesian space
  • Genie combines aspects of SE(3)-equivariant reasoning with DDPMs to create a diffusion process over protein backbone geometry in Cartesian space

Methods

  • Genie is a DDPM that generates protein backbones as a sequence of C α atomic coordinates
  • Genie performs diffusion directly in Cartesian space and uses an SE(3)-equivariant denoiser
  • Section 2.1 describes tailored implementation of DDPMs for protein backbone generation
  • Section 2.2 provides details on the SE(3)-equivariant denoiser
  • Section 2.3 and 2.4 describe how to train and sample from the model

Denoising diffusion probabilistic model

  • Denotes a sequence of Cα coordinates of length N, corresponding to a protein with N residues
  • Adds isotropic Gaussian noise to the sample following a cosine variance schedule
  • Reverse process modeled as a Gaussian process
  • Starting the reverse process from pure white noise and then iteratively removing noise generates protein backbones de novo

Noise prediction

  • First element of F i is rotation matrix
  • Second element is translation vector
  • Edge cases of N- and C-termini of proteins handled by assigning frames of second and second-to-last residues to first and last residues

Se(3)-invariant encoder

  • Generates and refines single and paired residue representations
  • Extracts updated coordinates from translation component of frames
  • Computes predicted noise as difference between coordinates

Training

  • Genie is a computer program that can reverse the diffusion process and generate novel protein backbones.
  • Training Genie involves minimizing the error in noise prediction for each diffusion step.
  • The maximum sequence length considered is 128.
  • The loss is defined as the sum of per residue L2 distances between true and predicted noise vectors.
  • Training data is from the Structural Classification of Proteins extended (SCOPe) dataset.
  • 8,766 domains are used, with 3,942 domains having at most 128 residues.

Sampling

  • Generate a new protein backbone of length N
  • Sample a random sequence of coordinates
  • Feed sequence through reverse diffusion process
  • Update rule is given by Equation 3

Results

  • Genie was evaluated by generating 10 proteins for each sequence length between 50 and 128 residues
  • Genie was assessed on designability, diversity, and novelty
  • Genie outperformed ProtDiff and FoldingDiff on all three criteria

Designability

  • Genie outperforms ProtDiff and FoldingDiff in terms of designability
  • Designability is measured by scTM score and pLDDT score
  • scTM score ranges from 0 to 1, higher numbers indicate higher likelihood of designability
  • pLDDT score ranges from 0 to 100, higher numbers indicate higher confidence in prediction
  • scTM > 0.5 and pLDDT > 70 are used as cutoffs for designability
  • Genie generates more designable protein structures than ProtDiff and FoldingDiff

Diversity

  • Evaluated diversity by considering relative proportion of secondary structure elements (SSEs)
  • Used Protein Secondary Element Assignment (P-SEA) algorithm to identify SSEs
  • Genie designs are more diverse, with 254 mainly α-helical, 25 mainly β-strand, and 176 α, β-mixed domains
  • Genie achieves lower average maximum TM score than ProtDiff and FoldingDiff, suggesting more diverse domains

Novelty

  • Novelty of generated protein structures is a key feature of any structure-based protein design tool.
  • 98 out of 455 (21.5%) confidently designable structures generated by Genie are novel.
  • Multidimensional scaling (MDS) is applied to the pairwise TM scores of all 455 confidently designable domains to visualize the design space of Genie.

Conclusion

  • Presented Genie, a novel DDPM for de novo protein design
  • Dual representations for protein residues used
  • Noise prediction accomplished by combining IPA with backbone updates
  • Future directions include expanding Genie to include a sequence generation module and facilitating application of Genie to biologically functional designs
  • Sinusoidal encoding of diffusion step and residue index used
  • Relative positional encoding used to compute pair representation
  • Single Feature Network and Pair Feature Network used
  • Pair Transform Network uses 5 layers of triangular multiplicative updates
  • Genie implemented in PyTorch
  • Adam optimizer used with learning rate of 10-4
  • Trained on 12 A100 Nvidia GPUs with effective batch size of 48
  • Trained for 50000 epochs
  • Evaluated using ProtDiff and FoldingDiff
  • Additional evaluations of Genie and other methods by replacing OmegaFold with ESMFold
  • Self-consistency Template Modeling (scTM) pipeline used