Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Coarse-graining (CG) is a method used to accelerate molecular simulations of protein dynamics.
  • Backmapping is the opposite operation of bringing lost atomistic details back from the CG representation.
  • Machine learning (ML) has been used to produce accurate and efficient CG simulations of proteins, but fast and reliable backmapping remains a challenge.
  • Rule-based methods produce poor all-atom geometries, needing computationally costly refinement.
  • ML approaches outperform traditional baselines but are not transferable between proteins and sometimes generate unphysical atom placements.
  • This work addresses both issues to build a fast, transferable, and reliable generative backmapping tool for CG protein representations.

Paper Content

Introduction

  • Protein dynamics ranges from large to small movements
  • Connected to essential biological functions
  • Data scarce, so research on conformational ensembles started recently
  • Experimental structure determination methods not suitable for describing individual dynamic states
  • Conformational ensembles generated using simulations
  • Atomistic simulations too computationally expensive
  • Coarse-grained simulations used to overcome limitations
  • Atom-level contacts essential to understand molecular recognition
  • Backmapping required to get complete picture of protein function
  • Popular backmapping methods involve two steps
  • Data-driven methods proposed to achieve efficiency and successful restoration
  • Proposed deep generative backmapping tool with transferability across protein space
  • Model reconstructs protein all-atom structure from alpha carbon of each amino acid
  • Model utilizes equivariant encoder and loss functions to enforce physical constraints

Methods

Data

  • PED contains 227 entries of protein structural ensembles
  • Most entries are generated by computer and experimentally constrained
  • Experimental validation reduces potential bias from sampling errors
  • 84 proteins selected for training, 4 proteins for testing

Cg mapping scheme

  • Amino acid residues are represented as one bead centered at its C α
  • Popular medium resolution coarse-grained models include CABS and MARTINI
  • Backmapping algorithms start from the C α trace level
  • Internal coordinates are used to preserve bond topology
  • GenZProt reconstructs bond topology by generating internal coordinate representation of each atom

Internal coordinate-based structure generation

  • Generates internal coordinates (Z-matrix) which is converted to Cartesian coordinates
  • Placement of an atom A in 3D space is determined from three anchor atoms B, C, D and a set of internal coordinates
  • Backbone atoms are placed using C α s as anchors and side chain atoms are placed sequentially
  • Machine learning model can learn to predict placement of backbone atoms relative to three adjacent C α atoms
  • Atoms are added to 3D space sequentially
  • Decoder generates all internal coordinates simultaneously in one shot

Vae framework

  • Model is based on VAE framework introduced in (Wang et al., 2022).
  • Modeling task is to find distribution of all-atom structure x conditioned on CG structure X.
  • Distribution is factorized as a latent variable model with a prior and decoder.
  • Encoder is introduced to train the prior and decoder.
  • During training, CG latent variable z is sampled from encoder.
  • During sampling, latent variable is sampled from prior.
  • Latent representation is passed to decoder to generate all-atom structure.

Model architecture

  • Introduce an equivariant encoder and prior architecture to learn spatial interdependence of atom and residue placements
  • Model molecular structures as graphs with nodes as residues and atoms
  • Use geometric tensors to represent node attributes and SE(3)-equivariant neural networks to perform message passing
  • Digitize protein molecular graph by assigning residue and atom identity as initial node attributes
  • Encoder performs message passing at three levels: atom-atom, atom-residue, and residue-residue
  • Prior performs message passing at residue level only
  • Decoder architecture allows flexibility on torsion angles and gives constrained predictions on local structures
  • Train model to minimize Evidential Lower Bound (ELBO) objective
  • Supervise model on topology and atom placements in 3D space
  • Use Mean-Squared-Error (MSE) loss term on bond lengths and periodic angular loss term for angles
  • Use RMSD loss term in Cartesian coordinate space
  • Use steric clash loss as an auxiliary learning objective

Experiments

  • Performed ablation studies on model architecture and loss functions
  • Compared model with CGVAE
  • CGVAE modified to take multiple proteins as training data
  • Performed five random seed experiments and reported mean and variance of metrics
  • Referred to structures decoded from encoder-sampled latent variables as reconstructed
  • Referred to structures generated from prior sampling as sampled structures

Test proteins

  • Tested model with four proteins of varying flexibility and compactness
  • PED00055 and PED00090 are mostly globular with short disordered tails
  • PED00151 is an IDP
  • PED00218 is a complex of a globular protein and an IDP

Metrics

  • Evaluated model performance with 3 metrics: RMSD, GED, and Steric Clash Score
  • Reported RM SD value of ground truth and reconstructed structures
  • Measured sample quality by preserving original chemical bond graph (quantified by GED ratio)
  • Reported ratio of steric clash occurrence in atom-atom pairs within 5.0 Å distance

Results

  • Transferable models trained with 88 protein ensembles (m1-m4) show best performance for every metric
  • Equivariant encoder/prior is important for model performance
  • Models with Cartesian coordinate decoder (m3, m4) fail to give high-quality reconstructions
  • Internal coordinate-based decoding coupled with equivariant encoder/prior can faithfully keep topology
  • Model trained on single protein structure (m5, m6) performs worse than generalized model (m1)
  • Learning objective L xyz is critical for optimal model performance, L torsion slightly improves model performance
  • Removing L steric increases steric clash ratio
  • Reconstructed and sampled structures recover topology faithfully and avoid steric clashes
  • Long-range interactions are preserved
  • Torsion angle distribution is recovered well
  • Sampling speed is approximately 0.009 seconds per frame
  • Model can be used for protein-protein docking
  • Model can be applied to nucleic acids and nucleic acid-protein complexes