Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Coarse-graining (CG) is a method used to accelerate molecular simulations of protein dynamics.
- Backmapping is the opposite operation of bringing lost atomistic details back from the CG representation.
- Machine learning (ML) has been used to produce accurate and efficient CG simulations of proteins, but fast and reliable backmapping remains a challenge.
- Rule-based methods produce poor all-atom geometries, needing computationally costly refinement.
- ML approaches outperform traditional baselines but are not transferable between proteins and sometimes generate unphysical atom placements.
- This work addresses both issues to build a fast, transferable, and reliable generative backmapping tool for CG protein representations.
Paper Content
Introduction
- Protein dynamics ranges from large to small movements
- Connected to essential biological functions
- Data scarce, so research on conformational ensembles started recently
- Experimental structure determination methods not suitable for describing individual dynamic states
- Conformational ensembles generated using simulations
- Atomistic simulations too computationally expensive
- Coarse-grained simulations used to overcome limitations
- Atom-level contacts essential to understand molecular recognition
- Backmapping required to get complete picture of protein function
- Popular backmapping methods involve two steps
- Data-driven methods proposed to achieve efficiency and successful restoration
- Proposed deep generative backmapping tool with transferability across protein space
- Model reconstructs protein all-atom structure from alpha carbon of each amino acid
- Model utilizes equivariant encoder and loss functions to enforce physical constraints
Methods
Data
- PED contains 227 entries of protein structural ensembles
- Most entries are generated by computer and experimentally constrained
- Experimental validation reduces potential bias from sampling errors
- 84 proteins selected for training, 4 proteins for testing
Cg mapping scheme
- Amino acid residues are represented as one bead centered at its C α
- Popular medium resolution coarse-grained models include CABS and MARTINI
- Backmapping algorithms start from the C α trace level
- Internal coordinates are used to preserve bond topology
- GenZProt reconstructs bond topology by generating internal coordinate representation of each atom
Internal coordinate-based structure generation
- Generates internal coordinates (Z-matrix) which is converted to Cartesian coordinates
- Placement of an atom A in 3D space is determined from three anchor atoms B, C, D and a set of internal coordinates
- Backbone atoms are placed using C α s as anchors and side chain atoms are placed sequentially
- Machine learning model can learn to predict placement of backbone atoms relative to three adjacent C α atoms
- Atoms are added to 3D space sequentially
- Decoder generates all internal coordinates simultaneously in one shot
Vae framework
- Model is based on VAE framework introduced in (Wang et al., 2022).
- Modeling task is to find distribution of all-atom structure x conditioned on CG structure X.
- Distribution is factorized as a latent variable model with a prior and decoder.
- Encoder is introduced to train the prior and decoder.
- During training, CG latent variable z is sampled from encoder.
- During sampling, latent variable is sampled from prior.
- Latent representation is passed to decoder to generate all-atom structure.
Model architecture
- Introduce an equivariant encoder and prior architecture to learn spatial interdependence of atom and residue placements
- Model molecular structures as graphs with nodes as residues and atoms
- Use geometric tensors to represent node attributes and SE(3)-equivariant neural networks to perform message passing
- Digitize protein molecular graph by assigning residue and atom identity as initial node attributes
- Encoder performs message passing at three levels: atom-atom, atom-residue, and residue-residue
- Prior performs message passing at residue level only
- Decoder architecture allows flexibility on torsion angles and gives constrained predictions on local structures
- Train model to minimize Evidential Lower Bound (ELBO) objective
- Supervise model on topology and atom placements in 3D space
- Use Mean-Squared-Error (MSE) loss term on bond lengths and periodic angular loss term for angles
- Use RMSD loss term in Cartesian coordinate space
- Use steric clash loss as an auxiliary learning objective
Experiments
- Performed ablation studies on model architecture and loss functions
- Compared model with CGVAE
- CGVAE modified to take multiple proteins as training data
- Performed five random seed experiments and reported mean and variance of metrics
- Referred to structures decoded from encoder-sampled latent variables as reconstructed
- Referred to structures generated from prior sampling as sampled structures
Test proteins
- Tested model with four proteins of varying flexibility and compactness
- PED00055 and PED00090 are mostly globular with short disordered tails
- PED00151 is an IDP
- PED00218 is a complex of a globular protein and an IDP
Metrics
- Evaluated model performance with 3 metrics: RMSD, GED, and Steric Clash Score
- Reported RM SD value of ground truth and reconstructed structures
- Measured sample quality by preserving original chemical bond graph (quantified by GED ratio)
- Reported ratio of steric clash occurrence in atom-atom pairs within 5.0 Å distance
Results
- Transferable models trained with 88 protein ensembles (m1-m4) show best performance for every metric
- Equivariant encoder/prior is important for model performance
- Models with Cartesian coordinate decoder (m3, m4) fail to give high-quality reconstructions
- Internal coordinate-based decoding coupled with equivariant encoder/prior can faithfully keep topology
- Model trained on single protein structure (m5, m6) performs worse than generalized model (m1)
- Learning objective L xyz is critical for optimal model performance, L torsion slightly improves model performance
- Removing L steric increases steric clash ratio
- Reconstructed and sampled structures recover topology faithfully and avoid steric clashes
- Long-range interactions are preserved
- Torsion angle distribution is recovered well
- Sampling speed is approximately 0.009 seconds per frame
- Model can be used for protein-protein docking
- Model can be applied to nucleic acids and nucleic acid-protein complexes