Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Computational predictions of mass spectra from molecules have enabled the discovery of clinically relevant metabolites.
- Predictive tools are limited, either operating with overly rigid constraints or by decoding lossy and nonphysical discretized spectra vectors.
- A new intermediate strategy is introduced for predicting mass spectra from molecules by treating them as sets of chemical formulae.
- A prefix tree structure is used to decode the formula set, atom-type by atom-type, representing a general method for ordered multiset decoding.
Paper Content
Background
- Introduction to tandem mass spectrometry
- Previous approaches to modeling the process
- Utilizing tools to discover molecules from new spectra
- Further details on physical process of mass spectrometry in [23]
Tandem mass spectrometry
- MS/MS measure fragmentation patterns of molecules in a multi-stage process
- Input is a solution containing a precursor molecule with a chemical formula
- Ionized by bonding or associating with an adduct
- Mass analyzer measures mass-to-charge ratio
- Filtered into a collision cell, broken down into product ions
- Product ions measured by second mass analyzer
- Spectrum prediction strategies include fragmentation, binned, and formula prediction
Mass spectrum libraries
- Spectrum predictors are used to build large libraries of molecule spectra.
- These libraries are used to determine properties of unidentified molecules and to infer an unknown molecule’s structure from a newly observed spectrum.
- Evaluation of spectrum predictors is based on accuracy of prediction and ability to assist with retrieval.
Model
- SCARF is a model for predicting mass spectra from precursor molecules
- SCARF is composed of two learned functions: one mapping from the original molecule to a set of product formulae, and one mapping from the set of formulae to the respective intensities
- SCARF-Thread predicts a prefix tree for new molecules at test time
- SCARF-Thread treats the counts of subsequent atoms as dependent only on the counts of atoms predicted so far
- SCARF-Thread poses the prediction of the set of child nodes as a multi-label binary classification problem
- SCARF-Thread takes as input an embedding of the overall molecule, a vector representing the counts of the atoms in the prefix so far, the difference of the counts predicted so far from the precursor molecule, and a one-hot representation of the atom for which the counts are currently being predicted
- SCARF-Weave uses a Set Transformer to take into account all of the formulae present in the mass spectrum when predicting final intensities
- SCARF-Weave represents the inputs by concatenating a vector embedding of the initial molecular graph with count-based embeddings of the product formula and its difference from the precursor formula
Training and inference
- SCARF-Thread and SCARF-Weave are trained separately
- SCARF-Weave is trained using a cosine loss
- SCARF-Thread is trained using binary cross entropy losses
- Teacher forcing is used to train each level of the tree in parallel
- Top 300 product formulae are always picked when generating from the model
Experiments
- Evaluated SCARF on prediction of spectra
- Evaluated SCARF on identification of unknown molecules
Dataset
- SCARF is trained and validated on two libraries: NIST20 and CANOPUS.
- MAGMa algorithm is used to standardize product formulae annotations for supervision.
- SCARF can be trained with any product formula annotations.
Spectra prediction
- SCARF-Thread is a computer program used to predict product formulae
- SCARF-Thread is compared to several baselines and outperforms them
- SCARF-Thread is able to cover 90% and 74% of true formulae in the ground truth test set for NIST20 and CANOPUS respectively
- SCARF-Weave is used to predict intensity of spectra
- SCARF-Weave is compared to three baselines and is more accurate
- SCARF-Weave is more physically-grounded and operates 2 orders of magnitude faster than CFM-ID
Retrieval
- Forward spectrum prediction can be used to determine molecular structure assignments.
- A retrieval task was designed to showcase the potential of forward spectrum prediction models.
- SCARF achieved an improved top-1 and top-5 retrieval accuracy compared to the second best method.
Related work
- Forward models predict the spectrum given the molecule
- Backward models start from the spectrum and predict features or structure of the molecule
- Deep representation learning techniques are used for both small molecules and proteomics
- SCARF-Thread generates a set as output
- Each member of the set represents a multiset of atom types
Conclusion
- Introduced SCARF, an approach utilizing prefix tree data structures to decode mass spectra from molecules
- Combines advantages of previous neural and fragment based approaches
- More accurate in predicting experimentally-observed spectra and better able to identify and label unknown spectra
- Data dependent, reliant upon quality of product formula label assignment
- Future directions involve more carefully modeling covariates, grounding product formulae in molecular graph substructures
- Benchmarked models in terms of retrieval accuracy
- Showcased additional spectra predictions from model trained on NIST20
- NIST20 dataset filtered, normalized, filtered, and subsetted
- Assume no adduct switching in formulation
- SCARF-Thread predicts prefix tree for new molecules at test time
- Multi-label binary classification task to predict binary label for each possible count
- SCARF-Weave takes in product formulae and predicts their intensities
- Graph neural network (GNN) atom features