Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Computational predictions of mass spectra from molecules have enabled the discovery of clinically relevant metabolites.
Predictive tools are limited, either operating with overly rigid constraints or by decoding lossy and nonphysical discretized spectra vectors.
A new intermediate strategy is introduced for predicting mass spectra from molecules by treating them as sets of chemical formulae.
A prefix tree structure is used to decode the formula set, atom-type by atom-type, representing a general method for ordered multiset decoding.

Paper Content

Background

Introduction to tandem mass spectrometry
Previous approaches to modeling the process
Utilizing tools to discover molecules from new spectra
Further details on physical process of mass spectrometry in [23]

Tandem mass spectrometry

MS/MS measure fragmentation patterns of molecules in a multi-stage process
Input is a solution containing a precursor molecule with a chemical formula
Ionized by bonding or associating with an adduct
Mass analyzer measures mass-to-charge ratio
Filtered into a collision cell, broken down into product ions
Product ions measured by second mass analyzer
Spectrum prediction strategies include fragmentation, binned, and formula prediction

Mass spectrum libraries

Spectrum predictors are used to build large libraries of molecule spectra.
These libraries are used to determine properties of unidentified molecules and to infer an unknown molecule’s structure from a newly observed spectrum.
Evaluation of spectrum predictors is based on accuracy of prediction and ability to assist with retrieval.

Model

SCARF is a model for predicting mass spectra from precursor molecules
SCARF is composed of two learned functions: one mapping from the original molecule to a set of product formulae, and one mapping from the set of formulae to the respective intensities
SCARF-Thread predicts a prefix tree for new molecules at test time
SCARF-Thread treats the counts of subsequent atoms as dependent only on the counts of atoms predicted so far
SCARF-Thread poses the prediction of the set of child nodes as a multi-label binary classification problem
SCARF-Thread takes as input an embedding of the overall molecule, a vector representing the counts of the atoms in the prefix so far, the difference of the counts predicted so far from the precursor molecule, and a one-hot representation of the atom for which the counts are currently being predicted
SCARF-Weave uses a Set Transformer to take into account all of the formulae present in the mass spectrum when predicting final intensities
SCARF-Weave represents the inputs by concatenating a vector embedding of the initial molecular graph with count-based embeddings of the product formula and its difference from the precursor formula

Training and inference

SCARF-Thread and SCARF-Weave are trained separately
SCARF-Weave is trained using a cosine loss
SCARF-Thread is trained using binary cross entropy losses
Teacher forcing is used to train each level of the tree in parallel
Top 300 product formulae are always picked when generating from the model

Experiments

Evaluated SCARF on prediction of spectra
Evaluated SCARF on identification of unknown molecules

Dataset

SCARF is trained and validated on two libraries: NIST20 and CANOPUS.
MAGMa algorithm is used to standardize product formulae annotations for supervision.
SCARF can be trained with any product formula annotations.

Spectra prediction

SCARF-Thread is a computer program used to predict product formulae
SCARF-Thread is compared to several baselines and outperforms them
SCARF-Thread is able to cover 90% and 74% of true formulae in the ground truth test set for NIST20 and CANOPUS respectively
SCARF-Weave is used to predict intensity of spectra
SCARF-Weave is compared to three baselines and is more accurate
SCARF-Weave is more physically-grounded and operates 2 orders of magnitude faster than CFM-ID

Retrieval

Forward spectrum prediction can be used to determine molecular structure assignments.
A retrieval task was designed to showcase the potential of forward spectrum prediction models.
SCARF achieved an improved top-1 and top-5 retrieval accuracy compared to the second best method.

Forward models predict the spectrum given the molecule
Backward models start from the spectrum and predict features or structure of the molecule
Deep representation learning techniques are used for both small molecules and proteomics
SCARF-Thread generates a set as output
Each member of the set represents a multiset of atom types

Conclusion

Introduced SCARF, an approach utilizing prefix tree data structures to decode mass spectra from molecules
Combines advantages of previous neural and fragment based approaches
More accurate in predicting experimentally-observed spectra and better able to identify and label unknown spectra
Data dependent, reliant upon quality of product formula label assignment
Future directions involve more carefully modeling covariates, grounding product formulae in molecular graph substructures
Benchmarked models in terms of retrieval accuracy
Showcased additional spectra predictions from model trained on NIST20
NIST20 dataset filtered, normalized, filtered, and subsetted
Assume no adduct switching in formulation
SCARF-Thread predicts prefix tree for new molecules at test time
Multi-label binary classification task to predict binary label for each possible count
SCARF-Weave takes in product formulae and predicts their intensities
Graph neural network (GNN) atom features

Link to paper#

Abstract#

Paper Content#

Background#

Tandem mass spectrometry#

Mass spectrum libraries#

Model#

Training and inference#

Experiments#

Dataset#

Spectra prediction#

Retrieval#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Background

Tandem mass spectrometry

Mass spectrum libraries

Model

Training and inference

Experiments

Dataset

Spectra prediction

Retrieval

Related work

Conclusion