Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Computational predictions of mass spectra from molecules have enabled the discovery of clinically relevant metabolites.
  • Predictive tools are limited, either operating with overly rigid constraints or by decoding lossy and nonphysical discretized spectra vectors.
  • A new intermediate strategy is introduced for predicting mass spectra from molecules by treating them as sets of chemical formulae.
  • A prefix tree structure is used to decode the formula set, atom-type by atom-type, representing a general method for ordered multiset decoding.

Paper Content

Background

  • Introduction to tandem mass spectrometry
  • Previous approaches to modeling the process
  • Utilizing tools to discover molecules from new spectra
  • Further details on physical process of mass spectrometry in [23]

Tandem mass spectrometry

  • MS/MS measure fragmentation patterns of molecules in a multi-stage process
  • Input is a solution containing a precursor molecule with a chemical formula
  • Ionized by bonding or associating with an adduct
  • Mass analyzer measures mass-to-charge ratio
  • Filtered into a collision cell, broken down into product ions
  • Product ions measured by second mass analyzer
  • Spectrum prediction strategies include fragmentation, binned, and formula prediction

Mass spectrum libraries

  • Spectrum predictors are used to build large libraries of molecule spectra.
  • These libraries are used to determine properties of unidentified molecules and to infer an unknown molecule’s structure from a newly observed spectrum.
  • Evaluation of spectrum predictors is based on accuracy of prediction and ability to assist with retrieval.

Model

  • SCARF is a model for predicting mass spectra from precursor molecules
  • SCARF is composed of two learned functions: one mapping from the original molecule to a set of product formulae, and one mapping from the set of formulae to the respective intensities
  • SCARF-Thread predicts a prefix tree for new molecules at test time
  • SCARF-Thread treats the counts of subsequent atoms as dependent only on the counts of atoms predicted so far
  • SCARF-Thread poses the prediction of the set of child nodes as a multi-label binary classification problem
  • SCARF-Thread takes as input an embedding of the overall molecule, a vector representing the counts of the atoms in the prefix so far, the difference of the counts predicted so far from the precursor molecule, and a one-hot representation of the atom for which the counts are currently being predicted
  • SCARF-Weave uses a Set Transformer to take into account all of the formulae present in the mass spectrum when predicting final intensities
  • SCARF-Weave represents the inputs by concatenating a vector embedding of the initial molecular graph with count-based embeddings of the product formula and its difference from the precursor formula

Training and inference

  • SCARF-Thread and SCARF-Weave are trained separately
  • SCARF-Weave is trained using a cosine loss
  • SCARF-Thread is trained using binary cross entropy losses
  • Teacher forcing is used to train each level of the tree in parallel
  • Top 300 product formulae are always picked when generating from the model

Experiments

  • Evaluated SCARF on prediction of spectra
  • Evaluated SCARF on identification of unknown molecules

Dataset

  • SCARF is trained and validated on two libraries: NIST20 and CANOPUS.
  • MAGMa algorithm is used to standardize product formulae annotations for supervision.
  • SCARF can be trained with any product formula annotations.

Spectra prediction

  • SCARF-Thread is a computer program used to predict product formulae
  • SCARF-Thread is compared to several baselines and outperforms them
  • SCARF-Thread is able to cover 90% and 74% of true formulae in the ground truth test set for NIST20 and CANOPUS respectively
  • SCARF-Weave is used to predict intensity of spectra
  • SCARF-Weave is compared to three baselines and is more accurate
  • SCARF-Weave is more physically-grounded and operates 2 orders of magnitude faster than CFM-ID

Retrieval

  • Forward spectrum prediction can be used to determine molecular structure assignments.
  • A retrieval task was designed to showcase the potential of forward spectrum prediction models.
  • SCARF achieved an improved top-1 and top-5 retrieval accuracy compared to the second best method.
  • Forward models predict the spectrum given the molecule
  • Backward models start from the spectrum and predict features or structure of the molecule
  • Deep representation learning techniques are used for both small molecules and proteomics
  • SCARF-Thread generates a set as output
  • Each member of the set represents a multiset of atom types

Conclusion

  • Introduced SCARF, an approach utilizing prefix tree data structures to decode mass spectra from molecules
  • Combines advantages of previous neural and fragment based approaches
  • More accurate in predicting experimentally-observed spectra and better able to identify and label unknown spectra
  • Data dependent, reliant upon quality of product formula label assignment
  • Future directions involve more carefully modeling covariates, grounding product formulae in molecular graph substructures
  • Benchmarked models in terms of retrieval accuracy
  • Showcased additional spectra predictions from model trained on NIST20
  • NIST20 dataset filtered, normalized, filtered, and subsetted
  • Assume no adduct switching in formulation
  • SCARF-Thread predicts prefix tree for new molecules at test time
  • Multi-label binary classification task to predict binary label for each possible count
  • SCARF-Weave takes in product formulae and predicts their intensities
  • Graph neural network (GNN) atom features