Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Activity and property prediction models are used in drug discovery and materials sciences.
  • Currently, these models have to be trained or fine-tuned for new tasks.
  • Scientific language models can be used for low-data tasks without training or fine-tuning.
  • Predictive quality of language models is lacking.
  • A novel type of activity prediction model is proposed that can adapt to new tasks at inference time.
  • Modularized architecture and pre-training objective are proposed.
  • Experiments show improved predictive performance on few-shot learning benchmarks and zero-shot problems.

Paper Content

Introduction

  • Activity and property prediction models are used in computational drug discovery, analogous to language models in NLP and image classification models in computer vision
  • Predictions are based on chemical structure and properties
  • Machine learning methods have been used since the early 90s
  • Deep learning has been used to create molecule encoders
  • Activity prediction models are used to select or rank molecules for further testing or to flag unwanted properties
  • Activity prediction models are usually trained on pairs of molecules and activity labels from biological experiments
  • Few data points are available for training
  • Scientific language models can utilize both natural language and chemical structure
  • SLMs tokenize SMILES representations of chemical structures and embed them in the same embedding space as language tokens
  • SLMs are suboptimal activity predictors
  • Cross-modal contrastive learning is proposed to use an effective molecule encoder and utilize chemical databases as training or pre-training data
  • CLIP uses both an effective language encoder and an effective vision encoder
  • Chemical databases contain orders of magnitude more molecules with associated biological information than biomedical texts
  • Zero-shot problem in drug discovery is equivalent to the library design problem
  • Scientific language models process chemical inputs
  • Large language models are trained with masking objective
  • MolT5 uses special objective and fine-tunes on molecule caption generation
  • Input tokens represent chemical structures or sub-structures
  • Activity and property prediction models improved with Deep Learning
  • Few-shot learning or low-resource drug discovery approaches exist
  • Cross-modal contrastive learning methods used in drug discovery
  • Chemical structure and natural language used as data modalities

Experiments and results

  • Experiment a) tested ability of models to predict activity of molecules from only a textual description
  • Experiment b) tested if learned molecule representations of different methods are rich and transferable
  • Experiment c) tested use of method as a retrieval system to rank active molecules higher than inactive molecules

Zero-shot transfer

  • CLAMP is a method used to predict the activity of molecules
  • FS-Mol and PubChem databases are used
  • Three different splits of PubChem are constructed
  • 1-NN, soft-NN, FH, GAL 125M, and KV-PLM are compared
  • CLAMP outperforms other methods except FH on two splits

Representation learning

  • Linear probing used to assess robustness and transferability of embeddings
  • Datasets used: MoleculeNet, BACE, BBBP, ClinTox, HIV, SIDER, Tox21, Tox21-10k
  • Methods compared: Morgan fingerprints, chemical descriptors, Grover, Galactica1
  • CLAMP performs best on average and significantly outperforms other methods
  • CLAMP improves predictive performance on ToxCast

Retrieval and library design

  • Retrieval task: rank molecules from chemical database based on given bioassay
  • Metric: enrichment factor (EF)
  • Dataset: PubChem, 190 assays for 10k molecules, 2,543 assays for 1M molecules
  • Methods compared: CLAMP, FH, KV-PLM, Galactica
  • Results: CLAMP outperforms all other methods by 10-fold (10k) and 250-fold (1M)

Discussion

  • Proposed contrastive learning method CLAMP exhibits best performance at zero-shot prediction drug discovery tasks
  • Pre-trained molecule-encoder of CLAMP yields transferable representations
  • Scientific language models can in principle be used for zero-shot activity prediction, but not performing well and computationally demanding
  • CLAMP embedding space prime candidate for conditional generation
  • Cold-start problem identified by recommender systems and matrix factorization research community
  • Contrastive learning suggested to learn similarities between users and items
  • Data-driven strategies not possible for new bioassays
  • Textual description of bioassay can be leveraged with contrastive learning
  • Datasets used in this work overviewed
  • FS-Mol dataset constructed based on ChEMBL
  • Standard setting in few-shot learning with ∆AP of 20-25%
  • CLAMP reaches ∆AP of 19.4% without any training data
  • Datasets from MoleculeNet benchmark used
  • BACE, BBBP, ClinTox, HIV, SIDER, Tox21, ToxCast, Tox21-10k
  • Temporal split to simulate situation in which new molecules and bioassays have to be predicted