Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Activity and property prediction models are used in drug discovery and materials sciences.
- Currently, these models have to be trained or fine-tuned for new tasks.
- Scientific language models can be used for low-data tasks without training or fine-tuning.
- Predictive quality of language models is lacking.
- A novel type of activity prediction model is proposed that can adapt to new tasks at inference time.
- Modularized architecture and pre-training objective are proposed.
- Experiments show improved predictive performance on few-shot learning benchmarks and zero-shot problems.
Paper Content
Introduction
- Activity and property prediction models are used in computational drug discovery, analogous to language models in NLP and image classification models in computer vision
- Predictions are based on chemical structure and properties
- Machine learning methods have been used since the early 90s
- Deep learning has been used to create molecule encoders
- Activity prediction models are used to select or rank molecules for further testing or to flag unwanted properties
- Activity prediction models are usually trained on pairs of molecules and activity labels from biological experiments
- Few data points are available for training
- Scientific language models can utilize both natural language and chemical structure
- SLMs tokenize SMILES representations of chemical structures and embed them in the same embedding space as language tokens
- SLMs are suboptimal activity predictors
- Cross-modal contrastive learning is proposed to use an effective molecule encoder and utilize chemical databases as training or pre-training data
- CLIP uses both an effective language encoder and an effective vision encoder
- Chemical databases contain orders of magnitude more molecules with associated biological information than biomedical texts
- Zero-shot problem in drug discovery is equivalent to the library design problem
Related work
- Scientific language models process chemical inputs
- Large language models are trained with masking objective
- MolT5 uses special objective and fine-tunes on molecule caption generation
- Input tokens represent chemical structures or sub-structures
- Activity and property prediction models improved with Deep Learning
- Few-shot learning or low-resource drug discovery approaches exist
- Cross-modal contrastive learning methods used in drug discovery
- Chemical structure and natural language used as data modalities
Experiments and results
- Experiment a) tested ability of models to predict activity of molecules from only a textual description
- Experiment b) tested if learned molecule representations of different methods are rich and transferable
- Experiment c) tested use of method as a retrieval system to rank active molecules higher than inactive molecules
Zero-shot transfer
- CLAMP is a method used to predict the activity of molecules
- FS-Mol and PubChem databases are used
- Three different splits of PubChem are constructed
- 1-NN, soft-NN, FH, GAL 125M, and KV-PLM are compared
- CLAMP outperforms other methods except FH on two splits
Representation learning
- Linear probing used to assess robustness and transferability of embeddings
- Datasets used: MoleculeNet, BACE, BBBP, ClinTox, HIV, SIDER, Tox21, Tox21-10k
- Methods compared: Morgan fingerprints, chemical descriptors, Grover, Galactica1
- CLAMP performs best on average and significantly outperforms other methods
- CLAMP improves predictive performance on ToxCast
Retrieval and library design
- Retrieval task: rank molecules from chemical database based on given bioassay
- Metric: enrichment factor (EF)
- Dataset: PubChem, 190 assays for 10k molecules, 2,543 assays for 1M molecules
- Methods compared: CLAMP, FH, KV-PLM, Galactica
- Results: CLAMP outperforms all other methods by 10-fold (10k) and 250-fold (1M)
Discussion
- Proposed contrastive learning method CLAMP exhibits best performance at zero-shot prediction drug discovery tasks
- Pre-trained molecule-encoder of CLAMP yields transferable representations
- Scientific language models can in principle be used for zero-shot activity prediction, but not performing well and computationally demanding
- CLAMP embedding space prime candidate for conditional generation
- Cold-start problem identified by recommender systems and matrix factorization research community
- Contrastive learning suggested to learn similarities between users and items
- Data-driven strategies not possible for new bioassays
- Textual description of bioassay can be leveraged with contrastive learning
- Datasets used in this work overviewed
- FS-Mol dataset constructed based on ChEMBL
- Standard setting in few-shot learning with ∆AP of 20-25%
- CLAMP reaches ∆AP of 19.4% without any training data
- Datasets from MoleculeNet benchmark used
- BACE, BBBP, ClinTox, HIV, SIDER, Tox21, ToxCast, Tox21-10k
- Temporal split to simulate situation in which new molecules and bioassays have to be predicted