- Modular pipelines
-
Data:
- Datasets:
-
Data Enrichment:
- Genomic/transcriptomic data
- Interactomic data
- Polypharmacology
- KSPA
- Latent Dirichlet Allocation (Predictive Toxicogenomics Space, PTGS)
-
Data preprocessing
- SMILES: canonizing, cleaning, etc.
- TODO: check other data sources/types
- Imbalanced classes
-
Feature engineering (ideas from here)
- ECFP
- DFS
- 3D features based on MOPAC
- Quantum-mechanical descriptors
- Tanimoto
- Minmax
- Various 2D, 3D and pharmacophore kernels
- In-house toxicophore and scaffold features
- Dimensionality reduction:
- PCA
- FastICA
- Manifolds
-
Feature selection:
- Variance Inflation Factor (VIF)
- F2 (Scikit-learn)
- Forest importance
-
ML setup:
- Validation algorithm:
- Hold-out + CV on train
- Nested CV
- Cluster-Nested CV
- Validation algorithm:
-
ML algorithms:
- Scikit-learn (RF, SVM, Elastic Nets, Gradient boosting, etc.)
- Ensembles (XGBoost, LightGBM, stacking, etc.)
- PyTorch (LSTM, 1D-CNN, GNN, GANs, etc.)
-
Evaluation:
- Metrics:
- Explainability:
- Applicability domain
- Attribution score
- Chemical space visualization
- Shapley
- ELI5
- LIME
-
-
Active Learning for increasing labelled samples
-
Generative models:
-
Generative models + labelling with Active Learning
-
Supervised Learning:
- LSTM with SMILES
- GNN with graph representations (maybe SDFs?)
-
Assessment/evaluation:
- Similarity (assumption: it equals to activity)
- Direct mapping to activity (using Supervised Learning)
-
DeepTox:
- Machine Learning methods:
- SVMs with various kernels
- Random Forests
- Elastic Nets
- Features and kernels:
- ECFP
- DFS
- 3D features based on MOPAC
- Quantum-mechanical descriptors
- Tanimoto
- Minmax
- Various 2D, 3D and pharmacophore kernels
- In-house toxicophore and scaffold features
- Machine Learning methods: