Human Organ Level Toxicity Endpoints Predictor from China Pharmaceutical University (HOLTEP-CPU
) is a collection of five models that could predict the specific toxicity endpoints of given molecular structures at the human organ level.
These five models are Chemprop
(which is a deep learning model), RandomForest
, SVM
, XGboosting
and LogisticRegression
(which are machine learning models). The supported toxicity endpoints are carcinogenicity
, cardiotoxicity
, developmental_toxicity
, hepatotoxicity
, nephrotoxicity
, neurotoxicity
, reproductive_toxicity
and skin_sensitization
.
In addition, a transfer learning model based on Chemprop
is also deployed with the help of pkuwangsw/COVIDVS. It has advantages in predicting skin_sensitization
.
This repository contains the results of:
Hu Y, Ren Q, Liu X, et al. In Silico Prediction of Human Organ Toxicity via Artificial Intelligence Methods[J]. Chemical Research in Toxicology, 2023. https://doi.org/10.1021/acs.chemrestox.2c00411
- Clone repository
git clone https://github.com/Wenying-Yu-Lab/HOLTEP-CPU.git
- Change directory
cd HOLTEP-CPU
- (Optional) Add pkuwangsw/COVIDVS as a git submodule to apply the transfer learning model later
git submodule add https://github.com/pkuwangsw/COVIDVS.git COVIDVS
- Create a new conda environment
conda create -n holtep python=3.8
- Activate the environment
conda activate holtep
- Install cuda (11.3) and cudnn
conda install cudatoolkit 11.3 cudnn
- Install PyTorch (1.12.1)
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
- Install chemprop and other dependencies
conda install -c conda-forge rdkit
pip install git+https://github.com/bp-kelley/descriptastorus
pip install chemprop
usage: python -u predict.py [-h] [-m {Chemprop,Randomforest,SVM,XGboosting,LogisticRegression}] [-f FOLD_INDEX] [-s SAVE_DIR] {carcinogenicity,cardiotoxicity,developmental_toxicity,hepatotoxicity,nephrotoxicity,neurotoxicity,reproductive_toxicity,skin_sensitization} data_path
This tool predicts the specific toxicity endpoints of given molecular structures at the human organ level.
positional arguments:
{carcinogenicity,cardiotoxicity,developmental_toxicity,hepatotoxicity,nephrotoxicity,neurotoxicity,reproductive_toxicity,skin_sensitization}
Specify the toxicity endpoint for prediction from the given list.
data_path Specify the file containing a series of SMILES data of the structures for prediction.
optional arguments:
-h, --help show this help message and exit
-m {Chemprop,Randomforest,SVM,XGboosting,LogisticRegression}, --model {Chemprop,Randomforest,SVM,XGboosting,LogisticRegression}
Specify the model to be used from the given list (default: Chemprop).
-f FOLD_INDEX, --fold FOLD_INDEX
Specify the fold index where the model is used (default: 1, max: 10).
-s SAVE_DIR, --save-dir SAVE_DIR
Specify a directory to save results (default: ./predictions).
- Examples:
- Do not forget to activate the environment before running predictions.
conda activate holtep
- Example 1. Model: Chemprop from fold 1 of "cardiotoxicity".
python -u predict.py cardiotoxicity ./model_data/cardiotoxicity/test.csv
- Example 2. Model: Randomforest from fold 3 of "developmental_toxicity".
python -u predict.py -m Randomforest -f 3 developmental_toxicity ./model_data/developmental_toxicity/test.csv
- Specifications
- Specification for input data
- A file containing a list of SMILES data (
*.smi
) is supported - A
*.csv
file is supported with additional restrictions:- The column that contains the SMILES data should be the first column of the table.
- If the table contains headers, then the column headers containing the SMILES data should be
smiles
(case insensitive).
- A file containing a list of SMILES data (
- Specification for toxicity endpoints
- Available toxicity endpoints:
carcinogenicity
cardiotoxicity
developmental_toxicity
hepatotoxicity
nephrotoxicity
neurotoxicity
reproductive_toxicity
skin_sensitization
- Toxicity endpoints are case sensitive.
- Available toxicity endpoints:
- Specification for models
- Available models:
Chemprop
(default)Randomforest
SVM
XGboosting
LogisticRegression
- Model names are case sensitive.
- Available models:
- Specification for input data
- Steps
- Activate the environment before running predictions
conda activate holtep
- Create a directory to save results (skip if it exists)
mkdir predictions
- Change directory
cd COVIDVS
- Generate the descriptors
python generatorFeatures.py ../model_data/skin_sensitization/test.csv ../model_data/skin_sensitization/test-feat.npy 0
- The first argument: data for prediction
- The second argument: file path to save features (can be specified arbitrarily)
- Prediction
python predict.py --gpu 0 --test_path ../model_data/skin_sensitization/test.csv --features_path ../model_data/skin_sensitization/test-feat.npy --preds_path ../predictions/transfer_learning.csv --checkpoint_dir ../model_data/skin_sensitization/transfer_learning/up_AT_mouse_ip_LD50/fold_0/model_0/
--test_path
: data for prediction--features_path
: feature file generated in the previous step--preds_path
: file path to save results (can be specified arbitrarily)--checkpoint_dir
: directory where themodel.pt
file is located (replaceup_AT_mouse_ip_LD50
with a desired learner)
- Change directory back
cd ../
- Activate the environment before running predictions
- Notice
- Transfer learning model can only be used to predict
skin_sensitization
. - Refer to pkuwangsw/COVIDVS for more details.
- Transfer learning model can only be used to predict