Activity Prediction for Chemical Compounds

This project was conducted between Annika Schavemaker, Anna Fernandez-Rajal and Anoogna Suresh Babu, as part of the course ID2214 Programming for Data Science.

The goal is to find a suitable model to make predictions for the yet-unknown test labels. Herewith, it is of great importance that the chosen model does not only have good accuracy on the validation data, but also has a large area under the ROC curve (AUC-ROC).

Processing the data

To generate features from SMILES, the open-source toolkit for cheminformatics RDKit is deployed. For the extraction of features within each atom, additional functions as described by Hirohara et al. were leveraged [1]. This included determining the number of hydrogen molecules, unsaturation, formal charges, total valence, ring structures, aromatics, chirality, and hybridization. For this approach, features had to be extracted from each atom within each SMILE, resulting in a large time to process the data.

The preprocessed data were split into a training set (75%) and a validation set (25%), after which a column filter was used to define labels and features. The label the model aims to predict is ’ACTIVE.

Different representations were applied, and the data was normalized with a MinMax Scaler followed by a Standard Scaler and principle component analysis (PCA). This step was performed to reduce a large number of features in the data, with minimal reduction of information. Another approach was tried using only a RobustScaler before applying PCA and after the splitting of the initial data. This aimed to remove the outliers that for our selected features didn’t show relevant changes.

Results and Conclusion

From the results, we were able to observe that the accuracies for all models are very high ranging from 98% and 99%. Nevertheless, there are slight differences in the model performance with respect to the area under the ROC curve. The highest AUC score on the validation set, being 77.8%, is obtained using an artificial neural network with the parameters solver = adam and alpha = 0.0001. Therefore, the estimate of the AUC on the test set is approximately 77%.

	ROC_AUC	Accuracy
K-nearest neighbor	69.6%	99.0%
Logistic regression	77.0%	98.8%
Naive bayes	74.2%	98.6%
Decision tree	52.3%	98.1%
Random forest	76.9%	98.9%
XGBoost	77.3%	98.8%
Artificial neural network	77.8%	98.9%

References

[1] Hirohara, M., Saito, Y., Koda, Y. et al. Convolutional neural network based on SMILES representation of compounds for detecting chemical motif. BMC Bioinformatics 19 (Suppl 19), 526 (2018). https://doi.org/10.1186/s12859-018-2523-5

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
__pycache__		__pycache__
.DS_Store		.DS_Store
6(with pca fitted to training and validation).txt		6(with pca fitted to training and validation).txt
6.txt		6.txt
README.md		README.md
ann.ipynb		ann.ipynb
decision_tree.ipynb		decision_tree.ipynb
final_model_ann.ipynb		final_model_ann.ipynb
functions.py		functions.py
knn.ipynb		knn.ipynb
logistic_regression.ipynb		logistic_regression.ipynb
naive_bayes.ipynb		naive_bayes.ipynb
processing_4.ipynb		processing_4.ipynb
processing_4_model.ipynb		processing_4_model.ipynb
random_forest.ipynb		random_forest.ipynb
svm.ipynb		svm.ipynb
test_smiles.csv		test_smiles.csv
testing_preprocessed.csv		testing_preprocessed.csv
testing_processed_1.csv		testing_processed_1.csv
testing_processed_2.csv		testing_processed_2.csv
total_processed.csv		total_processed.csv
training_preprocessed.csv		training_preprocessed.csv
training_processed.csv		training_processed.csv
training_smiles.csv		training_smiles.csv
validation_processed.csv		validation_processed.csv
xgboost.ipynb		xgboost.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Activity Prediction for Chemical Compounds

Processing the data

Results and Conclusion

References

About

Releases

Packages

Contributors 3

Languages

annafernandezrajal/Activity-Prediction-for-Chemical-Compounds

Folders and files

Latest commit

History

Repository files navigation

Activity Prediction for Chemical Compounds

Processing the data

Results and Conclusion

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages