CNN models for antibiotic resistance prediction in M. tuberculosis genomes. From Green & Yoon et al
preprint: https://www.biorxiv.org/content/10.1101/2021.12.06.471431v1
Software can be installed by downloading the codebase. Required dependencies are listed on a per-subdirectory basis in environment_reqs.txt. Installation of the codebase and all required dependencies takes minutes.
The MD-CNN, SD-CNN, and WDNN were trained on an NVIDIA GeForce GTX Titan X GPU. Model training takes 6-12 hours.
The LogReg + L2 training was performed on a CPU with 50GB RAM.
DeepLIFT calculations were performed on an NVIDIA GeForce GTX Titan X GPU.
Downstream data analysis was performed on a MacBook Pro laptop running macOS Catalina v. 10.15.7
All sequences input to the model were aligned to the M. tuberculosis H37Rv reference genome, with fasta nucleotide sequences extracted for the positions listed in Table 3 of the published manuscript. The sequences for each locus are input as a multiple nucleotide sequence alignment - meaning that gaps are present in the input reference sequence if any of the input sequences has an insertion at that position. All provided input fasta sequence files are aligned accordingly.
To input new sequences to the model, the new sequences need to be appended to the existing input fasta sequence files so that the frame of the alignment is unchanged. We recommend doing this with MAFFT v7.490, mafft --add <new_fasta_alignment_file> --keeplength <previous_fasta_alignment_file> > <output_alignment>
All scripts in the model_training subdirectories are executed with python <script.py> <parameter_file.txt>
, with parameter files in yaml format. Paths within the parameter file should be adjusted for your local environment.
data_analysis: contains jupyter notebooks for comparing models and running analysis on DeepLIFT importance scores
md_cnn: code for running multidrug convolutional neural network (MD-CNN) model and corresponding DeepLIFT analysis
output_data: output accuracy files created by running the models on train and test data, list of known resistance variants, output DeepLIFT scores
sd_cnn: code for running single drug convolutional neural network (SD-CNN) models and corresponding DeepLIFT analysis
regression_l2: code for running regression + L2 regularization baseline
wdnn: code for running previous state of art wide and deep neural network (WDNN)
input_data: input fasta and phenotype data for training, test and CRyPTIC datasets, as well as in silico mutagenized strains