MIC Prediction and FeatureSelection using XGBoost for Neural Networks
contig fasta files were collected from Klebsila Pneumoniae (KPN).
There are two algorithms that will be used for prediction:
- Neural Network
- Gradient Boosted Forests (More specifically, XGBoost)
If you would like to test your own data with our best NN model, follow these steps. You should be in the repository's main directory to call step 2.
- Collect Fastas and MIC CSV file (put them in
data
folder) - Run
python script
to run test pipeline
Because testing the NN model does not require changing any of the files, the base file for script
was made into the test pipeline. This pipeline will run the same things that the following 4 lines would run, but automatically.
- Run
python get_format_data.py test
- Run
python nn.py files
to generate NN test file. - Run
python nn.py test
to test NN model - Run
python nn_analysis.py
to generate error rate CSV and MIC heatmap.
Note that all python run commands below must be done within the script
directory.
- Collect Fastas and MIC CSV file (put them in
data
folder) - Run
python get_format_data.py
- Zip up
train.libsvm
and put that zip file in an S3 bucket - Put
xgboost.ipynb
in Sagemaker, modify to point totrain.zip
in S3, and run all cells - Collect trained XGBoost model and control XGBoost model and put them in
output/models
- Run
python xgboost_test.py
to getfeature_importance.csv
and test results for XGBoost models - Use
feature_importance.csv
to modifynn.py
with most important features (all features with importance >=10 in sorted order) - Run
python nn.py files
to generate NN training, validation, and testing files. Take note of the console output to update thenn.py
file. - Run
python nn.py cv
to do grid search using 10-fold cross validation to determine best learning rate, architecture combination. UpdateBEST_LEARNING_RATE
,BEST_ARCH_SIZE
, andBEST_DROPOUT_PROB
at top of nn.py when finished. - Run
python nn.py train
and note the file name of the best model hdf5 file. Modifynn.py
with that name (delete all other hdf5 files) - Run
python nn.py test
to test NN model - Run
python nn_analysis.py
to generate error rate CSV and MIC heatmap.
For more information/detail on each step, visit the respective folders/script files.