-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
d25b0d4
commit 7611c1a
Showing
12,883 changed files
with
2,717,034 additions
and
1 deletion.
The diff you're trying to view is too large. We only load the first 3000 changed files.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
*.pkl | ||
*.csv |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,161 @@ | ||
# | ||
## Overview | ||
|
||
This project is my implementation of [this research paper](https://ernest-bonat.medium.com/rna-seq-gene-expression-classification-using-machine-learning-algorithms-de862e60bfd0#4592) from scratch. The objective of this project is to identify gene expression patterns associated with different conditions or diseases, leveraging advanced data processing and model training techniques. RNA sequencing (RNA-Seq) is a teachnique used to quantify and analyze gene expression levels across different conditions or samples. The classification of RNA-Seq data can be used to identify which genes are differently expressed between healthy and diseased samples, or between different diseased states, thus aiding in the diagnosis, treatment, and understanding of various diseases and conditions. | ||
|
||
## Table of Contents | ||
|
||
- [Overview](#overview) | ||
- [Dataset](#dataset) | ||
- [Project Structure](#project-structure) | ||
- [Running the project](#running-the-project) | ||
- [Models Used](#models-used) | ||
- [Results](#results) | ||
- [Contributing](#contributing) | ||
- [Acknowledgements](#acknowledgements) | ||
|
||
## Dataset | ||
|
||
The dataset used in this project: | ||
|
||
- **Source**: https://archive.ics.uci.edu/dataset/401/gene+expression+cancer+rna+seq | ||
|
||
- **Description**: This dataset is part of the RNA-Seq (HiSeq) PANCAN data set. It is a random extraction of gene expressions of patients having different types of tumor: BRCA (Breast invasive carcinoma), KIRC (Kidney renal clear cell carcinoma), COAD (Colon adenocarcinoma), LUAD (Lung adenocarcinoma) and PRAD (Prostate adenocarcinoma). | ||
|
||
The downloaded dataset file TCGA-PANCAN-HiSeq-801x20531.tar.gz contains two CSV files. | ||
|
||
- **data.csv**: Contains the feature set (X) representing gene expression data. | ||
- **labels.csv**: Contains the target labels (y) corresponding to the samples in the feature set. | ||
|
||
## Project Structure | ||
|
||
The outline of the project repository: | ||
|
||
``` | ||
RNASeq-GeneExpression-ML/ | ||
├── data/ | ||
│ ├── raw/ | ||
│ ├── processed/ | ||
│ │ ├── split_data | ||
│ │ ├── transformed | ||
├── notebooks/ | ||
│ ├── data_preprocessing.ipynb | ||
│ ├── model_training.ipynb | ||
│ ├── model_evaluation.ipynb | ||
├── src/ | ||
│ ├── load_data.py | ||
│ ├── split.py | ||
│ ├── smote.py | ||
│ ├── pca.py | ||
│ ├── models/ | ||
│ │ ├── logistic_regression.py | ||
│ │ ├── decision_tree.py | ||
│ │ ├── random_forest.py | ||
│ │ ├── svm.py | ||
│ │ ├── naive_bayes.py | ||
│ │ ├── knn.py | ||
│ │ ├── mlp.py | ||
│ │ ├── gradient_boosting.py | ||
│ │ ├── ada_boost.py | ||
│ │ ├── xgboost.py | ||
│ │ ├── lightgbm.py | ||
│ │ ├── catboost.py | ||
│ ├── evaluate_models.py | ||
├── results/ | ||
│ ├── logistic_regression.txt | ||
│ ├── decision_tree.txt | ||
│ ├── random_forest.txt/ | ||
│ ├── svm.txt/ | ||
│ ├── naive_bayes.txt/ | ||
│ ├── knn.txt/ | ||
│ ├── mlp.txt/ | ||
│ ├── gradient_boosting.txt/ | ||
│ ├── ada_boost.txt/ | ||
│ ├── xgboost.txt/ | ||
│ ├── lightgbm.txt/ | ||
│ ├── catboost.txt/ | ||
├── README.md | ||
├── requirements.txt | ||
``` | ||
|
||
## Running the Project | ||
|
||
1. **Clone the repository:** | ||
|
||
``` | ||
git clone [email protected]:shalinis602/RNASeq-GeneExpression-ML.git | ||
cd RNASeq-GeneExpression-ML | ||
``` | ||
|
||
2. **Install dependencies:** | ||
``` | ||
pip install -r requirements.txt | ||
``` | ||
|
||
3. **Fast load the dataset and verify its contents:** | ||
|
||
``` | ||
python src/load_data.py | ||
``` | ||
|
||
4. **Split the data into training, testing and validation sets:** | ||
|
||
``` | ||
python src/split.py | ||
``` | ||
|
||
5. **Preprocess the dataset:** | ||
|
||
``` | ||
python src/smote.py | ||
python src/pca.py | ||
``` | ||
|
||
6. **Train the models:** | ||
|
||
``` | ||
python src/models/random_forest.py | ||
``` | ||
|
||
7. **Evaluate the models:** | ||
|
||
``` | ||
python src/model_evaluation.py | ||
``` | ||
|
||
Alternatively, you can run the Jupyter notebooks located in the `notebooks/` directory to interactively execute the code. | ||
|
||
## Models Used | ||
|
||
- **Logistic Regression** | ||
- **Decision Tree** | ||
- **Random Forest** | ||
- **Support Vector Machine (SVM)** | ||
- **Naive Bayes** | ||
- **k-Nearest Neighbors (KNN)** | ||
- **Multilayer Perceptron (MLP)** | ||
- **Gradient Boosting** | ||
- **AdaBoost** | ||
- **XGBoost** | ||
- **LightGBM** | ||
- **CatBoost** | ||
|
||
## Results | ||
|
||
- **Accuracy** | ||
- **Precision** | ||
- **Recall** | ||
- **F1-Score** | ||
- **Confusion Matrix** | ||
|
||
## Contributing | ||
|
||
If you find any issues or have suggestions for improvements or expanding the project, feel free to open an issue or submit a pull request. | ||
|
||
1. Fork the repository. | ||
2. Create a new branch (`git checkout -b feature-branch`). | ||
3. Commit your changes (`git commit -am 'Add new feature'`). | ||
4. Push to the branch (`git push origin feature-branch`). | ||
5. Create a new Pull Request. | ||
|
||
## **Acknowledgements** | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"cells": [], | ||
"metadata": {}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
6 changes: 6 additions & 0 deletions
6
notebooks/.ipynb_checkpoints/data_preprocessing-checkpoint.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"cells": [], | ||
"metadata": {}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,137 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"id": "07816290", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Data information:\n", | ||
"<class 'pandas.core.frame.DataFrame'>\n", | ||
"RangeIndex: 801 entries, 0 to 800\n", | ||
"Columns: 20532 entries, Unnamed: 0 to gene_20530\n", | ||
"dtypes: float64(20531), object(1)\n", | ||
"memory usage: 125.5+ MB\n", | ||
"None\n", | ||
"\n", | ||
"Label information:\n", | ||
"<class 'pandas.core.frame.DataFrame'>\n", | ||
"RangeIndex: 801 entries, 0 to 800\n", | ||
"Data columns (total 1 columns):\n", | ||
" # Column Non-Null Count Dtype \n", | ||
"--- ------ -------------- ----- \n", | ||
" 0 Class 801 non-null object\n", | ||
"dtypes: object(1)\n", | ||
"memory usage: 6.4+ KB\n", | ||
"None\n", | ||
"\n", | ||
"Deserialized X:\n", | ||
"<class 'pandas.core.frame.DataFrame'>\n", | ||
"RangeIndex: 801 entries, 0 to 800\n", | ||
"Columns: 20531 entries, gene_0 to gene_20530\n", | ||
"dtypes: float64(20531)\n", | ||
"memory usage: 125.5 MB\n", | ||
"None\n", | ||
"\n", | ||
"Deserialized y:\n", | ||
"0 PRAD\n", | ||
"1 LUAD\n", | ||
"2 PRAD\n", | ||
"3 PRAD\n", | ||
"4 BRCA\n", | ||
"Name: Class, dtype: object\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import os\n", | ||
"import pandas as pd\n", | ||
"import pickle\n", | ||
"\n", | ||
"# Define serialization functions\n", | ||
"def pickle_serialize_object(filename, obj):\n", | ||
" with open(filename, 'wb') as f:\n", | ||
" pickle.dump(obj, f)\n", | ||
"\n", | ||
"def pickle_deserialize_object(filename):\n", | ||
" with open(filename, 'rb') as f:\n", | ||
" return pickle.load(f)\n", | ||
"\n", | ||
"# Step 1: Load the CSV files\n", | ||
"data_path = '../data/raw/data.csv'\n", | ||
"labels_path = '../data/raw/labels.csv'\n", | ||
"\n", | ||
"data = pd.read_csv(data_path)\n", | ||
"label = pd.read_csv(labels_path)\n", | ||
"\n", | ||
"# Step 2: Prepare data for serialization\n", | ||
"X = data.drop(columns=['Unnamed: 0']) # Drop the 'Unnamed: 0' column if it's not needed\n", | ||
"\n", | ||
"# Drop the 'Unnamed: 0' column from label if present\n", | ||
"if 'Unnamed: 0' in label.columns:\n", | ||
" label = label.drop(columns=['Unnamed: 0'])\n", | ||
"\n", | ||
"y = label['Class'] # Use the 'Class' column as the target\n", | ||
"\n", | ||
"# Step 3: Inspect the data\n", | ||
"print(\"Data information:\")\n", | ||
"print(data.info())\n", | ||
"print(\"\\nLabel information:\")\n", | ||
"print(label.info())\n", | ||
"\n", | ||
"# Serialize the features and target\n", | ||
"processed_data_dir = '../data/processed/'\n", | ||
"\n", | ||
"if not os.path.exists(processed_data_dir):\n", | ||
" os.makedirs(processed_data_dir)\n", | ||
"\n", | ||
"pickle_serialize_object(os.path.join(processed_data_dir, 'rna_seq_X.pkl'), X)\n", | ||
"pickle_serialize_object(os.path.join(processed_data_dir, 'rna_seq_y.pkl'), y)\n", | ||
"\n", | ||
"# Step 4: Deserialize the data\n", | ||
"X = pickle_deserialize_object(os.path.join(processed_data_dir, 'rna_seq_X.pkl'))\n", | ||
"y = pickle_deserialize_object(os.path.join(processed_data_dir, 'rna_seq_y.pkl'))\n", | ||
"\n", | ||
"# Verify the deserialized data\n", | ||
"print(\"\\nDeserialized X:\")\n", | ||
"print(X.info())\n", | ||
"\n", | ||
"print(\"\\nDeserialized y:\")\n", | ||
"print(y.head())" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "41c68f93", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.9.12" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Oops, something went wrong.