NAME |
---|
Hyun Soo Kim |
Matthew Armstrong |
Phuong Thao Quach |
Suin Kang |
Zarren Ali |
This project delves into the application of Convolutional Neural Networks (CNNs), a pivotal Machine Learning model in the realm of computer vision, particularly for image classification tasks. CNNs excel in learning and extracting features from images, enabling the accurate classification of new, similar images.
The project is divided into two primary tasks:
- CNN Encoder for Human Tissue Image Classification: This involves training a CNN encoder on a dataset of human tissue images to classify colon cancer (dataset 1). The outcomes are visualized using t-SNE (t-Distributed Stochastic Neighbor Embedding), a technique for high-dimensional data visualization.
- Feature Extraction and Model Evaluation Across Datasets: Utilizing the trained CNN encoder from the first task and a pre-trained CNN encoder from ImageNet, features are extracted from two additional datasets: a prostate cancer dataset (dataset 2) and an animal faces dataset (dataset 3). The project then focuses on training supervised machine learning models to evaluate and compare the performance of these CNN encoders across diverse datasets.
Training a CNN encoder presents specific challenges, such as the need for large, diverse image datasets and the computational demands of training complex models. To overcome these, strategies like data augmentation and the use of advanced hardware for faster processing are employed.
The performance of the models for both tasks is assessed using key metrics such as precision, recall, f1-score, support, and accuracy.
Download links for the datasets required for this assignment are provided below. The first three links lead to the project-required unprocessed data. The datasets that follow were generated through feature extraction using both our pretrained ResNet18 model (trained in task 1) and a pretrained Resnet18 model using IMAGENET weights. The final three hyperlinks lead to sampled datasets 1, 2, and 3, each comprising 100 images. The classes are distributed evenly within each class for these sampled datasets (this is an approximation, however, since 100 images must be sampled for three classes, so one class must have one more image).
- Dataset 1 Original
- Dataset 2 Original
- Dataset 3 Original
- Dataset 2 Extracted by Task 1 Model
- Dataset 3 Extracted by Task 1 Model
- Dataset 2 Extracted by ImageNet Model
- Dataset 3 Extracted by ImageNet Model
- Sampled Dataset 1
- Sampled Dataset 2
- Sampled Dataset 3
To successfully run the Python code in this repository, several libraries and dependencies need to be installed. The code primarily relies on popular Python libraries such as NumPy, Matplotlib, Pandas, Seaborn, and Scikit-Learn for data manipulation, statistical analysis, and machine learning tasks.
For deep learning models, the code uses PyTorch, along with its submodules such as torchvision
and torch.nn
. Ensure that you have the latest version of PyTorch installed, which can handle neural networks and various related functionalities.
Additionally, the project uses the Orion
library, an asynchronous hyperparameter optimization framework. This can be installed directly from its GitHub repository using the command !pip install git+https://github.com/epistimio/orion.git@develop
and its related profet
package with !pip install orion[profet]
.
Here is a comprehensive list of all the required libraries:
- NumPy
- Pandas
- Matplotlib
- Seaborn
- Scikit-Learn
- PyTorch (along with
torch.nn
,torch.optim
,torch.utils.data
, etc.) - Torchvision (including datasets, models, transforms)
- Orion (including the
profet
package) - Argparse (for parsing command-line options)
- TSNE (from Scikit-Learn for dimensionality reduction techniques)
- KNeighborsClassifier, GridSearchCV (from Scikit-Learn for machine learning models)
- RandomForestClassifier (from Scikit-Learn for machine learning models)
- Classification metrics from Scikit-Learn (confusion_matrix, classification_report, etc.)
For visualization and data analysis, Matplotlib and Seaborn are extensively used. Ensure all these libraries are installed in your environment to avoid any runtime errors.
To install these libraries, you can use pip (Python's package installer). For most libraries, the installation can be as simple as running pip install library-name
. For specific versions or sources, refer to the respective library documentation.
All notebooks were written in Google Colab and are intended for use in Google Colab only.
Open the notebook - "task1_training_testing.ipynb".
CAUTION: Every dataset is available via gdown
in the notebook. However, depending on which dataset (original with 6000 images vs. sample with 100 images) you wish to use, read the instruction carefully in the notebook and adjust the codes accordingly (comment/uncomment)
-
How To Train?
- Run the required libraries
- Run the cell section
1. Data Loading and Preprocessing
- By default, the sample dataset (100 images) will be loaded. - Run the cell section
2. Training
for training and validation
-
How To Test?
- No need to upload anything; the test run dataset is available for download via gdown.
- Make sure you run the
1. Data Loading and Preprocessing
part. - Pretrained model from Task 1 resnet18_model_98.pth is available via gdown in
3.Testing
block. - Move the pth file to the same directory as the notebook.
- Run the cell section
3. Testing
. - Run the cell section
4. Feature extracion and t-SNE visualization
.
- For Feature Extraction and tSNE: Run the notebook titled "Task2_Feature_Extraction.ipynb". If you want to save the extracted datasets as csv, run the code under "Save dataset to csv file". If not, leave these code blocks out.
- For KNN classification: Run the notebook titled "Task2_KNN.ipynb".
- For RF classification: Run the notebook titled "Task2_RF.ipynb".
All notebooks were written in Google Colab and are intended for use in Google Colab only.
To run the pre-trained models on the provided sample test datasets, follow the instructions below for each notebook:
-
For Task 1, open the notebook titled "task1_training_testing.ipynb", follow instructions on the following code cells. The instructions in the actual notebook might differ. If that is the case, follow the instructions in the actual notebook.
- Beside the sample dataset submitted in .zip file, it is already available via
gdown
, so you do not have to upload anything on your end. - Pretrained model from Task 1 resnet18_model_98.pth is available via gdown in
3.Testing
block, like image below.
- Beside the sample dataset submitted in .zip file, it is already available via
-
For Task 2, open the notebook titled "Task2_Feature_Extraction.ipynb", run the code cells one by one following instructions on the below code cells. The instructions in the actual notebook might differ. If that is the case, follow the instructions in the actual notebook.
- All the sample datasets are downloaded via
gdown
in the notebook, so you do not have to upload anything on your end.
- All the sample datasets are downloaded via