Image Classification using AWS SageMaker

Use AWS Sagemaker to train a pretrained model that can perform image classification by using the Sagemaker profiling, debugger, hyperparameter tuning and other good ML engineering practices. This can be done on either the provided dog breed classication data set or one of your choice.

Project Set Up and Installation

Enter AWS through the gateway in the course and open SageMaker Studio. Download the starter files. Download/Make the dataset available.

Dataset

The provided dataset is the dogbreed classification dataset which can be found in the classroom. The project is designed to be dataset independent so if there is a dataset that is more interesting or relevant to your work, you are welcome to use it to complete the project.

Access

Upload the data to an S3 bucket through the AWS Gateway so that SageMaker has access to the data.

Hyperparameter Tuning

In this project, I have decided to fine tune a Resnet101 image classification model. That being said, I tuned the following hyperparameters:

Learning Rate: The learning rate determines the step size at which the model adjusts its internal parameters during training. A high learning rate may cause the model to converge quickly but risk overshooting the optimal solution. Conversely, a low learning rate may result in slow convergence or getting stuck in suboptimal solutions. Finding an appropriate learning rate is essential for achieving faster convergence and better generalization .

Batch Size: The batch size refers to the number of training examples used in each iteration of gradient descent. A larger batch size can lead to faster training as more examples are processed simultaneously, but it requires more memory. On the other hand, a smaller batch size can provide a noisier estimate of the gradient but may allow for better generalization. Selecting an appropriate batch size depends on factors such as available computational resources, dataset size, and model complexity .

Number of Epochs: An epoch represents a complete pass through the entire training dataset. Training for too few epochs may result in underfitting, where the model fails to capture complex patterns in the data. Conversely, training for too many epochs can lead to overfitting, where the model becomes too specialized to the training data and performs poorly on unseen examples. Determining the optimal number of epochs involves finding a balance between underfitting and overfitting by monitoring performance on a validation set . What kind of model did you choose for this experiment and why? Give an overview of the types of parameters and their ranges used for the hyperparameter search.

Hyperparameter tunning code

total training runs

Log metrics during training can be monitored with cloudwatch

Best Hyperparameters found

## Debugging and Profiling In order to perform debugging and Profiling, I used sagemaker debugger and profiler tools. In practice, you need to modify your estimator definition, by adding debugging and profilling configurations. Besides that, you also need to add debugger hooks in your training script, so you can monitor the metrics you are interested in.

Defining profiler and debug rules

modifying estimator

modifying training script

Results

For my particular experiment, I chose to monitor the loss, weights, and biases during the training process. The respective tensors are sampled and saved in a S3 bucket for further analysis. You can also programmatically create a trial and access the outputs:

Sagemaker debug results

Then you can see, for instance, your train and validation loss curves:

Besides that, sagemaker profiler generates a HTML report for you, analyzing your training jobs according to several different metrics:

What are the results/insights did you get by profiling/debugging your model?

Fortunately, my model didn't flag any alert, and the loss curve is not too noisy. But taking a look at CPU utilization metrics, It seems the machine that I chose to use (ml.c5.2xlarge) is underutilized because it rarely passes 37% CPU utilization. Hence If in the future I were to build a retraining pipeline or simply repeat the experiment, I could choose a smaller instance and incur fewer costs.

training instance metrics

Model Deployment

The model that was fine-tuned in the previous step was deployed to a sagemaker endpoint. In summary, you need to create an inference script, that will be run in the endpoint instance and will be responsible for running predictions against incoming data. Besides that, it is important to define a predictor, which will serialize the data to be sent to the endpoint and will deserialize the received responses.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
images		images
profiler_outputs		profiler_outputs
CODEOWNERS		CODEOWNERS
LICENSE.txt		LICENSE.txt
README.md		README.md
hpo.py		hpo.py
inference.py		inference.py
train_and_deploy.ipynb		train_and_deploy.ipynb
train_model.py		train_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Classification using AWS SageMaker

Project Set Up and Installation

Dataset

Access

Hyperparameter Tuning

Hyperparameter tunning code

total training runs

Log metrics during training can be monitored with cloudwatch

Best Hyperparameters found

Defining profiler and debug rules

modifying estimator

modifying training script

Results

Sagemaker debug results

What are the results/insights did you get by profiling/debugging your model?

training instance metrics

Model Deployment

model deployment code

succesfully deployed endpoint

querying endpoint

About

Releases

Packages

Languages

License

hualcosa/AWS-Machine-Learning-Engineer-NanoDegree-Project3

Folders and files

Latest commit

History

Repository files navigation

Image Classification using AWS SageMaker

Project Set Up and Installation

Dataset

Access

Hyperparameter Tuning

Hyperparameter tunning code

total training runs

Log metrics during training can be monitored with cloudwatch

Best Hyperparameters found

Defining profiler and debug rules

modifying estimator

modifying training script

Results

Sagemaker debug results

What are the results/insights did you get by profiling/debugging your model?

training instance metrics

Model Deployment

model deployment code

succesfully deployed endpoint

querying endpoint

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages