Use AWS Sagemaker to train a pretrained model that can perform image classification by using the Sagemaker profiling, debugger, hyperparameter tuning and other good ML engineering practices. This can be done on either the provided dog breed classication data set or one of your choice.
Enter AWS through the gateway in the course and open SageMaker Studio. Download the starter files. Download/Make the dataset available.
The provided dataset is the dogbreed classification dataset which can be found in the classroom. The project is designed to be dataset independent so if there is a dataset that is more interesting or relevant to your work, you are welcome to use it to complete the project.
Upload the data to an S3 bucket through the AWS Gateway so that SageMaker has access to the data.
In this project, I have decided to fine tune a Resnet101 image classification model. That being said, I tuned the following hyperparameters:
Learning Rate: The learning rate determines the step size at which the model adjusts its internal parameters during training. A high learning rate may cause the model to converge quickly but risk overshooting the optimal solution. Conversely, a low learning rate may result in slow convergence or getting stuck in suboptimal solutions. Finding an appropriate learning rate is essential for achieving faster convergence and better generalization .
Batch Size: The batch size refers to the number of training examples used in each iteration of gradient descent. A larger batch size can lead to faster training as more examples are processed simultaneously, but it requires more memory. On the other hand, a smaller batch size can provide a noisier estimate of the gradient but may allow for better generalization. Selecting an appropriate batch size depends on factors such as available computational resources, dataset size, and model complexity .
Number of Epochs: An epoch represents a complete pass through the entire training dataset. Training for too few epochs may result in underfitting, where the model fails to capture complex patterns in the data. Conversely, training for too many epochs can lead to overfitting, where the model becomes too specialized to the training data and performs poorly on unseen examples. Determining the optimal number of epochs involves finding a balance between underfitting and overfitting by monitoring performance on a validation set .
What kind of model did you choose for this experiment and why? Give an overview of the types of parameters and their ranges used for the hyperparameter search.
## Debugging and Profiling In order to perform debugging and Profiling, I used sagemaker debugger and profiler tools. In practice, you need to modify your estimator definition, by adding debugging and profilling configurations. Besides that, you also need to add debugger hooks in your training script, so you can monitor the metrics you are interested in.
For my particular experiment, I chose to monitor the loss, weights, and biases during the training process. The respective tensors are sampled and saved in a S3 bucket for further analysis. You can also programmatically create a trial and access the outputs:
Then you can see, for instance, your train and validation loss curves:
Besides that, sagemaker profiler generates a HTML report for you, analyzing your training jobs according to several different metrics:
Fortunately, my model didn't flag any alert, and the loss curve is not too noisy. But taking a look at CPU utilization metrics, It seems the machine that I chose to use (ml.c5.2xlarge) is underutilized because it rarely passes 37% CPU utilization. Hence If in the future I were to build a retraining pipeline or simply repeat the experiment, I could choose a smaller instance and incur fewer costs.
The model that was fine-tuned in the previous step was deployed to a sagemaker endpoint. In summary, you need to create an inference script, that will be run in the endpoint instance and will be responsible for running predictions against incoming data. Besides that, it is important to define a predictor, which will serialize the data to be sent to the endpoint and will deserialize the received responses.