In this workshop, we will train a deep learning model in a distributed manner using Databricks. We will discuss how we can leverage Delta Lake to prepare structured, semi-structured, or unstructured datasets and Petastorm for distributing datasets efficiently on a cluster. We will also cover how to use Horovod for distributed training on both CPU and GPU based hardware. This example aims to serve as a reusable template that is tailorable to meet your specific modeling needs.
The workshop involves a series of Databricks notebooks split into two parts,
In part 1 we look at how we can optimally leverage the parallelism of Spark for training deep learning models in a distributed manner. The notebooks outline the following:
- Data Prep
- How to create a Delta table with the Binary file data source reader using JPEG image sources.
- Single node training
- Distributed training
In part 2 we look at how we can paralellize both hyperparameter tuning and model inference. We illustrate:
- Model tuning with Hyperopt
- Tuning a single node DL model with Hyperopt
- Tuning a distributed Horovod process with Hyperopt
- Distributed model inference
- How to package up a custom Pyfunc with preprocessing/post-processing steps
- Applying that logged custom Pyfunc in a single node inference setting
- Applying that logged custom Pyfunc in a distributed inference setting
A recommended Databricks ML Runtime >= 7.3LTS is suggested. Please use the repos feature to clone into your repo and access the notebook.