In this tutorial you will learn the basics of Dask, specifically the following:
- How to parallelize simple for-loop code with Dask Futures
- How to scale up Pandas code with Dask Dataframes
Additionally you will work on real large scale data using a cluster of machines on the cloud
-
Clone this repository
In your terminal:
git clone https://github.com/mrocklin/dask-tutorial cd dask-tutorial
Alternatively, you can download the zip file of the repository at the top of the main page of the repository. This is a good option if you don't have experience with git.
-
Create Conda Environment
In your terminal navigate to the directory where you have cloned/downloaded the
dask-tutorial
repository and install the required packages:conda env create -f binder/environment.yml
This will create a new environment called
dask-tutorial
. To activate the environment do:conda activate dask-tutorial
Alternatively, you can run
pip install -r requirements.txt
. This may or may not work as well. We recommend doing this from a fresh Python environment (this will make synchronizing with your cluster easier). -
Establish Coiled Access
This tutorial will use Dask clusters on the cloud. We will get these clusters using a SaaS product, Coiled. You can either ...
-
Sign up (it's free and there's no commitment) as follows:
coiled login
You'll be asked to authenticate with GitHub to make an account. Don't worry about connecting to your cloud resources. We'll add you to the
dask-tutorials
team, which is connected to an AWS account of ours.To get this access, ask to be added in the #dask-tutorial channel. You'll also want to set your default account to
dask-tutorials
:coiled config set account dask-tutorials
Alternatively, you can also ...
-
Use a short-lived auth token
coiled login --token 65924ef194cc4b658ff37c1c11caa357-2ad71e4ceeafd5a771f553306cff95eb9624ee2d --account dask-tutorials
This should just work, but will expire in a few days and you won't be able to access the web view.
-
-
Open Jupyter Lab
Once your environment has been activated and you are in the
dask-tutorial
repository, start Jupyter Lab:jupyter lab
You will see a notebooks directory, click on there and you will be ready to go.
We recommend Jupyter Lab due to the Dask Jupyter extension.
-
Setup virtual environment
conda create -n dask-tutorial python=3.10 coiled jupyter conda activate dask-tutorial
-
Establish Coiled Access
-
Sign up (it's free and there's no commitment) as follows:
coiled login
You'll be asked to authenticate with GitHub to make an account. Don't worry about connecting to your cloud resources. We'll add you to the
dask-tutorials
team, which is connected to an AWS account of ours.To get this access, ask to be added in the #dask-tutorial channel. You'll also want to set your default account to
dask-tutorials
:coiled config set account dask-tutorials
Alternatively, you can also ...
-
Use a short-lived auth token
coiled login --token 65924ef194cc4b658ff37c1c11caa357-2ad71e4ceeafd5a771f553306cff95eb9624ee2d --account dask-tutorials
This should just work, but will expire in a few days and you won't be able to access the web view.
-
-
Start Coiled notebook
coiled notebook up --software jupytercon-notebook
Note: Don't forget to shut down your notebook after you're done!
The website mybinder.org serves pre-configured Jupyter notebooks for
free that you can also use. Here is the link → .
However, mybinder.org has tragically lost some of their funding recently, and so availability is not what it once was. We recommend running locally if possible.