canonical · k-dimple · Sep 23, 2023 · Sep 23, 2023
diff --git a/.wordlist.txt b/.wordlist.txt
@@ -169,4 +169,10 @@ RBAC
 macOS
 microservices
 MLOps
+ideation
+hyperparameter
+checkpointing
+datastores
+UI
+backoff
 
diff --git a/aws/index.rst b/aws/index.rst
@@ -7,7 +7,7 @@ of products and services - offered by both Amazon Web Services (AWS) and Canonic
 have an optimised kernel that boots faster, has a smaller footprint and includes AWS-specific drivers.
 
 **These images provide a foundation for deploying cloud-based software solutions,** specifically
-for softwares built on Ubuntu and running on AWS. They focus on providing the optimal tools 
+for software built on Ubuntu and running on AWS. They focus on providing the optimal tools 
 and features needed to run specific workloads.
 
 .. include:: ../reuse/common-intro.txt

diff --git a/...to/deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-02.png b/...to/deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-02.png
diff --git a/...to/deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-03.png b/...to/deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-03.png
diff --git a/...to/deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-04.png b/...to/deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-04.png
diff --git a/...to/deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-10.png b/...to/deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-10.png
diff --git a/...to/deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-11.png b/...to/deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-11.png
diff --git a/...to/deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-13.png b/...to/deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-13.png
diff --git a/azure/azure-how-to/deploy-kubeflow-pipelines-with-aks-spot-instances.rst b/azure/azure-how-to/deploy-kubeflow-pipelines-with-aks-spot-instances.rst
@@ -0,0 +1,186 @@
+Deploy Kubeflow pipelines with AKS spot instances
+=================================================
+
+`Charmed Kubeflow`_ is an MLOps platform that delivers an end-to-end solution for AI/ML applications. It includes Kubeflow Pipelines, an engine for orchestrating different MLOps workflows.
+
+Kubeflow pipelines can be created using `Azure spot instances`_ on an AKS cluster. This can save costs as long as appropriate tasks are chosen for the spot instances and their eviction is handled gracefully.
+
+.. hlist::
+   :columns: 2
+
+   * **Use spot instances for :**
+   * --- data processing
+   * --- distributed training & hyperparameter tuning
+   * --- model training (with checkpointing)
+   * --- batch inference
+   * **Don't use spot instances for :**
+   * --- Kubernetes control plane
+   * --- notebooks and dashboards
+   * --- datastores or databases
+   * --- a model serving as an online inference
+
+
+To create Kubeflow pipelines with spot instances, start with an AKS cluster containing an existing deployment of Charmed Kubeflow. Refer to :doc:`install-kubeflow-on-aks` for  instructions.
+
+Add spot instances
+------------------
+
+In the cluster details screen of your Kubeflow deployment on AKS, go to :guilabel:`Settings` > :guilabel:`Node pools` > :guilabel:`Add node pool`:
+
+.. image:: deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-02.png
+   :align: center
+
+Select :guilabel:`Enable Azure Spot instances`:
+
+.. image:: deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-03.png
+   :align: center
+
+Specify when and how your node should be evicted - :guilabel:`Eviction type` and :guilabel:`Eviction policy`. If you have a maximum price over which it is not worth running your workload, then specify that:
+
+.. image:: deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-04.png
+   :align: center
+
+Create the pool after specifying all required fields. Once the VMs are up, check if all the nodes are available in the portal's :guilabel:`Node pools` tab or on the command line using ``kubectl get nodes``.
+
+
+Update pipeline to allow task execution on spot instances
+---------------------------------------------------------
+
+To enable tasks to execute on spot instances, use the Kubernetes `toleration` mechanism. Add an appropriate toleration to each task and if the tasks have to run on a certain type of node (say GPU required), include an `affinity` configuration to enforce the type of pods allowed.
+
+Sample code
+~~~~~~~~~~~
+
+To try out the changes needed, you can clone this `Kubeflow-examples`_ repository and use the notebooks from the ``aks-spot-instances`` folder to create new notebooks in your Kubeflow deployment. Run the first notebook (01-base-workload) and use the Kubeflow Pipelines UI to see the workflow run details:
+
+.. image:: deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-10.png
+   :align: center
+
+To check where the workflow steps were executed run ``kubectl get pods -n admin -o wide`` and in the `NODE` column, you'll see that all the tasks were executed on ``agentpool``:
+
+.. image:: deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-11.png
+   :align: center
+
+The code used to create the pipeline was:
+
+.. code::
+
+   @dsl.pipeline(
+    name="base_pipeline",
+    description="my base pipeline",
+   )
+   def my_pipeline(url):
+      web_downloader_task = web_downloader_op(url=url)
+      merge_csv_task = create_step_merge_csv(file=web_downloader_task.outputs['data'])
+
+   client = kfp.Client()
+
+   client.create_run_from_pipeline_func(
+      my_pipeline,
+      arguments={
+         'url': 'https://storage.googleapis.com/ml-pipeline-playground/iris-csv-files.tar.gz'
+      })
+
+
+Now, run the second notebook (02-spot-workload) to use spot instances. The pipeline definition in this code includes `toleration` and `affinity`. This configuration is done at the pipeline level and is applicable to each task, but the tasks themselves are unaware of it.
+
+
+.. code::
+
+   from kubernetes.client.models import V1Toleration, V1Affinity, V1NodeAffinity, V1NodeSelector, V1NodeSelectorTerm, V1NodeSelectorRequirement
+
+   @dsl.pipeline(
+      name="base_pipeline",
+      description="my base pipeline",
+   )
+   def my_pipeline(url):
+      web_downloader_task = web_downloader_op(url=url)
+      merge_csv_task = create_step_merge_csv(file=web_downloader_task.outputs['data'])
+
+      toleration = V1Toleration(
+         key="kubernetes.azure.com/scalesetpriority",
+         operator="Equal",
+         value="spot",
+         effect="NoSchedule",
+      )
+      web_downloader_task.add_toleration(toleration)
+      merge_csv_task.add_toleration(toleration)
+
+      aff = V1Affinity(node_affinity=V1NodeAffinity(
+         required_during_scheduling_ignored_during_execution=V1NodeSelector(
+               node_selector_terms=[V1NodeSelectorTerm(
+                  match_expressions=[V1NodeSelectorRequirement(
+                     key='kubernetes.azure.com/scalesetpriority',
+                     operator='In',
+                     values=['spot'])])]))
+      )
+      web_downloader_task.add_affinity(aff)
+      merge_csv_task.add_affinity(aff)
+
+   client = kfp.Client()
+
+   client.create_run_from_pipeline_func(
+      my_pipeline,
+      arguments={
+         'url': 'https://storage.googleapis.com/ml-pipeline-playground/iris-csv-files.tar.gz'
+      })
+
+The results seen will be the same as before except for the node names. Run ``kubectl get pods -n admin -o wide | grep "spot-pipeline"`` to see that the tasks were executed on the ``spotpool``.
+
+.. image:: deploy-kubeflow-pipelines-with-aks-spot-instances-images/kubeflow-aks-13.png
+   :align: center
+
+
+Handle eviction gracefully
+--------------------------
+
+To handle spot eviction gracefully, set the `retry` policy for all tasks that are to be executed on a spot instance. This ensures that when the task fails, the pipeline will retry the task after waiting for a while (``backoff_duration``). It'll keep doing this until a specified count of maximum retries (``num_retries``) is reached. 
+
+You can use the third notebook (03-spot-retries-workload) to try this out. Here the backoff duration is set to 5 minutes to give enough time for a new node to spawn and the max retry count is set to 5:
+
+.. code::
+
+   from kubernetes.client.models import V1Toleration, V1Affinity, V1NodeAffinity, V1NodeSelector, V1NodeSelectorTerm, V1NodeSelectorRequirement
+
+   @dsl.pipeline(
+      name="base_pipeline",
+      description="my base pipeline",
+   )
+   def my_pipeline(url):
+      web_downloader_task = web_downloader_op(url=url)
+      merge_csv_task = create_step_merge_csv(file=web_downloader_task.outputs['data'])
+
+      toleration = V1Toleration(
+         key="kubernetes.azure.com/scalesetpriority",
+         operator="Equal",
+         value="spot",
+         effect="NoSchedule",
+      )
+      web_downloader_task.add_toleration(toleration)
+      merge_csv_task.add_toleration(toleration)
+
+      aff = V1Affinity(node_affinity=V1NodeAffinity(
+         required_during_scheduling_ignored_during_execution=V1NodeSelector(
+               node_selector_terms=[V1NodeSelectorTerm(
+                  match_expressions=[V1NodeSelectorRequirement(
+                     key='kubernetes.azure.com/scalesetpriority',
+                     operator='In',
+                     values=['spot'])])]))
+      )
+      web_downloader_task.add_affinity(aff)
+      merge_csv_task.add_affinity(aff)
+
+      web_downloader_task.set_retry(num_retries=5, backoff_dureation="5m")
+      merge_csv_task.set_retry(num_retries=5, backoff_dureation="5m")
+
+   client = kfp.Client()
+
+   client.create_run_from_pipeline_func(
+      my_pipeline,
+      arguments={
+         'url': 'https://storage.googleapis.com/ml-pipeline-playground/iris-csv-files.tar.gz'
+      })
+
+.. _`Charmed Kubeflow`: https://charmed-kubeflow.io
+.. _`Azure spot instances`: https://azure.microsoft.com/en-gb/products/virtual-machines/spot/
+.. _`Kubeflow-examples`: https://github.com/Barteus/kubeflow-examples
diff --git a/azure/azure-how-to/index.rst b/azure/azure-how-to/index.rst
@@ -11,4 +11,5 @@ Step-by-step guides for some of the common tasks related to Ubuntu on Azure are
    get-ubuntu-pro
    create-pro-fips-golden-image
    install-kubeflow-on-aks
+   deploy-kubeflow-pipelines-with-aks-spot-instances
    upgrade-from-focal-to-jammy
diff --git a/azure/index.rst b/azure/index.rst
@@ -6,7 +6,7 @@ of products and services - offered by both Microsoft Azure and Canonical. These
 kernel that boots faster, has a smaller footprint and includes Azure-specific drivers.
 
 **These images provide a foundation for deploying cloud-based software solutions,** specifically
-for softwares built on Ubuntu and running on Azure. They focus on providing the optimal tools 
+for software built on Ubuntu and running on Azure. They focus on providing the optimal tools 
 and features needed to run specific workloads.
 
 .. include:: ../reuse/common-intro.txt

diff --git a/google/index.rst b/google/index.rst
@@ -6,7 +6,7 @@ to a wide range of products and services - offered by both Google Cloud and Cano
 have an optimised kernel that boots faster, has a smaller footprint and includes GCP-specific drivers.
 
 **These images provide a foundation for deploying cloud-based software solutions,** specifically
-for softwares built on Ubuntu and running on Google cloud. They focus on providing the optimal tools 
+for software built on Ubuntu and running on Google cloud. They focus on providing the optimal tools 
 and features needed to run specific workloads.
 
 .. include:: ../reuse/common-intro.txt

diff --git a/ibm/index.rst b/ibm/index.rst
@@ -7,7 +7,7 @@ of products and services - offered by both IBM cloud and Canonical. These images
 kernel that boots faster, has a smaller footprint and includes IBM-specific drivers.
 
 **These images provide a foundation for deploying cloud-based software solutions,** specifically
-for softwares built on Ubuntu and running on IBM cloud. They focus on providing the optimal tools 
+for software built on Ubuntu and running on IBM cloud. They focus on providing the optimal tools 
 and features needed to run specific workloads.
 
 .. include:: ../reuse/common-intro.txt

diff --git a/oracle/index.rst b/oracle/index.rst
@@ -6,7 +6,7 @@ of products and services - offered by both Oracle cloud and Canonical. These ima
 kernel that boots faster, has a smaller footprint and includes Oracle-specific drivers.
 
 **These images provide a foundation for deploying cloud-based software solutions,** specifically
-for softwares built on Ubuntu and running on Oracle cloud. They focus on providing the optimal tools 
+for software built on Ubuntu and running on Oracle cloud. They focus on providing the optimal tools 
 and features needed to run specific workloads.
 
 .. include:: ../reuse/common-intro.txt