Skip to content

Commit

Permalink
Doc: gang-scheduling for kubeflow training-operator (#3851)
Browse files Browse the repository at this point in the history
Signed-off-by: Fabio Grätz <[email protected]>
Co-authored-by: Fabio Grätz <[email protected]>
  • Loading branch information
fg91 and Fabio Grätz authored Jul 10, 2023
1 parent b46fbd2 commit 49868b6
Show file tree
Hide file tree
Showing 2 changed files with 65 additions and 0 deletions.
2 changes: 2 additions & 0 deletions rsts/deployment/configuration/general.rst
Original file line number Diff line number Diff line change
Expand Up @@ -290,6 +290,8 @@ The legacy technique is to set configuration options in Flyte's K8s plugin confi
These two approaches can be used simultaneously, where the K8s plugin configuration will override the default PodTemplate values.
.. _using-k8s-podtemplates:

*******************************
Using K8s PodTemplates
*******************************
Expand Down
63 changes: 63 additions & 0 deletions rsts/deployment/plugins/k8s/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,69 @@ Install the K8S Operator
export KUBECONFIG=$KUBECONFIG:~/.kube/config:~/.flyte/k3s/k3s.yaml
kustomize build "https://github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.5.0" | kubectl apply -f -
**Optional: Using a Gang Scheduler**

With the default Kubernetes scheduler, it can happen that some worker pods of distributed training jobs are scheduled
later than others due to resource constraints. This often causes the job to fail with a timeout error. To avoid
this you can use a gang scheduler, meaning that the worker pods are only scheduled once all of them can be scheduled at
the same time.

To `enable gang scheduling for the Kubeflow training-operator <https://www.kubeflow.org/docs/components/training/job-scheduling/>`_,
you can install the `Kubernetes scheduler plugins <https://github.com/kubernetes-sigs/scheduler-plugins/tree/master>`_:

1. Install the `scheduler plugin as a second scheduler <https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/manifests/install/charts/as-a-second-scheduler>`_.
2. Configure the Kubeflow training-operator to use the new scheduler:

Create a manifest called ``kustomization.yaml`` with the following content:

.. code-block:: yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- github.com/kubeflow/training-operator/manifests/overlays/standalone
patchesStrategicMerge:
- patch.yaml
Create a patch file called ``patch.yaml`` with the following content:

.. code-block:: yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: training-operator
spec:
template:
spec:
containers:
- name: training-operator
command:
- /manager
- --gang-scheduler-name=scheduler-plugins
Install the patched kustomization with:

.. code-block:: bash
kustomize build path/to/overlay/directory | kubectl apply -f -
3. Use a Flyte pod template with ``template.spec.schedulerName: scheduler-plugins-scheduler``
to use the new gang scheduler for your tasks.

See the :ref:`using-k8s-podtemplates` section for more information on pod templates in Flyte.
You can set the scheduler name in the pod template passed to the ``@task`` decorator. However, to prevent the
two different schedulers from competing for resources, it is recommended to set the scheduler name in the pod template
in the ``flyte`` namespace which is applied to all tasks. Non distributed training tasks can be scheduled by the
gang scheduler as well.



.. group-tab:: Ray

Install the Ray Operator:
Expand Down

0 comments on commit 49868b6

Please sign in to comment.