Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add taint to user and worker nodes #2605

Open
wants to merge 37 commits into
base: main
Choose a base branch
from
Open

Add taint to user and worker nodes #2605

wants to merge 37 commits into from

Conversation

Adam-D-Lewis
Copy link
Member

@Adam-D-Lewis Adam-D-Lewis commented Aug 1, 2024

Reference Issues or PRs

Fixes #2507

  • I need to test running pods with Argo Workflow through Nebari Workflow Controller before merging this PR

What does this implement/fix?

Put a x in the boxes that apply

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds a feature)
  • Breaking change (fix or feature that would cause existing features not to work as expected)
  • Documentation Update
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build related changes
  • Other (please describe):

Testing

  • Did you test the pull request locally?
  • Did you add new tests?

How to test this PR

A few possible ways to test:

  • You could not specify taints, do a deployment and then add taints, redeploy and make sure things are still working.
  • You could do a deployment and specify taints and make sure things work.
  • You could do a local deployment and make sure it deploys and make sure things work.

Things working is defined as:

  • jupyterlab server spins up for user
  • dask scheduler and worker spins up for user (when using nebari-git-dask built in env). Use some code like this

Any other comments?

@@ -41,10 +41,33 @@ class ExistingInputVars(schema.Base):
kube_context: str


class DigitalOceanNodeGroup(schema.Base):
Copy link
Member Author

@Adam-D-Lewis Adam-D-Lewis Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate class, so I deleted it

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Aug 19, 2024

This method works as intended when tested on GCP. However, One issue is that certain daemonsets won't run on the tainted nodes. I saw the issue with rook ceph csi-cephfslplugin from my rook PR, but I expect it would also be an issue for the monitoring daemonset pods. So we'd likely need to add the appropriate toleration to those daemonsets.

@@ -45,6 +45,13 @@ resource "helm_release" "rook-ceph" {
},
csi = {
enableRbdDriver = false, # necessary to provision block storage, but saves some cpu and memory if not needed
provisionerReplicas : 1, # default is 2 on different nodes
pluginTolerations = [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runs csi-driver on all nodes, even those with NoSchedule taints. Doesn't run on nodes with NoExecute taints. This is what the nebari-prometheus-node-exporter daemonset does so I copied it here.

effect = "NoSchedule"
},
{
operator = "Exists"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runs promtail on all nodes, even those with NoSchedule taints. Doesn't run on nodes with NoExecute taints. This is what the nebari-prometheus-node-exporter daemonset does so I copied it here. Promtail is what exports logs from the node so we still want it to run on the user and worker nodes.

Comment on lines +100 to +109
{
key = "node-role.kubernetes.io/master"
operator = "Exists"
effect = "NoSchedule"
},
{
key = "node-role.kubernetes.io/control-plane"
operator = "Exists"
effect = "NoSchedule"
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These top 2 are the default value for this helm chart.

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Aug 21, 2024

Okay, so things are working for the user node group. I tried adding a taint to the worker node group, but the dask scheduler won't run on the tainted worker node group. See this commit to see what I tried in a quick test. I do see the new scheduler_pod_extra_config value in /var/lib/dask-gateway/config.json in the dask gateway pod, but the scheduler tolerations look like

│   tolerations:                                                                                                                                                                            │
│   - effect: NoExecute                                                                                                                                                                     │
│     key: node.kubernetes.io/not-ready                                                                                                                                                     │
│     operator: Exists                                                                                                                                                                      │
│     tolerationSeconds: 300                                                                                                                                                                │
│   - effect: NoExecute                                                                                                                                                                     │
│     key: node.kubernetes.io/unreachable                                                                                                                                                   │
│     operator: Exists                                                                                                                                                                      │
│     tolerationSeconds: 300      

so I think possibly the merge isn't going as expected, but I need to verify. The docs say that "This dict will be deep merged with the scheduler pod spec (a V1PodSpec object) before submission. Keys should match those in the kubernetes spec, and should be camelCase."

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Oct 25, 2024

I managed to get the taints applied to the scheduler pod in this commit. I would have expected the c.KubeClusterConfig.scheduler_extra_pod_config to get merged with the options returned by the function passed to c.Backend.cluster_options, but it wasn't.

  • I should verify this and maybe submit an issue to dask-gateway.

I still need to apply the toleration to the dask workers.

@@ -227,18 +229,23 @@ def base_username_mount(username, uid=1000, gid=100):
}


def worker_profile(options, user):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this function since it affects the scheduler as well and not just the worker

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Oct 31, 2024

Okay things were working as expected for the jupyterlab pod and the dask worker and scheduler pods on GKE. I need to test on:

  • AWS
  • Azure.

I also need to test:

  • running an Argo Workflows pod. (Update: This worked. The taints were copied over when run with jupyterflow-override.)

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Nov 4, 2024

I tested on GCP, Azure, and AWS and works well on all. I did the following in my test:

  • Created an admin user
  • launched a dask cluster
  • verified taints set on user and worker nodes
  • verified toleration set on jupyter user pod and dask scheduler/worker pods

I also tested removing the taints on Azure and AWS and saw that the taints were removed successfully.

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Nov 4, 2024

  • I want to create an issue at least to prompt users on upgrade to ask if they want to add the taints for potential cost reductions.

Update: done now - #2824

@@ -150,6 +201,22 @@ class AWSNodeGroupInputVars(schema.Base):
permissions_boundary: Optional[str] = None
ami_type: Optional[AWSAmiTypes] = None
launch_template: Optional[AWSNodeLaunchTemplate] = None
node_taints: list[dict]

@field_validator("node_taints", mode="before")
Copy link
Member Author

@Adam-D-Lewis Adam-D-Lewis Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is repeated (see line 233 in this file) for GCP and AWS NodeGroupInputVars classes, but that's b/c the format expected by GCP and AWS terraform modules for taints happens to be the same. I think the required formats for the different modules could evolve separately and so I chose to duplicate the code in this case.

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Nov 6, 2024

We should add some instructions to the docs about adding other node groups (e.g. gpus). Users should add the user taint to other user node profiles in order to prevent the same issue this PR prevents.

@Adam-D-Lewis Adam-D-Lewis added this to the 2024.11.2 release milestone Nov 7, 2024
@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Dec 10, 2024

Rather than making the user make sure to put a consistent taint on each node group, maybe we should just have a "type" field on node groups to simplify this. Less flexible, but I think it's flexible enough for the use cases we expect.
This would replace taints: [dedicated=user:NoSchedule].

google_cloud_platform:
  project: qhub-279316
  region: us-central1
  kubernetes_version: 1.28.9-gke.1289000
  tags:
  - "nebari-quansight-dev"
  node_groups:
    general:
      instance: n1-standard-8
      min_nodes: 1
      max_nodes: 1
      type: general   <--------------------------NEW-------------------------------

    user:
      instance: n1-standard-4
      min_nodes: 0
      max_nodes: 200
      type: user   <----------------------------- NEW -----------------------------

    large:
      instance: n1-standard-8
      min_nodes: 0
      max_nodes: 200
      type: user   <----------------------------- NEW -----------------------------

    worker:
      instance: n1-standard-4
      min_nodes: 0
      max_nodes: 1000
      type: worker   <----------------------------- NEW -----------------------------

^ We discussed this in a group meeting. This is not the ideal solution. Instead, we should just:

  • add a default value for the taints so if unspecified the taint is applied.
  • use a single taint for user and worker nodes.

@viniciusdc viniciusdc removed this from the 2024.12.2 release milestone Dec 12, 2024
@Adam-D-Lewis
Copy link
Member Author

Okay, this PR is ready for review!

@viniciusdc
Copy link
Contributor

Hi, @Adam-D-Lewis. Were the changes we discussed last week applied? Regardingng the default taints and overrides?

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Jan 6, 2025

Hi, @Adam-D-Lewis. Were the changes we discussed last week applied? Regardingng the default taints and overrides?

Yes, they were in this commit. See the set_missing_taints_to_default_taints method as a starting point. Happy to answer any other questions or walk through it if needed.

@Adam-D-Lewis
Copy link
Member Author

looks like merging in main, broke many tests

@viniciusdc
Copy link
Contributor

It looks like 'provider_enum_name_map' is not recognized, though it seems it was not removed...

@Adam-D-Lewis
Copy link
Member Author

Failing test appears unrelated to this PR since it's a playwright test and this PR makes no changes to the UI/UX other than the command line when running nebari init.

@viniciusdc
Copy link
Contributor

Thanks @Adam-D-Lewis , thats related to the recent conda-store update.

@marcelovilla
Copy link
Member

@Adam-D-Lewis the failing tests have been addressed in #2911

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New 🚦
Development

Successfully merging this pull request may close these issues.

[BUG] - Nodes don't scale down on GKE and AKS
4 participants