Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Ensure no 6+ minute delays for disrupted stateful workloads #6484

Merged
merged 3 commits into from
Aug 2, 2024

Conversation

AndrewSirenko
Copy link
Member

Fixes #N/A

Description

Add tests that validate Karpenter PR 1294's solution to 6+ minute delays for disrupted stateful workloads outlined in #6336.

Add two storage E2E tests that ensure:

  • Disrupted stateful workloads run on a new node within 5 minutes
  • Un-drain-able stateful workloads (due to tolerations) do not block node deletion

Also adds read RBAC permissions for volumeattachment objects. (Let me know if this should be a separate fix PR)

How was this change tested?

Running FOCUS="StatefulSets" make e2etests locally on cluster with and without modified Karpenter from Karpenter PR 1294.

NOTE: The test should run a disrupted stateful workload on a new node within 5 minutes currently fails due to Karpenter PR 1294 not being merged. This test will validate that the 6+ minute delay is eliminated.

NOTE-2: We are not testing whether the 1+ minute EBS DetachVolume & EC2 TerminateInstances race delay occurs because the smaller time frame is likely to induce flakes.

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@AndrewSirenko AndrewSirenko requested a review from a team as a code owner July 10, 2024 14:14
Copy link

netlify bot commented Jul 10, 2024

Deploy Preview for karpenter-docs-prod canceled.

Name Link
🔨 Latest commit 70c0c89
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/66ad58a397548800085dfdb1

@coveralls
Copy link

coveralls commented Jul 10, 2024

Pull Request Test Coverage Report for Build 10222622868

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.01%) to 79.267%

Files with Coverage Reduction New Missed Lines %
pkg/providers/amifamily/ami.go 1 91.67%
Totals Coverage Status
Change from base Build 10222515675: -0.01%
Covered Lines: 5972
Relevant Lines: 7534

💛 - Coveralls

@arturkasperek
Copy link

@AndrewSirenko please also remember that not only statefulset should be handled. Also, we should handle cases where we have deployment with volume (single replica deployment)

@AndrewSirenko
Copy link
Member Author

@AndrewSirenko please also remember that not only statefulset should be handled. Also, we should handle cases where we have deployment with volume (single replica deployment)

I can remove "statefulset" from the test title to make it more clear this test implicitly covers both the statefulset and deployment cases. Thanks!

But I'm not sure if a separate test using deployment instead of statefulset would provide any additional coverage for karpenter (because the underlying mechanism for node draining and workload disruption is the same in both cases).

@jmdeal let me know if you think the additional test provides additional coverage, or would just take up computing time.

@AndrewSirenko
Copy link
Member Author

/hold

Until Karpenter PR 1294 is merged

test/suites/storage/suite_test.go Outdated Show resolved Hide resolved
test/suites/storage/suite_test.go Outdated Show resolved Hide resolved
test/suites/storage/suite_test.go Outdated Show resolved Hide resolved
@AndrewSirenko
Copy link
Member Author

/hold

Testing e2e tests against kubernetes-sigs/karpenter#1294

@AndrewSirenko
Copy link
Member Author

/karpenter snapshot

@AndrewSirenko
Copy link
Member Author

x-ref: Testing e2e tests against kubernetes-sigs/karpenter#1294 stacked on top of #6614 here: https://github.com/aws/karpenter-provider-aws/pull/6615/commits

Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/karpenter snapshot

Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/karpenter snapshot

Copy link
Contributor

github-actions bot commented Aug 2, 2024

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-7f0247a634554de8aa3b50968b3869435195b629.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-7f0247a634554de8aa3b50968b3869435195b629" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

jonathan-innis
jonathan-innis previously approved these changes Aug 2, 2024
Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

jonathan-innis
jonathan-innis previously approved these changes Aug 2, 2024
Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@jonathan-innis jonathan-innis merged commit b345a32 into aws:main Aug 2, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants