Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: dedicated hosts selectors #4553

Conversation

preflightsiren
Copy link

@preflightsiren preflightsiren commented Sep 4, 2023

Fixes: #3182

Description
Wanted to create a draft PR to start a conversation about the approach. I believe following https://karpenter.sh/v0.27.3/contributing/design-guide/ that adding to the AWSNodeTemplate best follows "Identify an opinionated default that solves the majority of use cases."
Of course I'd appreciate any and all help to mould this design.

This design focuses on a simplified configuration, just the minimum needed to use host resource groups.

apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  subnetSelector: { ... }        # required, discovers tagged subnets to attach to instances
  securityGroupSelector: { ... } # required, discovers tagged security groups to attach to instances
  licenseSelector: 
    name: myLicense
  hostResourceGroup: 
    name: my-hrg-name

How was this change tested?

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@preflightsiren preflightsiren requested a review from a team as a code owner September 4, 2023 06:28
@netlify
Copy link

netlify bot commented Sep 4, 2023

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit 9fe0f2f
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/6553feb9f0f43700085e5771
😎 Deploy Preview https://deploy-preview-4553--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@preflightsiren
Copy link
Author

Created this to split away from #3859 since the approach has significantly changed from when it was first opened.

@preflightsiren preflightsiren force-pushed the feat-3182-dedicated-hosts-selectors branch 2 times, most recently from cc3a842 to 3754d8e Compare September 15, 2023 00:44
@preflightsiren
Copy link
Author

Functionality wise this change is pretty complete, it's all wired up through the nodeclass controller, and extends the amifamily resolver to dynamicaly generate the Launch Template.

Todo:

  • update the website docs
  • fix/review the code comments
  • clean up any unused references to licenProvider
  • Add placement group provider

@preflightsiren preflightsiren force-pushed the feat-3182-dedicated-hosts-selectors branch from 3754d8e to 98ebfb1 Compare September 18, 2023 01:26
Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level question: We are thinking of supporting only v1beta1 and dropping support for v1alpha5 in the near-term. Does it make sense to just release these fields only under the v1beta1 API, marking them as alpha fields so that we don't have to reason about the conversion logic in this PR. Up to this point, we've just been handling conversion logic for existing API and trying to push new features into the soon-to-be released beta API

pkg/apis/v1alpha1/provider.go Outdated Show resolved Hide resolved
pkg/apis/v1alpha1/awsnodetemplate.go Outdated Show resolved Hide resolved
pkg/apis/v1beta1/ec2nodeclass.go Show resolved Hide resolved
pkg/apis/v1beta1/ec2nodeclass.go Outdated Show resolved Hide resolved
pkg/apis/v1beta1/ec2nodeclass.go Outdated Show resolved Hide resolved
pkg/providers/launchtemplate/launchtemplate.go Outdated Show resolved Hide resolved
pkg/providers/hostresourcegroup/hostresourcegroup.go Outdated Show resolved Hide resolved
pkg/providers/hostresourcegroup/hostresourcegroup.go Outdated Show resolved Hide resolved
pkg/providers/hostresourcegroup/hostresourcegroup.go Outdated Show resolved Hide resolved
pkg/apis/v1beta1/ec2nodeclass.go Show resolved Hide resolved
@preflightsiren preflightsiren force-pushed the feat-3182-dedicated-hosts-selectors branch from 7eb2915 to 67d0ba7 Compare September 19, 2023 06:00
@preflightsiren preflightsiren force-pushed the feat-3182-dedicated-hosts-selectors branch 2 times, most recently from 963af8c to 062a0bc Compare September 21, 2023 01:42
@preflightsiren
Copy link
Author

Ok I think this PR is now essentially complete or very near to, and is ready for a thorough review.
I have addressed all the feedback so far and re-tested in our test clusters to ensure the selector logic works as expected.
I have added some test cases around the 3 different providers, updating all the test infrastructure to load fake APIs for the aws services, and plumb them into the providers should anyone want to expand the test usage. I have also added some sanity checks into the utils v1alpha1 <-> v1beta1 conversion to ensure they'll maintain compatibility.

@jonathan-innis
Copy link
Contributor

@preflightsiren Did you get a chance to see this comment here on the v1beta1 API? I think it could save you a lot of conversion logic if you just supported the single, newer API

High-level question: We are thinking of supporting only v1beta1 and dropping support for v1alpha5 in the near-term. Does it make sense to just release these fields only under the v1beta1 API, marking them as alpha fields so that we don't have to reason about the conversion logic in this PR. Up to this point, we've just been handling conversion logic for existing API and trying to push new features into the soon-to-be released beta API

@preflightsiren
Copy link
Author

@preflightsiren Did you get a chance to see this comment here on the v1beta1 API? I think it could save you a lot of conversion logic if you just supported the single, newer API

High-level question: We are thinking of supporting only v1beta1 and dropping support for v1alpha5 in the near-term. Does it make sense to just release these fields only under the v1beta1 API, marking them as alpha fields so that we don't have to reason about the conversion logic in this PR. Up to this point, we've just been handling conversion logic for existing API and trying to push new features into the soon-to-be released beta API

Thanks Jon, I must have only replied in my mind and not in github! reposting what I commented in slack:

  1. My team currently uses the v1alpha1.AWSNodeTemplate resources today, and would require an uplift to EC2NodeClass.
  2. The conversion logic is already implemented, is there value to removing it ahead of the v1alpha1 deprecations

@preflightsiren preflightsiren force-pushed the feat-3182-dedicated-hosts-selectors branch from 5b26af9 to 914e187 Compare October 9, 2023 00:50
@preflightsiren
Copy link
Author

As per our conversation: I've removed the v1alpha1 and conversion code for awsnodetemplates, all the placement selectors and status fields only apply to the v1beta1.ec2nodeclass resource.

I've documented the fields in the ec2nodeclass documents in preview

@jonathan-innis jonathan-innis self-assigned this Oct 9, 2023
Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/karpenter snapshot

@preflightsiren
Copy link
Author

sorry @jonathan-innis, don't know how I missed that amongst all the make presubmit checks performed :(

@preflightsiren preflightsiren force-pushed the feat-3182-dedicated-hosts-selectors branch from f9788c7 to 03ba9be Compare October 20, 2023 08:07
@preflightsiren preflightsiren force-pushed the feat-3182-dedicated-hosts-selectors branch from 56d74e7 to 57c79ee Compare October 26, 2023 00:27
pkg/apis/v1beta1/ec2nodeclass_status.go Outdated Show resolved Hide resolved
@@ -174,6 +193,25 @@ func (r Resolver) Resolve(ctx context.Context, nodeClass *v1beta1.EC2NodeClass,
return resolvedTemplates, nil
}

func (r Resolver) resolvePlacement(nodeClass *v1beta1.EC2NodeClass) *Placement {
var placement *Placement
hrg := nodeClass.Status.HostResourceGroup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an asynchronous sync, how do we handle it if the current values in the spec haven't resolved to the status? Are we alright tolerating a bit of eventual consistency here or should we handle cases where the spec generation doesn't line up with the observed generation in the status? Thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it all depends on how eventually consistent the system will be. I don't believe it's an issue unless the consistency goes beyond the observed behaviour (reconciled in seconds).

If only 1 property is set, provisioning will error and the reconciliation loop with start afresh.

I believe this is all a binary operation that will fail or succeed, and far less likely to partially succeed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe it's an issue unless the consistency goes beyond the observed behaviour

It's more of an issue about correctly handling the failure scenarios when we don't resolve the host resource group details for whatever reason. We now shouldn't launch instances since we aren't going to accurately represent a user's desired state for the instance but we haven't done any checking in our resolve path to ensure that's the case.

I think it's quite a bit safer to just wholesale transfer over to using status across all of our resolved fields, where we intentionally think over how we are going to handle this eventually consistent resolution of the status consistently across all fields.

@@ -161,6 +178,8 @@ func (r Resolver) Resolve(ctx context.Context, nodeClass *v1beta1.EC2NodeClass,
DetailedMonitoring: aws.BoolValue(nodeClass.Spec.DetailedMonitoring),
AMIID: amiID,
InstanceTypes: instanceTypes,
Licenses: nodeClass.Status.Licenses,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here on the licenses. I like that we are moving to using the resolved status information in this launch path (I think we want to go this direction long-term) but I also want to think about how we handle the eventual consistency problem here when we don't have a sync path to get this information. If you wanted a shorter path forward, you could just re-get the information here. The info is already cached in memory, so the expense of calling out to the provider again is minimal

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to leave this eventually consistent; but I'm happy to take input on this direction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably avoid using the status detail here for now until we thoroughly test using the status details or have a way for representing and checking the version of the spec that we have observed in the status (e.g. observedGeneration). Better for now to just re-get the info based on whatever the current spec is.

@@ -260,6 +260,8 @@ func (p *Provider) createLaunchTemplate(ctx context.Context, options *amifamily.
TagSpecifications: []*ec2.LaunchTemplateTagSpecificationRequest{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to add some testing to the launch template suite_test.go to handle these new parameters and make sure that everything resolves properly?

- arn:aws:license-manager:us-west-2:111122223333:license/lic-123456789
hostResourceGroup:
name: my-hrg
arn:aws:resource-groups:us-west-2:111122223333:group/my-hrg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just store the ARN in the host resource group in the same way that we are storing it with the other info?

logging.FromContext(ctx).Errorf("discovering placement groups, %w", err)
return nil, err
}
for i := range output.PlacementGroups {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for the selector to resolve into multiple Placement Groups? Is there a reason that you would want to resolve into multiple placement groups?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand your question. here: we're iterating through the list of placementgroups that is the output of the api response. We can only use a single Placement Group in any of our LaunchTemplate configurations, so we resolve this list down to a single placement group.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but the question was more if the API should be represented as "terms" or if it need only be represented as a selector. Is there a need to have the flexibility to select between two placement groups in the API? From what you are saying, I don't think so. We may only need a placementGroupSelector if we don't need the terms component

@@ -52,6 +52,15 @@ type EC2NodeClassSpec struct {
// +kubebuilder:validation:Enum:={AL2,Bottlerocket,Ubuntu,Custom,Windows2019,Windows2022}
// +required
AMIFamily *string `json:"amiFamily"`
// LicenseSelectorTerms is a list of LicenseSelectors. The terms are ORed.
// +optional
LicenseSelectorTerms []LicenseSelectorTerm `json:"licenseSelectorTerms,omitempty" hash:"ignore"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick high-level question: Which of these things actually need to be terms? Is there a need for someone to specify multiple of any of these things and have all of the options be passed through or us to somehow order and pick one based on some heuristic?

If there's not a reason that we should support more than one resolution of these things per-EC2NodeClass, it may be worth having this be a pure selector, no sets of terms that are ORed

}
}
}
if p.cm.HasChanged(fmt.Sprintf("placementgroup/%t/%s", nodeClass.IsNodeTemplate, nodeClass.Name), match) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if p.cm.HasChanged(fmt.Sprintf("placementgroup/%t/%s", nodeClass.IsNodeTemplate, nodeClass.Name), match) {
if p.cm.HasChanged(fmt.Sprintf("placementgroup/%s", nodeClass.IsNodeTemplate, nodeClass.Name), match) {

Same comment. If only for beta, no need to have the boolean


var match *ec2.PlacementGroup
// Look up all ec2 placement groups
output, err := p.ec2api.DescribePlacementGroupsWithContext(ctx, &ec2.DescribePlacementGroupsInput{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big is this output? Is there a pages version so that we don't have to handle the massive output?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't imagine it being large, and perhaps the API owners felt the same, hence no paged function - https://docs.aws.amazon.com/sdk-for-go/api/service/ec2/#EC2.DescribePlacementGroups


var licenses []string
// Look up all License Configurations
output, err := p.licensemanager.ListLicenseConfigurationsWithContext(ctx, &licensemanager.ListLicenseConfigurationsInput{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big is this output? Is there a pages version that we can use to reduce the memory overhead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type: string
type: object
type: array
licenseSelectorTerms:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digging a little deeper into the docs, trying to understand why the license configuration piece is needed. Can you achieve dedicated hosts by associating license configurations with the AMIs that you are planning to launch? From what I can tell, if the AMI has an associated license configuration, you may not have a need to specify the configuration directly.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's two options:

  1. Associate it with each AMI
  2. Specify it in launch template

In some cases it makes sense to associate with AMIs, in others it does not, such as when the an AMI is used both with & without license rules or where doing so just adds unnecessary workflow steps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use-case for an AMI being used without license rules? When you are bringing in software at runtime that will have a license attached to it? Why not bake it into the AMI in that case?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so dedicated host resource groups are a peculiar thing. They are implemented within license manager, but they don't have to have anything to do with software licenses https://docs.aws.amazon.com/license-manager/latest/userguide/host-resource-groups.html

You can use host resource groups to separate hosts by purpose, for example, development test hosts versus production, organizational unit, or license constraint.

We use them to get a dynamically allocated pool of dedicated hosts that can be shared across accounts.


var match *ec2.PlacementGroup
// Look up all ec2 placement groups
output, err := p.ec2api.DescribePlacementGroupsWithContext(ctx, &ec2.DescribePlacementGroupsInput{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are filters that we can pass into the placement group: https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-placement-groups.html. We may want to use this instead of getting all of the placement groups and filtering on the client-side.

@preflightsiren
Copy link
Author

After reviewing the work needed to finish this feature, I can't justify it. I'm happy for others to build off its current state.

@lorrrrrrrenzo
Copy link
Contributor

Hello @jonathan-innis, would you support a PR that specifically dealt with Placement Groups aspect of the work here? I'm willing to pick up a subset of the work done by @preflightsiren here, but cannot justify working on the full set of features it is trying to add.

Thanks.

@lorrrrrrrenzo lorrrrrrrenzo mentioned this pull request Dec 12, 2023
3 tasks
@aws aws locked and limited conversation to collaborators Oct 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Host Resource Groups
4 participants