-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add network topology information to nodes #1025
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This issue is currently awaiting triage. If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @wwvela. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
pkg/providers/v1/instances_v2.go
Outdated
topology, err := c.ec2.DescribeInstanceTopology(topologyRequest) | ||
if err != nil || topology == nil { | ||
klog.Infof("topology api not supported for instance type: %s or region: %s", instanceType, region) | ||
return additionalLabels, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you shouldn't return here, you should wrap the following block in an else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we will use the supportedInstanceType instead of instance prefix now, the error throw our there should not because not supported issue. I throw the error directly here in new version
pkg/providers/v1/instances_v2.go
Outdated
if c.topologySupportedInstance(instanceType) && slices.Contains(c.cfg.Global.TopologySupportedRegions, region) { | ||
topologyRequest := &ec2.DescribeInstanceTopologyInput{InstanceIds: []*string{&instanceID}} | ||
topology, err := c.ec2.DescribeInstanceTopology(topologyRequest) | ||
if err != nil || topology == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to check what the err
is before we say that the instance isn't supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revised the topologySupportedInstance to check the supported instance types instead of the instance prefix, so the DescribeInstanceTopology api should supported here. I will throw error directly
pkg/providers/v1/instances_v2.go
Outdated
return additionalLabels, nil | ||
} | ||
|
||
// TopologySupportedInstancePrefixes is help to filter instance types supported by topology API | ||
func (c *Cloud) topologySupportedInstance(instanceType string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (c *Cloud) topologySupportedInstance(instanceType string) bool { | |
func (c *Cloud) isTopologySupportedInstance(instanceType string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revised
pkg/providers/v1/instances_v2.go
Outdated
// TopologySupportedInstancePrefixes is help to filter instance types supported by topology API | ||
func (c *Cloud) topologySupportedInstance(instanceType string) bool { | ||
for _, prefix := range c.cfg.Global.TopologySupportedInstancePrefixes { | ||
if strings.HasPrefix(instanceType, prefix) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably just want to work with a list of fully-formed instance type names, prefix matching has some ambiguity and configuration footguns ("p4"
would match all p4dn
instances, for example). The list isn't going to change very often and we don't need the benefit of shorthand IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to be updating the cloud-provider each time there's a new AWS instance type, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be a bad idea to do some dynamic discover based on an DescribeInstanceTopology
result? then cache the instance types' support with a TTL? from what i can tell the api just returns an empty list when the instance-id's type is not supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed this with Carter before. We don't want to trigger this api for all nodes, because some cx might have hundreds or thousands of nodes in their cluster, if we call this api for all nodes, it very likely trigger throttle issue.
And DescribeInstanceTopology only supported for very limited instances types which lots of cx might not use. check the supported instance type and regions here. https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstanceTopology.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to be updating the cloud-provider each time there's a new AWS instance type, though?
We don't need to update cloud-provider, TopologySupportedInstanceTypes and TopologySupportedRegions is added in cloud-config where will accept the input in KCP CCM manifest. We only need to update the manifest with the cloud-config input.
pkg/providers/v1/instances_v2.go
Outdated
for index, networkNode := range topology.NetworkNodes { | ||
label := LabelNetworkNode + strconv.Itoa(index) | ||
additionalLabels[label] = *networkNode | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear how you would use this. How would you "query" the labels using a pod's nodeSelector, or a telemetry system, if the index of the networkNode appears in the label key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after reading through the instance topology docs i think i have an idea how this works?
Assuming the goal is to run workloads that share the same topology, you'd need to know the network nodes that your capacity sits on first, then use that in a nodeSelector
for the corresponding layer index. AFAIK there's no way to apply a nodeSelector
based on some heuristic like "schedule pods to nodes where the label topology.k8s.aws/network-node-layer-1
matches", so this logic would have to be handled by a high level system. the only guarentee is that there are always at least 3 layers.
However, this scheme breaks down if network nodes can be part of different layers, which i couldn't immediately grok from the docs but i assumed wasn't possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this scheme breaks down if network nodes can be part of different layers
yep, if trees with different roots can overlap (share network nodes).
The long form doc is helpful, the API doc doesn’t even say that order is meaningful in the networkNodeSet. 😄
I’m wondering if the “layer” number needs to be in the labels. Is it enough to just say “schedule this pod within this networkNode” instead of “within this networkNode which is the layer 2 networkNode for the instance”?
—-
A primary motivation for this feature was observability, being able to correlate issues with something more granular than AZ. If you want to slice and dice your instance metrics using network nodes, should you care about the layer of the network node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we inverted this, so that the label key contains the networkNode and the value is the layer number, that would allow you to use the Exists operator in a nodeAffinity to place the pod within a networkNode, without needing the layer to also match: https://kubernetes.io/docs/reference/kubernetes-api/common-definitions/node-selector-requirement/#NodeSelectorRequirement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, if trees with different roots can overlap (share network nodes).
We can assume that's not the case for now.
A primary motivation for this feature was observability, being able to correlate issues with something more granular than AZ
Do you mean the feature to apply arbitrary labels from within the cloud provider? I think observability at the zone-id level is where we started, but I'd like to think we can help with scheduling here too. The next thing I'm thinking is a label that describes whether an instance has a GPU or not, so we don't need to schedule GPU pods (Device Plugin, EFA etc) on nodes that don't need it etc.
Assuming the goal is to run workloads that share the same topology, you'd need to know the network nodes that your capacity sits on first, then use that in a nodeSelector for the corresponding layer index.
Right that's what I was expecting you'd need to do. You don't know the topology until you launch the instance first, so it forces this ordering of sorts where you launch the instance first, you realize what your network nodes are, and you should ideally see a common network node across your instances. I think you could maybe define weighted node affinities across the network node layers so you pack it as close as possible?
/ok-to-test |
/kind feature |
pkg/providers/v1/aws_ec2.go
Outdated
} else if len(resp.Instances) == 0 { | ||
return nil, nil | ||
} | ||
return resp.Instances[0], err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this wrapper's logic seems strange, shouldn't we return the list of []*ec2.InstanceTopology
instead of just the first item?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DescribeInstanceTopology can accept a list of instance IDs, if we pass only 1 instance ID, it will return 1 according topology info something like below.
"Instances": [
{
"InstanceId": "i-xxx",
"InstanceType": "p5e.48xlarge",
"NetworkNodes": [
"nn-xxxxx",
"nn-xxxxx",
"nn-xxxxx"
],
"AvailabilityZone": "us-west-2c",
"ZoneId": "usw2-az3"
}
]
But yeah, it seems more reasonable to return all instances in api and get the first instance in where trigger it and get result. Will revised in next
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, because we still allow the full ec2.DescribeInstanceTopologyInput
payload which could specify more than one instance id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revised
@@ -20,4 +20,5 @@ const ( | |||
// LabelZoneID is a topology label that can be applied to any resource | |||
// but will be initially applied to nodes. | |||
LabelZoneID = "topology.k8s.aws/zone-id" | |||
LabelNetworkNode = "topology.k8s.aws/network-node-" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LabelNetworkNode = "topology.k8s.aws/network-node-" | |
LabelNetworkNode = "topology.k8s.aws/network-node-layer-" |
IIUC, the info you're embedding into the key is about the layer of the network node in the node set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revised in next version
pkg/providers/v1/instances_v2.go
Outdated
for index, networkNode := range topology.NetworkNodes { | ||
label := LabelNetworkNode + strconv.Itoa(index) | ||
additionalLabels[label] = *networkNode | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after reading through the instance topology docs i think i have an idea how this works?
Assuming the goal is to run workloads that share the same topology, you'd need to know the network nodes that your capacity sits on first, then use that in a nodeSelector
for the corresponding layer index. AFAIK there's no way to apply a nodeSelector
based on some heuristic like "schedule pods to nodes where the label topology.k8s.aws/network-node-layer-1
matches", so this logic would have to be handled by a high level system. the only guarentee is that there are always at least 3 layers.
However, this scheme breaks down if network nodes can be part of different layers, which i couldn't immediately grok from the docs but i assumed wasn't possible.
func (s *awsSdkEC2) DescribeInstanceTopology(request *ec2.DescribeInstanceTopologyInput) ([]*ec2.InstanceTopology, error) { | ||
resp, err := s.ec2.DescribeInstanceTopology(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just realized this api is paginated, should we use logic similar to the DescribeInstances
wrapper below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result of the DescribeInstanceTopology is actually only 1 element in the array because we will not do group describe for multiple instances. And it will only return 1 topology info for 1 instance something like
{
"Instances": [
{
"InstanceId": "i-xxx",
"InstanceType": "p5e.48xlarge",
"NetworkNodes": [
"nn-xxx",
"nn-xxx",
"nn-xxx"
],
"AvailabilityZone": "us-west-2c",
"ZoneId": "usw2-az3"
}
]
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we test this with large instances because even though the result can be only 1 instance because of filters, not all filters are applied at the server side for the ec2 and for filter which are applied later, we have to go over multiple pages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our current usage, a lack of pagination should be fine. The only risk would be if we were re-using for some other use where we were describing multiple instances, which, AFAIK, isn't the type of thing that we ever do in the cloud provider.
I could be wrong, but when you're supplying a primary key like an instance ID, I've never seen EC2 not return the resource in question on the first call.
pkg/providers/v1/aws_ec2.go
Outdated
} else if len(resp.Instances) == 0 { | ||
return nil, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this branch necessary? wouldn't it be better to have an empty array as a default with no error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to return []*ec2.InstanceTopology{}, nil
pkg/providers/v1/topology.go
Outdated
func (t *topologyCache) mightSupportTopology(instanceType string, region string) bool { | ||
t.mutex.RLock() | ||
defer t.mutex.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this lock might need to belong in the caller getNodeTopology
? since that's where the possible writes occur to the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revised
unsupportedInstance []string | ||
unsupportedRegion []string | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just future proofing here, is there a minimal cache with TTL implementation we could use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not neccessary for the topology catch to have TTL implementation, as the already supported instance type and region will not changed
} | ||
return topology[0], nil | ||
} | ||
return nil, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a static error to throw here is better? it seems odd to return no error with nil
then we could check something like errors.Is(err, ErrTopologyUnsupported)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to throw error if the instance not supported by the api? as we will just skip adding the topology label for it
@@ -8,6 +8,7 @@ import "github.com/aws/aws-sdk-go/service/ec2" | |||
type EC2 interface { | |||
// Query EC2 for instances matching the filter | |||
DescribeInstances(request *ec2.DescribeInstancesInput) ([]*ec2.Instance, error) | |||
DescribeInstanceTopology(request *ec2.DescribeInstanceTopologyInput) ([]*ec2.InstanceTopology, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to update the docs right for this new permission?
https://github.com/kubernetes/cloud-provider-aws/blob/master/docs/prerequisites.md
Also can you update the description with release notes to make sure users are aware of this while upgrading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revised
v1 "k8s.io/api/core/v1" | ||
"k8s.io/apimachinery/pkg/types" | ||
cloudprovider "k8s.io/cloud-provider" | ||
"strconv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you sort the imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revised
return nil, err | ||
} | ||
|
||
additionalLabels, err := c.getAdditionalLabels(zone.FailureDomain, string(instanceID), instanceType, zone.Region) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is called regularly for each node, what would be the impact on the performance and also will it cause any ec2 limits issue as we are making another ec2 api?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should not cause ec2 limit. The topologyCache is designed to avoid this situation, the describeInstanceTopology api will only triggered once for those unsupported regions and nodes with unsupported instance types and will cache it in the topologyCache. So describeInstanceTopology api will only triggered for supported instance type or those not cached unsupported instance types.
func (s *awsSdkEC2) DescribeInstanceTopology(request *ec2.DescribeInstanceTopologyInput) ([]*ec2.InstanceTopology, error) { | ||
resp, err := s.ec2.DescribeInstanceTopology(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we test this with large instances because even though the result can be only 1 instance because of filters, not all filters are applied at the server side for the ec2 and for filter which are applied later, we have to go over multiple pages?
defer t.mutex.RUnlock() | ||
if t.mightSupportTopology(instanceType, region) { | ||
topologyRequest := &ec2.DescribeInstanceTopologyInput{InstanceIds: []*string{&instanceID}} | ||
topology, err := t.cloud.ec2.DescribeInstanceTopology(topologyRequest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per offline discussion, we need to gracefully handle access denied when making this API call. Non-EKS customers may not have enough permissions to make their call, and they can update them if they want this functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the grace handle for permission missing and add the e2e test
hack/e2e/run.sh
Outdated
@@ -128,7 +128,7 @@ if [[ "${UP}" = "yes" ]]; then | |||
--run-id="${test_run_id}" \ | |||
--cloud-provider=aws \ | |||
--cluster-name="${CLUSTER_NAME}" \ | |||
--create-args="--dns=none --zones=${ZONES} --node-size=m5.large --master-size=m5.large --override=cluster.spec.cloudControllerManager.cloudProvider=aws --override=cluster.spec.cloudControllerManager.image=${IMAGE_NAME}:${IMAGE_TAG}" \ | |||
--create-args="--dns=none --zones=${ZONES} --node-size=p4d.24xlarge --master-size=m5.large --override=cluster.spec.cloudControllerManager.cloudProvider=aws --override=cluster.spec.cloudControllerManager.image=${IMAGE_NAME}:${IMAGE_TAG}" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not going to work. These instance types are super expensive and not readily available, so it won't be practical to test here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. I have change the instance type back to m5.large in latest change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the instance type supported by the DescribeInstanceTopology are expensive. So I'm not sure if can still use the e2e to test this change or it will be better to test it by creating a dev cluster and replace the CCM manually
tests/e2e/nodes.go
Outdated
framework.ExpectNoError(err) | ||
|
||
client, err := kubernetes.NewForConfig(clientConfig) | ||
framework.ExpectNoError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spacing is off here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be able to just run go fmt
or whatever the k8s dev command is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised and verified by make check
tests/e2e/nodes.go
Outdated
@@ -44,4 +47,62 @@ var _ = Describe("[cloud-provider-aws-e2e] nodes", func() { | |||
gomega.Expect(node.Labels).To(gomega.HaveKey("topology.k8s.aws/zone-id")) | |||
} | |||
}) | |||
|
|||
It("should correctly label nodes with instance type p3dn.24xlarge", func(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test name is out of date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised
func (s *awsSdkEC2) DescribeInstanceTopology(request *ec2.DescribeInstanceTopologyInput) ([]*ec2.InstanceTopology, error) { | ||
resp, err := s.ec2.DescribeInstanceTopology(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our current usage, a lack of pagination should be fine. The only risk would be if we were re-using for some other use where we were describing multiple instances, which, AFAIK, isn't the type of thing that we ever do in the cloud provider.
I could be wrong, but when you're supplying a primary key like an instance ID, I've never seen EC2 not return the resource in question on the first call.
topologyRequest := &ec2.DescribeInstanceTopologyInput{InstanceIds: []*string{&instanceID}} | ||
topology, err := t.cloud.ec2.DescribeInstanceTopology(topologyRequest) | ||
if err != nil { | ||
klog.Errorf("Error describing instance topology: %q", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This should be a warn, or have a log for each branch so that there's an explicit log for each scenario.
// gracefully handle the DecribeInstanceTopology access missing error | ||
klog.Infof("Not authorized to perform: ec2:DescribeInstanceTopology, permission missing") | ||
return nil, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cartermckinnon Does this cover all of the scenarios that you talked about that should be gracefully handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I believe so
unsupportedInstance []string | ||
unsupportedRegion []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Wouldn't a map be more efficient for these?
// if both instanceType and region are not in unsupported cache, the instance type and region might be supported | ||
// or we haven't check the supportness and cache it yet. If they are unsupported and not cached, we will run | ||
// describeTopology api once for them | ||
// Initialize the map if it's unset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove
framework.Failf("Error checking EC2 describeInstanceTopology access: %v", err) | ||
} | ||
allowed := result.Status.Allowed | ||
supportedInstanceType := "p4d.24xlarge" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: May be brittle if topology information is ever backported to vanilla instances, though would be easy enough to fix.
if err != nil { | ||
framework.Failf("Error checking EC2 describeInstanceTopology access: %v", err) | ||
} | ||
allowed := result.Status.Allowed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not something that we can have expectations on what it will be?
/retest |
2 similar comments
/retest |
/retest |
topology, err := t.cloud.ec2.DescribeInstanceTopology(topologyRequest) | ||
if err != nil { | ||
klog.Errorf("Error describing instance topology: %q", err) | ||
if strings.Contains(err.Error(), "The functionality you requested is not available in this region") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be casting these to an awserr
and then comparing against error code constants, instead of doing string things.
see: https://docs.aws.amazon.com/sdk-for-go/v1/developer-guide/handling-errors.html
return nil, err | ||
} else if len(topology) == 0 { | ||
// instanceType is not support topology info and the result is empty | ||
t.unsupportedInstance = append(t.unsupportedInstance, instanceType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you haven't acquired a write lock before modifying supportedInstance/supportedRegion
if t.unsupportedInstance == nil { | ||
t.unsupportedInstance = []string{} | ||
} | ||
if t.unsupportedRegion == nil { | ||
t.unsupportedRegion = []string{} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be done when you create the topologyCache
, and I'd also move the locking down into this function instead of assuming the caller has acquired the correct lock
// gracefully handle the DecribeInstanceTopology access missing error | ||
klog.Infof("Not authorized to perform: ec2:DescribeInstanceTopology, permission missing") | ||
return nil, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I believe so
@wwvela: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
I'm taking over this change from @wwvela in #1038. She's on vacation, so I'm trying to get this over the line. I've done some significant refactoring to address issues discussed in this review, even some I said I wouldn't fix...
I don't see much for unit tests here, so I plan on adding those this morning. FYI @cartermckinnon @ndbaker1 @kmala |
/close Completed in #1038 |
@mmerkes: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR is to Add network topology information to nodes for Issue #998.
Which issue(s) this PR fixes:
Fixes #998
Does this PR introduce a user-facing change?:
NONE