aws-node-termination-handler pod is stuck in pending right after "kops rolling-update cluster --yes" #16870

stl-victor-sudakov · 2024-10-02T11:33:00Z

/kind bug

1. What kops version are you running? The command kops version, will display
this information.
1.30.1

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
v1.29.3

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
kops upgrade cluster --name XXX --kubernetes-version 1.29.9 --yes
kops --name XXX update cluster --yes --admin
kops --name XXX rolling-update cluster --yes

5. What happened after the commands executed?
Cluster did not pass validation at the very beginning of the upgrade procedure:

$ kops rolling-update cluster --yes --name XXX
Detected single-control-plane cluster; won't detach before draining
NAME                            STATUS          NEEDUPDATE      READY   MIN     TARGET  MAX     NODES
control-plane-us-west-2c        NeedsUpdate     1               0       1       1       1       1
nodes-us-west-2c                NeedsUpdate     4               0       4       4       4       4
I1002 15:03:05.336312   37988 instancegroups.go:507] Validating the cluster.
I1002 15:03:29.806323   37988 instancegroups.go:566] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "aws-node-termination-handler-577f866468-mmlx7" is pending.
I1002 15:04:22.511826   37988 instancegroups.go:566] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "aws-node-termination-handler-577f866468-mmlx7" is pending.
[...]

002 15:18:58.830547   37988 instancegroups.go:563] Cluster did not pass validation within deadline: system-cluster-critical pod "aws-node-termination-handler-577f866468-mmlx7" is pending.
E1002 15:18:58.830585   37988 instancegroups.go:512] Cluster did not validate within 15m0s
Error: control-plane node not healthy after update, stopping rolling-update: "error validating cluster: cluster did not validate within a duration of \"15m0s\""

When I looked up why the pod was pending, I found the following in "describe pod aws-node-termination-handler-577f866468-mmlx7":

0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 4 Preemption is not helpful for scheduling.

There is another aws-node-termination-handler- pod running at the moment (the old one):

$ kubectl -n kube-system get pods -l k8s-app=aws-node-termination-handler
NAME                                            READY   STATUS    RESTARTS          AGE
aws-node-termination-handler-577f866468-mmlx7   0/1     Pending   0                 69m
aws-node-termination-handler-6c9c8d7948-fxsrl   1/1     Running   1338 (4h1m ago)   133d

6. What did you expect to happen?

I expected the cluster to be upgraded go Kubernetes 1.29.9

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2023-07-05T02:16:44Z"
  generation: 9
  name: YYYY
spec:
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://XXXX/YYYY
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: control-plane-us-west-2c
      name: c
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: control-plane-us-west-2c
      name: c
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - X.X.X.X/24
    kubernetesVersion: 1.29.9
  masterPublicName: api.YYYY
  networkCIDR: 172.22.0.0/16
  networking:
    calico: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - X.X.X.X/24
  subnets:
  - cidr: 172.22.32.0/19
    name: us-west-2c
    type: Public
    zone: us-west-2c
  topology:
    dns:
      type: Public

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2023-07-05T02:16:48Z"
  generation: 5
  labels:
    kops.k8s.io/cluster: YYYY
  name: control-plane-us-west-2c
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240607
  instanceMetadata:
    httpPutResponseHopLimit: 3
    httpTokens: required
  machineType: t3a.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-west-2c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2023-07-05T02:16:49Z"
  generation: 7
  labels:
    kops.k8s.io/cluster: YYYY
  name: nodes-us-west-2c
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240607
  instanceMetadata:
    httpPutResponseHopLimit: 1
    httpTokens: required
  machineType: t3a.xlarge
  maxSize: 4
  minSize: 4
  role: Node
  subnets:
  - us-west-2c

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
Please see above the validation log.

9. Anything else do we need to know?
Now I would like to know how to recover from this situation and how to get rid of the aws-node-termination-handler-577f866468-mmlx7 pod which is now left in Pending state.

The text was updated successfully, but these errors were encountered:

stl-victor-sudakov · 2024-10-04T04:10:04Z

I have tried killing the running pod, and now I again have one pod running and one pending:

$ kubectl -n kube-system get pods -l k8s-app=aws-node-termination-handler
NAME                                            READY   STATUS    RESTARTS   AGE
aws-node-termination-handler-577f866468-bj4gd   0/1     Pending   0          41h
aws-node-termination-handler-6c9c8d7948-vt7hh   1/1     Running   0          3m30s

nuved · 2024-10-04T13:42:29Z

Hi @stl-victor-sudakov
you should find out what's the reason of pending state by running something like this one ,
kubectl describe pod aws-node-termination-handler-577f866468-bj4gd -n kube-system .
that may happen if for the second pod , scheduler of k8s could not find any place to deploy the service .

Most of times , that means the new controller nodes are not joined to the cluster properly and that's why scheduler could not deploy the service on the target nodes .

stl-victor-sudakov · 2024-10-04T14:14:47Z

@nuved I think I have already posted the error message above but I don't mind repeating, the relevant part of "kubectl -n kube-system describe pod aws-node-termination-handler-577f866468-bj4gd" is

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  13m (x6180 over 2d3h)  default-scheduler  0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 4 Preemption is not helpful for scheduling.

There is actually only one control node in the cluster. Is there any additional information I could provide?

UPD the complete "describe pod" output can be seen here: https://termbin.com/0sy6 (not to clutter the conversation with excessive output).

nuved · 2024-10-04T14:46:29Z

Well, that means there are not enough nodes. You should make sure if all nodes are up and ready . Kubectl get nodes -o wide I guess one of the controllers has an issue .

…

On Fri, Oct 4, 2024, 4:15 PM Victor Sudakov ***@***.***> wrote: @nuved <https://github.com/nuved> I think I have alredy posted the error message above but I don't mind repeating, the relevant part of "kubectl -n kube-system describe pod aws-node-termination-handler-577f866468-bj4gd" is Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 13m (x6180 over 2d3h) default-scheduler 0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 4 Preemption is not helpful for scheduling. There is actually only one control node in the cluster. Is there any additional information I could provide? — Reply to this email directly, view it on GitHub <#16870 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCQEOGEDNMJV4VIZVKC2DLZZ2PG3AVCNFSM6AAAAABPHPKE2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJTHAYTOOBYGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

stl-victor-sudakov · 2024-10-04T14:59:28Z

$ kubectl get nodes -o wide
NAME                  STATUS   ROLES           AGE    VERSION   INTERNAL-IP     EXTERNAL-IP      OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
i-01dbd1dccc0e30845   Ready    node            91d    v1.29.3   172.22.43.73    35.90.140.78     Ubuntu 22.04.4 LTS   6.5.0-1018-aws   containerd://1.7.16
i-02cf4b0fed779eb54   Ready    control-plane   135d   v1.29.3   172.22.48.131   34.222.92.123    Ubuntu 22.04.4 LTS   6.5.0-1018-aws   containerd://1.7.16
i-05569e161b2556a75   Ready    node            91d    v1.29.3   172.22.35.18    34.213.33.180    Ubuntu 22.04.4 LTS   6.5.0-1018-aws   containerd://1.7.16
i-06c219f4c3404e207   Ready    node            91d    v1.29.3   172.22.56.240   54.203.143.227   Ubuntu 22.04.4 LTS   6.5.0-1018-aws   containerd://1.7.16
i-0d1c604064d671d98   Ready    node            91d    v1.29.3   172.22.61.60    18.237.56.79     Ubuntu 22.04.4 LTS   6.5.0-1018-aws   containerd://1.7.16
$

It is a single-control-plane cluster. Also:

$ kops get instances
Using cluster from kubectl context: devXXXXXXX


ID                      NODE-NAME               STATUS          ROLES                           STATE   INTERNAL-IP     EXTERNAL-IP     INSTANCE-GROUP         MACHINE-TYPE
i-01dbd1dccc0e30845     i-01dbd1dccc0e30845     NeedsUpdate     node                                    172.22.43.73                    nodes-us-west-2c.YYYY                       t3a.xlarge
i-02cf4b0fed779eb54     i-02cf4b0fed779eb54     NeedsUpdate     control-plane, control-plane            172.22.48.131                   control-plane-us-west-2c.masters.YYYY       t3a.medium
i-05569e161b2556a75     i-05569e161b2556a75     NeedsUpdate     node                                    172.22.35.18                    nodes-us-west-2c.YYYY                       t3a.xlarge
i-06c219f4c3404e207     i-06c219f4c3404e207     NeedsUpdate     node                                    172.22.56.240                   nodes-us-west-2c.YYYY                       t3a.xlarge
i-0d1c604064d671d98     i-0d1c604064d671d98     NeedsUpdate     node                                    172.22.61.60                    nodes-us-west-2c.YYYY                       t3a.xlarge
$

stl-victor-sudakov · 2024-10-04T15:26:56Z

There is exactly one instance i-02cf4b0fed779eb54 in the control-plane-us-west-2c.masters.dev2XXXXX AWS autoscaling group, it is healthy according to AWS.

nuved · 2024-10-07T16:23:42Z

Probably you just need to adjust the replica set manually , set it to 1 .
kubelet edit deployment aws-node-termination-handler -n kube-system

I'm not sure how you can change the replica size via kops . but it should be work.

stl-victor-sudakov · 2024-10-08T05:00:59Z

Manually deleting the replicaset which had contained the old aws-node-termination-handler pod did the trick (the pod was finally replaced), but this should happen automatically and not prevent "kops rolling-update cluster" command from running smoothly.

axclever · 2024-11-14T13:37:40Z

Manually deleting the replicaset which had contained the old aws-node-termination-handler pod did the trick (the pod was finally replaced), but this should happen automatically and not prevent "kops rolling-update cluster" command from running smoothly.

It solved my prolem! Many thanks!

stl-victor-sudakov · 2024-11-14T14:04:41Z

@axclever Actually it was not my idea, I received this advice in the kOps office hours. It is a problem which manifests itself only in clusters with a single control-plane node.

k8s-triage-robot · 2025-02-12T14:59:57Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 2, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws-node-termination-handler pod is stuck in pending right after "kops rolling-update cluster --yes" #16870

aws-node-termination-handler pod is stuck in pending right after "kops rolling-update cluster --yes" #16870

stl-victor-sudakov commented Oct 2, 2024 •

edited

Loading

stl-victor-sudakov commented Oct 4, 2024

nuved commented Oct 4, 2024 •

edited

Loading

stl-victor-sudakov commented Oct 4, 2024 •

edited

Loading

nuved commented Oct 4, 2024 via email

stl-victor-sudakov commented Oct 4, 2024 •

edited

Loading

stl-victor-sudakov commented Oct 4, 2024

nuved commented Oct 7, 2024

stl-victor-sudakov commented Oct 8, 2024

axclever commented Nov 14, 2024

stl-victor-sudakov commented Nov 14, 2024 •

edited

Loading

k8s-triage-robot commented Feb 12, 2025

aws-node-termination-handler pod is stuck in pending right after "kops rolling-update cluster --yes" #16870

aws-node-termination-handler pod is stuck in pending right after "kops rolling-update cluster --yes" #16870

Comments

stl-victor-sudakov commented Oct 2, 2024 • edited Loading

stl-victor-sudakov commented Oct 4, 2024

nuved commented Oct 4, 2024 • edited Loading

stl-victor-sudakov commented Oct 4, 2024 • edited Loading

nuved commented Oct 4, 2024 via email

stl-victor-sudakov commented Oct 4, 2024 • edited Loading

stl-victor-sudakov commented Oct 4, 2024

nuved commented Oct 7, 2024

stl-victor-sudakov commented Oct 8, 2024

axclever commented Nov 14, 2024

stl-victor-sudakov commented Nov 14, 2024 • edited Loading

k8s-triage-robot commented Feb 12, 2025

stl-victor-sudakov commented Oct 2, 2024 •

edited

Loading

nuved commented Oct 4, 2024 •

edited

Loading

stl-victor-sudakov commented Oct 4, 2024 •

edited

Loading

stl-victor-sudakov commented Oct 4, 2024 •

edited

Loading

stl-victor-sudakov commented Nov 14, 2024 •

edited

Loading