You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Essentially a continuation of #9991, same observations as @bitfactory-henno-schooljan, new issue as requested. 582064d9c latches onto a temporary node name (?), and nodestatuses.kubernetes.talos.dev is not populated anymore on any of the nodes. Talos shows the nodes as not ready while they are ready and operational in Kubernetes. talosctl health passes without issues. My nodes receive their addresses and names through DHCP.
Logs
# Configures cluster member discovery.discovery:
enabled: true # Enable the cluster membership discovery feature.# Configure registries used for cluster member discovery.registries:
# Kubernetes registry uses Kubernetes API server to discover cluster members and stores additional informationkubernetes:
disabled: true # Disable Kubernetes discovery registry.
$ talosctl -n node-1 get machinestatus -oyaml
node: node-1
metadata:
namespace: runtime
type: MachineStatuses.runtime.talos.dev
id: machine
version: 14
owner: runtime.MachineStatusController
phase: running
created: 2025-01-18T12:47:08Z
updated: 2025-01-18T12:47:32Z
spec:
stage: running
status:
ready: false
unmetConditions:
- name: nodeReady
reason: node "node-1" status is not available yet
$ talosctl -n node-1 logs controller-runtime | egrep -i "nodestatus|talos-vcx-l96"
node-1: 2025-01-18T12:03:45.326Z DEBUG controller starting {"component": "controller-runtime", "controller": "k8s.NodeStatusController"}
node-1: 2025-01-18T12:03:48.286Z INFO setting hostname {"component": "controller-runtime", "controller": "network.HostnameSpecController", "hostname": "talos-vcx-l96", "domainname": ""}
node-1: 2025-01-18T12:03:48.287Z INFO setting hostname {"component": "controller-runtime", "controller": "network.HostnameSpecController", "hostname": "talos-vcx-l96", "domainname": ""}
node-1: 2025-01-18T12:03:50.289Z DEBUG starting node watcher {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "nodename": "talos-vcx-l96"}
node-1: 2025-01-18T12:03:50.290Z DEBUG waiting for node cache sync {"component": "controller-runtime", "controller": "k8s.NodeStatusController"}
node-1: 2025-01-18T12:04:00.224Z ERROR controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "2 error(s) occurred:\n\terror getting node: nodes \"talos-vcx-l96\" is forbidden: User \"system:node:node-1\" cannot get resource \"nodes\" in API group \"\" at the cluster scope: node 'node-1' cannot read 'talos-vcx-l96', only its own Node object\n\terror getting node: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline"}
So it seems that AuthorizeNodeWithSelectors is still stopping Talos, but this time it queries the completely wrong node. I presume talos-vcx-l96 is the temporary name of the node before it receives the proper name through DHCP. The node status controller also latches onto this name (starting node watcher {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "nodename": "talos-vcx-l96"}) and never receives the status (since the temporary node name is now gone), causing the readiness issue.
Environment
Talos version: [talosctl version --nodes <problematic nodes>]
Can you tell me which specific logs to retrieve instead? I'm sorry, but sanitizing the entire support bundle is a very extensive operation that I don't have the resources for currently. It exposes a lot of cluster internals that I'd rather not have public.
Bug Report
Description
Essentially a continuation of #9991, same observations as @bitfactory-henno-schooljan, new issue as requested. 582064d9c latches onto a temporary node name (?), and
nodestatuses.kubernetes.talos.dev
is not populated anymore on any of the nodes. Talos shows the nodes as not ready while they are ready and operational in Kubernetes.talosctl health
passes without issues. My nodes receive their addresses and names through DHCP.Logs
So it seems that
AuthorizeNodeWithSelectors
is still stopping Talos, but this time it queries the completely wrong node. I presumetalos-vcx-l96
is the temporary name of the node before it receives the proper name through DHCP. The node status controller also latches onto this name (starting node watcher {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "nodename": "talos-vcx-l96"}
) and never receives the status (since the temporary node name is now gone), causing the readiness issue.Environment
talosctl version --nodes <problematic nodes>
]kubectl version --short
]The text was updated successfully, but these errors were encountered: