Talos v1.9.2 nodes won't become ready, Kubernetes works #10163

twelho · 2025-01-18T13:19:42Z

Bug Report

Description

Essentially a continuation of #9991, same observations as @bitfactory-henno-schooljan, new issue as requested. 582064d9c latches onto a temporary node name (?), and nodestatuses.kubernetes.talos.dev is not populated anymore on any of the nodes. Talos shows the nodes as not ready while they are ready and operational in Kubernetes. talosctl health passes without issues. My nodes receive their addresses and names through DHCP.

Logs

# Configures cluster member discovery.
discovery:
    enabled: true # Enable the cluster membership discovery feature.
    # Configure registries used for cluster member discovery.
    registries:
        # Kubernetes registry uses Kubernetes API server to discover cluster members and stores additional information
        kubernetes:
            disabled: true # Disable Kubernetes discovery registry.

$ talosctl -n node-1 get machinestatus -oyaml
node: node-1
metadata:
    namespace: runtime
    type: MachineStatuses.runtime.talos.dev
    id: machine
    version: 14
    owner: runtime.MachineStatusController
    phase: running
    created: 2025-01-18T12:47:08Z
    updated: 2025-01-18T12:47:32Z
spec:
    stage: running
    status:
        ready: false
        unmetConditions:
            - name: nodeReady
              reason: node "node-1" status is not available yet

$ talosctl -n node-1 logs controller-runtime | egrep -i "nodestatus|talos-vcx-l96"
node-1: 2025-01-18T12:03:45.326Z DEBUG controller starting {"component": "controller-runtime", "controller": "k8s.NodeStatusController"}
node-1: 2025-01-18T12:03:48.286Z INFO setting hostname {"component": "controller-runtime", "controller": "network.HostnameSpecController", "hostname": "talos-vcx-l96", "domainname": ""}
node-1: 2025-01-18T12:03:48.287Z INFO setting hostname {"component": "controller-runtime", "controller": "network.HostnameSpecController", "hostname": "talos-vcx-l96", "domainname": ""}
node-1: 2025-01-18T12:03:50.289Z DEBUG starting node watcher {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "nodename": "talos-vcx-l96"}
node-1: 2025-01-18T12:03:50.290Z DEBUG waiting for node cache sync {"component": "controller-runtime", "controller": "k8s.NodeStatusController"}
node-1: 2025-01-18T12:04:00.224Z ERROR controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "2 error(s) occurred:\n\terror getting node: nodes \"talos-vcx-l96\" is forbidden: User \"system:node:node-1\" cannot get resource \"nodes\" in API group \"\" at the cluster scope: node 'node-1' cannot read 'talos-vcx-l96', only its own Node object\n\terror getting node: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline"}

So it seems that AuthorizeNodeWithSelectors is still stopping Talos, but this time it queries the completely wrong node. I presume talos-vcx-l96 is the temporary name of the node before it receives the proper name through DHCP. The node status controller also latches onto this name (starting node watcher {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "nodename": "talos-vcx-l96"}) and never receives the status (since the temporary node name is now gone), causing the readiness issue.

Environment

Talos version: [talosctl version --nodes <problematic nodes>]

Client:
        Tag:         v1.9.2
        SHA:         09758b3f
        Built:       
        Go version:  go1.23.4
        OS/Arch:     linux/amd64
Server:
        NODE:        node-1
        Tag:         v1.9.2
        SHA:         09758b3f
        Built:       
        Go version:  go1.23.4
        OS/Arch:     linux/amd64
        Enabled:     RBAC

Kubernetes version: [kubectl version --short]

Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.32.0

The text was updated successfully, but these errors were encountered:

smira · 2025-01-18T13:24:42Z

Please submit a talosctl support bundle.

twelho · 2025-01-18T13:34:27Z

Can you tell me which specific logs to retrieve instead? I'm sorry, but sanitizing the entire support bundle is a very extensive operation that I don't have the resources for currently. It exposes a lot of cluster internals that I'd rather not have public.

volschin mentioned this issue Jan 19, 2025

Netkit support #9181

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Talos v1.9.2 nodes won't become ready, Kubernetes works #10163

Talos v1.9.2 nodes won't become ready, Kubernetes works #10163

twelho commented Jan 18, 2025

smira commented Jan 18, 2025

twelho commented Jan 18, 2025

Talos v1.9.2 nodes won't become ready, Kubernetes works #10163

Talos v1.9.2 nodes won't become ready, Kubernetes works #10163

Comments

twelho commented Jan 18, 2025

Bug Report

Description

Logs

Environment

smira commented Jan 18, 2025

twelho commented Jan 18, 2025