Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Talos v1.9.2 nodes won't become ready, Kubernetes works #10163

Open
twelho opened this issue Jan 18, 2025 · 2 comments
Open

Talos v1.9.2 nodes won't become ready, Kubernetes works #10163

twelho opened this issue Jan 18, 2025 · 2 comments

Comments

@twelho
Copy link
Contributor

twelho commented Jan 18, 2025

Bug Report

Description

Essentially a continuation of #9991, same observations as @bitfactory-henno-schooljan, new issue as requested. 582064d9c latches onto a temporary node name (?), and nodestatuses.kubernetes.talos.dev is not populated anymore on any of the nodes. Talos shows the nodes as not ready while they are ready and operational in Kubernetes. talosctl health passes without issues. My nodes receive their addresses and names through DHCP.

Logs

# Configures cluster member discovery.
discovery:
    enabled: true # Enable the cluster membership discovery feature.
    # Configure registries used for cluster member discovery.
    registries:
        # Kubernetes registry uses Kubernetes API server to discover cluster members and stores additional information
        kubernetes:
            disabled: true # Disable Kubernetes discovery registry.
$ talosctl -n node-1 get machinestatus -oyaml
node: node-1
metadata:
    namespace: runtime
    type: MachineStatuses.runtime.talos.dev
    id: machine
    version: 14
    owner: runtime.MachineStatusController
    phase: running
    created: 2025-01-18T12:47:08Z
    updated: 2025-01-18T12:47:32Z
spec:
    stage: running
    status:
        ready: false
        unmetConditions:
            - name: nodeReady
              reason: node "node-1" status is not available yet
$ talosctl -n node-1 logs controller-runtime | egrep -i "nodestatus|talos-vcx-l96"
node-1: 2025-01-18T12:03:45.326Z DEBUG controller starting {"component": "controller-runtime", "controller": "k8s.NodeStatusController"}
node-1: 2025-01-18T12:03:48.286Z INFO setting hostname {"component": "controller-runtime", "controller": "network.HostnameSpecController", "hostname": "talos-vcx-l96", "domainname": ""}
node-1: 2025-01-18T12:03:48.287Z INFO setting hostname {"component": "controller-runtime", "controller": "network.HostnameSpecController", "hostname": "talos-vcx-l96", "domainname": ""}
node-1: 2025-01-18T12:03:50.289Z DEBUG starting node watcher {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "nodename": "talos-vcx-l96"}
node-1: 2025-01-18T12:03:50.290Z DEBUG waiting for node cache sync {"component": "controller-runtime", "controller": "k8s.NodeStatusController"}
node-1: 2025-01-18T12:04:00.224Z ERROR controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "2 error(s) occurred:\n\terror getting node: nodes \"talos-vcx-l96\" is forbidden: User \"system:node:node-1\" cannot get resource \"nodes\" in API group \"\" at the cluster scope: node 'node-1' cannot read 'talos-vcx-l96', only its own Node object\n\terror getting node: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline"}

So it seems that AuthorizeNodeWithSelectors is still stopping Talos, but this time it queries the completely wrong node. I presume talos-vcx-l96 is the temporary name of the node before it receives the proper name through DHCP. The node status controller also latches onto this name (starting node watcher {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "nodename": "talos-vcx-l96"}) and never receives the status (since the temporary node name is now gone), causing the readiness issue.

Environment

  • Talos version: [talosctl version --nodes <problematic nodes>]
Client:
        Tag:         v1.9.2
        SHA:         09758b3f
        Built:       
        Go version:  go1.23.4
        OS/Arch:     linux/amd64
Server:
        NODE:        node-1
        Tag:         v1.9.2
        SHA:         09758b3f
        Built:       
        Go version:  go1.23.4
        OS/Arch:     linux/amd64
        Enabled:     RBAC
  • Kubernetes version: [kubectl version --short]
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.32.0
@smira
Copy link
Member

smira commented Jan 18, 2025

Please submit a talosctl support bundle.

@twelho
Copy link
Contributor Author

twelho commented Jan 18, 2025

Can you tell me which specific logs to retrieve instead? I'm sorry, but sanitizing the entire support bundle is a very extensive operation that I don't have the resources for currently. It exposes a lot of cluster internals that I'd rather not have public.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants