Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Kubernetes metrics missing after 7.61.0 upgrade #33270

Closed
ja-sg opened this issue Jan 22, 2025 · 4 comments · Fixed by #33271
Closed

[BUG] Kubernetes metrics missing after 7.61.0 upgrade #33270

ja-sg opened this issue Jan 22, 2025 · 4 comments · Fixed by #33271

Comments

@ja-sg
Copy link

ja-sg commented Jan 22, 2025

Agent Environment

Datadog Operator Version: 2.5.1 (upgraded from 2.5.0)
Datadog Agent Version: 7.61.0 (upgraded from 7.60.1)
Datadog Cluster Agent Version: 7.61.0 (upgraded from 7.60.1)
Platform: Kubernetes (deployed via Helm)

Error logs that might be relevant:
Cluster Agent

2025-01-15 09:00:21 UTC | CLUSTER | ERROR | (pkg/orchestrator/redact/pod.go:109 in func1) | Failed to parse cmd from pod, obscuring whole command
2025-01-15 09:00:21 UTC | CLUSTER | ERROR | (pkg/orchestrator/redact/pod.go:109 in func1) | Failed to parse cmd from pod, obscuring whole command
2025-01-15 09:00:21 UTC | CLUSTER | ERROR | (pkg/orchestrator/redact/pod.go:109 in func1) | Failed to parse cmd from pod, obscuring whole command
2025-01-15 09:00:22 UTC | CLUSTER | ERROR | (pkg/orchestrator/redact/pod.go:109 in func1) | Failed to parse cmd from pod, obscuring whole command
2025-01-15 09:00:22 UTC | CLUSTER | ERROR | (pkg/orchestrator/redact/pod.go:109 in func1) | Failed to parse cmd from pod, obscuring whole command
2025-01-15 09:00:22 UTC | CLUSTER | ERROR | (pkg/orchestrator/redact/pod.go:109 in func1) | Failed to parse cmd from pod, obscuring whole command
2025-01-15 08:58:27 UTC | CLUSTER | ERROR | (pkg/collector/corechecks/loader.go:73 in Load) | core.loader: could not configure check kubernetes_state_core: resource customresourcedefinitions does not exist. Available resources: apiextensions.k8s.io/v1, Resource=customresourcedefinitions,clusterroles,configmaps,deployments,leases,replicationcontrollers,volumeattachments,cronjobs,ingresses,ingressclasses,persistentvolumes,poddisruptionbudgets,validatingwebhookconfigurations,core/v1, Resource=pods_extended,autoscaling.k8s.io/v1beta2, Resource=verticalpodautoscalers,daemonsets,horizontalpodautoscalers,jobs,resourcequotas,rolebindings,batch/v1, Resource=jobs_extended,endpointslices,services,storageclasses,core/v1, Resource=nodes_extended,statefulsets,endpoints,mutatingwebhookconfigurations,persistentvolumeclaims,replicasets,roles,serviceaccounts,namespaces,apiregistration.k8s.io/v1, Resource=apiservices,clusterrolebindings,limitranges,networkpolicies,pods,certificatesigningrequests,nodes,secrets

Agent

2025-01-20 17:51:14 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:71 in Error) | check:redisdb | Error running check: [{"message":"Error while reading from 10.2.5.55:6379 : (104, 'Connection reset by peer')","traceback":"Traceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/redis/connection.py\", line 512, in read_response\n response = self._parser.read_response(disable_decoding=disable_decoding)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/redis/_parsers/resp2.py\", line 15, in read_response\n result = self._read_response(disable_decoding=disable_decoding)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/redis/_parsers/resp2.py\", line 25, in _read_response\n raw = self._buffer.readline()\n ^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/redis/_parsers/socket.py\", line 115, in readline\n self._read_from_socket()\n File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/redis/_parsers/socket.py\", line 65, in _read_from_socket\n data = self._sock.recv(socket_read_size)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nConnectionResetError: [Errno 104] Connection reset by peer\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/datadog_checks/base/checks/base.py\", line 1290, in run\n self.check(instance)\n File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/datadog_checks/redisdb/redisdb.py\", line 552, in check\n self._check_db()\n File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/datadog_checks/redisdb/redisdb.py\", line 207, in _check_db\n conn.ping()\n File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/redis/commands/core.py\", line 1208, in ping\n return self.execute_command(\"PING\", **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/redis/client.py\", line 545, in execute_command\n conn = self.connection or pool.get_connection(command_name, **options)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/redis/connection.py\", line 1074, in get_connection\n connection.connect()\n File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/redis/connection.py\", line 289, in connect\n self.on_connect()\n File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/redis/connection.py\", line 391, in on_connect\n self.read_response()\n File \"/opt/datadog-agent/embedded/lib/python3.12/site-packages/redis/connection.py\", line 520, in read_response\n raise ConnectionError(\nredis.exceptions.ConnectionError: Error while reading from 10.2.5.55:6379 : (104, 'Connection reset by peer')\n"}]

Describe what happened:

After upgrading the Datadog Operator and related components, Kubernetes cluster metrics stopped appearing on Datadog dashboards.

Describe what you expected:

Kubernetes state cluster metrics should continue to appear on Datadog dashboards after upgrading the components.

Steps to reproduce the issue:

Upgrade the following components:
    Datadog Operator: 2.5.0 -> 2.5.1
    Datadog Agent: 7.60.1 -> 7.61.0
    Datadog Cluster Agent: 7.60.1 -> 7.61.0
Observe missing Kubernetes state metrics on Datadog dashboards.

Additional environment details (Operating System, Cloud provider, etc):

Kubernetes cluster running on AKS.

@julien-redel
Copy link

Same environment, same problem: no more kubernetes_state* metrics
Revert to 7.60.1 fix it

@FlorentClarret
Copy link
Member

Hello everyone,

Thank you for reaching out and reporting this issue.

We took a look and we confirm this issue. We have a fix ready here that we are going to ship with Agent 7.62.0 early next week. In the meantime, the recommendation is to pin the Agent version to 7.60.1, as @julien-redel mentioned.

We'll update this issue once the fix is out.

Really sorry for the inconvenience.

@FlorentClarret
Copy link
Member

Hi everyone,

As promised, the fix for this bug is shipped with the Datadog Agent 7.62.0, which was released today: https://github.com/DataDog/datadog-agent/releases/tag/7.62.0

Thanks for your patience on this one.

I'm going to close the issue for now, but feel free to reach out if you have any question.

@julien-redel
Copy link

I can confirm no more problem with 7.62.0
Thank @FlorentClarret

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants