Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporter Pod is continuously terminated #62

Open
jslouisyou opened this issue May 22, 2024 · 1 comment
Open

Exporter Pod is continuously terminated #62

jslouisyou opened this issue May 22, 2024 · 1 comment

Comments

@jslouisyou
Copy link

Hello WEKA,
I'm using this weka/export to make container and deploying into Kubernetes.
But this pod continuously terminated without an error (I can't see any error logs in the pod) and this was terminated with exitCode: 1.
Here's the manifest of Pod and configuration, logs.
Could you please check why this pod is continuously terminated?

  • containerStatus:
  containerStatuses:
  - containerID: containerd://ffd6e89aafc94f2205c5c82e94d9b7673e9f9cbf05aad1d181b12acfda523db7
    image: wekasolutions/export:latest
    imageID: wekasolutions/export@sha256:b6f94edb3511531110b95038b60241d2294124dc3e9faca3bca9adfb499874fa
    lastState:
      terminated:
        containerID: containerd://5d92d8b0ef84be1baa6c3e47feb945589aedffe75b32a4e8a266b73b19995ed8
        exitCode: 1
        finishedAt: "2024-05-22T04:24:57Z"
        reason: Error
        startedAt: "2024-05-22T03:49:59Z"
    name: weka-exporter
    ready: true
    restartCount: 344
    started: true
    state:
      running:
        startedAt: "2024-05-22T04:24:58Z"
  • Manifest of Pod
apiVersion: apps/v1
kind: Deployment
metadata:
  name: weka-exporter
  namespace: monitoring-infra
spec:
  selector:
    matchLabels:
      app: weka-exporter
  template:
    metadata:
      labels:
        app: weka-exporter
    spec:
      hostNetwork: true
      containers:
        - name: weka-exporter
          image: wekasolutions/export:latest
          args:
            - --no_syslog
          volumeMounts:
            - name: weka-auth-token
              mountPath: /weka/.weka
              readOnly: true
            - name: weka-exporter-config
              mountPath: /weka/export.yml
              subPath: export.yml
          env:
            - name: no_proxy
              value: <IP Addresses of WEKA backend>
      volumes:
        - name: weka-auth-token
          secret:
            secretName: weka-auth-token
        - name: weka-exporter-config
          configMap:
            items:
              - key: export.yml
                path: export.yml
            name: weka-exporter-config
  • weka-exporter-config - I have 6 WEKA backend servers.
    exporter:
      listen_port: 8001
      loki_host:
      loki_port: 3100
      timeout: 30.0
      max_procs: 8
      max_threads_per_proc: 100
      backends_only: True
      datapoints_per_collect: 5
      certfile: null
      keyfile: null

    cluster:
      auth_token_file: auth-token.json
      hosts:
        - host1
        - host2
        - host3
        - host4
        - host5
        - host6
      force_https: False
      verify_cert: False
      mgmt_port: 14000
.....
  • Logs (These below logs are repeated. I disabled syslog)
gathering
gathering weka data from cluster test1
starting 100 threads
Cluster test1 Using 6 hosts
populating datastructures for cluster test1
Gather complete: cluster=test1, total elapsed=10.33
stats returned. total time = 10.43s 69 api calls made. Wed May 22 04:24:57 2024
@davegreen
Copy link

Are you actively querying the endpoint with Prometheus at this point? What config do you use there?

There appears to be an issue where If you access the endpoint, then cancel the request and make any other request before the results of that request are returned, this causes the container to crash. From your logs, this currently takes over 10s, so the Prometheus default might be a contributor. If you have any liveness probe or scrape timing out in less than that time, this could be why.

I work around this (temporarily) by elongating the scrape timeout in Prometheus to ensure it doesn't scrape the container too often, even if it takes ages to return data. I've also made sure there are no direct user-facing access paths(like ingresses) to the service, so only Prometheus can get to it and other users can't accidentally cause this issue either.

You can easily test this by loading it up in a browser, if you cancel the page load, then wait, it's OK. If you make any other request, the container will crash when it attempts to return the originally requested data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants