argo-workflows/metrics/ #10319

2023-01-06T08:05:12Z

giscus[bot]
bot Jan 6, 2023

argo-workflows/metrics/

https://argoproj.github.io/argo-workflows/metrics/

rubaiat-hossain · 2023-01-06T08:05:13Z

rubaiat-hossain
Jan 6, 2023 — with giscus

I've deployed Argo inside a local kind cluster with -

kubectl create namespace argo

kubectl create \
  --filename https://github.com/argoproj/argo-workflows/releases/download/v3.4.4/namespace-install.yaml \
  --namespace argo

kubectl wait deployment workflow-controller \
  --for condition=Available \
  --namespace argo

kubectl create rolebinding default-admin \
  --clusterrole cluster-admin \
  --namespace argo \
  --serviceaccount=argo:default

However, when I want to label the workflow-controller-metrics service for Prometheus to scrape, it returns Error from server (NotFound): services workflow-controller-metrics not found

Here's the output of kubectl get svc --namespace argo

NAME          TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
argo-server   ClusterIP   10.96.45.141   <none>        2746/TCP   20m

Is my argo installation faulty? Why can't kubectl find the workflow-controller-metrics service?

4 replies

nicolas-vivot Apr 3, 2023 — with giscus

Looks like you actually have to deploy the service yourself. This service should point to the metric port of the workflow-controller, by default 9090 unless you changed it in the controller configuration.

Example:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: workflow-controller
  name: workflow-controller-metrics
  namespace: argo
spec:
  ports:
  - name: metrics
    port: 9090
    protocol: TCP
    targetPort: 9090
  selector:
    app: workflow-controller

nicolas-vivot Apr 3, 2023 — with giscus

By the way, even if the documentation says that there is a default metrics configuration, you must declare it into the workflow controller configuration map so that the prometheus metrics server is started within each workflow controller pod. Else you would potentially get a connection refused issue from whatever you use to scrape the metrics.

nicolas-vivot Apr 3, 2023 — with giscus

Last but not least: if you have setup HA, you must deploy a headless service to broadcast on all pods behind the service instead of load balancing and collect metrics from all pods. Without it, you will ends with inaccurate metrics. The metrics server is actually started only on the master instance of the workflow controller, so if you load balance you may ends with requests facing connection refused as it load balance to an instance where the metrics server is not running.

rubaiat-hossain Apr 19, 2023

I've solved this issue already. Appreciate your detailed feedback; thanks.

hongdiao · 2023-11-01T03:14:07Z

hongdiao
Nov 1, 2023 — with giscus

Template level CPU/Memory Metrics are not collected if the retryStrategy is enabled.

Hello Experts,

I am using Template level metric to collect CPU and Memory information. For example, the CPU metric is defined as below at template level:

metrics: {
    "prometheus": [
      {
        "name": "template_exec_cpu_gauge",
        "labels": [
          {
            "key": "template_name",
            "value": "My_Template_Name"
          }
        ],
        "help": "CPU gauge by template name",
        "gauge": {
          "value": "{{resourcesDuration.cpu}}",
          "realtime": false
        }
      }
    ]
  }

What I find is, if the retryStrategy is enabled at template level like:

  "retryStrategy": {
    "limit": 10
  },

Then from Argo server metric endpoint (http://localhost:9090/metrics), the CPU gague is always 0:

argo_workflows_template_exec_cpu_gauge{template_name="My_Template_Name"} 0

However, it is a heavy computation. When I click this template node (both on Retry node and POD node) of the workflow in the Argo workflow UI, I can see the CPU and memory usage:

RESOURCES DURATION
15m*(1 cpu),15m*(100Mi memory)

When I disable retryStrategy, the metric data can be collected correctly from Argo server metric endpoint (http://localhost:9090/metrics):

argo_workflows_template_exec_cpu_gauge{template_name="My_Template_Name"} 915

Is it expected behavior? How can I collect template CPU and memory metric when retryStrategy is enabled?

Thanks very much for your help in advance.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

argo-workflows/metrics/ #10319

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

argo-workflows/metrics/ #10319

giscus[bot] bot Jan 6, 2023

argo-workflows/metrics/

Replies: 2 comments · 4 replies

rubaiat-hossain Jan 6, 2023 — with giscus

nicolas-vivot Apr 3, 2023 — with giscus

nicolas-vivot Apr 3, 2023 — with giscus

nicolas-vivot Apr 3, 2023 — with giscus

rubaiat-hossain Apr 19, 2023

hongdiao Nov 1, 2023 — with giscus

giscus[bot]
bot Jan 6, 2023

Replies: 2 comments 4 replies

rubaiat-hossain
Jan 6, 2023 — with giscus

hongdiao
Nov 1, 2023 — with giscus