Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CronJob is not created for deleted and renamed test workflow #6146

Open
vsukhin opened this issue Jan 31, 2025 · 11 comments
Open

CronJob is not created for deleted and renamed test workflow #6146

vsukhin opened this issue Jan 31, 2025 · 11 comments
Labels
backend bug 🐛 Something is not working as should be Bug Created by Linear-GitHub Sync

Comments

@vsukhin
Copy link
Collaborator

vsukhin commented Jan 31, 2025

Describe the bug
CronJob is not created for deleted and renamed test workflow

To Reproduce
Steps to reproduce the behavior:
(⎈|pequod:osp)➜ ~ k get cronjob | grep open-patent-service-dmz
open-patent-service-dmz-k6-probe-2792369560789523000 */5 * * * * Europe/Amsterdam False 0 100s 48m
(⎈|pequod:osp)➜ ~ k get testworkflow | grep open-patent-service-dmz
open-patent-service-dmz-k6-probe 49m
(⎈|pequod:osp)➜ ~ k delete testworkflow open-patent-service-dmz-k6-probe
testworkflow.testworkflows.testkube.io "open-patent-service-dmz-k6-probe" deleted
(⎈|pequod:osp)➜ ~ k get testworkflow | grep open-patent-service-dmz <<<<<<<Pushed Testworkflow C (renaming B)>>>>>>>>>
(⎈|pequod:osp)➜ ~ k get testworkflow | grep open-patent-service-dmz
open-patent-service-dmz-k6-probe-apigee 77s
open-patent-service-dmz-k6-probe-general 77s <<<<<<<renamed with suffix -general>>>>>>>>
(⎈|pequod:osp)➜ ~ k get cronjob | grep open-patent-service-dmz
open-patent-service-dmz-k6-probe-2792369560789523000 */5 * * * * Europe/Amsterdam False 0 85s

Expected behavior
A clear and concise description of what you expected to happen.

Version / Cluster

  • Which testkube version? 2.1 .86
  • What Kubernetes cluster? (e.g. GKE, EKS, Openshift etc, local KinD, local Minikube) Kin
  • What Kubernetes version? 1.25

By Tao Aleixandre https://testkubeworkspace.slack.com/archives/C06D9EYTQ2J/p1738256445119279

@vsukhin vsukhin added the bug 🐛 Something is not working as should be label Jan 31, 2025
@vsukhin
Copy link
Collaborator Author

vsukhin commented Jan 31, 2025

@olensmar @jmorante-ks I will picckup

@tao-aleixandre
Copy link

tao-aleixandre commented Jan 31, 2025

Thanks a lot @vsukhin
Just to add more info:
Both A,B,C TestWorkflows are been deployed by the same ArgoCD application.
I tried to reproduce the error in a LAB cluster but is not happening. B is renamed to C and A is been created.
Where is the states/information about the TestWorkflow being persisted? If its the mongoDB i could do a cleanup for instance, or share more details to help the investigation.

@linear linear bot added Bug Created by Linear-GitHub Sync backend labels Jan 31, 2025
@vsukhin
Copy link
Collaborator Author

vsukhin commented Jan 31, 2025

Thank you, @Fernandoramas Let me check it a bit deeply

@tao-aleixandre
Copy link

@vsukhin Im starting to suspect that is something related to the reconciliation loop of TestWorkflows. This might be directly connected with the leak of goroutines that @rangoo94 was having a look: slack discussion.

@rangoo94
Copy link
Member

rangoo94 commented Jan 31, 2025

This might be directly connected with the leak of goroutines [...]

@tao-aleixandre, it cannot be related. The creation and deletion of CronJobs is happening on the testkube-operator, while the Slack discussion is around testkube-api-server.

@vsukhin
Copy link
Collaborator Author

vsukhin commented Jan 31, 2025

hey @tao-aleixandre If I understand it right, it happens only for the ArgoCD app. Then it looks like pretty much like race condition issue, because it happens in the same app. TestWorklow is a k8s object and its' state persisted in etcd. Operator is watching for the changes in TestWorkflow objects using Reconciliation method. If there is a change in cronjob section of the test workflow, then Operator will create/delete related CronJobs resources. So I'm not sure bout the order of notifications Operator receive when you do everything in the same app. In the future Enterprie control plane will store Test Workflow resources in mongoDB, but it's not a case for OSS

@vsukhin
Copy link
Collaborator Author

vsukhin commented Jan 31, 2025

Regarding this bug,

{"L":"ERROR","T":"2025-01-30T17:04:49.586Z","M":"Reconciler error","controller":"testworkflow","controllerGroup":"testworkflows.testkube.io","controllerKind":"TestWorkflow","TestWorkflow":{"name":"open-patent-service-dmz-k6-probe","namespace":"osp"},"namespace":"osp","name":"open-patent-service-dmz-k6-probe","reconcileID":"b1dab96a-bde9-49be-881d-ab70ac954777","error":"cronjobs.batch \"open-patent-service-dmz-k6-probe-2792369560789523000\" already exists"

We create CronJob with a name of the Test Workflow + Unique ID
// UID is the unique in time and space value for this object. It is typically generated by
// the server on successful creation of a resource and is not allowed to change on PUT
// operations.

So, if it happens, then the Operator receive 2 notifications with the same UID

@tao-aleixandre
Copy link

thanks a lot for the details explanation. I have the goal this year to start doing some contributions to the OSS project. So understanding the in depth helps me on that goal.
I found an important pattern. Cronjobs have a limit of 52 characters.
TestWorkflow A: open-patent-service-dmz-k6-probe -> 32u + UID -> 20 = 52 characters.
TestWorklow B: open-patent-service-dmz-k6-probe-apigee > 39u + UID -> 20 = 59 ❌
It trims to = open-patent-service-dmz-k6-probe to be 32u + UID -> 20 = Which is matching the TestWorkflow name A. And Reconciler fails since its the same name.
I dont know why the UID are that long, im guessing for the number of Testworkflows we have (161).

For your explanation im guessing that both UID match because they are deployed by the same argoCD app.
We will take care internally to control the number of characters, we can close if you see it fit. Thanks a lot for the support.

@tao-aleixandre
Copy link

(⎈|pequod:osp)➜ ~ k get cronjob | grep open-patent-service-dmz
oopen-patent-service-dmz-k6-apig-2792369560789523000   */5 * * * *       Europe/Amsterdam   False     0        <none>          13s
oopen-patent-service-dmz-k6-gene-2792369560789523000   */5 * * * *       Europe/Amsterdam   False     0        <none>          13s

@vsukhin
Copy link
Collaborator Author

vsukhin commented Jan 31, 2025

sure, we're happy abou user contributions!

yes, I see, you're welcome, interesting, that 2 different workflows have the same UID. Can you show metadata for both workflows - resourceVersion, generation, creationTimestamp, etc?

@tao-aleixandre
Copy link

Testworkflow 1

apiVersion: testworkflows.testkube.io/v1
kind: TestWorkflow
metadata:
  annotations:
    argocd.argoproj.io/tracking-id: <...>
  creationTimestamp: "2025-01-31T15:32:40Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: osp-open-patent-service-dmz-k6
    class: synthetic-probe
    cluster: pequod
    env: non-production
    group: oopen-patent-service-dmz-k6
    namespace: osp
    team: ops-team
    test-type: k6
  name: oopen-patent-service-dmz-k6-apigee
  namespace: osp
  resourceVersion: "821768529"
  uid: 75432e35-f2af-4309-801f-be62a6f6d5eb
spec:
  container:
    env:
   <...>
    resources:
      requests:
        cpu: 1
        memory: 1Gi
    workingDir: /data/repo/tests/non-production
  content:
    git:
      paths:
      - tests/
      revision: non-production
      tokenFrom:
         <...>
  events:
  - cronjob:
      cron: '*/5 * * * *'
  job:
    activeDeadlineSeconds: 115
  pod:
    affinity:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - podAffinityTerm:
            labelSelector:
              matchLabels:
                test-type: k6
            topologyKey: kubernetes.io/hostname
          weight: 50
    labels:
      class: synthetic-probe
      test-type: k6
      testkube-type: execution-pod
    securityContext:
      runAsUser: 1001
    volumes:
    - name: ca-pemstore
      secret:
        items:
        - key: ca.crt
          path: ca.crt
        secretName: epo-root-ca
    - configMap:
        items:
        - key: verifyChecks.js
          path: verifyChecks.js
        name: osp-oopen-patent-service-dmz-k6-k6-verifychecks
      name: verifychecks-volume
  steps:
  - condition: always
    container:
      image: grafana/k6:0.53.0
      imagePullPolicy: IfNotPresent
      volumeMounts:
      - mountPath: /etc/ssl/certs/internal-ca.pem
        name: ca-pemstore
        subPath: ca.crt
    name: k6 complete execution
    steps:
    - name: k6 run
      run:
        args:
        - run
        - apigee.js
        - --out
        - json=/data/repo/tests/non-production/k6-output.json
        - --include-system-env-vars
        - --no-usage-report
        - --tag
        - testName=apigee
        - --tag
        - groupName=oopen-patent-service-dmz-k6
        - --tag
        - workflowName=oopen-patent-service-dmz-k6-apigee
        - --insecure-skip-tls-verify
        env:
        - name: K6_VUS
          value: "1"
    - artifacts:
        paths:
        - k6-output.json
      condition: always
      name: Save artifacts
      workingDir: /data/repo/tests/non-production
  - condition: always
    container:
      env:
      - name: payloadExecutionId
        value: '{{ execution.id }}'
      image: node:20.18.0-alpine3.20
      imagePullPolicy: IfNotPresent
      volumeMounts:
      - mountPath: /data/repo/tests/non-production/verifyChecks.js
        name: verifychecks-volume
        readOnly: false
        subPath: verifyChecks.js
    name: send data to PagerDuty with verifyChecks
    shell: |
      ls -l
      node verifyChecks.js
    workingDir: /data/repo/tests/non-production

2 testworkflow

apiVersion: testworkflows.testkube.io/v1
kind: TestWorkflow
metadata:
  annotations:
    argocd.argoproj.io/tracking-id: <...>
  creationTimestamp: "2025-01-31T15:32:40Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: osp-open-patent-service-dmz-k6
    class: synthetic-probe
    cluster: pequod
    env: non-production
    group: oopen-patent-service-dmz-k6
    namespace: osp
    team: ops-team
    test-type: k6
  name: oopen-patent-service-dmz-k6-general
  namespace: osp
  resourceVersion: "821768530"
  uid: 0b7f3338-b9ef-4188-bc9c-4f50d887f62d
spec:
  container:
    env:
    <...>
    resources:
      requests:
        cpu: 1
        memory: 1Gi
    workingDir: /data/repo/tests/non-production
  content:
    git:
      paths:
      - tests/
      revision: non-production
      tokenFrom:
        <...>
  events:
  - cronjob:
      cron: '*/5 * * * *'
  job:
    activeDeadlineSeconds: 115
  pod:
    affinity:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - podAffinityTerm:
            labelSelector:
              matchLabels:
                test-type: k6
            topologyKey: kubernetes.io/hostname
          weight: 50
    labels:
      class: synthetic-probe
      test-type: k6
      testkube-type: execution-pod
    securityContext:
      runAsUser: 1001
    volumes:
    - name: ca-pemstore
      secret:
        items:
        - key: ca.crt
          path: ca.crt
        secretName: epo-root-ca
    - configMap:
        items:
        - key: verifyChecks.js
          path: verifyChecks.js
        name: osp-oopen-patent-service-dmz-k6-k6-verifychecks
      name: verifychecks-volume
  steps:
  - condition: always
    container:
      image: grafana/k6:0.53.0
      imagePullPolicy: IfNotPresent
      volumeMounts:
      - mountPath: /etc/ssl/certs/internal-ca.pem
        name: ca-pemstore
        subPath: ca.crt
    name: k6 complete execution
    steps:
    - name: k6 run
      run:
        args:
        - run
        - general.js
        - --out
        - json=/data/repo/tests/non-production/k6-output.json
        - --include-system-env-vars
        - --no-usage-report
        - --tag
        - testName=general
        - --tag
        - groupName=oopen-patent-service-dmz-k6
        - --tag
        - workflowName=oopen-patent-service-dmz-k6-general
        - --insecure-skip-tls-verify
        env:
        - name: K6_VUS
          value: "1"
    - artifacts:
        paths:
        - k6-output.json
      condition: always
      name: Save artifacts
      workingDir: /data/repo/tests/non-production
  - condition: always
    container:
      env:
      - name: payloadExecutionId
        value: '{{ execution.id }}'
      image: node:20.18.0-alpine3.20
      imagePullPolicy: IfNotPresent
      volumeMounts:
      - mountPath: /data/repo/tests/non-production/verifyChecks.js
        name: verifychecks-volume
        readOnly: false
        subPath: verifyChecks.js
    name: send data to PagerDuty with verifyChecks
    shell: |
      ls -l
      node verifyChecks.js
    workingDir: /data/repo/tests/non-production

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend bug 🐛 Something is not working as should be Bug Created by Linear-GitHub Sync
Projects
None yet
Development

No branches or pull requests

3 participants