Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto promoted Freight Error Patching ArgoCD Application #2473

Closed
wmiller112 opened this issue Aug 27, 2024 · 6 comments · Fixed by #2499
Closed

Auto promoted Freight Error Patching ArgoCD Application #2473

wmiller112 opened this issue Aug 27, 2024 · 6 comments · Fixed by #2499

Comments

@wmiller112
Copy link
Contributor

wmiller112 commented Aug 27, 2024

Description

After updating from 0.8.4 to 0.8.6 I'm consistently hitting this error for automatically promoted freight only. A subsequent manual promotion of the same freight, or an initial manual promotion of freight into the same stage is successful.

Promotion Errored
error executing Argo CD promotion mechanism: error patching Argo CD Application "<app-name>": failed to patch the object: Operation cannot be fulfilled on applications.argoproj.io "<app-name>": the object has been modified; please apply your changes to the latest version and try again

Steps to Reproduce

  1. Warehouse subscribed to a git repo and image repo
  2. Stage with promotionMechanisms.gitRepoUpdate and promotionMechanism.argoCDAppUpdates configured with multiple ArgoCD Applications
    • The stage can allow direct freight from the warehouse or require another stage
    • annotate for authorized stage on the applications
    • autoPromotionEnabled for the stage on the project
  3. New image tag discovered and freight generated
  4. Auto promotion begins, freight changes are committed to git
  5. Error occurs as the argocd app refresh/sync process starts
  6. Manually promote the same freight into stage succeeds

Version

Client Version: v0.8.7
Server Version: v0.8.6

Logs

"error executing Promotion" error="error executing Argo CD promotion mechanism: error patching Argo CD Application \"<app-name>\": failed to patch the object: Operation cannot be fulfilled on applications.argoproj.io \"<app-name>\": the object has been modified; please apply your changes to the latest version and try again" freight=cd45f097cc7d86b05091f22181b96c087523f888 namespace=<app-ns> promotion=canary.01j6axmchf8jyx54svs7a0mb6a.cd45f09 stage=canary
@wmiller112
Copy link
Contributor Author

This seems to have been a result of having argocd auto sync policy enabled on the applications managed by the stage. Up to this point, I hadn't seen any particular reason to turn it off/on, as I assumed it didnt matter what was triggering the sync, since the branch the argocd app is associated with only gets updated when new freight is ready/pushed to the branch. With the optimistic locking for argocd app sync added in 0.8.5, I see now that it interferes with kargo's ability to trigger the sync.

@wmiller112
Copy link
Contributor Author

Never mind - this persists sporadically with auto-sync disabled. I'm wondering if it's the application being refreshed (not synced) by Argocd while kargo is trying to trigger its own refresh/sync? They get refreshed fairly quickly because they are configured with webhooks, but I imagine if that were the case, even without webhook configured, a stage that's responsible for enough Applications would run into that eventually when a polling refresh coincided with a kargo promotion.

Either way, curious if it make sense to have some kind of retry here getting the resource, generating the patch, and attempting to apply, so single occurrences don't error the promotion.

@wmiller112 wmiller112 reopened this Sep 3, 2024
@hiddeco
Copy link
Contributor

hiddeco commented Sep 3, 2024

Any chance you can catch the individual versions of the object from around the timeframe the issue occurs?

Asking because I would like to understand the core of the issue better, before trying to address it.

@wmiller112
Copy link
Contributor Author

Got all apps before, and the failed app after. Only thing that stands out in the diff is status.resources with an HPA recommendation. The latest item in status.history is a previous kargo-controller triggered action(about 15 minutes before the current promotion) that was successful. The status.reconciledAt, however is about 20 seconds after the failed promotion began.

@hiddeco
Copy link
Contributor

hiddeco commented Sep 3, 2024

I'll try to reproduce this tomorrow to get a better idea.

I am a bit suspicious about what causes there to be a race, as due to how this is written, the chances of this happening based on an external factor should be quite slim (i.e., I would not expect this to be caused by ArgoCD itself). At the same time, if Kargo itself issues the patch twice, I would expect the client to catch up fairly quickly, making the chance of this happening also slim.

In any of the above scenarios, retrying would indeed be the best solution. But we need to be sure about the actual cause.

@wmiller112
Copy link
Contributor Author

Sounds good, let me know if I can provide any more details to replicate. Even with the chance of it happening being slim, I'm curious in what case we wouldn't want to retry? Maybe I'm misunderstanding the goal of the write lock, but if it is just to prevent overwriting changes made by some other tooling, should something else interacting with the argo application be a potential blocker for a promotion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants