Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating ClusterCatalog hangs for a while, without any signal for what's going on. #432

Open
anik120 opened this issue Oct 14, 2024 · 4 comments
Labels
v1.x Issues related to features added after v1.0

Comments

@anik120
Copy link
Collaborator

anik120 commented Oct 14, 2024

I created a ClusterCatalog with a private image from redhat.registry.io, but it hung for a while without any signal about what's going on..

NAME                         LASTUNPACKED   SERVING   AGE
redhat-community-operators   1s             True      14m

Notice the last unpacked (1s) vs the age (14m).
When I inspected the logs, I saw that the unpacker ran into intermittent issues, before it was able to pull the image successfully. For the first 14m though, there was no signal on the CR for what's going on (you can see the unpacker struggling if you inspect the controller logs though).

We should restructure our controller to be a state machine, so that the unpacking operation can be transitioned through the different states, instead of the entire operation being an all-in blocking operation.

The different states for the transition machine could be something like:

Initiated->Pulling->Unpacking->Unpacked

@LalatenduMohanty LalatenduMohanty added the v1.x Issues related to features added after v1.0 label Oct 15, 2024
@joelanford
Copy link
Member

I believe this is a result of synchronous image pulling. The fix for this should be to design and implement async image pulling. At that point, the reconciler could kick off a background goroutine to pull an image, and then immediately update status to say "Pulling" somehow.

Then later when the image pull is complete (or fails), we could signal the reconciler to tell it to reconcile again at which point it could find the result and continue unpacking/processing it.

@anik120
Copy link
Collaborator Author

anik120 commented Oct 16, 2024

@joelanford you'd mentioned this call but I don't think I picked up your points that well in the call:

What was the reason you mentioned again for not doing a state machine again? What you're saying here

At that point, the reconciler could kick off a background goroutine to pull an image, and then immediately update status to say "Pulling" somehow.
Then later when the image pull is complete (or fails), we could signal the reconciler to tell it to reconcile again at which point it could find the result and continue unpacking/processing it.

seems much easier to do if we just requeueAfter in each state, until we reach a "final" state (of success or error), with exponential backoff in case of errors.

This is a level-based design too.

we could signal the reconciler to tell it to reconcile again

sounds like you're suggesting an edge-based design.

@joelanford
Copy link
Member

Maybe I'm misinterpreting what you mean by a state machine design, but I've seen cases where controllers are implemented as state machines where they make assumptions that the status of an object is correct and then execute a state's logic based on that assumption. But if the status is incorrect, then the logic that is executed is also incorrect and/or incomplete, and that might mean we never successfully run other parts of the state machine that we think will always run.

The correct design of a controller is to read the actual and current state of the world and then do as much as possible to drive toward that desired state.

we could signal the reconciler to tell it to reconcile again

sounds like you're suggesting an edge-based design.

Not quite. Think of the async image puller a lot like the apiserver and our client that fetches from it, and the watches we register with it.

  1. We ask it: "do you have the data we expect?" If so, great we use it and continue on with that data.
  2. If it doesn't have the data we expect, we ask it to fetch that data in the background (kind of like how we used to create a unpack pod), and then update our status to say "don't have it yet, but it's being fetched"
  3. Just like when the apiserver sends an event via an informer about a change in an object in etcd, we could have something like an informer against the async image puller, and we'd register that as a source.Channel (or similar) on our controller such that we would just get a generic event for "this catalog needs to be reconciled"

So then anytime we reconcile, no matter why we're reconciling, we're looking at current state (what is in our APIserver cache, what is in our image cache) and then taking action based on what we see there.

@anik120
Copy link
Collaborator Author

anik120 commented Oct 28, 2024

I think I see what you mean, let me try to find some time to prototype this. I'll update here once the prototype is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.x Issues related to features added after v1.0
Projects
Status: No status
Development

No branches or pull requests

3 participants