Workers stuck due to nonexistent migrations #1228

grzleadams · 2024-03-21T00:01:24Z

Version
Pulp Operator: 1.0.0-beta.3
(The following are both important to note, as I'll describe later.)
Affected node: pulp-minimal:stable@0f66fa60566b
Unaffected node: pulp-minimal:stable@4b30cd0edbb7

Describe the bug
I'm not sure if this belongs in the operator, oci-images, or another repo, so let me know if I should open it somewhere else.
We deployed a Pulp in our K8s cluster and everything was fine. A few days later, I went to bump the worker replicas, but the new pods just sat with the following in their logs:

error: Failed to initialize NSS library
Database migration in progress. Waiting...

After a while (and stumbling across this issue), it occurred to me that the new pods were probably on nodes that hadn't previously pulled the minimal image, and sure enough the digest IDs for the images on both nodes didn't match. The API pod was on the "old" image... the new pods were on the "new" one. So, unless I'm misunderstanding what's happening here, these workers will never come up until the API pod is refreshed (with imagePullPolicy either set to Always, or with "unaffected" nodes cordoned, to ensure it'll come up on the newer image).

To Reproduce
Steps to reproduce the behavior:

Deploy Pulp operator and a Pulp instance.
Wait a few days (or until there's a new re-tag of pulp-minimal:latest).
Cordon nodes currently running workers.
Scale workers up and ensure they come up on "fresh" nodes with the newer image.

Expected behavior
I'm not sure... it's easy enough for the user to be more explicit with the images they want to use. On the other hand, that involves providing a bunch of values in the Pulp spec that can easily be mixed up. It seems like the Operator should recognize that a "new" image exists and update all the pods (in rolling fashion, of course), to ensure the migrations actually happen.

Additional context
None... but I wanted to say thanks for all the work that goes into Pulp!

The text was updated successfully, but these errors were encountered:

grzleadams · 2024-03-21T00:45:48Z

I did just notice that the docs say:

WARN: defining a different image than the one used by API pods can cause unexpected behaviors!

So maybe the key would be to move away from tagging latest for the image, since users can unintentionally fall into the situation that the docs warn about?

git-hyagi · 2024-04-05T18:46:02Z

Hi @grzleadams,

Thank you for the detailed description, and sorry for the late response!

It seems like the Operator should recognize that a "new" image exists and update all the pods (in rolling fashion, of course), to ensure the migrations actually happen.

The current logic of the operator is verifying if the spec.{image,image_version} changed and triggering a reconciliation to update the Deployments with the new image. I think the problem is because, by default, we let the ImagePullPolicy as IfNotPresent. Modifying it to Always should avoid the error of the "cached"/"old" image version with the same tag.

spec:
  image_pull_policy: Always

So maybe the key would be to move away from tagging latest for the image, since users can unintentionally fall into the situation that the docs warn about?

Hum... that is a good point. Another idea would be to enforce the pulp-minimal image with pulp-operator image, like, instead of allowing users to define which version of pulp they would like to install, installing pulp-operator v1 would install pulpcore v1. But this would bring other issues, for example, in air gapped environments where users point to custom registries or a QA env pointing to custom images.

grzleadams added Issue Triage-Needed labels Mar 21, 2024

mikedep333 removed the Triage-Needed label Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workers stuck due to nonexistent migrations #1228

Workers stuck due to nonexistent migrations #1228

grzleadams commented Mar 21, 2024

grzleadams commented Mar 21, 2024 •

edited

Loading

git-hyagi commented Apr 5, 2024

Workers stuck due to nonexistent migrations #1228

Workers stuck due to nonexistent migrations #1228

Comments

grzleadams commented Mar 21, 2024

grzleadams commented Mar 21, 2024 • edited Loading

git-hyagi commented Apr 5, 2024

grzleadams commented Mar 21, 2024 •

edited

Loading