Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8SPXC-1534 | [bug] fix issue with inconsistent secret reconciliation #1945

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mayankshah1607
Copy link
Member

@mayankshah1607 mayankshah1607 commented Jan 17, 2025

K8SPXC-1534 Powered by Pull Request Badge

CHANGE DESCRIPTION

Problem:

Creating a new PXC cluster can sometimes result in pods that go into CreateContainerConfigError state and never recover from in.

STR:

  1. Create a dummy user Secret:
apiVersion: v1
kind: Secret
metadata:
  name: cluster1-secrets
type: Opaque
stringData:
  monitor: test
  1. Create a minimal PXC cluster and reference this in .spec.secretsName

We expect that the cluster1-secrets is updated with all other users (6 users), and a copy of this secret named internal-<clusterName>. However, it is observed that while the cluster1-secrets is reconciled, the internal- secret still only contains the monitor user from the originally created secret, and is never updated or reconciled with the cluster1-secrets.

NOTE: This issue is not consistently reproducible, and I was able to see it happen in only 1 in every 5-6 attempts. However this was observed multiple times during Everest QA testing.

Cause:

With respect to the Secret reconciliation, there are 2 steps executed in order:

  1. Reconcile the .spec.secretsName (includes validating and filling out defaults) [1]
  2. Create a copy of this secret named internal-[clusterName] and mount it onto pods. [2]

The result of (1) is written back to KubeAPI and in step (2) we again read this secret from the kubeAPI to create a copy. While this logic looks fine overall, it assumes that step (2) reads a consistent result from the kubeAPI which may not be the case. This means that step (2) can create a copy based on an outdated (non-reconciled) secret, which is what leads to the CreateContainerConfigError.

Since the controller does not watch this secret, the kube client cache may not be immediately updated, resulting in an inconsistent read in step (2).

Solution:

This PR introduces the following changes to address this issue:

  1. Watch the Secret so that the cache is synced on time.
  2. Instead of reading the result from KubeAPI in step (2), the result of step (1) is directly shared with step (2). This also avoids an unnecessary API call, which guarantees consistency of data in both steps.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PXC version?
  • Does the change support oldest and newest supported Kubernetes version?

* add a watch in the cache for secret specified in .spec.secretsName
* once the original secret is reconciled, use the updated secret object
to reconcile the internal secret, rather that fetching from the cache which
could be outdated

Signed-off-by: Mayank Shah <[email protected]>
@pull-request-size pull-request-size bot added the size/M 30-99 lines label Jan 17, 2025
@it-percona-cla
Copy link

it-percona-cla commented Jan 17, 2025

CLA assistant check
All committers have signed the CLA.

@mayankshah1607 mayankshah1607 changed the title K8SPXC-1534 | [bug] new clusters may sometimes go into CreateContainerConfigError state K8SPXC-1534 | [bug] fix issue with inconsistent secret reconciliation Jan 17, 2025
@mayankshah1607 mayankshah1607 marked this pull request as ready for review January 17, 2025 11:07
@JNKPercona
Copy link
Collaborator

Test name Status
affinity-8-0 passed
auto-tuning-8-0 passed
cross-site-8-0 passed
custom-users-8-0 passed
demand-backup-cloud-8-0 passed
demand-backup-encrypted-with-tls-8-0 passed
demand-backup-8-0 passed
haproxy-5-7 passed
haproxy-8-0 passed
init-deploy-5-7 passed
init-deploy-8-0 passed
limits-8-0 passed
monitoring-2-0-8-0 passed
one-pod-5-7 passed
one-pod-8-0 passed
pitr-8-0 passed
pitr-gap-errors-8-0 passed
proxy-protocol-8-0 passed
proxysql-sidecar-res-limits-8-0 passed
pvc-resize-5-7 passed
pvc-resize-8-0 passed
recreate-8-0 passed
restore-to-encrypted-cluster-8-0 passed
scaling-proxysql-8-0 passed
scaling-8-0 passed
scheduled-backup-5-7 passed
scheduled-backup-8-0 passed
security-context-8-0 passed
smart-update1-8-0 passed
smart-update2-8-0 passed
storage-8-0 passed
tls-issue-cert-manager-ref-8-0 passed
tls-issue-cert-manager-8-0 passed
tls-issue-self-8-0 passed
upgrade-consistency-8-0 passed
upgrade-haproxy-5-7 passed
upgrade-haproxy-8-0 passed
upgrade-proxysql-5-7 passed
upgrade-proxysql-8-0 passed
users-5-7 passed
users-8-0 passed
validation-hook-8-0 passed
We run 42 out of 42

commit: 079482d
image: perconalab/percona-xtradb-cluster-operator:PR-1945-079482d0

@egegunes egegunes added this to the v1.17.0 milestone Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/M 30-99 lines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants