Backups/Restores are in Waiting Status after Kubernetes scheduler restarted the backup-agent container #1463

AlcipPopa · 2024-03-06T15:00:58Z

Report

MongoDB Backup is stuck on Status:Waiting and backup-agent container is not doing anything after Kubernetes scheduler restarted the backup-agent container during the execution of a restore:

Steps to reproduce

Start a MongoDB cluster in unsafe mode with only 1 replica (this is usefull for development environments) and fill it with some data (let's say about 600MB of gzipped data);

Do a MongoDB backup and wait for the completion (Status = Ready) with the following yml (this will upload the backup to our AWS S3 bucket):

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBBackup
metadata:
  finalizers:
    - delete-backup
  name: backup1
spec:
  clusterName: mongodb-percona-cluster
  storageName: eu-central-1
  type: logical

Drop collections on MongoDB replicaset (just to avoid the _id clashes at next point);

Now ask for a restore of the above backup with the following yml (this works as intended since I saw the logs and the data inside MongoDB ReplicaSet):

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBRestore
metadata:
  name: restore1
spec:
  clusterName: mongodb-percona-cluster
  backupName: backup1

Ask for another backup with the following yml (keep in mind that at this point the previous restore process is still in progress)

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBBackup
metadata:
  finalizers:
    - delete-backup
  name: backup2
spec:
  clusterName: mongodb-percona-cluster
  storageName: eu-central-1
  type: logical

The backup2 will be put on Status=Waiting;

At this point Kubernetes scheduler should kill the backup-agent container from the MongoDB replica pod because of memory issues and restart it;

Now if you do a kubectl get psmdb-backup, you'll see that backup2 is in Error status and if you do a kubectl get psmdb-restore, you'll see that restore1 is also in Error status (OK, I can take that);

From this point onwards, no backup/restore will be possible through any yml, because they'll be appended as Status=Waiting.

The new backup-agent container logs state that it is waiting for incoming requests:

2024/03/05 16:36:01 [entrypoint] starting `pbm-agent`
2024-03-05T16:36:05.000+0000 I pbm-agent:
Version:   2.3.0
Platform:  linux/amd64
GitCommit: 3b1c2e263901cf041c6b83547f6f28ac2879911f
GitBranch: release-2.3.0
BuildTime: 2023-09-20_14:42_UTC
GoVersion: go1.19
2024-03-05T16:36:05.000+0000 I starting PITR routine
2024-03-05T16:36:05.000+0000 I node: rs0/mongodb-percona-cluster-rs0-0.mongodb-percona-cluster-rs0.default.svc.cluster.local:27017
2024-03-05T16:36:05.000+0000 I listening for the commands

Versions

Kubernetes version v1.27.9 in a 8 nodes cluster with 4GB of RAM each, in Azure Cloud
Operator image percona/percona-server-mongodb-operator:1.15.0
Database image percona/percona-server-mongodb:5.0.20-17

Anything else?

Same bug applies also on cronjobs (so it's not an issue triggered by the on demand backup/restore requests): they are kept in Waiting status.
The bug does NOT happen when using a ReplicaSet with at least 3 replicas (the default topology).

The text was updated successfully, but these errors were encountered:

spron-in · 2024-03-07T20:20:07Z

Nice catch @AlcipPopa . @hors I think we had something in our backlog about it. Thoughts?

AlcipPopa added the bug label Mar 6, 2024

AlcipPopa changed the title ~~Backups/Restores are in Waiting Status after Kubernetes scheduler killed the backup-agent container~~ Backups/Restores are in Waiting Status after Kubernetes scheduler restarted the backup-agent container Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backups/Restores are in Waiting Status after Kubernetes scheduler restarted the backup-agent container #1463

Backups/Restores are in Waiting Status after Kubernetes scheduler restarted the backup-agent container #1463

AlcipPopa commented Mar 6, 2024 •

edited

Loading

spron-in commented Mar 7, 2024

Backups/Restores are in Waiting Status after Kubernetes scheduler restarted the backup-agent container #1463

Backups/Restores are in Waiting Status after Kubernetes scheduler restarted the backup-agent container #1463

Comments

AlcipPopa commented Mar 6, 2024 • edited Loading

Report

More about the problem

Steps to reproduce

Versions

Anything else?

spron-in commented Mar 7, 2024

AlcipPopa commented Mar 6, 2024 •

edited

Loading