You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MongoDB Backup is stuck on Status:Waiting and backup-agent container is not doing anything after Kubernetes scheduler restarted the backup-agent container during the execution of a restore:
More about the problem
I expect to see an ongoing backup after asking for a backup through the PerconaServerMongoDBBackup yml definition, when other actions (backups / restores) are not in progress.
Steps to reproduce
Start a MongoDB cluster in unsafe mode with only 1 replica (this is usefull for development environments) and fill it with some data (let's say about 600MB of gzipped data);
Do a MongoDB backup and wait for the completion (Status = Ready) with the following yml (this will upload the backup to our AWS S3 bucket):
At this point Kubernetes scheduler should kill the backup-agent container from the MongoDB replica pod because of memory issues and restart it;
Now if you do a kubectl get psmdb-backup, you'll see that backup2 is in Error status and if you do a kubectl get psmdb-restore, you'll see that restore1 is also in Error status (OK, I can take that);
From this point onwards, no backup/restore will be possible through any yml, because they'll be appended as Status=Waiting.
The new backup-agent container logs state that it is waiting for incoming requests:
2024/03/05 16:36:01 [entrypoint] starting `pbm-agent`
2024-03-05T16:36:05.000+0000 I pbm-agent:
Version: 2.3.0
Platform: linux/amd64
GitCommit: 3b1c2e263901cf041c6b83547f6f28ac2879911f
GitBranch: release-2.3.0
BuildTime: 2023-09-20_14:42_UTC
GoVersion: go1.19
2024-03-05T16:36:05.000+0000 I starting PITR routine
2024-03-05T16:36:05.000+0000 I node: rs0/mongodb-percona-cluster-rs0-0.mongodb-percona-cluster-rs0.default.svc.cluster.local:27017
2024-03-05T16:36:05.000+0000 I listening for the commands
Versions
Kubernetes version v1.27.9 in a 8 nodes cluster with 4GB of RAM each, in Azure Cloud
Same bug applies also on cronjobs (so it's not an issue triggered by the on demand backup/restore requests): they are kept in Waiting status.
The bug does NOT happen when using a ReplicaSet with at least 3 replicas (the default topology).
The text was updated successfully, but these errors were encountered:
AlcipPopa
changed the title
Backups/Restores are in Waiting Status after Kubernetes scheduler killed the backup-agent container
Backups/Restores are in Waiting Status after Kubernetes scheduler restarted the backup-agent container
Mar 6, 2024
Report
MongoDB Backup is stuck on Status:Waiting and backup-agent container is not doing anything after Kubernetes scheduler restarted the backup-agent container during the execution of a restore:
More about the problem
I expect to see an ongoing backup after asking for a backup through the PerconaServerMongoDBBackup yml definition, when other actions (backups / restores) are not in progress.
Steps to reproduce
Start a MongoDB cluster in unsafe mode with only 1 replica (this is usefull for development environments) and fill it with some data (let's say about 600MB of gzipped data);
Do a MongoDB backup and wait for the completion (Status = Ready) with the following yml (this will upload the backup to our AWS S3 bucket):
Drop collections on MongoDB replicaset (just to avoid the _id clashes at next point);
Now ask for a restore of the above backup with the following yml (this works as intended since I saw the logs and the data inside MongoDB ReplicaSet):
Ask for another backup with the following yml (keep in mind that at this point the previous restore process is still in progress)
The backup2 will be put on Status=Waiting;
At this point Kubernetes scheduler should kill the backup-agent container from the MongoDB replica pod because of memory issues and restart it;
Now if you do a
kubectl get psmdb-backup
, you'll see that backup2 is in Error status and if you do akubectl get psmdb-restore
, you'll see that restore1 is also in Error status (OK, I can take that);From this point onwards, no backup/restore will be possible through any yml, because they'll be appended as Status=Waiting.
The new backup-agent container logs state that it is waiting for incoming requests:
Versions
Anything else?
Same bug applies also on cronjobs (so it's not an issue triggered by the on demand backup/restore requests): they are kept in Waiting status.
The bug does NOT happen when using a ReplicaSet with at least 3 replicas (the default topology).
The text was updated successfully, but these errors were encountered: