-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k8s.io disaster recovery plan #70
Comments
There are 2 things: |
IMO it does. I don't want to be over-pedantic but if we don't force ourselves to do it, it won't get done. :( Brainstorming: This doesn't have to be the most amazing, elegant, automatic thing in the world. It might simply be:
If that is too onerous, what corners can we cut? |
Isn't GCR just a GCS bucket fronted by a proxy/API? Could the backend bucket just be backed up/copied? Would the GCP storage transfer service be enough? Or a cron'd gsutil sync? |
Not as simple as a dumb bucket-wise copy, but: GCR just stores digests of images. As long as there is a reference to it, it won't get deleted. So, we could copy everything into another GCR, but namespace it under a timestamped folder. E.g. gcr.io/backup/20190701/..., and it won't eat up a ton of storage because Docker already de-dupes things. And we could also try turning on that lifecycle thing for the underlying GCS (bucket) layer: https://cloud.google.com/storage/docs/lifecycle. |
+1 on listx's suggestion. Let's narrow down the solutions so that we can get started and unblock releasing the promotor to the rest of the community. |
is backing up the GCS bucket good enough? or do we need to do it at GCR level? copy image by image, tags and so |
@listx What happens if the bucket or project gets deleted? |
@cblecker Can you clarify? |
/assign @amy Please continue the discussion. I'm following the thread & will write up a google doc of some options for next week's meeting. |
@listx I guess I'm not clear on your proposed multiple GCR with digests proposal. Isn't the GCR just a fronted underlying GCS bucket scoped to a project? |
Yes. But AFAIK GCS alone does not auto-dedup data. A quick google search led me to https://cloud.google.com/solutions/partners/storreduce-cloud-deduplication which supports my assumption. Ultimately we would be taking daily(?) snapshots of all the images in k8s.gcr.io. If deduplication is free (via another Docker Registry such as GCR), then we can even take hourly snapshots and it won't matter much. |
Short of reaching consensus on the initial backup approach, let's try to identify some invariants. (1) job duration < 24 hrs: I think we want the backups to happen at least daily. Do these points sound reasonable as a first stab at this problem? I think using the promoter's As for where this backup job logic should live --- I'm guessing github.com/kubernetes/k8s.io, or some other k8s repo (and not this promoter repo). |
Looks like there is already a GCS disaster recovery script underway here: kubernetes/k8s.io#334. We should probably follow the same infrastructural patterns established there. |
The pattern that I'm proposing in #334 is a different script for copying everything, with a no-overwrite / no-delete policy (I implemented that in code, @thockin pointed out that we can probably just use retention policy). (Edit: different as in not reusing the same code that we use for promotion) However, for registries which naturally de-dup I agree with the suggestions of using a date suffix. And nice find on gcrane @listx ! How about:
Of course, it'll take some time to translate that from bash to a programming language ;-) And while this solution does dedup, it doesn't protect against accidental/malicious tag deletion if someone gets access. If we do want to protect against that, another option is to rsync the bucket underlying GCR, and then also export the manifests and upload them. This is relatively cheap, and we also can then have a GCS bucket with a retention policy to prevent overwriting.
(This one probably does need some work, because I cheated when creating the directories: it fails on nested images) The downside is that it isn't trivial to restore from that, and that we're making some assumptions about the structure of GCR. But we could easily bring up a server that serves from this structure - whether that's a temporary one for DR, or because we want some mirrors that don't use GCR. If we're really sneaky, it's even possible to serve direct from GCS I believe. |
I think it makes sense to just start out with something simple like this. One thing to note here is that the backup GCR will have its own separate service account for write access to the backups. It doesn't buy us a ton of security but it's better than the status quo. Are there any volunteers for this initial implementation using EDIT: I'd like to clarify that I will take an initial stab at the implementation (you should see a PR this week); I just wanted to see if other people on this thread wanted to chip in. :) |
An additional thought: I think it makes sense for the backup GCR to additionally mirror the latest snapshot of the prod GCR. This way, we could just redirect the vanity domain k8s.gcr.io to point to the backup GCR in case the prod GCR gets hosed, so that we don't have to wait for the backfill process to finish (there would be very minimal downtime). The one slightly ugly part is that now the backup GCR looks ilke this:
where the I suppose the missing piece here is that the backup GCR has to be made smart enough to only mirror good states (i.e., if an attacker re-tags all images, we don't want the backup mirror to do the same --- there would have to be some sort of delta heuristic for the backup process to detect and know when not to mirror false positive states of the original). |
Are there any thoughts about using the promoter directly for performing backups? We should be able to do this once #118 is merged. The backup process would be:
I think steps 1 and 2 can be glued together with either a shell script or Go binary (we already have the framework for this sort of "glue" code in our e2e tests, so we can reuse the code there if we decide to use Go instead of bash). I think this is 1/2 of Disaster Recovery. The other 1/2 would be the Restoration process that restores backed-up images to a test GCR. This is actually pretty similar to the other half:
I think the only missing piece is some easy way of making the promoter promote directly from a snapshot YAML, by allowing the user to supply the missing |
|
I am working on a doc to sum everything up + an initial implementation. Will share with this thread soon... stay tuned! |
Here is a writeup of an initial approach/design: https://docs.google.com/document/d/1od5y-Z2xP9mVmg2Yztnv-GQ7D-orj9HsTmeVvNHkzzA/edit?usp=sharing Mailing list link: https://groups.google.com/d/msg/kubernetes-wg-k8s-infra/cseCwgALwdk/iOYkaEYFCAAJ You must be a member of the kubernetes-wg-k8s-infra Google group in order to access the document. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/unassign @amy |
I'm gonna assign myself to it too cause I think it's important topic we should work on sooner or later. /assign |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
This still needs doing I think. |
Broad issue to track what our disaster recovery plan is if k8s.io registry somehow gets deleted.
One suggestion was creating a backup registry that snapshots k8s.io registry.
The text was updated successfully, but these errors were encountered: