k8s.io disaster recovery plan #70

amy · 2019-07-10T16:38:08Z

Broad issue to track what our disaster recovery plan is if k8s.io registry somehow gets deleted.

One suggestion was creating a backup registry that snapshots k8s.io registry.

amy · 2019-07-10T20:13:18Z

There are 2 things:
1.) Does the lack of a disaster recovery plan (aka maintaining the status quo today) prohibit the image promoter from being released to the general public?
2.) Initial brainstorming on possible disaster recovery options.

amy · 2019-07-10T20:27:57Z

cc/ @thockin @listx @spiffxp

thockin · 2019-07-10T21:11:59Z

Does the lack of a disaster recovery plan (aka maintaining the status quo today) prohibit the image promoter from being released to the general public?

IMO it does. I don't want to be over-pedantic but if we don't force ourselves to do it, it won't get done. :(

Brainstorming: This doesn't have to be the most amazing, elegant, automatic thing in the world. It might simply be:

A daily job (running where? how do we know if it fails or stops running) that copies all images by SHA to another GCS which has a much smaller set of things that can access it and a strong retention policy. Also snapshots of the promoter YAMLs (the SHA to tag mappings).
A program which consumes the snapshot yaml files and promotes the backup images into a GCR, restoring tags to SHAs.
A monthly job that runs the restore into a test GCR, generates a log, and then erases it all.

If that is too onerous, what corners can we cut?

cblecker · 2019-07-10T22:28:21Z

Isn't GCR just a GCS bucket fronted by a proxy/API? Could the backend bucket just be backed up/copied? Would the GCP storage transfer service be enough? Or a cron'd gsutil sync?

listx · 2019-07-10T22:56:38Z

Not as simple as a dumb bucket-wise copy, but: GCR just stores digests of images. As long as there is a reference to it, it won't get deleted. So, we could copy everything into another GCR, but namespace it under a timestamped folder. E.g. gcr.io/backup/20190701/..., and it won't eat up a ton of storage because Docker already de-dupes things.

And we could also try turning on that lifecycle thing for the underlying GCS (bucket) layer: https://cloud.google.com/storage/docs/lifecycle.

amy · 2019-07-15T17:30:02Z

+1 on listx's suggestion. Let's narrow down the solutions so that we can get started and unblock releasing the promotor to the rest of the community.

javier-b-perez · 2019-07-15T17:37:42Z

is backing up the GCS bucket good enough? or do we need to do it at GCR level? copy image by image, tags and so

cblecker · 2019-07-15T18:53:37Z

@listx What happens if the bucket or project gets deleted?

listx · 2019-07-15T21:06:48Z

@cblecker Can you clarify?

amy · 2019-07-15T21:17:03Z

/assign @amy

Please continue the discussion. I'm following the thread & will write up a google doc of some options for next week's meeting.

cblecker · 2019-07-16T00:50:04Z

@listx I guess I'm not clear on your proposed multiple GCR with digests proposal. Isn't the GCR just a fronted underlying GCS bucket scoped to a project?

listx · 2019-07-16T17:56:14Z

@listx I guess I'm not clear on your proposed multiple GCR with digests proposal. Isn't the GCR just a fronted underlying GCS bucket scoped to a project?

Yes. But AFAIK GCS alone does not auto-dedup data. A quick google search led me to https://cloud.google.com/solutions/partners/storreduce-cloud-deduplication which supports my assumption.

Ultimately we would be taking daily(?) snapshots of all the images in k8s.gcr.io. If deduplication is free (via another Docker Registry such as GCR), then we can even take hourly snapshots and it won't matter much.

listx · 2019-08-16T00:18:05Z

Short of reaching consensus on the initial backup approach, let's try to identify some invariants.

(1) job duration < 24 hrs: I think we want the backups to happen at least daily.
(2) disk usage: because of (1), we really want to de-dupe data. This rules out GCS bucketwise copies (although, one could argue, we could have a rolling window of backed-up snapshots --- e.g. only the last 30 days).
(3) restoration: following the spirit of Tim's 2nd bullet point, there needs to be some process that understands how to restore from the backup to an "original" state. Using the prefixed-by-date GCR backup idea, this would be as simple as copying all images from (for example) gcr.io/some-backup-project-name/20190808/... -> {asia,eu,us}.gcr.io/k8s-artifacts-prod/... . There are many options here (it could involve some combination of the promoter's -snapshot flag along with gcrane (gcrane, unlike gcloud (which the promoter currently relies on), can copy images that don't even have a tag).
(4) a job that actually runs the restoration: this follows Tim's 3rd bullet point.

Do these points sound reasonable as a first stab at this problem? I think using the promoter's -snapshot flag to generate an easy-to-read YAML inventory of all images in the GCR-to-backup makes sense. These snapshot YAMLs would be stored in GCS (or if we're fancy, in Github). I think the backup "job" should run in Prow (and surely, Prow has some slack alert thing that we can enable for the backup job).

As for where this backup job logic should live --- I'm guessing github.com/kubernetes/k8s.io, or some other k8s repo (and not this promoter repo).

listx · 2019-08-16T01:02:14Z

Looks like there is already a GCS disaster recovery script underway here: kubernetes/k8s.io#334. We should probably follow the same infrastructural patterns established there.

justinsb · 2019-08-20T00:49:21Z

The pattern that I'm proposing in #334 is a different script for copying everything, with a no-overwrite / no-delete policy (I implemented that in code, @thockin pointed out that we can probably just use retention policy). (Edit: different as in not reusing the same code that we use for promotion)

However, for registries which naturally de-dup I agree with the suggestions of using a date suffix.

And nice find on gcrane @listx ! How about:

gcrane cp -r gcr.io/k8s-staging-cluster-api-aws gcr.io/backup-dest/k8s-staging-cluster-api-aws/$(date --rfc-3339=date)

Of course, it'll take some time to translate that from bash to a programming language ;-) And while this solution does dedup, it doesn't protect against accidental/malicious tag deletion if someone gets access.

If we do want to protect against that, another option is to rsync the bucket underlying GCR, and then also export the manifests and upload them. This is relatively cheap, and we also can then have a GCS bucket with a retention policy to prevent overwriting.

ID=`date --rfc-3339=date`
gsutil rsync -r gs://artifacts.k8s-staging-cluster-api-aws.appspot.com/containers/images/ gs://backup/containers/images/
mkdir -p tags/${ID}/gcr.io/k8s-staging-cluster-api-aws/
gcrane ls -r gcr.io/k8s-staging-cluster-api-aws | grep -v @sha256 | xargs -I {} bash -c "gcrane manifest {} > tags/${ID}/{}.manifest"
gsutil rsync -r tags/ gs://backup/tags

(This one probably does need some work, because I cheated when creating the directories: it fails on nested images)

The downside is that it isn't trivial to restore from that, and that we're making some assumptions about the structure of GCR. But we could easily bring up a server that serves from this structure - whether that's a temporary one for DR, or because we want some mirrors that don't use GCR. If we're really sneaky, it's even possible to serve direct from GCS I believe.

listx · 2019-08-26T20:13:23Z

The pattern that I'm proposing in #334 is a different script for copying everything, with a no-overwrite / no-delete policy (I implemented that in code, @thockin pointed out that we can probably just use retention policy). (Edit: different as in not reusing the same code that we use for promotion)

However, for registries which naturally de-dup I agree with the suggestions of using a date suffix.

And nice find on gcrane @listx ! How about:
gcrane cp -r gcr.io/k8s-staging-cluster-api-aws gcr.io/backup-dest/k8s-staging-cluster-api-aws/$(date --rfc-3339=date)
Of course, it'll take some time to translate that from bash to a programming language ;-) And while this solution does dedup, it doesn't protect against accidental/malicious tag deletion if someone gets access.

I think it makes sense to just start out with something simple like this. One thing to note here is that the backup GCR will have its own separate service account for write access to the backups. It doesn't buy us a ton of security but it's better than the status quo.

Are there any volunteers for this initial implementation using gcrane to do the copy? It would have to live in a prow job. Please comment!

EDIT: I'd like to clarify that I will take an initial stab at the implementation (you should see a PR this week); I just wanted to see if other people on this thread wanted to chip in. :)

listx · 2019-08-28T19:25:59Z

An additional thought: I think it makes sense for the backup GCR to additionally mirror the latest snapshot of the prod GCR. This way, we could just redirect the vanity domain k8s.gcr.io to point to the backup GCR in case the prod GCR gets hosed, so that we don't have to wait for the backfill process to finish (there would be very minimal downtime).

The one slightly ugly part is that now the backup GCR looks ilke this:

gcr.io/<backup-project-name>/foo-img
gcr.io/<backup-project-name>/bar-img
gcr.io/<backup-project-name>/...
gcr.io/<backup-project-name>/backups/<DATE>/...

where the backups folder would take up a name that the new prod GCR must not have (it's a sort of reserved name). But I think this is minor/negligible.

I suppose the missing piece here is that the backup GCR has to be made smart enough to only mirror good states (i.e., if an attacker re-tags all images, we don't want the backup mirror to do the same --- there would have to be some sort of delta heuristic for the backup process to detect and know when not to mirror false positive states of the original).

listx · 2019-08-30T22:06:08Z

Are there any thoughts about using the promoter directly for performing backups? We should be able to do this once #118 is merged.

The backup process would be:

Construct a "backup" Promoter manifest. We can use the -snapshot flag to record all reachable images in a repo. (This output is 99% of a regular Promoter manifest, minus the registries: field).
Promote all images in the backup manifest with a rebased name, prefixed by date:

registries:
- name: gcr.io/k8s-cip-test-prod
  service-account: [email protected]
  src: true
# Same for all the regions for multi-regional backups.
- name: us.gcr.io/k8s-cip-test-prod/<DATE>
  service-account: [email protected]

Save the backup manifest to a GCS bucket (or Github or somewhere else). Saving it in Github would be nice because of the easier discoverability and change history.
Repeat the above steps daily.

I think steps 1 and 2 can be glued together with either a shell script or Go binary (we already have the framework for this sort of "glue" code in our e2e tests, so we can reuse the code there if we decide to use Go instead of bash).

I think this is 1/2 of Disaster Recovery. The other 1/2 would be the Restoration process that restores backed-up images to a test GCR. This is actually pretty similar to the other half:

Promote all images in the backup (the backup will already have a list of snapshot YAMLs by date) to the target GCR (in this case, the test GCR).
Take a snapshot of the test GCR and ensure that it matches with the snapshot we used for the promotion (this is the same approach we use in our e2e tests).
Delete the test GCR.

I think the only missing piece is some easy way of making the promoter promote directly from a snapshot YAML, by allowing the user to supply the missing registries: field dynamically as CLI arguments or ENV vars or some other.

jonjohnsonjr · 2019-09-19T01:02:24Z

gcrane cp -r should work for this

listx · 2019-09-20T21:16:23Z

I am working on a doc to sum everything up + an initial implementation. Will share with this thread soon... stay tuned!

listx · 2019-09-23T20:47:55Z

Here is a writeup of an initial approach/design: https://docs.google.com/document/d/1od5y-Z2xP9mVmg2Yztnv-GQ7D-orj9HsTmeVvNHkzzA/edit?usp=sharing

Mailing list link: https://groups.google.com/d/msg/kubernetes-wg-k8s-infra/cseCwgALwdk/iOYkaEYFCAAJ

You must be a member of the kubernetes-wg-k8s-infra Google group in order to access the document.

fejta-bot · 2019-12-22T21:40:26Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-01-21T22:25:55Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

listx · 2020-03-09T18:04:39Z

/unassign @amy

bartsmykla · 2020-03-10T06:56:20Z

I'm gonna assign myself to it too cause I think it's important topic we should work on sooner or later.

/assign

fejta-bot · 2020-06-08T07:27:43Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

listx · 2020-06-09T05:59:00Z

/remove-lifecycle stale

fejta-bot · 2020-09-07T06:53:36Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-10-07T07:35:53Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2021-01-14T05:14:23Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

spiffxp · 2021-01-15T03:03:42Z

/remove-lifecycle stale
/lifecycle frozen
/wg k8s-infra
/sig release
/assign @justaugustus
please assign to someone more appropriate in @kubernetes-sigs/release-engineering

BenTheElder · 2023-04-02T19:35:17Z

This still needs doing I think.

spiffxp mentioned this issue Jul 10, 2019

[Umbrella Issue] Create a Image Promotion process kubernetes/k8s.io#157

Closed

k8s-ci-robot assigned amy Jul 15, 2019

listx added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Aug 27, 2019

listx mentioned this issue Aug 30, 2019

allow tag-less promotions (promote images without tagging them) #114

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 21, 2020

k8s-ci-robot assigned listx Mar 9, 2020

k8s-ci-robot unassigned amy Mar 9, 2020

k8s-ci-robot assigned bartsmykla Mar 10, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 8, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 9, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 7, 2020

justaugustus removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 16, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2021

k8s-ci-robot assigned justaugustus Jan 15, 2021

justaugustus unassigned bartsmykla Sep 14, 2021

k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed wg/k8s-infra labels Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k8s.io disaster recovery plan #70

k8s.io disaster recovery plan #70

amy commented Jul 10, 2019

amy commented Jul 10, 2019

amy commented Jul 10, 2019

thockin commented Jul 10, 2019

cblecker commented Jul 10, 2019

listx commented Jul 10, 2019

amy commented Jul 15, 2019 •

edited

Loading

javier-b-perez commented Jul 15, 2019

cblecker commented Jul 15, 2019

listx commented Jul 15, 2019

amy commented Jul 15, 2019 •

edited

Loading

cblecker commented Jul 16, 2019

listx commented Jul 16, 2019

listx commented Aug 16, 2019

listx commented Aug 16, 2019

justinsb commented Aug 20, 2019 •

edited

Loading

listx commented Aug 26, 2019 •

edited

Loading

listx commented Aug 28, 2019

listx commented Aug 30, 2019

jonjohnsonjr commented Sep 19, 2019

listx commented Sep 20, 2019

listx commented Sep 23, 2019 •

edited

Loading

fejta-bot commented Dec 22, 2019

fejta-bot commented Jan 21, 2020

listx commented Mar 9, 2020

bartsmykla commented Mar 10, 2020

fejta-bot commented Jun 8, 2020

listx commented Jun 9, 2020

fejta-bot commented Sep 7, 2020

fejta-bot commented Oct 7, 2020

fejta-bot commented Jan 14, 2021

spiffxp commented Jan 15, 2021

BenTheElder commented Apr 2, 2023

k8s.io disaster recovery plan #70

k8s.io disaster recovery plan #70

Comments

amy commented Jul 10, 2019

amy commented Jul 10, 2019

amy commented Jul 10, 2019

thockin commented Jul 10, 2019

cblecker commented Jul 10, 2019

listx commented Jul 10, 2019

amy commented Jul 15, 2019 • edited Loading

javier-b-perez commented Jul 15, 2019

cblecker commented Jul 15, 2019

listx commented Jul 15, 2019

amy commented Jul 15, 2019 • edited Loading

cblecker commented Jul 16, 2019

listx commented Jul 16, 2019

listx commented Aug 16, 2019

listx commented Aug 16, 2019

justinsb commented Aug 20, 2019 • edited Loading

listx commented Aug 26, 2019 • edited Loading

listx commented Aug 28, 2019

listx commented Aug 30, 2019

jonjohnsonjr commented Sep 19, 2019

listx commented Sep 20, 2019

listx commented Sep 23, 2019 • edited Loading

fejta-bot commented Dec 22, 2019

fejta-bot commented Jan 21, 2020

listx commented Mar 9, 2020

bartsmykla commented Mar 10, 2020

fejta-bot commented Jun 8, 2020

listx commented Jun 9, 2020

fejta-bot commented Sep 7, 2020

fejta-bot commented Oct 7, 2020

fejta-bot commented Jan 14, 2021

spiffxp commented Jan 15, 2021

BenTheElder commented Apr 2, 2023

amy commented Jul 15, 2019 •

edited

Loading

amy commented Jul 15, 2019 •

edited

Loading

justinsb commented Aug 20, 2019 •

edited

Loading

listx commented Aug 26, 2019 •

edited

Loading

listx commented Sep 23, 2019 •

edited

Loading