Add DEP-06: Immutable ETCD Backups #884

seshachalam-yv · 2024-10-01T11:46:39Z

How to categorize this PR?

/area backup
/area disaster-recovery
/area security
/area compliance
/area storage
/kind enhancement

What this PR does / why we need it:
This PR adds DEP-06: Immutable ETCD Backups. The proposal aims to enhance the reliability and integrity of ETCD backups in ETCD Druid by introducing immutable backups. By leveraging cloud provider features that support a write-once-read-many (WORM) model, this approach prevents unauthorized modifications to backup data, ensuring that backups remain available and intact for restoration.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

Add DEP-06: Immutable ETCD Backups

docs/proposals/06-immutable-etcd-backups.md

ashwani2k

Thanks @seshachalam-yv @ishan16696 @renormalize for the proposal.
It captures thing well, but I've put some open points esp. on the structure as well some details esp. as it addresses design considerations.

docs/proposals/06-immutable-etcd-backups.md

ashwani2k · 2024-10-10T05:36:55Z

docs/proposals/06-immutable-etcd-backups.md

+
+### Excluding Snapshots Under Specific Circumstances
+
+Given that immutable backups cannot be deleted until the immutability period expires, there are scenarios, such as corrupted snapshots or other anomalies, where certain snapshots must be skipped during the restoration process. To facilitate this:


This can happen even outside of immutable backups scenarios as well, so how is this handled there? I'm guessing currently by deleting manually the affected snapshots.
But with this new approach it should be same mechanism there as well.

If snapshots are mutable, this is achieved through deletion of snapshots.
The same functionality will be achieved through custom metadata tags. Will enhance the doc for this.

ashwani2k · 2024-10-10T05:37:59Z

docs/proposals/06-immutable-etcd-backups.md

+
+Given that immutable backups cannot be deleted until the immutability period expires, there are scenarios, such as corrupted snapshots or other anomalies, where certain snapshots must be skipped during the restoration process. To facilitate this:
+
+- **Custom Metadata Tags:** Utilize custom metadata to mark specific objects (snapshots) that should be bypassed. To exclude a snapshot from the restoration process, attach custom metadata to it with the key `x-etcd-snapshot-exclude` and value `true`. This method is officially supported, as demonstrated in the [etcd-backup-restore PR](https://github.com/gardener/etcd-backup-restore/pull/776).


Its not clear from the doc who takes care of attaching the custom metadata flag and how its consumed? Can we describe here to avoid any unintended interpretation of the flow.

Human operators add these tags; will include this.

ashwani2k · 2024-10-10T05:43:35Z

docs/proposals/06-immutable-etcd-backups.md

+
+## Implementation Steps
+
+1. **Enhance the Compaction Job:**


I think we should create a new name for the job for Hibernated Full Snapshots and ensure that we have flags and even flow which can leverage the existing compaction feature and enhance it with additional change required for Immuatable backup snapshotting and garbage collection.

Also we cannot have a compaction job for hibernated cluster in practical terms, so it will be even more confusing to see a compaction job running for a hibernated cluster.

ashwani2k · 2024-10-10T05:52:05Z

docs/proposals/06-immutable-etcd-backups.md

+  - Configure buckets with appropriate immutability settings before deploying ETCD clusters.
+  - Ensure that the immutability periods align with organizational policies.
+
+- **Compaction Job Configuration:**


What is the retry threshold for this job?
What happens if it fails to run for a period of 24hrs.
What happens if druid is down?
What happens when druid comes back up esp. for failed jobs which have breached the retry threshold?
What happens if we breach the bucket retention period? Is no data to restore possible on wake-up of hibernated clusters.
Does garbage collection runs independent or in sequence only after the job takes a full snapshot on its run.

docs/proposals/06-immutable-etcd-backups.md

Co-authored-by: Saketh Kalaga <[email protected]>

docs/proposals/06-immutable-etcd-backups.md

Co-authored-by: Saketh Kalaga <[email protected]>

renormalize

Thanks a ton for this PR @seshachalam-yv! It was no small feat to bring the DEP to this stage.

renormalize · 2025-01-28T07:40:39Z

docs/proposals/06-immutable-etcd-backups.md

+- "@renormalize"
+- "@ishan16696"


Suggested change

- "@renormalize"

- "@ishan16696"

Since the DEP has changed significantly since the original draft, and the both of us have only acted as reviewers after the first draft, it is not right to have our names in the author section.

ashwani2k

This is such a neat proposal now to from where it has started.
No nitpics, it was sufficiently detailed without getting into implementation semantics and was able to convey the intent aptly.

Thanks a lot for taking all the inputs @seshachalam-yv and to the reviewers for the detailed review.

unmarshall · 2025-01-29T04:15:53Z

docs/proposals/06-immutable-etcd-backups.md

+
+## Summary
+
+This proposal introduces immutable backups for etcd clusters managed by `etcd-druid`. By leveraging cloud provider immutability features, backups can neither be modified nor deleted once created. This approach strengthens the reliability and fault tolerance of the etcd restoration process.


By leveraging cloud provider immutability features, backups can neither be modified nor deleted once created - to be precise this statement tells the reader that once the immutability feature is enabled then it will ensure immutability for backups for eternity. This is not true. Therefore you must mention that this is true for a configured retention duration or immutability duration.

docs/proposals/06-immutable-etcd-backups.md

unmarshall · 2025-01-29T04:31:31Z

docs/proposals/06-immutable-etcd-backups.md

+
+Currently, `etcd-druid` does not directly configure immutable buckets. The specific method of enabling immutability depends on your use case:
+
+- **Large-Scale Consumers (e.g., Gardener):**  


Thinking about it again - irrespective of whether its a large-scale or small-scale consumer, the choice to automate or do it manually applies equally to both consumers. Therefore this section can only mention the following:

Creating and configuring immutable buckets on providers is not handled by etcd-druid and must be done by the consumers. For a large-scale consumer like Gardener provider extensions are leveraged to automate both the creation and configuration of buckets. For more details see BackupBucket and refer issue.

This simplifies the section further.

unmarshall · 2025-01-29T04:34:30Z

docs/proposals/06-immutable-etcd-backups.md

+1. **Configure or Update the Immutable Bucket**  
+   - Use your cloud provider’s CLI, SDK, or console to create (or update) a bucket/container with a WORM (write-once-read-many) immutability policy.  
+   - Refer to the [Getting Started guide](https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/enabling_immutable_snapshots.md#configure-bucket-level-immutability) for step-by-step instructions on configuring or updating the immutable bucket across different cloud providers.
+   - For AWS S3, for example, you enable Object Lock at bucket creation; for Azure Blob Storage, you configure Immutable Blob Storage at the container scope.


If you are providing hints for 2 out of the 3 supported providers then i would say just complete the list and provide it as bullet points - one for each provider. Alternatively, you can completely remove this point since you have already provided link to a detailed documentation in etcd-backup-restore repo.

unmarshall · 2025-01-29T05:29:45Z

docs/proposals/06-immutable-etcd-backups.md

+}
+```
+
+If `immutability` is not specified, `etcd-druid` will assume that the bucket is mutable, and no immutability-related logic applies.


Just add a line saying that we have defined a new type to allow us future enhancements to the immutability specification.

unmarshall · 2025-01-29T05:32:30Z

docs/proposals/06-immutable-etcd-backups.md

+1. **`reupload-snapshot`**
+   - Downloads the latest full snapshot from the object store.
+   - Renames the snapshot (for instance, updates its Unix timestamp) to avoid overwriting an existing immutable snapshot.
+   - Uploads the renamed snapshot back to object storage, thereby restarting its immutability timer.


you should also mention that if a bucket is shared across several etcd clusters then this enhancement would increase the storage cost and perhaps some numbers would be nice.

unmarshall · 2025-01-29T05:35:15Z

docs/proposals/06-immutable-etcd-backups.md

+
+##### Operator Task Enhancements
+
+The `ExtendFullSnapshotImmutabilityTask` will create a cron job that:


what you have missed is that this task should only be active as long as the cluster is hibernated. Once the etcd cluster comes out of hibernation then this task should no longer exist. So have a section on lifecycle of this task.

unmarshall · 2025-01-29T05:36:22Z

docs/proposals/06-immutable-etcd-backups.md

+###### Example Task Config
+
+```go
+type ExtendFullSnapshotImmutabilityTaskConfig struct {


So one task is extending the snapshot's immutability and also doing garbage collection? If that is the case then the name does not reflect that.

unmarshall · 2025-01-29T05:37:54Z

docs/proposals/06-immutable-etcd-backups.md

+  GarbageCollectionConfig *GarbageCollectionConfig `json:"garbageCollectionConfig,omitempty"`
+}
+
+type GarbageCollectionConfig struct {


i did not quite understand this type in context of ExtendFullSnapshotImmutabilityTaskConfig - why is it that complicated? If GC needs to be done via this task as well then it should only be done for copied last full snapshot (taken prior to hibernation). I am confused why is delta snapshot retention coming here and what is exponential GC policy and how is relevant w.r.t this task.

Co-authored-by: Madhav Bhargava <[email protected]>

gardener-prow · 2025-01-29T07:09:06Z

@seshachalam-yv: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-etcd-druid-e2e-kind	`b8655fb`	link	true	`/test pull-etcd-druid-e2e-kind`

Full PR test history. Your PR dashboard. Command help for this repository.
Please help us cut down on flakes by linking this test failure to an open flake report or filing a new flake report if you can't find an existing one. Also see our testing guideline for how to avoid and hunt flakes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

shreyas-s-rao

@seshachalam-yv the flow is neat and understandable from a new-reader's perspective too. Thanks a lot for taking the time and making all the changes! I have a few nits. Please address them, thanks.

Please also update https://github.com/gardener/etcd-druid/blob/master/docs/README.md to include link to this proposal.

shreyas-s-rao · 2025-01-30T06:35:26Z

docs/proposals/06-immutable-etcd-backups.md

+title: Immutable etcd Cluster Backups
+dep-number: 06
+creation-date: 2024-09-25
+status: implementable


Is the proposed feature just immutable backups, or does it also include handling hibernated clusters? I'm a but confused, since the non-goals section mentions that the proposal doesn't focus on handling immutable backups for hibernated clusters, but the Handling of Hibernated Clusters section talks about how druid can handle hibernated clusters. I'm a bit confused about this.

Because if the proposal is just about immutable backups, then the proposal is already in implemented status, correct?

shreyas-s-rao · 2025-01-30T06:40:43Z

docs/proposals/06-immutable-etcd-backups.md

+- **Immutability Period:** The duration for which data must remain immutable before it can be modified or deleted.
+- **Bucket-Level Immutability:** A policy that applies a uniform immutability period to all objects within a bucket.
+- **Object-Level Immutability:** A policy that allows setting immutability periods individually for objects within a bucket, offering more granular control.
+- **Garbage Collection:** The process of deleting old snapshot data that is no longer needed to free up storage space.


Suggested change

- **Garbage Collection:** The process of deleting old snapshot data that is no longer needed to free up storage space.

- **Garbage Collection:** The process of deleting old snapshot data that is no longer needed, in order to free up storage space.

shreyas-s-rao · 2025-01-30T06:42:48Z

docs/proposals/06-immutable-etcd-backups.md

+
+`etcd-druid` provisions etcd clusters and manages their lifecycle. For every etcd cluster, consumers can enable periodic backups of the cluster state by configuring the `spec.backup` section in an Etcd custom resource. Periodic backups are taken via the `etcd-backup-restore` sidecar container that runs in each etcd member pod.
+
+Periodic backups of an etcd cluster state ensure the ability to recover from a complete quorum loss, enhancing reliability and fault tolerance. It is crucial that these backups, which are vital for restoring the etcd cluster, remain protected from any form of tampering, whether intentional or accidental. To safeguard the integrity of these backups, the authors recommend utilizing `WORM` protection, a feature offered by various cloud providers, to ensure the backups remain immutable and secure.


Suggested change

Periodic backups of an etcd cluster state ensure the ability to recover from a complete quorum loss, enhancing reliability and fault tolerance. It is crucial that these backups, which are vital for restoring the etcd cluster, remain protected from any form of tampering, whether intentional or accidental. To safeguard the integrity of these backups, the authors recommend utilizing `WORM` protection, a feature offered by various cloud providers, to ensure the backups remain immutable and secure.

Periodic backups of an etcd cluster state ensure the ability to recover from a data loss or a quorum loss, enhancing reliability and fault tolerance. It is crucial that these backups, which are vital for restoring the etcd cluster, remain protected from any form of tampering, whether intentional or accidental. To safeguard the integrity of these backups, the authors recommend utilizing `WORM` protection, a feature offered by various cloud providers, to ensure the backups remain immutable and secure.

shreyas-s-rao · 2025-01-30T06:46:07Z

docs/proposals/06-immutable-etcd-backups.md

+
+### Non-Goals
+
+- Implementing hibernation support via `etcd.spec` or annotations on the `Etcd` CR (i.e., specifying an intent for hibernation), as noted in [gardener/etcd-druid#922](https://github.com/gardener/etcd-druid/issues/922).


This point seems out-of-context here. A better wording for this would be:
Ensuring immutability of snapshots when the etcd cluster is hibernated for a period longer than the configured immutability period.

This is more in-line with this proposal's theme, ie immutable backups. Whereas, the issue you mentioned here simply talks about providing certain features upon hibernation of the etcd cluster, which is a general feature and not necessarily related to immutable backups.

shreyas-s-rao · 2025-01-30T06:49:50Z

docs/proposals/06-immutable-etcd-backups.md

+1. **Object-Level Immutability:** Allows setting immutability periods independently for each object within a bucket.
+2. **Bucket-Level Immutability:** Applies a uniform immutability policy to all objects in a bucket.
+
+### Comparison of Bucket-Level and Object-Level Immutability


Can you please also add line items for Support for enabling bucket-level immutability in existing buckets and Support for enabling bucket-level immutability in new buckets?

shreyas-s-rao · 2025-01-30T06:51:13Z

docs/proposals/06-immutable-etcd-backups.md

+
+### Recommended Approach
+
+Given the nuances across providers:


Suggested change

Given the nuances across providers:

At the time of writing this proposal, these are the current limitations seen across providers:

shreyas-s-rao · 2025-01-30T08:15:28Z

docs/proposals/06-immutable-etcd-backups.md

+   - Downloads the latest full snapshot from the object store.
+   - Renames the snapshot (for instance, updates its Unix timestamp) to avoid overwriting an existing immutable snapshot.
+   - Uploads the renamed snapshot back to object storage, thereby restarting its immutability timer.


Wouldn't it be more optimal to utilise provider APIs to perform a server-side copy of the object with a new name? That we we save on download+upload time as well as network costs.

If so, can we rename this to renew-snapshot rather than reupload-snapshot? Because the task is to renew or refresh the snapshot's timestamp, rather than download+upload.

If not, then the term reupload is incorrect, since it technically needs to be re-upload. So it'll be re-upload-snapshot.

shreyas-s-rao · 2025-01-30T08:17:39Z

docs/proposals/06-immutable-etcd-backups.md

+
+These changes are compatible with existing etcd clusters and current backup processes.
+
+- **Backward Compatibility:**


What about reverting from immutable backups to mutable backups? Is that possible? If so, can you please elaborate on the nuances of this? If not, then can you add that as a limitation here?

shreyas-s-rao · 2025-01-30T08:18:35Z

docs/proposals/06-immutable-etcd-backups.md

+- **Key:** `x-etcd-snapshot-exclude`  
+- **Value:** `true`  
+
+Because these tags or annotations do not modify the underlying snapshot data, they are permissible even for immutable objects. Once these annotations are in place, `etcd-backup-restore` will detect them and skip the tagged snapshots during restoration, thus preventing unwanted snapshots from being used. For more details, see the [Ignoring Snapshots during Restoration](https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/enabling_immutable_snapshots.md#ignoring-snapshots-during-restoration).


Please add a limitation note for AWS, since we currently don't support this for S3 buckets, atleast at the time of writing this proposal.

ishan16696

few nits

ishan16696 · 2025-01-30T06:06:15Z

docs/proposals/06-immutable-etcd-backups.md

+
+## Summary
+
+This proposal introduces immutable backups for etcd clusters managed by `etcd-druid`. By leveraging cloud provider immutability features, backups can neither be modified nor deleted once created. This approach strengthens the reliability and fault tolerance of the etcd restoration process.


Suggested change

This proposal introduces immutable backups for etcd clusters managed by `etcd-druid`. By leveraging cloud provider immutability features, backups can neither be modified nor deleted once created. This approach strengthens the reliability and fault tolerance of the etcd restoration process.

This proposal introduces immutable backups for etcd clusters managed by `etcd-druid`. By leveraging cloud provider immutability features, backups taken by `etcd-backup-restore` can neither be modified nor deleted once created for a configured retention period. This approach strengthens the reliability and fault tolerance of the etcd restoration process.

ishan16696 · 2025-01-30T06:09:52Z

docs/proposals/06-immutable-etcd-backups.md

+- **Immutability Period:** The duration for which data must remain immutable before it can be modified or deleted.
+- **Bucket-Level Immutability:** A policy that applies a uniform immutability period to all objects within a bucket.
+- **Object-Level Immutability:** A policy that allows setting immutability periods individually for objects within a bucket, offering more granular control.
+- **Garbage Collection:** The process of deleting old snapshot data that is no longer needed to free up storage space.


can you mention this link for more information on GC, https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/garbage_collection.md

ishan16696 · 2025-01-30T06:13:13Z

docs/proposals/06-immutable-etcd-backups.md

+- **Google Cloud Storage (GCS):** [Bucket Lock](https://cloud.google.com/storage/docs/bucket-lock)
+- **Amazon S3 (S3):** [Object Lock](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html)
+- **Azure Blob Storage (ABS):** [Immutable Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/immutable-policy-configure-container-scope?tabs=azure-portal)


can you mention about the openstack here.

Note
Currently, Openstack object storage (swift) doesn't support immutability for objects: https://blueprints.launchpad.net/swift/+spec/immutability-middleware.

ishan16696 · 2025-01-30T07:21:13Z

docs/proposals/06-immutable-etcd-backups.md

+| Precedence between bucket-level and object-level immutability periods     | Max(bucket, object)     | Object-level | Max(bucket, object)                 |
+
+> [!NOTE]
+> In AWS S3, it is possible to decrease the bucket-level immutability period; however, this action may be blocked by specific bucket policy settings.  


Suggested change

> In AWS S3, it is possible to decrease the bucket-level immutability period; however, this action may be blocked by specific bucket policy settings.

> In AWS S3, it is possible to increase and decrease the bucket-level immutability period; however, this action can be blocked by configuring specific bucket policy settings.

ishan16696 · 2025-01-30T07:24:59Z

docs/proposals/06-immutable-etcd-backups.md

+1. **Configure or Update the Immutable Bucket**  
+   - Use your cloud provider’s CLI, SDK, or console to create (or update) a bucket/container with a WORM (write-once-read-many) immutability policy.  
+   - Refer to the [Getting Started guide](https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/enabling_immutable_snapshots.md#configure-bucket-level-immutability) for step-by-step instructions on configuring or updating the immutable bucket across different cloud providers.
+   - For AWS S3, for example, you enable Object Lock at bucket creation; for Azure Blob Storage, you configure Immutable Blob Storage at the container scope.


For AWS S3, for example, you enable Object Lock at bucket creation

can you explain what do you mean by this ?
IMO, this line can be removed

ishan16696 · 2025-01-30T08:45:58Z

docs/proposals/06-immutable-etcd-backups.md

+   - A new operator task type, `ExtendFullSnapshotImmutabilityTask`, will periodically invoke the `reupload-snapshot` and `garbage-collect` commands.
+
+3. **Enhance `etcd-backup-restore`**  
+   - Introduce new CLI commands:


Suggested change

- Introduce new CLI commands:

- Introduce new CLI sub-commands:

ishan16696 · 2025-01-30T08:46:47Z

docs/proposals/06-immutable-etcd-backups.md

+3. **Enhance `etcd-backup-restore`**  
+   - Introduce new CLI commands:
+     - **`reupload-snapshot`** for re-uploading snapshots.  
+     - **`garbage-collect`** for removing older backups whose immutability period has expired.


Why we want to GC the old snapshots ? why can't we leave them as we currently leave them already.

ishan16696 · 2025-01-30T09:53:23Z

docs/proposals/06-immutable-etcd-backups.md

+2. **`garbage-collect`**
+   - Scans the object store for older snapshots and deletes them if their immutability period has expired and they are no longer needed, following the standard [garbage collection policy](https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/garbage_collection.md).


same as above

seshachalam-yv requested a review from a team as a code owner October 1, 2024 11:46

seshachalam-yv changed the title ~~Add DEP-06: Immutable ETCD Backups~~ Add DEP-06: Immutable ETCD Backups Oct 2, 2024

seshachalam-yv changed the title ~~Add DEP-06: Immutable ETCD Backups~~ Add DEP-06: Immutable ETCD Backups Oct 2, 2024

seshachalam-yv changed the title ~~Add DEP-06: Immutable ETCD Backups~~ Add DEP-06: Immutable ETCD Backups Oct 2, 2024

anveshreddy18 self-assigned this Oct 3, 2024

ishan16696 requested changes Oct 9, 2024

View reviewed changes

gardener-robot added the needs/changes Needs (more) changes label Oct 9, 2024

gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Oct 10, 2024

ashwani2k requested changes Oct 10, 2024

View reviewed changes

anveshreddy18 reviewed Oct 10, 2024

View reviewed changes

ishan16696 assigned ishan16696 and seshachalam-yv Oct 10, 2024

renormalize self-assigned this Oct 10, 2024

renormalize added this to the v0.25.0 milestone Nov 13, 2024

gardener-robot added size/l Size of pull request is large (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else and removed size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Nov 21, 2024

gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Nov 21, 2024

shreyas-s-rao assigned ishan16696, renormalize and anveshreddy18 and unassigned seshachalam-yv Jan 21, 2025

seshachalam-yv force-pushed the feature/dep-06-immutable-etcd-backups branch from f832315 to 3df82c8 Compare January 22, 2025 05:56

gardener-robot added size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) and removed size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) labels Jan 22, 2025

gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 22, 2025

Add DEP-06: Immutable ETCD Backups

805f277

Co-authored-by: Saketh Kalaga <[email protected]>

seshachalam-yv force-pushed the feature/dep-06-immutable-etcd-backups branch from 3df82c8 to 805f277 Compare January 22, 2025 07:56

gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 22, 2025

gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 22, 2025

renormalize requested changes Jan 28, 2025

View reviewed changes

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

Apply suggestions from code review

8cd21aa

Co-authored-by: Saketh Kalaga <[email protected]>

gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 28, 2025

gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 28, 2025

renormalize approved these changes Jan 28, 2025

View reviewed changes

ashwani2k approved these changes Jan 28, 2025

View reviewed changes

unmarshall requested changes Jan 29, 2025

View reviewed changes

Apply suggestions from code review

b8655fb

Co-authored-by: Madhav Bhargava <[email protected]>

gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 29, 2025

shreyas-s-rao requested changes Jan 30, 2025

View reviewed changes

shreyas-s-rao unassigned shreyas-s-rao, ashwani2k, renormalize and unmarshall Jan 30, 2025

ishan16696 requested changes Jan 30, 2025

View reviewed changes


		### Excluding Snapshots Under Specific Circumstances

		Given that immutable backups cannot be deleted until the immutability period expires, there are scenarios, such as corrupted snapshots or other anomalies, where certain snapshots must be skipped during the restoration process. To facilitate this:


		Given that immutable backups cannot be deleted until the immutability period expires, there are scenarios, such as corrupted snapshots or other anomalies, where certain snapshots must be skipped during the restoration process. To facilitate this:

		- Custom Metadata Tags: Utilize custom metadata to mark specific objects (snapshots) that should be bypassed. To exclude a snapshot from the restoration process, attach custom metadata to it with the key `x-etcd-snapshot-exclude` and value `true`. This method is officially supported, as demonstrated in the [etcd-backup-restore PR](https://github.com/gardener/etcd-backup-restore/pull/776).


		## Summary

		This proposal introduces immutable backups for etcd clusters managed by `etcd-druid`. By leveraging cloud provider immutability features, backups can neither be modified nor deleted once created. This approach strengthens the reliability and fault tolerance of the etcd restoration process.


		Currently, `etcd-druid` does not directly configure immutable buckets. The specific method of enabling immutability depends on your use case:

		- Large-Scale Consumers (e.g., Gardener):


		##### Operator Task Enhancements

		The `ExtendFullSnapshotImmutabilityTask` will create a cron job that:

	- Garbage Collection: The process of deleting old snapshot data that is no longer needed to free up storage space.
	- Garbage Collection: The process of deleting old snapshot data that is no longer needed, in order to free up storage space.


		`etcd-druid` provisions etcd clusters and manages their lifecycle. For every etcd cluster, consumers can enable periodic backups of the cluster state by configuring the `spec.backup` section in an Etcd custom resource. Periodic backups are taken via the `etcd-backup-restore` sidecar container that runs in each etcd member pod.

		Periodic backups of an etcd cluster state ensure the ability to recover from a complete quorum loss, enhancing reliability and fault tolerance. It is crucial that these backups, which are vital for restoring the etcd cluster, remain protected from any form of tampering, whether intentional or accidental. To safeguard the integrity of these backups, the authors recommend utilizing `WORM` protection, a feature offered by various cloud providers, to ensure the backups remain immutable and secure.


		### Non-Goals

		- Implementing hibernation support via `etcd.spec` or annotations on the `Etcd` CR (i.e., specifying an intent for hibernation), as noted in [gardener/etcd-druid#922](https://github.com/gardener/etcd-druid/issues/922).

	Given the nuances across providers:
	At the time of writing this proposal, these are the current limitations seen across providers:


		These changes are compatible with existing etcd clusters and current backup processes.

		- Backward Compatibility:

	> In AWS S3, it is possible to decrease the bucket-level immutability period; however, this action may be blocked by specific bucket policy settings.
	> In AWS S3, it is possible to increase and decrease the bucket-level immutability period; however, this action can be blocked by configuring specific bucket policy settings.

	- Introduce new CLI commands:
	- Introduce new CLI sub-commands:

		2. `garbage-collect`
		- Scans the object store for older snapshots and deletes them if their immutability period has expired and they are no longer needed, following the standard [garbage collection policy](https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/garbage_collection.md).

Add DEP-06: Immutable ETCD Backups #884

Are you sure you want to change the base?

Add DEP-06: Immutable ETCD Backups #884

Conversation

seshachalam-yv commented Oct 1, 2024 • edited Loading

ashwani2k left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

renormalize left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashwani2k left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gardener-prow bot commented Jan 29, 2025

shreyas-s-rao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shreyas-s-rao Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ishan16696 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seshachalam-yv commented Oct 1, 2024 •

edited

Loading

shreyas-s-rao Jan 30, 2025 •

edited

Loading