-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DEP-06: Immutable ETCD Backups #884
base: master
Are you sure you want to change the base?
Add DEP-06: Immutable ETCD Backups #884
Conversation
DEP-06: Immutable ETCD Backups
DEP-06: Immutable ETCD Backups
DEP-06
: Immutable ETCD Backups
DEP-06
: Immutable ETCD Backups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @seshachalam-yv @ishan16696 @renormalize for the proposal.
It captures thing well, but I've put some open points esp. on the structure as well some details esp. as it addresses design considerations.
|
||
### Excluding Snapshots Under Specific Circumstances | ||
|
||
Given that immutable backups cannot be deleted until the immutability period expires, there are scenarios, such as corrupted snapshots or other anomalies, where certain snapshots must be skipped during the restoration process. To facilitate this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can happen even outside of immutable backups scenarios as well, so how is this handled there? I'm guessing currently by deleting manually the affected snapshots.
But with this new approach it should be same mechanism there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If snapshots are mutable, this is achieved through deletion of snapshots.
The same functionality will be achieved through custom metadata tags. Will enhance the doc for this.
|
||
Given that immutable backups cannot be deleted until the immutability period expires, there are scenarios, such as corrupted snapshots or other anomalies, where certain snapshots must be skipped during the restoration process. To facilitate this: | ||
|
||
- **Custom Metadata Tags:** Utilize custom metadata to mark specific objects (snapshots) that should be bypassed. To exclude a snapshot from the restoration process, attach custom metadata to it with the key `x-etcd-snapshot-exclude` and value `true`. This method is officially supported, as demonstrated in the [etcd-backup-restore PR](https://github.com/gardener/etcd-backup-restore/pull/776). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not clear from the doc who takes care of attaching the custom metadata flag and how its consumed? Can we describe here to avoid any unintended interpretation of the flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Human operators add these tags; will include this.
|
||
## Implementation Steps | ||
|
||
1. **Enhance the Compaction Job:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should create a new name for the job for Hibernated Full Snapshots and ensure that we have flags and even flow which can leverage the existing compaction feature and enhance it with additional change required for Immuatable backup snapshotting and garbage collection.
Also we cannot have a compaction job for hibernated cluster in practical terms, so it will be even more confusing to see a compaction job running for a hibernated cluster.
- Configure buckets with appropriate immutability settings before deploying ETCD clusters. | ||
- Ensure that the immutability periods align with organizational policies. | ||
|
||
- **Compaction Job Configuration:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the retry threshold for this job?
What happens if it fails to run for a period of 24hrs.
What happens if druid is down?
What happens when druid comes back up esp. for failed jobs which have breached the retry threshold?
What happens if we breach the bucket retention period? Is no data to restore possible on wake-up of hibernated clusters.
Does garbage collection runs independent or in sequence only after the job takes a full snapshot on its run.
f832315
to
3df82c8
Compare
Co-authored-by: Saketh Kalaga <[email protected]>
3df82c8
to
805f277
Compare
Co-authored-by: Saketh Kalaga <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton for this PR @seshachalam-yv! It was no small feat to bring the DEP to this stage.
- "@renormalize" | ||
- "@ishan16696" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- "@renormalize" | |
- "@ishan16696" |
Since the DEP has changed significantly since the original draft, and the both of us have only acted as reviewers after the first draft, it is not right to have our names in the author section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is such a neat proposal now to from where it has started.
No nitpics, it was sufficiently detailed without getting into implementation semantics and was able to convey the intent aptly.
Thanks a lot for taking all the inputs @seshachalam-yv and to the reviewers for the detailed review.
|
||
## Summary | ||
|
||
This proposal introduces immutable backups for etcd clusters managed by `etcd-druid`. By leveraging cloud provider immutability features, backups can neither be modified nor deleted once created. This approach strengthens the reliability and fault tolerance of the etcd restoration process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By leveraging cloud provider immutability features, backups can neither be modified nor deleted once created
- to be precise this statement tells the reader that once the immutability feature is enabled then it will ensure immutability for backups for eternity. This is not true. Therefore you must mention that this is true for a configured retention duration or immutability duration.
|
||
Currently, `etcd-druid` does not directly configure immutable buckets. The specific method of enabling immutability depends on your use case: | ||
|
||
- **Large-Scale Consumers (e.g., Gardener):** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it again - irrespective of whether its a large-scale or small-scale consumer, the choice to automate or do it manually applies equally to both consumers. Therefore this section can only mention the following:
Creating and configuring immutable buckets on providers is not handled by etcd-druid and must be done by the consumers. For a large-scale consumer like Gardener provider extensions are leveraged to automate both the creation and configuration of buckets. For more details see BackupBucket and refer issue.
This simplifies the section further.
1. **Configure or Update the Immutable Bucket** | ||
- Use your cloud provider’s CLI, SDK, or console to create (or update) a bucket/container with a WORM (write-once-read-many) immutability policy. | ||
- Refer to the [Getting Started guide](https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/enabling_immutable_snapshots.md#configure-bucket-level-immutability) for step-by-step instructions on configuring or updating the immutable bucket across different cloud providers. | ||
- For AWS S3, for example, you enable Object Lock at bucket creation; for Azure Blob Storage, you configure Immutable Blob Storage at the container scope. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are providing hints for 2 out of the 3 supported providers then i would say just complete the list and provide it as bullet points - one for each provider. Alternatively, you can completely remove this point since you have already provided link to a detailed documentation in etcd-backup-restore repo.
} | ||
``` | ||
|
||
If `immutability` is not specified, `etcd-druid` will assume that the bucket is mutable, and no immutability-related logic applies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add a line saying that we have defined a new type to allow us future enhancements to the immutability specification.
1. **`reupload-snapshot`** | ||
- Downloads the latest full snapshot from the object store. | ||
- Renames the snapshot (for instance, updates its Unix timestamp) to avoid overwriting an existing immutable snapshot. | ||
- Uploads the renamed snapshot back to object storage, thereby restarting its immutability timer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should also mention that if a bucket is shared across several etcd clusters then this enhancement would increase the storage cost and perhaps some numbers would be nice.
|
||
##### Operator Task Enhancements | ||
|
||
The `ExtendFullSnapshotImmutabilityTask` will create a cron job that: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what you have missed is that this task should only be active as long as the cluster is hibernated. Once the etcd cluster comes out of hibernation then this task should no longer exist. So have a section on lifecycle of this task.
###### Example Task Config | ||
|
||
```go | ||
type ExtendFullSnapshotImmutabilityTaskConfig struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So one task is extending the snapshot's immutability and also doing garbage collection? If that is the case then the name does not reflect that.
GarbageCollectionConfig *GarbageCollectionConfig `json:"garbageCollectionConfig,omitempty"` | ||
} | ||
|
||
type GarbageCollectionConfig struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i did not quite understand this type in context of ExtendFullSnapshotImmutabilityTaskConfig
- why is it that complicated? If GC needs to be done via this task as well then it should only be done for copied last full snapshot (taken prior to hibernation). I am confused why is delta snapshot retention coming here and what is exponential GC policy and how is relevant w.r.t this task.
Co-authored-by: Madhav Bhargava <[email protected]>
@seshachalam-yv: The following test failed, say
Full PR test history. Your PR dashboard. Command help for this repository. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seshachalam-yv the flow is neat and understandable from a new-reader's perspective too. Thanks a lot for taking the time and making all the changes! I have a few nits. Please address them, thanks.
Please also update https://github.com/gardener/etcd-druid/blob/master/docs/README.md to include link to this proposal.
title: Immutable etcd Cluster Backups | ||
dep-number: 06 | ||
creation-date: 2024-09-25 | ||
status: implementable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the proposed feature just immutable backups, or does it also include handling hibernated clusters? I'm a but confused, since the non-goals
section mentions that the proposal doesn't focus on handling immutable backups for hibernated clusters, but the Handling of Hibernated Clusters
section talks about how druid can handle hibernated clusters. I'm a bit confused about this.
Because if the proposal is just about immutable backups, then the proposal is already in implemented
status, correct?
- **Immutability Period:** The duration for which data must remain immutable before it can be modified or deleted. | ||
- **Bucket-Level Immutability:** A policy that applies a uniform immutability period to all objects within a bucket. | ||
- **Object-Level Immutability:** A policy that allows setting immutability periods individually for objects within a bucket, offering more granular control. | ||
- **Garbage Collection:** The process of deleting old snapshot data that is no longer needed to free up storage space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **Garbage Collection:** The process of deleting old snapshot data that is no longer needed to free up storage space. | |
- **Garbage Collection:** The process of deleting old snapshot data that is no longer needed, in order to free up storage space. |
|
||
`etcd-druid` provisions etcd clusters and manages their lifecycle. For every etcd cluster, consumers can enable periodic backups of the cluster state by configuring the `spec.backup` section in an Etcd custom resource. Periodic backups are taken via the `etcd-backup-restore` sidecar container that runs in each etcd member pod. | ||
|
||
Periodic backups of an etcd cluster state ensure the ability to recover from a complete quorum loss, enhancing reliability and fault tolerance. It is crucial that these backups, which are vital for restoring the etcd cluster, remain protected from any form of tampering, whether intentional or accidental. To safeguard the integrity of these backups, the authors recommend utilizing `WORM` protection, a feature offered by various cloud providers, to ensure the backups remain immutable and secure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Periodic backups of an etcd cluster state ensure the ability to recover from a complete quorum loss, enhancing reliability and fault tolerance. It is crucial that these backups, which are vital for restoring the etcd cluster, remain protected from any form of tampering, whether intentional or accidental. To safeguard the integrity of these backups, the authors recommend utilizing `WORM` protection, a feature offered by various cloud providers, to ensure the backups remain immutable and secure. | |
Periodic backups of an etcd cluster state ensure the ability to recover from a data loss or a quorum loss, enhancing reliability and fault tolerance. It is crucial that these backups, which are vital for restoring the etcd cluster, remain protected from any form of tampering, whether intentional or accidental. To safeguard the integrity of these backups, the authors recommend utilizing `WORM` protection, a feature offered by various cloud providers, to ensure the backups remain immutable and secure. |
|
||
### Non-Goals | ||
|
||
- Implementing hibernation support via `etcd.spec` or annotations on the `Etcd` CR (i.e., specifying an intent for hibernation), as noted in [gardener/etcd-druid#922](https://github.com/gardener/etcd-druid/issues/922). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This point seems out-of-context here. A better wording for this would be:
Ensuring immutability of snapshots when the etcd cluster is hibernated for a period longer than the configured immutability period.
This is more in-line with this proposal's theme, ie immutable backups. Whereas, the issue you mentioned here simply talks about providing certain features upon hibernation of the etcd cluster, which is a general feature and not necessarily related to immutable backups.
1. **Object-Level Immutability:** Allows setting immutability periods independently for each object within a bucket. | ||
2. **Bucket-Level Immutability:** Applies a uniform immutability policy to all objects in a bucket. | ||
|
||
### Comparison of Bucket-Level and Object-Level Immutability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please also add line items for Support for enabling bucket-level immutability in existing buckets
and Support for enabling bucket-level immutability in new buckets
?
|
||
### Recommended Approach | ||
|
||
Given the nuances across providers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the nuances across providers: | |
At the time of writing this proposal, these are the current limitations seen across providers: |
- Downloads the latest full snapshot from the object store. | ||
- Renames the snapshot (for instance, updates its Unix timestamp) to avoid overwriting an existing immutable snapshot. | ||
- Uploads the renamed snapshot back to object storage, thereby restarting its immutability timer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be more optimal to utilise provider APIs to perform a server-side copy of the object with a new name? That we we save on download+upload time as well as network costs.
If so, can we rename this to renew-snapshot
rather than reupload-snapshot
? Because the task is to renew or refresh the snapshot's timestamp, rather than download+upload.
If not, then the term reupload
is incorrect, since it technically needs to be re-upload
. So it'll be re-upload-snapshot
.
|
||
These changes are compatible with existing etcd clusters and current backup processes. | ||
|
||
- **Backward Compatibility:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about reverting from immutable backups to mutable backups? Is that possible? If so, can you please elaborate on the nuances of this? If not, then can you add that as a limitation here?
- **Key:** `x-etcd-snapshot-exclude` | ||
- **Value:** `true` | ||
|
||
Because these tags or annotations do not modify the underlying snapshot data, they are permissible even for immutable objects. Once these annotations are in place, `etcd-backup-restore` will detect them and skip the tagged snapshots during restoration, thus preventing unwanted snapshots from being used. For more details, see the [Ignoring Snapshots during Restoration](https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/enabling_immutable_snapshots.md#ignoring-snapshots-during-restoration). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a limitation
note for AWS, since we currently don't support this for S3 buckets, atleast at the time of writing this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few nits
|
||
## Summary | ||
|
||
This proposal introduces immutable backups for etcd clusters managed by `etcd-druid`. By leveraging cloud provider immutability features, backups can neither be modified nor deleted once created. This approach strengthens the reliability and fault tolerance of the etcd restoration process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal introduces immutable backups for etcd clusters managed by `etcd-druid`. By leveraging cloud provider immutability features, backups can neither be modified nor deleted once created. This approach strengthens the reliability and fault tolerance of the etcd restoration process. | |
This proposal introduces immutable backups for etcd clusters managed by `etcd-druid`. By leveraging cloud provider immutability features, backups taken by `etcd-backup-restore` can neither be modified nor deleted once created for a configured retention period. This approach strengthens the reliability and fault tolerance of the etcd restoration process. |
- **Immutability Period:** The duration for which data must remain immutable before it can be modified or deleted. | ||
- **Bucket-Level Immutability:** A policy that applies a uniform immutability period to all objects within a bucket. | ||
- **Object-Level Immutability:** A policy that allows setting immutability periods individually for objects within a bucket, offering more granular control. | ||
- **Garbage Collection:** The process of deleting old snapshot data that is no longer needed to free up storage space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you mention this link for more information on GC, https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/garbage_collection.md
- **Google Cloud Storage (GCS):** [Bucket Lock](https://cloud.google.com/storage/docs/bucket-lock) | ||
- **Amazon S3 (S3):** [Object Lock](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html) | ||
- **Azure Blob Storage (ABS):** [Immutable Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/immutable-policy-configure-container-scope?tabs=azure-portal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you mention about the openstack here.
Note
Currently, Openstack object storage (swift) doesn't support immutability for objects: https://blueprints.launchpad.net/swift/+spec/immutability-middleware.
| Precedence between bucket-level and object-level immutability periods | Max(bucket, object) | Object-level | Max(bucket, object) | | ||
|
||
> [!NOTE] | ||
> In AWS S3, it is possible to decrease the bucket-level immutability period; however, this action may be blocked by specific bucket policy settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> In AWS S3, it is possible to decrease the bucket-level immutability period; however, this action may be blocked by specific bucket policy settings. | |
> In AWS S3, it is possible to increase and decrease the bucket-level immutability period; however, this action can be blocked by configuring specific bucket policy settings. |
1. **Configure or Update the Immutable Bucket** | ||
- Use your cloud provider’s CLI, SDK, or console to create (or update) a bucket/container with a WORM (write-once-read-many) immutability policy. | ||
- Refer to the [Getting Started guide](https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/enabling_immutable_snapshots.md#configure-bucket-level-immutability) for step-by-step instructions on configuring or updating the immutable bucket across different cloud providers. | ||
- For AWS S3, for example, you enable Object Lock at bucket creation; for Azure Blob Storage, you configure Immutable Blob Storage at the container scope. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For AWS S3, for example, you enable Object Lock at bucket creation
can you explain what do you mean by this ?
IMO, this line can be removed
- A new operator task type, `ExtendFullSnapshotImmutabilityTask`, will periodically invoke the `reupload-snapshot` and `garbage-collect` commands. | ||
|
||
3. **Enhance `etcd-backup-restore`** | ||
- Introduce new CLI commands: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Introduce new CLI commands: | |
- Introduce new CLI sub-commands: |
3. **Enhance `etcd-backup-restore`** | ||
- Introduce new CLI commands: | ||
- **`reupload-snapshot`** for re-uploading snapshots. | ||
- **`garbage-collect`** for removing older backups whose immutability period has expired. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we want to GC the old snapshots ? why can't we leave them as we currently leave them already.
2. **`garbage-collect`** | ||
- Scans the object store for older snapshots and deletes them if their immutability period has expired and they are no longer needed, following the standard [garbage collection policy](https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/garbage_collection.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
How to categorize this PR?
/area backup
/area disaster-recovery
/area security
/area compliance
/area storage
/kind enhancement
What this PR does / why we need it:
This PR adds
DEP-06: Immutable ETCD Backups
. The proposal aims to enhance the reliability and integrity of ETCD backups in ETCD Druid by introducing immutable backups. By leveraging cloud provider features that support a write-once-read-many (WORM) model, this approach prevents unauthorized modifications to backup data, ensuring that backups remain available and intact for restoration.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Release note: