Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DEP-06: Immutable ETCD Backups #884

Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
258 changes: 258 additions & 0 deletions docs/proposals/06-immutable-etcd-backups.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
---
title: Immutable etcd Cluster Backups
dep-number: 06
creation-date: 2024-09-25
status: implementable
shreyas-s-rao marked this conversation as resolved.
Show resolved Hide resolved
authors:
- "@seshachalam-yv"
- "@renormalize"
- "@ishan16696"
Comment on lines +8 to +9
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- "@renormalize"
- "@ishan16696"

Since the DEP has changed significantly since the original draft, and the both of us have only acted as reviewers after the first draft, it is not right to have our names in the author section.

reviewers:
- "@etcd-druid-maintainers"
---

# DEP-06: Immutable etcd Cluster Backups

## Summary

This proposal introduces immutable backups for etcd clusters managed by `etcd-druid`. By leveraging cloud provider immutability features, backups can neither be modified nor deleted once created. This approach strengthens the reliability and fault tolerance of the etcd restoration process.
seshachalam-yv marked this conversation as resolved.
Show resolved Hide resolved
seshachalam-yv marked this conversation as resolved.
Show resolved Hide resolved

## Terminology

- **etcd-druid:** An etcd operator that configures, provisions, reconciles, and monitors etcd clusters.
- **etcd-backup-restore:** A sidecar container that manages backups and restores of etcd cluster state. For more information, see the [etcd-backup-restore](https://github.com/gardener/etcd-backup-restore/blob/master/README.md) documentation.
- **WORM (Write Once, Read Many):** A storage model in which data, once written, cannot be modified or deleted until certain conditions are met.
- **Immutability:** The property of an object that prevents it from being modified or deleted after creation.
- **Immutability Period:** The duration for which data must remain immutable before it can be modified or deleted.
- **Bucket-Level Immutability:** A policy that applies a uniform immutability period to all objects within a bucket.
- **Object-Level Immutability:** A policy that allows setting immutability periods individually for objects within a bucket, offering more granular control.
- **Garbage Collection:** The process of deleting old snapshot data that is no longer needed to free up storage space.
ishan16696 marked this conversation as resolved.
Show resolved Hide resolved
seshachalam-yv marked this conversation as resolved.
Show resolved Hide resolved
- **Hibernation:** A state in which an etcd cluster is scaled down to zero replicas, effectively pausing its operations. This is typically done to save costs when the cluster is not needed for an extended period. During hibernation, the cluster's data remains intact, and it can be resumed to its previous state when required.

## Motivation
unmarshall marked this conversation as resolved.
Show resolved Hide resolved

`etcd-druid` provisions etcd clusters and manages their lifecycle. For every etcd cluster, consumers can enable periodic backups of the cluster state by configuring the `spec.backup` section in an Etcd custom resource. Periodic backups are taken via the `etcd-backup-restore` sidecar container that runs in each etcd member pod.

Periodic backups of an etcd cluster state ensure the ability to recover from a complete quorum loss, enhancing reliability and fault tolerance. It is crucial that these backups, which are vital for restoring the etcd cluster, remain protected from any form of tampering, whether intentional or accidental. To safeguard the integrity of these backups, the authors recommend utilizing `WORM` protection, a feature offered by various cloud providers, to ensure the backups remain immutable and secure.
seshachalam-yv marked this conversation as resolved.
Show resolved Hide resolved

### Goals

- Protect backup data against modifications and deletions post-creation through immutability policies offered by storage providers.

### Non-Goals

- Implementing hibernation support via `etcd.spec` or annotations on the `Etcd` CR (i.e., specifying an intent for hibernation), as noted in [gardener/etcd-druid#922](https://github.com/gardener/etcd-druid/issues/922).
shreyas-s-rao marked this conversation as resolved.
Show resolved Hide resolved
- Supporting immutable backups on storage providers that do not offer immutability features (e.g., OpenStack Swift).

## Proposal

This proposal aims to improve backup storage security by using immutability features available on major cloud providers.
seshachalam-yv marked this conversation as resolved.
Show resolved Hide resolved

### Supported Cloud Providers

- **Google Cloud Storage (GCS):** [Bucket Lock](https://cloud.google.com/storage/docs/bucket-lock)
- **Amazon S3 (S3):** [Object Lock](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html)
- **Azure Blob Storage (ABS):** [Immutable Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/immutable-policy-configure-container-scope?tabs=azure-portal)
ishan16696 marked this conversation as resolved.
Show resolved Hide resolved


#### Types of Immutability

1. **Object-Level Immutability:** Allows setting immutability periods independently for each object within a bucket.
2. **Bucket-Level Immutability:** Applies a uniform immutability policy to all objects in a bucket.

### Comparison of Bucket-Level and Object-Level Immutability
shreyas-s-rao marked this conversation as resolved.
Show resolved Hide resolved

| Feature | GCS | S3 | ABS |
|---------------------------------------------------------------------------|-------------------------|--------------|-------------------------------------|
| Can bucket-level immutability period be increased? | Yes | Yes* | Yes (only 5 times) |
| Can bucket-level immutability period be decreased? | No | Yes* | No |
| Is bucket-level immutability a prerequisite for object-level immutability?| No | Yes | Yes (for existing buckets) |
| Can object-level immutability period be increased? | Yes | Yes | Yes |
| Can object-level immutability period be decreased? | No | No | No |
| Support for enabling object-level immutability in existing buckets | No | Yes | Yes |
| Support for enabling object-level immutability in new buckets | Yes | Yes | Yes |
| Precedence between bucket-level and object-level immutability periods | Max(bucket, object) | Object-level | Max(bucket, object) |

> [!NOTE]
> In AWS S3, it is possible to decrease the bucket-level immutability period; however, this action may be blocked by specific bucket policy settings.
seshachalam-yv marked this conversation as resolved.
Show resolved Hide resolved
> For GCS, object-level immutability is not yet supported for existing buckets; see [this issue](https://issuetracker.google.com/issues/346679415?pli=1).

renormalize marked this conversation as resolved.
Show resolved Hide resolved
### Recommended Approach

Given the nuances across providers:
seshachalam-yv marked this conversation as resolved.
Show resolved Hide resolved

- **S3 and ABS:** typically require bucket-level immutability as a prerequisite for object-level immutability.
- **GCS** does not currently support object-level immutability in existing buckets.
- **ABS** requires a [migration process](https://learn.microsoft.com/en-us/azure/storage/blobs/immutable-version-level-worm-policies#migration) to enable version-level immutability on existing containers.

Consequently, the authors recommend **bucket-level immutability**. This approach simplifies configuration and ensures a uniform immutability policy for all backups in a bucket.
seshachalam-yv marked this conversation as resolved.
Show resolved Hide resolved

### Configuring Immutable Backups

Currently, `etcd-druid` does not directly configure immutable buckets. The specific method of enabling immutability depends on your use case:

- **Large-Scale Consumers (e.g., Gardener):**
seshachalam-yv marked this conversation as resolved.
Show resolved Hide resolved
Typically, these consumers automate the configuration of immutability for both existing and new buckets, as detailed [here](https://github.com/gardener/gardener/issues/10866).

- **Standalone Consumers of `etcd-druid`:**
These users must manually configure immutability settings using their cloud provider's CLI, SDK, or web console.

#### Prerequisites

1. **Configure or Update the Immutable Bucket**
- Use your cloud provider’s CLI, SDK, or console to create (or update) a bucket/container with a WORM (write-once-read-many) immutability policy.
- Refer to the [Getting Started guide](https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/enabling_immutable_snapshots.md#configure-bucket-level-immutability) for step-by-step instructions on configuring or updating the immutable bucket across different cloud providers.
- For AWS S3, for example, you enable Object Lock at bucket creation; for Azure Blob Storage, you configure Immutable Blob Storage at the container scope.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are providing hints for 2 out of the 3 supported providers then i would say just complete the list and provide it as bullet points - one for each provider. Alternatively, you can completely remove this point since you have already provided link to a detailed documentation in etcd-backup-restore repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

ishan16696 marked this conversation as resolved.
Show resolved Hide resolved

2. **Set `Etcd.spec.backup.store` to Reference This Bucket**
- In the `Etcd` custom resource (CR), specify the `store` configuration (e.g., `spec.backup.store`) to point to the bucket name you configured above.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful if you provide a sample yaml. So either choose one from config/samples which illustrates what you are proposing or create a new one and put it there and then give a reference to the section/lines in that yaml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Added the sample yaml


3. **Provide Valid Credentials in a Kubernetes Secret**
- The `store` section of the `Etcd` CR must reference a `Secret` containing valid credentials.
- Confirm that this secret has the proper permissions to upload and retrieve snapshots from the immutable bucket.
- See the [Getting Started guide](https://github.com/gardener/etcd-druid/blob/master/docs/deployment/getting-started-locally/getting-started-locally.md#setting-up-cloud-provider-object-store-secret) for an example.

4. **(Gardener Only) Credential Rotation**
- **Note**: In Gardener-based setups, credential rotation is automatically handled by `gardener/gardener`. No manual rotation is required.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handled by gardener/gardener? - either it should point to the component within gardener which handles this or it should at least provide a link to the g/g repo. Just having gardener/gardener makes sense for us but not for anyone reading this from outside.
Can you also reach out to the core colleagues and ask for a link to any documentation which describes how gardener does credential rotation specifically for backup buckets?

Since in this section which is titled Gardener Only you also have a line about non-gardener use case then the title of this section should rather just be Credential Rotation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the bullet point since Gardener does not handle cloud provider credential rotation for the seed bucket, as stated [here](https://gardener.cloud/docs/getting-started/features/credential-rotation/#user-provided-credentials).

Credential rotation must be managed by the operator, so I've added the following note:

Note

etcd-druid does not handle cloud provider credential rotation. It is the responsibility of the operator to manage credential rotation.

- In non-Gardener environments, you must manage credential updates and secret rotation yourself.

By following these steps, you will have set up an immutable bucket for storing etcd backups, along with the necessary references in your `Etcd` specification and Kubernetes secret.

### Handling of Hibernated Clusters

When an etcd cluster is hibernated for a period longer than the bucket’s immutability period, backups might become mutable again (depending on the cloud provider, see [Comparison of Storage Provider Properties](#comparison-of-bucket-level-and-object-level-immutability)). This possibility undermines the intended guarantees of immutability and may expose backups to accidental or malicious alterations.

As mentioned in [gardener/etcd-druid#922](https://github.com/gardener/etcd-druid/issues/922), a clear hibernation signal is needed. Since hibernation is not yet supported in `etcd-druid` and is out of scope for this proposal, we only address the method for maintaining immutability.

#### Proposal

To mitigate the risk of backups becoming mutable during extended hibernation under bucket-level immutability, the authors propose the following approach:
ishan16696 marked this conversation as resolved.
Show resolved Hide resolved

1. **Prerequisite: Take a Final Full Snapshot Before Hibernation**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure that there is no data loss, one must cut-off traffic to etcd before taking a full snapshot. This is then followed by hibernation of an etcd cluster. This you have also mentioned in etcd Controller Enhancements section below but in summarising the approach this is missed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @unmarshall, for catching this! I've now added the note.

- Before scaling the etcd cluster down to zero replicas, the etcd controller triggers an [on-demand full snapshot](https://github.com/gardener/etcd-druid/blob/master/docs/proposals/05-etcd-operator-tasks.md#trigger-on-demand-fulldelta-snapshot). This ensures that the latest state of the etcd cluster is captured and securely stored before hibernation commences.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also discussed that it is possible that the full snapshot taken prior to hibernation gets corrupted. Then you stand a chance to lose data for at least 24 hours. In addition the full snapshot prior to the one that you are about to take will become mutable after some time. So now you get into a situation where the final full snapshot is corrupt and the one prior to that is mutable and therefore susceptible to modifications. Should we mention this gap?

Copy link
Contributor Author

@seshachalam-yv seshachalam-yv Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've added it as limitation as part of 5c4ab0e

2. **Periodically Re-Upload the Snapshot**
- Re-uploading the latest full snapshot resets its immutability period in the bucket, thereby keeping the backups protected during hibernation.
- By default, the re-upload schedule is determined by `etcd.spec.backup.fullSnapshotSchedule`. At present, this interval cannot be customized exclusively for re-uploads; future enhancements may introduce a dedicated configuration parameter.
- A new operator task type, `ExtendFullSnapshotImmutabilityTask`, will periodically invoke the `reupload-snapshot` and `garbage-collect` commands.

3. **Enhance `etcd-backup-restore`**
- Introduce new CLI commands:
seshachalam-yv marked this conversation as resolved.
Show resolved Hide resolved
- **`reupload-snapshot`** for re-uploading snapshots.
- **`garbage-collect`** for removing older backups whose immutability period has expired.
ishan16696 marked this conversation as resolved.
Show resolved Hide resolved

By capturing a final full snapshot before hibernation, periodically re-uploading it to preserve immutability, and removing stale backups, etcd backups remain safeguarded against accidental or malicious alterations until the cluster is resumed.

##### ETCD API Changes
seshachalam-yv marked this conversation as resolved.
Show resolved Hide resolved

A new field is introduced in `Etcd.spec.backup.store` to indicate the immutability strategy:

```go
// StoreSpec defines parameters for storing etcd backups.
type StoreSpec struct {
// ...
// Immutability configuration for the backup store.
Immutability *ImmutabilitySpec `json:"immutability,omitempty"`
}

// ImmutabilitySpec defines immutability settings.
type ImmutabilitySpec struct {
// RetentionType indicates the type of immutability approach. For example, "Bucket".
RetentionType string `json:"retentionType,omitempty"`
}
```

If `immutability` is not specified, `etcd-druid` will assume that the bucket is mutable, and no immutability-related logic applies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add a line saying that we have defined a new type to allow us future enhancements to the immutability specification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


##### `etcd-backup-restore` Enhancements

The authors propose adding two new commands to the `etcd-backup-restore` CLI (`etcdbrctl`) to maintain immutability during hibernation and to clean up older snapshots:

1. **`reupload-snapshot`**
- Downloads the latest full snapshot from the object store.
- Renames the snapshot (for instance, updates its Unix timestamp) to avoid overwriting an existing immutable snapshot.
- Uploads the renamed snapshot back to object storage, thereby restarting its immutability timer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should also mention that if a bucket is shared across several etcd clusters then this enhancement would increase the storage cost and perhaps some numbers would be nice.

shreyas-s-rao marked this conversation as resolved.
Show resolved Hide resolved

2. **`garbage-collect`**
- Scans the object store for older snapshots and deletes them if their immutability period has expired and they are no longer needed, following the standard [garbage collection policy](https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/garbage_collection.md).
ishan16696 marked this conversation as resolved.
Show resolved Hide resolved

##### etcd Controller Enhancements

When a hibernation flow is initiated (by external tooling or higher-level operators), the [etcd controller](https://github.com/gardener/etcd-druid/blob/master/docs/development/controllers.md#etcd-controller) can:

1. Remove etcd’s client ports (2379/2380) from the etcd Service to block application traffic.
2. Trigger an [on-demand full snapshot](https://github.com/gardener/etcd-druid/blob/master/docs/proposals/05-etcd-operator-tasks.md#trigger-on-demand-fulldelta-snapshot) via an `EtcdOperatorTask`.
3. Scale down the `StatefulSet` replicas to zero, provided the previous snapshot step is successful.
4. Create the `ExtendFullSnapshotImmutabilityTask` if `etcd.spec.backup.store.immutability.retentionType` is `"Bucket"` and based on `etcd.spec.backup.fullSnapshotSchedule`.

##### Operator Task Enhancements

The `ExtendFullSnapshotImmutabilityTask` will create a cron job that:
unmarshall marked this conversation as resolved.
Show resolved Hide resolved

- Runs `etcdbrctl reupload-snapshot` to extend the immutability of the most recent snapshot.
- Runs `etcdbrctl garbage-collect --garbage-collection-policy <policy>` to remove old snapshots.

By periodically re-uploading the latest snapshot during hibernation, the authors ensure that the immutability period is extended, and the backups remain **protected throughout the hibernation period**.

###### Example Task Config

```go
type ExtendFullSnapshotImmutabilityTaskConfig struct {
unmarshall marked this conversation as resolved.
Show resolved Hide resolved
// Schedule defines a cron schedule (e.g., "0 */6 * * *").
Schedule *string `json:"schedule,omitempty"`

// GarbageCollectionConfig specifies the configuration for snapshot GC.
GarbageCollectionConfig *GarbageCollectionConfig `json:"garbageCollectionConfig,omitempty"`
}

type GarbageCollectionConfig struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did not quite understand this type in context of ExtendFullSnapshotImmutabilityTaskConfig - why is it that complicated? If GC needs to be done via this task as well then it should only be done for copied last full snapshot (taken prior to hibernation). I am confused why is delta snapshot retention coming here and what is exponential GC policy and how is relevant w.r.t this task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, I've removed the garbage collection config and ExtendFullSnapshotImmutabilityTask will automatically garbage collect the snapshots it has created, ensuring storage efficiency and preventing stale backups.

// GarbageCollectionPolicy (e.g., "LimitBased" or "Exponential").
GarbageCollectionPolicy *string `json:"garbageCollectionPolicy,omitempty"`
// MaxBackupsLimitBasedGC sets the maximum number of full snapshots to keep.
MaxBackupsLimitBasedGC *int32 `json:"maxBackupsLimitBasedGC,omitempty"`
// DeltaSnapshotRetentionPeriod indicates how long to keep delta snapshots (e.g., "72h").
DeltaSnapshotRetentionPeriod *metav1.Duration `json:"deltaSnapshotRetentionPeriod,omitempty"`
}
```

**Sample YAML**:

```yaml
spec:
config:
schedule: "0 */6 * * *"
seshachalam-yv marked this conversation as resolved.
Show resolved Hide resolved
garbageCollectionConfig:
garbageCollectionPolicy: "LimitBased"
maxBackupsLimitBasedGC: 5
deltaSnapshotRetentionPeriod: "72h"
```

## Compatibility

These changes are compatible with existing etcd clusters and current backup processes.

- **Backward Compatibility:**
shreyas-s-rao marked this conversation as resolved.
Show resolved Hide resolved
- Clusters without immutable buckets continue to function without any changes.
- **Forward Compatibility:**
- Clusters can opt in to use immutable backups by configuring the bucket accordingly (as described in [Configuring Immutable Backups](#configuring-immutable-backups)) and setting `etcd.spec.backup.store.immutability.retentionType == "Bucket"`.
- The enhanced hibernation logic in the etcd controller is additive, meaning it does not interfere with existing workflows.

### Impact for Operators

In scenarios where you want to exclude certain snapshots from an etcd restore, you previously could simply delete them from object storage. However, when bucket-level immutability is enabled, deleting existing immutable snapshots is no longer possible. To address this need, most cloud providers allow adding **custom annotations or tags** to objects—even immutable ones—so they can be logically excluded without physically removing them.

renormalize marked this conversation as resolved.
Show resolved Hide resolved
`etcd-backup-restore` supports ignoring snapshots based on annotations or tags, rather than deleting them. Operators can add the following key-value pair to any snapshot object to exclude it from future restores:

- **Key:** `x-etcd-snapshot-exclude`
- **Value:** `true`

Because these tags or annotations do not modify the underlying snapshot data, they are permissible even for immutable objects. Once these annotations are in place, `etcd-backup-restore` will detect them and skip the tagged snapshots during restoration, thus preventing unwanted snapshots from being used. For more details, see the [Ignoring Snapshots during Restoration](https://github.com/gardener/etcd-backup-restore/blob/master/docs/usage/enabling_immutable_snapshots.md#ignoring-snapshots-during-restoration).
shreyas-s-rao marked this conversation as resolved.
Show resolved Hide resolved

## References

- [GCS Bucket Lock](https://cloud.google.com/storage/docs/bucket-lock)
- [AWS S3 Object Lock](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html)
- [Azure Immutable Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/immutable-policy-configure-container-scope?tabs=azure-portal)
- [etcd-backup-restore Documentation](https://github.com/gardener/etcd-backup-restore/blob/master/README.md)
- [Gardener Issue: 10866](https://github.com/gardener/gardener/issues/10866)

---