Skip to content

Commit

Permalink
Update WAL failover docs with additional feedback (#19189)
Browse files Browse the repository at this point in the history
* Update WAL failover docs with additional feedback

Fixes DOC-11733
  • Loading branch information
rmloveland authored Jan 14, 2025
1 parent 8aa0ac9 commit 23cc953
Show file tree
Hide file tree
Showing 6 changed files with 16 additions and 0 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 8 additions & 0 deletions src/current/v24.3/wal-failover.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,10 @@ When a disk stalls on a node, it could be due to complete hardware failure or it

WAL failover uses a secondary disk to fail over WAL writes to when transient disk stalls occur. This limits the write impact to a few hundreds of milliseconds (the [failover threshold, which is configurable](#unhealthy-op-threshold)). Note that WAL failover **only preserves availability of writes**. If reads to the underlying storage are also stalled, operations that read and do not find data in the block cache or page cache will stall.

The following diagram shows how WAL failover works at a high level. For more information about the WAL, memtables, and SSTables, refer to the [Architecture » Storage Layer documentation]({% link {{ page.version.version }}/architecture/storage-layer.md %}).

<img src="{{ 'images/v24.3/wal-failover-overview.png' | relative_url }}" alt="WAL failover overview diagram" style="border:1px solid #eee;max-width:100%" />

## Create and configure a cluster to be ready for WAL failover

The steps to provision a cluster that has a single data store versus a multi-store cluster are slightly different. In this section, we will provide high-level instructions for setting up each of these configurations. We will use [GCE](https://cloud.google.com/compute/docs) as the environment. You will need to translate these instructions into the steps used by the deployment tools in your environment.
Expand Down Expand Up @@ -371,6 +375,10 @@ If a disk stalls for longer than the duration of [`COCKROACH_ENGINE_MAX_SYNC_DUR

In a [multi-store](#multi-store-config) cluster, if a disk for a store has a transient stall, WAL will failover to the second store's disk. When the stall on the first disk clears, the WAL will failback to the first disk. WAL failover will daisy-chain from store _A_ to store _B_ to store _C_.

The following diagram shows the behavior of WAL writes during a disk stall with and without WAL failover enabled.

<img src="{{ 'images/v24.3/wal-failover-behavior.png' | relative_url }}" alt="how long WAL writes take during a disk stall with and without WAL failover enabled" style="border:1px solid #eee;max-width:100%" />

## FAQs

### 1. What are the benefits of WAL failover?
Expand Down
8 changes: 8 additions & 0 deletions src/current/v25.1/wal-failover.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,10 @@ When a disk stalls on a node, it could be due to complete hardware failure or it

WAL failover uses a secondary disk to fail over WAL writes to when transient disk stalls occur. This limits the write impact to a few hundreds of milliseconds (the [failover threshold, which is configurable](#unhealthy-op-threshold)). Note that WAL failover **only preserves availability of writes**. If reads to the underlying storage are also stalled, operations that read and do not find data in the block cache or page cache will stall.

The following diagram shows how WAL failover works at a high level. For more information about the WAL, memtables, and SSTables, refer to the [Architecture &raquo; Storage Layer documentation]({% link {{ page.version.version }}/architecture/storage-layer.md %}).

<img src="{{ 'images/v25.1/wal-failover-overview.png' | relative_url }}" alt="WAL failover overview diagram" style="border:1px solid #eee;max-width:100%" />

## Create and configure a cluster to be ready for WAL failover

The steps to provision a cluster that has a single data store versus a multi-store cluster are slightly different. In this section, we will provide high-level instructions for setting up each of these configurations. We will use [GCE](https://cloud.google.com/compute/docs) as the environment. You will need to translate these instructions into the steps used by the deployment tools in your environment.
Expand Down Expand Up @@ -371,6 +375,10 @@ If a disk stalls for longer than the duration of [`COCKROACH_ENGINE_MAX_SYNC_DUR

In a [multi-store](#multi-store-config) cluster, if a disk for a store has a transient stall, WAL will failover to the second store's disk. When the stall on the first disk clears, the WAL will failback to the first disk. WAL failover will daisy-chain from store _A_ to store _B_ to store _C_.

The following diagram shows the behavior of WAL writes during a disk stall with and without WAL failover enabled.

<img src="{{ 'images/v25.1/wal-failover-behavior.png' | relative_url }}" alt="how long WAL writes take during a disk stall with and without WAL failover enabled" style="border:1px solid #eee;max-width:100%" />

## FAQs

### 1. What are the benefits of WAL failover?
Expand Down

0 comments on commit 23cc953

Please sign in to comment.