Update WAL failover docs with additional feedback (#19189)

* Update WAL failover docs with additional feedback Fixes DOC-11733
cockroachdb · Jan 14, 2025 · 23cc953 · 23cc953
1 parent 8aa0ac9
commit 23cc953
Show file tree

Hide file tree

Showing 6 changed files with 16 additions and 0 deletions.
diff --git a/src/current/images/v24.3/wal-failover-behavior.png b/src/current/images/v24.3/wal-failover-behavior.png
diff --git a/src/current/images/v24.3/wal-failover-overview.png b/src/current/images/v24.3/wal-failover-overview.png
diff --git a/src/current/images/v25.1/wal-failover-behavior.png b/src/current/images/v25.1/wal-failover-behavior.png
diff --git a/src/current/images/v25.1/wal-failover-overview.png b/src/current/images/v25.1/wal-failover-overview.png
diff --git a/src/current/v24.3/wal-failover.md b/src/current/v24.3/wal-failover.md
@@ -27,6 +27,10 @@ When a disk stalls on a node, it could be due to complete hardware failure or it
 
 WAL failover uses a secondary disk to fail over WAL writes to when transient disk stalls occur. This limits the write impact to a few hundreds of milliseconds (the [failover threshold, which is configurable](#unhealthy-op-threshold)). Note that WAL failover **only preserves availability of writes**. If reads to the underlying storage are also stalled, operations that read and do not find data in the block cache or page cache will stall.
 
+The following diagram shows how WAL failover works at a high level. For more information about the WAL, memtables, and SSTables, refer to the [Architecture &raquo; Storage Layer documentation]({% link {{ page.version.version }}/architecture/storage-layer.md %}).
+
+<img src="{{ 'images/v24.3/wal-failover-overview.png' | relative_url }}" alt="WAL failover overview diagram"  style="border:1px solid #eee;max-width:100%" />
+
 ## Create and configure a cluster to be ready for WAL failover
 
 The steps to provision a cluster that has a single data store versus a multi-store cluster are slightly different. In this section, we will provide high-level instructions for setting up each of these configurations. We will use [GCE](https://cloud.google.com/compute/docs) as the environment. You will need to translate these instructions into the steps used by the deployment tools in your environment.
@@ -371,6 +375,10 @@ If a disk stalls for longer than the duration of [`COCKROACH_ENGINE_MAX_SYNC_DUR
 
 In a [multi-store](#multi-store-config) cluster, if a disk for a store has a transient stall, WAL will failover to the second store's disk. When the stall on the first disk clears, the WAL will failback to the first disk. WAL failover will daisy-chain from store _A_ to store _B_ to store _C_.
 
+The following diagram shows the behavior of WAL writes during a disk stall with and without WAL failover enabled.
+
+<img src="{{ 'images/v24.3/wal-failover-behavior.png' | relative_url }}" alt="how long WAL writes take during a disk stall with and without WAL failover enabled"  style="border:1px solid #eee;max-width:100%" />
+
 ## FAQs
 
 ### 1. What are the benefits of WAL failover?

diff --git a/src/current/v25.1/wal-failover.md b/src/current/v25.1/wal-failover.md
@@ -27,6 +27,10 @@ When a disk stalls on a node, it could be due to complete hardware failure or it
 
 WAL failover uses a secondary disk to fail over WAL writes to when transient disk stalls occur. This limits the write impact to a few hundreds of milliseconds (the [failover threshold, which is configurable](#unhealthy-op-threshold)). Note that WAL failover **only preserves availability of writes**. If reads to the underlying storage are also stalled, operations that read and do not find data in the block cache or page cache will stall.
 
+The following diagram shows how WAL failover works at a high level. For more information about the WAL, memtables, and SSTables, refer to the [Architecture &raquo; Storage Layer documentation]({% link {{ page.version.version }}/architecture/storage-layer.md %}).
+
+<img src="{{ 'images/v25.1/wal-failover-overview.png' | relative_url }}" alt="WAL failover overview diagram"  style="border:1px solid #eee;max-width:100%" />
+
 ## Create and configure a cluster to be ready for WAL failover
 
 The steps to provision a cluster that has a single data store versus a multi-store cluster are slightly different. In this section, we will provide high-level instructions for setting up each of these configurations. We will use [GCE](https://cloud.google.com/compute/docs) as the environment. You will need to translate these instructions into the steps used by the deployment tools in your environment.
@@ -371,6 +375,10 @@ If a disk stalls for longer than the duration of [`COCKROACH_ENGINE_MAX_SYNC_DUR
 
 In a [multi-store](#multi-store-config) cluster, if a disk for a store has a transient stall, WAL will failover to the second store's disk. When the stall on the first disk clears, the WAL will failback to the first disk. WAL failover will daisy-chain from store _A_ to store _B_ to store _C_.
 
+The following diagram shows the behavior of WAL writes during a disk stall with and without WAL failover enabled.
+
+<img src="{{ 'images/v25.1/wal-failover-behavior.png' | relative_url }}" alt="how long WAL writes take during a disk stall with and without WAL failover enabled"  style="border:1px solid #eee;max-width:100%" />
+
 ## FAQs
 
 ### 1. What are the benefits of WAL failover?