cockroachdb · kathancox · Nov 14, 2024 · Oct 28, 2024 · Oct 30, 2024 · Nov 5, 2024
diff --git a/src/current/_includes/v24.3/ldr/multiple-tables.md b/src/current/_includes/v24.3/ldr/multiple-tables.md
@@ -1 +1 @@
-There are some tradeoffs between enabling one table per LDR job versus multiple tables in one LDR job. Multiple tables in one LDR job can be easier to operate. For example, if you pause and resume the single job, LDR will stop and resume for all the tables. However, the most granular level observability will be at the job level. One table in one LDR job will allow for table-level observability.
+There are some tradeoffs between enabling one table per LDR job versus multiple tables in one LDR job. Multiple tables in one LDR job can be easier to operate. For example, if you pause and resume the single job, LDR will stop and resume for all the tables. However, the most granular level observability will be at the job level. One table in one LDR job will allow for table-level observability.
diff --git a/src/current/_includes/v24.3/ldr/show-logical-replication-responses.md b/src/current/_includes/v24.3/ldr/show-logical-replication-responses.md
@@ -0,0 +1,9 @@
+Field    | Response
+---------+----------
+`job_id` | The job's ID. Use with [`CANCEL JOB`]({% link {{ page.version.version }}/cancel-job.md %}), [`PAUSE JOB`]({% link {{ page.version.version }}/pause-job.md %}), [`RESUME JOB`]({% link {{ page.version.version }}/resume-job.md %}), [`SHOW JOB`]({% link {{ page.version.version }}/show-jobs.md %}).
+`status` | Status of the job `running`, `paused`, `canceled`. {% comment  %}check these{% endcomment %}
+`targets` | The fully qualified name of the table(s) that are part of the LDR job.
+`replicated_time` | The latest [timestamp]({% link {{ page.version.version }}/timestamp.md %}) at which the destination cluster has consistent data. This time advances automatically as long as the LDR job proceeds without error. `replicated_time` is updated periodically (every 30s). {% comment %}To confirm this line is accurate{% endcomment %}
+`replication_start_time` | The start time of the LDR job.
+`conflict_resolution_type` | The type of [conflict resolution]({% link {{ page.version.version }}/manage-logical-data-replication.md %}#conflict-resolution): `LWW` last write wins.
+`description` | Description of the job including the replicating table(s) and the source cluster connection.
diff --git a/src/current/_includes/v24.3/sidebar-data/cross-cluster-replication.json b/src/current/_includes/v24.3/sidebar-data/cross-cluster-replication.json
@@ -5,6 +5,12 @@
       {
         "title": "Logical Data Replication",
         "items": [
+            {
+                "title": "Overview",
+                "urls": [
+                  "/${VERSION}/logical-data-replication-overview.html"
+                ]
+            },
             {
                 "title": "Set Up Logical Data Replication",
                 "urls": [
@@ -16,6 +22,12 @@
                 "urls": [
                     "/${VERSION}/manage-logical-data-replication.html"
                 ]
+            },
+            {
+                "title": "Monitor Logical Data Replication",
+                "urls": [
+                    "/${VERSION}/logical-data-replication-monitoring.html"
+                ]
             }
         ]
       }

diff --git a/src/current/images/v24.3/east-west-region.svg b/src/current/images/v24.3/east-west-region.svg
diff --git a/src/current/images/v24.3/unidirectional.svg b/src/current/images/v24.3/unidirectional.svg
diff --git a/src/current/v24.3/logical-data-replication-monitoring.md b/src/current/v24.3/logical-data-replication-monitoring.md
@@ -0,0 +1,147 @@
+---
+title: Logical Data Replication Monitoring
+summary: Monitor and observe LDR jobs between a source and destination table.
+toc: true
+docs_area: manage
+---
+
+{{site.data.alerts.callout_info}}
+{% include feature-phases/preview.md %}
+{{site.data.alerts.end}}
+
+You can monitor [**logical data replication (LDR)**]({% link {{ page.version.version }}/logical-data-replication-overview.md %}) using:
+
+- [`SHOW LOGICAL REPLICATION JOBS`](#sql-shell) in the SQL shell to view a list of LDR jobs on the cluster.
+- The **Logical Data Replication** dashboard on the [DB Console](#db-console) to view metrics at the cluster level. {% comment %}To add link later to dashboard page{% endcomment %}
+- [Prometheus and Alertmanager](#prometheus) to track and alert on LDR metrics.
+- Metrics export with [Datadog](#datadog).
+- [Metrics labels](#metrics-labels) to view metrics at the job level.
+
+{{site.data.alerts.callout_info}}
+{% include {{ page.version.version }}/ldr/multiple-tables.md %}
+{{site.data.alerts.end}}
+
+{% comment  %}To add to an include{% endcomment %}
+When you start an LDR stream, one job is created on each cluster:
+
+- The _history retention job_ on the source cluster, which runs while the LDR job is active to protect changes in the table from [garbage collection]({% link {{ page.version.version }}/architecture/storage-layer.md %}#garbage-collection) until they have been applied to the destination cluster. The history retention job is viewable in the [DB Console](#db-console) or with [`SHOW JOBS`]({% link {{ page.version.version }}/show-jobs.md %}). Any manual changes to the history retention job could disrupt the LDR job.
+- The `logical replication` job on the destination cluster. You can view the status of this job in the SQL shell with `SHOW LOGICAL REPLICATION JOBS` and the DB Console [**Jobs** page](#jobs-page).
+
+## SQL Shell
+
+In the destination cluster's SQL shell, you can query `SHOW LOGICAL REPLICATION JOBS` to view the LDR jobs running on the cluster:
+
+{% include_cached copy-clipboard.html %}
+~~~ sql
+SHOW LOGICAL REPLICATION JOBS;
+~~~
+~~~
+        job_id        | status  |          targets          | replicated_time
+----------------------+---------+---------------------------+------------------
+1012877040439033857   | running | {database.public.table}   | NULL
+(1 row)
+~~~
+
+For additional detail on each LDR job, use the `WITH details` option:
+
+{% include_cached copy-clipboard.html %}
+~~~ sql
+SHOW LOGICAL REPLICATION JOBS WITH details;
+~~~
+~~~
+        job_id        |  status  |            targets             |        replicated_time        |    replication_start_time     | conflict_resolution_type |                                      description
+----------------------+----------+--------------------------------+-------------------------------+-------------------------------+--------------------------+-----------------------------------------------------------------------------------------
+  1010959260799270913 | running  | {movr.public.promo_codes}      | 2024-10-24 17:50:05+00        | 2024-10-10 20:04:42.196982+00 | LWW                      | LOGICAL REPLICATION STREAM into movr.public.promo_codes from external://cluster_a
+  1014047902397333505 | canceled | {defaultdb.public.office_dogs} | 2024-10-24 17:30:25+00        | 2024-10-21 17:54:20.797643+00 | LWW                      | LOGICAL REPLICATION STREAM into defaultdb.public.office_dogs from external://cluster_a
+~~~
+
+### Responses
+
+{% include {{ page.version.version }}/ldr/show-logical-replication-responses.md %}
+
+## Recommended LDR metrics to track
+
+- Replication latency: The commit-to-commit replication latency, which is tracked from when a row is committed on the source cluster, to when it is applied on the destination cluster. An LDR _commit_ is when the job either applies a row successfully to the destination cluster or adds a row to the [dead letter queue (DLQ)]({% link {{ page.version.version }}/manage-logical-data-replication.md %}#dead-letter-queue-dlq).
+    - `logical_replication.commit_latency-p50`
+    - `logical_replication.commit_latency-p99`
+- Replication lag: How far behind the source cluster is from the destination cluster at a specific point in time. The replication lag is equivalent to [RPO]({% link {{ page.version.version }}/disaster-recovery-overview.md %}) during a disaster. Calculate the replication lag with this metric. For example, `time.now() - replicated_time_seconds`.
+    - `logical_replication.replicated_time_seconds`
+- Row updates applied: These metrics indicate whether the destination cluster is actively receiving and applying data from the source cluster.
+    - `logical_replication.events_ingested`
+    - `logical_replication.events_dlqed`
+
+## DB Console
+
+In the DB Console, you can use:
+
+- The [**Metrics** dashboard]({% link {{ page.version.version }}/ui-overview-dashboard.md %}) for LDR to view metrics for the job on the destination cluster.
+- The [**Jobs** page]({% link {{ page.version.version }}/ui-jobs-page.md %}) to view the history retention job on the source cluster and the LDR job on the destination cluster
+
+The metrics for LDR in the DB Console metrics are at the **cluster** level. This means that if there are multiple LDR jobs running on a cluster the DB Console will show the average metrics across jobs.   
+
+### Metrics dashboard
+
+You can use the [**Logical Data Replication** dashboard]({% link {{ page.version.version }}/ui-overview-dashboard.md %}) of the destination cluster to monitor the following metric graphs at the **cluster** level:
+
+- Replication latency
+- Replication lag
+- Row updates applied
+- Logical bytes reviewed
+- Batch application processing time: 50th percentile
+- Batch application processing time: 99th percentile
+- DLQ causes
+- Retry queue size
+
+{% comment  %}Dashboard page in the DB Console docs to be added with more information per other dashboards. Link to there from this section.{% endcomment %}
+
+To track replicated time, ingested events, and events added to the DLQ at the **job** level, refer to [Metrics labels](#metrics-labels).
+
+### Jobs page
+
+On the [**Jobs** page]({% link {{ page.version.version }}/ui-jobs-page.md %}), select:
+
+- The **Replication Producer** in the source cluster's DB Console to view the _history retention job_.
+- The **Logical Replication Ingestion** job in the destination cluster's DB Console. When you start LDR, the **Logical Replication Ingestion** job will show a bar that tracks the initial scan progress of the source table's existing data.
+
+## Monitoring and alerting
+
+### Prometheus
+
+You can use Prometheus and Alertmanager to track and alert on LDR metrics. Refer to the [Monitor CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}) tutorial for steps to set up Prometheus.
+
+#### Metrics labels
+
+To view metrics at the job level, you can use the `label` option when you start LDR to add a metrics label to the LDR job. This enables [child metric]({% link {{ page.version.version }}/child-metrics.md %}) export, which are Prometheus time series with extra labels. You can track the following metrics for an LDR job with labels:
+
+- `logical_replication.replicated_time_seconds`
+- `logical_replication.events_ingested`
+- `logical_replication.events_dlqed`
+- `logical_replication.scanning_ranges`
+- `logical_replication.catchup_ranges`
+
+To use metrics labels, ensure you have enabled the child metrics cluster setting:
+
+{% include_cached copy-clipboard.html %}
+~~~ sql
+SET CLUSTER SETTING server.child_metrics.enabled = true;
+~~~
+
+When you start LDR, include the `label` option:
+
+{% include_cached copy-clipboard.html %}
+~~~ sql
+CREATE LOGICAL REPLICATION STREAM FROM TABLE {database.public.table_name} 
+ON 'external://{source_external_connection}' 
+INTO TABLE {database.public.table_name} WITH label=ldr_job;
+~~~
+
+For a full reference on tracking metrics with labels, refer to the [Child Metrics]({% link {{ page.version.version }}/child-metrics.md %}) page.
+
+### Datadog
+
+You can export metrics to Datadog for LDR jobs. For steps to set up metrics export, refer to the [Monitor CockroachDB Self-Hosted with Datadog]({% link {{ page.version.version }}/datadog.md %}).
+
+## See also
+
+- [Set Up Logical Data Replication]({% link {{ page.version.version }}/set-up-logical-data-replication.md %})
+- [Managed Logical Data Replcation]({% link {{ page.version.version }}/manage-logical-data-replication.md %})
diff --git a/src/current/v24.3/logical-data-replication-overview.md b/src/current/v24.3/logical-data-replication-overview.md
@@ -0,0 +1,54 @@
+---
+title: Logical Data Replication
+summary: An overview of CockroachDB logical data replication (LDR).
+toc: true
+---
+
+{{site.data.alerts.callout_info}}
+{% include feature-phases/preview.md %}
+{{site.data.alerts.end}}
+
+{% include_cached new-in.html version="v24.3" %} **Logical data replication (LDR)** continuously replicates tables between an active _source_ CockroachDB cluster to an active _destination_ CockroachDB cluster. Both source and destination can receive application reads and writes, and participate in [_bidirectional_](#use-cases) LDR replication for eventual consistency in the replicating tables. The active-active setup between clusters can provide protection against cluster, datacenter, or region failure while still achieving single-region low latency reads and writes in the individual CockroachDB clusters. Each cluster in an LDR job still benefits individually from [multi-active availability]({% link {{ page.version.version }}/multi-active-availability.md %}) with CockroachDB's built-in [Raft replication]({% link {{ page.version.version }}/demo-replication-and-rebalancing.md %}) providing data consistency across nodes, zones, and regions.
+
+{{site.data.alerts.callout_success}}
+Cockroach Labs also has a [physical cluster replication]({% link {{ page.version.version }}/physical-cluster-replication-overview.md %}) tool that continuously replicates data for transactional consistency from a primary cluster to an independent standby cluster.
+{{site.data.alerts.end}}
+
+## Use cases
+
+You can run LDR in a _unidirectional_ or _bidirectional_ setup to meet different use cases that support:
+
+- [High availability and single-region write latency in two-datacenter deployments](#achieve-high-availability-and-single-region-write-latency-in-two-datacenter-2dc-deployments)
+- [Workload isolation between clusters](#achieve-workload-isolation-between-clusters)
+
+{{site.data.alerts.callout_info}}
+For a comparison of CockroachDB high availability and resilience features and tooling, refer to the [Data Resilience]({% link {{ page.version.version }}/data-resilience.md %}) page.
+{{site.data.alerts.end}}
+
+### Achieve high availability and single-region write latency in two-datacenter (2DC) deployments
+
+Maintain [high availability]({% link {{ page.version.version }}/data-resilience.md %}#high-availability) and resilience to region failures with a two-datacenter topology. You can run bidirectional LDR to ensure [data resilience]({% link {{ page.version.version }}/data-resilience.md %}) in your deployment, particularly in datacenter or region failures. If you set up two single-region clusters, in LDR, both clusters can receive application reads and writes with low, single-region write latency. Then, in a datacenter, region, or cluster outage, you can redirect application traffic to the surviving cluster with [low downtime]({% link {{ page.version.version }}/data-resilience.md %}#high-availability). In the following diagram, the two single-region clusters are deployed in US East and West to provide low latency for that region. The two LDR jobs ensure that the tables on both clusters will reach eventual consistency.
+
+<image src="{{ 'images/v24.3/east-west-region.svg' | relative_url }}" alt="Diagram showing bidirectional LDR from cluster A to B and back again from cluster B to A." style="width:60%" />
+
+### Achieve workload isolation between clusters
+
+Isolate critical application workloads from non-critical application workloads. For example, you may want to run jobs like [changefeeds]({% link {{ page.version.version }}/change-data-capture-overview.md %}) or [backups]({% link {{ page.version.version }}/backup-and-restore-overview.md %}) from one cluster to isolate these jobs from the cluster receiving the principal application traffic.
+
+<image src="{{ 'images/v24.3/unidirectional.svg' | relative_url }}" alt="Diagram showing unidirectional LDR from a source cluster to a destination cluster with the destination cluster supporting secondary workloads plus jobs and the source cluster accepting the main application traffic." style="width:80%" />
+
+## Features
+
+- **Table-level replication**: When you initiate LDR, it will replicate all of the source table's existing data to the destination table. From then on, LDR will replicate the source table's data to the destination table to achieve eventual consistency.
+- **Last write wins conflict resolution**: LDR uses [_last write wins (LWW)_ conflict resolution]({% link {{ page.version.version }}/manage-logical-data-replication.md %}#conflict-resolution), which will use the latest [MVCC]({% link {{ page.version.version }}/architecture/storage-layer.md %}#mvcc) timestamp to resolve a conflict in row insertion.
+- **Dead letter queue (DLQ)**: When LDR starts, the job will create a [DLQ table]({% link {{ page.version.version }}/manage-logical-data-replication.md %}#dead-letter-queue-dlq) with each replicating table in order to track unresolved conflicts. You can interact and manage this table like any other SQL table.
+- **Replication modes**: LDR offers different _modes_ that apply data differently during replication, which allows you to consider optimizing for throughput or constraints during replication.
+- **Monitoring**: To [monitor]({% link {{ page.version.version }}/logical-data-replication-monitoring.md %}) LDR's initial progress, current status, and performance, you can view metrics available in the DB Console, Prometheus, and Metrics Export.
+
+## Get started
+
+- To set up unidirectional or bidirectional LDR, follow the [Set Up Logical Data Replication]({% link {{ page.version.version }}/set-up-logical-data-replication.md %}) tutorial.
+- Once you've set up LDR, use the [Manage Logical Data Replication]({% link {{ page.version.version }}/manage-logical-data-replication.md %}) page to coordinate and manage different parts of the job.
+- For an overview of metrics to track and monitoring tools, refer to the [Monitor Logical Data Replication]({% link {{ page.version.version }}/logical-data-replication-monitoring.md %}) page.
+
+{% comment  %}move known limitations to here after PR 1 merges{% endcomment %}
diff --git a/src/current/v24.3/manage-logical-data-replication.md b/src/current/v24.3/manage-logical-data-replication.md
@@ -176,7 +176,7 @@ If you have a unidirectional LDR setup, you should cancel the running LDR stream
 
 ## Jobs and LDR
 
-You can run changefeed and backup [jobs]({% link {{ page.version.version }}/show-jobs.md %}) on any cluster that is involved in an LDR job. Both source and destination clusters in LDR are active, which means they can both serve production reads and writes as well as run [backups]({% link {{ page.version.version }}/backup-and-restore-overview.md %}), [changefeeds]({% link {{ page.version.version }}/change-data-capture-overview.md %}), and so on. 
+You can run changefeed and backup [jobs]({% link {{ page.version.version }}/show-jobs.md %}) on any cluster that is involved in an LDR job. Both source and destination clusters in LDR are active, which means they can both serve production reads and writes as well as run [backups]({% link {{ page.version.version }}/backup-and-restore-overview.md %}) and [changefeeds]({% link {{ page.version.version }}/change-data-capture-overview.md %}).
 
 {{site.data.alerts.callout_success}}
 You may want to run jobs like [changefeeds]({% link {{ page.version.version }}/change-data-capture-overview.md %}) from one cluster to isolate these jobs from the cluster receiving the principal application traffic. {% comment %} add link to ldr overview page that will describe this workload isolation topology {% endcomment %}
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		There are some tradeoffs between enabling one table per LDR job versus multiple tables in one LDR job. Multiple tables in one LDR job can be easier to operate. For example, if you pause and resume the single job, LDR will stop and resume for all the tables. However, the most granular level observability will be at the job level. One table in one LDR job will allow for table-level observability.
		There are some tradeoffs between enabling one table per LDR job versus multiple tables in one LDR job. Multiple tables in one LDR job can be easier to operate. For example, if you pause and resume the single job, LDR will stop and resume for all the tables. However, the most granular level observability will be at the job level. One table in one LDR job will allow for table-level observability.
kathancox marked this conversation as resolved. Show resolved Hide resolved Copy link Contributor rmloveland Nov 14, 2024 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others. Learn more. lol i was just checking if this was an include