Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDR overview and monitoring pages #19085

Merged
merged 11 commits into from
Nov 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/current/_includes/v24.3/ldr/multiple-tables.md
Original file line number Diff line number Diff line change
@@ -1 +1 @@
There are some tradeoffs between enabling one table per LDR job versus multiple tables in one LDR job. Multiple tables in one LDR job can be easier to operate. For example, if you pause and resume the single job, LDR will stop and resume for all the tables. However, the most granular level observability will be at the job level. One table in one LDR job will allow for table-level observability.
There are some tradeoffs between enabling one table per LDR job versus multiple tables in one LDR job. Multiple tables in one LDR job can be easier to operate. For example, if you pause and resume the single job, LDR will stop and resume for all the tables. However, the most granular level observability will be at the job level. One table in one LDR job will allow for table-level observability.
kathancox marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol i was just checking if this was an include

Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Field | Response

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this missing the "description" field in LDR responses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is, I am adding. Thanks for catching.

---------+----------
`job_id` | The job's ID. Use with [`CANCEL JOB`]({% link {{ page.version.version }}/cancel-job.md %}), [`PAUSE JOB`]({% link {{ page.version.version }}/pause-job.md %}), [`RESUME JOB`]({% link {{ page.version.version }}/resume-job.md %}), [`SHOW JOB`]({% link {{ page.version.version }}/show-jobs.md %}).
`status` | Status of the job `running`, `paused`, `canceled`. {% comment %}check these{% endcomment %}
`targets` | The fully qualified name of the table(s) that are part of the LDR job.
`replicated_time` | The latest [timestamp]({% link {{ page.version.version }}/timestamp.md %}) at which the destination cluster has consistent data. This time advances automatically as long as the LDR job proceeds without error. `replicated_time` is updated periodically (every 30s). {% comment %}To confirm this line is accurate{% endcomment %}
`replication_start_time` | The start time of the LDR job.
`conflict_resolution_type` | The type of [conflict resolution]({% link {{ page.version.version }}/manage-logical-data-replication.md %}#conflict-resolution): `LWW` last write wins.
`description` | Description of the job including the replicating table(s) and the source cluster connection.
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,12 @@
{
"title": "Logical Data Replication",
"items": [
{
"title": "Overview",
"urls": [
"/${VERSION}/logical-data-replication-overview.html"
]
},
{
"title": "Set Up Logical Data Replication",
"urls": [
Expand All @@ -16,6 +22,12 @@
"urls": [
"/${VERSION}/manage-logical-data-replication.html"
]
},
{
"title": "Monitor Logical Data Replication",
"urls": [
"/${VERSION}/logical-data-replication-monitoring.html"
]
}
]
}
Expand Down
644 changes: 644 additions & 0 deletions src/current/images/v24.3/east-west-region.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
544 changes: 544 additions & 0 deletions src/current/images/v24.3/unidirectional.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
147 changes: 147 additions & 0 deletions src/current/v24.3/logical-data-replication-monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
---
title: Logical Data Replication Monitoring
summary: Monitor and observe LDR jobs between a source and destination table.
toc: true
docs_area: manage
---

{{site.data.alerts.callout_info}}
{% include feature-phases/preview.md %}
{{site.data.alerts.end}}

You can monitor [**logical data replication (LDR)**]({% link {{ page.version.version }}/logical-data-replication-overview.md %}) using:

- [`SHOW LOGICAL REPLICATION JOBS`](#sql-shell) in the SQL shell to view a list of LDR jobs on the cluster.
- The **Logical Data Replication** dashboard on the [DB Console](#db-console) to view metrics at the cluster level. {% comment %}To add link later to dashboard page{% endcomment %}
- [Prometheus and Alertmanager](#prometheus) to track and alert on LDR metrics.
- Metrics export with [Datadog](#datadog).
- [Metrics labels](#metrics-labels) to view metrics at the job level.

{{site.data.alerts.callout_info}}
{% include {{ page.version.version }}/ldr/multiple-tables.md %}
{{site.data.alerts.end}}

{% comment %}To add to an include{% endcomment %}
When you start an LDR stream, one job is created on each cluster:

- The _history retention job_ on the source cluster, which runs while the LDR job is active to protect changes in the table from [garbage collection]({% link {{ page.version.version }}/architecture/storage-layer.md %}#garbage-collection) until they have been applied to the destination cluster. The history retention job is viewable in the [DB Console](#db-console) or with [`SHOW JOBS`]({% link {{ page.version.version }}/show-jobs.md %}). Any manual changes to the history retention job could disrupt the LDR job.
- The `logical replication` job on the destination cluster. You can view the status of this job in the SQL shell with `SHOW LOGICAL REPLICATION JOBS` and the DB Console [**Jobs** page](#jobs-page).

## SQL Shell

In the destination cluster's SQL shell, you can query `SHOW LOGICAL REPLICATION JOBS` to view the LDR jobs running on the cluster:

{% include_cached copy-clipboard.html %}
~~~ sql
SHOW LOGICAL REPLICATION JOBS;
~~~
~~~
job_id | status | targets | replicated_time
----------------------+---------+---------------------------+------------------
1012877040439033857 | running | {database.public.table} | NULL
(1 row)
~~~

For additional detail on each LDR job, use the `WITH details` option:

{% include_cached copy-clipboard.html %}
~~~ sql
SHOW LOGICAL REPLICATION JOBS WITH details;
~~~
~~~
job_id | status | targets | replicated_time | replication_start_time | conflict_resolution_type | description
----------------------+----------+--------------------------------+-------------------------------+-------------------------------+--------------------------+-----------------------------------------------------------------------------------------
1010959260799270913 | running | {movr.public.promo_codes} | 2024-10-24 17:50:05+00 | 2024-10-10 20:04:42.196982+00 | LWW | LOGICAL REPLICATION STREAM into movr.public.promo_codes from external://cluster_a
1014047902397333505 | canceled | {defaultdb.public.office_dogs} | 2024-10-24 17:30:25+00 | 2024-10-21 17:54:20.797643+00 | LWW | LOGICAL REPLICATION STREAM into defaultdb.public.office_dogs from external://cluster_a
~~~

### Responses

{% include {{ page.version.version }}/ldr/show-logical-replication-responses.md %}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, i may have to backport some changes to this table. this is fine to merge as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


## Recommended LDR metrics to track

- Replication latency: The commit-to-commit replication latency, which is tracked from when a row is committed on the source cluster, to when it is applied on the destination cluster. An LDR _commit_ is when the job either applies a row successfully to the destination cluster or adds a row to the [dead letter queue (DLQ)]({% link {{ page.version.version }}/manage-logical-data-replication.md %}#dead-letter-queue-dlq).
- `logical_replication.commit_latency-p50`
- `logical_replication.commit_latency-p99`
- Replication lag: How far behind the source cluster is from the destination cluster at a specific point in time. The replication lag is equivalent to [RPO]({% link {{ page.version.version }}/disaster-recovery-overview.md %}) during a disaster. Calculate the replication lag with this metric. For example, `time.now() - replicated_time_seconds`.
- `logical_replication.replicated_time_seconds`
msbutler marked this conversation as resolved.
Show resolved Hide resolved
- Row updates applied: These metrics indicate whether the destination cluster is actively receiving and applying data from the source cluster.
- `logical_replication.events_ingested`
- `logical_replication.events_dlqed`

## DB Console

In the DB Console, you can use:

- The [**Metrics** dashboard]({% link {{ page.version.version }}/ui-overview-dashboard.md %}) for LDR to view metrics for the job on the destination cluster.
- The [**Jobs** page]({% link {{ page.version.version }}/ui-jobs-page.md %}) to view the history retention job on the source cluster and the LDR job on the destination cluster

The metrics for LDR in the DB Console metrics are at the **cluster** level. This means that if there are multiple LDR jobs running on a cluster the DB Console will show the average metrics across jobs.

### Metrics dashboard

You can use the [**Logical Data Replication** dashboard]({% link {{ page.version.version }}/ui-overview-dashboard.md %}) of the destination cluster to monitor the following metric graphs at the **cluster** level:

- Replication latency
- Replication lag
- Row updates applied
- Logical bytes reviewed
- Batch application processing time: 50th percentile
- Batch application processing time: 99th percentile
- DLQ causes
- Retry queue size

{% comment %}Dashboard page in the DB Console docs to be added with more information per other dashboards. Link to there from this section.{% endcomment %}

To track replicated time, ingested events, and events added to the DLQ at the **job** level, refer to [Metrics labels](#metrics-labels).

### Jobs page

On the [**Jobs** page]({% link {{ page.version.version }}/ui-jobs-page.md %}), select:

- The **Replication Producer** in the source cluster's DB Console to view the _history retention job_.
- The **Logical Replication Ingestion** job in the destination cluster's DB Console. When you start LDR, the **Logical Replication Ingestion** job will show a bar that tracks the initial scan progress of the source table's existing data.

## Monitoring and alerting

### Prometheus

You can use Prometheus and Alertmanager to track and alert on LDR metrics. Refer to the [Monitor CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}) tutorial for steps to set up Prometheus.

#### Metrics labels

To view metrics at the job level, you can use the `label` option when you start LDR to add a metrics label to the LDR job. This enables [child metric]({% link {{ page.version.version }}/child-metrics.md %}) export, which are Prometheus time series with extra labels. You can track the following metrics for an LDR job with labels:

- `logical_replication.replicated_time_seconds`
- `logical_replication.events_ingested`
- `logical_replication.events_dlqed`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while we're here, you could add the .scanning_ranges and .lagging_ranges metrics. added on friday cockroachdb/cockroach@9eb6c8b

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have added these to this label list

- `logical_replication.scanning_ranges`
- `logical_replication.catchup_ranges`

To use metrics labels, ensure you have enabled the child metrics cluster setting:

{% include_cached copy-clipboard.html %}
~~~ sql
SET CLUSTER SETTING server.child_metrics.enabled = true;
~~~

When you start LDR, include the `label` option:

{% include_cached copy-clipboard.html %}
~~~ sql
CREATE LOGICAL REPLICATION STREAM FROM TABLE {database.public.table_name}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this WITH option included in the setup page too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to add this in a few places

ON 'external://{source_external_connection}'
INTO TABLE {database.public.table_name} WITH label=ldr_job;
~~~

For a full reference on tracking metrics with labels, refer to the [Child Metrics]({% link {{ page.version.version }}/child-metrics.md %}) page.

### Datadog

You can export metrics to Datadog for LDR jobs. For steps to set up metrics export, refer to the [Monitor CockroachDB Self-Hosted with Datadog]({% link {{ page.version.version }}/datadog.md %}).

## See also

- [Set Up Logical Data Replication]({% link {{ page.version.version }}/set-up-logical-data-replication.md %})
- [Managed Logical Data Replcation]({% link {{ page.version.version }}/manage-logical-data-replication.md %})
54 changes: 54 additions & 0 deletions src/current/v24.3/logical-data-replication-overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
title: Logical Data Replication
summary: An overview of CockroachDB logical data replication (LDR).
toc: true
---

{{site.data.alerts.callout_info}}
{% include feature-phases/preview.md %}
{{site.data.alerts.end}}

{% include_cached new-in.html version="v24.3" %} **Logical data replication (LDR)** continuously replicates tables between an active _source_ CockroachDB cluster to an active _destination_ CockroachDB cluster. Both source and destination can receive application reads and writes, and participate in [_bidirectional_](#use-cases) LDR replication for eventual consistency in the replicating tables. The active-active setup between clusters can provide protection against cluster, datacenter, or region failure while still achieving single-region low latency reads and writes in the individual CockroachDB clusters. Each cluster in an LDR job still benefits individually from [multi-active availability]({% link {{ page.version.version }}/multi-active-availability.md %}) with CockroachDB's built-in [Raft replication]({% link {{ page.version.version }}/demo-replication-and-rebalancing.md %}) providing data consistency across nodes, zones, and regions.

{{site.data.alerts.callout_success}}
Cockroach Labs also has a [physical cluster replication]({% link {{ page.version.version }}/physical-cluster-replication-overview.md %}) tool that continuously replicates data for transactional consistency from a primary cluster to an independent standby cluster.
{{site.data.alerts.end}}

## Use cases

You can run LDR in a _unidirectional_ or _bidirectional_ setup to meet different use cases that support:

- [High availability and single-region write latency in two-datacenter deployments](#achieve-high-availability-and-single-region-write-latency-in-two-datacenter-2dc-deployments)
- [Workload isolation between clusters](#achieve-workload-isolation-between-clusters)

{{site.data.alerts.callout_info}}
For a comparison of CockroachDB high availability and resilience features and tooling, refer to the [Data Resilience]({% link {{ page.version.version }}/data-resilience.md %}) page.
{{site.data.alerts.end}}

### Achieve high availability and single-region write latency in two-datacenter (2DC) deployments

Maintain [high availability]({% link {{ page.version.version }}/data-resilience.md %}#high-availability) and resilience to region failures with a two-datacenter topology. You can run bidirectional LDR to ensure [data resilience]({% link {{ page.version.version }}/data-resilience.md %}) in your deployment, particularly in datacenter or region failures. If you set up two single-region clusters, in LDR, both clusters can receive application reads and writes with low, single-region write latency. Then, in a datacenter, region, or cluster outage, you can redirect application traffic to the surviving cluster with [low downtime]({% link {{ page.version.version }}/data-resilience.md %}#high-availability). In the following diagram, the two single-region clusters are deployed in US East and West to provide low latency for that region. The two LDR jobs ensure that the tables on both clusters will reach eventual consistency.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alicia-l2 are we planning to publicly document the pros and cons of using our MR feature suite vs LDR for the first use case? as well as zone cfgs + execution locality backup/cdc vs ldr for the second use case?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can link to blog posts describing this for the first use case.
For second use case, @kathancox maybe we can try to get more specific about hardware/cluster specific isolation?

maybe above both use cases we should add a note saying that we consider this a tool and is an alternative deployment option to our native Raft/MRarchitecture

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just link to the Data Resilience page? I think that covers some of this, particularly when it comes to MR feature comparison. I'm adding a note linking to that now.

Anything further in terms of execution locality + cdc comps, we should probably create another docs issue for that.

Copy link

@msbutler msbutler Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i think we should revisit how we publicly document the pros and cons of LDR vs PCR vs CRDB replication, but i don't think we need to block this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK yes, I think it's something that @alicia-l2 and I are working on gradually. We recently published this page https://www.cockroachlabs.com/docs/dev/data-resilience and created a new top-level section, so I building out this area is a good idea.

kathancox marked this conversation as resolved.
Show resolved Hide resolved

<image src="{{ 'images/v24.3/east-west-region.svg' | relative_url }}" alt="Diagram showing bidirectional LDR from cluster A to B and back again from cluster B to A." style="width:60%" />

### Achieve workload isolation between clusters

Isolate critical application workloads from non-critical application workloads. For example, you may want to run jobs like [changefeeds]({% link {{ page.version.version }}/change-data-capture-overview.md %}) or [backups]({% link {{ page.version.version }}/backup-and-restore-overview.md %}) from one cluster to isolate these jobs from the cluster receiving the principal application traffic.

<image src="{{ 'images/v24.3/unidirectional.svg' | relative_url }}" alt="Diagram showing unidirectional LDR from a source cluster to a destination cluster with the destination cluster supporting secondary workloads plus jobs and the source cluster accepting the main application traffic." style="width:80%" />

## Features

- **Table-level replication**: When you initiate LDR, it will replicate all of the source table's existing data to the destination table. From then on, LDR will replicate the source table's data to the destination table to achieve eventual consistency.
- **Last write wins conflict resolution**: LDR uses [_last write wins (LWW)_ conflict resolution]({% link {{ page.version.version }}/manage-logical-data-replication.md %}#conflict-resolution), which will use the latest [MVCC]({% link {{ page.version.version }}/architecture/storage-layer.md %}#mvcc) timestamp to resolve a conflict in row insertion.
- **Dead letter queue (DLQ)**: When LDR starts, the job will create a [DLQ table]({% link {{ page.version.version }}/manage-logical-data-replication.md %}#dead-letter-queue-dlq) with each replicating table in order to track unresolved conflicts. You can interact and manage this table like any other SQL table.
- **Replication modes**: LDR offers different _modes_ that apply data differently during replication, which allows you to consider optimizing for throughput or constraints during replication.
- **Monitoring**: To [monitor]({% link {{ page.version.version }}/logical-data-replication-monitoring.md %}) LDR's initial progress, current status, and performance, you can view metrics available in the DB Console, Prometheus, and Metrics Export.

## Get started

- To set up unidirectional or bidirectional LDR, follow the [Set Up Logical Data Replication]({% link {{ page.version.version }}/set-up-logical-data-replication.md %}) tutorial.
- Once you've set up LDR, use the [Manage Logical Data Replication]({% link {{ page.version.version }}/manage-logical-data-replication.md %}) page to coordinate and manage different parts of the job.
- For an overview of metrics to track and monitoring tools, refer to the [Monitor Logical Data Replication]({% link {{ page.version.version }}/logical-data-replication-monitoring.md %}) page.

{% comment %}move known limitations to here after PR 1 merges{% endcomment %}
2 changes: 1 addition & 1 deletion src/current/v24.3/manage-logical-data-replication.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ If you have a unidirectional LDR setup, you should cancel the running LDR stream

## Jobs and LDR

You can run changefeed and backup [jobs]({% link {{ page.version.version }}/show-jobs.md %}) on any cluster that is involved in an LDR job. Both source and destination clusters in LDR are active, which means they can both serve production reads and writes as well as run [backups]({% link {{ page.version.version }}/backup-and-restore-overview.md %}), [changefeeds]({% link {{ page.version.version }}/change-data-capture-overview.md %}), and so on.
You can run changefeed and backup [jobs]({% link {{ page.version.version }}/show-jobs.md %}) on any cluster that is involved in an LDR job. Both source and destination clusters in LDR are active, which means they can both serve production reads and writes as well as run [backups]({% link {{ page.version.version }}/backup-and-restore-overview.md %}) and [changefeeds]({% link {{ page.version.version }}/change-data-capture-overview.md %}).

{{site.data.alerts.callout_success}}
You may want to run jobs like [changefeeds]({% link {{ page.version.version }}/change-data-capture-overview.md %}) from one cluster to isolate these jobs from the cluster receiving the principal application traffic. {% comment %} add link to ldr overview page that will describe this workload isolation topology {% endcomment %}
Expand Down
Loading
Loading