From c8bc21880a26728282744e372ed57857685e239e Mon Sep 17 00:00:00 2001
From: Edward Gao <edward.gao@airbyte.io>
Date: Tue, 4 Feb 2025 13:37:53 -0800
Subject: [PATCH] explain dedup

---
 docs/integrations/destinations/s3-data-lake.md | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)
diff --git a/docs/integrations/destinations/s3-data-lake.md b/docs/integrations/destinations/s3-data-lake.md
index 37ac3294969e1..e04f5bef2dfe4 100644
--- a/docs/integrations/destinations/s3-data-lake.md
+++ b/docs/integrations/destinations/s3-data-lake.md
@@ -105,6 +105,24 @@ rewrite existing data files. This means that we can only handle specific schema
 
 If your source goes through an unsupported schema change, you must manually edit the table schema.
 
+## Deduplication
+
+This connector uses a merge-on-read strategy to support deduplication:
+* The stream's primary keys are translated to Iceberg's [identifier columns](https://iceberg.apache.org/spec/#identifier-field-ids).
+* An "upsert" is an [equality-based delete](https://iceberg.apache.org/spec/#equality-delete-files)
+  on that row's primary key, followed by an insertion of the new data.
+
+### Assumptions
+
+The S3 Data Lake connector assumes that one of two things is true:
+* The source will never emit the same primary key twice in a single sync attempt
+* If the source emits the same PK multiple times in a single attempt, it will always emit those records
+  in cursor order (oldest to newest)
+
+If these conditions are not met, you may see inaccurate data in the destination (i.e. older records
+taking precendence over newer records). If this happens, you should use the `append` or `overwrite`
+sync mode.
+
 ## Changelog
 
 <details>