feat(#190): version 2: add multi-db, batching and tags #192

witash · 2025-02-19T10:01:03Z

This is ready for review
It wasn't really possible to separate the issues included, so I added one big PR instead of several small ones.

It contains

adding source to couchdb (this is a breaking change for cht-sync)
adding dbname and instance, relying on source to document_metadata for mulit-db and multi-
adds tags to base and telemetry models to allow users to separate the runs for base models and telemetry models if they want.
adds batch_incremental macro to allow for small batches for document_metadata

mrjones-plip

I haven't spent enough time with the code prior to v1 to be confident in doing an in depth code review. I'll defer to @jkuester for any deep code dives!

I'll be sure to do a fair amount of black box testing in the docs PR I'm working and suggest we wait to have both PRs approved before merging. For example, there's no upgrade path documented and I'm not sure if this will impose additional changes on this v2 PR

otherwise approving to unblock!

jkuester

Very nice! Just one big question about our batching logic. At the end of the day duping less than 1000 rows in a 100_000 row batch is probably not too big of a deal, but maybe we can avoid it....

macros/batch_incremental.sql

models/root/tests/document_metadata.yml

jkuester · 2025-03-06T22:30:00Z

macros/batch_incremental.sql

+
+    -- union the CTE with a query getting all records with equal timestamp
+    SELECT * FROM {{ source_cte_name }}
+    WHERE saved_timestamp = {{ max_existing_timestamp('saved_timestamp') }}


question: it seems like this is going to end up sometimes selecting duplicate records (that were processed in a previous branch). Is this okay?

For example, in your test_document_metadata_batch_size_same_timestamp test case for the docmument_metadata model, the in_batch_4 record is the last one processed. But at the end of that batch, my understanding is that the timestamp for that row (2024-08-03 00:00:00.000) becomes the saved_timestamp for the next run. And so, the next batch will not only contain the not_in_batch_5 record, but also in_batch_4 because they have the same timestamp.

I think these processes are idempotent so, maybe the only downside is a bit of wasted performance on rows that happen to have a duplicate timestamp....
worst case scenario where more docs exist for the timestamp than the batch_size... However, if I am understanding this correctly the saved_timestamp colum is populated with the time that couch2pg wrote the data and that happens in batches of up to 1000. So, there is a reasonable likelihood of a good number of rows sharing the same timestamp value (but not more than 1000).

I wonder if it would be possible to reverse this logic and instead of being greedy on the low end, we are actually "stingy" on the top end of the batch. So, instead of doing the UNION of selects here, we just select a sub-set of current_batch:

SELECT * FROM current_batch WHERE saved_timestamp < (SELECT MAX(saved_timestamp) FROM current_batch)

This guarantees that all of the rows with the max saved_timestamp that we return will be included in the batch. Then, the next batch will be safe to start processing at anything greater than the max saved_timestamp value we have already processed...

question: it seems like this is going to end up sometimes selecting duplicate records (that were processed in a previous branch). Is this okay?

Yes this is ok, and currently happens regardless of any batching; rows in the source table that have the same saved_timestamp as max(saved_timestamp) in the target table can be selected multiple times; it ends up deleting them from the target table and re-inserting, so treating them like updates. It's a little messy but there's no possibility for duplicate data.

worst case scenario where more docs exist for the timestamp than the batch_size

Yes, this is what the UNION with all rows with the same timestamp handles; if the batch size is less than the number of rows in the source table with the same saved_timestamp, just selecting all rows in the source table where source timestamp is greater than or equal to the max saved timestamp in the target table, and then limiting the results, would result in the same rows being selected each time, ignoring new rows added to the source table. The max timestamp in the target table would never change.

It is an edge case because if this batch_size is small compared to the batch size that couch2pg inserts, its going to have other problems (not being able to catch up to cocuh2pg).

I wonder if it would be possible to reverse this logic and instead of being greedy on the low end, we are actually "stingy" on the top end of the batch. So, instead of doing the UNION of selects here, we just select a sub-set of current_batch:

This guarantees that all of the rows with the max saved_timestamp that we return will be included in the batch. Then, the next batch will be safe to start processing at anything greater than the max saved_timestamp value we have already processed...

I don't understand this...what if the batch size is less than the number of rows in the source table with the same saved_timestamp? How does it avoid the problem described above, where it always selects the same batch because the max(saved_timestamp) in the target table doesn't change?

tests/run_dbt_tests_docker.sh

Co-authored-by: Joshua Kuestersteffen <[email protected]>

witash added 12 commits January 30, 2025 12:58

feat(#172): add instance and dbname

9c23188

feat(#190): add selectors

78d3cd5

fix(#190): fixing some tests

4a0e874

feat(#156): add limit if bathc size is set

4f1e208

fix(#156): moving condition outside incremental

9893767

fix(#156): make batch_size optional

dd042ad

fix(#156): fix duplicate timestamps issue

c3529cd

Merge branch '156-run-dbt-in-batches-simple' into v2

7bb9d4c

fix: tests

9930474

chore: move batching to macro

3d24535

chore: add test for batching

78c407f

fix: fixing batch tests

fb4f34f

witash changed the title V2 feat: V2 Mar 4, 2025

witash changed the title ~~feat: V2~~ feat(190): V2 Mar 4, 2025

witash changed the title ~~feat(190): V2~~ feat(#190): V2 Mar 4, 2025

witash marked this pull request as ready for review March 4, 2025 12:32

witash changed the title ~~feat(#190): V2~~ feat(#190): version 2: add mulit-db, batching and tags Mar 5, 2025

witash requested review from mrjones-plip and jkuester March 5, 2025 07:06

mrjones-plip approved these changes Mar 6, 2025

View reviewed changes

jkuester reviewed Mar 6, 2025

View reviewed changes

witash changed the title ~~feat(#190): version 2: add mulit-db, batching and tags~~ feat(#190): version 2: add multi-db, batching and tags Mar 7, 2025

Apply suggestions from code review

e3a9b64

Co-authored-by: Joshua Kuestersteffen <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(#190): version 2: add multi-db, batching and tags #192

feat(#190): version 2: add multi-db, batching and tags #192

witash commented Feb 19, 2025 •

edited

Loading

mrjones-plip left a comment

jkuester left a comment •

edited

Loading

jkuester Mar 6, 2025

witash Mar 7, 2025

feat(#190): version 2: add multi-db, batching and tags #192

Are you sure you want to change the base?

feat(#190): version 2: add multi-db, batching and tags #192

Conversation

witash commented Feb 19, 2025 • edited Loading

mrjones-plip left a comment

Choose a reason for hiding this comment

jkuester left a comment • edited Loading

Choose a reason for hiding this comment

jkuester Mar 6, 2025

Choose a reason for hiding this comment

witash Mar 7, 2025

Choose a reason for hiding this comment

witash commented Feb 19, 2025 •

edited

Loading

jkuester left a comment •

edited

Loading