Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(#190): version 2: add multi-db, batching and tags #192

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Conversation

witash
Copy link
Contributor

@witash witash commented Feb 19, 2025

This is ready for review
It wasn't really possible to separate the issues included, so I added one big PR instead of several small ones.

It contains

  • adding source to couchdb (this is a breaking change for cht-sync)
  • adding dbname and instance, relying on source to document_metadata for mulit-db and multi-
  • adds tags to base and telemetry models to allow users to separate the runs for base models and telemetry models if they want.
  • adds batch_incremental macro to allow for small batches for document_metadata

@witash witash changed the title V2 feat: V2 Mar 4, 2025
@witash witash changed the title feat: V2 feat(190): V2 Mar 4, 2025
@witash witash changed the title feat(190): V2 feat(#190): V2 Mar 4, 2025
@witash witash marked this pull request as ready for review March 4, 2025 12:32
@witash witash changed the title feat(#190): V2 feat(#190): version 2: add mulit-db, batching and tags Mar 5, 2025
@witash witash requested review from mrjones-plip and jkuester March 5, 2025 07:06
Copy link

@mrjones-plip mrjones-plip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't spent enough time with the code prior to v1 to be confident in doing an in depth code review. I'll defer to @jkuester for any deep code dives!

I'll be sure to do a fair amount of black box testing in the docs PR I'm working and suggest we wait to have both PRs approved before merging. For example, there's no upgrade path documented and I'm not sure if this will impose additional changes on this v2 PR

otherwise approving to unblock!

Copy link

@jkuester jkuester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Just one big question about our batching logic. At the end of the day duping less than 1000 rows in a 100_000 row batch is probably not too big of a deal, but maybe we can avoid it....


-- union the CTE with a query getting all records with equal timestamp
SELECT * FROM {{ source_cte_name }}
WHERE saved_timestamp = {{ max_existing_timestamp('saved_timestamp') }}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: it seems like this is going to end up sometimes selecting duplicate records (that were processed in a previous branch). Is this okay?

For example, in your test_document_metadata_batch_size_same_timestamp test case for the docmument_metadata model, the in_batch_4 record is the last one processed. But at the end of that batch, my understanding is that the timestamp for that row (2024-08-03 00:00:00.000) becomes the saved_timestamp for the next run. And so, the next batch will not only contain the not_in_batch_5 record, but also in_batch_4 because they have the same timestamp.

I think these processes are idempotent so, maybe the only downside is a bit of wasted performance on rows that happen to have a duplicate timestamp....
worst case scenario where more docs exist for the timestamp than the batch_size... However, if I am understanding this correctly the saved_timestamp colum is populated with the time that couch2pg wrote the data and that happens in batches of up to 1000. So, there is a reasonable likelihood of a good number of rows sharing the same timestamp value (but not more than 1000).


I wonder if it would be possible to reverse this logic and instead of being greedy on the low end, we are actually "stingy" on the top end of the batch. So, instead of doing the UNION of selects here, we just select a sub-set of current_batch:

    SELECT * FROM current_batch
    WHERE saved_timestamp < (SELECT MAX(saved_timestamp) FROM current_batch)

This guarantees that all of the rows with the max saved_timestamp that we return will be included in the batch. Then, the next batch will be safe to start processing at anything greater than the max saved_timestamp value we have already processed...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: it seems like this is going to end up sometimes selecting duplicate records (that were processed in a previous branch). Is this okay?

Yes this is ok, and currently happens regardless of any batching; rows in the source table that have the same saved_timestamp as max(saved_timestamp) in the target table can be selected multiple times; it ends up deleting them from the target table and re-inserting, so treating them like updates. It's a little messy but there's no possibility for duplicate data.

worst case scenario where more docs exist for the timestamp than the batch_size

Yes, this is what the UNION with all rows with the same timestamp handles; if the batch size is less than the number of rows in the source table with the same saved_timestamp, just selecting all rows in the source table where source timestamp is greater than or equal to the max saved timestamp in the target table, and then limiting the results, would result in the same rows being selected each time, ignoring new rows added to the source table. The max timestamp in the target table would never change.

It is an edge case because if this batch_size is small compared to the batch size that couch2pg inserts, its going to have other problems (not being able to catch up to cocuh2pg).

I wonder if it would be possible to reverse this logic and instead of being greedy on the low end, we are actually "stingy" on the top end of the batch. So, instead of doing the UNION of selects here, we just select a sub-set of current_batch:

This guarantees that all of the rows with the max saved_timestamp that we return will be included in the batch. Then, the next batch will be safe to start processing at anything greater than the max saved_timestamp value we have already processed...

I don't understand this...what if the batch size is less than the number of rows in the source table with the same saved_timestamp? How does it avoid the problem described above, where it always selects the same batch because the max(saved_timestamp) in the target table doesn't change?

@witash witash changed the title feat(#190): version 2: add mulit-db, batching and tags feat(#190): version 2: add multi-db, batching and tags Mar 7, 2025
Co-authored-by: Joshua Kuestersteffen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants