-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(#190): version 2: add multi-db, batching and tags #192
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't spent enough time with the code prior to v1 to be confident in doing an in depth code review. I'll defer to @jkuester for any deep code dives!
I'll be sure to do a fair amount of black box testing in the docs PR I'm working and suggest we wait to have both PRs approved before merging. For example, there's no upgrade path documented and I'm not sure if this will impose additional changes on this v2 PR
otherwise approving to unblock!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Just one big question about our batching logic. At the end of the day duping less than 1000 rows in a 100_000 row batch is probably not too big of a deal, but maybe we can avoid it....
|
||
-- union the CTE with a query getting all records with equal timestamp | ||
SELECT * FROM {{ source_cte_name }} | ||
WHERE saved_timestamp = {{ max_existing_timestamp('saved_timestamp') }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: it seems like this is going to end up sometimes selecting duplicate records (that were processed in a previous branch). Is this okay?
For example, in your test_document_metadata_batch_size_same_timestamp
test case for the docmument_metadata
model, the in_batch_4
record is the last one processed. But at the end of that batch, my understanding is that the timestamp for that row (2024-08-03 00:00:00.000
) becomes the saved_timestamp
for the next run. And so, the next batch will not only contain the not_in_batch_5
record, but also in_batch_4
because they have the same timestamp.
I think these processes are idempotent so, maybe the only downside is a bit of wasted performance on rows that happen to have a duplicate timestamp....
worst case scenario where more docs exist for the timestamp than the batch_size... However, if I am understanding this correctly the saved_timestamp
colum is populated with the time that couch2pg wrote the data and that happens in batches of up to 1000
. So, there is a reasonable likelihood of a good number of rows sharing the same timestamp value (but not more than 1000).
I wonder if it would be possible to reverse this logic and instead of being greedy on the low end, we are actually "stingy" on the top end of the batch. So, instead of doing the UNION of selects here, we just select a sub-set of current_batch
:
SELECT * FROM current_batch
WHERE saved_timestamp < (SELECT MAX(saved_timestamp) FROM current_batch)
This guarantees that all of the rows with the max saved_timestamp
that we return will be included in the batch. Then, the next batch will be safe to start processing at anything greater than the max saved_timestamp
value we have already processed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: it seems like this is going to end up sometimes selecting duplicate records (that were processed in a previous branch). Is this okay?
Yes this is ok, and currently happens regardless of any batching; rows in the source table that have the same saved_timestamp
as max(saved_timestamp)
in the target table can be selected multiple times; it ends up deleting them from the target table and re-inserting, so treating them like updates. It's a little messy but there's no possibility for duplicate data.
worst case scenario where more docs exist for the timestamp than the batch_size
Yes, this is what the UNION
with all rows with the same timestamp handles; if the batch size is less than the number of rows in the source table with the same saved_timestamp
, just selecting all rows in the source table where source timestamp is greater than or equal to the max saved timestamp in the target table, and then limiting the results, would result in the same rows being selected each time, ignoring new rows added to the source table. The max timestamp in the target table would never change.
It is an edge case because if this batch_size
is small compared to the batch size that couch2pg inserts, its going to have other problems (not being able to catch up to cocuh2pg).
I wonder if it would be possible to reverse this logic and instead of being greedy on the low end, we are actually "stingy" on the top end of the batch. So, instead of doing the UNION of selects here, we just select a sub-set of current_batch:
This guarantees that all of the rows with the max saved_timestamp that we return will be included in the batch. Then, the next batch will be safe to start processing at anything greater than the max saved_timestamp value we have already processed...
I don't understand this...what if the batch size is less than the number of rows in the source table with the same saved_timestamp
? How does it avoid the problem described above, where it always selects the same batch because the max(saved_timestamp)
in the target table doesn't change?
Co-authored-by: Joshua Kuestersteffen <[email protected]>
This is ready for review
It wasn't really possible to separate the issues included, so I added one big PR instead of several small ones.
It contains
batch_incremental
macro to allow for small batches for document_metadata