Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

Support for NoSQL/document-based databases? #152

Closed
ashrielbrian opened this issue Jul 6, 2022 · 10 comments
Closed

Support for NoSQL/document-based databases? #152

ashrielbrian opened this issue Jul 6, 2022 · 10 comments
Labels
new-db-driver Request to add a new database driver triage

Comments

@ashrielbrian
Copy link

We do a lot of data ingestion/syncing from Firestore and MongoDB into BigQuery. Any plans in the roadmap for these kinds of DBs? I assume in order to support other database types, they would need to support running hashing algorithms (md5, sha) natively.

@mrn3
Copy link

mrn3 commented Jul 6, 2022

So ideally, they support querying with an MD5 function. But, if they don't, you can still get chunks of data and do the MD5 on it outside of the database - it just isn't as fast/efficient. I have successfully done this with NoSQL databases, including MongoDB and Elasticsearch. I am hoping once we establish a pattern for these, we can get MongoDB and Elasticsearch on the shortlist, because I think they will be some of the most popular (as well as DynamoDB, and Firestore and BigQuery as you mention).

@nolar
Copy link
Collaborator

nolar commented Jul 6, 2022

Hello. Thanks for asking. We have this task on our roadmap, but it is not in the highest priority, closer to the end of the list: firstly, because of our assumption that this is a rare use case; and secondly, because of a lack of users to test it.

Yes, you are right: the absence of hashes on the server side is the most significant challenge there. It breaks the main trick of why data-diff is fast on big data. See also #51.

However, there is good news: we are planning on reworking the database connecting machinery rather sooner than later — this task is closer to the top of our list. With that, users will be able to write their custom connectors that calculate the hashes "somehow" (including by downloading the full dataset locally). This might partially solve the problem.

We are discussing the roadmap and priorities right now — it will become more clear in the coming days or weeks. I will include your use case there.

@ashrielbrian
Copy link
Author

ashrielbrian commented Jul 6, 2022

Thanks for the replies @mrn3 @nolar!

Yeah - I thought of having the hashes running locally, but then that defeats the purpose of data-diff's performance advantage, with only hashes being transmitted over the network, as opposed to all the rows.

Another issue I can think of is the lack of schema/enforced data types in NoSQL documents, which I understand is important when comparing hashes.

As for it being a rare use-case, our application team handles millions of users on document-based DBs, and we ingest them into columnar-style data warehouses for our analytics teams. So I am convinced it's not that rare of a use-case. Hope to see it pushed higher up the list, as I'm thoroughly impressed by the cleverness of this tool. Thanks again for all your hard work and making this open source!

@erezsh
Copy link
Contributor

erezsh commented Jul 6, 2022

Just a note - it may be possible to infer the schema from looking at the data, at least in some cases. We currently do this to detect UUIDs, which the schema usually reports as varchar.

But I agree, having to download all the rows would nullify the usefulness of this tool.

@mrn3
Copy link

mrn3 commented Jul 6, 2022

So I have done this with MongoDB and Elasticsearch in the past using functions in these files (the Leo Platform is open source):

https://github.com/LeoPlatform/connectors/blob/master/elasticsearch/lib/checksum.js
https://github.com/LeoPlatform/connectors/blob/master/mongo/lib/checksum.js

As far as the schema goes, you can infer it, but we just provided configuration (and even transformation functions) as part of the input. All you are really trying to do is take chunk of data from a source system, run it through the same transformations that your pipeline does, do a MD5 on it, and then compare it to the MD5 on the same data in the target data store.

While doing the MD5 outside of the database is quite a bit slower, I wouldn't say doing it "nullifies the usefulness of this tool". There is still a lot of value in the tool, even if it runs slower when using data stores that don't support MD5 function in database queries.

I also agree that this isn't a "rare use case". Most of our primary data stores we want to compare to Snowflake and Elasticsearch are MongoDB, so we have quite a bit of "NoSQL" to deal with.

@erezsh erezsh added the new-db-driver Request to add a new database driver label Jul 20, 2022
@menzenski
Copy link

Also very interested in this use case, we have MongoDB + AWS DocumentDB source systems we would love to be able to diff against Redshift.

@mrn3
Copy link

mrn3 commented Mar 10, 2023

Yeah it definitely seems like MongoDB will come up a lot since it is pretty widely used.

@github-actions
Copy link
Contributor

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues/PRs that have gone stale label May 28, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 4, 2023

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment and it will be reopened for triage.

@github-actions github-actions bot closed this as completed Jun 4, 2023
@menzenski
Copy link

Commenting to re-open this issue - still very interested in this use case

@github-actions github-actions bot added triage and removed stale Issues/PRs that have gone stale labels Mar 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
new-db-driver Request to add a new database driver triage
Projects
None yet
Development

No branches or pull requests

5 participants