Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

The data-diff shows incorrect row differences between Postgres and BigQuery table #532

Closed
chandu-chai opened this issue Apr 27, 2023 · 5 comments
Labels
bug Something isn't working non-dbt Use cases outside of dbt stale_immune Immunity to stale bot triage

Comments

@chandu-chai
Copy link

chandu-chai commented Apr 27, 2023

I ran the following command to use data-diff for a comparison of tables between PostgreSQL and BigQuery:
Make sure to include the following (minus sensitive information):

        "postgresql://<username>:<password>@<db_host>:<port>/<database>" \
        application \
        "bigquery://<project_name>/<dataset_name>" \
        application \
        -k id -c id \
        --debug

The primary key column in both tables is id, with a format of application-xxxxxxxxYYYYYY-zzzz (e.g., application-2gXxHwCux0HqFNsV721hvTPhJRa97A-0dJgh).

Issue:
The data-diff output showed a difference of >500 rows between the two tables, while the actual difference is only around 4 records.

Further Details:
PostgreSQL version: PostgreSQL 13.4.

After the segment split, data-diff prepares the following queries for both Postgres and BigQuery (multiple queries like this are created for each segment):

PostgreSQL query (generated by data-diff):

SELECT count(*), sum(('x' || substring(md5("id"::varchar), 18))::bit(60)::bigint) FROM "application" WHERE ("id" >= 'application-2gIxHwCux0HqFNSV727hvTPhJRa97A-AAAAA') AND ("id" < 'application-4-xuP3G1v 6g238AtkhFZAHY843gchXXXXX')

BigQuery query (generated by data-diff):

SELECT count(*), sum(cast(cast( ('0x' || substr(TO_HEX(md5(cast(`id` as string))), 18)) as int64) as numeric)) FROM `<project>`.`<dataset>`.`application` WHERE (`id` >= 'application-2gIxHwCux0HqFNSV727hvTPhJRa97A-AAAAA') AND (`id` < 'application-4-xuP3G1v 6g238AtkhFZAHY843gchXXXXX')

Upon running these queries separately on Postgres and BigQuery, there is a significant difference in row count.

@skmanchala
Copy link

skmanchala commented May 5, 2023

where can we find the code which generates this queries ?

@dlawin dlawin added bug Something isn't working non-dbt Use cases outside of dbt labels May 8, 2023
@chandu-chai
Copy link
Author

I found it when I run the data-diff command using debug mode.

@github-actions
Copy link
Contributor

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues/PRs that have gone stale label Jul 17, 2023
@dlawin dlawin added stale_immune Immunity to stale bot and removed stale Issues/PRs that have gone stale labels Jul 19, 2023
@dlawin
Copy link
Contributor

dlawin commented Jul 19, 2023

@chandu-chai thanks for reporting this, which version of data-diff are you using?

@glebmezh
Copy link
Contributor

Hi @chandu-chai,

I'm sorry for the delay in following up on this. Thank you for taking the time to raise this issue!

We made a hard decision to sunset the data-diff package and won't provide further development or support.

If that's of interest, over the past few months, we have rewritten the diffing engine in Datafold Cloud and solved many issues that existed in this package.

-Gleb

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working non-dbt Use cases outside of dbt stale_immune Immunity to stale bot triage
Projects
None yet
Development

No branches or pull requests

4 participants