Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

seemingly incorrect "ValueError: Duplicate primary keys" error #527

Closed
leoebfolsom opened this issue Apr 24, 2023 · 6 comments · Fixed by #585
Closed

seemingly incorrect "ValueError: Duplicate primary keys" error #527

leoebfolsom opened this issue Apr 24, 2023 · 6 comments · Fixed by #585
Assignees
Labels
--dbt Issues/features related to the dbt integration bug Something isn't working enhancement New feature or request stale_immune Immunity to stale bot

Comments

@leoebfolsom
Copy link
Contributor

leoebfolsom commented Apr 24, 2023

Describe the bug

I was testing out the tool on a model that should have essentially zero diff against the production model, just one column modified slightly. It was more a POC to see the tool in action than anything. But I am getting a “ValueError: Duplicate primary keys” error. My table in both prod and in dev passes my unique_key test on the same primary key column that I specified for the data-diff configuration. Can someone help me figure out if I’m doing something wrong here?

This was reported in the dbt Slack: https://getdbt.slack.com/archives/C03D25A92UU/p1682373672200739

I'll request the user add additional info.


Make sure to include the following (minus sensitive information):

  • The command or code you used
  • The run output + error you're getting. (including tracestack)
  • Run data-diff with the -d switch for extra debug information.

If possible, please paste these as text, and not a screenshot.

Describe the environment

Describe which OS you're using, which data-diff version, and any other information that might be relevant to this bug.

@dlawin
Copy link
Contributor

dlawin commented Apr 25, 2023

This is being caused by a disconnect between dbt's definition of "unique" and data-diff's definition.

Dbt's uniqueness test (note that it excludes nulls):

select
    key as unique_field,
    count(*) as n_records

from DEV.DBT_DEV_DAN.some_table
where key is not null
group by key
having count(*) > 1

Data-diff (nulls are counted):

SELECT count(*) AS "total"
, count(distinct coalesce(cast("ID" as string), '<null>')) AS "total_distinct" 
FROM "INTEGRATION"."BEERS_TEST"."SIMPLE_EXAMPLE"

If "total" != "total_distinct" -> throw a duplicate PK error

Options I see:

  • Warn that there are null PKs and skip diffing them?
  • Throw a more verbose error "Detected null PK(s)" or similar

@dlawin dlawin added the --dbt Issues/features related to the dbt integration label Apr 25, 2023
@kylemcnair
Copy link
Contributor

@dlawin

Options I see:
Warn that there are null PKs and skip diffing them?
Throw a more verbose error "Detected null PK(s)" or similar

I might vote for something like:

"null primary keys detected, skipping records with null keys"

[run]

and include a count of records with null keys in the output

@dlawin
Copy link
Contributor

dlawin commented Apr 25, 2023

@kylemcnair

Options I see:
Warn that there are null PKs and skip diffing them?
Throw a more verbose error "Detected null PK(s)" or similar

I might vote for something like:

"null primary keys detected, skipping records with null keys"

[run]

and include a count of records with null keys in the output

My only hesitation with that is modifying the existing behavior for non --dbt joindiffs

Probably fairly edge case, but I think we should add a error_on_null boolean to joindiff so that we can leave it as is outside of --dbt

@dlawin dlawin added enhancement New feature or request bug Something isn't working labels Apr 25, 2023
@mariahjrogers
Copy link

FWIW in the --dbt case I would personally prefer to have it still run the diff and skip or ignore rows with null PK values, because if I'm comparing local changes against something already in production, and it is the version in production which is already messed up with null PK values, I may not have the ability to "fix" it in order to run the diff properly. Does that make sense? In any case, I think this aligns with the direction the thread seems to be going!

@kylemcnair
Copy link
Contributor

FWIW in the --dbt case I would personally prefer to have it still run the diff and skip or ignore rows with null PK values, because if I'm comparing local changes against something already in production, and it is the version in production which is already messed up with null PK values, I may not have the ability to "fix" it in order to run the diff properly. Does that make sense? In any case, I think this aligns with the direction the thread seems to be going!

@mariahjrogers That makes sense to me and I think I agree. I'd rather get some data diffed (with a warning about my nulls) than have the whole thing fail.

@dlawin I agree about making this --dbt specific

@dlawin
Copy link
Contributor

dlawin commented Aug 28, 2023

Have a couple users reporting this isn't working for compound keys

@dlawin dlawin reopened this Aug 28, 2023
@dlawin dlawin added stale_immune Immunity to stale bot and removed triage labels Aug 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
--dbt Issues/features related to the dbt integration bug Something isn't working enhancement New feature or request stale_immune Immunity to stale bot
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants