seemingly incorrect "ValueError: Duplicate primary keys" error #527

leoebfolsom · 2023-04-24T23:31:11Z

Describe the bug

I was testing out the tool on a model that should have essentially zero diff against the production model, just one column modified slightly. It was more a POC to see the tool in action than anything. But I am getting a “ValueError: Duplicate primary keys” error. My table in both prod and in dev passes my unique_key test on the same primary key column that I specified for the data-diff configuration. Can someone help me figure out if I’m doing something wrong here?

This was reported in the dbt Slack: https://getdbt.slack.com/archives/C03D25A92UU/p1682373672200739

I'll request the user add additional info.

Make sure to include the following (minus sensitive information):

The command or code you used
The run output + error you're getting. (including tracestack)
Run data-diff with the -d switch for extra debug information.

If possible, please paste these as text, and not a screenshot.

Describe the environment

Describe which OS you're using, which data-diff version, and any other information that might be relevant to this bug.

The text was updated successfully, but these errors were encountered:

dlawin · 2023-04-25T17:51:36Z

This is being caused by a disconnect between dbt's definition of "unique" and data-diff's definition.

Dbt's uniqueness test (note that it excludes nulls):

select
    key as unique_field,
    count(*) as n_records

from DEV.DBT_DEV_DAN.some_table
where key is not null
group by key
having count(*) > 1

Data-diff (nulls are counted):

SELECT count(*) AS "total"
, count(distinct coalesce(cast("ID" as string), '<null>')) AS "total_distinct" 
FROM "INTEGRATION"."BEERS_TEST"."SIMPLE_EXAMPLE"

If "total" != "total_distinct" -> throw a duplicate PK error

Options I see:

Warn that there are null PKs and skip diffing them?
Throw a more verbose error "Detected null PK(s)" or similar

kylemcnair · 2023-04-25T18:11:30Z

@dlawin

Options I see:
Warn that there are null PKs and skip diffing them?
Throw a more verbose error "Detected null PK(s)" or similar

I might vote for something like:

"null primary keys detected, skipping records with null keys"

[run]

and include a count of records with null keys in the output

dlawin · 2023-04-25T20:30:18Z

@kylemcnair

Options I see:
Warn that there are null PKs and skip diffing them?
Throw a more verbose error "Detected null PK(s)" or similar

I might vote for something like:

"null primary keys detected, skipping records with null keys"

[run]

and include a count of records with null keys in the output

My only hesitation with that is modifying the existing behavior for non --dbt joindiffs

Probably fairly edge case, but I think we should add a error_on_null boolean to joindiff so that we can leave it as is outside of --dbt

mariahjrogers · 2023-04-26T23:27:51Z

FWIW in the --dbt case I would personally prefer to have it still run the diff and skip or ignore rows with null PK values, because if I'm comparing local changes against something already in production, and it is the version in production which is already messed up with null PK values, I may not have the ability to "fix" it in order to run the diff properly. Does that make sense? In any case, I think this aligns with the direction the thread seems to be going!

kylemcnair · 2023-04-27T16:26:33Z

FWIW in the --dbt case I would personally prefer to have it still run the diff and skip or ignore rows with null PK values, because if I'm comparing local changes against something already in production, and it is the version in production which is already messed up with null PK values, I may not have the ability to "fix" it in order to run the diff properly. Does that make sense? In any case, I think this aligns with the direction the thread seems to be going!

@mariahjrogers That makes sense to me and I think I agree. I'd rather get some data diffed (with a warning about my nulls) than have the whole thing fail.

@dlawin I agree about making this --dbt specific

dlawin · 2023-08-28T16:07:50Z

Have a couple users reporting this isn't working for compound keys

dlawin added the --dbt Issues/features related to the dbt integration label Apr 25, 2023

dlawin added enhancement New feature or request bug Something isn't working labels Apr 25, 2023

dlawin mentioned this issue May 17, 2023

continue --dbt diff when null PKs exist #585

Merged

dlawin self-assigned this May 17, 2023

dlawin closed this as completed in #585 May 18, 2023

dlawin mentioned this issue May 19, 2023

Release v0.7.8 #588

Closed

dlawin reopened this Aug 28, 2023

github-actions bot added the triage label Aug 28, 2023

dlawin added stale_immune Immunity to stale bot and removed triage labels Aug 28, 2023

glebmezh closed this as completed May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seemingly incorrect "ValueError: Duplicate primary keys" error #527

seemingly incorrect "ValueError: Duplicate primary keys" error #527

leoebfolsom commented Apr 24, 2023 •

edited

Loading

dlawin commented Apr 25, 2023

kylemcnair commented Apr 25, 2023

dlawin commented Apr 25, 2023 •

edited

Loading

mariahjrogers commented Apr 26, 2023

kylemcnair commented Apr 27, 2023

dlawin commented Aug 28, 2023

seemingly incorrect "ValueError: Duplicate primary keys" error #527

seemingly incorrect "ValueError: Duplicate primary keys" error #527

Comments

leoebfolsom commented Apr 24, 2023 • edited Loading

dlawin commented Apr 25, 2023

kylemcnair commented Apr 25, 2023

dlawin commented Apr 25, 2023 • edited Loading

mariahjrogers commented Apr 26, 2023

kylemcnair commented Apr 27, 2023

dlawin commented Aug 28, 2023

leoebfolsom commented Apr 24, 2023 •

edited

Loading

dlawin commented Apr 25, 2023 •

edited

Loading