Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

Support key columns of any data type #539

Closed
mariahjrogers opened this issue Apr 28, 2023 · 6 comments
Closed

Support key columns of any data type #539

mariahjrogers opened this issue Apr 28, 2023 · 6 comments
Labels
--dbt Issues/features related to the dbt integration enhancement New feature or request stale_immune Immunity to stale bot

Comments

@mariahjrogers
Copy link

Is your feature request related to a problem? Please describe.
I am trying to run a diff on a table without one single primary key column, so I'm passing in multiple key columns. Some of these columns are numeric, some are timestamps, and some are VARCHARs. I get an error thrown by the tool when it encounters the TIMESTAMP column: ERROR - Cannot use a column of type Timestamp(precision=6, rounds=True) as a key. I would happily pass it in as a VARCHAR, but I can't modify the type of the field in the table directly, so I tried passing <tstamp_field>::VARCHAR as the key column value, and then got this error: ERROR - Column 'date_day::varchar' not found in table 1.

Describe the solution you'd like
Data-diff should support key columns of any type and handle different types behind the scenes.

Failing that, it would be great to be able to pass in a column with a type cast and not get a "column not found" error.

Describe alternatives you've considered
One alternative to be able to run my diff with non-VARCHAR key columns would be for me to change the types of the columns in my production table, which isn't an acceptable alternative.

Another alternative would be to generate a new column which is a composite key, but this also isn't an acceptable alternative because it requires first modifying the production version of the table before being able to run a diff against it, when the whole point is that I want to diff my changes before making any impact on the production table.

Additional context

@mariahjrogers mariahjrogers added the enhancement New feature or request label Apr 28, 2023
@dlawin dlawin added the --dbt Issues/features related to the dbt integration label May 2, 2023
@koenvb
Copy link

koenvb commented May 31, 2023

Hi I am using data-diff to mainly diff tables between a production database and a test database.
In this case the scenario is an ERP system which has tables with following structures.

objid , objkey, name (Text) , description (Text)

So ofcourse the objid values are different between the two databases but the Name column is a column which is unique for the system.

So ideally I would like to use the name column as my key column so that I can compare the tables between the two systems.

Currently this is not possible, but the system guarantees that the name column really acts like a key column even though it is just a Text() column.

So what would be a good approach to solve this and diff the tables by "forcing" data-diff to treat the column as an id so somehow overrule it to say as a user, yes, I can assure you that this column is a key column?

Is there anything particular you need with regards to key columns that you do not accept TEXT() as a key column?

Thanks!

@dlawin
Copy link
Contributor

dlawin commented May 31, 2023

@mariahjrogers @koenvb which database providers are you encountering this with?

@koenvb
Copy link

koenvb commented Jun 2, 2023

@dlawin For my use case I am using the oracle connection. I fetch it straight from the erp system.

@pthompson-qontigo
Copy link

I just started using data-diff to compare data across Postgres databases.
I have the same issue as above in that the unique key on some tables is one or more varchar columns.
data-diff gives an error "Cannot use a column of type Text() as a key"
This is not an uncommon scenario, even if the table has a surrogate id key, but the comparison needs to be on textual data. How can this be done with data-diff?

@dlawin
Copy link
Contributor

dlawin commented Oct 20, 2023

Context on why this is not supported yet:

We currently support certain strings as PKs. They need to be alphanums with a fixed length (and only these characters)
-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz

The reason for this is that we need to be able to perform arithmetic operations on the keys in order to segment and bisect the tables when the database is of type ThreadedDatabase. It’s possible but not trivial to implement this for arbitrary types

For fixed length alphanums, you can see we convert them to numbers in order to support them as keys

class ArithAlphanumeric(ArithString):

@glebmezh
Copy link
Contributor

Hi all!

I'm sorry for the delay in following up on this. @mariahjrogers thank you for taking the time to raise this issue!

We made a hard decision to sunset the data-diff package and won't provide further development or support.

If that's of interest, over the past few months, we have rewritten the diffing engine in Datafold Cloud and solved many issues that existed in this package's diffing algorithm.

-Gleb

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
--dbt Issues/features related to the dbt integration enhancement New feature or request stale_immune Immunity to stale bot
Projects
None yet
Development

No branches or pull requests

5 participants