Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

Make all format options available when using --dbt #409

Closed
dbeatty10 opened this issue Feb 24, 2023 · 18 comments · Fixed by #857
Closed

Make all format options available when using --dbt #409

dbeatty10 opened this issue Feb 24, 2023 · 18 comments · Fixed by #857
Assignees
Labels
--dbt Issues/features related to the dbt integration enhancement New feature or request stale_immune Immunity to stale bot

Comments

@dbeatty10
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Currently, the --dbt flag uses a different output format than non- --dbt executions of data-diff.

Describe the solution you'd like
Ideally, regular data-diff and data-diff --dbt would have the same output format. Additionally, --json and --stats should be possible when using --dbt.

Describe alternatives you've considered
It's an option for --dbt to have completely different formatting and for the --json and --stats flags to not have any effect. But that alternative makes dolphins cry 🐬 😢

Additional context
I'll submit a PR!

@dlawin
Copy link
Contributor

dlawin commented Mar 6, 2023

Ideally, regular data-diff and data-diff --dbt would have the same output format

I'm not sure I agree as we are targeting two very diff erent uses cases in these two features, so I do think it makes sense for the output to not align by default

Migrations (primary data-diff use case) vs. dbt model development. But 100% we need parity across features around the output options (not to mention the other diff options flags)

It's an option for --dbt to have completely different formatting and for the --json and --stats flags to not have any effect. But that alternative makes dolphins cry 🐬 😢

lol true

@williebsweet curious on your thoughts here

@dlawin dlawin added the --dbt Issues/features related to the dbt integration label Mar 6, 2023
@dlawin
Copy link
Contributor

dlawin commented Mar 6, 2023

and @leoebfolsom @kylemcnair

@dbeatty10
Copy link
Contributor Author

we are targeting two very diff erent uses cases in these two features

😂

The output format of regular data-diff felt familiar and comfortable due to its similarity with the git diff (and just diff) that I do regularly in the terminal. But when I used data-diff --dbt, I didn't really know what to do with the output.

At minimum, I'd like to be able to configure the output format via the --json and --stats flags (along with a flag for the default format) when using --dbt.

But interested to hear everyone's thoughts on output formats that would be most useful during dbt model development!

In particular, how are the needs of Mary the Migrator and Dylan the dbt Developer similar and how are their needs different? Is there particular diff output one persona needs that the other doesn't? Is there any overlap?

@github-actions
Copy link
Contributor

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues/PRs that have gone stale label May 10, 2023
@dbeatty10
Copy link
Contributor Author

@dlawin could we mark this as un-stale?

My feature proposal is to achieve parity across the output options (when using the --dbt option). I'll provide examples of all three below.

data-diff --dbt --dbt-profiles-dir .
image
data-diff --dbt --dbt-profiles-dir . --json
image
data-diff --dbt --dbt-profiles-dir . --stats
image

@kylemcnair kylemcnair removed the stale Issues/PRs that have gone stale label May 10, 2023
@kylemcnair
Copy link
Contributor

@dlawin could we mark this as un-stale?

My feature proposal is to achieve parity across the output options (when using the --dbt option). I'll provide examples of all three below.

data-diff --dbt --dbt-profiles-dir .
image ```shell data-diff --dbt --dbt-profiles-dir . --json ``` image ```shell data-diff --dbt --dbt-profiles-dir . --stats ``` image

Hey @dbeatty10!
I removed the stale label - this seems like a great feature that we'd want to keep on the roadmap. We just added the stale workflow in an attempt to better wrangle older issues, please pardon the noise as we get it dialed.

@dbeatty10
Copy link
Contributor Author

No worries @kylemcnair -- stalebot is definitely handy to keep the backlog more manageable.

At the moment, the following is more of a link to interesting information rather than a feature request...

Another diff format to take a peek at is the one used by daff:

image

I believe it originally came from the coopy toolbox, and you can see it in action here:

A relevant git repo is here:

@leoebfolsom
Copy link
Contributor

@dbeatty10 I do really like the daff format, and believe it's a very good way to display row-level data diff info in CLI output, much better imo than the current "regular" (non-dbt) data-diff output. However, I echo what Kyle said above! We'll get there!

@github-actions
Copy link
Contributor

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues/PRs that have gone stale label Jul 10, 2023
@dbeatty10
Copy link
Contributor Author

@kylemcnair would you be open to either removing stale or adding stale_immune for this one?

@github-actions github-actions bot added triage and removed stale Issues/PRs that have gone stale labels Jul 11, 2023
@dlawin dlawin added stale_immune Immunity to stale bot and removed triage labels Jul 11, 2023
@leoebfolsom
Copy link
Contributor

Hi @dbeatty10!

I want to add that the traditional git diff format gets pretty unwieldy when there are many columns and rows, which is of course common when developing in dbt.

I know this from experience when trying to use 'main' data-diff to validate dbt model (comparing prod to dev), which I used to do in the before times, prior to the dbt integration. The output of data-diff was hard to read and overwhelming.

I believe the output of 'main' data-diff was built this way because of two assumptions that don't make sense for the dbt use case:

  • that the user is primarily interested in added/removed rows based on PKs, without the intent to compare the data in every column
  • that there will not be a huge number of differences, and that most rows will remain identical between the tables being compared (which would make sense when comparing source to target in a migration).

I support the general goal of feature parity across outputs for these different use cases (Mary and Dylan), but it doesn't seem trivial or obvious to me to identify what the ideal common output structure would be. (As you mentioned above, there are distinctly different use cases.)

I just want to make sure this is taken into account before something is designed and shipped. Perhaps the best answer is high-level parity, as in: the two use cases should have both machine-readable and human-readable summarized output; and machine-readable value-level output; but, the exact structure and implementation might differ for the two use cases.

Finally, I'm curious if these recently shipped features influence your opinion on the direction and what should be prioritized:

@sungchun12 sungchun12 self-assigned this Jan 17, 2024
@sungchun12
Copy link
Contributor

sungchun12 commented Jan 17, 2024

@dbeatty10 I'm planning to take on this feature for the next couple weeks. Let me know if any of your opinions changed on the above!

Just a heads up this will likely turn into a cloud limited preview feature similar to the dbt show command as a dbt user would run this data-diff --dbt --cloud --preview and get something like the daff output above.

@sungchun12
Copy link
Contributor

@kylemcnair would love your specific input since you last commented on this thread!

@dbeatty10
Copy link
Contributor Author

@dbeatty10 I'm planning to take on this feature for the next couple weeks. Let me know if any of your opinions changed on the above!

Nice, @sungchun12 😎

The above comments still reflect my primary thoughts:

  1. When using the --dbt option, achieve parity across all the output options (--json, --stats, etc.)
  2. If adding a new output option, then a --daff option would be interesting to look at

@sungchun12
Copy link
Contributor

sungchun12 commented Jan 23, 2024

Alright, since this will be a pretty big change. I have a couple proposals in mind.

Scope:

  • This is specific to dbt use cases only and this PR is not expected to translate changes to xdb diffs
  • Value-level diffs are Datafold cloud-specific only
  • JSON outputs will store stats outputs, not value level diffs
  • Problem to solve: surface the right level of information at the right time with the most ergonomic format to discern signal vs. noise and prompt intuitive next steps for the dbt user (think: less eyeball scanning, less pattern recognition homework for the user). Match the inner dialogue of the user trying to solve problems with a data diff.

Design Notes:

  • I like the idea of color formatting to distinguish things like prod only, dev only, differences vs. no differences.
  • I played around with gridlines but it adds a bunch of noise and makes my eyeballs buzz around
  • Align numbers for a cleaner, uniform look
  • I like the feeling of sampled rows, gives a nice rule of thumb if I don't want to click a hyperlink for a full diff yet in Datafold cloud
  • For columns added (green color), columns removed (red color) similar to git diff
  • Row counts per table look like a good idea for baseline thinking
  • Should sample diff rows be opt-in or opt-out? Still weighing this as that changes the flag design pattern
  • Color code the prod vs. dev portions: DEMO.CORE.DIM_ORGS <> DEMO.DBT_SUNG.DIM_ORGS and then color code the rest of the diff output similarly. Automatic pattern recognition and less "this is prod stuff" "this is dev stuff" text needed.
  • Invoke the feelings of blue green deployments

Current --dbt output
image

Current --dbt --cloud output
image

Future diff outputs
image

Research:
Looks like SQLmesh has a thought out pattern when it comes to their table diffing feature: here
image

@leoebfolsom
Copy link
Contributor

I agree that the outputs (--dbt vs --cloud vs --json vs whatever else nifty future output) should have parity when equivalent information is available. This is table stakes.

@leoebfolsom
Copy link
Contributor

leoebfolsom commented Jan 24, 2024

I like the format Sung proposes ("Future diff outputs").

One nitpick is: I dislike how "changed rows" and "unchanged rows" are in columns specific to one vs the other side of the diff, because those are really meta statistics about the comparison of the two tables.

In Cloud, we render that as "in the middle" which I think makes sense:
Screenshot 2024-01-24 at 15 41 08

I want to scooch them, like this:

Screenshot 2024-01-24 at 15 42 40

That said, I think it's fine as-is as an MVP.

@leoebfolsom
Copy link
Contributor

I think sampling should be opt-in because I suspect it will be pretty hard to render sometimes, and I don't want people to have to run a diff, see that samples are rendered poorly, run it again, etc.

OTOH, if sampling renders nicely every time, even when there are 250 columns, then I would support opt-out.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
--dbt Issues/features related to the dbt integration enhancement New feature or request stale_immune Immunity to stale bot
Projects
None yet
5 participants