Make all format options available when using `--dbt` #409

dbeatty10 · 2023-02-24T01:19:40Z

Is your feature request related to a problem? Please describe.
Currently, the --dbt flag uses a different output format than non- --dbt executions of data-diff.

Describe the solution you'd like
Ideally, regular data-diff and data-diff --dbt would have the same output format. Additionally, --json and --stats should be possible when using --dbt.

Describe alternatives you've considered
It's an option for --dbt to have completely different formatting and for the --json and --stats flags to not have any effect. But that alternative makes dolphins cry 🐬 😢

Additional context
I'll submit a PR!

The text was updated successfully, but these errors were encountered:

dlawin · 2023-03-06T19:53:06Z

Ideally, regular data-diff and data-diff --dbt would have the same output format

I'm not sure I agree as we are targeting two very diff erent uses cases in these two features, so I do think it makes sense for the output to not align by default

Migrations (primary data-diff use case) vs. dbt model development. But 100% we need parity across features around the output options (not to mention the other diff options flags)

It's an option for --dbt to have completely different formatting and for the --json and --stats flags to not have any effect. But that alternative makes dolphins cry 🐬 😢

lol true

@williebsweet curious on your thoughts here

dlawin · 2023-03-06T20:01:08Z

and @leoebfolsom @kylemcnair

dbeatty10 · 2023-03-06T20:29:52Z

we are targeting two very diff erent uses cases in these two features

😂

The output format of regular data-diff felt familiar and comfortable due to its similarity with the git diff (and just diff) that I do regularly in the terminal. But when I used data-diff --dbt, I didn't really know what to do with the output.

At minimum, I'd like to be able to configure the output format via the --json and --stats flags (along with a flag for the default format) when using --dbt.

But interested to hear everyone's thoughts on output formats that would be most useful during dbt model development!

In particular, how are the needs of Mary the Migrator and Dylan the dbt Developer similar and how are their needs different? Is there particular diff output one persona needs that the other doesn't? Is there any overlap?

github-actions · 2023-05-10T06:31:37Z

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

dbeatty10 · 2023-05-10T12:05:17Z

@dlawin could we mark this as un-stale?

My feature proposal is to achieve parity across the output options (when using the --dbt option). I'll provide examples of all three below.

data-diff --dbt --dbt-profiles-dir .

data-diff --dbt --dbt-profiles-dir . --json

data-diff --dbt --dbt-profiles-dir . --stats

kylemcnair · 2023-05-10T14:53:11Z

@dlawin could we mark this as un-stale?

My feature proposal is to achieve parity across the output options (when using the --dbt option). I'll provide examples of all three below.
data-diff --dbt --dbt-profiles-dir .
```shell data-diff --dbt --dbt-profiles-dir . --json ``` ```shell data-diff --dbt --dbt-profiles-dir . --stats ```

Hey @dbeatty10!
I removed the stale label - this seems like a great feature that we'd want to keep on the roadmap. We just added the stale workflow in an attempt to better wrangle older issues, please pardon the noise as we get it dialed.

dbeatty10 · 2023-05-10T15:52:36Z

No worries @kylemcnair -- stalebot is definitely handy to keep the backlog more manageable.

At the moment, the following is more of a link to interesting information rather than a feature request...

Another diff format to take a peek at is the one used by daff:

http://paulfitz.github.io/daff-doc/spec.html

I believe it originally came from the coopy toolbox, and you can see it in action here:

https://paulfitz.github.io/daff/

A relevant git repo is here:

https://github.com/paulfitz/daff

leoebfolsom · 2023-05-10T20:34:46Z

@dbeatty10 I do really like the daff format, and believe it's a very good way to display row-level data diff info in CLI output, much better imo than the current "regular" (non-dbt) data-diff output. However, I echo what Kyle said above! We'll get there!

github-actions · 2023-07-10T06:34:25Z

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.

dbeatty10 · 2023-07-11T01:11:04Z

@kylemcnair would you be open to either removing stale or adding stale_immune for this one?

leoebfolsom · 2023-07-12T00:00:05Z

Hi @dbeatty10!

I want to add that the traditional git diff format gets pretty unwieldy when there are many columns and rows, which is of course common when developing in dbt.

I know this from experience when trying to use 'main' data-diff to validate dbt model (comparing prod to dev), which I used to do in the before times, prior to the dbt integration. The output of data-diff was hard to read and overwhelming.

I believe the output of 'main' data-diff was built this way because of two assumptions that don't make sense for the dbt use case:

that the user is primarily interested in added/removed rows based on PKs, without the intent to compare the data in every column
that there will not be a huge number of differences, and that most rows will remain identical between the tables being compared (which would make sense when comparing source to target in a migration).

I support the general goal of feature parity across outputs for these different use cases (Mary and Dylan), but it doesn't seem trivial or obvious to me to identify what the ideal common output structure would be. (As you mentioned above, there are distinctly different use cases.)

I just want to make sure this is taken into account before something is designed and shipped. Perhaps the best answer is high-level parity, as in: the two use cases should have both machine-readable and human-readable summarized output; and machine-readable value-level output; but, the exact structure and implementation might differ for the two use cases.

Finally, I'm curious if these recently shipped features influence your opinion on the direction and what should be prioritized:

The VS Code extension (very much in beta / testing / pre-launch mode, but more than happy for you to try it out and give feedback 🙏 ): https://github.com/datafold/demo/blob/vscode-demo/VSCODE_INSTRUCTIONS.md
The --json flag in the --dbt context

sungchun12 · 2024-01-17T20:14:29Z

@dbeatty10 I'm planning to take on this feature for the next couple weeks. Let me know if any of your opinions changed on the above!

Just a heads up this will likely turn into a cloud limited preview feature similar to the dbt show command as a dbt user would run this data-diff --dbt --cloud --preview and get something like the daff output above.

sungchun12 · 2024-01-17T20:42:10Z

@kylemcnair would love your specific input since you last commented on this thread!

dbeatty10 · 2024-01-17T23:04:59Z

@dbeatty10 I'm planning to take on this feature for the next couple weeks. Let me know if any of your opinions changed on the above!

Nice, @sungchun12 😎

The above comments still reflect my primary thoughts:

When using the --dbt option, achieve parity across all the output options (--json, --stats, etc.)
If adding a new output option, then a --daff option would be interesting to look at

sungchun12 · 2024-01-23T17:31:45Z

Alright, since this will be a pretty big change. I have a couple proposals in mind.

Scope:

This is specific to dbt use cases only and this PR is not expected to translate changes to xdb diffs
Value-level diffs are Datafold cloud-specific only
JSON outputs will store stats outputs, not value level diffs
Problem to solve: surface the right level of information at the right time with the most ergonomic format to discern signal vs. noise and prompt intuitive next steps for the dbt user (think: less eyeball scanning, less pattern recognition homework for the user). Match the inner dialogue of the user trying to solve problems with a data diff.

Design Notes:

I like the idea of color formatting to distinguish things like prod only, dev only, differences vs. no differences.
I played around with gridlines but it adds a bunch of noise and makes my eyeballs buzz around
Align numbers for a cleaner, uniform look
I like the feeling of sampled rows, gives a nice rule of thumb if I don't want to click a hyperlink for a full diff yet in Datafold cloud
For columns added (green color), columns removed (red color) similar to git diff
Row counts per table look like a good idea for baseline thinking
Should sample diff rows be opt-in or opt-out? Still weighing this as that changes the flag design pattern
Color code the prod vs. dev portions: DEMO.CORE.DIM_ORGS <> DEMO.DBT_SUNG.DIM_ORGS and then color code the rest of the diff output similarly. Automatic pattern recognition and less "this is prod stuff" "this is dev stuff" text needed.
Invoke the feelings of blue green deployments

Current --dbt output

Current --dbt --cloud output

Future diff outputs

Research:
Looks like SQLmesh has a thought out pattern when it comes to their table diffing feature: here

leoebfolsom · 2024-01-24T23:37:40Z

I agree that the outputs (--dbt vs --cloud vs --json vs whatever else nifty future output) should have parity when equivalent information is available. This is table stakes.

leoebfolsom · 2024-01-24T23:42:27Z

I like the format Sung proposes ("Future diff outputs").

One nitpick is: I dislike how "changed rows" and "unchanged rows" are in columns specific to one vs the other side of the diff, because those are really meta statistics about the comparison of the two tables.

In Cloud, we render that as "in the middle" which I think makes sense:

I want to scooch them, like this:

That said, I think it's fine as-is as an MVP.

leoebfolsom · 2024-01-24T23:46:50Z

I think sampling should be opt-in because I suspect it will be pretty hard to render sometimes, and I don't want people to have to run a diff, see that samples are rendered poorly, run it again, etc.

OTOH, if sampling renders nicely every time, even when there are 250 columns, then I would support opt-out.

dbeatty10 added the enhancement New feature or request label Feb 24, 2023

dbeatty10 mentioned this issue Feb 24, 2023

Unify output formats for --dbt running locally #410

Closed

dlawin added the --dbt Issues/features related to the dbt integration label Mar 6, 2023

github-actions bot added the stale Issues/PRs that have gone stale label May 10, 2023

kylemcnair removed the stale Issues/PRs that have gone stale label May 10, 2023

github-actions bot added the stale Issues/PRs that have gone stale label Jul 10, 2023

github-actions bot added triage and removed stale Issues/PRs that have gone stale labels Jul 11, 2023

dlawin added stale_immune Immunity to stale bot and removed triage labels Jul 11, 2023

sungchun12 self-assigned this Jan 17, 2024

sungchun12 mentioned this issue Jan 24, 2024

Evolve dbt data diff output #857

Merged

sungchun12 closed this as completed in #857 Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make all format options available when using `--dbt` #409

Make all format options available when using `--dbt` #409

dbeatty10 commented Feb 24, 2023

dlawin commented Mar 6, 2023

dlawin commented Mar 6, 2023

dbeatty10 commented Mar 6, 2023

github-actions bot commented May 10, 2023

dbeatty10 commented May 10, 2023

kylemcnair commented May 10, 2023

dbeatty10 commented May 10, 2023

leoebfolsom commented May 10, 2023

github-actions bot commented Jul 10, 2023

dbeatty10 commented Jul 11, 2023

leoebfolsom commented Jul 12, 2023

sungchun12 commented Jan 17, 2024 •

edited

Loading

sungchun12 commented Jan 17, 2024

dbeatty10 commented Jan 17, 2024

sungchun12 commented Jan 23, 2024 •

edited

Loading

leoebfolsom commented Jan 24, 2024

leoebfolsom commented Jan 24, 2024 •

edited

Loading

leoebfolsom commented Jan 24, 2024

Make all format options available when using --dbt #409

Make all format options available when using --dbt #409

Comments

dbeatty10 commented Feb 24, 2023

dlawin commented Mar 6, 2023

dlawin commented Mar 6, 2023

dbeatty10 commented Mar 6, 2023

github-actions bot commented May 10, 2023

dbeatty10 commented May 10, 2023

kylemcnair commented May 10, 2023

dbeatty10 commented May 10, 2023

leoebfolsom commented May 10, 2023

github-actions bot commented Jul 10, 2023

dbeatty10 commented Jul 11, 2023

leoebfolsom commented Jul 12, 2023

sungchun12 commented Jan 17, 2024 • edited Loading

sungchun12 commented Jan 17, 2024

dbeatty10 commented Jan 17, 2024

sungchun12 commented Jan 23, 2024 • edited Loading

leoebfolsom commented Jan 24, 2024

leoebfolsom commented Jan 24, 2024 • edited Loading

leoebfolsom commented Jan 24, 2024

Make all format options available when using `--dbt` #409

Make all format options available when using `--dbt` #409

sungchun12 commented Jan 17, 2024 •

edited

Loading

sungchun12 commented Jan 23, 2024 •

edited

Loading

leoebfolsom commented Jan 24, 2024 •

edited

Loading