feat(Dataquality aspect): Added Data Quality Metrics aspect to emit data quality metrics metadata into Datahub #9265

naresh-angala · 2023-11-17T17:52:53Z

Checklist

[ X ] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

metadata-models/src/main/pegasus/com/linkedin/dataquality/DataQualityDimensionInfo.pdl

metadata-models/src/main/pegasus/com/linkedin/dataquality/DimensionScore.pdl

jjoyce0510 · 2024-06-28T22:23:41Z

Hi there! What is the goal with this PR? Adding context in the description will be quite useful! Thanks in advanced

naresh-angala · 2024-07-01T06:43:20Z

Hi there! What is the goal with this PR? Adding context in the description will be quite useful! Thanks in advanced

Hi,

PR is about adding Data Quality Metrics capability, working on changes for dynamic Data Quality metrics addition as per PR review comments.

Thanks.

rtekal · 2024-08-02T16:46:08Z

@naresh-angala I know that this PR is just the model changes. Can you attach a documentation link that conveys the big picture and where it all fits please.

metadata-models/src/main/pegasus/com/linkedin/dataquality/DataQualityDimensionInfo.pdl

metadata-models/src/main/pegasus/com/linkedin/dataquality/SchemaFieldQualityDimensionInfo.pdl

jjoyce0510

Requested changes and clarifications have not been addressed on this PR

rtekal · 2024-08-07T16:56:53Z

Okay, the team will change the PR to Draft and work on the design changes. Thanks.

sgm44 · 2024-08-07T22:02:46Z

PR is about adding Data Quality Metrics capability, working on changes for dynamic Data Quality metrics addition as per PR review comments.

This is the data model changes to support the full ability to capture and report data quality dimensions. There was a bit of back and forth in slack back in Oct 2023 on this topic which include example usage screens found here

Here was the simple Feature Goal statement:
As a data producer, I want quality metrics for my ingested datasets to display within the dataset view in the catalog so that the metrics are available to data consumers.

sgm44 · 2024-08-07T22:05:36Z

@naresh-angala and @rtekal -- where are the graphQL and UI updates related to this feature? Right now this looks like just PDL updates. Without the rest I don't see how datahub gets any value outside of the ability to ingest and store the data which IMHO is pretty basic.

naresh-angala · 2024-08-09T06:26:09Z

@naresh-angala and @rtekal -- where are the graphQL and UI updates related to this feature? Right now this looks like just PDL updates. Without the rest I don't see how datahub gets any value outside of the ability to ingest and store the data which IMHO is pretty basic.

@sgm44 -- Intial plan was to get the Data quality model changes be reviewed and accepted.
Would be updating the code with GraphQL and UI related change subsequently after the model.

naresh-angala · 2024-08-09T07:08:15Z

@jjoyce0510 -- Can you share details on below,

Can a feature be contributed with multiple PRs.
If feature need to be contributed with single PR, let us know comments on below approach
--> Marking this PR in draft mode and add the below changes incrementally,
a) Dynamic list of Dimensions with new aspect
b) Mapper changes and GraphQL changes
c) UI changes to display the quality metrics

Thanks.

naresh-angala · 2024-08-19T11:47:38Z

@jjoyce0510 -- Can you share details on below,

Can a feature be contributed with multiple PRs.

If feature need to be contributed with single PR, let us know comments on below approach
--> Marking this PR in draft mode and add the below changes incrementally,
a) Dynamic list of Dimensions with new aspect
b) Mapper changes and GraphQL changes
c) UI changes to display the quality metrics

Thanks.

@jjoyce0510 -- Please provide details on above points.

Curiosity007 · 2024-09-08T05:02:14Z

@naresh-angala Is there any tentative timeline for this feature to be fully integrated into the UI, GraphQL and backend? This is an integral part of DQ, and would like very much to see this in the newest version

rtekal · 2024-09-11T16:07:25Z

@naresh-angala told me: Targetting last week of Sep to complete

…ta into Datahub

naresh-angala · 2024-11-15T17:30:08Z

Requested changes and clarifications have not been addressed on this PR

@jjoyce0510 : Updated the PR with dynamic dimension names and UI changes. Please review.

naresh-angala · 2024-11-15T17:31:42Z

UI Screen shots: Dataset metrics: Chart view:

Schemafield metrics:

naresh-angala · 2024-11-15T17:34:06Z

UI Screenshots:
Dataset metrics: Table view:

naresh-angala · 2024-11-21T16:45:33Z

Requested changes and clarifications have not been addressed on this PR

@jjoyce0510: Have addressed all the requested changes. Please review.

metadata-models/src/main/pegasus/com/linkedin/dataquality/SchemaFieldQualityDimensionInfo.pdl

jjoyce0510

We will take another look at this.

Note that we never had a design discussion around this non-trivial feature. It would be best to have a dedicated time to chat through this together.

Either way, we'll take another look and try to reverse interpret

Cheers
John

jjoyce0510

This is quite comprehensive!

This one requires more discussion at the strategic level before admission for the broader community. An RFC document detailing the thinking here might be the best place to discuss and engage others within the community.

A few top of mind concerns:

Who is responsible for registering Dimension Names? e.g. creating the new Dimension Name Type entity
Who is responsible for producing the Dimension Name Scores for Tables & Columns?
Is there an example of a Data Quality dimension at the table and column level we could use to understand the use case a bit better?
Is there a way to use Structured Properties + a custom UI tab to achieve what you want? It feels like you simply need a way to attach strong-typed numeric properties to tables and columns (which structured properties can be used for)
What would a user-facing feature guide doc look like for this feature?

There are also lower level tactical comments on the code around variable naming, data modeling, etc, but I don't want to waste your time on those things before there is alignment at the strategic level.

Cheers
John

rtekal · 2024-12-06T19:39:50Z

cc: @naresh-angala, @mzaman

Who is responsible for registering Dimension Names? e.g. creating the new Dimension Name Type entity

Data Producer will submit an enhancement request for adding a new dimension as an Entity Type

Who is responsible for producing the Dimension Name Scores for Tables & Columns?

Data Producer will use their own algorithm for calculating the scores and will ingest them to Data Catalog.

Is there an example of a Data Quality dimension at the table and column level we could use to understand the use case a bit better?

Already provided in the PR above

Is there a way to use Structured Properties + a custom UI tab to achieve what you want? It feels like you simply need a way to attach strong-typed numeric properties to tables and columns (which structured properties can be used for)

Structured property could be used for just showing the current score. But we are also showing the historical average
We will show the time-series values in future. Not supported by Structured Properties
We will also show the information about the method of calculation and the threshold score. These are the quality attributes similar to "Assertions" and "Metadata Tests". So, it makes sense to group them together with Assertions and Tests

What would a user-facing feature guide doc look like for this feature?

User doc in Markdown

Introduction

Data Quality Dimension(DQ) is a popular industry term used to describe characteristics or attributes of data that can be measured against defined standards in order to determine the quality of data. The aggregate at dimension level can fairly indicate the fitness of data to be used for a certain business purpose.

Data Catalog offers the standard Dimensions recognized in the industry and scorecard for both Data Owners & Consumers to make informed decisions on the quality of data. This approach facilitates a common language for the users to communicate and comprehend the quality of data in a consistent way across the enterprise.

Dimensions of Data Quality

A Data Quality Dimension is typically presented as a percentage or a total count.

Dimension	Description
Accuracy	How well does a piece of information reflect reality?
Completeness	Does it fulfill your expectations of what’s comprehensive?
Consistency	Does information stored in one place match relevant data stored elsewhere?
Timeliness	Is your information available when you need it?
Validity	Is information in a specific format, does it follow business rules, or is it in an unusable format?
Uniqueness	Is this the only instance in which this information appears in the database?
Integrity	Is the data compliant to the referential integrity rules?
Semantic Correctness	Are Data values true to their meaning?
Reliability	Is the stability and reliability of data measured over time?
Duplication	Are there multiple instances in which this information appears in the dataset ?
Precision	Analyzes the content of the data to identify inconsistencies and irregularities.

Each Dimension will contain dimension scores which can be either provided in numerical or percentage format.

Current score (Required): The score as observed from the most recent data quality evaluation for a specific dimension
HistoricalWeightedScore (Optional): The score when weighted across several historical quality runs for a specific dimension
ScoreType (Required): enum field to define current and historical score format (Percentage or Numerical)
Notes (Optional): field to capture note/comment for a specific dimension

Data quality metrics will be available at Dataset and Field level under "Quality" tab in Data Catalog.

image

jjoyce0510

I am pausing on technical reviews - there are still some comments I have around naming and more but I'm going to wait - until we have strategic alignment to discuss how exactly this benefits our users. I don't have any evidence or confidence that if we roll this out, other community members will be able to immediately start to get value from it (even given the explanation).

In my view, this feature feels like too much work for our users. The burden is left on others to know how to produce data quality metrics and there is no recipe or runbook about how to use the feature. -- In other words, this is a partial feature and not a "full feature".

If there was a guide that showed users how to get to Accuracy, Completeness, Consistency for their data that would be ideal - but I do not see that yet. I do not want to accept this PR and then leave our customers / community on their own to figure out what to do with it, while adding the support burden for the engineering teams maintaining DataHub.

jjoyce0510

Upon reflecting upon this more, I think this is the main problem with this feature:

It's opinionated, but not complete: We are introducing a highly specific concept of "Data Quality Metrics", without providing users a clear guide on how to use the feature. The implementation of the feature is far more generic than the "Data Quality" framing would suggest - It's really a way to provide a named metric to a table or schema field with a highly opinionated, but vaguely defined historical and current value + score.

If we want to keep the implementation as generic as it currently is, WITHOUT requiring more information about how the metrics are actually computed, I'd argue we should generalize the feature EVEN MORE to become a general purpose way to report time-series metrics about a table or column to datahub (without the Data Quality-specific framing + scores).

Specifically, I think a viable approach could be:

Define a general purpose "DataHub Metric" concept that can be associated to Fields or Datasets.
Allow defining metric metadata like name, value type, and more.
Allow reporting values for the metric associated with a given entity URN and using time series aspects (the correct way to store historical metrics)
Allow metrics to have custom dimensions, which are all indexed so we can slice and dice using them.
Allow displaying your custom metrics in rich historical + latest graphs in a new tab called "Metrics".
Metrics could have a category: Quality, Governance, or something altogether different.
Add some APIs for basic aggregations on top of metrics. (e.g. sum, number of events, latest value in windows, etc).

These more general purpose, less opinionated foundations would enable this feature to scale to provide value beyond just Data Quality, for example broadening the scope to possibly include Data Governance and Data Discovery, or something altogether more specific to a particular organization.

We could develop the user interface layer to be less opinionated about scores, latest, and historical values than it currently is. This would reduce the collision between this feature and other very specific Data Quality features already offerred by DataHub/Acryl.

What do you think about this direction? It has been on our minds at Acryl for some time now. If you're up for it, we could attempt to pivot this feature in that more general purpose direction.

naresh-angala changed the title ~~feat(Dataquality aspect): Added Data Quality Metrics aspect to emit data quality metrics into Datahub~~ feat(Dataquality aspect): Added Data Quality Metrics aspect to emit data quality metrics metadata into Datahub Nov 17, 2023

vercel bot deployed to Preview November 17, 2023 18:29 View deployment

david-leifker assigned shirshanka and jjoyce0510 Nov 21, 2023

david-leifker added the product PR or Issue related to the DataHub UI/UX label Nov 21, 2023

vercel bot deployed to Preview November 27, 2023 08:58 View deployment

maggiehays added the community-contribution PR or Issue raised by member(s) of DataHub Community label Nov 29, 2023

chriscollins3456 added the poc-marathon-dec-2023 label Dec 6, 2023

skrydal reviewed Jan 10, 2024

View reviewed changes

metadata-models/src/main/pegasus/com/linkedin/dataquality/DataQualityDimensionInfo.pdl Outdated Show resolved Hide resolved

skrydal reviewed Jan 10, 2024

View reviewed changes

metadata-models/src/main/pegasus/com/linkedin/dataquality/DimensionScore.pdl Outdated Show resolved Hide resolved

frsann reviewed Jan 11, 2024

View reviewed changes

metadata-models/src/main/pegasus/com/linkedin/dataquality/DimensionScore.pdl Show resolved Hide resolved

jjoyce0510 reviewed Aug 6, 2024

View reviewed changes

metadata-models/src/main/pegasus/com/linkedin/dataquality/DataQualityDimensionInfo.pdl Outdated Show resolved Hide resolved

jjoyce0510 reviewed Aug 6, 2024

View reviewed changes

metadata-models/src/main/pegasus/com/linkedin/dataquality/SchemaFieldQualityDimensionInfo.pdl Outdated Show resolved Hide resolved

jjoyce0510 requested changes Aug 6, 2024

View reviewed changes

naresh-angala marked this pull request as draft August 26, 2024 08:00

naresh-angala closed this Oct 23, 2024

naresh-angala force-pushed the dataquality-metrics branch from 091db70 to eab2ac7 Compare October 23, 2024 14:31

vercel bot deployed to Preview October 23, 2024 14:49 View deployment

Added Data Quality Metrics aspect to emit data quality metrics metada…

c95e365

…ta into Datahub

naresh-angala reopened this Oct 25, 2024

github-actions bot removed the product PR or Issue related to the DataHub UI/UX label Oct 25, 2024

vercel bot had a problem deploying to Preview October 25, 2024 12:33 Failure

Adding support for custom dimension names for data quality model

bf904a4

vercel bot had a problem deploying to Preview November 5, 2024 14:02 Failure

Dataquality UI changes

e8e13a3

naresh-angala marked this pull request as ready for review November 15, 2024 17:30

vercel bot had a problem deploying to Preview November 15, 2024 17:32 Failure

datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Nov 21, 2024

jjoyce0510 reviewed Nov 21, 2024

View reviewed changes

metadata-models/src/main/pegasus/com/linkedin/dataquality/SchemaFieldQualityDimensionInfo.pdl Outdated Show resolved Hide resolved

jjoyce0510 approved these changes Nov 21, 2024

View reviewed changes

jjoyce0510 reviewed Nov 25, 2024

View reviewed changes

schemafieldUrn field name update

fac28ad

vercel bot had a problem deploying to Preview November 26, 2024 13:59 Failure

hsheth2 added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 4, 2024

datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 6, 2024

jjoyce0510 reviewed Dec 17, 2024

View reviewed changes

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 17, 2024

jjoyce0510 reviewed Dec 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(Dataquality aspect): Added Data Quality Metrics aspect to emit data quality metrics metadata into Datahub #9265

feat(Dataquality aspect): Added Data Quality Metrics aspect to emit data quality metrics metadata into Datahub #9265

naresh-angala commented Nov 17, 2023

jjoyce0510 commented Jun 28, 2024

naresh-angala commented Jul 1, 2024

rtekal commented Aug 2, 2024

jjoyce0510 left a comment

rtekal commented Aug 7, 2024

sgm44 commented Aug 7, 2024

sgm44 commented Aug 7, 2024

naresh-angala commented Aug 9, 2024

naresh-angala commented Aug 9, 2024

naresh-angala commented Aug 19, 2024

Curiosity007 commented Sep 8, 2024

rtekal commented Sep 11, 2024

naresh-angala commented Nov 15, 2024

naresh-angala commented Nov 15, 2024

naresh-angala commented Nov 15, 2024

naresh-angala commented Nov 21, 2024

jjoyce0510 left a comment

jjoyce0510 left a comment

rtekal commented Dec 6, 2024

jjoyce0510 left a comment

jjoyce0510 left a comment •

edited

Loading

feat(Dataquality aspect): Added Data Quality Metrics aspect to emit data quality metrics metadata into Datahub #9265

Are you sure you want to change the base?

feat(Dataquality aspect): Added Data Quality Metrics aspect to emit data quality metrics metadata into Datahub #9265

Conversation

naresh-angala commented Nov 17, 2023

Checklist

jjoyce0510 commented Jun 28, 2024

naresh-angala commented Jul 1, 2024

rtekal commented Aug 2, 2024

jjoyce0510 left a comment

Choose a reason for hiding this comment

rtekal commented Aug 7, 2024

sgm44 commented Aug 7, 2024

sgm44 commented Aug 7, 2024

naresh-angala commented Aug 9, 2024

naresh-angala commented Aug 9, 2024

naresh-angala commented Aug 19, 2024

Curiosity007 commented Sep 8, 2024

rtekal commented Sep 11, 2024

naresh-angala commented Nov 15, 2024

naresh-angala commented Nov 15, 2024

naresh-angala commented Nov 15, 2024

naresh-angala commented Nov 21, 2024

jjoyce0510 left a comment

Choose a reason for hiding this comment

jjoyce0510 left a comment

Choose a reason for hiding this comment

rtekal commented Dec 6, 2024

Introduction

Dimensions of Data Quality

jjoyce0510 left a comment

Choose a reason for hiding this comment

jjoyce0510 left a comment • edited Loading

Choose a reason for hiding this comment

jjoyce0510 left a comment •

edited

Loading