-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(Dataquality aspect): Added Data Quality Metrics aspect to emit data quality metrics metadata into Datahub #9265
base: master
Are you sure you want to change the base?
Conversation
metadata-models/src/main/pegasus/com/linkedin/dataquality/DataQualityDimensionInfo.pdl
Outdated
Show resolved
Hide resolved
metadata-models/src/main/pegasus/com/linkedin/dataquality/DimensionScore.pdl
Outdated
Show resolved
Hide resolved
metadata-models/src/main/pegasus/com/linkedin/dataquality/DimensionScore.pdl
Show resolved
Hide resolved
Hi there! What is the goal with this PR? Adding context in the description will be quite useful! Thanks in advanced |
Hi, PR is about adding Data Quality Metrics capability, working on changes for dynamic Data Quality metrics addition as per PR review comments. Thanks. |
@naresh-angala I know that this PR is just the model changes. Can you attach a documentation link that conveys the big picture and where it all fits please. |
metadata-models/src/main/pegasus/com/linkedin/dataquality/DataQualityDimensionInfo.pdl
Outdated
Show resolved
Hide resolved
metadata-models/src/main/pegasus/com/linkedin/dataquality/SchemaFieldQualityDimensionInfo.pdl
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested changes and clarifications have not been addressed on this PR
Okay, the team will change the PR to Draft and work on the design changes. Thanks. |
This is the data model changes to support the full ability to capture and report data quality dimensions. There was a bit of back and forth in slack back in Oct 2023 on this topic which include example usage screens found here Here was the simple Feature Goal statement: |
@naresh-angala and @rtekal -- where are the graphQL and UI updates related to this feature? Right now this looks like just PDL updates. Without the rest I don't see how datahub gets any value outside of the ability to ingest and store the data which IMHO is pretty basic. |
@sgm44 -- Intial plan was to get the Data quality model changes be reviewed and accepted. |
@jjoyce0510 -- Can you share details on below,
Thanks. |
@jjoyce0510 -- Please provide details on above points. |
@naresh-angala Is there any tentative timeline for this feature to be fully integrated into the UI, GraphQL and backend? This is an integral part of DQ, and would like very much to see this in the newest version |
@naresh-angala told me: Targetting last week of Sep to complete |
091db70
to
eab2ac7
Compare
@jjoyce0510 : Updated the PR with dynamic dimension names and UI changes. Please review. |
@jjoyce0510: Have addressed all the requested changes. Please review. |
metadata-models/src/main/pegasus/com/linkedin/dataquality/SchemaFieldQualityDimensionInfo.pdl
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will take another look at this.
Note that we never had a design discussion around this non-trivial feature. It would be best to have a dedicated time to chat through this together.
Either way, we'll take another look and try to reverse interpret
Cheers
John
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite comprehensive!
This one requires more discussion at the strategic level before admission for the broader community. An RFC document detailing the thinking here might be the best place to discuss and engage others within the community.
A few top of mind concerns:
- Who is responsible for registering Dimension Names? e.g. creating the new Dimension Name Type entity
- Who is responsible for producing the Dimension Name Scores for Tables & Columns?
- Is there an example of a Data Quality dimension at the table and column level we could use to understand the use case a bit better?
- Is there a way to use Structured Properties + a custom UI tab to achieve what you want? It feels like you simply need a way to attach strong-typed numeric properties to tables and columns (which structured properties can be used for)
- What would a user-facing feature guide doc look like for this feature?
There are also lower level tactical comments on the code around variable naming, data modeling, etc, but I don't want to waste your time on those things before there is alignment at the strategic level.
Cheers
John
cc: @naresh-angala, @mzaman
Data Producer will submit an enhancement request for adding a new dimension as an Entity Type
Data Producer will use their own algorithm for calculating the scores and will ingest them to Data Catalog.
Already provided in the PR above
User doc in Markdown IntroductionData Quality Dimension(DQ) is a popular industry term used to describe characteristics or attributes of data that can be measured against defined standards in order to determine the quality of data. The aggregate at dimension level can fairly indicate the fitness of data to be used for a certain business purpose. Data Catalog offers the standard Dimensions recognized in the industry and scorecard for both Data Owners & Consumers to make informed decisions on the quality of data. This approach facilitates a common language for the users to communicate and comprehend the quality of data in a consistent way across the enterprise. Dimensions of Data QualityA Data Quality Dimension is typically presented as a percentage or a total count.
Each Dimension will contain dimension scores which can be either provided in numerical or percentage format.
Data quality metrics will be available at Dataset and Field level under "Quality" tab in Data Catalog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pausing on technical reviews - there are still some comments I have around naming and more but I'm going to wait - until we have strategic alignment to discuss how exactly this benefits our users. I don't have any evidence or confidence that if we roll this out, other community members will be able to immediately start to get value from it (even given the explanation).
In my view, this feature feels like too much work for our users. The burden is left on others to know how to produce data quality metrics and there is no recipe or runbook about how to use the feature. -- In other words, this is a partial feature and not a "full feature".
If there was a guide that showed users how to get to Accuracy, Completeness, Consistency for their data that would be ideal - but I do not see that yet. I do not want to accept this PR and then leave our customers / community on their own to figure out what to do with it, while adding the support burden for the engineering teams maintaining DataHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon reflecting upon this more, I think this is the main problem with this feature:
It's opinionated, but not complete: We are introducing a highly specific concept of "Data Quality Metrics", without providing users a clear guide on how to use the feature. The implementation of the feature is far more generic than the "Data Quality" framing would suggest - It's really a way to provide a named metric to a table or schema field with a highly opinionated, but vaguely defined historical and current value + score.
If we want to keep the implementation as generic as it currently is, WITHOUT requiring more information about how the metrics are actually computed, I'd argue we should generalize the feature EVEN MORE to become a general purpose way to report time-series metrics about a table or column to datahub (without the Data Quality-specific framing + scores).
Specifically, I think a viable approach could be:
- Define a general purpose "DataHub Metric" concept that can be associated to Fields or Datasets.
- Allow defining metric metadata like name, value type, and more.
- Allow reporting values for the metric associated with a given entity URN and using time series aspects (the correct way to store historical metrics)
- Allow metrics to have custom dimensions, which are all indexed so we can slice and dice using them.
- Allow displaying your custom metrics in rich historical + latest graphs in a new tab called "Metrics".
- Metrics could have a category: Quality, Governance, or something altogether different.
- Add some APIs for basic aggregations on top of metrics. (e.g. sum, number of events, latest value in windows, etc).
These more general purpose, less opinionated foundations would enable this feature to scale to provide value beyond just Data Quality, for example broadening the scope to possibly include Data Governance and Data Discovery, or something altogether more specific to a particular organization.
We could develop the user interface layer to be less opinionated about scores, latest, and historical values than it currently is. This would reduce the collision between this feature and other very specific Data Quality features already offerred by DataHub/Acryl.
What do you think about this direction? It has been on our minds at Acryl for some time now. If you're up for it, we could attempt to pivot this feature in that more general purpose direction.
Checklist