-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest/kafka) meta-mapping for Kafka source w/ Avro schemas #7381
feat(ingest/kafka) meta-mapping for Kafka source w/ Avro schemas #7381
Conversation
58ab3d0
to
d2c8077
Compare
d2c8077
to
efcd449
Compare
Adds meta-mapping to Kafka source for Avro schemas. Like to dbt's implementation, it allows arbitrary metadata embedded in an Avro schema to be mapped to DataHub Owners, Tags and Terms
efcd449
to
cbe9b22
Compare
Hi there! Thanks for the contrib -- we will be looking into this shortly. Cheers |
@jjoyce0510 @hsheth2 I'm also looking into whether it makes sense to add meta-mapping to the S3 recipe in the same way. This seems like a common pattern - when you already have a tagging system on the underlying tool and want a way to automatically translate those tags to DataHub metadata. Feel free to give me a ping on Slack if you guys have thoughts on this! 😃 |
@danielcmessias given the merge conflicts on this one, we ended up adapting + making a bunch of tweaks in #8825 I'm going to close this in favor of the new PR |
Since Avro permits non-spec field/value pairs, a common pattern is to use this to embed business metadata. Being able to map this directly to DataHub's Owners, Tags, and Terms allows metadata to be kept closer to source, avoiding the need for Data Producers to configure transformers.
New Features
Mapping an Avro list-of-strings to DataHub Tags
Consider the following Avro schema:
The following new config allows this to be ingested as DataHub tags:
These config values are currently the default (so they wouldn't actually be needed).
Maybe it would make more sense for the default value of
schema_tags_field
to be something like_datahub_tags
?Field-level tags do not automatically become Dataset-level tags.
Meta-mapping implementation
I've re-used the meta-mapping implementation that already exists for dbt (with a small extension to support referencing nested fields).
Consider the following Avro schema:
This can now be mapped to DataHub metadata with the following config:
Note the ability to reference nested fields.
Top-level
doc
in Avro is used as the Dataset's descriptionA small extra feature that seemed like a no-brainer.
Consider the schema:
Before, this doc field would be ignored. Now, it is used as the Dataset Description:
Implementation
The implementation is mostly just re-using the OperationProcessor created for the dbt source. Otherwise:
KafkaSchemaRegistryBase
interface is changed fromget_schema_metadata()
toget_aspects_from_schema()
, to support returned multiple Aspects. I've kept the original method for backward compatibility in case there are some forks out there with custom schema-registry implementations.schema_util.py
does involve passing around a lot of values. I was reluctant to do any refactoring, given this class is used in a bunch of places, but I suspect it does need some cleanup.Obviously, this is just for Avro right now. I'm not familiar with Protobuf, so I don't know if the same thing could be done there.
Testing
I've made a start on getting test coverage for the new changes. If you feel we need to substantially rework any of the testing I'll probably need some support from you guys to get it done 🙂
Docs
Definitely a bit lacking. The trouble is I don't want to rewrite (or copy/paste) all the meta-mapping examples from the dbt page over to Kafka. Open to suggestions here - if there's potential for this 'meta-mapping pattern' to be used in a few different sources, perhaps we should standardize and have a dedicated page?
Checklist