Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/kafka) meta-mapping for Kafka source w/ Avro schemas #7381

Closed

Conversation

danielcmessias
Copy link
Contributor

@danielcmessias danielcmessias commented Feb 20, 2023

Since Avro permits non-spec field/value pairs, a common pattern is to use this to embed business metadata. Being able to map this directly to DataHub's Owners, Tags, and Terms allows metadata to be kept closer to source, avoiding the need for Data Producers to configure transformers.

New Features

Mapping an Avro list-of-strings to DataHub Tags

Consider the following Avro schema:

{
  "name": "sampleRecord",
  "type": "record",
  "tags": ["tag1", "tag2"],
  "fields": [{
    "name": "field_1",
    "type": "string",
    "tags": ["tag3", "tag4"]
  }]
}

The following new config allows this to be ingested as DataHub tags:

config:
  schema_tags_field: tags
  tag_prefix: kafka

These config values are currently the default (so they wouldn't actually be needed).

Maybe it would make more sense for the default value of schema_tags_field to be something like _datahub_tags?

Field-level tags do not automatically become Dataset-level tags.

Meta-mapping implementation

I've re-used the meta-mapping implementation that already exists for dbt (with a small extension to support referencing nested fields).

Consider the following Avro schema:

{
  "name": "sampleRecord",
  "type": "record",
  "owning_team": "@Data-Science",
  "data_tier": "Bronze",
  "fields": [{
    "name": "field_1",
    "type": "string",
    "gdpr": {
      "pii": true
    }
  }]
}

This can now be mapped to DataHub metadata with the following config:

config:
  meta_mapping:
    owning_team:
      match: "^@(.*)"
      operation: "add_owner"
      config:
        owner_type: group
    data_tier:
      match: "Bronze|Silver|Gold"
      operation: "add_term"
      config:
        term: "{{ $match }}"
  field_meta_mapping:
    gdpr.pii:
      match: true
      operation: "add_tag"
      config:
        tag: "pii"

Note the ability to reference nested fields.

Top-level doc in Avro is used as the Dataset's description

A small extra feature that seemed like a no-brainer.

Consider the schema:

{
  "name": "datahub-0",
  "type": "record",
  "doc": "# This is a header.\nAnd this goes below it.",
  "fields": ...
}

Before, this doc field would be ignored. Now, it is used as the Dataset Description:

Screenshot 2023-02-20 at 15 22 02

Implementation

The implementation is mostly just re-using the OperationProcessor created for the dbt source. Otherwise:

  • KafkaSchemaRegistryBase interface is changed from get_schema_metadata() to get_aspects_from_schema(), to support returned multiple Aspects. I've kept the original method for backward compatibility in case there are some forks out there with custom schema-registry implementations.
  • schema_util.py does involve passing around a lot of values. I was reluctant to do any refactoring, given this class is used in a bunch of places, but I suspect it does need some cleanup.
  • Hopefully, there should be no breaking changes (at least that's the idea 😅 )

Obviously, this is just for Avro right now. I'm not familiar with Protobuf, so I don't know if the same thing could be done there.

Testing

I've made a start on getting test coverage for the new changes. If you feel we need to substantially rework any of the testing I'll probably need some support from you guys to get it done 🙂

Docs

Definitely a bit lacking. The trouble is I don't want to rewrite (or copy/paste) all the meta-mapping examples from the dbt page over to Kafka. Open to suggestions here - if there's potential for this 'meta-mapping pattern' to be used in a few different sources, perhaps we should standardize and have a dedicated page?

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Feb 20, 2023
Adds meta-mapping to Kafka source for Avro schemas. Like to dbt's
implementation, it allows arbitrary metadata embedded in an Avro schema
to be mapped to DataHub Owners, Tags and Terms
@danielcmessias danielcmessias changed the title feat(ingest/kafka) meta-mapping for Avro + Kafka feat(ingest/kafka) meta-mapping for Kafka source w/ Avro schemas Feb 20, 2023
@danielcmessias danielcmessias marked this pull request as ready for review February 20, 2023 16:01
@anshbansal anshbansal added the community-contribution PR or Issue raised by member(s) of DataHub Community label Feb 21, 2023
@jjoyce0510
Copy link
Collaborator

Hi there!

Thanks for the contrib -- we will be looking into this shortly.

Cheers
John

@danielcmessias
Copy link
Contributor Author

@jjoyce0510 @hsheth2 I'm also looking into whether it makes sense to add meta-mapping to the S3 recipe in the same way. This seems like a common pattern - when you already have a tagging system on the underlying tool and want a way to automatically translate those tags to DataHub metadata. Feel free to give me a ping on Slack if you guys have thoughts on this! 😃

@hsheth2
Copy link
Collaborator

hsheth2 commented Sep 20, 2023

@danielcmessias given the merge conflicts on this one, we ended up adapting + making a bunch of tweaks in #8825

I'm going to close this in favor of the new PR

@hsheth2 hsheth2 closed this Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants