Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/kafka): support metadata mapping from kafka avro schemas #8825

Merged

Conversation

mayurinehate
Copy link
Collaborator

@mayurinehate mayurinehate commented Sep 12, 2023

Adaptation of this PR: #7381 with slightly different implementation and more tests. Please refer the original PR for detailed description of feature.

In short, this PR allows dbt-style meta-mapping for dataset and field level additional properties present in avro schema of kafka topic. This also allows translating tags from existing tags listed in avro schema to DataHub tags.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Sep 12, 2023
@hsheth2
Copy link
Collaborator

hsheth2 commented Sep 19, 2023

@mayurinehate fyi there's a test failing on this one

I'll try to review this in the next two days

@mayurinehate
Copy link
Collaborator Author

@mayurinehate fyi there's a test failing on this one

I'll try to review this in the next two days

I've fixed the test. Thank you for informing.

self.field_meta_processor = OperationProcessor(
self.source_config.field_meta_mapping,
self.source_config.tag_prefix,
"SOURCE_CONTROL",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this should be. "SOURCE_CONTROL" is probably okay for dbt. Should we set the ownership source to "OTHER" or "MANUAL"here ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use OwnershipSourceTypeClass.SERVICE

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

)
tag_prefix: str = pydantic.Field(
default="", description="Prefix added to tags during ingestion."
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default is dbt: for dbt. Do we need to set default to kafka: here ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly not sure why we added tag prefix at all

if we did want to support it, it should be implemented as a workunit helper right?

so anyways, empty is fine for now, with a goal of removing that field with something cleaner in the future

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure either. Do you mean workunit processor ? Yes that would be better.

Also, we can remove prefix stuff from kafka it its not required.

):
self._schema = schema
self._actual_schema = actual_schema
self._converter = converter
self._description = description
self._default_value = default_value
self._meta_mapping_processor = meta_mapping_processor
self._schema_tags_field = schema_tags_field
self._tag_prefix = tag_prefix
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can get these using converter._something right? is there a reason to pass them explicitly?

Copy link
Collaborator Author

@mayurinehate mayurinehate Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense. there is no reason to pass them explicitly. Removed it.

self.field_meta_processor = OperationProcessor(
self.source_config.field_meta_mapping,
self.source_config.tag_prefix,
"SOURCE_CONTROL",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use OwnershipSourceTypeClass.SERVICE

)
tag_prefix: str = pydantic.Field(
default="", description="Prefix added to tags during ingestion."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly not sure why we added tag prefix at all

if we did want to support it, it should be implemented as a workunit helper right?

so anyways, empty is fine for now, with a goal of removing that field with something cleaner in the future

try:
raw_props_value = reduce(
operator.getitem, operation_key.split("."), raw_props
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a comment would be helpful here

@hsheth2 hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Sep 21, 2023
@hsheth2
Copy link
Collaborator

hsheth2 commented Sep 23, 2023

CI failure is unrelated

@hsheth2 hsheth2 merged commit 5c40390 into datahub-project:master Sep 23, 2023
56 of 57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants