Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion/snowflake): use user email-id in urn generation for top users stat #8513

Conversation

siddiquebagwan
Copy link
Contributor

No description provided.

@siddiquebagwan siddiquebagwan changed the title feat(ingestion/snowflake): use user email-id in urn generation for top uesrs stat feat(ingestion/snowflake): use user email-id in urn generation for top users stat Jul 26, 2023
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 26, 2023
Copy link
Collaborator

@asikowitz asikowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Just some docs and code structure questions / requests

@@ -120,6 +120,11 @@ class SnowflakeV2Config(
"upstreams_deny_pattern", "temporary_tables_pattern"
)

email_as_user_identifier: bool = Field(
default=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this default True and add a breaking change in docs/how/updating-datahub.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -120,6 +120,11 @@ class SnowflakeV2Config(
"upstreams_deny_pattern", "temporary_tables_pattern"
)

email_as_user_identifier: bool = Field(
default=False,
description="When enabled user email id will be used in user urn generation for stat",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does stat mean here?

Suggested change
description="When enabled user email id will be used in user urn generation for stat",
description="Format user urns as an email, if the snowflake user's email is set. If `email_domain` is provided, generates email addresses for snowflake users with unset emails, based on their username.",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -247,7 +249,9 @@ def _map_top_sql_queries(self, top_sql_queries: Dict) -> List[str]:
]
)

def _map_user_counts(self, user_counts: Dict) -> List[DatasetUserUsageCounts]:
def _map_user_counts(
self, user_counts: Dict, email_as_user_identifier: bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just use self.email_as_user_identifier in here and not pass the parameter? Don't feel strongly here at all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

user_urn = make_user_urn(self.get_user_identifier(user_name, user_email))
user_urn = make_user_urn(
self.get_user_identifier(
user_name, user_email, self.config.email_as_user_identifier
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add logic to generate an email from self.config.email_domain if it's set, like we do in _map_user_counts? Perhaps it'd be best to put that logic in get_user_identifier instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep it as is. The current method uses the user_email to determine whether to skip the current metadata (i.e continue in for loop)

Copy link
Collaborator

@asikowitz asikowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@asikowitz asikowitz added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Aug 2, 2023
@anshbansal anshbansal merged commit 6a36118 into datahub-project:master Aug 3, 2023
43 checks passed
yoonhyejin pushed a commit that referenced this pull request Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants