-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest/unity): GE Profiling #8951
feat(ingest/unity): GE Profiling #8951
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
broadly looks good - definitely don't like how much cruft/boilerplate code we have around profiling
@@ -329,6 +348,7 @@ def _should_ignore_column(self, sqlalchemy_type: sa.types.TypeEngine) -> bool: | |||
|
|||
@_run_with_query_combiner | |||
def _get_column_type(self, column_spec: _SingleColumnSpec, column: str) -> None: | |||
# logger.info(f"{column} {self.dataset.columns} {self.dataset.columns[0]}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove this line
class UnityCatalogAnalyzeProfilerConfig(UnityCatalogConfig): | ||
method: Literal["analyze"] = "analyze" | ||
|
||
# TODO: Reduce duplicate code with DataLakeProfilerConfig, GEProfilingConfig, SQLAlchemyConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please
at=urlparse(self.workspace_url).netloc, | ||
db=None, | ||
uri_opts={ | ||
"http_path": f"/sql/1.0/warehouses/{self.profiling.warehouse_id}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes use of SQL warehouse. Looks like this may take different formats.
http-path is the HTTP Path either to a Databricks SQL endpoint (e.g. /sql/1.0/endpoints/1234567890abcdef), or to a Databricks Runtime interactive cluster (e.g. /sql/protocolv1/o/1234567890123456/1234-123456-slid123).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. It's going to be a bit annoying to support both profilers :|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gonna hold off on this... want to get this in before I never get to it again
metadata-ingestion/src/datahub/ingestion/source/sql/sql_generic_profiler.py
Show resolved
Hide resolved
@@ -70,7 +70,6 @@ def generate_profile_workunits( | |||
self, | |||
requests: List[TableProfilerRequest], | |||
max_workers: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can force these to be kwargs for clarity
max_workers: int, | |
*, max_workers: int, |
default_factory=OperationConfig, | ||
description="Experimental feature. To specify operation configs.", | ||
) | ||
method: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this show up in the docs? does this need a Field(description="docs")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it doesn't show up at all. I'll add a description but in general, our docs support for discriminated unions is not very good -- we don't show which type supports which options. I'll update example recipes to help here
dataset_name, | ||
size_in_bytes=table.size_in_bytes, | ||
last_altered=table.last_altered, | ||
rows_count=0, # Can't get row count ahead of time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we pass None here?
rows_count=0, # Can't get row count ahead of time | |
rows_count=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looks like we can do this now. We don't for some reason in sql_generic_profiler.py
): | ||
# Profile only table level if dataset is filtered from profiling | ||
# due to size limits alone | ||
if self.is_dataset_eligible_for_profiling( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this the same as the call two lines above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this one, size_in_bytes is 0 lol
self.ddl = None | ||
|
||
|
||
class UnityCatalogGEProfiler(GenericProfiler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the amount of duplicated code here is not ideal - we definitely will need to revisit the GE profiler at some point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, ideally I'd merge this with Mayuri's changes but if I don't get this in now idk when it'll go in... don't really have time to iterate on this
Checklist