feat(ingest/unity): GE Profiling #8951

asikowitz · 2023-10-04T16:59:36Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

hsheth2

broadly looks good - definitely don't like how much cruft/boilerplate code we have around profiling

hsheth2 · 2023-10-04T17:33:03Z

metadata-ingestion/src/datahub/ingestion/source/ge_data_profiler.py

@@ -329,6 +348,7 @@ def _should_ignore_column(self, sqlalchemy_type: sa.types.TypeEngine) -> bool:

    @_run_with_query_combiner
    def _get_column_type(self, column_spec: _SingleColumnSpec, column: str) -> None:
+        # logger.info(f"{column} {self.dataset.columns} {self.dataset.columns[0]}")


let's remove this line

hsheth2 · 2023-10-04T17:33:40Z

metadata-ingestion/src/datahub/ingestion/source/unity/config.py

+class UnityCatalogAnalyzeProfilerConfig(UnityCatalogConfig):
+    method: Literal["analyze"] = "analyze"
+
+    # TODO: Reduce duplicate code with DataLakeProfilerConfig, GEProfilingConfig, SQLAlchemyConfig


mayurinehate · 2023-11-30T11:54:49Z

metadata-ingestion/src/datahub/ingestion/source/unity/config.py

+            at=urlparse(self.workspace_url).netloc,
+            db=None,
+            uri_opts={
+                "http_path": f"/sql/1.0/warehouses/{self.profiling.warehouse_id}"


This assumes use of SQL warehouse. Looks like this may take different formats.

http-path is the HTTP Path either to a Databricks SQL endpoint (e.g. /sql/1.0/endpoints/1234567890abcdef), or to a Databricks Runtime interactive cluster (e.g. /sql/protocolv1/o/1234567890123456/1234-123456-slid123).

Ref - https://pypi.org/project/databricks-sql-connector/

Hmm, good point. It's going to be a bit annoying to support both profilers :|

Gonna hold off on this... want to get this in before I never get to it again

metadata-ingestion/src/datahub/ingestion/source/ge_data_profiler.py

metadata-ingestion/src/datahub/ingestion/source/sql/sql_generic_profiler.py

hsheth2 · 2023-12-04T20:36:01Z

metadata-ingestion/src/datahub/ingestion/source/sql/sql_generic_profiler.py

@@ -70,7 +70,6 @@ def generate_profile_workunits(
        self,
        requests: List[TableProfilerRequest],
        max_workers: int,


we can force these to be kwargs for clarity

Suggested change

max_workers: int,

*, max_workers: int,

hsheth2 · 2023-12-04T20:37:30Z

metadata-ingestion/src/datahub/ingestion/source/unity/config.py

-        default_factory=OperationConfig,
-        description="Experimental feature. To specify operation configs.",
-    )
+    method: str


how does this show up in the docs? does this need a Field(description="docs")?

Looks like it doesn't show up at all. I'll add a description but in general, our docs support for discriminated unions is not very good -- we don't show which type supports which options. I'll update example recipes to help here

hsheth2 · 2023-12-04T20:40:28Z

metadata-ingestion/src/datahub/ingestion/source/unity/ge_profiler.py

+            dataset_name,
+            size_in_bytes=table.size_in_bytes,
+            last_altered=table.last_altered,
+            rows_count=0,  # Can't get row count ahead of time


shouldn't we pass None here?

Suggested change

rows_count=0, # Can't get row count ahead of time

rows_count=None,

Yeah, looks like we can do this now. We don't for some reason in sql_generic_profiler.py

hsheth2 · 2023-12-04T20:40:44Z

metadata-ingestion/src/datahub/ingestion/source/unity/ge_profiler.py

+        ):
+            # Profile only table level if dataset is filtered from profiling
+            # due to size limits alone
+            if self.is_dataset_eligible_for_profiling(


isn't this the same as the call two lines above?

For this one, size_in_bytes is 0 lol

hsheth2 · 2023-12-04T20:44:29Z

metadata-ingestion/src/datahub/ingestion/source/unity/ge_profiler.py

+        self.ddl = None
+
+
+class UnityCatalogGEProfiler(GenericProfiler):


the amount of duplicated code here is not ideal - we definitely will need to revisit the GE profiler at some point

Yeah, ideally I'd merge this with Mayuri's changes but if I don't get this in now idk when it'll go in... don't really have time to iterate on this

asikowitz added 6 commits October 4, 2023 07:47

feat(ingest/unity): GE Profiling

38f216a

fixes

142b359

Merge branch 'master' into databricks-ge-profiling

0baffb5

comment

7222a43

lint

05990ed

do not profile views

f1cd4b6

hsheth2 approved these changes Oct 4, 2023

View reviewed changes

remove log

3de31ba

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Oct 4, 2023

vercel bot had a problem deploying to Preview October 4, 2023 19:42 Failure

asikowitz added 2 commits October 4, 2023 16:22

refactor

ff8f72d

remove thrift pin

807ab59

vercel bot deployed to Preview October 4, 2023 21:18 View deployment

asikowitz added 2 commits October 5, 2023 12:26

update goldens

ebefe1c

fix config test

3e71c04

vercel bot deployed to Preview October 5, 2023 17:27 View deployment

update setup.py

5d83435

maggiehays added the hacktoberfest-accepted Acceptance for hacktoberfest https://hacktoberfest.com/participation/ label Oct 26, 2023

mayurinehate reviewed Nov 30, 2023

View reviewed changes

asikowitz added 3 commits December 4, 2023 13:39

Merge branch 'master' into databricks-ge-profiling

2cb0a7b

fix tests; cleanup; lint

56ae954

remove comment

0f38f95

asikowitz requested a review from hsheth2 December 4, 2023 20:10

cleanup; updating datahub

8d450f3

asikowitz commented Dec 4, 2023

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/ge_data_profiler.py Show resolved Hide resolved

hsheth2 reviewed Dec 4, 2023

View reviewed changes

vercel bot deployed to Preview December 4, 2023 21:00 View deployment

asikowitz added 2 commits December 5, 2023 17:17

Merge branch 'master' into databricks-ge-profiling

5f93e9d

pr feedback: updated docs

9aaf271

vercel bot deployed to Preview December 5, 2023 22:50 View deployment

asikowitz merged commit 27f23ec into datahub-project:master Dec 6, 2023
51 of 52 checks passed

asikowitz deleted the databricks-ge-profiling branch December 6, 2023 18:59

Salman-Apptware pushed a commit to Salman-Apptware/datahub that referenced this pull request Dec 15, 2023

feat(ingest/unity): GE Profiling (datahub-project#8951)

49faac9

coderabbitai bot mentioned this pull request Jul 1, 2024

feat(cli): add more details to get cli #10815

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest/unity): GE Profiling #8951

feat(ingest/unity): GE Profiling #8951

asikowitz commented Oct 4, 2023

hsheth2 left a comment

hsheth2 Oct 4, 2023

hsheth2 Oct 4, 2023

mayurinehate Nov 30, 2023

asikowitz Nov 30, 2023

asikowitz Dec 4, 2023

hsheth2 Dec 4, 2023

hsheth2 Dec 4, 2023

asikowitz Dec 5, 2023

hsheth2 Dec 4, 2023

asikowitz Dec 4, 2023

hsheth2 Dec 4, 2023

asikowitz Dec 4, 2023

hsheth2 Dec 4, 2023

asikowitz Dec 4, 2023

	rows_count=0, # Can't get row count ahead of time
	rows_count=None,

		self.ddl = None


		class UnityCatalogGEProfiler(GenericProfiler):

feat(ingest/unity): GE Profiling #8951

feat(ingest/unity): GE Profiling #8951

Conversation

asikowitz commented Oct 4, 2023

Checklist

hsheth2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment