Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/dremio): Dremio Source Ingestion #11598

Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
38e5946
feat: Dremio Source Ingestion
sagar-salvi-apptware Oct 11, 2024
a5b8c45
refactor: improvement to metadata gathering
sagar-salvi-apptware Oct 11, 2024
4a81b19
fix: Update dremio_entity.py
sagar-salvi-apptware Oct 11, 2024
51c2443
fix: Update dremio_sql_queries.py
sagar-salvi-apptware Oct 11, 2024
84227b6
refactor: Update dremio_source.py
sagar-salvi-apptware Oct 11, 2024
e462be6
feat: Dremio Source Ingestion
sagar-salvi-apptware Oct 11, 2024
703a3f8
refactor: improvement to metadata gathering
sagar-salvi-apptware Oct 11, 2024
4283cb6
fix: Update dremio_entity.py
sagar-salvi-apptware Oct 11, 2024
f897b9e
fix: Update dremio_sql_queries.py
sagar-salvi-apptware Oct 11, 2024
ac56777
refactor: Update dremio_source.py
sagar-salvi-apptware Oct 11, 2024
3ce57b9
test: add integration test for dremio
sagar-salvi-apptware Oct 17, 2024
e1ee817
fix: added minor changes + fix testcase
sagar-salvi-apptware Oct 17, 2024
b23e003
fix: PR Comments
sagar-salvi-apptware Oct 21, 2024
7247ac1
Merge branch 'feat/dremio-connector-source' of https://github.com/sag…
acrylJonny Oct 21, 2024
b9d7b8a
switch to drill dialect
acrylJonny Oct 21, 2024
230fbd7
Merge branch 'master' into feat/dremio-connector-source
acrylJonny Oct 21, 2024
d979d31
Update dremio_entities.py
acrylJonny Oct 21, 2024
9244e4c
Update dremio_entities.py
acrylJonny Oct 21, 2024
44853b8
Update datahub-web-react/src/app/ingest/source/builder/sources.json
acrylJonny Oct 21, 2024
fac705c
Update metadata-ingestion/docs/sources/dremio/README.md
acrylJonny Oct 21, 2024
29ba440
reafactor: Dremio Authetication
sagar-salvi-apptware Oct 21, 2024
6574b63
Cite Dremio Docs for SchemaFieldTypeMapper
acrylJonny Oct 21, 2024
9835e31
test: fix ci test
sagar-salvi-apptware Oct 21, 2024
a6a9a91
fix: PR comments
sagar-salvi-apptware Oct 22, 2024
b0c8c08
fix: dataset_pattern changes for tables and views
sagar-salvi-apptware Oct 23, 2024
ff211af
add warnings when unable to parse sql query
acrylJonny Oct 23, 2024
459cfc8
add view definition aspect
acrylJonny Oct 23, 2024
224ec10
bug fix - external urls. Improve Dremio Cloud API support for projects
acrylJonny Oct 24, 2024
0982a1b
fix: ci test
sagar-salvi-apptware Oct 24, 2024
3ce5e02
docs: minor docs changes per pr comments
sagar-salvi-apptware Oct 25, 2024
9c1e3fb
fix: PR comments
sagar-salvi-apptware Oct 25, 2024
7b938b2
docs: minor changes
sagar-salvi-apptware Oct 25, 2024
6914f75
build: added dependacy for sql
sagar-salvi-apptware Oct 25, 2024
3176e71
test: fix ci test
sagar-salvi-apptware Oct 25, 2024
9c1c556
fix: add minor change in report failure
sagar-salvi-apptware Oct 25, 2024
6670a98
fix: statefull ingestion error
sagar-salvi-apptware Oct 25, 2024
3e2eae7
fix: PR Comments and added mysql as a source to test
sagar-salvi-apptware Oct 27, 2024
195f78c
fix: PR Comments
sagar-salvi-apptware Oct 28, 2024
9de8be0
test: Updated the test for platform instance
sagar-salvi-apptware Oct 28, 2024
102b55f
Merge branch 'master' into feat/dremio-connector-source
sagar-salvi-apptware Oct 28, 2024
8efa349
fix: minor comments
sagar-salvi-apptware Oct 28, 2024
82988fd
test: fix ci test
sagar-salvi-apptware Oct 28, 2024
7433b8b
fix: added fixes for dremio cloud apis
sagar-salvi-apptware Oct 28, 2024
dca1077
fix: added changes for profiling
sagar-salvi-apptware Oct 29, 2024
42e6ec2
test: updated tests for profiling
sagar-salvi-apptware Oct 30, 2024
e27d542
fix: cloud region
sagar-salvi-apptware Oct 31, 2024
2f8d07c
fix: minor changes
sagar-salvi-apptware Nov 5, 2024
ad405a9
Merge branch 'master' into feat/dremio-connector-source
mayurinehate Nov 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,6 @@
"displayName": "Dremio",
"description": "Import Spaces, Sources, Tables and statistics from Dremio.",
"docsUrl": "https://datahubproject.io/docs/metadata-ingestion/",
"recipe": "source:\n type: dremio\n config:\n # Coordinates\n hostname: null\n port: null\n #true if https, otherwise false\n tls: true\n\n #Credentials with personal access token\n authentication_method: PAT\n password: pass\n\n #Or Credentials with basic auth\n #authentication_method: password\n #username: null\n #password: null\n\n stateful_ingestion:\n enabled: true"
"recipe": "source:\n type: dremio\n config:\n # Coordinates\n hostname: null\n port: null\n #true if https, otherwise false\n tls: true\n\n #For cloud instance\n #is_dremio_cloud: True\n #dremio_cloud_project_id: <project_id>\n\n #Credentials with personal access token\n authentication_method: PAT\n password: pass\n\n #Or Credentials with basic auth\n #authentication_method: password\n #username: null\n #password: null\n\n stateful_ingestion:\n enabled: true"
}
]
4 changes: 4 additions & 0 deletions metadata-ingestion/docs/sources/dremio/dremio_recipe.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ source:
# username: user
# password: pass

#For cloud instance
#is_dremio_cloud: True
sagar-salvi-apptware marked this conversation as resolved.
Show resolved Hide resolved
#dremio_cloud_project_id: <project_id>

include_query_lineage: True

#Optional
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -266,15 +266,11 @@ def execute_query(self, query: str, timeout: int = 300) -> List[Dict[str, Any]]:
return future.result(timeout=timeout)
except concurrent.futures.TimeoutError:
self.cancel_query(job_id)
self.report.failure(
f"Query execution timed out after {timeout} seconds"
)
raise TimeoutError(
raise DremioAPIException(
f"Query execution timed out after {timeout} seconds"
)
except RuntimeError as e:
self.report.failure("Query Execution failed", exc=e)
raise DremioAPIException("Query Execution failed: {str(e)}")
raise DremioAPIException(f"{str(e)}")

except requests.RequestException as e:
raise DremioAPIException(f"Error executing query: {str(e)}")
Expand Down Expand Up @@ -490,16 +486,15 @@ def get_all_tables_and_columns(self, containers: Deque) -> List[Dict]:
deny_schema_pattern=deny_schema_condition,
container_name=schema.container_name.lower(),
)

all_tables_and_columns.extend(
self.execute_query(
query=formatted_query,
)
)
except Exception as exc:
logger.debug(f"Failed with {exc}")
except DremioAPIException as e:
self.report.warning(
f"{schema.subclass} {schema.container_name} had no tables or views"
title=f"{schema.subclass} {schema.container_name} had no tables or views",
message=str(e),
sagar-salvi-apptware marked this conversation as resolved.
Show resolved Hide resolved
)

tables = []
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@
DremioDatasetType,
DremioGlossaryTerm,
)
from datahub.ingestion.source.dremio.dremio_profiling import DremioProfiler
from datahub.metadata.schema_classes import (
ArrayTypeClass,
AuditStampClass,
Expand Down Expand Up @@ -141,19 +140,15 @@ class DremioAspects:
def __init__(
self,
platform: str,
profiler: DremioProfiler,
ui_url: str,
env: str,
domain: Optional[str] = None,
platform_instance: Optional[str] = None,
profiling_enabled: bool = False,
):
self.platform = platform
self.platform_instance = platform_instance
self.env = env
self.domain = domain
self.profiler = profiler
self.profiling_enabled = profiling_enabled
self.ui_url = ui_url

def get_container_key(
Expand All @@ -176,12 +171,6 @@ def get_container_urn(
container_key = self.get_container_key(name, path)
return container_key.as_urn()

def get_container_space_urn(self) -> str:
return self.get_container_urn(name="Spaces", path=[])

def get_container_source_urn(self) -> str:
return self.get_container_urn(name="Sources", path=[])

def create_domain_aspect(self) -> Optional[_Aspect]:
if self.domain:
if self.domain.startswith("urn:li:domain:"):
Expand Down Expand Up @@ -225,16 +214,6 @@ def populate_container_mcp(
)
yield mcp.as_workunit()

# Container Class for Spaces and Sources
if not container.path:
container_class = self._create_container_class_containers(container)
if container_class:
mcp = MetadataChangeProposalWrapper(
entityUrn=container_urn,
aspect=container_class,
)
yield mcp.as_workunit()

# Data Platform Instance
data_platform_instance = self._create_data_platform_instance()
if data_platform_instance:
Expand Down Expand Up @@ -332,15 +311,6 @@ def populate_dataset_mcp(
)
yield mcp.as_workunit()

if self.profiling_enabled:
profile_data = dataset.get_profile_data(self.profiler)
profile_aspect = self.populate_profile_aspect(profile_data)
if profile_aspect:
mcp = MetadataChangeProposalWrapper(
entityUrn=dataset_urn,
aspect=profile_aspect,
)
yield mcp.as_workunit()
else:
logger.warning(
f"Dataset {dataset.path}.{dataset.resource_name} has not been queried in Dremio"
Expand Down Expand Up @@ -409,16 +379,6 @@ def _create_container_class(
return ContainerClass(container=self.get_container_urn(path=entity.path))
return None

def _create_container_class_containers(
self, entity: DremioContainer
) -> Optional[ContainerClass]:
if entity.subclass == "Dremio Space":
return ContainerClass(container=self.get_container_space_urn())
elif entity.subclass == "Dremio Source":
return ContainerClass(container=self.get_container_source_urn())

return None

def _create_data_platform_instance(self) -> DataPlatformInstanceClass:
return DataPlatformInstanceClass(
platform=f"urn:li:dataPlatform:{self.platform}",
Expand Down Expand Up @@ -516,12 +476,6 @@ def _create_schema_field(self, column: DremioDatasetColumn) -> SchemaFieldClass:
nullable=column.is_nullable == "YES",
)

def _get_profile_data(self, dataset: DremioDataset) -> Dict:
return self.profiler.profile_table(
f"{'.'.join(dataset.path)}.{dataset.resource_name}",
[(col.name, col.data_type) for col in dataset.columns],
)

def _create_view_properties(
self, dataset: DremioDataset
) -> Optional[ViewPropertiesClass]:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
import datetime
import os
from typing import List, Literal, Optional

import certifi
from pydantic import Field, root_validator, validator
import pydantic
from pydantic import Field, validator

from datahub.configuration.common import AllowDenyPattern, ConfigModel
from datahub.configuration.source_common import (
Expand All @@ -17,6 +19,7 @@
StatefulIngestionConfigBase,
)
from datahub.ingestion.source.usage.usage_common import BaseUsageConfig
from datahub.ingestion.source_config.operation_config import is_profiling_enabled


class DremioConnectionConfig(ConfigModel):
Expand Down Expand Up @@ -65,9 +68,9 @@ class DremioConnectionConfig(ConfigModel):
description="Whether this is a Dremio Cloud instance",
)

dremio_cloud_region: Literal["US", "EMEA"] = Field(
dremio_cloud_region: Literal["US", "EU"] = Field(
default="US",
description="Dremio Cloud region ('US' or 'EMEA')",
description="Dremio Cloud region ('US' or 'EU')",
)

dremio_cloud_project_id: Optional[str] = Field(
sagar-salvi-apptware marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -94,14 +97,7 @@ def validate_password(cls, value, values):


class ProfileConfig(GEProfilingConfig):
partition_profiling_enabled: bool = Field(
default=False,
description="Partition profiling disabled for Dremio.",
)
include_field_median_value: bool = Field(
default=False,
description="Median causes a number of issues in Dremio.",
)

query_timeout: int = Field(
default=300, description="Time before cancelling Dremio profiling query"
)
Expand All @@ -110,6 +106,70 @@ class ProfileConfig(GEProfilingConfig):
column_count: bool = True
sample_values: bool = True

# Below Configs inherited from GEProfilingConfig
# but not used in Dremio so we hide them from docs.
include_field_median_value: bool = Field(
default=False,
hidden_from_docs=True,
description="Median causes a number of issues in Dremio.",
)
partition_profiling_enabled: bool = Field(default=True, hidden_from_docs=True)
profile_table_row_count_estimate_only: bool = Field(
default=False, hidden_from_docs=True
)
query_combiner_enabled: bool = Field(default=True, hidden_from_docs=True)
max_number_of_fields_to_profile: Optional[pydantic.PositiveInt] = Field(
default=None, hidden_from_docs=True
)
profile_if_updated_since_days: Optional[pydantic.PositiveFloat] = Field(
default=None, hidden_from_docs=True
)
profile_table_size_limit: Optional[int] = Field(
default=5,
description="Profile tables only if their size is less then specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `snowflake` and `BigQuery`",
hidden_from_docs=True,
)

profile_table_row_limit: Optional[int] = Field(
default=5000000,
hidden_from_docs=True,
description="Profile tables only if their row count is less then specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `snowflake` and `BigQuery`",
)

sagar-salvi-apptware marked this conversation as resolved.
Show resolved Hide resolved
partition_datetime: Optional[datetime.datetime] = Field(
default=None,
hidden_from_docs=True,
description="If specified, profile only the partition which matches this datetime. "
"If not specified, profile the latest partition. Only Bigquery supports this.",
)
use_sampling: bool = Field(
default=True,
hidden_from_docs=True,
description="Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. "
"If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. ",
)
sagar-salvi-apptware marked this conversation as resolved.
Show resolved Hide resolved

sample_size: int = Field(
default=10000,
hidden_from_docs=True,
description="Number of rows to be sampled from table for column level profiling."
"Applicable only if `use_sampling` is set to True.",
)
profile_external_tables: bool = Field(
default=False,
hidden_from_docs=True,
description="Whether to profile external tables. Only Snowflake and Redshift supports this.",
)

tags_to_ignore_sampling: Optional[List[str]] = pydantic.Field(
default=None,
hidden_from_docs=True,
description=(
"Fixed list of tags to ignore sampling."
" If not specified, tables will be sampled based on `use_sampling`."
),
)

sagar-salvi-apptware marked this conversation as resolved.
Show resolved Hide resolved

class DremioSourceMapping(EnvConfigMixin, PlatformInstanceConfigMixin, ConfigModel):
platform: str = Field(
Expand All @@ -119,15 +179,6 @@ class DremioSourceMapping(EnvConfigMixin, PlatformInstanceConfigMixin, ConfigMod
description="Alias of platform in Dremio connection",
)
sagar-salvi-apptware marked this conversation as resolved.
Show resolved Hide resolved

@root_validator
def check_both_fields_present(cls, values):
platform, source_name = values.get("platform"), values.get("source_name")
if not platform or not source_name:
raise ValueError(
"Both 'platform' and 'source_name' must be provided in source_mappings."
)
return values


class DremioSourceConfig(
DremioConnectionConfig,
Expand Down Expand Up @@ -174,6 +225,11 @@ class DremioSourceConfig(
description="Configuration for profiling",
)

def is_profiling_enabled(self) -> bool:
return self.profiling.enabled and is_profiling_enabled(
self.profiling.operation_config
)

# Advanced Configs
max_workers: int = Field(
default=5 * (os.cpu_count() or 4),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@
DremioEdition,
DremioEntityContainerType,
)
from datahub.ingestion.source.dremio.dremio_profiling import DremioProfiler

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -267,11 +266,6 @@ def __init__(
dataset_id=self.resource_id,
)

def get_profile_data(self, profiler: DremioProfiler) -> Dict:
full_table_name = '"' + '"."'.join(self.path) + '"."' + self.resource_name + '"'
columns = [(col.name, col.data_type) for col in self.columns]
return profiler.profile_table(full_table_name, columns)


class DremioContainer:
subclass: str = "Dremio Container"
Expand Down
Loading
Loading