-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ingest/unity): Remove metastore from ingestion and urns; standardize platform instance; add notebook filter #8943
Changes from 5 commits
ce623f2
1a06dc1
0612a52
5602ccc
d224996
3cbce63
af0622c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,12 +3,19 @@ | |
This file documents any backwards-incompatible changes in DataHub and assists people when migrating to a new version. | ||
|
||
## Next | ||
- #8943 - There is a new config param, `include_metastore`, that provides the option of not ingesting | ||
the Unity Catalog metastore associated with your Databricks workspace. We recommend setting this to `false`. | ||
However, if you have previously ingested from unity catalog, setting this to `false` is a breaking change; see that section for details. | ||
|
||
### Breaking Changes | ||
|
||
- #8810 - Removed support for SQLAlchemy 1.3.x. Only SQLAlchemy 1.4.x is supported now. | ||
- #8853 - The Airflow plugin no longer supports Airflow 2.0.x or Python 3.7. See the docs for more details. | ||
- #8853 - Introduced the Airflow plugin v2. If you're using Airflow 2.3+, the v2 plugin will be enabled by default, and so you'll need to switch your requirements to include `pip install 'acryl-datahub-airflow-plugin[plugin-v2]'`. To continue using the v1 plugin, set the `DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN` environment variable to `true`. | ||
- #8943 All Unity Catalog urns are changed if a new config param, `include_metastore`, is set to `false`. | ||
This is set to `true` by default at the moment, but this default will be changed in the future. | ||
To handle the change in urns, we recommend soft deleting all databricks data via the DataHub CLI: `datahub delete --platform databricks --soft`, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a note that if you have stateful ingest enabled, you don't need to do anything |
||
and then re-ingesting from unity catalog with `include_metastore: false`. | ||
|
||
### Potential Downtime | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
import logging | ||
import os | ||
from datetime import datetime, timedelta, timezone | ||
from typing import Any, Dict, Optional | ||
|
@@ -22,6 +23,8 @@ | |
is_profiling_enabled, | ||
) | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
class UnityCatalogProfilerConfig(ConfigModel): | ||
# TODO: Reduce duplicate code with DataLakeProfilerConfig, GEProfilingConfig, SQLAlchemyConfig | ||
|
@@ -97,9 +100,20 @@ class UnityCatalogSourceConfig( | |
description="Name of the workspace. Default to deployment name present in workspace_url", | ||
) | ||
|
||
ingest_data_platform_instance_aspect: Optional[bool] = pydantic.Field( | ||
default=False, | ||
description="Option to enable/disable ingestion of the data platform instance aspect. The default data platform instance id for a dataset is workspace_name", | ||
include_metastore: bool = pydantic.Field( | ||
default=True, | ||
description=( | ||
"Whether to ingest the workspace's metastore as a container and include it in all urns." | ||
" Changing this will affect the urns of all entities in the workspace." | ||
" This will be disabled by default in the future," | ||
" so it is recommended to set this to False for new ingestions." | ||
" If you have an existing unity catalog ingestion, we recommend deleting existing data" | ||
" via the cli: `datahub delete --platform databricks` and re-ingesting." | ||
), | ||
) | ||
|
||
_ingest_data_platform_instance_aspect_removed = pydantic_removed_field( | ||
"ingest_data_platform_instance_aspect" | ||
) | ||
|
||
_only_ingest_assigned_metastore_removed = pydantic_removed_field( | ||
|
@@ -122,6 +136,16 @@ class UnityCatalogSourceConfig( | |
default=AllowDenyPattern.allow_all(), | ||
description="Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in `catalog.schema.table` format. e.g. to match all tables starting with customer in Customer catalog and public schema, use the regex `Customer\\.public\\.customer.*`.", | ||
) | ||
|
||
notebook_pattern: AllowDenyPattern = Field( | ||
default=AllowDenyPattern.allow_all(), | ||
description=( | ||
"Regex patterns for notebooks to filter in ingestion, based on notebook *path*." | ||
" Specify regex to match the entire notebook path in `/<dir>/.../<name>` format." | ||
" e.g. to match all notebooks in the root Shared directory, use the regex `/Shared/.*`." | ||
), | ||
) | ||
|
||
domain: Dict[str, AllowDenyPattern] = Field( | ||
default=dict(), | ||
description='Attach domains to catalogs, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like "Marketing".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.', | ||
|
@@ -182,3 +206,15 @@ def workspace_url_should_start_with_http_scheme(cls, workspace_url: str) -> str: | |
"Workspace URL must start with http scheme. e.g. https://my-workspace.cloud.databricks.com" | ||
) | ||
return workspace_url | ||
|
||
@pydantic.validator("include_metastore") | ||
def include_metastore_warning(cls, v: bool) -> bool: | ||
if v: | ||
msg = ( | ||
"include_metastore is enabled." | ||
" This is not recommended and will be disabled by default in the future, which is a breaking change." | ||
" All databricks urns will change if you re-ingest with this disabled." | ||
" We recommend soft deleting all databricks data and re-ingesting with include_metastore set to False." | ||
) | ||
logger.warning(msg) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's also call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's that do? |
||
return v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The phrasing is a bit confusing, and splitting this across the two sections isn't ideal
maybe let's do "The unity catalog ingestion source has a new option
include_metastore
, which will cause all urns to be changed when disabled. This is currently enabled by default to preserve compatibility, but will be disabled by default and then removed in the future. If stateful ingestion is enabled, simply settinginclude_metastore: true
will perform all required cleanup. Otherwise, we recommend soft deleting all databricks data via the DataHub CLI:datahub delete --platform databricks --soft
and then reingest withinclude_metastore: false
.