Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sdk): autogenerate urn types #9257

Merged
merged 31 commits into from
Nov 30, 2023
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
7f53d09
wip autogen urns
hsheth2 Jan 9, 2023
43bc1a8
start basic classes
hsheth2 Feb 7, 2023
1820ef3
add key aspect helpers
hsheth2 Feb 8, 2023
8f558ee
add basic init validation
hsheth2 Feb 8, 2023
ddc819b
setup coercion + defaults + codegen all entities
hsheth2 Feb 8, 2023
cc5b8b8
start replacing old classes
hsheth2 Feb 9, 2023
e424eda
start refactoring
hsheth2 Nov 15, 2023
61707cf
move urn tests
hsheth2 Nov 15, 2023
3e02a43
start refactoring
hsheth2 Nov 15, 2023
03241ac
more stuff
hsheth2 Nov 16, 2023
9883d6a
continue updating tests
hsheth2 Nov 16, 2023
c2207dd
remove dataclasses
hsheth2 Nov 16, 2023
5db44b7
update constructors
hsheth2 Nov 16, 2023
a9c1a1b
add most deprecated method shims
hsheth2 Nov 16, 2023
da8fe50
fix remaining lint issues
hsheth2 Nov 16, 2023
5b8f92a
ignore deprecation warnings
hsheth2 Nov 16, 2023
1d120a7
move urn shims to utilities.urns
hsheth2 Nov 16, 2023
81f60b7
shuffle
hsheth2 Nov 16, 2023
c9a29e7
fix compat
hsheth2 Nov 16, 2023
d6b0fef
generalized urn validation logic
hsheth2 Nov 17, 2023
7b16707
fix validation logic
hsheth2 Nov 17, 2023
f9f49d1
add note about encoding
hsheth2 Nov 17, 2023
c832a47
tweak custom package setup
hsheth2 Nov 17, 2023
5c43535
Merge branch 'master' into autogen-urns
hsheth2 Nov 17, 2023
b96d971
update changelog
hsheth2 Nov 17, 2023
3873d02
add urns to docs site
hsheth2 Nov 17, 2023
c6bf8d8
show deprecation warnings in docs
hsheth2 Nov 17, 2023
7777c18
add missing file
hsheth2 Nov 17, 2023
983259b
fix test
hsheth2 Nov 20, 2023
95a1223
review
hsheth2 Nov 30, 2023
0127a3a
Merge branch 'master' into autogen-urns
hsheth2 Nov 30, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs-website/sphinx/apidocs/urns.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
URNs
======

.. automodule:: datahub.metadata.urns
:exclude-members: LI_DOMAIN, URN_PREFIX, url_encode, validate, get_type, get_entity_id, get_entity_id_as_string, get_domain, underlying_key_aspect_type
:member-order: alphabetical
:inherited-members:
4 changes: 4 additions & 0 deletions docs-website/sphinx/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# See https://stackoverflow.com/a/65147676
import builtins

builtins.__sphinx_build__ = True

# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
Expand Down
1 change: 1 addition & 0 deletions docs-website/sphinx/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Welcome to DataHub Python SDK's documentation!
apidocs/builder
apidocs/clients
apidocs/models
apidocs/urns


Indices and tables
Expand Down
2 changes: 1 addition & 1 deletion docs-website/sphinx/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
-e ../../metadata-ingestion[datahub-rest,sql-parsing]
-e ../../metadata-ingestion[datahub-rest,sql-parser]
beautifulsoup4==4.11.2
Sphinx==6.1.3
sphinx-click==4.4.0
Expand Down
51 changes: 33 additions & 18 deletions docs/how/updating-datahub.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
### Breaking Changes

- #9244: The `redshift-legacy` and `redshift-legacy-usage` sources, which have been deprecated for >6 months, have been removed. The new `redshift` source is a superset of the functionality provided by those legacy sources.
- #9257: The Python SDK urn types are now autogenerated. The new classes are largely backwards compatible with the previous, manually written classes, but many older methods are now deprecated in favor of a more uniform interface. The only breaking change is that the signature for the director constructor e.g. `TagUrn("tag", ["tag_name"])` is no longer supported, and the simpler `TagUrn("tag_name")` should be used instead.
The canonical place to import the urn classes from is `datahub.metadata.urns.*`. Other import paths, like `datahub.utilities.urns.corpuser_urn.CorpuserUrn` are retained for backwards compatibility, but are considered deprecated.

### Potential Downtime

Expand All @@ -22,18 +24,19 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
- #9044 - GraphQL APIs for adding ownership now expect either an `ownershipTypeUrn` referencing a customer ownership type or a (deprecated) `type`. Where before adding an ownership without a concrete type was allowed, this is no longer the case. For simplicity you can use the `type` parameter which will get translated to a custom ownership type internally if one exists for the type being added.
- #9010 - In Redshift source's config `incremental_lineage` is set default to off.
- #8810 - Removed support for SQLAlchemy 1.3.x. Only SQLAlchemy 1.4.x is supported now.
- #8942 - Removed `urn:li:corpuser:datahub` owner for the `Measure`, `Dimension` and `Temporal` tags emitted
- #8942 - Removed `urn:li:corpuser:datahub` owner for the `Measure`, `Dimension` and `Temporal` tags emitted
by Looker and LookML source connectors.
- #8853 - The Airflow plugin no longer supports Airflow 2.0.x or Python 3.7. See the docs for more details.
- #8853 - Introduced the Airflow plugin v2. If you're using Airflow 2.3+, the v2 plugin will be enabled by default, and so you'll need to switch your requirements to include `pip install 'acryl-datahub-airflow-plugin[plugin-v2]'`. To continue using the v1 plugin, set the `DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN` environment variable to `true`.
- #8943 - The Unity Catalog ingestion source has a new option `include_metastore`, which will cause all urns to be changed when disabled.
This is currently enabled by default to preserve compatibility, but will be disabled by default and then removed in the future.
If stateful ingestion is enabled, simply setting `include_metastore: false` will perform all required cleanup.
Otherwise, we recommend soft deleting all databricks data via the DataHub CLI:
`datahub delete --platform databricks --soft` and then reingesting with `include_metastore: false`.
This is currently enabled by default to preserve compatibility, but will be disabled by default and then removed in the future.
If stateful ingestion is enabled, simply setting `include_metastore: false` will perform all required cleanup.
Otherwise, we recommend soft deleting all databricks data via the DataHub CLI:
`datahub delete --platform databricks --soft` and then reingesting with `include_metastore: false`.
- #8846 - Changed enum values in resource filters used by policies. `RESOURCE_TYPE` became `TYPE` and `RESOURCE_URN` became `URN`.
Any existing policies using these filters (i.e. defined for particular `urns` or `types` such as `dataset`) need to be upgraded
manually, for example by retrieving their respective `dataHubPolicyInfo` aspect and changing part using filter i.e.
Any existing policies using these filters (i.e. defined for particular `urns` or `types` such as `dataset`) need to be upgraded
manually, for example by retrieving their respective `dataHubPolicyInfo` aspect and changing part using filter i.e.

```yaml
"resources": {
"filter": {
Expand All @@ -48,7 +51,9 @@ manually, for example by retrieving their respective `dataHubPolicyInfo` aspect
]
}
```

into

```yaml
"resources": {
"filter": {
Expand All @@ -63,22 +68,25 @@ into
]
}
```

for example, using `datahub put` command. Policies can be also removed and re-created via UI.

- #9077 - The BigQuery ingestion source by default sets `match_fully_qualified_names: true`.
This means that any `dataset_pattern` or `schema_pattern` specified will be matched on the fully
qualified dataset name, i.e. `<project_name>.<dataset_name>`. We attempt to support the old
pattern format by prepending `.*\\.` to dataset patterns lacking a period, so in most cases this
should not cause any issues. However, if you have a complex dataset pattern, we recommend you
manually convert it to the fully qualified format to avoid any potential issues.
This means that any `dataset_pattern` or `schema_pattern` specified will be matched on the fully
qualified dataset name, i.e. `<project_name>.<dataset_name>`. We attempt to support the old
pattern format by prepending `.*\\.` to dataset patterns lacking a period, so in most cases this
should not cause any issues. However, if you have a complex dataset pattern, we recommend you
manually convert it to the fully qualified format to avoid any potential issues.
- #9110 - The Unity Catalog source will now generate urns based on `env` properly. If you have
been setting `env` in your recipe to something besides `PROD`, we will now generate urns
with that new env variable, invalidating your existing urns.
been setting `env` in your recipe to something besides `PROD`, we will now generate urns
with that new env variable, invalidating your existing urns.

### Potential Downtime

### Deprecations

### Other Notable Changes

- Session token configuration has changed, all previously created session tokens will be invalid and users will be prompted to log in. Expiration time has also been shortened which may result in more login prompts with the default settings.
There should be no other interruption due to this change.

Expand All @@ -87,13 +95,16 @@ with that new env variable, invalidating your existing urns.
### Breaking Changes

### Potential Downtime

- #8611 Search improvements requires reindexing indices. A `system-update` job will run which will set indices to read-only and create a backup/clone of each index. During the reindexing new components will be prevented from start-up until the reindex completes. The logs of this job will indicate a % complete per index. Depending on index sizes and infrastructure this process can take 5 minutes to hours however as a rough estimate 1 hour for every 2.3 million entities.

### Deprecations

- #8525: In LDAP ingestor, the `manager_pagination_enabled` changed to general `pagination_enabled`
- MAE Events are no longer produced. MAE events have been deprecated for over a year.

### Other Notable Changes

- In this release we now enable you to create and delete pinned announcements on your DataHub homepage! If you have the “Manage Home Page Posts” platform privilege you’ll see a new section in settings called “Home Page Posts” where you can create and delete text posts and link posts that your users see on the home page.
- The new search and browse experience, which was first made available in the previous release behind a feature flag, is now on by default. Check out our release notes for v0.10.5 to get more information and documentation on this new Browse experience.
- In addition to the ranking changes mentioned above, this release includes changes to the highlighting of search entities to understand why they match your query. You can also sort your results alphabetically or by last updated times, in addition to relevance. In this release, we suggest a correction if your query has a typo in it.
Expand All @@ -120,12 +131,13 @@ with that new env variable, invalidating your existing urns.
This determines which Okta profile attribute is used for the corresponding DataHub user
and thus may change what DataHub users are generated by the Okta source. And in a follow up `okta_profile_to_username_regex` has been set to `.*` which taken together with previous change brings the defaults in line with OIDC.
- #8331: For all sql-based sources that support profiling, you can no longer specify
`profile_table_level_only` together with `include_field_xyz` config options to ingest
certain column-level metrics. Instead, set `profile_table_level_only` to `false` and
individually enable / disable desired field metrics.
`profile_table_level_only` together with `include_field_xyz` config options to ingest
certain column-level metrics. Instead, set `profile_table_level_only` to `false` and
individually enable / disable desired field metrics.
- #8451: The `bigquery-beta` and `snowflake-beta` source aliases have been dropped. Use `bigquery` and `snowflake` as the source type instead.
- #8472: Ingestion runs created with Pipeline.create will show up in the DataHub ingestion tab as CLI-based runs. To revert to the previous behavior of not showing these runs in DataHub, pass `no_default_report=True`.
- #8513: `snowflake` connector will use user's `email` attribute as is in urn. To revert to previous behavior disable `email_as_user_identifier` in recipe.
- #8513: `snowflake` connector will use user's `email` attribute as is in urn. To revert to previous behavior disable `email_as_user_identifier` in recipe.

### Potential Downtime

- BrowsePathsV2 upgrade will now be handled by the `system-update` job in non-blocking mode. This process generates data needed for the new search
Expand All @@ -152,9 +164,11 @@ individually enable / disable desired field metrics.
### Potential Downtime

### Deprecations

- #8045: With the introduction of custom ownership types, the `Owner` aspect has been updated where the `type` field is deprecated in favor of a new field `typeUrn`. This latter field is an urn reference to the new OwnershipType entity. GraphQL endpoints have been updated to use the new field. For pre-existing ownership aspect records, DataHub now has logic to map the old field to the new field.

### Other notable Changes

- #8191: Updates GMS's health check endpoint to account for its dependency on external components. Notably, at this time, elasticsearch. This means that DataHub operators can now use GMS health status more reliably.

## 0.10.3
Expand All @@ -169,6 +183,7 @@ individually enable / disable desired field metrics.
### Potential Downtime

### Deprecations

- The signature of `Source.get_workunits()` is changed from `Iterable[WorkUnit]` to the more restrictive `Iterable[MetadataWorkUnit]`.
- Legacy usage creation via the `UsageAggregation` aspect, `/usageStats?action=batchIngest` GMS endpoint, and `UsageStatsWorkUnit` metadata-ingestion class are all deprecated.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,24 +14,12 @@
EditableSchemaMetadataClass,
InstitutionalMemoryClass,
)
from datahub.utilities.urns.field_paths import get_simple_field_path_from_v2_field_path

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)


def get_simple_field_path_from_v2_field_path(field_path: str) -> str:
"""A helper function to extract simple . path notation from the v2 field path"""
if not field_path.startswith("[version=2.0]"):
# not a v2, we assume this is a simple path
return field_path
# this is a v2 field path
tokens = [
t for t in field_path.split(".") if not (t.startswith("[") or t.endswith("]"))
]

return ".".join(tokens)


# Inputs -> owner, ownership_type, dataset
documentation_to_add = (
"Name of the user who was deleted. This description is updated via PythonSDK."
Expand Down
14 changes: 1 addition & 13 deletions metadata-ingestion/examples/library/dataset_add_column_tag.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,24 +15,12 @@
GlobalTagsClass,
TagAssociationClass,
)
from datahub.utilities.urns.field_paths import get_simple_field_path_from_v2_field_path

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)


def get_simple_field_path_from_v2_field_path(field_path: str) -> str:
"""A helper function to extract simple . path notation from the v2 field path"""
if not field_path.startswith("[version=2.0]"):
# not a v2, we assume this is a simple path
return field_path
# this is a v2 field path
tokens = [
t for t in field_path.split(".") if not (t.startswith("[") or t.endswith("]"))
]

return ".".join(tokens)


# Inputs -> the column, dataset and the tag to set
column = "user_name"
dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD")
Expand Down
14 changes: 1 addition & 13 deletions metadata-ingestion/examples/library/dataset_add_column_term.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,24 +15,12 @@
GlossaryTermAssociationClass,
GlossaryTermsClass,
)
from datahub.utilities.urns.field_paths import get_simple_field_path_from_v2_field_path

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)


def get_simple_field_path_from_v2_field_path(field_path: str) -> str:
"""A helper function to extract simple . path notation from the v2 field path"""
if not field_path.startswith("[version=2.0]"):
# not a v2, we assume this is a simple path
return field_path
# this is a v2 field path
tokens = [
t for t in field_path.split(".") if not (t.startswith("[") or t.endswith("]"))
]

return ".".join(tokens)


# Inputs -> the column, dataset and the term to set
column = "address.zipcode"
dataset_urn = make_dataset_urn(platform="hive", name="realestate_db.sales", env="PROD")
Expand Down
8 changes: 4 additions & 4 deletions metadata-ingestion/examples/library/upsert_group.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,18 @@
CorpGroupGenerationConfig,
)
from datahub.ingestion.graph.client import DataHubGraph, DataHubGraphConfig
from datahub.utilities.urns.corpuser_urn import CorpuserUrn
from datahub.metadata.urns import CorpUserUrn

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

group_email = "[email protected]"
group = CorpGroup(
id=group_email,
owners=[str(CorpuserUrn.create_from_id("datahub"))],
owners=[str(CorpUserUrn("datahub"))],
members=[
str(CorpuserUrn.create_from_id("[email protected]")),
str(CorpuserUrn.create_from_id("[email protected]")),
str(CorpUserUrn("[email protected]")),
str(CorpUserUrn("[email protected]")),
],
display_name="Foo Group",
email=group_email,
Expand Down
Loading
Loading