diff --git a/README.md b/README.md index f0314c0ed90e1..3b381ebc8dc89 100644 --- a/README.md +++ b/README.md @@ -80,7 +80,9 @@ Please follow the [DataHub Quickstart Guide](https://datahubproject.io/docs/quic If you're looking to build & modify datahub please take a look at our [Development Guide](https://datahubproject.io/docs/developers). -[![DataHub Demo GIF](docs/imgs/entity.png)](https://demo.datahubproject.io/) +
+ +
## Source Code and Repositories diff --git a/docker/airflow/local_airflow.md b/docker/airflow/local_airflow.md index d0a2b18cff2d2..cbd93660468f0 100644 --- a/docker/airflow/local_airflow.md +++ b/docker/airflow/local_airflow.md @@ -138,25 +138,41 @@ Successfully added `conn_id`=datahub_rest_default : datahub_rest://:@http://data Navigate the Airflow UI to find the sample Airflow dag we just brought in -![Find the DAG](../../docs/imgs/airflow/find_the_dag.png) ++ +
By default, Airflow loads all DAG-s in paused status. Unpause the sample DAG to use it. -![Paused DAG](../../docs/imgs/airflow/paused_dag.png) -![Unpaused DAG](../../docs/imgs/airflow/unpaused_dag.png) ++ +
++ +
Then trigger the DAG to run. -![Trigger the DAG](../../docs/imgs/airflow/trigger_dag.png) ++ +
After the DAG runs successfully, go over to your DataHub instance to see the Pipeline and navigate its lineage. -![DataHub Pipeline View](../../docs/imgs/airflow/datahub_pipeline_view.png) ++ +
-![DataHub Pipeline Entity](../../docs/imgs/airflow/datahub_pipeline_entity.png) ++ +
-![DataHub Task View](../../docs/imgs/airflow/datahub_task_view.png) ++ +
-![DataHub Lineage View](../../docs/imgs/airflow/datahub_lineage_view.png) ++ +
## TroubleShooting @@ -164,9 +180,13 @@ Most issues are related to connectivity between Airflow and DataHub. Here is how you can debug them. -![Find the Task Log](../../docs/imgs/airflow/finding_failed_log.png) ++ +
-![Inspect the Log](../../docs/imgs/airflow/connection_error.png) ++ +
In this case, clearly the connection `datahub-rest` has not been registered. Looks like we forgot to register the connection with Airflow! Let's execute Step 4 to register the datahub connection with Airflow. @@ -175,4 +195,6 @@ In case the connection was registered successfully but you are still seeing `Fai After re-running the DAG, we see success! -![Pipeline Success](../../docs/imgs/airflow/successful_run.png) ++ +
diff --git a/docs/advanced/no-code-modeling.md b/docs/advanced/no-code-modeling.md index e1fadee6d371a..ceb51511ad678 100644 --- a/docs/advanced/no-code-modeling.md +++ b/docs/advanced/no-code-modeling.md @@ -1,47 +1,46 @@ -# No Code Metadata +# No Code Metadata ## Summary of changes -As part of the No Code Metadata Modeling initiative, we've made radical changes to the DataHub stack. +As part of the No Code Metadata Modeling initiative, we've made radical changes to the DataHub stack. -Specifically, we've +Specifically, we've -- Decoupled the persistence layer from Java + Rest.li specific concepts +- Decoupled the persistence layer from Java + Rest.li specific concepts - Consolidated the per-entity Rest.li resources into a single general-purpose Entity Resource -- Consolidated the per-entity Graph Index Writers + Readers into a single general-purpose Neo4J DAO -- Consolidated the per-entity Search Index Writers + Readers into a single general-purpose ES DAO. +- Consolidated the per-entity Graph Index Writers + Readers into a single general-purpose Neo4J DAO +- Consolidated the per-entity Search Index Writers + Readers into a single general-purpose ES DAO. - Developed mechanisms for declaring search indexing configurations + foreign key relationships as annotations -on PDL models themselves. -- Introduced a special "Browse Paths" aspect that allows the browse configuration to be -pushed into DataHub, as opposed to computed in a blackbox lambda sitting within DataHub + on PDL models themselves. +- Introduced a special "Browse Paths" aspect that allows the browse configuration to be + pushed into DataHub, as opposed to computed in a blackbox lambda sitting within DataHub - Introduced special "Key" aspects for conveniently representing the information that identifies a DataHub entities via -a normal struct. + a normal struct. - Removed the need for hand-written Elastic `settings.json` and `mappings.json`. (Now generated at runtime) - Removed the need for the Elastic Set Up container (indexes are not registered at runtime) - Simplified the number of models that need to be maintained for each DataHub entity. We removed the need for - 1. Relationship Models - 2. Entity Models - 3. Urn models + the associated Java container classes - 4. 'Value' models, those which are returned by the Rest.li resource + 1. Relationship Models + 2. Entity Models + 3. Urn models + the associated Java container classes + 4. 'Value' models, those which are returned by the Rest.li resource In doing so, dramatically reducing the level of effort required to add or extend an existing entity. For more on the design considerations, see the **Design** section below. - ## Engineering Spec This section will provide a more in-depth overview of the design considerations that were at play when working on the No -Code initiative. +Code initiative. # Use Cases Who needs what & why? -| As a | I want to | because -| ---------------- | ------------------------ | ------------------------------ -| DataHub Operator | Add new entities | The default domain model does not match my business needs -| DataHub Operator | Extend existing entities | The default domain model does not match my business needs +| As a | I want to | because | +| ---------------- | ------------------------ | --------------------------------------------------------- | +| DataHub Operator | Add new entities | The default domain model does not match my business needs | +| DataHub Operator | Extend existing entities | The default domain model does not match my business needs | What we heard from folks in the community is that adding new entities + aspects is just **too difficult**. @@ -62,15 +61,19 @@ Achieve the primary goal in a way that does not require a fork. ### Must-Haves 1. Mechanisms for **adding** a browsable, searchable, linkable GMS entity by defining one or more PDL models - - GMS Endpoint for fetching entity - - GMS Endpoint for fetching entity relationships - - GMS Endpoint for searching entity - - GMS Endpoint for browsing entity -2. Mechanisms for **extending** a ****browsable, searchable, linkable GMS ****entity by defining one or more PDL models - - GMS Endpoint for fetching entity - - GMS Endpoint for fetching entity relationships - - GMS Endpoint for searching entity - - GMS Endpoint for browsing entity + +- GMS Endpoint for fetching entity +- GMS Endpoint for fetching entity relationships +- GMS Endpoint for searching entity +- GMS Endpoint for browsing entity + +2. Mechanisms for **extending** a \***\*browsable, searchable, linkable GMS \*\***entity by defining one or more PDL models + +- GMS Endpoint for fetching entity +- GMS Endpoint for fetching entity relationships +- GMS Endpoint for searching entity +- GMS Endpoint for browsing entity + 3. Mechanisms + conventions for introducing a new **relationship** between 2 GMS entities without writing code 4. Clear documentation describing how to perform actions in #1, #2, and #3 above published on [datahubproject.io](http://datahubproject.io) @@ -78,8 +81,9 @@ Achieve the primary goal in a way that does not require a fork. 1. Mechanisms for automatically generating a working GraphQL API using the entity PDL models 2. Ability to add / extend GMS entities without a fork. - - e.g. **Register** new entity / extensions *at runtime*. (Unlikely due to code generation) - - or, **configure** new entities at *deploy time* + +- e.g. **Register** new entity / extensions _at runtime_. (Unlikely due to code generation) +- or, **configure** new entities at _deploy time_ ## What Success Looks Like @@ -88,7 +92,6 @@ Achieve the primary goal in a way that does not require a fork. 3. Adding a new relationship among 2 GMS entities takes 1 dev < 15 minutes 4. [Bonus] Implementing the `datahub-frontend` GraphQL API for a new / extended entity takes < 10 minutes - ## Design ## State of the World @@ -104,7 +107,8 @@ Currently, there are various models in GMS: 5. [Entities](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/entity/DatasetEntity.pdl) - Records with fields derived from the URN. Used only in graph / relationships 6. [Relationships](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/relationship/Relationship.pdl) - Edges between 2 entities with optional edge properties 7. [Search Documents](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/search/ChartDocument.pdl) - Flat documents for indexing within Elastic index - - And corresponding index [mappings.json](https://github.com/datahub-project/datahub/blob/master/gms/impl/src/main/resources/index/chart/mappings.json), [settings.json](https://github.com/datahub-project/datahub/blob/master/gms/impl/src/main/resources/index/chart/settings.json) + +- And corresponding index [mappings.json](https://github.com/datahub-project/datahub/blob/master/gms/impl/src/main/resources/index/chart/mappings.json), [settings.json](https://github.com/datahub-project/datahub/blob/master/gms/impl/src/main/resources/index/chart/settings.json) Various components of GMS depend on / make assumptions about these model types: @@ -122,7 +126,7 @@ Various components of GMS depend on / make assumptions about these model types: Additionally, there are some implicit concepts that require additional caveats / logic: 1. Browse Paths - Requires defining logic in an entity-specific index builder to generate. -2. Urns - Requires defining a) an Urn PDL model and b) a hand-written Urn class +2. Urns - Requires defining a) an Urn PDL model and b) a hand-written Urn class As you can see, there are many tied up concepts. Fundamentally changing the model would require a serious amount of refactoring, as it would require new versions of numerous components. @@ -132,25 +136,25 @@ The challenge is, how can we meet the requirements without fundamentally alterin In a nutshell, the idea is to consolidate the number of models + code we need to write on a per-entity basis. We intend to achieve this by making search index + relationship configuration declarative, specified as part of the model -definition itself. +definition itself. -We will use this configuration to drive more generic versions of the index builders + rest resources, -with the intention of reducing the overall surface area of GMS. +We will use this configuration to drive more generic versions of the index builders + rest resources, +with the intention of reducing the overall surface area of GMS. During this initiative, we will also seek to make the concepts of Browse Paths and Urns declarative. Browse Paths -will be provided using a special BrowsePaths aspect. Urns will no longer be strongly typed. +will be provided using a special BrowsePaths aspect. Urns will no longer be strongly typed. -To achieve this, we will attempt to generify many components throughout the stack. Currently, many of them are defined on -a *per-entity* basis, including +To achieve this, we will attempt to generify many components throughout the stack. Currently, many of them are defined on +a _per-entity_ basis, including - Rest.li Resources - Index Builders - Graph Builders - Local, Search, Browse, Graph DAOs -- Clients +- Clients - Browse Path Logic -along with simplifying the number of raw data models that need defined, including +along with simplifying the number of raw data models that need defined, including - Rest.li Resource Models - Search Document Models @@ -159,39 +163,43 @@ along with simplifying the number of raw data models that need defined, includin From an architectural PoV, we will move from a before that looks something like this: -![no-code-before](../imgs/no-code-before.png) ++ +
to an after that looks like this -![no-code-after](../imgs/no-code-after.png) ++ +
-That is, a move away from patterns of strong-typing-everywhere to a more generic + flexible world. +That is, a move away from patterns of strong-typing-everywhere to a more generic + flexible world. ### How will we do it? We will accomplish this by building the following: 1. Set of custom annotations to permit declarative entity, search, graph configurations - - @Entity & @Aspect - - @Searchable - - @Relationship + - @Entity & @Aspect + - @Searchable + - @Relationship 2. Entity Registry: In-memory structures for representing, storing & serving metadata associated with a particular Entity, including search and relationship configurations. -3. Generic Entity, Search, Graph Service classes: Replaces traditional strongly-typed DAOs with flexible, pluggable APIs that can be used for CRUD, search, and graph across all entities. -2. Generic Rest.li Resources: - - 1 permitting reading, writing, searching, autocompleting, and browsing arbitrary entities - - 1 permitting reading of arbitrary entity-entity relationship edges -2. Generic Search Index Builder: Given a MAE and a specification of the Search Configuration for an entity, updates the search index. -3. Generic Graph Index Builder: Given a MAE and a specification of the Relationship Configuration for an entity, updates the graph index. -4. Generic Index + Mappings Builder: Dynamically generates index mappings and creates indices on the fly. -5. Introduce of special aspects to address other imperative code requirements - - BrowsePaths Aspect: Include an aspect to permit customization of the indexed browse paths. - - Key aspects: Include "virtual" aspects for representing the fields that uniquely identify an Entity for easy - reading by clients of DataHub. +3. Generic Entity, Search, Graph Service classes: Replaces traditional strongly-typed DAOs with flexible, pluggable APIs that can be used for CRUD, search, and graph across all entities. +4. Generic Rest.li Resources: + - 1 permitting reading, writing, searching, autocompleting, and browsing arbitrary entities + - 1 permitting reading of arbitrary entity-entity relationship edges +5. Generic Search Index Builder: Given a MAE and a specification of the Search Configuration for an entity, updates the search index. +6. Generic Graph Index Builder: Given a MAE and a specification of the Relationship Configuration for an entity, updates the graph index. +7. Generic Index + Mappings Builder: Dynamically generates index mappings and creates indices on the fly. +8. Introduce of special aspects to address other imperative code requirements + - BrowsePaths Aspect: Include an aspect to permit customization of the indexed browse paths. + - Key aspects: Include "virtual" aspects for representing the fields that uniquely identify an Entity for easy + reading by clients of DataHub. ### Final Developer Experience: Defining an Entity We will outline what the experience of adding a new Entity should look like. We will imagine we want to define a "Service" entity representing -online microservices. +online microservices. #### Step 1. Add aspects @@ -236,7 +244,7 @@ record ServiceInfo { /** * Description of the service */ - @Searchable = {} + @Searchable = {} description: string /** @@ -244,7 +252,7 @@ record ServiceInfo { */ @Relationship = { "name": "OwnedBy", - "entityTypes": ["corpUser"] + "entityTypes": ["corpUser"] } owner: Urn } @@ -310,7 +318,7 @@ namespace com.linkedin.metadata.snapshot * A union of all supported metadata snapshot types. */ typeref Snapshot = union[ - ... + ... ServiceSnapshot ] ``` @@ -321,7 +329,7 @@ typeref Snapshot = union[ ``` curl 'http://localhost:8080/entities?action=ingest' -X POST -H 'X-RestLi-Protocol-Version:2.0.0' --data '{ - "entity":{ + "entity":{ "value":{ "com.linkedin.metadata.snapshot.ServiceSnapshot":{ "urn": "urn:li:service:mydemoservice", @@ -329,7 +337,7 @@ curl 'http://localhost:8080/entities?action=ingest' -X POST -H 'X-RestLi-Protoco { "com.linkedin.service.ServiceInfo":{ "description":"My demo service", - "owner": "urn:li:corpuser:user1" + "owner": "urn:li:corpuser:user1" } }, { @@ -400,4 +408,3 @@ curl --location --request POST 'http://localhost:8080/entities?action=browse' \ curl --location --request GET 'http://localhost:8080/relationships?direction=INCOMING&urn=urn%3Ali%3Acorpuser%3Auser1&types=OwnedBy' \ --header 'X-RestLi-Protocol-Version: 2.0.0' ``` - diff --git a/docs/api/graphql/how-to-set-up-graphql.md b/docs/api/graphql/how-to-set-up-graphql.md index 562e8edb9f5d9..e9d264c5b04bf 100644 --- a/docs/api/graphql/how-to-set-up-graphql.md +++ b/docs/api/graphql/how-to-set-up-graphql.md @@ -62,7 +62,9 @@ Postman is a popular API client that provides a graphical user interface for sen Within Postman, you can create a `POST` request and set the request URL to the `/api/graphql` endpoint. In the request body, select the `GraphQL` option and enter your GraphQL query in the request body. -![postman-graphql](../../imgs/apis/postman-graphql.png) ++ +
Please refer to [Querying with GraphQL](https://learning.postman.com/docs/sending-requests/graphql/graphql/) in the Postman documentation for more information. diff --git a/docs/api/tutorials/custom-properties.md b/docs/api/tutorials/custom-properties.md index dbc07bfaa712e..672fc568ea8b5 100644 --- a/docs/api/tutorials/custom-properties.md +++ b/docs/api/tutorials/custom-properties.md @@ -34,7 +34,9 @@ In this example, we will add some custom properties `cluster_name` and `retentio After you have ingested sample data, the dataset `fct_users_deleted` should have a custom properties section with `encoding` set to `utf-8`. -![dataset-properties-before](../../imgs/apis/tutorials/dataset-properties-before.png) ++ +
```shell datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)" --aspect datasetProperties @@ -80,7 +82,9 @@ The following code adds custom properties `cluster_name` and `retention_time` to You can now see the two new properties are added to `fct_users_deleted` and the previous property `encoding` is unchanged. -![dataset-properties-added](../../imgs/apis/tutorials/dataset-properties-added.png) ++ +
We can also verify this operation by programmatically checking the `datasetProperties` aspect after running this code using the `datahub` cli. @@ -130,7 +134,9 @@ The following code shows you how can add and remove custom properties in the sam You can now see the `cluster_name` property is added to `fct_users_deleted` and the `retention_time` property is removed. -![dataset-properties-added-removed](../../imgs/apis/tutorials/dataset-properties-added-removed.png) ++ +
We can also verify this operation programmatically by checking the `datasetProperties` aspect using the `datahub` cli. @@ -179,7 +185,9 @@ The following code replaces the current custom properties with a new properties You can now see the `cluster_name` and `retention_time` properties are added to `fct_users_deleted` but the previous `encoding` property is no longer present. -![dataset-properties-replaced](../../imgs/apis/tutorials/dataset-properties-replaced.png) ++ +
We can also verify this operation programmatically by checking the `datasetProperties` aspect using the `datahub` cli. diff --git a/docs/api/tutorials/datasets.md b/docs/api/tutorials/datasets.md index 62b30e97c8020..0d41da741a5f5 100644 --- a/docs/api/tutorials/datasets.md +++ b/docs/api/tutorials/datasets.md @@ -42,7 +42,9 @@ For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart. You can now see `realestate_db.sales` dataset has been created. -![dataset-created](../../imgs/apis/tutorials/dataset-created.png) ++ +
## Delete Dataset @@ -110,4 +112,6 @@ Expected Response: The dataset `fct_users_deleted` has now been deleted, so if you search for a hive dataset named `fct_users_delete`, you will no longer be able to see it. -![dataset-deleted](../../imgs/apis/tutorials/dataset-deleted.png) ++ +
diff --git a/docs/api/tutorials/deprecation.md b/docs/api/tutorials/deprecation.md index 6a8f7c8a1d2be..590ad707969ca 100644 --- a/docs/api/tutorials/deprecation.md +++ b/docs/api/tutorials/deprecation.md @@ -155,4 +155,6 @@ Expected Response: You can now see the dataset `fct_users_created` has been marked as `Deprecated.` -![tag-removed](../../imgs/apis/tutorials/deprecation-updated.png) ++ +
diff --git a/docs/api/tutorials/descriptions.md b/docs/api/tutorials/descriptions.md index 46f42b7a05be6..6247b61d3b300 100644 --- a/docs/api/tutorials/descriptions.md +++ b/docs/api/tutorials/descriptions.md @@ -202,16 +202,16 @@ Expected Response: ```graphql mutation updateDataset { updateDataset( - urn:"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)", + urn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)" input: { editableProperties: { - description: "## The Real Estate Sales Dataset\nThis is a really important Dataset that contains all the relevant information about sales that have happened organized by address.\n" + description: "## The Real Estate Sales Dataset\nThis is a really important Dataset that contains all the relevant information about sales that have happened organized by address.\n" } institutionalMemory: { elements: { author: "urn:li:corpuser:jdoe" - url: "https://wikipedia.com/real_estate" - description: "This is the definition of what real estate means" + url: "https://wikipedia.com/real_estate" + description: "This is the definition of what real estate means" } } } @@ -275,7 +275,9 @@ Expected Response: You can now see the description is added to `fct_users_deleted`. -![dataset-description-added](../../imgs/apis/tutorials/dataset-description-added.png) ++ +
## Add Description on Column @@ -357,4 +359,6 @@ Expected Response: You can now see column description is added to `user_name` column of `fct_users_deleted`. -![column-description-added](../../imgs/apis/tutorials/column-description-added.png) ++ +
diff --git a/docs/api/tutorials/domains.md b/docs/api/tutorials/domains.md index c8c47f85c570f..29c77a35ff0e1 100644 --- a/docs/api/tutorials/domains.md +++ b/docs/api/tutorials/domains.md @@ -74,7 +74,9 @@ Expected Response: You can now see `Marketing` domain has been created under `Govern > Domains`. -![domain-created](../../imgs/apis/tutorials/domain-created.png) ++ +
## Read Domains @@ -209,7 +211,9 @@ Expected Response: You can now see `Marketing` domain has been added to the dataset. -![domain-added](../../imgs/apis/tutorials/domain-added.png) ++ +
## Remove Domains @@ -259,4 +263,6 @@ curl --location --request POST 'http://localhost:8080/api/graphql' \ You can now see a domain `Marketing` has been removed from the `fct_users_created` dataset. -![domain-removed](../../imgs/apis/tutorials/domain-removed.png) ++ +
diff --git a/docs/api/tutorials/lineage.md b/docs/api/tutorials/lineage.md index e37986af7bbbd..80f46a3131d11 100644 --- a/docs/api/tutorials/lineage.md +++ b/docs/api/tutorials/lineage.md @@ -112,7 +112,9 @@ Expected Response: You can now see the lineage between `fct_users_deleted` and `logging_events`. -![lineage-added](../../imgs/apis/tutorials/lineage-added.png) ++ +
## Add Column-level Lineage @@ -130,7 +132,9 @@ You can now see the lineage between `fct_users_deleted` and `logging_events`. You can now see the column-level lineage between datasets. Note that you have to enable `Show Columns` to be able to see the column-level lineage. -![column-level-lineage-added](../../imgs/apis/tutorials/column-level-lineage-added.png) ++ +
## Read Lineage @@ -171,7 +175,7 @@ mutation searchAcrossLineage { } ``` -This example shows using lineage degrees as a filter, but additional search filters can be included here as well. +This example shows using lineage degrees as a filter, but additional search filters can be included here as well.+ +
-![model-group-created](../../imgs/apis/tutorials/model-group-created.png) ++ +
## Read ML Entities @@ -499,6 +503,10 @@ Expected Response: (Note that this entity does not exist in the sample ingestion You can access to `Features` or `Group` Tab of each entity to view the added entities. -![feature-added-to-model](../../imgs/apis/tutorials/feature-added-to-model.png) ++ +
-![model-group-added-to-model](../../imgs/apis/tutorials/model-group-added-to-model.png) ++ +
diff --git a/docs/api/tutorials/owners.md b/docs/api/tutorials/owners.md index 3c7a46b136d76..71f908a85a864 100644 --- a/docs/api/tutorials/owners.md +++ b/docs/api/tutorials/owners.md @@ -77,7 +77,10 @@ Update succeeded for urn urn:li:corpuser:datahub. ### Expected Outcomes of Upserting User You can see the user `The bar` has been created and the user `Datahub` has been updated under `Settings > Access > Users & Groups` -![user-upserted](../../imgs/apis/tutorials/user-upserted.png) + ++ +
## Upsert Group @@ -125,7 +128,10 @@ Update succeeded for group urn:li:corpGroup:foogroup@acryl.io. ### Expected Outcomes of Upserting Group You can see the group `Foo Group` has been created under `Settings > Access > Users & Groups` -![group-upserted](../../imgs/apis/tutorials/group-upserted.png) + ++ +
## Read Owners @@ -272,7 +278,9 @@ curl --location --request POST 'http://localhost:8080/api/graphql' \ You can now see `bfoo` has been added as an owner to the `fct_users_created` dataset. -![ownership-added](../../imgs/apis/tutorials/owner-added.png) ++ +
## Remove Owners @@ -340,4 +348,6 @@ curl --location --request POST 'http://localhost:8080/api/graphql' \ You can now see `John Doe` has been removed as an owner from the `fct_users_created` dataset. -![ownership-removed](../../imgs/apis/tutorials/owner-removed.png) ++ +
diff --git a/docs/api/tutorials/tags.md b/docs/api/tutorials/tags.md index 2f80a833136c1..fa8ea6e0c39a7 100644 --- a/docs/api/tutorials/tags.md +++ b/docs/api/tutorials/tags.md @@ -91,7 +91,9 @@ Expected Response: You can now see the new tag `Deprecated` has been created. -![tag-created](../../imgs/apis/tutorials/tag-created.png) ++ +
We can also verify this operation by programmatically searching `Deprecated` tag after running this code using the `datahub` cli. @@ -307,7 +309,9 @@ Expected Response: You can now see `Deprecated` tag has been added to `user_name` column. -![tag-added](../../imgs/apis/tutorials/tag-added.png) ++ +
We can also verify this operation programmatically by checking the `globalTags` aspect using the `datahub` cli. @@ -359,7 +363,9 @@ curl --location --request POST 'http://localhost:8080/api/graphql' \ You can now see `Deprecated` tag has been removed to `user_name` column. -![tag-removed](../../imgs/apis/tutorials/tag-removed.png) ++ +
We can also verify this operation programmatically by checking the `gloablTags` aspect using the `datahub` cli. diff --git a/docs/api/tutorials/terms.md b/docs/api/tutorials/terms.md index 207e14ea4afe8..ba2be76195c62 100644 --- a/docs/api/tutorials/terms.md +++ b/docs/api/tutorials/terms.md @@ -95,7 +95,9 @@ Expected Response: You can now see the new term `Rate of Return` has been created. -![term-created](../../imgs/apis/tutorials/term-created.png) ++ +
We can also verify this operation by programmatically searching `Rate of Return` term after running this code using the `datahub` cli. @@ -289,7 +291,9 @@ Expected Response: You can now see `Rate of Return` term has been added to `user_name` column. -![term-added](../../imgs/apis/tutorials/term-added.png) ++ +
## Remove Terms @@ -361,4 +365,6 @@ curl --location --request POST 'http://localhost:8080/api/graphql' \ You can now see `Rate of Return` term has been removed to `user_name` column. -![term-removed](../../imgs/apis/tutorials/term-removed.png) ++ +
diff --git a/docs/architecture/architecture.md b/docs/architecture/architecture.md index 6b76b995cc427..13ce50d6806e4 100644 --- a/docs/architecture/architecture.md +++ b/docs/architecture/architecture.md @@ -10,7 +10,9 @@ disparate tools & systems. The figures below describe the high-level architecture of DataHub. -![datahub-architecture](../imgs/datahub-architecture.png) ++ +
![Acryl DataHub System Architecture ](../managed-datahub/imgs/saas/DataHub-Architecture.png) For a more detailed look at the components that make up the Architecture, check out [Components](../components.md). diff --git a/docs/architecture/metadata-ingestion.md b/docs/architecture/metadata-ingestion.md index 2b60383319c68..c4b3c834937b7 100644 --- a/docs/architecture/metadata-ingestion.md +++ b/docs/architecture/metadata-ingestion.md @@ -4,14 +4,17 @@ title: "Ingestion Framework" # Metadata Ingestion Architecture -DataHub supports an extremely flexible ingestion architecture that can support push, pull, asynchronous and synchronous models. -The figure below describes all the options possible for connecting your favorite system to DataHub. -![Ingestion Architecture](../imgs/ingestion-architecture.png) +DataHub supports an extremely flexible ingestion architecture that can support push, pull, asynchronous and synchronous models. +The figure below describes all the options possible for connecting your favorite system to DataHub. -## Metadata Change Proposal: The Center Piece ++ +
-The center piece for ingestion are [Metadata Change Proposal]s which represent requests to make a metadata change to an organization's Metadata Graph. -Metadata Change Proposals can be sent over Kafka, for highly scalable async publishing from source systems. They can also be sent directly to the HTTP endpoint exposed by the DataHub service tier to get synchronous success / failure responses. +## Metadata Change Proposal: The Center Piece + +The center piece for ingestion are [Metadata Change Proposal]s which represent requests to make a metadata change to an organization's Metadata Graph. +Metadata Change Proposals can be sent over Kafka, for highly scalable async publishing from source systems. They can also be sent directly to the HTTP endpoint exposed by the DataHub service tier to get synchronous success / failure responses. ## Pull-based Integration @@ -25,7 +28,7 @@ As long as you can emit a [Metadata Change Proposal (MCP)] event to Kafka or mak ### Applying Metadata Change Proposals to DataHub Metadata Service (mce-consumer-job) -DataHub comes with a Spring job, [mce-consumer-job], which consumes the Metadata Change Proposals and writes them into the DataHub Metadata Service (datahub-gms) using the `/ingest` endpoint. +DataHub comes with a Spring job, [mce-consumer-job], which consumes the Metadata Change Proposals and writes them into the DataHub Metadata Service (datahub-gms) using the `/ingest` endpoint. [Metadata Change Proposal (MCP)]: ../what/mxe.md#metadata-change-proposal-mcp [Metadata Change Proposal]: ../what/mxe.md#metadata-change-proposal-mcp @@ -33,4 +36,3 @@ DataHub comes with a Spring job, [mce-consumer-job], which consumes the Metadata [equivalent Pegasus format]: https://linkedin.github.io/rest.li/how_data_is_represented_in_memory#the-data-template-layer [mce-consumer-job]: ../../metadata-jobs/mce-consumer-job [Python emitters]: ../../metadata-ingestion/README.md#using-as-a-library - diff --git a/docs/architecture/metadata-serving.md b/docs/architecture/metadata-serving.md index ada41179af4e0..a677ca111e71f 100644 --- a/docs/architecture/metadata-serving.md +++ b/docs/architecture/metadata-serving.md @@ -4,21 +4,23 @@ title: "Serving Tier" # DataHub Serving Architecture -The figure below shows the high-level system diagram for DataHub's Serving Tier. +The figure below shows the high-level system diagram for DataHub's Serving Tier. -![datahub-serving](../imgs/datahub-serving.png) ++ +
-The primary component is called [the Metadata Service](../../metadata-service) and exposes a REST API and a GraphQL API for performing CRUD operations on metadata. The service also exposes search and graph query API-s to support secondary-index style queries, full-text search queries as well as relationship queries like lineage. In addition, the [datahub-frontend](../../datahub-frontend) service expose a GraphQL API on top of the metadata graph. +The primary component is called [the Metadata Service](../../metadata-service) and exposes a REST API and a GraphQL API for performing CRUD operations on metadata. The service also exposes search and graph query API-s to support secondary-index style queries, full-text search queries as well as relationship queries like lineage. In addition, the [datahub-frontend](../../datahub-frontend) service expose a GraphQL API on top of the metadata graph. ## DataHub Serving Tier Components ### Metadata Storage -The DataHub Metadata Service persists metadata in a document store (an RDBMS like MySQL, Postgres, or Cassandra, etc.). +The DataHub Metadata Service persists metadata in a document store (an RDBMS like MySQL, Postgres, or Cassandra, etc.). ### Metadata Change Log Stream (MCL) -The DataHub Service Tier also emits a commit event [Metadata Change Log] when a metadata change has been successfully committed to persistent storage. This event is sent over Kafka. +The DataHub Service Tier also emits a commit event [Metadata Change Log] when a metadata change has been successfully committed to persistent storage. This event is sent over Kafka. The MCL stream is a public API and can be subscribed to by external systems (for example, the Actions Framework) providing an extremely powerful way to react in real-time to changes happening in metadata. For example, you could build an access control enforcer that reacts to change in metadata (e.g. a previously world-readable dataset now has a pii field) to immediately lock down the dataset in question. Note that not all MCP-s will result in an MCL, because the DataHub serving tier will ignore any duplicate changes to metadata. @@ -26,7 +28,7 @@ Note that not all MCP-s will result in an MCL, because the DataHub serving tier ### Metadata Index Applier (mae-consumer-job) [Metadata Change Log]s are consumed by another Spring job, [mae-consumer-job], which applies the changes to the [graph] and [search index] accordingly. -The job is entity-agnostic and will execute corresponding graph & search index builders, which will be invoked by the job when a specific metadata aspect is changed. +The job is entity-agnostic and will execute corresponding graph & search index builders, which will be invoked by the job when a specific metadata aspect is changed. The builder should instruct the job how to update the graph and search index based on the metadata change. To ensure that metadata changes are processed in the correct chronological order, MCLs are keyed by the entity [URN] — meaning all MAEs for a particular entity will be processed sequentially by a single thread. @@ -44,13 +46,10 @@ Primary-key based reads (e.g. getting schema metadata for a dataset based on the [GMS]: ../what/gms.md [Metadata Change Log]: ../what/mxe.md#metadata-change-log-mcl [rest.li]: https://rest.li - - [Metadata Change Proposal (MCP)]: ../what/mxe.md#metadata-change-proposal-mcp [Metadata Change Log (MCL)]: ../what/mxe.md#metadata-change-log-mcl [MCP]: ../what/mxe.md#metadata-change-proposal-mcp [MCL]: ../what/mxe.md#metadata-change-log-mcl - [equivalent Pegasus format]: https://linkedin.github.io/rest.li/how_data_is_represented_in_memory#the-data-template-layer [graph]: ../what/graph.md [search index]: ../what/search-index.md diff --git a/docs/authentication/concepts.md b/docs/authentication/concepts.md index 715e94c7e0380..5ce20546ed612 100644 --- a/docs/authentication/concepts.md +++ b/docs/authentication/concepts.md @@ -11,7 +11,9 @@ We introduced a few important concepts to the Metadata Service to make authentic In following sections, we'll take a closer look at each individually. -![](../imgs/metadata-service-auth.png) ++ +
*High level overview of Metadata Service Authentication* ## What is an Actor? @@ -53,7 +55,7 @@ There can be many types of Authenticator. For example, there can be Authenticato - Verify the authenticity of access tokens (ie. issued by either DataHub itself or a 3rd-party IdP) - Authenticate username / password credentials against a remote database (ie. LDAP) -and more! A key goal of the abstraction is *extensibility*: a custom Authenticator can be developed to authenticate requests +and more! A key goal of the abstraction is _extensibility_: a custom Authenticator can be developed to authenticate requests based on an organization's unique needs. DataHub ships with 2 Authenticators by default: @@ -75,13 +77,13 @@ The Authenticator Chain can be configured in the `application.yml` file under `a ``` authentication: - .... + .... authenticators: - # Configure the Authenticators in the chain + # Configure the Authenticators in the chain - type: com.datahub.authentication.Authenticator1 ... - - type: com.datahub.authentication.Authenticator2 - .... + - type: com.datahub.authentication.Authenticator2 + .... ``` ## What is the AuthenticationFilter? @@ -91,7 +93,6 @@ It does so by constructing and invoking an **AuthenticatorChain**, described abo If an Actor is unable to be resolved by the AuthenticatorChain, then a 401 unauthorized exception will be returned by the filter. - ## What is a DataHub Token Service? What are Access Tokens? Along with Metadata Service Authentication comes an important new component called the **DataHub Token Service**. The purpose of this @@ -114,10 +115,10 @@ Today, Access Tokens are granted by the Token Service under two scenarios: 1. **UI Login**: When a user logs into the DataHub UI, for example via [JaaS](guides/jaas.md) or [OIDC](guides/sso/configure-oidc-react.md), the `datahub-frontend` service issues an - request to the Metadata Service to generate a SESSION token *on behalf of* of the user logging in. (*Only the frontend service is authorized to perform this action). + request to the Metadata Service to generate a SESSION token _on behalf of_ of the user logging in. (\*Only the frontend service is authorized to perform this action). 2. **Generating Personal Access Tokens**: When a user requests to generate a Personal Access Token (described below) from the UI. > At present, the Token Service supports the symmetric signing method `HS256` to generate and verify tokens. Now that we're familiar with the concepts, we will talk concretely about what new capabilities have been built on top -of Metadata Service Authentication. \ No newline at end of file +of Metadata Service Authentication. diff --git a/docs/authentication/guides/sso/configure-oidc-react-azure.md b/docs/authentication/guides/sso/configure-oidc-react-azure.md index d185957967882..10601769a9d3d 100644 --- a/docs/authentication/guides/sso/configure-oidc-react-azure.md +++ b/docs/authentication/guides/sso/configure-oidc-react-azure.md @@ -1,5 +1,6 @@ # Configuring Azure Authentication for React App (OIDC) -*Authored on 21/12/2021* + +_Authored on 21/12/2021_ `datahub-frontend` server can be configured to authenticate users over OpenID Connect (OIDC). As such, it can be configured to delegate authentication responsibility to identity providers like Microsoft Azure. @@ -24,53 +25,63 @@ b. Select **App registrations**, then **New registration** to register a new app c. Name your app registration and choose who can access your application. d. Select `Web` as the **Redirect URI** type and enter the following: + ``` https://your-datahub-domain.com/callback/oidc ``` + If you are just testing locally, the following can be used: `http://localhost:9002/callback/oidc`. Azure supports more than one redirect URI, so both can be configured at the same time from the **Authentication** tab once the registration is complete. At this point, your app registration should look like the following: -![azure-setup-app-registration](img/azure-setup-app-registration.png) ++ +
e. Click **Register**. ### 2. Configure Authentication (optional) -Once registration is done, you will land on the app registration **Overview** tab. On the left-side navigation bar, click on **Authentication** under **Manage** and add extra redirect URIs if need be (if you want to support both local testing and Azure deployments). +Once registration is done, you will land on the app registration **Overview** tab. On the left-side navigation bar, click on **Authentication** under **Manage** and add extra redirect URIs if need be (if you want to support both local testing and Azure deployments). -![azure-setup-authentication](img/azure-setup-authentication.png) ++ +
Click **Save**. ### 3. Configure Certificates & secrets On the left-side navigation bar, click on **Certificates & secrets** under **Manage**. -Select **Client secrets**, then **New client secret**. Type in a meaningful description for your secret and select an expiry. Click the **Add** button when you are done. +Select **Client secrets**, then **New client secret**. Type in a meaningful description for your secret and select an expiry. Click the **Add** button when you are done. **IMPORTANT:** Copy the `value` of your newly create secret since Azure will never display its value afterwards. -![azure-setup-certificates-secrets](img/azure-setup-certificates-secrets.png) ++ +
### 4. Configure API permissions -On the left-side navigation bar, click on **API permissions** under **Manage**. DataHub requires the following four Microsoft Graph APIs: +On the left-side navigation bar, click on **API permissions** under **Manage**. DataHub requires the following four Microsoft Graph APIs: -1. `User.Read` *(should be already configured)* +1. `User.Read` _(should be already configured)_ 2. `profile` 3. `email` 4. `openid` -Click on **Add a permission**, then from the **Microsoft APIs** tab select **Microsoft Graph**, then **Delegated permissions**. From the **OpenId permissions** category, select `email`, `openid`, `profile` and click **Add permissions**. +Click on **Add a permission**, then from the **Microsoft APIs** tab select **Microsoft Graph**, then **Delegated permissions**. From the **OpenId permissions** category, select `email`, `openid`, `profile` and click **Add permissions**. At this point, you should be looking at a screen like the following: -![azure-setup-api-permissions](img/azure-setup-api-permissions.png) ++ +
### 5. Obtain Application (Client) ID -On the left-side navigation bar, go back to the **Overview** tab. You should see the `Application (client) ID`. Save its value for the next step. +On the left-side navigation bar, go back to the **Overview** tab. You should see the `Application (client) ID`. Save its value for the next step. ### 6. Obtain Discovery URI @@ -108,4 +119,5 @@ docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml Navigate to your DataHub domain to see SSO in action. ## Resources -- [Microsoft identity platform and OpenID Connect protocol](https://docs.microsoft.com/en-us/azure/active-directory/develop/v2-protocols-oidc/) \ No newline at end of file + +- [Microsoft identity platform and OpenID Connect protocol](https://docs.microsoft.com/en-us/azure/active-directory/develop/v2-protocols-oidc/) diff --git a/docs/authentication/guides/sso/configure-oidc-react-google.md b/docs/authentication/guides/sso/configure-oidc-react-google.md index 474538097aae2..16bf66dfc3101 100644 --- a/docs/authentication/guides/sso/configure-oidc-react-google.md +++ b/docs/authentication/guides/sso/configure-oidc-react-google.md @@ -1,8 +1,9 @@ # Configuring Google Authentication for React App (OIDC) -*Authored on 3/10/2021* -`datahub-frontend` server can be configured to authenticate users over OpenID Connect (OIDC). As such, it can be configured to delegate -authentication responsibility to identity providers like Google. +_Authored on 3/10/2021_ + +`datahub-frontend` server can be configured to authenticate users over OpenID Connect (OIDC). As such, it can be configured to delegate +authentication responsibility to identity providers like Google. This guide will provide steps for configuring DataHub authentication using Google. @@ -17,40 +18,42 @@ please see [this guide](../jaas.md) to mount a custom user.props file for a JAAS ### 1. Create a project in the Google API Console -Using an account linked to your organization, navigate to the [Google API Console](https://console.developers.google.com/) and select **New project**. -Within this project, we will configure the OAuth2.0 screen and credentials. +Using an account linked to your organization, navigate to the [Google API Console](https://console.developers.google.com/) and select **New project**. +Within this project, we will configure the OAuth2.0 screen and credentials. ### 2. Create OAuth2.0 consent screen -a. Navigate to `OAuth consent screen`. This is where you'll configure the screen your users see when attempting to -log in to DataHub. +a. Navigate to `OAuth consent screen`. This is where you'll configure the screen your users see when attempting to +log in to DataHub. -b. Select `Internal` (if you only want your company users to have access) and then click **Create**. -Note that in order to complete this step you should be logged into a Google account associated with your organization. +b. Select `Internal` (if you only want your company users to have access) and then click **Create**. +Note that in order to complete this step you should be logged into a Google account associated with your organization. c. Fill out the details in the App Information & Domain sections. Make sure the 'Application Home Page' provided matches where DataHub is deployed -at your organization. +at your organization. -![google-setup-1](img/google-setup-1.png) ++ +
-Once you've completed this, **Save & Continue**. +Once you've completed this, **Save & Continue**. d. Configure the scopes: Next, click **Add or Remove Scopes**. Select the following scopes: - + - `.../auth/userinfo.email` - `.../auth/userinfo.profile` - `openid` -Once you've selected these, **Save & Continue**. +Once you've selected these, **Save & Continue**. ### 3. Configure client credentials -Now navigate to the **Credentials** tab. This is where you'll obtain your client id & secret, as well as configure info -like the redirect URI used after a user is authenticated. +Now navigate to the **Credentials** tab. This is where you'll obtain your client id & secret, as well as configure info +like the redirect URI used after a user is authenticated. a. Click **Create Credentials** & select `OAuth client ID` as the credential type. -b. On the following screen, select `Web application` as your Application Type. +b. On the following screen, select `Web application` as your Application Type. c. Add the domain where DataHub is hosted to your 'Authorized Javascript Origins'. @@ -58,7 +61,7 @@ c. Add the domain where DataHub is hosted to your 'Authorized Javascript Origins https://your-datahub-domain.com ``` -d. Add the domain where DataHub is hosted with the path `/callback/oidc` appended to 'Authorized Redirect URLs'. +d. Add the domain where DataHub is hosted with the path `/callback/oidc` appended to 'Authorized Redirect URLs'. ``` https://your-datahub-domain.com/callback/oidc @@ -70,7 +73,9 @@ f. You will now receive a pair of values, a client id and a client secret. Bookm At this point, you should be looking at a screen like the following: -![google-setup-2](img/google-setup-2.png) ++ +
Success! @@ -78,7 +83,7 @@ Success! a. Open the file `docker/datahub-frontend/env/docker.env` -b. Add the following configuration values to the file: +b. Add the following configuration values to the file: ``` AUTH_OIDC_ENABLED=true @@ -91,20 +96,18 @@ AUTH_OIDC_USER_NAME_CLAIM=email AUTH_OIDC_USER_NAME_CLAIM_REGEX=([^@]+) ``` -Replacing the placeholders above with the client id & client secret received from Google in Step 3f. - +Replacing the placeholders above with the client id & client secret received from Google in Step 3f. ### 5. Restart `datahub-frontend-react` docker container -Now, simply restart the `datahub-frontend-react` container to enable the integration. +Now, simply restart the `datahub-frontend-react` container to enable the integration. ``` docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml up datahub-frontend-react ``` -Navigate to your DataHub domain to see SSO in action. - +Navigate to your DataHub domain to see SSO in action. ## References -- [OpenID Connect in Google Identity](https://developers.google.com/identity/protocols/oauth2/openid-connect) \ No newline at end of file +- [OpenID Connect in Google Identity](https://developers.google.com/identity/protocols/oauth2/openid-connect) diff --git a/docs/authentication/guides/sso/configure-oidc-react-okta.md b/docs/authentication/guides/sso/configure-oidc-react-okta.md index cfede999f1e70..54d2b54357aad 100644 --- a/docs/authentication/guides/sso/configure-oidc-react-okta.md +++ b/docs/authentication/guides/sso/configure-oidc-react-okta.md @@ -1,5 +1,6 @@ # Configuring Okta Authentication for React App (OIDC) -*Authored on 3/10/2021* + +_Authored on 3/10/2021_ `datahub-frontend` server can be configured to authenticate users over OpenID Connect (OIDC). As such, it can be configured to delegate authentication responsibility to identity providers like Okta. @@ -52,7 +53,6 @@ If you're just testing locally, this can be `http://localhost:9002`. i. Click **Save** - ### 2. Obtain Client Credentials On the subsequent screen, you should see the client credentials. Bookmark the `Client id` and `Client secret` for the next step. @@ -69,8 +69,12 @@ for example, `https://dev-33231928.okta.com/.well-known/openid-configuration`. At this point, you should be looking at a screen like the following: -![okta-setup-1](img/okta-setup-1.png) -![okta-setup-2](img/okta-setup-2.png) ++ +
++ +
Success! @@ -91,12 +95,15 @@ AUTH_OIDC_SCOPE="openid profile email groups" Replacing the placeholders above with the client id & client secret received from Okta in Step 2. -> **Pro Tip!** You can easily enable Okta to return the groups that a user is associated with, which will be provisioned in DataHub, along with the user logging in. This can be enabled by setting the `AUTH_OIDC_EXTRACT_GROUPS_ENABLED` flag to `true`. +> **Pro Tip!** You can easily enable Okta to return the groups that a user is associated with, which will be provisioned in DataHub, along with the user logging in. This can be enabled by setting the `AUTH_OIDC_EXTRACT_GROUPS_ENABLED` flag to `true`. > if they do not already exist in DataHub. You can enable your Okta application to return a 'groups' claim from the Okta Console at Applications > Your Application -> Sign On -> OpenID Connect ID Token Settings (Requires an edit). -> -> By default, we assume that the groups will appear in a claim named "groups". This can be customized using the `AUTH_OIDC_GROUPS_CLAIM` container configuration. -> -> ![okta-setup-2](img/okta-setup-groups-claim.png) +> +> By default, we assume that the groups will appear in a claim named "groups". This can be customized using the `AUTH_OIDC_GROUPS_CLAIM` container configuration. +> +>+> + +
### 5. Restart `datahub-frontend-react` docker container @@ -109,4 +116,5 @@ docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml Navigate to your DataHub domain to see SSO in action. ## Resources + - [OAuth 2.0 and OpenID Connect Overview](https://developer.okta.com/docs/concepts/oauth-openid/) diff --git a/docs/authentication/personal-access-tokens.md b/docs/authentication/personal-access-tokens.md index 0188aab49444e..8fdce8a83507d 100644 --- a/docs/authentication/personal-access-tokens.md +++ b/docs/authentication/personal-access-tokens.md @@ -4,15 +4,16 @@ import FeatureAvailability from '@site/src/components/FeatureAvailability';+ +
:::note @@ -87,8 +90,8 @@ is enabled. ### GraphQL - - Have a look at [Token Management in GraphQL](../api/graphql/token-management.md) to learn how to manage tokens programatically! - +- Have a look at [Token Management in GraphQL](../api/graphql/token-management.md) to learn how to manage tokens programatically! + ## FAQ and Troubleshooting **The button to create tokens is greyed out - why can’t I click on it?** @@ -104,5 +107,4 @@ A PAT represents a user in DataHub, if that user does not have permissions for a Yes, although not through the UI correctly, you will have to use the [token management graphQL API](../api/graphql/token-management.md) and the user making the request must have `Manage All Access Tokens` permissions. -*Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!* - +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs/components.md b/docs/components.md index ef76729bb37fb..a425d10015102 100644 --- a/docs/components.md +++ b/docs/components.md @@ -4,56 +4,58 @@ title: "Components" # DataHub Components Overview -The DataHub platform consists of the components shown in the following diagram. +The DataHub platform consists of the components shown in the following diagram. -![DataHub Component Overview](./imgs/datahub-components.png) ++ +
-## Metadata Store +## Metadata Store The Metadata Store is responsible for storing the [Entities & Aspects](https://datahubproject.io/docs/metadata-modeling/metadata-model/) comprising the Metadata Graph. This includes exposing an API for [ingesting metadata](https://datahubproject.io/docs/metadata-service#ingesting-entities), [fetching Metadata by primary key](https://datahubproject.io/docs/metadata-service#retrieving-entities), [searching entities](https://datahubproject.io/docs/metadata-service#search-an-entity), and [fetching Relationships](https://datahubproject.io/docs/metadata-service#get-relationships-edges) between entities. It consists of a Spring Java Service hosting a set of [Rest.li](https://linkedin.github.io/rest.li/) API endpoints, along with -MySQL, Elasticsearch, & Kafka for primary storage & indexing. +MySQL, Elasticsearch, & Kafka for primary storage & indexing. -Get started with the Metadata Store by following the [Quickstart Guide](https://datahubproject.io/docs/quickstart/). +Get started with the Metadata Store by following the [Quickstart Guide](https://datahubproject.io/docs/quickstart/). ## Metadata Models Metadata Models are schemas defining the shape of the Entities & Aspects comprising the Metadata Graph, along with the relationships between them. They are defined using [PDL](https://linkedin.github.io/rest.li/pdl_schema), a modeling language quite similar in form to Protobuf while serializes to JSON. Entities represent a specific class of Metadata -Asset such as a Dataset, a Dashboard, a Data Pipeline, and beyond. Each *instance* of an Entity is identified by a unique identifier called an `urn`. Aspects represent related bundles of data attached -to an instance of an Entity such as its descriptions, tags, and more. View the current set of Entities supported [here](https://datahubproject.io/docs/metadata-modeling/metadata-model#exploring-datahubs-metadata-model). +Asset such as a Dataset, a Dashboard, a Data Pipeline, and beyond. Each _instance_ of an Entity is identified by a unique identifier called an `urn`. Aspects represent related bundles of data attached +to an instance of an Entity such as its descriptions, tags, and more. View the current set of Entities supported [here](https://datahubproject.io/docs/metadata-modeling/metadata-model#exploring-datahubs-metadata-model). -Learn more about DataHub models Metadata [here](https://datahubproject.io/docs/metadata-modeling/metadata-model/). +Learn more about DataHub models Metadata [here](https://datahubproject.io/docs/metadata-modeling/metadata-model/). ## Ingestion Framework The Ingestion Framework is a modular, extensible Python library for extracting Metadata from external source systems (e.g. Snowflake, Looker, MySQL, Kafka), transforming it into DataHub's [Metadata Model](https://datahubproject.io/docs/metadata-modeling/metadata-model/), and writing it into DataHub via either Kafka or using the Metadata Store Rest APIs directly. DataHub supports an [extensive list of Source connectors](https://datahubproject.io/docs/metadata-ingestion/#installing-plugins) to choose from, along with -a host of capabilities including schema extraction, table & column profiling, usage information extraction, and more. +a host of capabilities including schema extraction, table & column profiling, usage information extraction, and more. Getting started with the Ingestion Framework is as simple: just define a YAML file and execute the `datahub ingest` command. -Learn more by heading over the the [Metadata Ingestion](https://datahubproject.io/docs/metadata-ingestion/) guide. +Learn more by heading over the the [Metadata Ingestion](https://datahubproject.io/docs/metadata-ingestion/) guide. ## GraphQL API The [GraphQL](https://graphql.org/) API provides a strongly-typed, entity-oriented API that makes interacting with the Entities comprising the Metadata -Graph simple, including APIs for adding and removing tags, owners, links & more to Metadata Entities! Most notably, this API is consumed by the User Interface (discussed below) for enabling Search & Discovery, Governance, Observability -and more. +Graph simple, including APIs for adding and removing tags, owners, links & more to Metadata Entities! Most notably, this API is consumed by the User Interface (discussed below) for enabling Search & Discovery, Governance, Observability +and more. -To get started using the GraphQL API, check out the [Getting Started with GraphQL](https://datahubproject.io/docs/api/graphql/getting-started) guide. +To get started using the GraphQL API, check out the [Getting Started with GraphQL](https://datahubproject.io/docs/api/graphql/getting-started) guide. ## User Interface DataHub comes with a React UI including an ever-evolving set of features to make Discovering, Governing, & Debugging your Data Assets easy & delightful. For a full overview of the capabilities currently supported, take a look at the [Features](https://datahubproject.io/docs/features/) overview. For a look at what's coming next, -head over to the [Roadmap](https://datahubproject.io/docs/roadmap/). +head over to the [Roadmap](https://datahubproject.io/docs/roadmap/). ## Learn More Learn more about the specifics of the [DataHub Architecture](./architecture/architecture.md) in the Architecture Overview. Learn about using & developing the components -of the Platform by visiting the Module READMEs. +of the Platform by visiting the Module READMEs. ## Feedback / Questions / Concerns diff --git a/docs/deploy/aws.md b/docs/deploy/aws.md index 7b01ffa02a744..399db2de8edc0 100644 --- a/docs/deploy/aws.md +++ b/docs/deploy/aws.md @@ -13,8 +13,7 @@ skip the corresponding sections. This guide requires the following tools: - [kubectl](https://kubernetes.io/docs/tasks/tools/) to manage kubernetes resources -- [helm](https://helm.sh/docs/intro/install/) to deploy the resources based on helm charts. Note, we only support Helm - 3. +- [helm](https://helm.sh/docs/intro/install/) to deploy the resources based on helm charts. Note, we only support Helm 3. - [eksctl](https://eksctl.io/introduction/#installation) to create and manage clusters on EKS - [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) to manage AWS resources @@ -63,7 +62,7 @@ steps in this [guide](kubernetes.md) Now that all the pods are up and running, you need to expose the datahub-frontend end point by setting up [ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/). To do this, you need to first set up an ingress controller. There are -many [ingress controllers](https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/) to choose +many [ingress controllers](https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/) to choose from, but here, we will follow this [guide](https://docs.aws.amazon.com/eks/latest/userguide/aws-load-balancer-controller.html) to set up the AWS Application Load Balancer(ALB) Controller. @@ -94,7 +93,7 @@ eksctl create iamserviceaccount \ --name=aws-load-balancer-controller \ --attach-policy-arn=arn:aws:iam::<+ +
First, add the DB password to kubernetes by running the following. @@ -234,7 +235,9 @@ Provision an elasticsearch domain running elasticsearch version 7.10 or above th cluster or has VPC peering set up between the VPC of the kubernetes cluster. Once the domain is provisioned, you should be able to see the following page. Take a note of the endpoint marked by the red box. -![AWS Elasticsearch Service](../imgs/aws/aws-elasticsearch.png) ++ +
Update the elasticsearch settings under global in the values.yaml as follows. @@ -274,11 +277,15 @@ Then use the settings below. secretRef: elasticsearch-secrets secretKey: elasticsearch-password ``` + If you have access control enabled with IAM auth, enable AWS auth signing in Datahub + ``` - OPENSEARCH_USE_AWS_IAM_AUTH=true + OPENSEARCH_USE_AWS_IAM_AUTH=true ``` + Then use the settings below. + ``` elasticsearch: host: <+ +
Update the kafka settings under global in the values.yaml as follows. @@ -421,7 +430,7 @@ The minimum permissions required looks like this } ``` -The latter part is required to have "*" as the resource because of an issue in the AWS Glue schema registry library. +The latter part is required to have "\*" as the resource because of an issue in the AWS Glue schema registry library. Refer to [this issue](https://github.com/awslabs/aws-glue-schema-registry/issues/68) for any updates. Glue currently doesn't support AWS Signature V4. As such, we cannot use service accounts to give permissions to access diff --git a/docs/deploy/confluent-cloud.md b/docs/deploy/confluent-cloud.md index d93ffcceaecee..df31df09be64c 100644 --- a/docs/deploy/confluent-cloud.md +++ b/docs/deploy/confluent-cloud.md @@ -1,4 +1,4 @@ -# Integrating with Confluent Cloud +# Integrating with Confluent Cloud DataHub provides the ability to easily leverage Confluent Cloud as your Kafka provider. To do so, you'll need to configure DataHub to talk to a broker and schema registry hosted by Confluent. @@ -8,7 +8,7 @@ Doing this is a matter of configuring the Kafka Producer and Consumers used by D First, you'll need to create following new topics in the [Confluent Control Center](https://docs.confluent.io/platform/current/control-center/index.html). By default they have the following names: -1. **MetadataChangeProposal_v1** +1. **MetadataChangeProposal_v1** 2. **FailedMetadataChangeProposal_v1** 3. **MetadataChangeLog_Versioned_v1** 4. **MetadataChangeLog_Timeseries_v1** @@ -18,13 +18,15 @@ First, you'll need to create following new topics in the [Confluent Control Cent 8. (Deprecated) **FailedMetadataChangeEvent_v4**: Failed to process #1 event The first five are the most important, and are explained in more depth in [MCP/MCL](../advanced/mcp-mcl.md). The final topics are -those which are deprecated but still used under certain circumstances. It is likely that in the future they will be completely -decommissioned. +those which are deprecated but still used under certain circumstances. It is likely that in the future they will be completely +decommissioned. To create the topics, navigate to your **Cluster** and click "Create Topic". Feel free to tweak the default topic configurations to match your preferences. -![CreateTopic](../imgs/confluent-create-topic.png) ++ +
## Step 2: Configure DataHub Container to use Confluent Cloud Topics @@ -62,7 +64,7 @@ KAFKA_PROPERTIES_BASIC_AUTH_USER_INFO=P2ETAN5QR2LCWL14:RTjqw7AfETDl0RZo/7R0123Lh ``` Note that this step is only required if `DATAHUB_ANALYTICS_ENABLED` environment variable is not explicitly set to false for the datahub-frontend -container. +container. If you're deploying with Docker Compose, you do not need to deploy the Zookeeper, Kafka Broker, or Schema Registry containers that ship by default. @@ -72,16 +74,16 @@ Configuring Confluent Cloud for DataHub Actions requires some additional edits t source connection config you will need to add the Python style client connection information: ```yaml - connection: - bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} - schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} - consumer_config: - security.protocol: ${KAFKA_PROPERTIES_SECURITY_PROTOCOL:-PLAINTEXT} - sasl.mechanism: ${KAFKA_PROPERTIES_SASL_MECHANISM:-PLAIN} - sasl.username: ${KAFKA_PROPERTIES_SASL_USERNAME} - sasl.password: ${KAFKA_PROPERTIES_SASL_PASSWORD} - schema_registry_config: - basic.auth.user.info: ${KAFKA_PROPERTIES_BASIC_AUTH_USER_INFO} +connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} + consumer_config: + security.protocol: ${KAFKA_PROPERTIES_SECURITY_PROTOCOL:-PLAINTEXT} + sasl.mechanism: ${KAFKA_PROPERTIES_SASL_MECHANISM:-PLAIN} + sasl.username: ${KAFKA_PROPERTIES_SASL_USERNAME} + sasl.password: ${KAFKA_PROPERTIES_SASL_PASSWORD} + schema_registry_config: + basic.auth.user.info: ${KAFKA_PROPERTIES_BASIC_AUTH_USER_INFO} ``` Specifically `sasl.username` and `sasl.password` are the differences from the base `executor.yaml` example file. @@ -115,7 +117,7 @@ First, disable the `cp-schema-registry` service: ``` cp-schema-registry: - enabled: false + enabled: false ``` Next, disable the `kafkaSetupJob` service: @@ -139,13 +141,16 @@ Next, you'll want to create 2 new Kubernetes secrets, one for the JaaS configura and another for the user info used for connecting to the schema registry. You'll find the values for each within the Confluent Control Center. Specifically, select "Clients" -> "Configure new Java Client". You should see a page like the following: - -![Config](../imgs/confluent-cloud-config.png) ++ +
You'll want to generate both a Kafka Cluster API Key & a Schema Registry key. Once you do so,you should see the config automatically populate with your new secrets: -![Config](../imgs/confluent-cloud-config-2.png) ++ +
You'll need to copy the values of `sasl.jaas.config` and `basic.auth.user.info` for the next step. @@ -165,7 +170,7 @@ kubectl create secret generic confluent-secrets --from-literal=basic_auth_user_i ``` Finally, we'll configure our containers to pick up the Confluent Kafka Configs by changing two config blocks in our `values.yaml` file. You -should see these blocks commented at the bottom of the template. You'll want to uncomment them and set them to the following values: +should see these blocks commented at the bottom of the template. You'll want to uncomment them and set them to the following values: ``` credentialsAndCertsSecrets: @@ -182,7 +187,7 @@ springKafkaConfigurationOverrides: basic.auth.credentials.source: USER_INFO ``` -Then simply apply the updated `values.yaml` to your K8s cluster via `kubectl apply`. +Then simply apply the updated `values.yaml` to your K8s cluster via `kubectl apply`. #### DataHub Actions @@ -190,16 +195,16 @@ Configuring Confluent Cloud for DataHub Actions requires some additional edits t source connection config you will need to add the Python style client connection information: ```yaml - connection: - bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} - schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} - consumer_config: - security.protocol: ${KAFKA_PROPERTIES_SECURITY_PROTOCOL:-PLAINTEXT} - sasl.mechanism: ${KAFKA_PROPERTIES_SASL_MECHANISM:-PLAIN} - sasl.username: ${KAFKA_PROPERTIES_SASL_USERNAME} - sasl.password: ${KAFKA_PROPERTIES_SASL_PASSWORD} - schema_registry_config: - basic.auth.user.info: ${KAFKA_PROPERTIES_BASIC_AUTH_USER_INFO} +connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} + consumer_config: + security.protocol: ${KAFKA_PROPERTIES_SECURITY_PROTOCOL:-PLAINTEXT} + sasl.mechanism: ${KAFKA_PROPERTIES_SASL_MECHANISM:-PLAIN} + sasl.username: ${KAFKA_PROPERTIES_SASL_USERNAME} + sasl.password: ${KAFKA_PROPERTIES_SASL_PASSWORD} + schema_registry_config: + basic.auth.user.info: ${KAFKA_PROPERTIES_BASIC_AUTH_USER_INFO} ``` Specifically `sasl.username` and `sasl.password` are the differences from the base `executor.yaml` example file. @@ -222,8 +227,9 @@ credentialsAndCertsSecrets: The Actions pod will automatically pick these up in the correctly named environment variables when they are named this exact way. ## Contribution + Accepting contributions for a setup script compatible with Confluent Cloud! The kafka-setup-job container we ship with is only compatible with a distribution of Kafka wherein ZooKeeper -is exposed and available. A version of the job using the [Confluent CLI](https://docs.confluent.io/confluent-cli/current/command-reference/kafka/topic/confluent_kafka_topic_create.html) -would be very useful for the broader community. \ No newline at end of file +is exposed and available. A version of the job using the [Confluent CLI](https://docs.confluent.io/confluent-cli/current/command-reference/kafka/topic/confluent_kafka_topic_create.html) +would be very useful for the broader community. diff --git a/docs/deploy/gcp.md b/docs/deploy/gcp.md index 3713d69f90636..6b01236007408 100644 --- a/docs/deploy/gcp.md +++ b/docs/deploy/gcp.md @@ -13,8 +13,7 @@ skip the corresponding sections. This guide requires the following tools: - [kubectl](https://kubernetes.io/docs/tasks/tools/) to manage kubernetes resources -- [helm](https://helm.sh/docs/intro/install/) to deploy the resources based on helm charts. Note, we only support Helm - 3. +- [helm](https://helm.sh/docs/intro/install/) to deploy the resources based on helm charts. Note, we only support Helm 3. - [gcloud](https://cloud.google.com/sdk/docs/install) to manage GCP resources Follow the @@ -65,16 +64,22 @@ the GKE page on [GCP website](https://console.cloud.google.com/kubernetes/discov Once all deploy is successful, you should see a page like below in the "Services & Ingress" tab on the left. -![Services and Ingress](../imgs/gcp/services_ingress.png) ++ +
Tick the checkbox for datahub-datahub-frontend and click "CREATE INGRESS" button. You should land on the following page. -![Ingress1](../imgs/gcp/ingress1.png) ++ +
Type in an arbitrary name for the ingress and click on the second step "Host and path rules". You should land on the following page. -![Ingress2](../imgs/gcp/ingress2.png) ++ +
Select "datahub-datahub-frontend" in the dropdown menu for backends, and then click on "ADD HOST AND PATH RULE" button. In the second row that got created, add in the host name of choice (here gcp.datahubproject.io) and select @@ -83,20 +88,25 @@ In the second row that got created, add in the host name of choice (here gcp.dat This step adds the rule allowing requests from the host name of choice to get routed to datahub-frontend service. Click on step 3 "Frontend configuration". You should land on the following page. -![Ingress3](../imgs/gcp/ingress3.png) ++ +
Choose HTTPS in the dropdown menu for protocol. To enable SSL, you need to add a certificate. If you do not have one, you can click "CREATE A NEW CERTIFICATE" and input the host name of choice. GCP will create a certificate for you. Now press "CREATE" button on the left to create ingress! After around 5 minutes, you should see the following. -![Ingress Ready](../imgs/gcp/ingress_ready.png) ++ +
In your domain provider, add an A record for the host name set above using the IP address on the ingress page (noted with the red box). Once DNS updates, you should be able to access DataHub through the host name!! Note, ignore the warning icon next to ingress. It takes about ten minutes for ingress to check that the backend service -is ready and show a check mark as follows. However, ingress is fully functional once you see the above page. - -![Ingress Final](../imgs/gcp/ingress_final.png) +is ready and show a check mark as follows. However, ingress is fully functional once you see the above page. ++ +
diff --git a/docs/dev-guides/timeline.md b/docs/dev-guides/timeline.md index 966e659b90991..2c8c5d6e875d1 100644 --- a/docs/dev-guides/timeline.md +++ b/docs/dev-guides/timeline.md @@ -12,14 +12,18 @@ The Timeline API is available in server versions `0.8.28` and higher. The `cli` # Concepts ## Entity Timeline Conceptually + For the visually inclined, here is a conceptual diagram that illustrates how to think about the entity timeline with categorical changes overlaid on it. -![../imgs/timeline/timeline-conceptually.png](../imgs/timeline/timeline-conceptually.png) ++ +
## Change Event -Each modification is modeled as a + +Each modification is modeled as a [ChangeEvent](../../metadata-service/services/src/main/java/com/linkedin/metadata/timeline/data/ChangeEvent.java) -which are grouped under [ChangeTransactions](../../metadata-service/services/src/main/java/com/linkedin/metadata/timeline/data/ChangeTransaction.java) +which are grouped under [ChangeTransactions](../../metadata-service/services/src/main/java/com/linkedin/metadata/timeline/data/ChangeTransaction.java) based on timestamp. A `ChangeEvent` consists of: - `changeType`: An operational type for the change, either `ADD`, `MODIFY`, or `REMOVE` @@ -31,10 +35,11 @@ based on timestamp. A `ChangeEvent` consists of: - `changeDetails`: A loose property map of additional details about the change ### Change Event Examples -- A tag was applied to a *field* of a dataset through the UI: + +- A tag was applied to a _field_ of a dataset through the UI: - `changeType`: `ADD` - `target`: `urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:+ +
++ +
# Future Work @@ -238,4 +255,3 @@ Here are a few screenshots showing how to navigate to it. You can try out the AP - Adding GraphQL API support - Supporting materialization of computed versions for entity categories (compared to the current read-time version computation) - Support in the UI to visualize the timeline in various places (e.g. schema history, etc.) - diff --git a/docs/docker/development.md b/docs/docker/development.md index 329bda2c06bac..2cb7233090809 100644 --- a/docs/docker/development.md +++ b/docs/docker/development.md @@ -11,7 +11,7 @@ The `docker-compose.dev.yml` file bypasses the need to rebuild docker images by and other data. These dev images, tagged with `debug` will use your _locally built code_ with gradle. Building locally and bypassing the need to rebuild the Docker images should be much faster. -We highly recommend you just invoke `./gradlew quickstartDebug` task. +We highly recommend you just invoke `./gradlew quickstartDebug` task. ```shell ./gradlew quickstartDebug @@ -20,12 +20,12 @@ We highly recommend you just invoke `./gradlew quickstartDebug` task. This task is defined in `docker/build.gradle` and executes the following steps: 1. Builds all required artifacts to run DataHub. This includes both application code such as the GMS war, the frontend -distribution zip which contains javascript, as wel as secondary support docker containers. - + distribution zip which contains javascript, as wel as secondary support docker containers. + 1. Locally builds Docker images with the expected `debug` tag required by the docker compose files. 1. Runs the special `docker-compose.dev.yml` and supporting docker-compose files to mount local files directly in the -containers with remote debugging ports enabled. + containers with remote debugging ports enabled. Once the `debug` docker images are constructed you'll see images similar to the following: @@ -45,6 +45,7 @@ At this point it is possible to view the DataHub UI at `http://localhost:9002` a Next, perform the desired modifications and rebuild the frontend and/or GMS components. **Builds GMS** + ```shell ./gradlew :metadata-service:war:build ``` @@ -57,7 +58,7 @@ Including javascript components. ./gradlew :datahub-frontend:build ``` -After building the artifacts only a restart of the container(s) is required to run with the updated code. +After building the artifacts only a restart of the container(s) is required to run with the updated code. The restart can be performed using a docker UI, the docker cli, or the following gradle task. ```shell @@ -69,11 +70,13 @@ The restart can be performed using a docker UI, the docker cli, or the following The following commands can pause the debugging environment to release resources when not needed. Pause containers and free resources. + ```shell docker compose -p datahub stop ``` Resume containers for further debugging. + ```shell docker compose -p datahub start ``` @@ -92,8 +95,9 @@ Environment variables control the debugging ports for GMS and the frontend. The screenshot shows an example configuration for IntelliJ using the default GMS debugging port of 5001. -![](../imgs/development/intellij-remote-debug.png) - ++ +
## Tips for People New To Docker @@ -125,9 +129,11 @@ running. If you, for some reason, wish to change this behavior, check out these ``` docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml -f docker-compose-without-neo4j.m1.yml -f docker-compose.dev.yml up datahub-gms ``` + Will only start `datahub-gms` and its dependencies. ``` docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml -f docker-compose-without-neo4j.m1.yml -f docker-compose.dev.yml up --no-deps datahub-gms ``` + Will only start `datahub-gms`, without dependencies. diff --git a/docs/domains.md b/docs/domains.md index c846a753417c5..f668f5da5d919 100644 --- a/docs/domains.md +++ b/docs/domains.md @@ -4,17 +4,16 @@ import FeatureAvailability from '@site/src/components/FeatureAvailability';- +
Once you're on the Domains page, you'll see a list of all the Domains that have been created on DataHub. Additionally, you can -view the number of entities inside each Domain. +view the number of entities inside each Domain.- +
To create a new Domain, click '+ New Domain'.- +
Inside the form, you can choose a name for your Domain. Most often, this will align with your business units or groups, for example @@ -45,34 +44,36 @@ Inside the form, you can choose a name for your Domain. Most often, this will al Click on 'Advanced' to show the option to set a custom Domain id. The Domain id determines what will appear in the DataHub 'urn' (primary key) for the Domain. This option is useful if you intend to refer to Domains by a common name inside your code, or you want the primary -key to be human-readable. Proceed with caution: once you select a custom id, it cannot be easily changed. +key to be human-readable. Proceed with caution: once you select a custom id, it cannot be easily changed.- +
-By default, you don't need to worry about this. DataHub will auto-generate a unique Domain id for you. +By default, you don't need to worry about this. DataHub will auto-generate a unique Domain id for you. -Once you've chosen a name and a description, click 'Create' to create the new Domain. +Once you've chosen a name and a description, click 'Create' to create the new Domain. ### Assigning an Asset to a Domain -You can assign assets to Domain using the UI or programmatically using the API or during ingestion. +You can assign assets to Domain using the UI or programmatically using the API or during ingestion. #### UI-Based Assignment -To assign an asset to a Domain, simply navigate to the asset's profile page. At the bottom left-side menu bar, you'll + +To assign an asset to a Domain, simply navigate to the asset's profile page. At the bottom left-side menu bar, you'll see a 'Domain' section. Click 'Set Domain', and then search for the Domain you'd like to add to. When you're done, click 'Add'.- +
-To remove an asset from a Domain, click the 'x' icon on the Domain tag. +To remove an asset from a Domain, click the 'x' icon on the Domain tag. > Notice: Adding or removing an asset from a Domain requires the `Edit Domain` Metadata Privilege, which can be granted > by a [Policy](authorization/policies.md). #### Ingestion-time Assignment + All SQL-based ingestion sources support assigning domains during ingestion using the `domain` configuration. Consult your source's configuration details page (e.g. [Snowflake](./generated/ingestion/sources/snowflake.md)), to verify that it supports the Domain capability. :::note @@ -81,7 +82,6 @@ Assignment of domains during ingestion will overwrite domains that you have assi ::: - Here is a quick example of a snowflake ingestion recipe that has been enhanced to attach the **Analytics** domain to all tables in the **long_tail_companions** database in the **analytics** schema, and the **Finance** domain to all tables in the **long_tail_companions** database in the **ecommerce** schema. ```yaml @@ -90,7 +90,7 @@ source: config: username: ${SNOW_USER} password: ${SNOW_PASS} - account_id: + account_id: warehouse: COMPUTE_WH role: accountadmin database_pattern: @@ -149,27 +149,27 @@ source: Once you've created a Domain, you can use the search bar to find it.- +
Clicking on the search result will take you to the Domain's profile, where you -can edit its description, add / remove owners, and view the assets inside the Domain. +can edit its description, add / remove owners, and view the assets inside the Domain.- +
Once you've added assets to a Domain, you can filter search results to limit to those Assets -within a particular Domain using the left-side search filters. +within a particular Domain using the left-side search filters.- +
On the homepage, you'll also find a list of the most popular Domains in your organization.- +
## Additional Resources @@ -184,11 +184,11 @@ On the homepage, you'll also find a list of the most popular Domains in your org ### GraphQL -* [domain](../graphql/queries.md#domain) -* [listDomains](../graphql/queries.md#listdomains) -* [createDomains](../graphql/mutations.md#createdomain) -* [setDomain](../graphql/mutations.md#setdomain) -* [unsetDomain](../graphql/mutations.md#unsetdomain) +- [domain](../graphql/queries.md#domain) +- [listDomains](../graphql/queries.md#listdomains) +- [createDomains](../graphql/mutations.md#createdomain) +- [setDomain](../graphql/mutations.md#setdomain) +- [unsetDomain](../graphql/mutations.md#unsetdomain) #### Examples @@ -196,11 +196,13 @@ On the homepage, you'll also find a list of the most popular Domains in your org ```graphql mutation createDomain { - createDomain(input: { name: "My New Domain", description: "An optional description" }) + createDomain( + input: { name: "My New Domain", description: "An optional description" } + ) } ``` -This query will return an `urn` which you can use to fetch the Domain details. +This query will return an `urn` which you can use to fetch the Domain details. **Fetching a Domain by Urn** @@ -209,11 +211,11 @@ query getDomain { domain(urn: "urn:li:domain:engineering") { urn properties { - name - description + name + description } entities { - total + total } } } @@ -223,7 +225,10 @@ query getDomain { ```graphql mutation setDomain { - setDomain(entityUrn: "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)", domainUrn: "urn:li:domain:engineering") + setDomain( + entityUrn: "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)" + domainUrn: "urn:li:domain:engineering" + ) } ``` @@ -231,7 +236,7 @@ mutation setDomain { ### DataHub Blog -* [Just Shipped: UI-Based Ingestion, Data Domains & Containers, Tableau support, and MORE!](https://blog.datahubproject.io/just-shipped-ui-based-ingestion-data-domains-containers-and-more-f1b1c90ed3a) +- [Just Shipped: UI-Based Ingestion, Data Domains & Containers, Tableau support, and MORE!](https://blog.datahubproject.io/just-shipped-ui-based-ingestion-data-domains-containers-and-more-f1b1c90ed3a) ## FAQ and Troubleshooting @@ -243,9 +248,9 @@ DataHub supports Tags, Glossary Terms, & Domains as distinct types of Metadata t - **Glossary Terms**: A controlled vocabulary, with optional hierarchy. Terms are typically used to standardize types of leaf-level attributes (i.e. schema fields) for governance. E.g. (EMAIL_PLAINTEXT) - **Domains**: A set of top-level categories. Usually aligned to business units / disciplines to which the assets are most relevant. Central or distributed management. Single Domain assignment per data asset. -*Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!* +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ ### Related Features -* [Glossary Terms](./glossary/business-glossary.md) -* [Tags](./tags.md) +- [Glossary Terms](./glossary/business-glossary.md) +- [Tags](./tags.md) diff --git a/docs/glossary/business-glossary.md b/docs/glossary/business-glossary.md index faab6f12fc55e..6a3244cd847e1 100644 --- a/docs/glossary/business-glossary.md +++ b/docs/glossary/business-glossary.md @@ -12,7 +12,7 @@ import FeatureAvailability from '@site/src/components/FeatureAvailability'; When working in complex data ecosystems, it is very useful to organize data assets using a shared vocabulary. The Business Glossary feature in DataHub helps you do this, by providing a framework for defining a standardized set of data concepts and then associating them with the physical assets that exist within your data ecosystem. -Within this document, we'll introduce the core concepts comprising DataHub's Business Glossary feature and show you how to put it to work in your organization. +Within this document, we'll introduce the core concepts comprising DataHub's Business Glossary feature and show you how to put it to work in your organization. ### Terms & Term Groups @@ -31,59 +31,82 @@ In order to view a Business Glossary, users must have the Platform Privilege cal Once granted this privilege, you can access your Glossary by clicking the dropdown at the top of the page called **Govern** and then click **Glossary**: -![](../imgs/glossary/glossary-button.png) ++ +
You are now at the root of your Glossary and should see all Terms and Term Groups with no parents assigned to them. You should also notice a hierarchy navigator on the left where you can easily check out the structure of your Glossary! -![](../imgs/glossary/root-glossary.png) ++ +
## Creating a Term or Term Group There are two ways to create Terms and Term Groups through the UI. First, you can create directly from the Glossary home page by clicking the menu dots on the top right and selecting your desired option: -![](../imgs/glossary/root-glossary-create.png) ++ +
You can also create Terms or Term Groups directly from a Term Group's page. In order to do that you need to click the menu dots on the top right and select what you want: -![](../imgs/glossary/create-from-node.png) ++ +
Note that the modal that pops up will automatically set the current Term Group you are in as the **Parent**. You can easily change this by selecting the input and navigating through your Glossary to find your desired Term Group. In addition, you could start typing the name of a Term Group to see it appear by searching. You can also leave this input blank in order to create a Term or Term Group with no parent. -![](../imgs/glossary/create-modal.png) ++ +
## Editing a Term or Term Group In order to edit a Term or Term Group, you first need to go the page of the Term or Term group you want to edit. Then simply click the edit icon right next to the name to open up an inline editor. Change the text and it will save when you click outside or hit Enter. -![](../imgs/glossary/edit-term.png) ++ +
## Moving a Term or Term Group Once a Term or Term Group has been created, you can always move it to be under a different Term Group parent. In order to do this, click the menu dots on the top right of either entity and select **Move**. -![](../imgs/glossary/move-term-button.png) ++ +
This will open a modal where you can navigate through your Glossary to find your desired Term Group. -![](../imgs/glossary/move-term-modal.png) ++ +
## Deleting a Term or Term Group In order to delete a Term or Term Group, you need to go to the entity page of what you want to delete then click the menu dots on the top right. From here you can select **Delete** followed by confirming through a separate modal. **Note**: at the moment we only support deleting Term Groups that do not have any children. Until cascade deleting is supported, you will have to delete all children first, then delete the Term Group. -![](../imgs/glossary/delete-button.png) ++ +
## Adding a Term to an Entity Once you've defined your Glossary, you can begin attaching terms to data assets. To add a Glossary Term to an asset, go to the entity page of your asset and find the **Add Terms** button on the right sidebar. -![](../imgs/glossary/add-term-to-entity.png) ++ +
In the modal that pops up you can select the Term you care about in one of two ways: + - Search for the Term by name in the input - Navigate through the Glossary dropdown that appears after clicking into the input -![](../imgs/glossary/add-term-modal.png) ++ +
## Privileges @@ -92,7 +115,7 @@ Glossary Terms and Term Groups abide by metadata policies like other entities. H - **Manage Direct Glossary Children**: If a user has this privilege on a Glossary Term Group, they will be able to create, edit, and delete Terms and Term Groups directly underneath the Term Group they have this privilege on. - **Manage All Glossary Children**: If a user has this privilege on a Glossary Term Group, they will be able to create, edit, and delete any Term or Term Group anywhere underneath the Term Group they have this privilege on. This applies to the children of a child Term Group as well (and so on). -## Managing Glossary with Git +## Managing Glossary with Git In many cases, it may be preferable to manage the Business Glossary in a version-control system like git. This can make managing changes across teams easier, by funneling all changes through a change management and review process. @@ -103,43 +126,41 @@ the glossary file, and how to ingest it into DataHub, check out the [Business Gl ## About Glossary Term Relationships -DataHub supports 2 different kinds of relationships _between_ individual Glossary Terms: **Inherits From** and **Contains**. +DataHub supports 2 different kinds of relationships _between_ individual Glossary Terms: **Inherits From** and **Contains**. **Contains** can be used to relate two Glossary Terms when one is a _superset_ of or _consists_ of another. For example: **Address** Term _Contains_ **Zip Code** Term, **Street** Term, & **City** Term (_Has-A_ style relationship) **Inherits** can be used to relate two Glossary Terms when one is a _sub-type_ or _sub-category_ of another. -For example: **Email** Term _Inherits From_ **PII** Term (_Is-A_ style relationship) +For example: **Email** Term _Inherits From_ **PII** Term (_Is-A_ style relationship) These relationship types allow you to map the concepts existing within your organization, enabling you to change the mapping between concepts behind the scenes, without needing to change the Glossary Terms -that are attached to individual Data Assets and Columns. +that are attached to individual Data Assets and Columns. For example, you can define a very specific, concrete Glossary Term like `Email Address` to represent a physical -data type, and then associate this with a higher-level `PII` Glossary Term via an `Inheritance` relationship. -This allows you to easily maintain a set of all Data Assets that contain or process `PII`, while keeping it easy to add +data type, and then associate this with a higher-level `PII` Glossary Term via an `Inheritance` relationship. +This allows you to easily maintain a set of all Data Assets that contain or process `PII`, while keeping it easy to add and remove new Terms from the `PII` Classification, e.g. without requiring re-annotation of individual Data Assets or Columns. - - - ## Demo Check out [our demo site](https://demo.datahubproject.io/glossary) to see an example Glossary and how it works! ### GraphQL -* [addTerm](../../graphql/mutations.md#addterm) -* [addTerms](../../graphql/mutations.md#addterms) -* [batchAddTerms](../../graphql/mutations.md#batchaddterms) -* [removeTerm](../../graphql/mutations.md#removeterm) -* [batchRemoveTerms](../../graphql/mutations.md#batchremoveterms) -* [createGlossaryTerm](../../graphql/mutations.md#createglossaryterm) -* [createGlossaryNode](../../graphql/mutations.md#createglossarynode) (Term Group) +- [addTerm](../../graphql/mutations.md#addterm) +- [addTerms](../../graphql/mutations.md#addterms) +- [batchAddTerms](../../graphql/mutations.md#batchaddterms) +- [removeTerm](../../graphql/mutations.md#removeterm) +- [batchRemoveTerms](../../graphql/mutations.md#batchremoveterms) +- [createGlossaryTerm](../../graphql/mutations.md#createglossaryterm) +- [createGlossaryNode](../../graphql/mutations.md#createglossarynode) (Term Group) You can easily fetch the Glossary Terms for an entity with a given its URN using the **glossaryTerms** property. Check out [Working with Metadata Entities](../api/graphql/how-to-set-up-graphql.md#querying-for-glossary-terms-of-an-asset) for an example. ## Resources + - [Creating a Business Glossary and Putting it to use in DataHub](https://blog.datahubproject.io/creating-a-business-glossary-and-putting-it-to-use-in-datahub-43a088323c12) - [Tags and Terms: Two Powerful DataHub Features, Used in Two Different Scenarios](https://medium.com/datahub-project/tags-and-terms-two-powerful-datahub-features-used-in-two-different-scenarios-b5b4791e892e) diff --git a/docs/how/configuring-authorization-with-apache-ranger.md b/docs/how/configuring-authorization-with-apache-ranger.md index 26d3be6d358b2..c26dd1a516445 100644 --- a/docs/how/configuring-authorization-with-apache-ranger.md +++ b/docs/how/configuring-authorization-with-apache-ranger.md @@ -2,7 +2,9 @@ title: "Configuring Authorization with Apache Ranger" hide_title: true --- + # Configuring Authorization with Apache Ranger + DataHub integration with Apache Ranger allows DataHub Authorization policies to be controlled inside Apache Ranger. Admins can create users, groups and roles on Apache Ranger, and then assign them to Ranger policies to control the authorization of requests to DataHub. @@ -13,218 +15,249 @@ We'll break down configuration of the DataHub Apache Ranger Plugin into two part > Disclaimer: All configurations shown in this documented were tested against [Privacera Platform](https://privacera.com/) v6.3.0.1. -# Prerequisites -- User identifier present in CorpRole URN should be the name of the AD/LDAP user. For example in URN **urn:li:corpuser:datahub**, the **datahub** should present as name of user in AD/LDAP +# Prerequisites + +- User identifier present in CorpRole URN should be the name of the AD/LDAP user. For example in URN **urn:li:corpuser:datahub**, the **datahub** should present as name of user in AD/LDAP - Apache Ranger and DataHub are configured for authentication via same IDP (either LDAP + JaaS or OIDC SSO) - Apache Ranger service available via HTTP - Basic authentication is enabled on Apache Ranger Service -# Configuration +# Configuration ## Configuring your Apache Ranger Deployment Perform the following steps to configure an Apache Ranger deployment to support creating access policies compatible with DataHub. For kubernetes example command, please replace the <ranger-pod-name> and <namespace> as per your environment. -1. Download the **datahub-ranger-plugin** from [Maven](https://mvnrepository.com/artifact/io.acryl/datahub-ranger-plugin) -2. Create a "datahub" directory inside the "ranger-plugins" directory where Apache Ranger is deployed. For example, to do this in a Privacera container +1. Download the **datahub-ranger-plugin** from [Maven](https://mvnrepository.com/artifact/io.acryl/datahub-ranger-plugin) +2. Create a "datahub" directory inside the "ranger-plugins" directory where Apache Ranger is deployed. For example, to do this in a Privacera container + + _Docker command:_ - *Docker command:* ```bash docker exec privacera_ranger_1 mkdir ews/webapp/WEB-INF/classes/ranger-plugins/datahub ``` - *Kubernetes command:* + + _Kubernetes command:_ + ```bash - kubectl exec+ +
-4. Create a new policy under service **ranger_datahub** - this will be used to control DataHub authorization. +4. Create a new policy under service **ranger_datahub** - this will be used to control DataHub authorization. 5. Create a test user & assign them to a policy. We'll use the `datahub` user, which is the default root user inside DataHub. To do this performs below steps - - Create a user **datahub** - - Create a policy under **ranger_datahub** service. To assign [Platform Privileges](../authorization/policies.md#privileges) (e.g. Admin privileges), simply use the "platform" resource type which is defined. To test the flow, we can simply assign the **datahub** user all platform privileges that are available through the Ranger UI. This will enable the "datahub" to have full platform admin privileges. - > To define fine-grained resource privileges, e.g. for DataHub Datasets, Dashboards, Charts, and more, you can simply select the appropriate Resource Type in the Ranger policy builder. You should also see a list of privileges that are supported for each resource type, which correspond to the actions that you can perform. To learn more about supported privileges, check out the DataHub [Policies Guide](../authorization/policies.md#privileges). - - DataHub platform access policy screenshot:+ +
+ +Once we've created our first policy, we can set up DataHub to start authorizing requests using Ranger policies. ## Configuring your DataHub Deployment Perform the following steps to configure DataHub to send incoming requests to Apache Ranger for authorization. 1. Download Apache Ranger security xml [ranger-datahub-security.xml](https://github.com/acryldata/datahub-ranger-auth-plugin/blob/main/datahub-ranger-plugin/conf/ranger-datahub-security.xml) -2. In **ranger-datahub-security.xml** edit the value of property *ranger.plugin.datahub.policy.rest.url*. Sample snippet is shown below - ```xml -+ +
- ```bash - curl -u+ +
Simply click "Add to Chrome" then "Add extension" on the ensuing popup. @@ -20,11 +24,15 @@ Once you have your extension installed, you'll need to configure it to work with 1. Click the extension button on the right of your browser's address bar to view all of your installed extensions. Click on the newly installed DataHub extension. -![](imgs/saas/extension_open_popup.png) ++ +
2. Fill in your DataHub domain and click "Continue" in the extension popup that appears. -![](imgs/saas/extension_enter_domain.png) ++ +
If your organization uses standard SaaS domains for Looker, you should be ready to go! @@ -34,11 +42,15 @@ Some organizations have custom SaaS domains for Looker and some Acryl DataHub de 1. Click on the extension button and select your DataHub extension to open the popup again. Now click the settings icon in order to open the configurations page. -![](imgs/saas/extension_open_options_page.png) ++ +
2. Fill out any and save custom configurations you have in the **TOOL CONFIGURATIONS** section. Here you can configure a custom domain, a Platform Instance associated with that domain, and the Environment set on your DataHub assets. If you don't have a custom domain but do have a custom Platform Instance or Environment, feel free to leave the field domain empty. -![](imgs/saas/extension_custom_configs.png) ++ +
## Using the Extension @@ -52,7 +64,9 @@ Once you have everything configured on your extension, it's time to use it! 4. Click the Acryl DataHub extension button on the bottom right of your page to open a drawer where you can now see additional information about this asset right from your DataHub instance. -![](imgs/saas/extension_view_in_looker.png) ++ +
## Advanced: Self-Hosted DataHub @@ -67,4 +81,4 @@ AUTH_COOKIE_SAME_SITE="NONE" AUTH_COOKIE_SECURE=true ``` -Once your re-deploy your `datahub-frontend` container with these values, you should be good to go! \ No newline at end of file +Once your re-deploy your `datahub-frontend` container with these values, you should be good to go! diff --git a/docs/managed-datahub/datahub-api/graphql-api/getting-started.md b/docs/managed-datahub/datahub-api/graphql-api/getting-started.md index 3c57b0a21d96e..881a7492f341a 100644 --- a/docs/managed-datahub/datahub-api/graphql-api/getting-started.md +++ b/docs/managed-datahub/datahub-api/graphql-api/getting-started.md @@ -10,7 +10,9 @@ For a full reference to the Queries & Mutations available for consumption, check ### Connecting to the API -![](../../imgs/saas/image-(3).png) ++ +
When you generate the token you will see an example of `curl` command which you can use to connect to the GraphQL API. diff --git a/docs/managed-datahub/datahub-api/graphql-api/incidents-api-beta.md b/docs/managed-datahub/datahub-api/graphql-api/incidents-api-beta.md index 89bacb2009e49..909fee7d6020f 100644 --- a/docs/managed-datahub/datahub-api/graphql-api/incidents-api-beta.md +++ b/docs/managed-datahub/datahub-api/graphql-api/incidents-api-beta.md @@ -1,10 +1,11 @@ --- description: This page provides an overview of working with the DataHub Incidents API. --- -import FeatureAvailability from '@site/src/components/FeatureAvailability'; +import FeatureAvailability from '@site/src/components/FeatureAvailability'; # Incidents API (Beta) ++ +
+To do so, simply follow the [Slack Integration Guide](docs/managed-datahub/saas-slack-setup.md) and contact your Acryl customer success team to enable the feature! diff --git a/docs/managed-datahub/imgs/saas/DataHub-Architecture.png b/docs/managed-datahub/imgs/saas/DataHub-Architecture.png deleted file mode 100644 index 95b3ab0b06ad6..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/DataHub-Architecture.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-13-at-7.45.56-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-13-at-7.45.56-PM.png deleted file mode 100644 index 721989a6c37e1..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-13-at-7.45.56-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-24-at-4.35.17-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-24-at-4.35.17-PM.png deleted file mode 100644 index dffac92f257c7..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-24-at-4.35.17-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-24-at-4.37.22-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-24-at-4.37.22-PM.png deleted file mode 100644 index ff0c29de1fbad..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-24-at-4.37.22-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-07-at-10.23.31-AM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-07-at-10.23.31-AM.png deleted file mode 100644 index 070bfd9f6b897..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-07-at-10.23.31-AM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.43.25-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.43.25-PM.png deleted file mode 100644 index b4bb4e2ba60ed..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.43.25-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.44.15-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.44.15-PM.png deleted file mode 100644 index b0397afd1b3a4..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.44.15-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.46.41-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.46.41-PM.png deleted file mode 100644 index 9258badb6f088..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.46.41-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.52.55-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.52.55-PM.png deleted file mode 100644 index 386b4cdcd9911..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.52.55-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.56.50-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.56.50-PM.png deleted file mode 100644 index a129f5eba4271..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.56.50-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.58.46-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.58.46-PM.png deleted file mode 100644 index 96ae48318a35a..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.58.46-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-5.01.16-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-5.01.16-PM.png deleted file mode 100644 index b6fd273389c90..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-5.01.16-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-5.03.36-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-5.03.36-PM.png deleted file mode 100644 index 0acd4e75bc6d2..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-5.03.36-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-13-at-2.34.24-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-13-at-2.34.24-PM.png deleted file mode 100644 index 364b9292cfaab..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-13-at-2.34.24-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-7.56.16-AM-(1).png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-7.56.16-AM-(1).png deleted file mode 100644 index 6a12dc545ec62..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-7.56.16-AM-(1).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-7.56.16-AM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-7.56.16-AM.png deleted file mode 100644 index 6a12dc545ec62..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-7.56.16-AM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-8.02.55-AM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-8.02.55-AM.png deleted file mode 100644 index 83645e00d724a..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-8.02.55-AM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-24-at-11.02.47-AM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-24-at-11.02.47-AM.png deleted file mode 100644 index a2f239ce847e0..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-24-at-11.02.47-AM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-24-at-12.59.38-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-24-at-12.59.38-PM.png deleted file mode 100644 index e31d4b089d929..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-24-at-12.59.38-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.21.42-AM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.21.42-AM.png deleted file mode 100644 index c003581c9d1b6..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.21.42-AM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.22.23-AM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.22.23-AM.png deleted file mode 100644 index 660dd121dd0a4..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.22.23-AM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.23.08-AM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.23.08-AM.png deleted file mode 100644 index 07e3c71dba262..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.23.08-AM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.47.57-AM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.47.57-AM.png deleted file mode 100644 index 579e7f62af708..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.47.57-AM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-29-at-6.07.25-PM-(1).png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-29-at-6.07.25-PM-(1).png deleted file mode 100644 index f85f4d5c79bfb..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-29-at-6.07.25-PM-(1).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-29-at-6.07.25-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-29-at-6.07.25-PM.png deleted file mode 100644 index f85f4d5c79bfb..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-29-at-6.07.25-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-4.16.52-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-4.16.52-PM.png deleted file mode 100644 index cb8b7470cd957..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-4.16.52-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-4.23.32-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-4.23.32-PM.png deleted file mode 100644 index 1de51e33d87c2..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-4.23.32-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.47-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.47-PM.png deleted file mode 100644 index df687dabe345c..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.47-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.56-PM-(1).png b/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.56-PM-(1).png deleted file mode 100644 index a8d9ee37c7a55..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.56-PM-(1).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.56-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.56-PM.png deleted file mode 100644 index a8d9ee37c7a55..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.56-PM.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Untitled(1).png b/docs/managed-datahub/imgs/saas/Untitled(1).png deleted file mode 100644 index 87846e7897f6e..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Untitled(1).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Untitled-(2)-(1).png b/docs/managed-datahub/imgs/saas/Untitled-(2)-(1).png deleted file mode 100644 index 7715bf4a51fbe..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Untitled-(2)-(1).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Untitled-(2).png b/docs/managed-datahub/imgs/saas/Untitled-(2).png deleted file mode 100644 index a01a1af370442..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Untitled-(2).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Untitled-(3).png b/docs/managed-datahub/imgs/saas/Untitled-(3).png deleted file mode 100644 index 02d84b326896c..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Untitled-(3).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Untitled-(4).png b/docs/managed-datahub/imgs/saas/Untitled-(4).png deleted file mode 100644 index a01a1af370442..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Untitled-(4).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/Untitled.png b/docs/managed-datahub/imgs/saas/Untitled.png deleted file mode 100644 index a01a1af370442..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/Untitled.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/chrome-store-extension-screenshot.png b/docs/managed-datahub/imgs/saas/chrome-store-extension-screenshot.png deleted file mode 100644 index e00a4d57f32dd..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/chrome-store-extension-screenshot.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/extension_custom_configs.png b/docs/managed-datahub/imgs/saas/extension_custom_configs.png deleted file mode 100644 index b3d70dfac00ff..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/extension_custom_configs.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/extension_developer_mode.png b/docs/managed-datahub/imgs/saas/extension_developer_mode.png deleted file mode 100644 index e740d15912e17..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/extension_developer_mode.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/extension_enter_domain.png b/docs/managed-datahub/imgs/saas/extension_enter_domain.png deleted file mode 100644 index 3304fa168beaf..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/extension_enter_domain.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/extension_load_unpacked.png b/docs/managed-datahub/imgs/saas/extension_load_unpacked.png deleted file mode 100644 index 8f56705cd9176..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/extension_load_unpacked.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/extension_open_options_page.png b/docs/managed-datahub/imgs/saas/extension_open_options_page.png deleted file mode 100644 index c1366d5673b59..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/extension_open_options_page.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/extension_open_popup.png b/docs/managed-datahub/imgs/saas/extension_open_popup.png deleted file mode 100644 index 216056b847fb5..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/extension_open_popup.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/extension_view_in_looker.png b/docs/managed-datahub/imgs/saas/extension_view_in_looker.png deleted file mode 100644 index bf854b3e840f7..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/extension_view_in_looker.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/home-(1).png b/docs/managed-datahub/imgs/saas/home-(1).png deleted file mode 100644 index 88cf2017dd7e7..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/home-(1).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/home.png b/docs/managed-datahub/imgs/saas/home.png deleted file mode 100644 index 8ad63deec75c9..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/home.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image-(1).png b/docs/managed-datahub/imgs/saas/image-(1).png deleted file mode 100644 index c1a249125fcf7..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image-(1).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image-(10).png b/docs/managed-datahub/imgs/saas/image-(10).png deleted file mode 100644 index a580fdc3d6730..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image-(10).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image-(11).png b/docs/managed-datahub/imgs/saas/image-(11).png deleted file mode 100644 index ee95eb4384272..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image-(11).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image-(12).png b/docs/managed-datahub/imgs/saas/image-(12).png deleted file mode 100644 index bbd8e6a66cf85..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image-(12).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image-(13).png b/docs/managed-datahub/imgs/saas/image-(13).png deleted file mode 100644 index bbd8e6a66cf85..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image-(13).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image-(14).png b/docs/managed-datahub/imgs/saas/image-(14).png deleted file mode 100644 index a580fdc3d6730..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image-(14).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image-(15).png b/docs/managed-datahub/imgs/saas/image-(15).png deleted file mode 100644 index f282e2d92c1a1..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image-(15).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image-(16).png b/docs/managed-datahub/imgs/saas/image-(16).png deleted file mode 100644 index 1340c77bd648c..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image-(16).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image-(17).png b/docs/managed-datahub/imgs/saas/image-(17).png deleted file mode 100644 index 6eee2fb2d821f..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image-(17).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image-(2).png b/docs/managed-datahub/imgs/saas/image-(2).png deleted file mode 100644 index cf475edd7b95d..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image-(2).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image-(3).png b/docs/managed-datahub/imgs/saas/image-(3).png deleted file mode 100644 index b08818ff3e97c..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image-(3).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image-(4).png b/docs/managed-datahub/imgs/saas/image-(4).png deleted file mode 100644 index a580fdc3d6730..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image-(4).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image-(5).png b/docs/managed-datahub/imgs/saas/image-(5).png deleted file mode 100644 index 48438c6001e4f..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image-(5).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image-(6).png b/docs/managed-datahub/imgs/saas/image-(6).png deleted file mode 100644 index 54e569e853f24..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image-(6).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image-(7).png b/docs/managed-datahub/imgs/saas/image-(7).png deleted file mode 100644 index 6e89e5881cfa7..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image-(7).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image-(8).png b/docs/managed-datahub/imgs/saas/image-(8).png deleted file mode 100644 index ee0a3c89d58fa..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image-(8).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image-(9).png b/docs/managed-datahub/imgs/saas/image-(9).png deleted file mode 100644 index 301ca98593ef9..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image-(9).png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/image.png b/docs/managed-datahub/imgs/saas/image.png deleted file mode 100644 index a1cfc3e74c5dd..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/image.png and /dev/null differ diff --git a/docs/managed-datahub/imgs/saas/settings.png b/docs/managed-datahub/imgs/saas/settings.png deleted file mode 100644 index ca99984abbbc9..0000000000000 Binary files a/docs/managed-datahub/imgs/saas/settings.png and /dev/null differ diff --git a/docs/managed-datahub/integrations/oidc-sso-integration.md b/docs/managed-datahub/integrations/oidc-sso-integration.md index 6a9e085186b44..81a5cc4a557ce 100644 --- a/docs/managed-datahub/integrations/oidc-sso-integration.md +++ b/docs/managed-datahub/integrations/oidc-sso-integration.md @@ -3,12 +3,12 @@ description: >- This page will help you set up OIDC SSO with your identity provider to log into Acryl Data --- -import FeatureAvailability from '@site/src/components/FeatureAvailability'; +import FeatureAvailability from '@site/src/components/FeatureAvailability'; # OIDC SSO Integration -+ +
diff --git a/docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md b/docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md index 95ca6e5e33e16..62d466467327e 100644 --- a/docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md +++ b/docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md @@ -56,9 +56,14 @@ In Acryl DataHub deployments, you _must_ use a sink of type `datahub-rest`, whic 2. **token**: a unique API key used to authenticate requests to your instance's REST API The token can be retrieved by logging in as admin. You can go to Settings page and generate a Personal Access Token with your desired expiration date. -![](../imgs/saas/home-(1).png) -![](../imgs/saas/settings.png) ++ +
+ ++ +
To configure your instance of DataHub as the destination for ingestion, set the "server" field of your recipe to point to your Acryl instance's domain suffixed by the path `/gms`, as shown below. A complete example of a DataHub recipe file, which reads from MySQL and writes into a DataHub instance: diff --git a/docs/managed-datahub/operator-guide/setting-up-remote-ingestion-executor-on-aws.md b/docs/managed-datahub/operator-guide/setting-up-remote-ingestion-executor-on-aws.md index d389ec97d0550..c4056a04e9ab9 100644 --- a/docs/managed-datahub/operator-guide/setting-up-remote-ingestion-executor-on-aws.md +++ b/docs/managed-datahub/operator-guide/setting-up-remote-ingestion-executor-on-aws.md @@ -4,9 +4,11 @@ description: >- executor, which allows you to ingest metadata from private metadata sources using private credentials via the DataHub UI. --- + import FeatureAvailability from '@site/src/components/FeatureAvailability'; # Setting up Remote Ingestion Executor on AWS ++ +
To accommodate these cases, Acryl supports configuring a remote ingestion executor which can be deployed inside of your AWS account. This setup allows you to continue leveraging the Acryl DataHub console to create, schedule, and run metadata ingestion, all while retaining network and credential isolation. -![Option 2: Customer-hosted ingestion runner](../imgs/saas/image-(6).png) ++ +
## Deploying a Remote Ingestion Executor -1. **Provide AWS Account Id**: Provide Acryl Team with the id of the AWS in which the remote executor will be hosted. This will be used to grant access to private Acryl containers and create a unique SQS queue which your remote agent will subscribe to. The account id can be provided to your Acryl representative via Email or [One Time Secret](https://onetimesecret.com/). - -2. **Provision an Acryl Executor** (ECS)**:** Acryl team will provide a [Cloudformation Template](https://github.com/acryldata/datahub-cloudformation/blob/master/Ingestion/templates/python.ecs.template.yaml) that you can run to provision an ECS cluster with a single remote ingestion task. It will also provision an AWS role for the task which grants the permissions necessary to read and delete from the private SQS queue created for you, along with reading the secrets you've specified. At minimum, the template requires the following parameters: - 1. **Deployment Location:** The AWS VPC + subnet in which the Acryl Executor task is to be provisioned. - 2. **SQS Queue ARN**: Reference to your private SQS command queue. This is provided by Acryl and is used to configure IAM policies enabling the Task role to read from the shared queue. - 3. **SQS Queue URL**: The URL referring to your private SQS command queue. This is provided by Acryl and is used to read messages. - 4. **DataHub Personal Access Token**: A valid DataHub PAT. This can be generated inside of **Settings > Access Tokens** of DataHub web application. You can alternatively create a secret in AWS Secrets Manager and refer to that by ARN. - 5. **Acryl DataHub URL**: The URL for your DataHub instance, e.g. `+ +
+ ++ +
3. **Test the Executor:** To test your remote executor: 1. Create a new Ingestion Source by clicking '**Create new Source**' the '**Ingestion**' tab of the DataHub console. Configure your Ingestion Recipe as though you were running it from inside of your environment. - 2. When working with "secret" fields (passwords, keys, etc), you can refer to any "self-managed" secrets by name: `${SECRET_NAME}:` + 2. When working with "secret" fields (passwords, keys, etc), you can refer to any "self-managed" secrets by name: `${SECRET_NAME}:` + ++ +
- ![Using a secret called BQ_DEPLOY_KEY which is managed in AWS secrets manager](../imgs/saas/Screen-Shot-2023-01-19-at-4.16.52-PM.png) 3. In the 'Finish Up' step, click '**Advanced'**. - 4. Update the '**Executor Id**' form field to be '**remote**'. This indicates that you'd like to use the remote executor. + 4. Update the '**Executor Id**' form field to be '**remote**'. This indicates that you'd like to use the remote executor. 5. Click '**Done**'. Now, simple click '**Execute**' to test out the remote executor. If your remote executor is configured properly, you should promptly see the ingestion task state change to 'Running'. -![](../imgs/saas/Screen-Shot-2022-03-07-at-10.23.31-AM.png) ++ +
+ ## Updating a Remote Ingestion Executor + In order to update the executor, ie. to deploy a new container version, you'll need to update the CloudFormation Stack to re-deploy the CloudFormation template with a new set of parameters. + ### Steps - AWS Console + 1. Navigate to CloudFormation in AWS Console 2. Select the stack dedicated to the remote executor 3. Click **Update** 4. Select **Replace Current Template** 5. Select **Upload a template file** 6. Upload a copy of the Acryl Remote Executor [CloudFormation Template](https://raw.githubusercontent.com/acryldata/datahub-cloudformation/master/Ingestion/templates/python.ecs.template.yaml) -![](../imgs/saas/Screen-Shot-2023-01-19-at-4.23.32-PM.png) + ++ +
+ 7. Click **Next** 8. Change parameters based on your modifications (e.g. ImageTag, etc) 9. Click **Next** @@ -74,26 +100,30 @@ In order to update the executor, ie. to deploy a new container version, you'll n ## FAQ -### If I need to change (or add) a secret that is stored in AWS Secrets Manager, e.g. for rotation, will the new secret automatically get picked up by Acryl's executor?** +### If I need to change (or add) a secret that is stored in AWS Secrets Manager, e.g. for rotation, will the new secret automatically get picked up by Acryl's executor?\*\* Unfortunately, no. Secrets are wired into the executor container at deployment time, via environment variables. Therefore, the ECS Task will need to be restarted (either manually or via a stack parameter update) whenever your secrets change. -### I want to deploy multiple Acryl Executors. Is this currently possible?** +### I want to deploy multiple Acryl Executors. Is this currently possible?\*\* This is possible, but requires a new SQS queue is maintained (on per executor). Please contact your Acryl representative for more information. -### I've run the CloudFormation Template, how can I tell that the container was successfully deployed?** +### I've run the CloudFormation Template, how can I tell that the container was successfully deployed?\*\* We recommend verifying in AWS Console by navigating to **ECS > Cluster > Stack Name > Services > Logs.** When you first deploy the executor, you should a single log line to indicate success: + ``` Starting AWS executor consumer.. ``` + This indicates that the remote executor has established a successful connection to your DataHub instance and is ready to execute ingestion runs. If you DO NOT see this log line, but instead see something else, please contact your Acryl representative for support. ## Release Notes + This is where release notes for the Acryl Remote Executor Container will live. ### v0.0.3.9 + Bumping to the latest version of acryl-executor, which includes smarter messaging around OOM errors. diff --git a/docs/modeling/extending-the-metadata-model.md b/docs/modeling/extending-the-metadata-model.md index 32951ab2e41eb..41d58eb626b69 100644 --- a/docs/modeling/extending-the-metadata-model.md +++ b/docs/modeling/extending-the-metadata-model.md @@ -11,7 +11,10 @@ these two concepts prior to making changes. ## To fork or not to fork? An important question that will arise once you've decided to extend the metadata model is whether you need to fork the main repo or not. Use the diagram below to understand how to make this decision. -![Metadata Model To Fork or Not](../imgs/metadata-model-to-fork-or-not-to.png) + ++ +
The green lines represent pathways that will lead to lesser friction for you to maintain your code long term. The red lines represent higher risk of conflicts in the future. We are working hard to move the majority of model extension use-cases to no-code / low-code pathways to ensure that you can extend the core metadata model without having to maintain a custom fork of DataHub. @@ -88,10 +91,11 @@ the annotation model. Define the entity within an `entity-registry.yml` file. Depending on your approach, the location of this file may vary. More on that in steps [4](#step_4) and [5](#step_5). Example: + ```yaml - - name: dashboard - doc: A container of related data assets. - keyAspect: dashboardKey +- name: dashboard + doc: A container of related data assets. + keyAspect: dashboardKey ``` - name: The entity name/type, this will be present as a part of the Urn. @@ -192,8 +196,8 @@ The Aspect has four key components: its properties, the @Aspect annotation, the can be defined as PDL primitives, enums, records, or collections ( see [pdl schema documentation](https://linkedin.github.io/rest.li/pdl_schema)) references to other entities, of type Urn or optionally `+ +
## The Core Entities -DataHub's "core" Entity types model the Data Assets that comprise the Modern Data Stack. They include +DataHub's "core" Entity types model the Data Assets that comprise the Modern Data Stack. They include -1. **[Data Platform](docs/generated/metamodel/entities/dataPlatform.md)**: A type of Data "Platform". That is, an external system that is involved in processing, storing, or visualizing Data Assets. Examples include MySQL, Snowflake, Redshift, and S3. +1. **[Data Platform](docs/generated/metamodel/entities/dataPlatform.md)**: A type of Data "Platform". That is, an external system that is involved in processing, storing, or visualizing Data Assets. Examples include MySQL, Snowflake, Redshift, and S3. 2. **[Dataset](docs/generated/metamodel/entities/dataset.md)**: A collection of data. Tables, Views, Streams, Document Collections, and Files are all modeled as "Datasets" on DataHub. Datasets can have tags, owners, links, glossary terms, and descriptions attached to them. They can also have specific sub-types, such as "View", "Collection", "Stream", "Explore", and more. Examples include Postgres Tables, MongoDB Collections, or S3 files. 3. **[Chart](docs/generated/metamodel/entities/chart.md)**: A single data vizualization derived from a Dataset. A single Chart can be a part of multiple Dashboards. Charts can have tags, owners, links, glossary terms, and descriptions attached to them. Examples include a Superset or Looker Chart. 4. **[Dashboard](docs/generated/metamodel/entities/dashboard.md)**: A collection of Charts for visualization. Dashboards can have tags, owners, links, glossary terms, and descriptions attached to them. Examples include a Superset or Mode Dashboard. @@ -47,54 +46,57 @@ See the **Metadata Modeling/Entities** section on the left to explore the entire ## The Entity Registry -Where are Entities and their aspects defined in DataHub? Where does the Metadata Model "live"? The Metadata Model is stitched together by means +Where are Entities and their aspects defined in DataHub? Where does the Metadata Model "live"? The Metadata Model is stitched together by means of an **Entity Registry**, a catalog of Entities that comprise the Metadata Graph along with the aspects associated with each. Put -simply, this is where the "schema" of the model is defined. +simply, this is where the "schema" of the model is defined. -Traditionally, the Entity Registry was constructed using [Snapshot](https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot) models, which are schemas that explicitly tie -an Entity to the Aspects associated with it. An example is [DatasetSnapshot](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/DatasetSnapshot.pdl), which defines the core `Dataset` Entity. -The Aspects of the Dataset entity are captured via a union field inside a special "Aspect" schema. An example is +Traditionally, the Entity Registry was constructed using [Snapshot](https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot) models, which are schemas that explicitly tie +an Entity to the Aspects associated with it. An example is [DatasetSnapshot](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/DatasetSnapshot.pdl), which defines the core `Dataset` Entity. +The Aspects of the Dataset entity are captured via a union field inside a special "Aspect" schema. An example is [DatasetAspect](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/aspect/DatasetAspect.pdl). -This file associates dataset-specific aspects (like [DatasetProperties](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/dataset/DatasetProperties.pdl)) and common aspects (like [Ownership](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl), -[InstitutionalMemory](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/InstitutionalMemory.pdl), -and [Status](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Status.pdl)) -to the Dataset Entity. This approach to defining Entities will soon be deprecated in favor of a new approach. +This file associates dataset-specific aspects (like [DatasetProperties](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/dataset/DatasetProperties.pdl)) and common aspects (like [Ownership](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl), +[InstitutionalMemory](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/InstitutionalMemory.pdl), +and [Status](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Status.pdl)) +to the Dataset Entity. This approach to defining Entities will soon be deprecated in favor of a new approach. As of January 2022, DataHub has deprecated support for Snapshot models as a means of adding new entities. Instead, the Entity Registry is defined inside a YAML configuration file called [entity-registry.yml](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/resources/entity-registry.yml), -which is provided to DataHub's Metadata Service at start up. This file declares Entities and Aspects by referring to their [names](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl#L7). -At boot time, DataHub validates the structure of the registry file and ensures that it can find PDL schemas associated with -each aspect name provided by configuration (via the [@Aspect](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl#L6) annotation). - -By moving to this format, evolving the Metadata Model becomes much easier. Adding Entities & Aspects becomes a matter of adding a -to the YAML configuration, instead of creating new Snapshot / Aspect files. +which is provided to DataHub's Metadata Service at start up. This file declares Entities and Aspects by referring to their [names](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl#L7). +At boot time, DataHub validates the structure of the registry file and ensures that it can find PDL schemas associated with +each aspect name provided by configuration (via the [@Aspect](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl#L6) annotation). +By moving to this format, evolving the Metadata Model becomes much easier. Adding Entities & Aspects becomes a matter of adding a +to the YAML configuration, instead of creating new Snapshot / Aspect files. ## Exploring DataHub's Metadata Model -To explore the current DataHub metadata model, you can inspect this high-level picture that shows the different entities and edges between them showing the relationships between them. -![Metadata Model Graph](../imgs/datahub-metadata-model.png) +To explore the current DataHub metadata model, you can inspect this high-level picture that shows the different entities and edges between them showing the relationships between them. + ++ +
To navigate the aspect model for specific entities and explore relationships using the `foreign-key` concept, you can view them in our demo environment or navigate the auto-generated docs in the **Metadata Modeling/Entities** section on the left. -For example, here are helpful links to the most popular entities in DataHub's metadata model: -* [Dataset](docs/generated/metamodel/entities/dataset.md): [Profile](https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:datahub,Dataset,PROD)/Schema?is_lineage_mode=false) [Documentation](https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:datahub,Dataset,PROD)/Documentation?is_lineage_mode=false) -* [Dashboard](docs/generated/metamodel/entities/dashboard.md): [Profile](https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:datahub,Dashboard,PROD)/Schema?is_lineage_mode=false) [Documentation](https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:datahub,Dashboard,PROD)/Documentation?is_lineage_mode=false) -* [User (a.k.a CorpUser)](docs/generated/metamodel/entities/corpuser.md): [Profile](https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:datahub,Corpuser,PROD)/Schema?is_lineage_mode=false) [Documentation](https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:datahub,Corpuser,PROD)/Documentation?is_lineage_mode=false) -* [Pipeline (a.k.a DataFlow)](docs/generated/metamodel/entities/dataFlow.md): [Profile](https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:datahub,DataFlow,PROD)/Schema?is_lineage_mode=false) [Documentation](https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:datahub,DataFlow,PROD)/Documentation?is_lineage_mode=false) -* [Feature Table (a.k.a. MLFeatureTable)](docs/generated/metamodel/entities/mlFeatureTable.md): [Profile](https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:datahub,MlFeatureTable,PROD)/Schema?is_lineage_mode=false) [Documentation](https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:datahub,MlFeatureTable,PROD)/Documentation?is_lineage_mode=false) -* For the full list of entities in the metadata model, browse them [here](https://demo.datahubproject.io/browse/dataset/prod/datahub/entities) or use the **Metadata Modeling/Entities** section on the left. +For example, here are helpful links to the most popular entities in DataHub's metadata model: + +- [Dataset](docs/generated/metamodel/entities/dataset.md): [Profile](+ +
+ +## Naming Platform Instances + +When configuring a platform instance, choose an instance name that is understandable and will be stable for the foreseeable future. e.g. `core_warehouse` or `finance_redshift` are allowed names, as are pure guids like `a37dc708-c512-4fe4-9829-401cd60ed789`. Remember that whatever instance name you choose, you will need to specify it in more than one recipe to ensure that the identifiers produced by different sources will line up. + +## Enabling Platform Instances + +Read the Ingestion source specific guides for how to enable platform instances in each of them. +The general pattern is to add an additional optional configuration parameter called `platform_instance`. + +e.g. here is how you would configure a recipe to ingest a mysql instance that you want to call `core_finance` + +```yaml +source: + type: mysql + config: + # Coordinates + host_port: localhost:3306 + platform_instance: core_finance + database: dbname + + # Credentials + username: root + password: example + +sink: + # sink configs +``` + +## diff --git a/docs/schema-history.md b/docs/schema-history.md index 9fc9ec1af52bb..2c92e7a8f29e9 100644 --- a/docs/schema-history.md +++ b/docs/schema-history.md @@ -13,7 +13,7 @@ along with informing Data Practitioners when these changes happened. Schema History uses DataHub's [Timeline API](https://datahubproject.io/docs/dev-guides/timeline/) to compute schema changes. -## Schema History Setup, Prerequisites, and Permissions +## Schema History Setup, Prerequisites, and Permissions Schema History is viewable in the DataHub UI for any Dataset that has had at least one schema change. To view a Dataset, a user must have the **View Entity Page** privilege, or be assigned to **any** DataHub Role. @@ -23,33 +23,40 @@ must have the **View Entity Page** privilege, or be assigned to **any** DataHub You can view the Schema History for a Dataset by navigating to that Dataset's Schema Tab. As long as that Dataset has more than one version, you can view what a Dataset looked like at any given version by using the version selector. Here's an example from DataHub's official Demo environment with the -[Snowflake pets dataset](https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:snowflake,long_tail_companions.adoption.pets,PROD)/Schema?is_lineage_mode=false). +[Snowflake pets dataset](+ +
If you click on an older version in the selector, you'll be able to see what the schema looked like back then. Notice the changes here to the glossary terms for the `status` field, and to the descriptions for the `created_at` and `updated_at` fields. -![](./imgs/schema-history-older-version.png) ++ +
In addition to this, you can also toggle the Audit view that shows you when the most recent changes were made to each field. You can active this by clicking on the Audit icon you see above the top right of the table. -![](./imgs/schema-history-audit-activated.png) ++ +
You can see here that some of these fields were added at the oldest dataset version, while some were added only at this latest version. Some fields were even modified and had a type change at the latest version! ### GraphQL -* [getSchemaBlame](../graphql/queries.md#getSchemaBlame) -* [getSchemaVersionList](../graphql/queries.md#getSchemaVersionList) +- [getSchemaBlame](../graphql/queries.md#getSchemaBlame) +- [getSchemaVersionList](../graphql/queries.md#getSchemaVersionList) ## FAQ and Troubleshooting **What updates are planned for the Schema History feature?** In the future, we plan on adding the following features + - Supporting a linear timeline view where you can see what changes were made to various schema fields over time - Adding a diff viewer that highlights the differences between two versions of a Dataset diff --git a/docs/tags.md b/docs/tags.md index 945b514dc7b47..720e56b9a4a75 100644 --- a/docs/tags.md +++ b/docs/tags.md @@ -8,15 +8,15 @@ Tags are informal, loosely controlled labels that help in search & discovery. Th Tags can help help you in: -* Querying: Tagging a dataset with a phrase that a co-worker can use to query the same dataset -* Mapping assets to a category or group of your choice +- Querying: Tagging a dataset with a phrase that a co-worker can use to query the same dataset +- Mapping assets to a category or group of your choice ## Tags Setup, Prerequisites, and Permissions What you need to add tags: -* **Edit Tags** metadata privilege to add tags at the entity level -* **Edit Dataset Column Tags** to edit tags at the column level +- **Edit Tags** metadata privilege to add tags at the entity level +- **Edit Dataset Column Tags** to edit tags at the column level You can create these privileges by creating a new [Metadata Policy](./authorization/policies.md). @@ -27,25 +27,25 @@ You can create these privileges by creating a new [Metadata Policy](./authorizat To add a tag at the dataset or container level, simply navigate to the page for that entity and click on the **Add Tag** button.- +
Type in the name of the tag you want to add. You can add a new tag, or add a tag that already exists (the autocomplete will pull up the tag if it already exists).- +
Click on the "Add" button and you'll see the tag has been added!- +
If you would like to add a tag at the schema level, hover over the "Tags" column for a schema until the "Add Tag" button shows up, and then follow the same flow as above.- +
### Removing a Tag @@ -57,7 +57,7 @@ To remove a tag, simply click on the "X" button in the tag. Then click "Yes" whe You can search for a tag in the search bar, and even filter entities by the presence of a specific tag.- +
## Additional Resources @@ -72,21 +72,21 @@ You can search for a tag in the search bar, and even filter entities by the pres ### GraphQL -* [addTag](../graphql/mutations.md#addtag) -* [addTags](../graphql/mutations.md#addtags) -* [batchAddTags](../graphql/mutations.md#batchaddtags) -* [removeTag](../graphql/mutations.md#removetag) -* [batchRemoveTags](../graphql/mutations.md#batchremovetags) -* [createTag](../graphql/mutations.md#createtag) -* [updateTag](../graphql/mutations.md#updatetag) -* [deleteTag](../graphql/mutations.md#deletetag) +- [addTag](../graphql/mutations.md#addtag) +- [addTags](../graphql/mutations.md#addtags) +- [batchAddTags](../graphql/mutations.md#batchaddtags) +- [removeTag](../graphql/mutations.md#removetag) +- [batchRemoveTags](../graphql/mutations.md#batchremovetags) +- [createTag](../graphql/mutations.md#createtag) +- [updateTag](../graphql/mutations.md#updatetag) +- [deleteTag](../graphql/mutations.md#deletetag) -You can easily fetch the Tags for an entity with a given its URN using the **tags** property. Check out [Working with Metadata Entities](./api/graphql/how-to-set-up-graphql.md#querying-for-tags-of-an-asset) for an example. +You can easily fetch the Tags for an entity with a given its URN using the **tags** property. Check out [Working with Metadata Entities](./api/graphql/how-to-set-up-graphql.md#querying-for-tags-of-an-asset) for an example. ### DataHub Blog -* [Tags and Terms: Two Powerful DataHub Features, Used in Two Different Scenarios -Managing PII in DataHub: A Practitioner’s Guide](https://blog.datahubproject.io/tags-and-terms-two-powerful-datahub-features-used-in-two-different-scenarios-b5b4791e892e) +- [Tags and Terms: Two Powerful DataHub Features, Used in Two Different Scenarios + Managing PII in DataHub: A Practitioner’s Guide](https://blog.datahubproject.io/tags-and-terms-two-powerful-datahub-features-used-in-two-different-scenarios-b5b4791e892e) ## FAQ and Troubleshooting @@ -96,16 +96,16 @@ DataHub Tags are informal, loosely controlled labels while Terms are part of a c Usage and applications: -* An asset may have multiple tags. -* Tags serve as a tool for search & discovery while Terms are typically used to standardize types of leaf-level attributes (i.e. schema fields) for governance. E.g. (EMAIL_PLAINTEXT) +- An asset may have multiple tags. +- Tags serve as a tool for search & discovery while Terms are typically used to standardize types of leaf-level attributes (i.e. schema fields) for governance. E.g. (EMAIL_PLAINTEXT) **How are DataHub Tags different from Domains?** Domains are a set of top-level categories usually aligned to business units/disciplines to which the assets are most relevant. They rely on central or distributed management. A single domain is assigned per data asset. -*Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!* +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ ### Related Features -* [Glossary Terms](./glossary/business-glossary.md) -* [Domains](./domains.md) +- [Glossary Terms](./glossary/business-glossary.md) +- [Domains](./domains.md) diff --git a/docs/ui-ingestion.md b/docs/ui-ingestion.md index 235f1521c070a..50cc8e5207686 100644 --- a/docs/ui-ingestion.md +++ b/docs/ui-ingestion.md @@ -1,44 +1,52 @@ -# UI Ingestion Guide +# UI Ingestion Guide -## Introduction +## Introduction Starting in version `0.8.25`, DataHub supports creating, configuring, scheduling, & executing batch metadata ingestion using the DataHub user interface. This makes -getting metadata into DataHub easier by minimizing the overhead required to operate custom integration pipelines. +getting metadata into DataHub easier by minimizing the overhead required to operate custom integration pipelines. -This document will describe the steps required to configure, schedule, and execute metadata ingestion inside the UI. +This document will describe the steps required to configure, schedule, and execute metadata ingestion inside the UI. ## Running Metadata Ingestion ### Prerequisites To view & manage UI-based metadata ingestion, you must have the `Manage Metadata Ingestion` & `Manage Secrets` - privileges assigned to your account. These can be granted by a [Platform Policy](authorization/policies.md). +privileges assigned to your account. These can be granted by a [Platform Policy](authorization/policies.md). -![](./imgs/ingestion-privileges.png) ++ +
-Once you have these privileges, you can begin to manage ingestion by navigating to the 'Ingestion' tab in DataHub. +Once you have these privileges, you can begin to manage ingestion by navigating to the 'Ingestion' tab in DataHub. -![](./imgs/ingestion-tab.png) ++ +
On this page, you'll see a list of active **Ingestion Sources**. An Ingestion Sources is a unique source of metadata ingested into DataHub from an external source like Snowflake, Redshift, or BigQuery. If you're just getting started, you won't have any sources. In the following sections, we'll describe how to create -your first **Ingestion Source**. +your first **Ingestion Source**. ### Creating an Ingestion Source Before ingesting any metadata, you need to create a new Ingestion Source. Start by clicking **+ Create new source**. -![](./imgs/create-new-ingestion-source-button.png) ++ +
#### Step 1: Select a Platform Template In the first step, select a **Recipe Template** corresponding to the source type that you'd like to extract metadata from. Choose among a variety of natively supported integrations, from Snowflake to Postgres to Kafka. -Select `Custom` to construct an ingestion recipe from scratch. +Select `Custom` to construct an ingestion recipe from scratch. -![](./imgs/select-platform-template.png) ++ +
Next, you'll configure an ingestion **Recipe**, which defines _how_ and _what_ to extract from the source system. @@ -49,23 +57,23 @@ used by DataHub to extract metadata from a 3rd party system. It most often consi 1. A source **type**: The type of system you'd like to extract metadata from (e.g. snowflake, mysql, postgres). If you've chosen a native template, this will already be populated for you. To view a full list of currently supported **types**, check out [this list](https://datahubproject.io/docs/metadata-ingestion/#installing-plugins). - -2. A source **config**: A set of configurations specific to the source **type**. Most sources support the following types of configuration values: - - **Coordinates**: The location of the system you want to extract metadata from - - **Credentials**: Authorized credentials for accessing the system you want to extract metadata from - - **Customizations**: Customizations regarding the metadata that will be extracted, e.g. which databases or tables to scan in a relational DB +2. A source **config**: A set of configurations specific to the source **type**. Most sources support the following types of configuration values: + - **Coordinates**: The location of the system you want to extract metadata from + - **Credentials**: Authorized credentials for accessing the system you want to extract metadata from + - **Customizations**: Customizations regarding the metadata that will be extracted, e.g. which databases or tables to scan in a relational DB 3. A sink **type**: A type of sink to route the metadata extracted from the source type. The officially supported DataHub sink - types are `datahub-rest` and `datahub-kafka`. - + types are `datahub-rest` and `datahub-kafka`. 4. A sink **config**: Configuration required to send metadata to the provided sink type. For example, DataHub coordinates and credentials. - + A sample of a full recipe configured to ingest metadata from MySQL can be found in the image below. -![](./imgs/example-mysql-recipe.png) ++ +
Detailed configuration examples & documentation for each source type can be found on the [DataHub Docs](https://datahubproject.io/docs/metadata-ingestion/) website. @@ -75,15 +83,16 @@ For production use cases, sensitive configuration values, such as database usern should be hidden from plain view within your ingestion recipe. To accomplish this, you can create & embed **Secrets**. Secrets are named values that are encrypted and stored within DataHub's storage layer. -To create a secret, first navigate to the 'Secrets' tab. Then click `+ Create new secret`. +To create a secret, first navigate to the 'Secrets' tab. Then click `+ Create new secret`. -![](./imgs/create-secret.png) ++ +
_Creating a Secret to store the username for a MySQL database_ Inside the form, provide a unique name for the secret along with the value to be encrypted, and an optional description. Click **Create** when you are done. -This will create a Secret which can be referenced inside your ingestion recipe using its name. - +This will create a Secret which can be referenced inside your ingestion recipe using its name. ##### Referencing a Secret @@ -92,53 +101,56 @@ to substitute secrets for a MySQL username and password into a Recipe, your Reci ```yaml source: - type: mysql - config: - host_port: 'localhost:3306' - database: my_db - username: ${MYSQL_USERNAME} - password: ${MYSQL_PASSWORD} - include_tables: true - include_views: true - profiling: - enabled: true + type: mysql + config: + host_port: "localhost:3306" + database: my_db + username: ${MYSQL_USERNAME} + password: ${MYSQL_PASSWORD} + include_tables: true + include_views: true + profiling: + enabled: true sink: - type: datahub-rest - config: - server: 'http://datahub-gms:8080' + type: datahub-rest + config: + server: "http://datahub-gms:8080" ``` + _Referencing DataHub Secrets from a Recipe definition_ When the Ingestion Source with this Recipe executes, DataHub will attempt to 'resolve' Secrets found within the YAML. If a secret can be resolved, the reference is substituted for its decrypted value prior to execution. Secret values are not persisted to disk beyond execution time, and are never transmitted outside DataHub. -> **Attention**: Any DataHub users who have been granted the `Manage Secrets` [Platform Privilege](authorization/policies.md) will be able to retrieve plaintext secret values using the GraphQL API. - +> **Attention**: Any DataHub users who have been granted the `Manage Secrets` [Platform Privilege](authorization/policies.md) will be able to retrieve plaintext secret values using the GraphQL API. -#### Step 3: Schedule Execution +#### Step 3: Schedule Execution Next, you can optionally configure a schedule on which to execute your new Ingestion Source. This enables to schedule metadata extraction on a monthly, weekly, daily, or hourly cadence depending on the needs of your organization. -Schedules are defined using CRON format. +Schedules are defined using CRON format. -![](./imgs/schedule-ingestion.png) ++ +
_An Ingestion Source that is executed at 9:15am every day, Los Angeles time_ To learn more about the CRON scheduling format, check out the [Wikipedia](https://en.wikipedia.org/wiki/Cron) overview. -If you plan to execute ingestion on an ad-hoc basis, you can click **Skip** to skip the scheduling step entirely. Don't worry - -you can always come back and change this. +If you plan to execute ingestion on an ad-hoc basis, you can click **Skip** to skip the scheduling step entirely. Don't worry - +you can always come back and change this. #### Step 4: Finishing Up -Finally, give your Ingestion Source a name. +Finally, give your Ingestion Source a name. -![](./imgs/name-ingestion-source.png) ++ +
Once you're happy with your configurations, click 'Done' to save your changes. - -##### Advanced: Running with a specific CLI version +##### Advanced: Running with a specific CLI version DataHub comes pre-configured to use the latest version of the DataHub CLI ([acryl-datahub](https://pypi.org/project/acryl-datahub/)) that is compatible with the server. However, you can override the default package version using the 'Advanced' source configurations. @@ -146,11 +158,13 @@ with the server. However, you can override the default package version using the To do so, simply click 'Advanced', then change the 'CLI Version' text box to contain the exact version of the DataHub CLI you'd like to use. -![](./imgs/custom-ingestion-cli-version.png) -_Pinning the CLI version to version `0.8.23.2`_ ++ +
-Once you're happy with your changes, simply click 'Done' to save. +_Pinning the CLI version to version `0.8.23.2`_ +Once you're happy with your changes, simply click 'Done' to save. ### Running an Ingestion Source @@ -158,47 +172,54 @@ Once you've created your Ingestion Source, you can run it by clicking 'Execute'. you should see the 'Last Status' column of the ingestion source change from `N/A` to `Running`. This means that the request to execute ingestion has been successfully picked up by the DataHub ingestion executor. -![](./imgs/running-ingestion.png) ++ +
-If ingestion has executed successfully, you should see it's state shown in green as `Succeeded`. - -![](./imgs/successful-ingestion.png) +If ingestion has executed successfully, you should see it's state shown in green as `Succeeded`. ++ +
### Cancelling an Ingestion Run -If your ingestion run is hanging, there may a bug in the ingestion source, or another persistent issue like exponential timeouts. If these situations, +If your ingestion run is hanging, there may a bug in the ingestion source, or another persistent issue like exponential timeouts. If these situations, you can cancel ingestion by clicking **Cancel** on the problematic run. -![](./imgs/cancelled-ingestion.png) - -Once cancelled, you can view the output of the ingestion run by clicking **Details**. ++ +
+Once cancelled, you can view the output of the ingestion run by clicking **Details**. ### Debugging a Failed Ingestion Run -![](./imgs/failed-ingestion.png) ++ +
-A variety of things can cause an ingestion run to fail. Common reasons for failure include: +A variety of things can cause an ingestion run to fail. Common reasons for failure include: 1. **Recipe Misconfiguration**: A recipe has not provided the required or expected configurations for the ingestion source. You can refer to the [Metadata Ingestion Framework](https://datahubproject.io/docs/metadata-ingestion) source docs to learn more about the configurations required for your source type. - -2. **Failure to resolve Secrets**: If DataHub is unable to find secrets that were referenced by your Recipe configuration, the ingestion run will fail. - Verify that the names of the secrets referenced in your recipe match those which have been created. - +2. **Failure to resolve Secrets**: If DataHub is unable to find secrets that were referenced by your Recipe configuration, the ingestion run will fail. + Verify that the names of the secrets referenced in your recipe match those which have been created. 3. **Connectivity / Network Reachability**: If DataHub is unable to reach a data source, for example due to DNS resolution failures, metadata ingestion will fail. Ensure that the network where DataHub is deployed has access to the data source which - you are trying to reach. - -4. **Authentication**: If you've enabled [Metadata Service Authentication](authentication/introducing-metadata-service-authentication.md), you'll need to provide a Personal Access Token - in your Recipe Configuration. To so this, set the 'token' field of the sink configuration to contain a Personal Access Token: - ![](./imgs/ingestion-with-token.png) + you are trying to reach. +4. **Authentication**: If you've enabled [Metadata Service Authentication](authentication/introducing-metadata-service-authentication.md), you'll need to provide a Personal Access Token +in your Recipe Configuration. To so this, set the 'token' field of the sink configuration to contain a Personal Access Token: ++ +
-The output of each run is captured and available to view in the UI for easier debugging. To view output logs, click **DETAILS** -on the corresponding ingestion run. +The output of each run is captured and available to view in the UI for easier debugging. To view output logs, click **DETAILS** +on the corresponding ingestion run. -![](./imgs/ingestion-logs.png) ++ +
## FAQ @@ -206,14 +227,16 @@ on the corresponding ingestion run. If not due to one of the reasons outlined above, this may be because the executor running ingestion is unable to reach DataHub's backend using the default configurations. Try changing your ingestion recipe to make the `sink.config.server` variable point to the Docker -DNS name for the `datahub-gms` pod: +DNS name for the `datahub-gms` pod: -![](./imgs/quickstart-ingestion-config.png) ++ +
### I see 'N/A' when I try to run ingestion. What do I do? -If you see 'N/A', and the ingestion run state never changes to 'Running', this may mean -that your executor (`datahub-actions`) container is down. +If you see 'N/A', and the ingestion run state never changes to 'Running', this may mean +that your executor (`datahub-actions`) container is down. This container is responsible for executing requests to run ingestion when they come in, either on demand on a particular schedule. You can verify the health of the container using `docker ps`. Moreover, you can inspect the container logs using by finding the container id diff --git a/docs/what/relationship.md b/docs/what/relationship.md index 1908bbd6ce75f..c52aa05ee489d 100644 --- a/docs/what/relationship.md +++ b/docs/what/relationship.md @@ -1,19 +1,21 @@ # What is a relationship? -A relationship is a named associate between exactly two [entities](entity.md), a source and a destination. - -![metadata-modeling](../imgs/metadata-modeling.png) - -From the above graph, a `Group` entity can be linked to a `User` entity via a `HasMember` relationship. -Note that the name of the relationship reflects the direction, i.e. pointing from `Group` to `User`. -This is due to the fact that the actual metadata aspect holding this information is associated with `Group`, rather than User. -Had the direction been reversed, the relationship would have been named `IsMemberOf` instead. -See [Direction of Relationships](#direction-of-relationships) for more discussions on relationship directionality. -A specific instance of a relationship, e.g. `urn:li:corpGroup:group1` has a member `urn:li:corpuser:user1`, +A relationship is a named associate between exactly two [entities](entity.md), a source and a destination. + ++ +
+ +From the above graph, a `Group` entity can be linked to a `User` entity via a `HasMember` relationship. +Note that the name of the relationship reflects the direction, i.e. pointing from `Group` to `User`. +This is due to the fact that the actual metadata aspect holding this information is associated with `Group`, rather than User. +Had the direction been reversed, the relationship would have been named `IsMemberOf` instead. +See [Direction of Relationships](#direction-of-relationships) for more discussions on relationship directionality. +A specific instance of a relationship, e.g. `urn:li:corpGroup:group1` has a member `urn:li:corpuser:user1`, corresponds to an edge in the metadata graph. -Similar to an entity, a relationship can also be associated with optional attributes that are derived from the metadata. -For example, from the `Membership` metadata aspect shown below, we’re able to derive the `HasMember` relationship that links a specific `Group` to a specific `User`. We can also include additional attribute to the relationship, e.g. importance, which corresponds to the position of the specific member in the original membership array. This allows complex graph query that travel only relationships that match certain criteria, e.g. "returns only the top-5 most important members of this group." +Similar to an entity, a relationship can also be associated with optional attributes that are derived from the metadata. +For example, from the `Membership` metadata aspect shown below, we’re able to derive the `HasMember` relationship that links a specific `Group` to a specific `User`. We can also include additional attribute to the relationship, e.g. importance, which corresponds to the position of the specific member in the original membership array. This allows complex graph query that travel only relationships that match certain criteria, e.g. "returns only the top-5 most important members of this group." Similar to the entity attributes, relationship attributes should only be added based on the expected query patterns to reduce the indexing cost. ``` @@ -38,12 +40,13 @@ record Membership { } ``` -Relationships are meant to be "entity-neutral". In other words, one would expect to use the same `OwnedBy` relationship to link a `Dataset` to a `User` and to link a `Dashboard` to a `User`. As Pegasus doesn’t allow typing a field using multiple URNs (because they’re all essentially strings), we resort to using generic URN type for the source and destination. +Relationships are meant to be "entity-neutral". In other words, one would expect to use the same `OwnedBy` relationship to link a `Dataset` to a `User` and to link a `Dashboard` to a `User`. As Pegasus doesn’t allow typing a field using multiple URNs (because they’re all essentially strings), we resort to using generic URN type for the source and destination. We also introduce a `@pairings` [annotation](https://linkedin.github.io/rest.li/pdl_migration#shorthand-for-custom-properties) to limit the allowed source and destination URN types. While it’s possible to model relationships in rest.li as [association resources](https://linkedin.github.io/rest.li/modeling/modeling#association), which often get stored as mapping tables, it is far more common to model them as "foreign keys" field in a metadata aspect. For instance, the `Ownership` aspect is likely to contain an array of owner’s corpuser URNs. Below is an example of how a relationship is modeled in PDL. Note that: + 1. As the `source` and `destination` are of generic URN type, we’re able to factor them out to a common `BaseRelationship` model. 2. Each model is expected to have a `@pairings` annotation that is an array of all allowed source-destination URN pairs. 3. Unlike entity attributes, there’s no requirement on making all relationship attributes optional since relationships do not support partial updates. @@ -85,20 +88,20 @@ record HasMembership includes BaseRelationship /** * The importance of the membership */ - importance: int + importance: int } ``` ## Direction of Relationships -As relationships are modeled as directed edges between nodes, it’s natural to ask which way should it be pointing, -or should there be edges going both ways? The answer is, "doesn’t really matter." It’s rather an aesthetic choice than technical one. +As relationships are modeled as directed edges between nodes, it’s natural to ask which way should it be pointing, +or should there be edges going both ways? The answer is, "doesn’t really matter." It’s rather an aesthetic choice than technical one. For one, the actual direction doesn’t really impact the execution of graph queries. Most graph DBs are fully capable of traversing edges in reverse direction efficiently. That being said, generally there’s a more "natural way" to specify the direction of a relationship, which closely relate to how the metadata is stored. For example, the membership information for an LDAP group is generally stored as a list in group’s metadata. As a result, it’s more natural to model a `HasMember` relationship that points from a group to a member, instead of a `IsMemberOf` relationship pointing from member to group. -Since all relationships are explicitly declared, it’s fairly easy for a user to discover what relationships are available and their directionality by inspecting +Since all relationships are explicitly declared, it’s fairly easy for a user to discover what relationships are available and their directionality by inspecting the [relationships directory](../../metadata-models/src/main/pegasus/com/linkedin/metadata/relationship). It’s also possible to provide a UI for the catalog of entities and relationships for analysts who are interested in building complex graph queries to gain insights into the metadata. ## High Cardinality Relationships diff --git a/metadata-ingestion/developing.md b/metadata-ingestion/developing.md index 67041d23a21b1..9b79ee31e31bf 100644 --- a/metadata-ingestion/developing.md +++ b/metadata-ingestion/developing.md @@ -74,7 +74,9 @@ The syntax for installing plugins is slightly different in development. For exam ## Architecture -![metadata ingestion framework layout](../docs/imgs/datahub-metadata-ingestion-framework.png) ++ +
The architecture of this metadata ingestion framework is heavily inspired by [Apache Gobblin](https://gobblin.apache.org/) (also originally a LinkedIn project!). We have a standardized format - the MetadataChangeEvent - and sources and sinks which respectively produce and consume these objects. The sources pull metadata from a variety of data systems, while the sinks are primarily for moving this metadata into DataHub. @@ -99,6 +101,7 @@ mypy src/ tests/ ``` or you can run from root of the repository + ```shell ./gradlew :metadata-ingestion:lintFix ``` diff --git a/perf-test/README.md b/perf-test/README.md index 24fb064d3e28a..7b1f962e208f3 100644 --- a/perf-test/README.md +++ b/perf-test/README.md @@ -58,7 +58,9 @@ locust -f perf-test/locustfiles/ingest.py This will set up the web interface in http://localhost:8089 (unless the port is already taken). Once you click into it, you should see the following -![Locust Example](../docs/imgs/locust-example.png) ++ +
Input the number of users you would like to spawn and the spawn rate. Point the host to the deployed DataHub GMS ( locally, it should be http://localhost:8080). Click on the "Start swarming" button to start the load test.