diff --git a/README.md b/README.md index 3b381ebc8dc89..3498eb595d3a6 100644 --- a/README.md +++ b/README.md @@ -81,7 +81,9 @@ Please follow the [DataHub Quickstart Guide](https://datahubproject.io/docs/quic If you're looking to build & modify datahub please take a look at our [Development Guide](https://datahubproject.io/docs/developers).
- + + +
## Source Code and Repositories diff --git a/datahub-web-react/README.md b/datahub-web-react/README.md index 6c91b169af858..8d11389ee29d2 100644 --- a/datahub-web-react/README.md +++ b/datahub-web-react/README.md @@ -126,7 +126,9 @@ for functional configurability should reside. to render a view associated with a particular entity type (user, dataset, etc.). -![entity-registry](./entity-registry.png) ++ +
**graphql** - The React App talks to the `dathub-frontend` server using GraphQL. This module is where the *queries* issued against the server are defined. Once defined, running `yarn run generate` will code-gen TypeScript objects to make invoking diff --git a/docs-website/versioned_docs/version-0.10.4/README.md b/docs-website/versioned_docs/version-0.10.4/README.md new file mode 100644 index 0000000000000..6b9450294a7f0 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/README.md @@ -0,0 +1,185 @@ +--- +description: >- + DataHub is a data discovery application built on an extensible metadata + platform that helps you tame the complexity of diverse data ecosystems. +hide_title: true +title: Introduction +slug: /introduction +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/README.md" +--- + +import useBaseUrl from '@docusaurus/useBaseUrl'; + +export const Logo = (props) => { +return ( + ++ +
+ +## Source Code and Repositories + +- [datahub-project/datahub](https://github.com/datahub-project/datahub): This repository contains the complete source code for DataHub's metadata model, metadata services, integration connectors and the web application. +- [acryldata/datahub-actions](https://github.com/acryldata/datahub-actions): DataHub Actions is a framework for responding to changes to your DataHub Metadata Graph in real time. +- [acryldata/datahub-helm](https://github.com/acryldata/datahub-helm): Repository of helm charts for deploying DataHub on a Kubernetes cluster +- [acryldata/meta-world](https://github.com/acryldata/meta-world): A repository to store recipes, custom sources, transformations and other things to make your DataHub experience magical + +## Releases + +See [Releases](https://github.com/datahub-project/datahub/releases) page for more details. We follow the [SemVer Specification](https://semver.org) when versioning the releases and adopt the [Keep a Changelog convention](https://keepachangelog.com/) for the changelog format. + +## Contributing + +We welcome contributions from the community. Please refer to our [Contributing Guidelines](docs/CONTRIBUTING.md) for more details. We also have a [contrib](https://github.com/datahub-project/datahub/blob/master/contrib) directory for incubating experimental features. + +## Community + +Join our [Slack workspace](https://slack.datahubproject.io) for discussions and important announcements. You can also find out more about our upcoming [town hall meetings](docs/townhalls.md) and view past recordings. + +## Adoption + +Here are the companies that have officially adopted DataHub. Please feel free to add yours to the list if we missed it. + +- [ABLY](https://ably.team/) +- [Adevinta](https://www.adevinta.com/) +- [Banksalad](https://www.banksalad.com) +- [Cabify](https://cabify.tech/) +- [ClassDojo](https://www.classdojo.com/) +- [Coursera](https://www.coursera.org/) +- [DefinedCrowd](http://www.definedcrowd.com) +- [DFDS](https://www.dfds.com/) +- [Digital Turbine](https://www.digitalturbine.com/) +- [Expedia Group](http://expedia.com) +- [Experius](https://www.experius.nl) +- [Geotab](https://www.geotab.com) +- [Grofers](https://grofers.com) +- [Haibo Technology](https://www.botech.com.cn) +- [hipages](https://hipages.com.au/) +- [inovex](https://www.inovex.de/) +- [IOMED](https://iomed.health) +- [Klarna](https://www.klarna.com) +- [LinkedIn](http://linkedin.com) +- [Moloco](https://www.moloco.com/en) +- [N26](https://n26brasil.com/) +- [Optum](https://www.optum.com/) +- [Peloton](https://www.onepeloton.com) +- [PITS Global Data Recovery Services](https://www.pitsdatarecovery.net/) +- [Razer](https://www.razer.com) +- [Saxo Bank](https://www.home.saxo) +- [Showroomprive](https://www.showroomprive.com/) +- [SpotHero](https://spothero.com) +- [Stash](https://www.stash.com) +- [Shanghai HuaRui Bank](https://www.shrbank.com) +- [ThoughtWorks](https://www.thoughtworks.com) +- [TypeForm](http://typeform.com) +- [Udemy](https://www.udemy.com/) +- [Uphold](https://uphold.com) +- [Viasat](https://viasat.com) +- [Wikimedia](https://www.wikimedia.org) +- [Wolt](https://wolt.com) +- [Zynga](https://www.zynga.com) + +## Select Articles & Talks + +- [DataHub Blog](https://blog.datahubproject.io/) +- [DataHub YouTube Channel](https://www.youtube.com/channel/UC3qFQC5IiwR5fvWEqi_tJ5w) +- [Optum: Data Mesh via DataHub](https://optum.github.io/blog/2022/03/23/data-mesh-via-datahub/) +- [Saxo Bank: Enabling Data Discovery in Data Mesh](https://medium.com/datahub-project/enabling-data-discovery-in-a-data-mesh-the-saxo-journey-451b06969c8f) +- [Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data](https://www.dataengineeringpodcast.com/acryl-data-datahub-metadata-graph-episode-230/) +- [DataHub: Popular Metadata Architectures Explained](https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained) +- [Driving DataOps Culture with LinkedIn DataHub](https://www.youtube.com/watch?v=ccsIKK9nVxk) @ [DataOps Unleashed 2021](https://dataopsunleashed.com/#shirshanka-session) +- [The evolution of metadata: LinkedIn’s story](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019) @ [Strata Data Conference 2019](https://conferences.oreilly.com/strata/strata-ny-2019.html) +- [Journey of metadata at LinkedIn](https://www.youtube.com/watch?v=OB-O0Y6OYDE) @ [Crunch Data Conference 2019](https://crunchconf.com/2019) +- [DataHub Journey with Expedia Group](https://www.youtube.com/watch?v=ajcRdB22s5o) +- [Data Discoverability at SpotHero](https://www.slideshare.net/MaggieHays/data-discoverability-at-spothero) +- [Data Catalogue — Knowing your data](https://medium.com/albert-franzi/data-catalogue-knowing-your-data-15f7d0724900) +- [DataHub: A Generalized Metadata Search & Discovery Tool](https://engineering.linkedin.com/blog/2019/data-hub) +- [Open sourcing DataHub: LinkedIn’s metadata search and discovery platform](https://engineering.linkedin.com/blog/2020/open-sourcing-datahub--linkedins-metadata-search-and-discovery-p) +- [Emerging Architectures for Modern Data Infrastructure](https://future.com/emerging-architectures-for-modern-data-infrastructure-2020/) + +See the full list [here](docs/links.md). + +## License + +[Apache License 2.0](https://github.com/datahub-project/datahub/blob/master/LICENSE). diff --git a/docs-website/versioned_docs/version-0.10.4/datahub-frontend/README.md b/docs-website/versioned_docs/version-0.10.4/datahub-frontend/README.md new file mode 100644 index 0000000000000..7f181861f6a43 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/datahub-frontend/README.md @@ -0,0 +1,97 @@ +--- +title: datahub-frontend +sidebar_label: datahub-frontend +slug: /datahub-frontend +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/datahub-frontend/README.md +--- + +# DataHub Frontend Proxy + +DataHub frontend is a [Play](https://www.playframework.com/) service written in Java. It is served as a mid-tier +between [DataHub GMS](https://github.com/datahub-project/datahub/blob/master/metadata-service) which is the backend service and [DataHub Web](../datahub-web-react/README.md). + +## Pre-requisites + +- You need to have [JDK11](https://openjdk.org/projects/jdk/11/) + installed on your machine to be able to build `DataHub Frontend`. +- You need to have [Chrome](https://www.google.com/chrome/) web browser + installed to be able to build because UI tests have a dependency on `Google Chrome`. + +## Build + +`DataHub Frontend` is already built as part of top level build: + +``` +./gradlew build +``` + +However, if you only want to build `DataHub Frontend` specifically: + +``` +./gradlew :datahub-frontend:dist +``` + +## Dependencies + +Before starting `DataHub Frontend`, you need to make sure that [DataHub GMS](https://github.com/datahub-project/datahub/blob/master/metadata-service) and +all its dependencies have already started and running. + +## Start via Docker image + +Quickest way to try out `DataHub Frontend` is running the [Docker image](https://github.com/datahub-project/datahub/blob/master/docker/datahub-frontend). + +## Start via command line + +If you do modify things and want to try it out quickly without building the Docker image, you can also run +the application directly from command line after a successful [build](#build): + +``` +cd datahub-frontend/run && ./run-local-frontend +``` + +## Checking out DataHub UI + +After starting your application in one of the two ways mentioned above, you can connect to it by typing below +into your favorite web browser: + +``` +http://localhost:9002 +``` + +To be able to sign in, you need to provide your user name. The default account is `datahub`, password `datahub`. + +## Authentication + +DataHub frontend leverages [Java Authentication and Authorization Service (JAAS)](https://docs.oracle.com/javase/7/docs/technotes/guides/security/jaas/JAASRefGuide.html) to perform the authentication. By default we provided a [DummyLoginModule](https://github.com/datahub-project/datahub/blob/master/datahub-frontend/app/security/DummyLoginModule.java) which will accept any username/password combination. You can update [jaas.conf](https://github.com/datahub-project/datahub/blob/master/datahub-frontend/conf/jaas.conf) to match your authentication requirement. For example, use the following config for LDAP-based authentication, + +``` +WHZ-Authentication { + com.sun.security.auth.module.LdapLoginModule sufficient + userProvider="ldaps://+ +
+ +**graphql** - The React App talks to the `dathub-frontend` server using GraphQL. This module is where the _queries_ issued +against the server are defined. Once defined, running `yarn run generate` will code-gen TypeScript objects to make invoking +these queries extremely easy. An example can be found at the top of `SearchPage.tsx.` + +**images** - Images to be displayed within the app. This is where one would place a custom logo image. + +## Adding an Entity + +The following outlines a series of steps required to introduce a new entity into the React app: + +1. Declare the GraphQL Queries required to display the new entity + + - If search functionality should be supported, extend the "search" query within `search.graphql` to fetch the new + entity data. + - If browse functionality should be supported, extend the "browse" query within `browse.graphql` to fetch the new + entity data. + - If display a 'profile' should be supported (most often), introduce a new `+ +
+ +By default, Airflow loads all DAG-s in paused status. Unpause the sample DAG to use it. + ++ +
++ +
+ +Then trigger the DAG to run. + ++ +
+ +After the DAG runs successfully, go over to your DataHub instance to see the Pipeline and navigate its lineage. + ++ +
+ ++ +
+ ++ +
+ ++ +
+ +## TroubleShooting + +Most issues are related to connectivity between Airflow and DataHub. + +Here is how you can debug them. + ++ +
+ ++ +
+ +In this case, clearly the connection `datahub-rest` has not been registered. Looks like we forgot to register the connection with Airflow! +Let's execute Step 4 to register the datahub connection with Airflow. + +In case the connection was registered successfully but you are still seeing `Failed to establish a new connection`, check if the connection is `http://datahub-gms:8080` and not `http://localhost:8080`. + +After re-running the DAG, we see success! + ++ +
diff --git a/docs-website/versioned_docs/version-0.10.4/docker/datahub-upgrade/README.md b/docs-website/versioned_docs/version-0.10.4/docker/datahub-upgrade/README.md new file mode 100644 index 0000000000000..687f3b3eae4a7 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docker/datahub-upgrade/README.md @@ -0,0 +1,121 @@ +--- +title: DataHub Upgrade Docker Image +sidebar_label: Upgrade Docker Image +slug: /docker/datahub-upgrade +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docker/datahub-upgrade/README.md +--- + +# DataHub Upgrade Docker Image + +This container is used to automatically apply upgrades from one version of DataHub to another. + +## Supported Upgrades + +As of today, there are 2 supported upgrades: + +1. **NoCodeDataMigration**: Performs a series of pre-flight qualification checks and then migrates metadata*aspect table data + to metadata_aspect_v2 table. Arguments: - \_batchSize* (Optional): The number of rows to migrate at a time. Defaults to 1000. - _batchDelayMs_ (Optional): The number of milliseconds of delay between migrated batches. Used for rate limiting. Defaults to 250. - _dbType_ (optional): The target DB type. Valid values are `MYSQL`, `MARIA`, `POSTGRES`. Defaults to `MYSQL`. +2. **NoCodeDataMigrationCleanup**: Cleanses graph index, search index, and key-value store of legacy DataHub data (metadata_aspect table) once + the No Code Data Migration has completed successfully. No arguments. + +3. **RestoreIndices**: Restores indices by fetching the latest version of each aspect and producing MAE + +4. **RestoreBackup**: Restores the storage stack from a backup of the local database + +## Environment Variables + +To run the `datahub-upgrade` container, some environment variables must be provided in order to tell the upgrade CLI +where the running DataHub containers reside. + +Below details the required configurations. By default, these configs are provided for local docker-compose deployments of +DataHub within `docker/datahub-upgrade/env/docker.env`. They assume that there is a Docker network called datahub_network +where the DataHub containers can be found. + +These are also the variables used when the provided `datahub-upgrade.sh` script is executed. To run the upgrade CLI for non-local deployments, +follow these steps: + +1. Define new ".env" variable to hold your environment variables. + +The following variables may be provided: + +```aidl +# Required Environment Variables +EBEAN_DATASOURCE_USERNAME=datahub +EBEAN_DATASOURCE_PASSWORD=datahub +EBEAN_DATASOURCE_HOST=+ +
diff --git a/docs-website/versioned_docs/version-0.10.4/docs/act-on-metadata/impact-analysis.md b/docs-website/versioned_docs/version-0.10.4/docs/act-on-metadata/impact-analysis.md new file mode 100644 index 0000000000000..92c307e4a82da --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/act-on-metadata/impact-analysis.md @@ -0,0 +1,102 @@ +--- +title: About DataHub Lineage Impact Analysis +sidebar_label: Lineage Impact Analysis +slug: /act-on-metadata/impact-analysis +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/act-on-metadata/impact-analysis.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub Lineage Impact Analysis + ++ +
+ +2. Easily toggle between **Upstream** and **Downstream** dependencies + ++ +
+ +3. Choose the **Degree of Dependencies** you are interested in. The default filter is “1 Degree of Dependency” to minimize processor-intensive queries. + ++ +
+ +4. Slice and dice the result list by Entity Type, Platfrom, Owner, and more to isolate the relevant dependencies + ++ +
+ +5. Export the full list of dependencies to CSV + ++ +
+ +6. View the filtered set of dependencies via CSV, with details about assigned ownership, domain, tags, terms, and quick links back to those entities within DataHub + ++ +
+ +## Additional Resources + +### Videos + +**DataHub 201: Impact Analysis** + ++ +
+ +### GraphQL + +- [searchAcrossLineage](../../graphql/queries.md#searchacrosslineage) +- [searchAcrossLineageInput](../../graphql/inputObjects.md#searchacrosslineageinput) + +Looking for an example of how to use `searchAcrossLineage` to read lineage? Look [here](../api/tutorials/lineage.md#read-lineage) + +### DataHub Blog + +- [Dependency Impact Analysis, Data Validation Outcomes, and MORE! - Highlights from DataHub v0.8.27 & v.0.8.28](https://blog.datahubproject.io/dependency-impact-analysis-data-validation-outcomes-and-more-1302604da233) + +### FAQ and Troubleshooting + +**The Lineage Tab is greyed out - why can’t I click on it?** + +This means you have not yet ingested Lineage metadata for that entity. Please see the Lineage Guide to get started. + +**Why is my list of exported dependencies incomplete?** + +We currently limit the list of dependencies to 10,000 records; we suggest applying filters to narrow the result set if you hit that limit. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ + +### Related Features + +- [DataHub Lineage](../lineage/lineage-feature-guide.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/README.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/README.md new file mode 100644 index 0000000000000..10040bd1e45dd --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/README.md @@ -0,0 +1,250 @@ +--- +title: Introduction +slug: /actions +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/actions/README.md" +--- + +# ⚡ DataHub Actions Framework + +Welcome to DataHub Actions! The Actions framework makes responding to realtime changes in your Metadata Graph easy, enabling you to seamlessly integrate [DataHub](https://github.com/datahub-project/datahub) into a broader events-based architecture. + +For a detailed introduction, check out the [original announcement](https://www.youtube.com/watch?v=7iwNxHgqxtg&t=2189s) of the DataHub Actions Framework at the DataHub April 2022 Town Hall. For a more in-depth look at use cases and concepts, check out [DataHub Actions Concepts](concepts.md). + +## Quickstart + +To get started right away, check out the [DataHub Actions Quickstart](quickstart.md) Guide. + +## Prerequisites + +The DataHub Actions CLI commands are an extension of the base `datahub` CLI commands. We recommend +first installing the `datahub` CLI: + +```shell +python3 -m pip install --upgrade pip wheel setuptools +python3 -m pip install --upgrade acryl-datahub +datahub --version +``` + +> Note that the Actions Framework requires a version of `acryl-datahub` >= v0.8.34 + +## Installation + +Next, simply install the `acryl-datahub-actions` package from PyPi: + +```shell +python3 -m pip install --upgrade pip wheel setuptools +python3 -m pip install --upgrade acryl-datahub-actions +datahub actions version +``` + +## Configuring an Action + +Actions are configured using a YAML file, much in the same way DataHub ingestion sources are. An action configuration file consists of the following + +1. Action Pipeline Name (Should be unique and static) +2. Source Configurations +3. Transform + Filter Configurations +4. Action Configuration +5. Pipeline Options (Optional) +6. DataHub API configs (Optional - required for select actions) + +With each component being independently pluggable and configurable. + +```yml +# 1. Required: Action Pipeline Name +name:+ +
+ +**In the Actions Framework, Events flow continuously from left-to-right.** + +### Pipelines + +A **Pipeline** is a continuously running process which performs the following functions: + +1. Polls events from a configured Event Source (described below) +2. Applies configured Transformation + Filtering to the Event +3. Executes the configured Action on the resulting Event + +in addition to handling initialization, errors, retries, logging, and more. + +Each Action Configuration file corresponds to a unique Pipeline. In practice, +each Pipeline has its very own Event Source, Transforms, and Actions. This makes it easy to maintain state for mission-critical Actions independently. + +Importantly, each Action must have a unique name. This serves as a stable identifier across Pipeline run which can be useful in saving the Pipeline's consumer state (ie. resiliency + reliability). For example, the Kafka Event Source (default) uses the pipeline name as the Kafka Consumer Group id. This enables you to easily scale-out your Actions by running multiple processes with the same exact configuration file. Each will simply become different consumers in the same consumer group, sharing traffic of the DataHub Events stream. + +### Events + +**Events** are data objects representing changes that have occurred on DataHub. Strictly speaking, the only requirement that the Actions framework imposes is that these objects must be + +a. Convertible to JSON +b. Convertible from JSON + +So that in the event of processing failures, events can be written and read from a failed events file. + +#### Event Types + +Each Event instance inside the framework corresponds to a single **Event Type**, which is common name (e.g. "EntityChangeEvent_v1") which can be used to understand the shape of the Event. This can be thought of as a "topic" or "stream" name. That being said, Events associated with a single type are not expected to change in backwards-breaking ways across versons. + +### Event Sources + +Events are produced to the framework by **Event Sources**. Event Sources may include their own guarantees, configurations, behaviors, and semantics. They usually produce a fixed set of Event Types. + +In addition to sourcing events, Event Sources are also responsible for acking the succesful processing of an event by implementing the `ack` method. This is invoked by the framework once the Event is guaranteed to have reached the configured Action successfully. + +### Transformers + +**Transformers** are pluggable components which take an Event as input, and produce an Event (or nothing) as output. This can be used to enrich the information of an Event prior to sending it to an Action. + +Multiple Transformers can be configured to run in sequence, filtering and transforming an event in multiple steps. + +Transformers can also be used to generate a completely new type of Event (i.e. registered at runtime via the Event Registry) which can subsequently serve as input to an Action. + +Transformers can be easily customized and plugged in to meet an organization's unqique requirements. For more information on developing a Transformer, check out [Developing a Transformer](guides/developing-a-transformer.md) + +### Action + +**Actions** are pluggable components which take an Event as input and perform some business logic. Examples may be sending a Slack notification, logging to a file, +or creating a Jira ticket, etc. + +Each Pipeline can be configured to have a single Action which runs after the filtering and transformations have occurred. + +Actions can be easily customized and plugged in to meet an organization's unqique requirements. For more information on developing a Action, check out [Developing a Action](guides/developing-an-action.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/events/entity-change-event.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/events/entity-change-event.md new file mode 100644 index 0000000000000..3fa382ee660b2 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/events/entity-change-event.md @@ -0,0 +1,355 @@ +--- +title: Entity Change Event V1 +slug: /actions/events/entity-change-event +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/actions/events/entity-change-event.md +--- + +# Entity Change Event V1 + +## Event Type + +`EntityChangeEvent_v1` + +## Overview + +This Event is emitted when certain changes are made to an entity (dataset, dashboard, chart, etc) on DataHub. + +## Event Structure + +Entity Change Events are generated in a variety of circumstances, but share a common set of fields. + +### Common Fields + +| Name | Type | Description | Optional | +| ---------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------- | +| entityUrn | String | The unique identifier for the Entity being changed. For example, a Dataset's urn. | False | +| entityType | String | The type of the entity being changed. Supported values include dataset, chart, dashboard, dataFlow (Pipeline), dataJob (Task), domain, tag, glossaryTerm, corpGroup, & corpUser. | False | +| category | String | The category of the change, related to the kind of operation that was performed. Examples include TAG, GLOSSARY_TERM, DOMAIN, LIFECYCLE, and more. | False | +| operation | String | The operation being performed on the entity given the category. For example, ADD ,REMOVE, MODIFY. For the set of valid operations, see the full catalog below. | False | +| modifier | String | The modifier that has been applied to the entity. The value depends on the category. An example includes the URN of a tag being applied to a Dataset or Schema Field. | True | +| parameters | Dict | Additional key-value parameters used to provide specific context. The precise contents depends on the category + operation of the event. See the catalog below for a full summary of the combinations. | True | +| auditStamp.actor | String | The urn of the actor who triggered the change. | False | +| auditStamp.time | Number | The timestamp in milliseconds corresponding to the event. | False | + +In following sections, we will provide sample events for each scenario in which Entity Change Events are fired. + +### Add Tag Event + +This event is emitted when a Tag has been added to an entity on DataHub. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "TAG", + "operation": "ADD", + "modifier": "urn:li:tag:PII", + "parameters": { + "tagUrn": "urn:li:tag:PII" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Remove Tag Event + +This event is emitted when a Tag has been removed from an entity on DataHub. +Header + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "TAG", + "operation": "REMOVE", + "modifier": "urn:li:tag:PII", + "parameters": { + "tagUrn": "urn:li:tag:PII" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Add Glossary Term Event + +This event is emitted when a Glossary Term has been added to an entity on DataHub. +Header + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "GLOSSARY_TERM", + "operation": "ADD", + "modifier": "urn:li:glossaryTerm:ExampleNode.ExampleTerm", + "parameters": { + "termUrn": "urn:li:glossaryTerm:ExampleNode.ExampleTerm" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Remove Glossary Term Event + +This event is emitted when a Glossary Term has been removed from an entity on DataHub. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "GLOSSARY_TERM", + "operation": "REMOVE", + "modifier": "urn:li:glossaryTerm:ExampleNode.ExampleTerm", + "parameters": { + "termUrn": "urn:li:glossaryTerm:ExampleNode.ExampleTerm" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Add Domain Event + +This event is emitted when Domain has been added to an entity on DataHub. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "DOMAIN", + "operation": "ADD", + "modifier": "urn:li:domain:ExampleDomain", + "parameters": { + "domainUrn": "urn:li:domain:ExampleDomain" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Remove Domain Event + +This event is emitted when Domain has been removed from an entity on DataHub. +Header + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "DOMAIN", + "operation": "REMOVE", + "modifier": "urn:li:domain:ExampleDomain", + "parameters": { + "domainUrn": "urn:li:domain:ExampleDomain" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Add Owner Event + +This event is emitted when a new owner has been assigned to an entity on DataHub. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "OWNER", + "operation": "ADD", + "modifier": "urn:li:corpuser:jdoe", + "parameters": { + "ownerUrn": "urn:li:corpuser:jdoe", + "ownerType": "BUSINESS_OWNER" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Remove Owner Event + +This event is emitted when an existing owner has been removed from an entity on DataHub. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "OWNER", + "operation": "REMOVE", + "modifier": "urn:li:corpuser:jdoe", + "parameters": { + "ownerUrn": "urn:li:corpuser:jdoe", + "ownerType": "BUSINESS_OWNER" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Modify Deprecation Event + +This event is emitted when the deprecation status of an entity has been modified on DataHub. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "DEPRECATION", + "operation": "MODIFY", + "modifier": "DEPRECATED", + "parameters": { + "status": "DEPRECATED" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Add Dataset Schema Field Event + +This event is emitted when a new field has been added to a Dataset Schema. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "TECHNICAL_SCHEMA", + "operation": "ADD", + "modifier": "urn:li:schemaField:(urn:li:dataset:abc,newFieldName)", + "parameters": { + "fieldUrn": "urn:li:schemaField:(urn:li:dataset:abc,newFieldName)", + "fieldPath": "newFieldName", + "nullable": false + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Remove Dataset Schema Field Event + +This event is emitted when a new field has been remove from a Dataset Schema. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "TECHNICAL_SCHEMA", + "operation": "REMOVE", + "modifier": "urn:li:schemaField:(urn:li:dataset:abc,newFieldName)", + "parameters": { + "fieldUrn": "urn:li:schemaField:(urn:li:dataset:abc,newFieldName)", + "fieldPath": "newFieldName", + "nullable": false + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Entity Create Event + +This event is emitted when a new entity has been created on DataHub. +Header + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "LIFECYCLE", + "operation": "CREATE", + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Entity Soft-Delete Event + +This event is emitted when a new entity has been soft-deleted on DataHub. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "LIFECYCLE", + "operation": "SOFT_DELETE", + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Entity Hard-Delete Event + +This event is emitted when a new entity has been hard-deleted on DataHub. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "LIFECYCLE", + "operation": "HARD_DELETE", + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/events/metadata-change-log-event.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/events/metadata-change-log-event.md new file mode 100644 index 0000000000000..11db1bdfb4718 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/events/metadata-change-log-event.md @@ -0,0 +1,155 @@ +--- +title: Metadata Change Log Event V1 +slug: /actions/events/metadata-change-log-event +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/actions/events/metadata-change-log-event.md +--- + +# Metadata Change Log Event V1 + +## Event Type + +`MetadataChangeLog_v1` + +## Overview + +This event is emitted when any aspect on DataHub Metadata Graph is changed. This includes creates, updates, and removals of both "versioned" aspects and "time-series" aspects. + +> Disclaimer: This event is quite powerful, but also quite low-level. Because it exposes the underlying metadata model directly, it is subject to more frequent structural and semantic changes than the higher level [Entity Change Event](entity-change-event.md). We recommend using that event instead to achieve your use case when possible. + +## Event Structure + +The fields include + +| Name | Type | Description | Optional | +| ------------------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------- | +| entityUrn | String | The unique identifier for the Entity being changed. For example, a Dataset's urn. | False | +| entityType | String | The type of the entity being changed. Supported values include dataset, chart, dashboard, dataFlow (Pipeline), dataJob (Task), domain, tag, glossaryTerm, corpGroup, & corpUser. | False | +| entityKeyAspect | Object | The key struct of the entity that was changed. Only present if the Metadata Change Proposal contained the raw key struct. | True | +| changeType | String | The change type. UPSERT or DELETE are currently supported. | False | +| aspectName | String | The entity aspect which was changed. | False | +| aspect | Object | The new aspect value. Null if the aspect was deleted. | True | +| aspect.contentType | String | The serialization type of the aspect itself. The only supported value is `application/json`. | False | +| aspect.value | String | The serialized aspect. This is a JSON-serialized representing the aspect document originally defined in PDL. See https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin for more. | False | +| previousAspectValue | Object | The previous aspect value. Null if the aspect did not exist previously. | True | +| previousAspectValue.contentType | String | The serialization type of the aspect itself. The only supported value is `application/json` | False | +| previousAspectValue.value | String | The serialized aspect. This is a JSON-serialized representing the aspect document originally defined in PDL. See https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin for more. | False | +| systemMetadata | Object | The new system metadata. This includes the the ingestion run-id, model registry and more. For the full structure, see https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/SystemMetadata.pdl | True | +| previousSystemMetadata | Object | The previous system metadata. This includes the the ingestion run-id, model registry and more. For the full structure, see https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/SystemMetadata.pdl | True | +| created | Object | Audit stamp about who triggered the Metadata Change and when. | False | +| created.time | Number | The timestamp in milliseconds when the aspect change occurred. | False | +| created.actor | String | The URN of the actor (e.g. corpuser) that triggered the change. | + +### Sample Events + +#### Tag Change Event + +```json +{ + "entityType": "container", + "entityUrn": "urn:li:container:DATABASE", + "entityKeyAspect": null, + "changeType": "UPSERT", + "aspectName": "globalTags", + "aspect": { + "value": "{\"tags\":[{\"tag\":\"urn:li:tag:pii\"}]}", + "contentType": "application/json" + }, + "systemMetadata": { + "lastObserved": 1651516475595, + "runId": "no-run-id-provided", + "registryName": "unknownRegistry", + "registryVersion": "0.0.0.0-dev", + "properties": null + }, + "previousAspectValue": null, + "previousSystemMetadata": null, + "created": { + "time": 1651516475594, + "actor": "urn:li:corpuser:datahub", + "impersonator": null + } +} +``` + +#### Glossary Term Change Event + +```json +{ + "entityType": "dataset", + "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)", + "entityKeyAspect": null, + "changeType": "UPSERT", + "aspectName": "glossaryTerms", + "aspect": { + "value": "{\"auditStamp\":{\"actor\":\"urn:li:corpuser:datahub\",\"time\":1651516599479},\"terms\":[{\"urn\":\"urn:li:glossaryTerm:CustomerAccount\"}]}", + "contentType": "application/json" + }, + "systemMetadata": { + "lastObserved": 1651516599486, + "runId": "no-run-id-provided", + "registryName": "unknownRegistry", + "registryVersion": "0.0.0.0-dev", + "properties": null + }, + "previousAspectValue": null, + "previousSystemMetadata": null, + "created": { + "time": 1651516599480, + "actor": "urn:li:corpuser:datahub", + "impersonator": null + } +} +``` + +#### Owner Change Event + +```json +{ + "auditHeader": null, + "entityType": "dataset", + "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)", + "entityKeyAspect": null, + "changeType": "UPSERT", + "aspectName": "ownership", + "aspect": { + "value": "{\"owners\":[{\"type\":\"DATAOWNER\",\"owner\":\"urn:li:corpuser:datahub\"}],\"lastModified\":{\"actor\":\"urn:li:corpuser:datahub\",\"time\":1651516640488}}", + "contentType": "application/json" + }, + "systemMetadata": { + "lastObserved": 1651516640493, + "runId": "no-run-id-provided", + "registryName": "unknownRegistry", + "registryVersion": "0.0.0.0-dev", + "properties": null + }, + "previousAspectValue": { + "value": "{\"owners\":[{\"owner\":\"urn:li:corpuser:jdoe\",\"type\":\"DATAOWNER\"},{\"owner\":\"urn:li:corpuser:datahub\",\"type\":\"DATAOWNER\"}],\"lastModified\":{\"actor\":\"urn:li:corpuser:jdoe\",\"time\":1581407189000}}", + "contentType": "application/json" + }, + "previousSystemMetadata": { + "lastObserved": 1651516415088, + "runId": "file-2022_05_02-11_33_35", + "registryName": null, + "registryVersion": null, + "properties": null + }, + "created": { + "time": 1651516640490, + "actor": "urn:li:corpuser:datahub", + "impersonator": null + } +} +``` + +## FAQ + +### Where can I find all the aspects and their schemas? + +Great Question! All MetadataChangeLog events are based on the Metadata Model which is comprised of Entities, +Aspects, and Relationships which make up an enterprise Metadata Graph. We recommend checking out the following +resources to learn more about this: + +- [Intro to Metadata Model](/docs/metadata-modeling/metadata-model) + +You can also find a comprehensive list of Entities + Aspects of the Metadata Model under the **Metadata Modeling > Entities** section of the [official DataHub docs](/docs/). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/guides/developing-a-transformer.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/guides/developing-a-transformer.md new file mode 100644 index 0000000000000..5ee579175a58f --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/guides/developing-a-transformer.md @@ -0,0 +1,136 @@ +--- +title: Developing a Transformer +slug: /actions/guides/developing-a-transformer +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/actions/guides/developing-a-transformer.md +--- + +# Developing a Transformer + +In this guide, we will outline each step to developing a custom Transformer for the DataHub Actions Framework. + +## Overview + +Developing a DataHub Actions Transformer is a matter of extending the `Transformer` base class in Python, installing your +Transformer to make it visible to the framework, and then configuring the framework to use the new Transformer. + +## Step 1: Defining a Transformer + +To implement an Transformer, we'll need to extend the `Transformer` base class and override the following functions: + +- `create()` - This function is invoked to instantiate the action, with a free-form configuration dictionary + extracted from the Actions configuration file as input. +- `transform()` - This function is invoked when an Event is received. It should contain the core logic of the Transformer. + and will return the transformed Event, or `None` if the Event should be filtered. + +Let's start by defining a new implementation of Transformer called `CustomTransformer`. We'll keep it simple-- this Transformer will +print the configuration that is provided when it is created, and print any Events that it receives. + +```python +# custom_transformer.py +from datahub_actions.transform.transformer import Transformer +from datahub_actions.event.event import EventEnvelope +from datahub_actions.pipeline.pipeline_context import PipelineContext +from typing import Optional + +class CustomTransformer(Transformer): + @classmethod + def create(cls, config_dict: dict, ctx: PipelineContext) -> "Transformer": + # Simply print the config_dict. + print(config_dict) + return cls(config_dict, ctx) + + def __init__(self, ctx: PipelineContext): + self.ctx = ctx + + def transform(self, event: EventEnvelope) -> Optional[EventEnvelope]: + # Simply print the received event. + print(event) + # And return the original event (no-op) + return event +``` + +## Step 2: Installing the Transformer + +Now that we've defined the Transformer, we need to make it visible to the framework by making +it available in the Python runtime environment. + +The easiest way to do this is to just place it in the same directory as your configuration file, in which case the module name is the same as the file +name - in this case it will be `custom_transformer`. + +### Advanced: Installing as a Package + +Alternatively, create a `setup.py` file in the same directory as the new Transformer to convert it into a package that pip can understand. + +``` +from setuptools import find_packages, setup + +setup( + name="custom_transformer_example", + version="1.0", + packages=find_packages(), + # if you don't already have DataHub Actions installed, add it under install_requires + # install_requires=["acryl-datahub-actions"] +) +``` + +Next, install the package + +```shell +pip install -e . +``` + +inside the module. (alt.`python setup.py`). + +Once we have done this, our class will be referencable via `custom_transformer_example.custom_transformer:CustomTransformer`. + +## Step 3: Running the Action + +Now that we've defined our Transformer, we can create an Action configuration file that refers to the new Transformer. +We will need to provide the fully-qualified Python module & class name when doing so. + +_Example Configuration_ + +```yaml +# custom_transformer_action.yaml +name: "custom_transformer_test" +source: + type: "kafka" + config: + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} +transform: + - type: "custom_transformer_example.custom_transformer:CustomTransformer" + config: + # Some sample configuration which should be printed on create. + config1: value1 +action: + # Simply reuse the default hello_world action + type: "hello_world" +``` + +Next, run the `datahub actions` command as usual: + +```shell +datahub actions -c custom_transformer_action.yaml +``` + +If all is well, your Transformer should now be receiving & printing Events. + +### (Optional) Step 4: Contributing the Transformer + +If your Transformer is generally applicable, you can raise a PR to include it in the core Transformer library +provided by DataHub. All Transformers will live under the `datahub_actions/plugin/transform` directory inside the +[datahub-actions](https://github.com/acryldata/datahub-actions) repository. + +Once you've added your new Transformer there, make sure that you make it discoverable by updating the `entry_points` section +of the `setup.py` file. This allows you to assign a globally unique name for you Transformer, so that people can use +it without defining the full module path. + +#### Prerequisites: + +Prerequisites to consideration for inclusion in the core Transformer library include + +- **Testing** Define unit tests for your Transformer +- **Deduplication** Confirm that no existing Transformer serves the same purpose, or can be easily extended to serve the same purpose diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/guides/developing-an-action.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/guides/developing-an-action.md new file mode 100644 index 0000000000000..2a392a696b0fa --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/guides/developing-an-action.md @@ -0,0 +1,135 @@ +--- +title: Developing an Action +slug: /actions/guides/developing-an-action +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/actions/guides/developing-an-action.md +--- + +# Developing an Action + +In this guide, we will outline each step to developing a Action for the DataHub Actions Framework. + +## Overview + +Developing a DataHub Action is a matter of extending the `Action` base class in Python, installing your +Action to make it visible to the framework, and then configuring the framework to use the new Action. + +## Step 1: Defining an Action + +To implement an Action, we'll need to extend the `Action` base class and override the following functions: + +- `create()` - This function is invoked to instantiate the action, with a free-form configuration dictionary + extracted from the Actions configuration file as input. +- `act()` - This function is invoked when an Action is received. It should contain the core logic of the Action. +- `close()` - This function is invoked when the framework has issued a shutdown of the pipeline. It should be used + to cleanup any processes happening inside the Action. + +Let's start by defining a new implementation of Action called `CustomAction`. We'll keep it simple-- this Action will +print the configuration that is provided when it is created, and print any Events that it receives. + +```python +# custom_action.py +from datahub_actions.action.action import Action +from datahub_actions.event.event_envelope import EventEnvelope +from datahub_actions.pipeline.pipeline_context import PipelineContext + +class CustomAction(Action): + @classmethod + def create(cls, config_dict: dict, ctx: PipelineContext) -> "Action": + # Simply print the config_dict. + print(config_dict) + return cls(ctx) + + def __init__(self, ctx: PipelineContext): + self.ctx = ctx + + def act(self, event: EventEnvelope) -> None: + # Do something super important. + # For now, just print. :) + print(event) + + def close(self) -> None: + pass +``` + +## Step 2: Installing the Action + +Now that we've defined the Action, we need to make it visible to the framework by making it +available in the Python runtime environment. + +The easiest way to do this is to just place it in the same directory as your configuration file, in which case the module name is the same as the file +name - in this case it will be `custom_action`. + +### Advanced: Installing as a Package + +Alternatively, create a `setup.py` file in the same directory as the new Action to convert it into a package that pip can understand. + +``` +from setuptools import find_packages, setup + +setup( + name="custom_action_example", + version="1.0", + packages=find_packages(), + # if you don't already have DataHub Actions installed, add it under install_requires + # install_requires=["acryl-datahub-actions"] +) +``` + +Next, install the package + +```shell +pip install -e . +``` + +inside the module. (alt.`python setup.py`). + +Once we have done this, our class will be referencable via `custom_action_example.custom_action:CustomAction`. + +## Step 3: Running the Action + +Now that we've defined our Action, we can create an Action configuration file that refers to the new Action. +We will need to provide the fully-qualified Python module & class name when doing so. + +_Example Configuration_ + +```yaml +# custom_action.yaml +name: "custom_action_test" +source: + type: "kafka" + config: + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} +action: + type: "custom_action_example.custom_action:CustomAction" + config: + # Some sample configuration which should be printed on create. + config1: value1 +``` + +Next, run the `datahub actions` command as usual: + +```shell +datahub actions -c custom_action.yaml +``` + +If all is well, your Action should now be receiving & printing Events. + +## (Optional) Step 4: Contributing the Action + +If your Action is generally applicable, you can raise a PR to include it in the core Action library +provided by DataHub. All Actions will live under the `datahub_actions/plugin/action` directory inside the +[datahub-actions](https://github.com/acryldata/datahub-actions) repository. + +Once you've added your new Action there, make sure that you make it discoverable by updating the `entry_points` section +of the `setup.py` file. This allows you to assign a globally unique name for you Action, so that people can use +it without defining the full module path. + +### Prerequisites: + +Prerequisites to consideration for inclusion in the core Actions library include + +- **Testing** Define unit tests for your Action +- **Deduplication** Confirm that no existing Action serves the same purpose, or can be easily extended to serve the same purpose diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/quickstart.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/quickstart.md new file mode 100644 index 0000000000000..2a7b563f60157 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/quickstart.md @@ -0,0 +1,176 @@ +--- +title: Quickstart +slug: /actions/quickstart +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/actions/quickstart.md +--- + +# DataHub Actions Quickstart + +## Prerequisites + +The DataHub Actions CLI commands are an extension of the base `datahub` CLI commands. We recommend +first installing the `datahub` CLI: + +```shell +python3 -m pip install --upgrade pip wheel setuptools +python3 -m pip install --upgrade acryl-datahub +datahub --version +``` + +> Note that the Actions Framework requires a version of `acryl-datahub` >= v0.8.34 + +## Installation + +To install DataHub Actions, you need to install the `acryl-datahub-actions` package from PyPi + +```shell +python3 -m pip install --upgrade pip wheel setuptools +python3 -m pip install --upgrade acryl-datahub-actions + +# Verify the installation by checking the version. +datahub actions version +``` + +### Hello World + +DataHub ships with a "Hello World" Action which logs all events it receives to the console. +To run this action, simply create a new Action configuration file: + +```yaml +# hello_world.yaml +name: "hello_world" +source: + type: "kafka" + config: + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} +action: + type: "hello_world" +``` + +and then run it using the `datahub actions` command: + +```shell +datahub actions -c hello_world.yaml +``` + +You should the see the following output if the Action has been started successfully: + +```shell +Action Pipeline with name 'hello_world' is now running. +``` + +Now, navigate to the instance of DataHub that you've connected to and perform an Action such as + +- Adding / removing a Tag +- Adding / removing a Glossary Term +- Adding / removing a Domain + +If all is well, you should see some events being logged to the console + +```shell +Hello world! Received event: +{ + "event_type": "EntityChangeEvent_v1", + "event": { + "entityType": "dataset", + "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)", + "category": "TAG", + "operation": "ADD", + "modifier": "urn:li:tag:pii", + "parameters": {}, + "auditStamp": { + "time": 1651082697703, + "actor": "urn:li:corpuser:datahub", + "impersonator": null + }, + "version": 0, + "source": null + }, + "meta": { + "kafka": { + "topic": "PlatformEvent_v1", + "offset": 1262, + "partition": 0 + } + } +} +``` + +_An example of an event emitted when a 'pii' tag has been added to a Dataset._ + +Woohoo! You've successfully started using the Actions framework. Now, let's see how we can get fancy. + +#### Filtering events + +If we know which Event types we'd like to consume, we can optionally add a `filter` configuration, which +will prevent events that do not match the filter from being forwarded to the action. + +```yaml +# hello_world.yaml +name: "hello_world" +source: + type: "kafka" + config: + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} +filter: + event_type: "EntityChangeEvent_v1" +action: + type: "hello_world" +``` + +_Filtering for events of type EntityChangeEvent_v1 only_ + +#### Advanced Filtering + +Beyond simply filtering by event type, we can also filter events by matching against the values of their fields. To do so, +use the `event` block. Each field provided will be compared against the real event's value. An event that matches +**all** of the fields will be forwarded to the action. + +```yaml +# hello_world.yaml +name: "hello_world" +source: + type: "kafka" + config: + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} +filter: + event_type: "EntityChangeEvent_v1" + event: + category: "TAG" + operation: "ADD" + modifier: "urn:li:tag:pii" +action: + type: "hello_world" +``` + +_This filter only matches events representing "PII" tag additions to an entity._ + +And more, we can achieve "OR" semantics on a particular field by providing an array of values. + +```yaml +# hello_world.yaml +name: "hello_world" +source: + type: "kafka" + config: + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} +filter: + event_type: "EntityChangeEvent_v1" + event: + category: "TAG" + operation: ["ADD", "REMOVE"] + modifier: "urn:li:tag:pii" +action: + type: "hello_world" +``` + +_This filter only matches events representing "PII" tag additions to OR removals from an entity. How fancy!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/sources/kafka-event-source.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/sources/kafka-event-source.md new file mode 100644 index 0000000000000..3584ced1d61c6 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/sources/kafka-event-source.md @@ -0,0 +1,97 @@ +--- +title: Kafka Event Source +slug: /actions/sources/kafka-event-source +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/actions/sources/kafka-event-source.md +--- + +# Kafka Event Source + +## Overview + +The Kafka Event Source is the default Event Source used within the DataHub Actions Framework. + +Under the hood, the Kafka Event Source uses a Kafka Consumer to subscribe to the topics streaming +out of DataHub (MetadataChangeLog_v1, PlatformEvent_v1). Each Action is automatically placed into a unique +[consumer group](https://docs.confluent.io/platform/current/clients/consumer.html#consumer-groups) based on +the unique `name` provided inside the Action configuration file. + +This means that you can easily scale-out Actions processing by sharing the same Action configuration file across +multiple nodes or processes. As long as the `name` of the Action is the same, each instance of the Actions framework will subscribe as a member in the same Kafka Consumer Group, which allows for load balancing the +topic traffic across consumers which each consume independent [partitions](https://developer.confluent.io/learn-kafka/apache-kafka/partitions/#kafka-partitioning). + +Because the Kafka Event Source uses consumer groups by default, actions using this source will be **stateful**. +This means that Actions will keep track of their processing offsets of the upstream Kafka topics. If you +stop an Action and restart it sometime later, it will first "catch up" by processing the messages that the topic +has received since the Action last ran. Be mindful of this - if your Action is computationally expensive, it may be preferable to start consuming from the end of the log, instead of playing catch up. The easiest way to achieve this is to simply rename the Action inside the Action configuration file - this will create a new Kafka Consumer Group which will begin processing new messages at the end of the log (latest policy). + +### Processing Guarantees + +This event source implements an "ack" function which is invoked if and only if an event is successfully processed +by the Actions framework, meaning that the event made it through the Transformers and into the Action without +any errors. Under the hood, the "ack" method synchronously commits Kafka Consumer Offsets on behalf of the Action. This means that by default, the framework provides _at-least once_ processing semantics. That is, in the unusual case that a failure occurs when attempting to commit offsets back to Kafka, that event may be replayed on restart of the Action. + +If you've configured your Action pipeline `failure_mode` to be `CONTINUE` (the default), then events which +fail to be processed will simply be logged to a `failed_events.log` file for further investigation (dead letter queue). The Kafka Event Source will continue to make progress against the underlying topics and continue to commit offsets even in the case of failed messages. + +If you've configured your Action pipeline `failure_mode` to be `THROW`, then events which fail to be processed result in an Action Pipeline error. This in turn terminates the pipeline before committing offsets back to Kafka. Thus the message will not be marked as "processed" by the Action consumer. + +## Supported Events + +The Kafka Event Source produces + +- [Entity Change Event V1](../events/entity-change-event.md) +- [Metadata Change Log V1](../events/metadata-change-log-event.md) + +## Configure the Event Source + +Use the following config(s) to get started with the Kafka Event Source. + +```yml +name: "pipeline-name" +source: + type: "kafka" + config: + # Connection-related configuration + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} + # Dictionary of freeform consumer configs propagated to underlying Kafka Consumer + consumer_config: + #security.protocol: ${KAFKA_PROPERTIES_SECURITY_PROTOCOL:-PLAINTEXT} + #ssl.keystore.location: ${KAFKA_PROPERTIES_SSL_KEYSTORE_LOCATION:-/mnt/certs/keystore} + #ssl.truststore.location: ${KAFKA_PROPERTIES_SSL_TRUSTSTORE_LOCATION:-/mnt/certs/truststore} + #ssl.keystore.password: ${KAFKA_PROPERTIES_SSL_KEYSTORE_PASSWORD:-keystore_password} + #ssl.key.password: ${KAFKA_PROPERTIES_SSL_KEY_PASSWORD:-keystore_password} + #ssl.truststore.password: ${KAFKA_PROPERTIES_SSL_TRUSTSTORE_PASSWORD:-truststore_password} + # Topic Routing - which topics to read from. + topic_routes: + mcl: ${METADATA_CHANGE_LOG_VERSIONED_TOPIC_NAME:-MetadataChangeLog_Versioned_v1} # Topic name for MetadataChangeLog_v1 events. + pe: ${PLATFORM_EVENT_TOPIC_NAME:-PlatformEvent_v1} # Topic name for PlatformEvent_v1 events. +action: + # action configs +``` + ++ +
+ +to an after that looks like this + ++ +
+ +That is, a move away from patterns of strong-typing-everywhere to a more generic + flexible world. + +### How will we do it? + +We will accomplish this by building the following: + +1. Set of custom annotations to permit declarative entity, search, graph configurations + - @Entity & @Aspect + - @Searchable + - @Relationship +2. Entity Registry: In-memory structures for representing, storing & serving metadata associated with a particular Entity, including search and relationship configurations. +3. Generic Entity, Search, Graph Service classes: Replaces traditional strongly-typed DAOs with flexible, pluggable APIs that can be used for CRUD, search, and graph across all entities. +4. Generic Rest.li Resources: + - 1 permitting reading, writing, searching, autocompleting, and browsing arbitrary entities + - 1 permitting reading of arbitrary entity-entity relationship edges +5. Generic Search Index Builder: Given a MAE and a specification of the Search Configuration for an entity, updates the search index. +6. Generic Graph Index Builder: Given a MAE and a specification of the Relationship Configuration for an entity, updates the graph index. +7. Generic Index + Mappings Builder: Dynamically generates index mappings and creates indices on the fly. +8. Introduce of special aspects to address other imperative code requirements + - BrowsePaths Aspect: Include an aspect to permit customization of the indexed browse paths. + - Key aspects: Include "virtual" aspects for representing the fields that uniquely identify an Entity for easy + reading by clients of DataHub. + +### Final Developer Experience: Defining an Entity + +We will outline what the experience of adding a new Entity should look like. We will imagine we want to define a "Service" entity representing +online microservices. + +#### Step 1. Add aspects + +ServiceKey.pdl + +``` +namespace com.linkedin.metadata.key + +/** + * Key for a Service + */ +@Aspect = { + "name": "serviceKey" +} +record ServiceKey { + /** + * Name of the service + */ + @Searchable = { + "fieldType": "TEXT_PARTIAL", + "enableAutocomplete": true + } + name: string +} +``` + +ServiceInfo.pdl + +``` +namespace com.linkedin.service + +import com.linkedin.common.Urn + +/** + * Properties associated with a Tag + */ +@Aspect = { + "name": "serviceInfo" +} +record ServiceInfo { + + /** + * Description of the service + */ + @Searchable = {} + description: string + + /** + * The owners of the + */ + @Relationship = { + "name": "OwnedBy", + "entityTypes": ["corpUser"] + } + owner: Urn +} +``` + +#### Step 2. Add aspect union. + +ServiceAspect.pdl + +``` +namespace com.linkedin.metadata.aspect + +import com.linkedin.metadata.key.ServiceKey +import com.linkedin.service.ServiceInfo +import com.linkedin.common.BrowsePaths + +/** + * Service Info + */ +typeref ServiceAspect = union[ + ServiceKey, + ServiceInfo, + BrowsePaths +] +``` + +#### Step 3. Add Snapshot model. + +ServiceSnapshot.pdl + +``` +namespace com.linkedin.metadata.snapshot + +import com.linkedin.common.Urn +import com.linkedin.metadata.aspect.ServiceAspect + +@Entity = { + "name": "service", + "keyAspect": "serviceKey" +} +record ServiceSnapshot { + + /** + * Urn for the service + */ + urn: Urn + + /** + * The list of service aspects + */ + aspects: array[ServiceAspect] +} +``` + +#### Step 4. Update Snapshot union. + +Snapshot.pdl + +``` +namespace com.linkedin.metadata.snapshot + +/** + * A union of all supported metadata snapshot types. + */ +typeref Snapshot = union[ + ... + ServiceSnapshot +] +``` + +### Interacting with New Entity + +1. Write Entity + +``` +curl 'http://localhost:8080/entities?action=ingest' -X POST -H 'X-RestLi-Protocol-Version:2.0.0' --data '{ + "entity":{ + "value":{ + "com.linkedin.metadata.snapshot.ServiceSnapshot":{ + "urn": "urn:li:service:mydemoservice", + "aspects":[ + { + "com.linkedin.service.ServiceInfo":{ + "description":"My demo service", + "owner": "urn:li:corpuser:user1" + } + }, + { + "com.linkedin.common.BrowsePaths":{ + "paths":[ + "/my/custom/browse/path1", + "/my/custom/browse/path2" + ] + } + } + ] + } + } + } +}' +``` + +2. Read Entity + +``` +curl 'http://localhost:8080/entities/urn%3Ali%3Aservice%3Amydemoservice' -H 'X-RestLi-Protocol-Version:2.0.0' +``` + +3. Search Entity + +``` +curl --location --request POST 'http://localhost:8080/entities?action=search' \ +--header 'X-RestLi-Protocol-Version: 2.0.0' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "input": "My demo", + "entity": "service", + "start": 0, + "count": 10 +}' +``` + +4. Autocomplete + +``` +curl --location --request POST 'http://localhost:8080/entities?action=autocomplete' \ +--header 'X-RestLi-Protocol-Version: 2.0.0' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "query": "mydem", + "entity": "service", + "limit": 10 +}' +``` + +5. Browse + +``` +curl --location --request POST 'http://localhost:8080/entities?action=browse' \ +--header 'X-RestLi-Protocol-Version: 2.0.0' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "path": "/my/custom/browse", + "entity": "service", + "start": 0, + "limit": 10 +}' +``` + +6. Relationships + +``` +curl --location --request GET 'http://localhost:8080/relationships?direction=INCOMING&urn=urn%3Ali%3Acorpuser%3Auser1&types=OwnedBy' \ +--header 'X-RestLi-Protocol-Version: 2.0.0' +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/advanced/no-code-upgrade.md b/docs-website/versioned_docs/version-0.10.4/docs/advanced/no-code-upgrade.md new file mode 100644 index 0000000000000..091a877b62911 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/advanced/no-code-upgrade.md @@ -0,0 +1,212 @@ +--- +title: No Code Upgrade (In-Place Migration Guide) +slug: /advanced/no-code-upgrade +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/advanced/no-code-upgrade.md +--- + +# No Code Upgrade (In-Place Migration Guide) + +## Summary of changes + +With the No Code metadata initiative, we've introduced various major changes: + +1. New Ebean Aspect table (metadata_aspect_v2) +2. New Elastic Indices (*entityName*index_v2) +3. New edge triples. (Remove fully qualified classpaths from nodes & edges) +4. Dynamic DataPlatform entities (no more hardcoded DataPlatformInfo.json) +5. Dynamic Browse Paths (no more hardcoded browse path creation logic) +6. Addition of Entity Key aspects, dropped requirement for strongly-typed Urns. +7. Addition of @Entity, @Aspect, @Searchable, @Relationship annotations to existing models. + +Because of these changes, it is required that your persistence layer be migrated after the NoCode containers have been +deployed. + +For more information about the No Code Update, please see [no-code-modeling](./no-code-modeling.md). + +## Migration strategy + +We are merging these breaking changes into the main branch upfront because we feel they are fundamental to subsequent +changes, providing a more solid foundation upon which exciting new features will be built upon. We will continue to +offer limited support for previous verions of DataHub. + +This approach means that companies who actively deploy the latest version of DataHub will need to perform an upgrade to +continue operating DataHub smoothly. + +## Upgrade Steps + +### Step 1: Pull & deploy latest container images + +It is important that the following containers are pulled and deployed simultaneously: + +- datahub-frontend-react +- datahub-gms +- datahub-mae-consumer +- datahub-mce-consumer + +#### Docker Compose Deployments + +From the `docker` directory: + +```aidl +docker-compose down --remove-orphans && docker-compose pull && docker-compose -p datahub up --force-recreate +``` + +#### Helm + +Deploying latest helm charts will upgrade all components to version 0.8.0. Once all the pods are up and running, it will +run the datahub-upgrade job, which will run the above docker container to migrate to the new sources. + +### Step 2: Execute Migration Job + +#### Docker Compose Deployments - Preserve Data + +If you do not care about migrating your data, you can refer to the Docker Compose Deployments - Lose All Existing Data +section below. + +To migrate existing data, the easiest option is to execute the `run_upgrade.sh` script located under `docker/datahub-upgrade/nocode`. + +``` +cd docker/datahub-upgrade/nocode +./run_upgrade.sh +``` + +Using this command, the default environment variables will be used (`docker/datahub-upgrade/env/docker.env`). These assume +that your deployment is local & that you are running MySQL. If this is not the case, you'll need to define your own environment variables to tell the +upgrade system where your DataHub containers reside and run + +To update the default environment variables, you can either + +1. Change `docker/datahub-upgrade/env/docker.env` in place and then run one of the above commands OR +2. Define a new ".env" file containing your variables and execute `docker pull acryldata/datahub-upgrade && docker run acryldata/datahub-upgrade:latest -u NoCodeDataMigration` + +To see the required environment variables, see the [datahub-upgrade](../../docker/datahub-upgrade/README.md) +documentation. + +To run the upgrade against a database other than MySQL, you can use the `-a dbType=+ +
+ +Please refer to [Querying with GraphQL](https://learning.postman.com/docs/sending-requests/graphql/graphql/) in the Postman documentation for more information. + +### Authentication + Authorization + +In general, you'll need to provide an [Access Token](../../authentication/personal-access-tokens.md) when querying the GraphQL by +providing an `Authorization` header containing a `Bearer` token. The header should take the following format: + +```bash +Authorization: Bearer+ +
+ +```shell +datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)" --aspect datasetProperties +{ + "datasetProperties": { + "customProperties": { + "encoding": "utf-8" + }, + "description": "table containing all the users deleted on a single day", + "tags": [] + } +} +``` + +## Add Custom Properties programmatically + +The following code adds custom properties `cluster_name` and `retention_time` to a dataset named `fct_users_deleted` without affecting existing properties. + ++ +
+ +We can also verify this operation by programmatically checking the `datasetProperties` aspect after running this code using the `datahub` cli. + +```shell +datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)" --aspect datasetProperties +{ + "datasetProperties": { + "customProperties": { + "encoding": "utf-8", + "cluster_name": "datahubproject.acryl.io", + "retention_time": "2 years" + }, + "description": "table containing all the users deleted on a single day", + "tags": [] + } +} +``` + +## Add and Remove Custom Properties programmatically + +The following code shows you how can add and remove custom properties in the same call. In the following code, we add custom property `cluster_name` and remove property `retention_time` from a dataset named `fct_users_deleted` without affecting existing properties. + ++ +
+ +We can also verify this operation programmatically by checking the `datasetProperties` aspect using the `datahub` cli. + +```shell +datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)" --aspect datasetProperties +{ + "datasetProperties": { + "customProperties": { + "encoding": "utf-8", + "cluster_name": "datahubproject.acryl.io" + }, + "description": "table containing all the users deleted on a single day", + "tags": [] + } +} +``` + +## Replace Custom Properties programmatically + +The following code replaces the current custom properties with a new properties map that includes only the properties `cluster_name` and `retention_time`. After running this code, the previous `encoding` property will be removed. + ++ +
+ +We can also verify this operation programmatically by checking the `datasetProperties` aspect using the `datahub` cli. + +```shell +datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)" --aspect datasetProperties +{ + "datasetProperties": { + "customProperties": { + "cluster_name": "datahubproject.acryl.io", + "retention_time": "2 years" + }, + "description": "table containing all the users deleted on a single day", + "tags": [] + } +} +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/datasets.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/datasets.md new file mode 100644 index 0000000000000..30a732f36f38b --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/datasets.md @@ -0,0 +1,211 @@ +--- +title: Dataset +slug: /api/tutorials/datasets +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/datasets.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Dataset + +## Why Would You Use Datasets? + +The dataset entity is one the most important entities in the metadata model. They represent collections of data that are typically represented as Tables or Views in a database (e.g. BigQuery, Snowflake, Redshift etc.), Streams in a stream-processing environment (Kafka, Pulsar etc.), bundles of data found as Files or Folders in data lake systems (S3, ADLS, etc.). +For more information about datasets, refer to [Dataset](/docs/generated/metamodel/entities/dataset.md). + +### Goal Of This Guide + +This guide will show you how to + +- Create: create a dataset with three columns. +- Delete: delete a dataset. + +## Prerequisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +## Create Dataset + ++ +
+ +## Delete Dataset + +You may want to delete a dataset if it is no longer needed, contains incorrect or sensitive information, or if it was created for testing purposes and is no longer necessary in production. +It is possible to [delete entities via CLI](/docs/how/delete-metadata.md), but a programmatic approach is necessary for scalability. + +There are two methods of deletion: soft delete and hard delete. +**Soft delete** sets the Status aspect of the entity to Removed, which hides the entity and all its aspects from being returned by the UI. +**Hard delete** physically deletes all rows for all aspects of the entity. + +For more information about soft delete and hard delete, please refer to [Removing Metadata from DataHub](/docs/how/delete-metadata.md#delete-by-urn). + ++ +
diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/deprecation.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/deprecation.md new file mode 100644 index 0000000000000..6731a99714619 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/deprecation.md @@ -0,0 +1,189 @@ +--- +title: Deprecation +slug: /api/tutorials/deprecation +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/deprecation.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Deprecation + +## Why Would You Deprecate Datasets? + +The Deprecation feature on DataHub indicates the status of an entity. For datasets, keeping the deprecation status up-to-date is important to inform users and downstream systems of changes to the dataset's availability or reliability. By updating the status, you can communicate changes proactively, prevent issues and ensure users are always using highly trusted data assets. + +### Goal Of This Guide + +This guide will show you how to read or update deprecation status of a dataset. + +## Prerequisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +:::note +Before updating deprecation, you need to ensure the targeted dataset is already present in your datahub. +If you attempt to manipulate entities that do not exist, your operation will fail. +In this guide, we will be using data from a sample ingestion. +::: + +## Read Deprecation + ++ +
diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/descriptions.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/descriptions.md new file mode 100644 index 0000000000000..9a3e80e3e1fb0 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/descriptions.md @@ -0,0 +1,613 @@ +--- +title: Description +slug: /api/tutorials/descriptions +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/descriptions.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Description + +## Why Would You Use Description on Dataset? + +Adding a description and related link to a dataset can provide important information about the data, such as its source, collection methods, and potential uses. This can help others understand the context of the data and how it may be relevant to their own work or research. Including a related link can also provide access to additional resources or related datasets, further enriching the information available to users. + +### Goal Of This Guide + +This guide will show you how to + +- Read dataset description: read a description of a dataset. +- Read column description: read a description of columns of a dataset`. +- Add dataset description: add a description and a link to dataset. +- Add column description: add a description to a column of a dataset. + +## Prerequisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +:::note +Before adding a description, you need to ensure the targeted dataset is already present in your datahub. +If you attempt to manipulate entities that do not exist, your operation will fail. +In this guide, we will be using data from sample ingestion. +::: + +In this example, we will add a description to `user_name `column of a dataset `fct_users_deleted`. + +## Read Description on Dataset + ++ +
+ +## Add Description on Column + ++ +
diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/domains.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/domains.md new file mode 100644 index 0000000000000..a42f8d3307ef6 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/domains.md @@ -0,0 +1,357 @@ +--- +title: Domains +slug: /api/tutorials/domains +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/domains.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Domains + +## Why Would You Use Domains? + +Domains are curated, top-level folders or categories where related assets can be explicitly grouped. Management of Domains can be centralized, or distributed out to Domain owners Currently, an asset can belong to only one Domain at a time. +For more information about domains, refer to [About DataHub Domains](/docs/domains.md). + +### Goal Of This Guide + +This guide will show you how to + +- Create a domain. +- Read domains attached to a dataset. +- Add a dataset to a domain +- Remove the domain from a dataset. + +## Prerequisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +## Create Domain + ++ +
+ +## Read Domains + ++ +
+ +## Remove Domains + ++ +
diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/lineage.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/lineage.md new file mode 100644 index 0000000000000..45a6f56b11316 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/lineage.md @@ -0,0 +1,322 @@ +--- +title: Lineage +slug: /api/tutorials/lineage +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/lineage.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Lineage + +## Why Would You Use Lineage? + +Lineage is used to capture data dependencies within an organization. It allows you to track the inputs from which a data asset is derived, along with the data assets that depend on it downstream. +For more information about lineage, refer to [About DataHub Lineage](/docs/lineage/lineage-feature-guide.md). + +### Goal Of This Guide + +This guide will show you how to + +- Add lineage between datasets. +- Add column-level lineage between datasets. + +## Prerequisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +:::note +Before adding lineage, you need to ensure the targeted dataset is already present in your datahub. +If you attempt to manipulate entities that do not exist, your operation will fail. +In this guide, we will be using data from sample ingestion. +::: + +## Add Lineage + ++ +
+ +## Add Column-level Lineage + ++ +
+ +## Read Lineage + ++ +
+ ++ +
+ +## Read ML Entities + +### Read MLFeature + ++ +
+ ++ +
diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/owners.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/owners.md new file mode 100644 index 0000000000000..361bdd9546ea3 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/owners.md @@ -0,0 +1,536 @@ +--- +title: Ownership +slug: /api/tutorials/owners +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/owners.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Ownership + +## Why Would You Use Users and Groups? + +Users and groups are essential for managing ownership of data. +By creating or updating user accounts and assigning them to appropriate groups, administrators can ensure that the right people can access the data they need to do their jobs. +This helps to avoid confusion or conflicts over who is responsible for specific datasets and can improve the overall effectiveness. + +### Goal Of This Guide + +This guide will show you how to + +- Create: create or update users and groups. +- Read: read owners attached to a dataset. +- Add: add user group as an owner to a dataset. +- Remove: remove the owner from a dataset. + +## Pre-requisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed information, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +:::note +In this guide, ingesting sample data is optional. +::: + +## Upsert Users + ++ +
+ +## Upsert Group + ++ +
+ +## Read Owners + ++ +
+ +## Remove Owners + ++ +
diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/tags.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/tags.md new file mode 100644 index 0000000000000..7ae66b5f89626 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/tags.md @@ -0,0 +1,613 @@ +--- +title: Tags +slug: /api/tutorials/tags +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/tags.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Tags + +## Why Would You Use Tags on Datasets? + +Tags are informal, loosely controlled labels that help in search & discovery. They can be added to datasets, dataset schemas, or containers, for an easy way to label or categorize entities – without having to associate them to a broader business glossary or vocabulary. +For more information about tags, refer to [About DataHub Tags](/docs/tags.md). + +### Goal Of This Guide + +This guide will show you how to + +- Create: create a tag. +- Read : read tags attached to a dataset. +- Add: add a tag to a column of a dataset or a dataset itself. +- Remove: remove a tag from a dataset. + +## Prerequisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed information, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +:::note +Before modifying tags, you need to ensure the target dataset is already present in your DataHub instance. +If you attempt to manipulate entities that do not exist, your operation will fail. +In this guide, we will be using data from sample ingestion. +::: + +For more information on how to set up for GraphQL, please refer to [How To Set Up GraphQL](/docs/api/graphql/how-to-set-up-graphql.md). + +## Create Tags + +The following code creates a tag `Deprecated`. + ++ +
+ +We can also verify this operation by programmatically searching `Deprecated` tag after running this code using the `datahub` cli. + +```shell +datahub get --urn "urn:li:tag:deprecated" --aspect tagProperties + +{ + "tagProperties": { + "description": "Having this tag means this column or table is deprecated.", + "name": "Deprecated" + } +} +``` + +## Read Tags + ++ +
+ +We can also verify this operation programmatically by checking the `globalTags` aspect using the `datahub` cli. + +```shell +datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)" --aspect globalTags + +``` + +## Remove Tags + +The following code remove a tag from a dataset. +After running this code, `Deprecated` tag will be removed from a `user_name` column. + ++ +
+ +We can also verify this operation programmatically by checking the `gloablTags` aspect using the `datahub` cli. + +```shell +datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)" --aspect globalTags + +{ + "globalTags": { + "tags": [] + } +} +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/terms.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/terms.md new file mode 100644 index 0000000000000..d2d0e715a4ca8 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/terms.md @@ -0,0 +1,613 @@ +--- +title: Terms +slug: /api/tutorials/terms +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/terms.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Terms + +## Why Would You Use Terms on Datasets? + +The Business Glossary(Term) feature in DataHub helps you use a shared vocabulary within the orgarnization, by providing a framework for defining a standardized set of data concepts and then associating them with the physical assets that exist within your data ecosystem. + +For more information about terms, refer to [About DataHub Business Glossary](/docs/glossary/business-glossary.md). + +### Goal Of This Guide + +This guide will show you how to + +- Create: create a term. +- Read : read terms attached to a dataset. +- Add: add a term to a column of a dataset or a dataset itself. +- Remove: remove a term from a dataset. + +## Prerequisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed information, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +:::note +Before modifying terms, you need to ensure the target dataset is already present in your DataHub instance. +If you attempt to manipulate entities that do not exist, your operation will fail. +In this guide, we will be using data from sample ingestion. +::: + +For more information on how to set up for GraphQL, please refer to [How To Set Up GraphQL](/docs/api/graphql/how-to-set-up-graphql.md). + +## Create Terms + +The following code creates a term `Rate of Return`. + ++ +
+ +We can also verify this operation by programmatically searching `Rate of Return` term after running this code using the `datahub` cli. + +```shell +datahub get --urn "urn:li:glossaryTerm:rateofreturn" --aspect glossaryTermInfo + +{ + "glossaryTermInfo": { + "definition": "A rate of return (RoR) is the net gain or loss of an investment over a specified time period.", + "name": "Rate of Return", + "termSource": "INTERNAL" + } +} +``` + +## Read Terms + ++ +
+ +## Remove Terms + +The following code remove a term from a dataset. +After running this code, `Rate of Return` term will be removed from a `user_name` column. + ++ +
diff --git a/docs-website/versioned_docs/version-0.10.4/docs/architecture/architecture.md b/docs-website/versioned_docs/version-0.10.4/docs/architecture/architecture.md new file mode 100644 index 0000000000000..4e7a8e081cf63 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/architecture/architecture.md @@ -0,0 +1,45 @@ +--- +title: Overview +sidebar_label: Overview +slug: /architecture/architecture +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/architecture/architecture.md +--- + +# DataHub Architecture Overview + +DataHub is a [3rd generation](https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained) Metadata Platform that enables Data Discovery, Collaboration, Governance, and end-to-end Observability +that is built for the Modern Data Stack. DataHub employs a model-first philosophy, with a focus on unlocking interoperability between +disparate tools & systems. + +The figures below describe the high-level architecture of DataHub. + ++ +
+![Acryl DataHub System Architecture ](../../../../docs/managed-datahub/imgs/saas/DataHub-Architecture.png) + +For a more detailed look at the components that make up the Architecture, check out [Components](../components.md). + +## Architecture Highlights + +There are three main highlights of DataHub's architecture. + +### Schema-first approach to Metadata Modeling + +DataHub's metadata model is described using a [serialization agnostic language](https://linkedin.github.io/rest.li/pdl_schema). Both [REST](https://github.com/datahub-project/datahub/blob/master/metadata-service) as well as [GraphQL API-s](https://github.com/datahub-project/datahub/blob/master/datahub-web-react/src/graphql) are supported. In addition, DataHub supports an [AVRO-based API](https://github.com/datahub-project/datahub/blob/master/metadata-events) over Kafka to communicate metadata changes and subscribe to them. Our [roadmap](../roadmap.md) includes a milestone to support no-code metadata model edits very soon, which will allow for even more ease of use, while retaining all the benefits of a typed API. Read about metadata modeling at [metadata modeling]. + +### Stream-based Real-time Metadata Platform + +DataHub's metadata infrastructure is stream-oriented, which allows for changes in metadata to be communicated and reflected within the platform within seconds. You can also subscribe to changes happening in DataHub's metadata, allowing you to build real-time metadata-driven systems. For example, you can build an access-control system that can observe a previously world-readable dataset adding a new schema field which contains PII, and locks down that dataset for access control reviews. + +### Federated Metadata Serving + +DataHub comes with a single [metadata service (gms)](https://github.com/datahub-project/datahub/blob/master/metadata-service) as part of the open source repository. However, it also supports federated metadata services which can be owned and operated by different teams –– in fact, that is how LinkedIn runs DataHub internally. The federated services communicate with the central search index and graph using Kafka, to support global search and discovery while still enabling decoupled ownership of metadata. This kind of architecture is very amenable for companies who are implementing [data mesh](https://martinfowler.com/articles/data-monolith-to-mesh.html). + +[metadata modeling]: ../modeling/metadata-model.md +[PDL]: https://linkedin.github.io/rest.li/pdl_schema +[metadata architectures blog post]: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained +[datahub-serving]: metadata-serving.md +[datahub-ingestion]: metadata-ingestion.md +[react-frontend]: ../../datahub-web-react/README.md diff --git a/docs-website/versioned_docs/version-0.10.4/docs/architecture/metadata-ingestion.md b/docs-website/versioned_docs/version-0.10.4/docs/architecture/metadata-ingestion.md new file mode 100644 index 0000000000000..16e270a304c3c --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/architecture/metadata-ingestion.md @@ -0,0 +1,42 @@ +--- +title: Ingestion Framework +sidebar_label: Ingestion Framework +slug: /architecture/metadata-ingestion +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/architecture/metadata-ingestion.md +--- + +# Metadata Ingestion Architecture + +DataHub supports an extremely flexible ingestion architecture that can support push, pull, asynchronous and synchronous models. +The figure below describes all the options possible for connecting your favorite system to DataHub. + ++ +
+ +## Metadata Change Proposal: The Center Piece + +The center piece for ingestion are [Metadata Change Proposal]s which represent requests to make a metadata change to an organization's Metadata Graph. +Metadata Change Proposals can be sent over Kafka, for highly scalable async publishing from source systems. They can also be sent directly to the HTTP endpoint exposed by the DataHub service tier to get synchronous success / failure responses. + +## Pull-based Integration + +DataHub ships with a Python based [metadata-ingestion system](../../metadata-ingestion/README.md) that can connect to different sources to pull metadata from them. This metadata is then pushed via Kafka or HTTP to the DataHub storage tier. Metadata ingestion pipelines can be [integrated with Airflow](../../metadata-ingestion/README.md#lineage-with-airflow) to set up scheduled ingestion or capture lineage. If you don't find a source already supported, it is very easy to [write your own](../../metadata-ingestion/README.md#contributing). + +## Push-based Integration + +As long as you can emit a [Metadata Change Proposal (MCP)] event to Kafka or make a REST call over HTTP, you can integrate any system with DataHub. For convenience, DataHub also provides simple [Python emitters] for you to integrate into your systems to emit metadata changes (MCP-s) at the point of origin. + +## Internal Components + +### Applying Metadata Change Proposals to DataHub Metadata Service (mce-consumer-job) + +DataHub comes with a Spring job, [mce-consumer-job], which consumes the Metadata Change Proposals and writes them into the DataHub Metadata Service (datahub-gms) using the `/ingest` endpoint. + +[Metadata Change Proposal (MCP)]: ../what/mxe.md#metadata-change-proposal-mcp +[Metadata Change Proposal]: ../what/mxe.md#metadata-change-proposal-mcp +[Metadata Change Log (MCL)]: ../what/mxe.md#metadata-change-log-mcl +[equivalent Pegasus format]: https://linkedin.github.io/rest.li/how_data_is_represented_in_memory#the-data-template-layer +[mce-consumer-job]: https://github.com/datahub-project/datahub/blob/master/metadata-jobs/mce-consumer-job +[Python emitters]: ../../metadata-ingestion/README.md#using-as-a-library diff --git a/docs-website/versioned_docs/version-0.10.4/docs/architecture/metadata-serving.md b/docs-website/versioned_docs/version-0.10.4/docs/architecture/metadata-serving.md new file mode 100644 index 0000000000000..e46fe3ffc91ea --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/architecture/metadata-serving.md @@ -0,0 +1,69 @@ +--- +title: Serving Tier +sidebar_label: Serving Tier +slug: /architecture/metadata-serving +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/architecture/metadata-serving.md +--- + +# DataHub Serving Architecture + +The figure below shows the high-level system diagram for DataHub's Serving Tier. + ++ +
+ +The primary component is called [the Metadata Service](https://github.com/datahub-project/datahub/blob/master/metadata-service) and exposes a REST API and a GraphQL API for performing CRUD operations on metadata. The service also exposes search and graph query API-s to support secondary-index style queries, full-text search queries as well as relationship queries like lineage. In addition, the [datahub-frontend](https://github.com/datahub-project/datahub/blob/master/datahub-frontend) service expose a GraphQL API on top of the metadata graph. + +## DataHub Serving Tier Components + +### Metadata Storage + +The DataHub Metadata Service persists metadata in a document store (an RDBMS like MySQL, Postgres, or Cassandra, etc.). + +### Metadata Change Log Stream (MCL) + +The DataHub Service Tier also emits a commit event [Metadata Change Log] when a metadata change has been successfully committed to persistent storage. This event is sent over Kafka. + +The MCL stream is a public API and can be subscribed to by external systems (for example, the Actions Framework) providing an extremely powerful way to react in real-time to changes happening in metadata. For example, you could build an access control enforcer that reacts to change in metadata (e.g. a previously world-readable dataset now has a pii field) to immediately lock down the dataset in question. +Note that not all MCP-s will result in an MCL, because the DataHub serving tier will ignore any duplicate changes to metadata. + +### Metadata Index Applier (mae-consumer-job) + +[Metadata Change Log]s are consumed by another Spring job, [mae-consumer-job], which applies the changes to the [graph] and [search index] accordingly. +The job is entity-agnostic and will execute corresponding graph & search index builders, which will be invoked by the job when a specific metadata aspect is changed. +The builder should instruct the job how to update the graph and search index based on the metadata change. + +To ensure that metadata changes are processed in the correct chronological order, MCLs are keyed by the entity [URN] — meaning all MAEs for a particular entity will be processed sequentially by a single thread. + +### Metadata Query Serving + +Primary-key based reads (e.g. getting schema metadata for a dataset based on the `dataset-urn`) on metadata are routed to the document store. Secondary index based reads on metadata are routed to the search index (or alternately can use the strongly consistent secondary index support described [here](https://github.com/datahub-project/datahub/blob/master/docs/architecture/)). Full-text and advanced search queries are routed to the search index. Complex graph queries such as lineage are routed to the graph index. + +[RecordTemplate]: https://github.com/linkedin/rest.li/blob/master/data/src/main/java/com/linkedin/data/template/RecordTemplate.java +[GenericRecord]: https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/generic/GenericRecord.java +[Pegasus]: https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates +[relationship]: ../what/relationship.md +[entity]: ../what/entity.md +[aspect]: ../what/aspect.md +[GMS]: ../what/gms.md +[Metadata Change Log]: ../what/mxe.md#metadata-change-log-mcl +[rest.li]: https://rest.li +[Metadata Change Proposal (MCP)]: ../what/mxe.md#metadata-change-proposal-mcp +[Metadata Change Log (MCL)]: ../what/mxe.md#metadata-change-log-mcl +[MCP]: ../what/mxe.md#metadata-change-proposal-mcp +[MCL]: ../what/mxe.md#metadata-change-log-mcl +[equivalent Pegasus format]: https://linkedin.github.io/rest.li/how_data_is_represented_in_memory#the-data-template-layer +[graph]: ../what/graph.md +[search index]: ../what/search-index.md +[mce-consumer-job]: https://github.com/datahub-project/datahub/blob/master/metadata-jobs/mce-consumer-job +[mae-consumer-job]: https://github.com/datahub-project/datahub/blob/master/metadata-jobs/mae-consumer-job +[Remote DAO]: ../architecture/metadata-serving.md#remote-dao +[URN]: ../what/urn.md +[Metadata Modelling]: ../modeling/metadata-model.md +[Entity]: ../what/entity.md +[Relationship]: ../what/relationship.md +[Search Document]: ../what/search-document.md +[metadata aspect]: ../what/aspect.md +[Python emitters]: /docs/metadata-ingestion/#using-as-a-library diff --git a/docs-website/versioned_docs/version-0.10.4/docs/architecture/stemming_and_synonyms.md b/docs-website/versioned_docs/version-0.10.4/docs/architecture/stemming_and_synonyms.md new file mode 100644 index 0000000000000..1d6fb95b4993c --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/architecture/stemming_and_synonyms.md @@ -0,0 +1,163 @@ +--- +title: "About DataHub [Stemming and Synonyms Support]" +sidebar_label: "[Stemming and Synonyms Support]" +slug: /architecture/stemming_and_synonyms +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/architecture/stemming_and_synonyms.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub [Stemming and Synonyms Support] + ++ +
+ +In this first image stemming is shown in the results. Even though the query is "event", the results contain instances with "events." + ++ +
+ +The second image exemplifies stemming on a query. The query is for "events", but the results show resources containing "event" as well. + +### Urn Matching + +Previously queries were not properly parsing out and tokenizing the expected portions of Urn types. Changes have been made on the index mapping and query side to support various partial and full Urn matching. + ++ +
+ ++ +
+ ++ +
+ +### Synonyms + +Synonyms includes a static list of equivalent terms that are baked into the index at index creation time. This allows for efficient indexing of related terms. It is possible to add these to the query side as well to +allow for dynamic synonyms, but this is unsupported at this time and has performance implications. + ++ +
+ ++ +
+ +### Autocomplete improvements + +Improvements were made to autocomplete handling around special characters like underscores and spaces. + ++ +
+ ++ +
+ ++ +
+ +## Additional Resources + +### Videos + +**DataHub TownHall: Search Improvements Preview** + ++ +
+ +## FAQ and Troubleshooting + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authentication/README.md b/docs-website/versioned_docs/version-0.10.4/docs/authentication/README.md new file mode 100644 index 0000000000000..e6a79405a1222 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authentication/README.md @@ -0,0 +1,62 @@ +--- +title: Overview +slug: /authentication +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authentication/README.md +--- + +# Overview + +Authentication is the process of verifying the identity of a user or service. There are two +places where Authentication occurs inside DataHub: + +1. DataHub frontend service when a user attempts to log in to the DataHub application. +2. DataHub backend service when making API requests to DataHub. + +In this document, we'll tak a closer look at both. + +### Authentication in the Frontend + +Authentication of normal users of DataHub takes place in two phases. + +At login time, authentication is performed by either DataHub itself (via username / password entry) or a third-party Identity Provider. Once the identity +of the user has been established, and credentials validated, a persistent session token is generated for the user and stored +in a browser-side session cookie. + +DataHub provides 3 mechanisms for authentication at login time: + +- **Native Authentication** which uses username and password combinations natively stored and managed by DataHub, with users invited via an invite link. +- [Single Sign-On with OpenID Connect](guides/sso/configure-oidc-react.md) to delegate authentication responsibility to third party systems like Okta or Google/Azure Authentication. This is the recommended approach for production systems. +- [JaaS Authentication](guides/jaas.md) for simple deployments where authenticated users are part of some known list or invited as a [Native DataHub User](guides/add-users.md). + +In subsequent requests, the session token is used to represent the authenticated identity of the user, and is validated by DataHub's backend service (discussed below). +Eventually, the session token is expired (24 hours by default), at which point the end user is required to log in again. + +### Authentication in the Backend (Metadata Service) + +When a user makes a request for Data within DataHub, the request is authenticated by DataHub's Backend (Metadata Service) via a JSON Web Token. This applies to both requests originating from the DataHub application, +and programmatic calls to DataHub APIs. There are two types of tokens that are important: + +1. **Session Tokens**: Generated for users of the DataHub web application. By default, having a duration of 24 hours. + These tokens are encoded and stored inside browser-side session cookies. +2. **Personal Access Tokens**: These are tokens generated via the DataHub settings panel useful for interacting + with DataHub APIs. They can be used to automate processes like enriching documentation, ownership, tags, and more on DataHub. Learn + more about Personal Access Tokens [here](personal-access-tokens.md). + +To learn more about DataHub's backend authentication, check out [Introducing Metadata Service Authentication](introducing-metadata-service-authentication.md). + +Credentials must be provided as Bearer Tokens inside of the **Authorization** header in any request made to DataHub's API layer. To learn + +```shell +Authorization: Bearer+ +
+*High level overview of Metadata Service Authentication* + +## What is an Actor? + +An **Actor** is a concept within the new Authentication subsystem to represent a unique identity / principal that is initiating actions (e.g. read & write requests) +on the platform. + +An actor can be characterized by 2 attributes: + +1. **Type**: The "type" of the actor making a request. The purpose is to for example distinguish between a "user" & "service" actor. Currently, the "user" actor type is the only one + formally supported. +2. **Id**: A unique identifier for the actor within DataHub. This is commonly known as a "principal" in other systems. In the case of users, this + represents a unique "username". This username is in turn used when converting from the "Actor" concept into a Metadata Entity Urn (e.g. CorpUserUrn). + +For example, the root "datahub" super user would have the following attributes: + +``` +{ + "type": "USER", + "id": "datahub" +} +``` + +Which is mapped to the CorpUser urn: + +``` +urn:li:corpuser:datahub +``` + +for Metadata retrieval. + +## What is an Authenticator? + +An **Authenticator** is a pluggable component inside the Metadata Service that is responsible for authenticating an inbound request provided context about the request (currently, the request headers). +Authentication boils down to successfully resolving an **Actor** to associate with the inbound request. + +There can be many types of Authenticator. For example, there can be Authenticators that + +- Verify the authenticity of access tokens (ie. issued by either DataHub itself or a 3rd-party IdP) +- Authenticate username / password credentials against a remote database (ie. LDAP) + +and more! A key goal of the abstraction is _extensibility_: a custom Authenticator can be developed to authenticate requests +based on an organization's unique needs. + +DataHub ships with 2 Authenticators by default: + +- **DataHubSystemAuthenticator**: Verifies that inbound requests have originated from inside DataHub itself using a shared system identifier + and secret. This authenticator is always present. + +- **DataHubTokenAuthenticator**: Verifies that inbound requests contain a DataHub-issued Access Token (discussed further in the "DataHub Access Token" section below) in their + 'Authorization' header. This authenticator is required if Metadata Service Authentication is enabled. + +## What is an AuthenticatorChain? + +An **AuthenticatorChain** is a series of **Authenticators** that are configured to run one-after-another. This allows +for configuring multiple ways to authenticate a given request, for example via LDAP OR via local key file. + +Only if each Authenticator within the chain fails to authenticate a request will it be rejected. + +The Authenticator Chain can be configured in the `application.yml` file under `authentication.authenticators`: + +``` +authentication: + .... + authenticators: + # Configure the Authenticators in the chain + - type: com.datahub.authentication.Authenticator1 + ... + - type: com.datahub.authentication.Authenticator2 + .... +``` + +## What is the AuthenticationFilter? + +The **AuthenticationFilter** is a [servlet filter](http://tutorials.jenkov.com/java-servlets/servlet-filters.html) that authenticates each and requests to the Metadata Service. +It does so by constructing and invoking an **AuthenticatorChain**, described above. + +If an Actor is unable to be resolved by the AuthenticatorChain, then a 401 unauthorized exception will be returned by the filter. + +## What is a DataHub Token Service? What are Access Tokens? + +Along with Metadata Service Authentication comes an important new component called the **DataHub Token Service**. The purpose of this +component is twofold: + +1. Generate Access Tokens that grant access to the Metadata Service +2. Verify the validity of Access Tokens presented to the Metadata Service + +**Access Tokens** granted by the Token Service take the form of [Json Web Tokens](https://jwt.io/introduction), a type of stateless token which +has a finite lifespan & is verified using a unique signature. JWTs can also contain a set of claims embedded within them. Tokens issued by the Token +Service contain the following claims: + +- exp: the expiration time of the token +- version: version of the DataHub Access Token for purposes of evolvability (currently 1) +- type: The type of token, currently SESSION (used for UI-based sessions) or PERSONAL (used for personal access tokens) +- actorType: The type of the **Actor** associated with the token. Currently, USER is the only type supported. +- actorId: The id of the **Actor** associated with the token. + +Today, Access Tokens are granted by the Token Service under two scenarios: + +1. **UI Login**: When a user logs into the DataHub UI, for example via [JaaS](guides/jaas.md) or + [OIDC](guides/sso/configure-oidc-react.md), the `datahub-frontend` service issues an + request to the Metadata Service to generate a SESSION token _on behalf of_ of the user logging in. (\*Only the frontend service is authorized to perform this action). +2. **Generating Personal Access Tokens**: When a user requests to generate a Personal Access Token (described below) from the UI. + +> At present, the Token Service supports the symmetric signing method `HS256` to generate and verify tokens. + +Now that we're familiar with the concepts, we will talk concretely about what new capabilities have been built on top +of Metadata Service Authentication. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/add-users.md b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/add-users.md new file mode 100644 index 0000000000000..bcf33122aaf3a --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/add-users.md @@ -0,0 +1,217 @@ +--- +title: Onboarding Users to DataHub +slug: /authentication/guides/add-users +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authentication/guides/add-users.md +--- + +# Onboarding Users to DataHub + +New user accounts can be provisioned on DataHub in 3 ways: + +1. Shared Invite Links +2. Single Sign-On using [OpenID Connect](https://www.google.com/search?q=openid+connect&oq=openid+connect&aqs=chrome.0.0i131i433i512j0i512l4j69i60l2j69i61.1468j0j7&sourceid=chrome&ie=UTF-8) +3. Static Credential Configuration File (Self-Hosted Only) + +The first option is the easiest to get started with. The second is recommended for deploying DataHub in production. The third should +be reserved for special circumstances where access must be closely monitored and controlled, and is only relevant for Self-Hosted instances. + +# Shared Invite Links + +### Generating an Invite Link + +If you have the `Manage User Credentials` [Platform Privilege](../../authorization/access-policies-guide.md), you can invite new users to DataHub by sharing an invite link. + +To do so, navigate to the **Users & Groups** section inside of Settings page. Here you can generate a shareable invite link by clicking the `Invite Users` button. If you +do not have the correct privileges to invite users, this button will be disabled. + ++ +
+ +To invite new users, simply share the link with others inside your organization. + ++ +
+ +When a new user visits the link, they will be directed to a sign up screen where they can create their DataHub account. + +### Resetting User Passwords + +To reset a user's password, navigate to the Users & Groups tab, find the user who needs their password reset, +and click **Reset user password** inside the menu dropdown on the right hand side. Note that a user must have the +`Manage User Credentials` [Platform Privilege](../../authorization/access-policies-guide.md) in order to reset passwords. + ++ +
+ +To reset the password, simply share the password reset link with the user who needs to change their password. Password reset links expire after 24 hours. + ++ +
+ +# Configuring Single Sign-On with OpenID Connect + +Setting up Single Sign-On via OpenID Connect enables your organization's users to login to DataHub via a central Identity Provider such as + +- Azure AD +- Okta +- Keycloak +- Ping! +- Google Identity + +and many more. + +This option is strongly recommended for production deployments of DataHub. + +### Managed DataHub + +Single Sign-On can be configured and enabled by navigating to **Settings** > **SSO** > **OIDC**. Note +that a user must have the **Manage Platform Settings** [Platform Privilege](../../authorization/access-policies-guide.md) +in order to configure SSO settings. + +To complete the integration, you'll need the following: + +1. **Client ID** - A unique identifier for your application with the identity provider +2. **Client Secret** - A shared secret to use for exchange between you and your identity provider +3. **Discovery URL** - A URL where the OpenID settings for your identity provider can be discovered. + +These values can be obtained from your Identity Provider by following Step 1 on the [OpenID Connect Authentication](sso/configure-oidc-react.md)) Guide. + +### Self-Hosted DataHub + +For information about configuring Self-Hosted DataHub to use OpenID Connect (OIDC) to +perform authentication, check out [OIDC Authentication](sso/configure-oidc-react.md). + +> **A note about user URNs**: User URNs are unique identifiers for users on DataHub. The username received from an Identity Provider +> when a user logs into DataHub via OIDC is used to construct a unique identifier for the user on DataHub. The urn is computed as: +> `urn:li:corpuser:+ +
+ +e. Click **Register**. + +### 2. Configure Authentication (optional) + +Once registration is done, you will land on the app registration **Overview** tab. On the left-side navigation bar, click on **Authentication** under **Manage** and add extra redirect URIs if need be (if you want to support both local testing and Azure deployments). + ++ +
+ +Click **Save**. + +### 3. Configure Certificates & secrets + +On the left-side navigation bar, click on **Certificates & secrets** under **Manage**. +Select **Client secrets**, then **New client secret**. Type in a meaningful description for your secret and select an expiry. Click the **Add** button when you are done. + +**IMPORTANT:** Copy the `value` of your newly create secret since Azure will never display its value afterwards. + ++ +
+ +### 4. Configure API permissions + +On the left-side navigation bar, click on **API permissions** under **Manage**. DataHub requires the following four Microsoft Graph APIs: + +1. `User.Read` _(should be already configured)_ +2. `profile` +3. `email` +4. `openid` + +Click on **Add a permission**, then from the **Microsoft APIs** tab select **Microsoft Graph**, then **Delegated permissions**. From the **OpenId permissions** category, select `email`, `openid`, `profile` and click **Add permissions**. + +At this point, you should be looking at a screen like the following: + ++ +
+ +### 5. Obtain Application (Client) ID + +On the left-side navigation bar, go back to the **Overview** tab. You should see the `Application (client) ID`. Save its value for the next step. + +### 6. Obtain Discovery URI + +On the same page, you should see a `Directory (tenant) ID`. Your OIDC discovery URI will be formatted as follows: + +``` +https://login.microsoftonline.com/{tenant ID}/v2.0/.well-known/openid-configuration +``` + +### 7. Configure `datahub-frontend` to enable OIDC authentication + +a. Open the file `docker/datahub-frontend/env/docker.env` + +b. Add the following configuration values to the file: + +``` +AUTH_OIDC_ENABLED=true +AUTH_OIDC_CLIENT_ID=your-client-id +AUTH_OIDC_CLIENT_SECRET=your-client-secret +AUTH_OIDC_DISCOVERY_URI=https://login.microsoftonline.com/{tenant ID}/v2.0/.well-known/openid-configuration +AUTH_OIDC_BASE_URL=your-datahub-url +AUTH_OIDC_SCOPE="openid profile email" +``` + +Replacing the placeholders above with the client id (step 5), client secret (step 3) and tenant ID (step 6) received from Microsoft Azure. + +### 9. Restart `datahub-frontend-react` docker container + +Now, simply restart the `datahub-frontend-react` container to enable the integration. + +``` +docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml up datahub-frontend-react +``` + +Navigate to your DataHub domain to see SSO in action. + +## Resources + +- [Microsoft identity platform and OpenID Connect protocol](https://docs.microsoft.com/en-us/azure/active-directory/develop/v2-protocols-oidc/) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react-google.md b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react-google.md new file mode 100644 index 0000000000000..581c0130e27f9 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react-google.md @@ -0,0 +1,120 @@ +--- +title: Configuring Google Authentication for React App (OIDC) +slug: /authentication/guides/sso/configure-oidc-react-google +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authentication/guides/sso/configure-oidc-react-google.md +--- + +# Configuring Google Authentication for React App (OIDC) + +_Authored on 3/10/2021_ + +`datahub-frontend` server can be configured to authenticate users over OpenID Connect (OIDC). As such, it can be configured to delegate +authentication responsibility to identity providers like Google. + +This guide will provide steps for configuring DataHub authentication using Google. + +:::caution +Even when OIDC is configured, the root user can still login without OIDC by going +to `/login` URL endpoint. It is recommended that you don't use the default +credentials by mounting a different file in the front end container. To do this +please see [this guide](../jaas.md) to mount a custom user.props file for a JAAS authenticated deployment. +::: + +## Steps + +### 1. Create a project in the Google API Console + +Using an account linked to your organization, navigate to the [Google API Console](https://console.developers.google.com/) and select **New project**. +Within this project, we will configure the OAuth2.0 screen and credentials. + +### 2. Create OAuth2.0 consent screen + +a. Navigate to `OAuth consent screen`. This is where you'll configure the screen your users see when attempting to +log in to DataHub. + +b. Select `Internal` (if you only want your company users to have access) and then click **Create**. +Note that in order to complete this step you should be logged into a Google account associated with your organization. + +c. Fill out the details in the App Information & Domain sections. Make sure the 'Application Home Page' provided matches where DataHub is deployed +at your organization. + ++ +
+ +Once you've completed this, **Save & Continue**. + +d. Configure the scopes: Next, click **Add or Remove Scopes**. Select the following scopes: + + - `.../auth/userinfo.email` + - `.../auth/userinfo.profile` + - `openid` + +Once you've selected these, **Save & Continue**. + +### 3. Configure client credentials + +Now navigate to the **Credentials** tab. This is where you'll obtain your client id & secret, as well as configure info +like the redirect URI used after a user is authenticated. + +a. Click **Create Credentials** & select `OAuth client ID` as the credential type. + +b. On the following screen, select `Web application` as your Application Type. + +c. Add the domain where DataHub is hosted to your 'Authorized Javascript Origins'. + +``` +https://your-datahub-domain.com +``` + +d. Add the domain where DataHub is hosted with the path `/callback/oidc` appended to 'Authorized Redirect URLs'. + +``` +https://your-datahub-domain.com/callback/oidc +``` + +e. Click **Create** + +f. You will now receive a pair of values, a client id and a client secret. Bookmark these for the next step. + +At this point, you should be looking at a screen like the following: + ++ +
+ +Success! + +### 4. Configure `datahub-frontend` to enable OIDC authentication + +a. Open the file `docker/datahub-frontend/env/docker.env` + +b. Add the following configuration values to the file: + +``` +AUTH_OIDC_ENABLED=true +AUTH_OIDC_CLIENT_ID=your-client-id +AUTH_OIDC_CLIENT_SECRET=your-client-secret +AUTH_OIDC_DISCOVERY_URI=https://accounts.google.com/.well-known/openid-configuration +AUTH_OIDC_BASE_URL=your-datahub-url +AUTH_OIDC_SCOPE="openid profile email" +AUTH_OIDC_USER_NAME_CLAIM=email +AUTH_OIDC_USER_NAME_CLAIM_REGEX=([^@]+) +``` + +Replacing the placeholders above with the client id & client secret received from Google in Step 3f. + +### 5. Restart `datahub-frontend-react` docker container + +Now, simply restart the `datahub-frontend-react` container to enable the integration. + +``` +docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml up datahub-frontend-react +``` + +Navigate to your DataHub domain to see SSO in action. + +## References + +- [OpenID Connect in Google Identity](https://developers.google.com/identity/protocols/oauth2/openid-connect) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react-okta.md b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react-okta.md new file mode 100644 index 0000000000000..a2816cf79de1c --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react-okta.md @@ -0,0 +1,127 @@ +--- +title: Configuring Okta Authentication for React App (OIDC) +slug: /authentication/guides/sso/configure-oidc-react-okta +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authentication/guides/sso/configure-oidc-react-okta.md +--- + +# Configuring Okta Authentication for React App (OIDC) + +_Authored on 3/10/2021_ + +`datahub-frontend` server can be configured to authenticate users over OpenID Connect (OIDC). As such, it can be configured to +delegate authentication responsibility to identity providers like Okta. + +This guide will provide steps for configuring DataHub authentication using Okta. + +:::caution +Even when OIDC is configured, the root user can still login without OIDC by going +to `/login` URL endpoint. It is recommended that you don't use the default +credentials by mounting a different file in the front end container. To do this +please see [this guide](../jaas.md) to mount a custom user.props file for a JAAS authenticated deployment. +::: + +## Steps + +### 1. Create an application in Okta Developer Console + +a. Log in to your Okta admin account & navigate to the developer console + +b. Select **Applications**, then **Add Application**, the **Create New App** to create a new app. + +c. Select `Web` as the **Platform**, and `OpenID Connect` as the **Sign on method** + +d. Click **Create** + +e. Under 'General Settings', name your application + +f. Below, add a **Login Redirect URI**. This should be formatted as + +``` +https://your-datahub-domain.com/callback/oidc +``` + +If you're just testing locally, this can be `http://localhost:9002/callback/oidc`. + +g. Below, add a **Logout Redirect URI**. This should be formatted as + +``` +https://your-datahub-domain.com +``` + +h. [Optional] If you're enabling DataHub login as an Okta tile, you'll need to provide the **Initiate Login URI**. You +can set if to + +``` +https://your-datahub-domain.com/authenticate +``` + +If you're just testing locally, this can be `http://localhost:9002`. + +i. Click **Save** + +### 2. Obtain Client Credentials + +On the subsequent screen, you should see the client credentials. Bookmark the `Client id` and `Client secret` for the next step. + +### 3. Obtain Discovery URI + +On the same page, you should see an `Okta Domain`. Your OIDC discovery URI will be formatted as follows: + +``` +https://your-okta-domain.com/.well-known/openid-configuration +``` + +for example, `https://dev-33231928.okta.com/.well-known/openid-configuration`. + +At this point, you should be looking at a screen like the following: + ++ +
++ +
+ +Success! + +### 4. Configure `datahub-frontend` to enable OIDC authentication + +a. Open the file `docker/datahub-frontend/env/docker.env` + +b. Add the following configuration values to the file: + +``` +AUTH_OIDC_ENABLED=true +AUTH_OIDC_CLIENT_ID=your-client-id +AUTH_OIDC_CLIENT_SECRET=your-client-secret +AUTH_OIDC_DISCOVERY_URI=https://your-okta-domain.com/.well-known/openid-configuration +AUTH_OIDC_BASE_URL=your-datahub-url +AUTH_OIDC_SCOPE="openid profile email groups" +``` + +Replacing the placeholders above with the client id & client secret received from Okta in Step 2. + +> **Pro Tip!** You can easily enable Okta to return the groups that a user is associated with, which will be provisioned in DataHub, along with the user logging in. This can be enabled by setting the `AUTH_OIDC_EXTRACT_GROUPS_ENABLED` flag to `true`. +> if they do not already exist in DataHub. You can enable your Okta application to return a 'groups' claim from the Okta Console at Applications > Your Application -> Sign On -> OpenID Connect ID Token Settings (Requires an edit). +> +> By default, we assume that the groups will appear in a claim named "groups". This can be customized using the `AUTH_OIDC_GROUPS_CLAIM` container configuration. +> +>+> + +
+ +### 5. Restart `datahub-frontend-react` docker container + +Now, simply restart the `datahub-frontend-react` container to enable the integration. + +``` +docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml up datahub-frontend-react +``` + +Navigate to your DataHub domain to see SSO in action. + +## Resources + +- [OAuth 2.0 and OpenID Connect Overview](https://developer.okta.com/docs/concepts/oauth-openid/) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react.md b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react.md new file mode 100644 index 0000000000000..fa4f33929c8df --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react.md @@ -0,0 +1,240 @@ +--- +title: Overview +slug: /authentication/guides/sso/configure-oidc-react +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authentication/guides/sso/configure-oidc-react.md +--- + +# Overview + +The DataHub React application supports OIDC authentication built on top of the [Pac4j Play](https://github.com/pac4j/play-pac4j) library. +This enables operators of DataHub to integrate with 3rd party identity providers like Okta, Google, Keycloak, & more to authenticate their users. + +When configured, OIDC auth will be enabled between clients of the DataHub UI & `datahub-frontend` server. Beyond this point is considered +to be a secure environment and as such authentication is validated & enforced only at the "front door" inside datahub-frontend. + +:::caution +Even if OIDC is configured the root user can still login without OIDC by going +to `/login` URL endpoint. It is recommended that you don't use the default +credentials by mounting a different file in the front end container. To do this +please see [this guide](../jaas.md) to mount a custom user.props file for a JAAS authenticated deployment. +::: + +## Provider-Specific Guides + +1. [Configuring OIDC using Google](configure-oidc-react-google.md) +2. [Configuring OIDC using Okta](configure-oidc-react-okta.md) +3. [Configuring OIDC using Azure](configure-oidc-react-azure.md) + +## Configuring OIDC in React + +### 1. Register an app with your Identity Provider + +To configure OIDC in React, you will most often need to register yourself as a client with your identity provider (Google, Okta, etc). Each provider may +have their own instructions. Provided below are links to examples for Okta, Google, Azure AD, & Keycloak. + +- [Registering an App in Okta](https://developer.okta.com/docs/guides/add-an-external-idp/apple/register-app-in-okta/) +- [OpenID Connect in Google Identity](https://developers.google.com/identity/protocols/oauth2/openid-connect) +- [OpenID Connect authentication with Azure Active Directory](https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/auth-oidc) +- [Keycloak - Securing Applications and Services Guide](https://www.keycloak.org/docs/latest/securing_apps/) + +During the registration process, you'll need to provide a login redirect URI to the identity provider. This tells the identity provider +where to redirect to once they've authenticated the end user. + +By default, the URL will be constructed as follows: + +> "http://your-datahub-domain.com/callback/oidc" + +For example, if you're hosted DataHub at `datahub.myorg.com`, this +value would be `http://datahub.myorg.com/callback/oidc`. For testing purposes you can also specify localhost as the domain name +directly: `http://localhost:9002/callback/oidc` + +The goal of this step should be to obtain the following values, which will need to be configured before deploying DataHub: + +1. **Client ID** - A unique identifier for your application with the identity provider +2. **Client Secret** - A shared secret to use for exchange between you and your identity provider +3. **Discovery URL** - A URL where the OIDC API of your identity provider can be discovered. This should suffixed by + `.well-known/openid-configuration`. Sometimes, identity providers will not explicitly include this URL in their setup guides, though + this endpoint _will_ exist as per the OIDC specification. For more info see http://openid.net/specs/openid-connect-discovery-1_0.html. + +### 2. Configure DataHub Frontend Server + +The second step to enabling OIDC involves configuring `datahub-frontend` to enable OIDC authentication with your Identity Provider. + +To do so, you must update the `datahub-frontend` [docker.env](https://github.com/datahub-project/datahub/blob/master/docker/datahub-frontend/env/docker.env) file with the +values received from your identity provider: + +``` +# Required Configuration Values: +AUTH_OIDC_ENABLED=true +AUTH_OIDC_CLIENT_ID=your-client-id +AUTH_OIDC_CLIENT_SECRET=your-client-secret +AUTH_OIDC_DISCOVERY_URI=your-provider-discovery-url +AUTH_OIDC_BASE_URL=your-datahub-url +``` + +- `AUTH_OIDC_ENABLED`: Enable delegating authentication to OIDC identity provider +- `AUTH_OIDC_CLIENT_ID`: Unique client id received from identity provider +- `AUTH_OIDC_CLIENT_SECRET`: Unique client secret received from identity provider +- `AUTH_OIDC_DISCOVERY_URI`: Location of the identity provider OIDC discovery API. Suffixed with `.well-known/openid-configuration` +- `AUTH_OIDC_BASE_URL`: The base URL of your DataHub deployment, e.g. https://yourorgdatahub.com (prod) or http://localhost:9002 (testing) + +Providing these configs will cause DataHub to delegate authentication to your identity +provider, requesting the "oidc email profile" scopes and parsing the "preferred_username" claim from +the authenticated profile as the DataHub CorpUser identity. + +> By default, the login callback endpoint exposed by DataHub will be located at `${AUTH_OIDC_BASE_URL}/callback/oidc`. This must **exactly** match the login redirect URL you've registered with your identity provider in step 1. + +In kubernetes, you can add the above env variables in the values.yaml as follows. + +```yaml +datahub-frontend: + ... + extraEnvs: + - name: AUTH_OIDC_ENABLED + value: "true" + - name: AUTH_OIDC_CLIENT_ID + value: your-client-id + - name: AUTH_OIDC_CLIENT_SECRET + value: your-client-secret + - name: AUTH_OIDC_DISCOVERY_URI + value: your-provider-discovery-url + - name: AUTH_OIDC_BASE_URL + value: your-datahub-url +``` + +You can also package OIDC client secrets into a k8s secret by running + +`kubectl create secret generic datahub-oidc-secret --from-literal=secret=<+ +
+ +If you have configured permissions correctly the `Generate new token` should be clickable. + +:::note + +If you see `Token based authentication is currently disabled. Contact your DataHub administrator to enable this feature.` then you must enable authentication in the metadata service (step 1 of the prerequisites). + +::: + +## Creating Personal Access Tokens + +Once in the Manage Access Tokens Settings Tab: + +1. Click `Generate new token` where a form should appear. + ++ +
+ +2. Fill out the information as needed and click `Create`. ++ +
+ +3. Save the token text somewhere secure! This is what will be used later on! ++ +
+ +## Using Personal Access Tokens + +Once a token has been generated, the user that created it will subsequently be able to make authenticated HTTP requests, assuming he/she has permissions to do so, to DataHub frontend proxy or DataHub GMS directly by providing +the generated Access Token as a Bearer token in the `Authorization` header: + +``` +Authorization: Bearer+ +
+ +:::note + +Without an access token, making programmatic requests will result in a 401 result from the server if Metadata Service Authentication +is enabled. + +::: + +## Additional Resources + +- Learn more about how this feature is by DataHub [Authentication Metadata Service](introducing-metadata-service-authentication.md). +- Check out our [Authorization Policies](../authorization/policies.md) to see what permissions can be programatically used. + +### GraphQL + +- Have a look at [Token Management in GraphQL](../api/graphql/token-management.md) to learn how to manage tokens programatically! + +## FAQ and Troubleshooting + +**The button to create tokens is greyed out - why can’t I click on it?** + +This means that the user currently logged in DataHub does not have either `Generate Personal Access Tokens` or `Manage All Access Tokens` permissions. +Please ask your DataHub administrator to grant you those permissions. + +**When using a token, I get 401 unauthorized - why?** + +A PAT represents a user in DataHub, if that user does not have permissions for a given action, neither will the token. + +**Can I create a PAT that represents some other user?** + +Yes, although not through the UI correctly, you will have to use the [token management graphQL API](../api/graphql/token-management.md) and the user making the request must have `Manage All Access Tokens` permissions. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authorization/README.md b/docs-website/versioned_docs/version-0.10.4/docs/authorization/README.md new file mode 100644 index 0000000000000..7d592e285716a --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authorization/README.md @@ -0,0 +1,25 @@ +--- +title: Overview +slug: /authorization +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authorization/README.md +--- + +# Overview + +Authorization specifies _what_ accesses an _authenticated_ user has within a system. +This section is all about how DataHub authorizes a given user/service that wants to interact with the system. + +:::note + +Authorization only makes sense in the context of an **Authenticated** DataHub deployment. To use DataHub's authorization features +please first make sure that the system has been configured from an authentication perspective as you intend. + +::: + +Once the identity of a user or service has been established, DataHub determines what accesses the authenticated request has. + +This is done by checking what operation a given user/service wants to perform within DataHub & whether it is allowed to do so. +The set of operations that are allowed in DataHub are what we call **Policies**. + +Policies specify fine-grain access control for _who_ can do _what_ to _which_ resources, for more details on the set of Policies that DataHub provides please see the [Policies Guide](../authorization/policies.md). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authorization/access-policies-guide.md b/docs-website/versioned_docs/version-0.10.4/docs/authorization/access-policies-guide.md new file mode 100644 index 0000000000000..d0aeaec60ecd7 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authorization/access-policies-guide.md @@ -0,0 +1,344 @@ +--- +title: About DataHub Access Policies +sidebar_label: Access Policies +slug: /authorization/access-policies-guide +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authorization/access-policies-guide.md +--- + +# About DataHub Access Policies + ++ +
+ +**Platform** Policies determine who has platform-level Privileges on DataHub. These include: + +- Managing Users & Groups +- Viewing the DataHub Analytics Page +- Managing Policies themselves + +Platform policies can be broken down into 2 parts: + +1. **Privileges**: Which privileges should be assigned to the Actors (e.g. "View Analytics") +2. **Actors**: Who the should be granted the privileges (Users, or Groups) + +A few Platform Policies in plain English include: + +- The Data Platform team should be allowed to manage users & groups, view platform analytics, & manage policies themselves +- John from IT should be able to invite new users + +**Metadata** policies determine who can do what to which Metadata Entities. For example: + +- Who can edit Dataset Documentation & Links? +- Who can add Owners to a Chart? +- Who can add Tags to a Dashboard? + +Metadata policies can be broken down into 3 parts: + +1. **Privileges**: The 'what'. What actions are being permitted by a Policy, e.g. "Add Tags". +2. **Resources**: The 'which'. Resources that the Policy applies to, e.g. "All Datasets". +3. **Actors**: The 'who'. Specific users, groups, & roles that the Policy applies to. + +A few **Metadata** Policies in plain English include: + +- Dataset Owners should be allowed to edit documentation, but not Tags. +- Jenny, our Data Steward, should be allowed to edit Tags for any Dashboard, but no other metadata. +- James, a Data Analyst, should be allowed to edit the Links for a specific Data Pipeline he is a downstream consumer of. + +Each of these can be implemented by constructing DataHub Access Policies. + +## Access Policies Setup, Prerequisites, and Permissions + +What you need to manage Access Policies on DataHub: + +- **Manage Policies** Privilege + +This Platform Privilege allows users to create, edit, and remove all Access Policies on DataHub. Therefore, it should only be +given to those users who will be serving as Admins of the platform. The default `Admin` role has this Privilege. + +## Using Access Policies + +Policies can be created by first navigating to **Settings > Permissions > Policies**. + +To begin building a new Policy, click **Create new Policy**. + ++ +
+ +### Creating a Platform Policy + +#### Step 1. Provide a Name & Description + +In the first step, we select the **Platform** Policy type, and define a name and description for the new Policy. + +Good Policy names describe the high-level purpose of the Policy. For example, a Policy named +"View DataHub Analytics - Data Governance Team" would be a great way to describe a Platform +Policy which grants abilities to view DataHub's Analytics view to anyone on the Data Governance team. + +You can optionally provide a text description to add richer details about the purpose of the Policy. + +#### Step 2: Configure Privileges + +In the second step, we can simply select the Privileges that this Platform Policy will grant. + ++ +
+ +**Platform** Privileges most often provide access to perform administrative functions on the Platform. These include: + +| Platform Privileges | Description | +| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | +| Manage Policies | Allow actor to create and remove access control policies. Be careful - Actors with this Privilege are effectively super users. | +| Manage Metadata Ingestion | Allow actor to create, remove, and update Metadata Ingestion sources. | +| Manage Secrets | Allow actor to create & remove secrets stored inside DataHub. | +| Manage Users & Groups | Allow actor to create, remove, and update users and groups on DataHub. | +| Manage All Access Tokens | Allow actor to create, remove, and list access tokens for all users on DataHub. | +| Create Domains | Allow the actor to create new Domains | +| Manage Domains | Allow actor to create and remove any Domains. | +| View Analytics | Allow the actor access to the DataHub analytics dashboard. | +| Generate Personal Access Tokens | Allow the actor to generate access tokens for personal use with DataHub APIs. | +| Manage User Credentials | Allow the actor to generate invite links for new native DataHub users, and password reset links for existing native users. | +| Manage Glossaries | Allow the actor to create, edit, move, and delete Glossary Terms and Term Groups | +| Create Tags | Allow the actor to create new Tags | +| Manage Tags | Allow the actor to create and remove any Tags | +| Manage Public Views | Allow the actor to create, edit, and remove any public (shared) Views. | +| Manage Ownership Types | Allow the actor to create, edit, and remove any Ownership Types. | +| Restore Indices API[^1] | Allow the actor to restore indices for a set of entities via API | +| Enable/Disable Writeability API[^1] | Allow the actor to enable or disable GMS writeability for use in data migrations | +| Apply Retention API[^1] | Allow the actor to apply aspect retention via API | + +[^1]: Only active if REST_API_AUTHORIZATION_ENABLED environment flag is enabled + +#### Step 3: Choose Policy Actors + +In this step, we can select the actors who should be granted Privileges appearing on this Policy. + +To do so, simply search and select the Users or Groups that the Policy should apply to. + +**Assigning a Policy to a User** + ++ +
+ +**Assigning a Policy to a Group** + ++ +
+ +### Creating a Metadata Policy + +#### Step 1. Provide a Name & Description + +In the first step, we select the **Metadata** Policy, and define a name and description for the new Policy. + +Good Policy names describe the high-level purpose of the Policy. For example, a Policy named +"Full Dataset Edit Privileges - Data Platform Engineering" would be a great way to describe a Metadata +Policy which grants all abilities to edit Dataset Metadata to anyone in the "Data Platform" group. + +You can optionally provide a text description to add richer detail about the purpose of the Policy. + +#### Step 2: Configure Privileges + +In the second step, we can simply select the Privileges that this Metadata Policy will grant. +To begin, we should first determine which assets that the Privileges should be granted for (i.e. the _scope_), then +select the appropriate Privileges to grant. + +Using the `Resource Type` selector, we can narrow down the _type_ of the assets that the Policy applies to. If left blank, +all entity types will be in scope. + +For example, if we only want to grant access for `Datasets` on DataHub, we can select +`Datasets`. + ++ +
+ +Next, we can search for specific Entities of the that the Policy should grant privileges on. +If left blank, all entities of the selected types are in scope. + +For example, if we only want to grant access for a specific sample dataset, we can search and +select it directly. + ++ +
+ +We can also limit the scope of the Policy to assets that live in a specific **Domain**. If left blank, +entities from all Domains will be in scope. + +For example, if we only want to grant access for assets part of a "Marketing" Domain, we can search and +select it. + ++ +
+ +Finally, we will choose the Privileges to grant when the selected entities fall into the defined +scope. + ++ +
+ +**Metadata** Privileges grant access to change specific _entities_ (i.e. data assets) on DataHub. + +The common Metadata Privileges, which span across entity types, include: + +| Common Privileges | Description | +| -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| View Entity Page | Allow actor to access the entity page for the resource in the UI. If not granted, it will redirect them to an unauthorized page. | +| Edit Tags | Allow actor to add and remove tags to an asset. | +| Edit Glossary Terms | Allow actor to add and remove glossary terms to an asset. | +| Edit Owners | Allow actor to add and remove owners of an entity. | +| Edit Description | Allow actor to edit the description (documentation) of an entity. | +| Edit Links | Allow actor to edit links associated with an entity. | +| Edit Status | Allow actor to edit the status of an entity (soft deleted or not). | +| Edit Domain | Allow actor to edit the Domain of an entity. | +| Edit Deprecation | Allow actor to edit the Deprecation status of an entity. | +| Edit Assertions | Allow actor to add and remove assertions from an entity. | +| Edit All | Allow actor to edit any information about an entity. Super user privileges. Controls the ability to ingest using API when REST API Authorization is enabled. | +| Get Timeline API[^1] | Allow actor to get the timeline of an entity via API. | +| Get Entity API[^1] | Allow actor to get an entity via API. | +| Get Timeseries Aspect API[^1] | Allow actor to get a timeseries aspect via API. | +| Get Aspect/Entity Count APIs[^1] | Allow actor to get aspect and entity counts via API. | +| Search API | Allow actor to search for entities via API. | +| Produce Platform Event API | Allow actor to ingest a platform event via API. | + +[^1]: Only active if REST_API_AUTHORIZATION_ENABLED is true + +**Specific Metadata Privileges** include + +| Entity | Privilege | Description | +| ------------ | ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Dataset | Edit Dataset Column Tags | Allow actor to edit the column (field) tags associated with a dataset schema. | +| Dataset | Edit Dataset Column Glossary Terms | Allow actor to edit the column (field) glossary terms associated with a dataset schema. | +| Dataset | Edit Dataset Column Descriptions | Allow actor to edit the column (field) descriptions associated with a dataset schema. | +| Dataset | Edit Dataset Queries | Allow actor to edit the Highlighted Queries on the Queries tab of the dataset. | +| Dataset | View Dataset Usage | Allow actor to access usage metadata about a dataset both in the UI and in the GraphQL API. This includes example queries, number of queries, etc. Also applies to REST APIs when REST API Authorization is enabled. | +| Dataset | View Dataset Profile | Allow actor to access a dataset's profile both in the UI and in the GraphQL API. This includes snapshot statistics like #rows, #columns, null percentage per field, etc. | +| Tag | Edit Tag Color | Allow actor to change the color of a Tag. | +| Group | Edit Group Members | Allow actor to add and remove members to a group. | +| User | Edit User Profile | Allow actor to change the user's profile including display name, bio, title, profile image, etc. | +| User + Group | Edit Contact Information | Allow actor to change the contact information such as email & chat handles. | + +> **Still have questions about Privileges?** Let us know in [Slack](https://slack.datahubproject.io)! + +#### Step 3: Choose Policy Actors + +In this step, we can select the actors who should be granted the Privileges on this Policy. Metadata Policies +can target specific Users & Groups, or the _owners_ of the Entities that are included in the scope of the Policy. + +To do so, simply search and select the Users or Groups that the Policy should apply to. + ++ +
+ ++ +
+ +We can also grant the Privileges to the _owners_ of Entities (or _Resources_) that are in scope for the Policy. +This advanced functionality allows of Admins of DataHub to closely control which actions can or cannot be performed by owners. + ++ +
+ +### Updating an Existing Policy + +To update an existing Policy, simply click the **Edit** on the Policy you wish to change. + ++ +
+ +Then, make the changes required and click **Save**. When you save a Policy, it may take up to 2 minutes for changes +to be reflected. + +### Removing a Policy + +To remove a Policy, simply click on the trashcan icon located on the Policies list. This will remove the Policy and +deactivate it so that it no longer applies. + +When you delete a Policy, it may take up to 2 minutes for changes to be reflected. + +### Deactivating a Policy + +In addition to deletion, DataHub also supports "deactivating" a Policy. This is useful if you need to temporarily disable +a particular Policy, but do not want to remove it altogether. + +To deactivate a Policy, simply click the **Deactivate** button on the Policy you wish to deactivate. When you change +the state of a Policy, it may take up to 2 minutes for the changes to be reflected. + ++ +
+ +After deactivating, you can re-enable a Policy by clicking **Activate**. + +### Default Policies + +Out of the box, DataHub is deployed with a set of pre-baked Policies. This set of policies serves the +following purposes: + +1. Assigns immutable super-user privileges for the root `datahub` user account (Immutable) +2. Assigns all Platform Privileges for all Users by default (Editable) + +The reason for #1 is to prevent people from accidentally deleting all policies and getting locked out (`datahub` super user account can be a backup) +The reason for #2 is to permit administrators to log in via OIDC or another means outside of the `datahub` root account +when they are bootstrapping with DataHub. This way, those setting up DataHub can start managing Access Policies without friction. +Note that these Privileges _can_ and likely _should_ be changed inside the **Policies** page before onboarding +your company's users. + +### REST API Authorization + +Policies only affect REST APIs when the environment variable `REST_API_AUTHORIZATION` is set to `true` for GMS. Some policies only apply when this setting is enabled, marked above, and other Metadata and Platform policies apply to the APIs where relevant, also specified in the table above. + +## Additional Resources + +- [Authorization Overview](./README.md) +- [Roles Overview](./roles.md) +- [Authorization using Groups](./groups.md) + +### Videos + +- [Introducing DataHub Access Policies](https://youtu.be/19zQCznqhMI?t=282) + +### GraphQL + +- [listPolicies](../../graphql/queries.md#listPolicies) +- [createPolicy](../../graphql/mutations.md#createPolicy) +- [updatePolicy](../../graphql/mutations.md#updatePolicy) +- [deletePolicy](../../graphql/mutations.md#deletePolicy) + +## FAQ and Troubleshooting + +**How do Policies relate to Roles?** + +Policies are the lowest level primitive for granting Privileges to users on DataHub. + +Roles are built for convenience on top of Policies. Roles grant Privileges to actors indirectly, driven by Policies +behind the scenes. Both can be used in conjunction to grant Privileges to end users. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ + +### Related Features + +- [Roles](./roles.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authorization/groups.md b/docs-website/versioned_docs/version-0.10.4/docs/authorization/groups.md new file mode 100644 index 0000000000000..3882317dd7f81 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authorization/groups.md @@ -0,0 +1,39 @@ +--- +title: Authorization using Groups +slug: /authorization/groups +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authorization/groups.md +--- + +# Authorization using Groups + +## Introduction + +DataHub provides the ability to use **Groups** to manage policies. + +## Why do we need groups for authorization? + +### Easily Applying Access Privileges + +Groups are useful for managing user privileges in DataHub. If you want a set of Admin users, +or you want to define a set of users that are only able to view metadata assets but not make changes to them, you could +create groups for each of these use cases and apply the appropriate policies at the group-level rather than the +user-level. + +### Syncing with Existing Enterprise Groups (via IdP) + +If you work with an Identity Provider like Okta or Azure AD, it's likely you already have groups defined there. DataHub +allows you to import the groups you have from OIDC for [Okta](../generated/ingestion/sources/okta.md) and +[Azure AD](../generated/ingestion/sources/azure-ad.md) using the DataHub ingestion framework. + +If you routinely ingest groups from these providers, you will also be able to keep groups synced. New groups will +be created in DataHub, stale groups will be deleted, and group membership will be updated! + +## Custom Groups + +DataHub admins can create custom groups by going to the **Settings > Users & Groups > Groups > Create Group**. +Members can be added to Groups via the Group profile page. + +## Feedback / Questions / Concerns + +We want to hear from you! For any inquiries, including Feedback, Questions, or Concerns, reach out on Slack! diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authorization/policies.md b/docs-website/versioned_docs/version-0.10.4/docs/authorization/policies.md new file mode 100644 index 0000000000000..b6cf1cd5a4e55 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authorization/policies.md @@ -0,0 +1,233 @@ +--- +title: Policies Guide +slug: /authorization/policies +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authorization/policies.md +--- + +# Policies Guide + +## Introduction + +DataHub provides the ability to declare fine-grained access control Policies via the UI & GraphQL API. +Access policies in DataHub define _who_ can _do what_ to _which resources_. A few policies in plain English include + +- Dataset Owners should be allowed to edit documentation, but not Tags. +- Jenny, our Data Steward, should be allowed to edit Tags for any Dashboard, but no other metadata. +- James, a Data Analyst, should be allowed to edit the Links for a specific Data Pipeline he is a downstream consumer of. +- The Data Platform team should be allowed to manage users & groups, view platform analytics, & manage policies themselves. + +In this document, we'll take a deeper look at DataHub Policies & how to use them effectively. + +## What is a Policy? + +There are 2 types of Policy within DataHub: + +1. Platform Policies +2. Metadata Policies + +We'll briefly describe each. + +### Platform Policies + +**Platform** policies determine who has platform-level privileges on DataHub. These privileges include + +- Managing Users & Groups +- Viewing the DataHub Analytics Page +- Managing Policies themselves + +Platform policies can be broken down into 2 parts: + +1. **Actors**: Who the policy applies to (Users or Groups) +2. **Privileges**: Which privileges should be assigned to the Actors (e.g. "View Analytics") + +Note that platform policies do not include a specific "target resource" against which the Policies apply. Instead, +they simply serve to assign specific privileges to DataHub users and groups. + +### Metadata Policies + +**Metadata** policies determine who can do what to which Metadata Entities. For example, + +- Who can edit Dataset Documentation & Links? +- Who can add Owners to a Chart? +- Who can add Tags to a Dashboard? + +and so on. + +A Metadata Policy can be broken down into 3 parts: + +1. **Actors**: The 'who'. Specific users, groups that the policy applies to. +2. **Privileges**: The 'what'. What actions are being permitted by a policy, e.g. "Add Tags". +3. **Resources**: The 'which'. Resources that the policy applies to, e.g. "All Datasets". + +#### Actors + +We currently support 3 ways to define the set of actors the policy applies to: a) list of users b) list of groups, and +c) owners of the entity. You also have the option to apply the policy to all users or groups. + +#### Privileges + +Check out the list of +privileges [here](https://github.com/datahub-project/datahub/blob/master/metadata-utils/src/main/java/com/linkedin/metadata/authorization/PoliciesConfig.java) +. Note, the privileges are semantic by nature, and does not tie in 1-to-1 with the aspect model. + +All edits on the UI are covered by a privilege, to make sure we have the ability to restrict write access. + +We currently support the following: + +**Platform-level** privileges for DataHub operators to access & manage the administrative functionality of the system. + +| Platform Privileges | Description | +| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | +| Manage Policies | Allow actor to create and remove access control policies. Be careful - Actors with this privilege are effectively super users. | +| Manage Metadata Ingestion | Allow actor to create, remove, and update Metadata Ingestion sources. | +| Manage Secrets | Allow actor to create & remove secrets stored inside DataHub. | +| Manage Users & Groups | Allow actor to create, remove, and update users and groups on DataHub. | +| Manage All Access Tokens | Allow actor to create, remove, and list access tokens for all users on DataHub. | +| Create Domains | Allow the actor to create new Domains | +| Manage Domains | Allow actor to create and remove any Domains. | +| View Analytics | Allow the actor access to the DataHub analytics dashboard. | +| Generate Personal Access Tokens | Allow the actor to generate access tokens for personal use with DataHub APIs. | +| Manage User Credentials | Allow the actor to generate invite links for new native DataHub users, and password reset links for existing native users. | +| Manage Glossaries | Allow the actor to create, edit, move, and delete Glossary Terms and Term Groups | +| Create Tags | Allow the actor to create new Tags | +| Manage Tags | Allow the actor to create and remove any Tags | +| Manage Public Views | Allow the actor to create, edit, and remove any public (shared) Views. | +| Restore Indices API[^1] | Allow the actor to restore indices for a set of entities via API | +| Enable/Disable Writeability API[^1] | Allow the actor to enable or disable GMS writeability for use in data migrations | +| Apply Retention API[^1] | Allow the actor to apply aspect retention via API | + +[^1]: Only active if REST_API_AUTHORIZATION_ENABLED is true + +**Common metadata privileges** to view & modify any entity within DataHub. + +| Common Privileges | Description | +| -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- | +| View Entity Page | Allow actor to access the entity page for the resource in the UI. If not granted, it will redirect them to an unauthorized page. | +| Edit Tags | Allow actor to add and remove tags to an asset. | +| Edit Glossary Terms | Allow actor to add and remove glossary terms to an asset. | +| Edit Owners | Allow actor to add and remove owners of an entity. | +| Edit Description | Allow actor to edit the description (documentation) of an entity. | +| Edit Links | Allow actor to edit links associated with an entity. | +| Edit Status | Allow actor to edit the status of an entity (soft deleted or not). | +| Edit Domain | Allow actor to edit the Domain of an entity. | +| Edit Deprecation | Allow actor to edit the Deprecation status of an entity. | +| Edit Assertions | Allow actor to add and remove assertions from an entity. | +| Edit All | Allow actor to edit any information about an entity. Super user privileges. Controls the ability to ingest using API when REST API Authorization is enabled. | | +| Get Timeline API[^1] | Allow actor to get the timeline of an entity via API. | +| Get Entity API[^1] | Allow actor to get an entity via API. | +| Get Timeseries Aspect API[^1] | Allow actor to get a timeseries aspect via API. | +| Get Aspect/Entity Count APIs[^1] | Allow actor to get aspect and entity counts via API. | +| Search API[^1] | Allow actor to search for entities via API. | +| Produce Platform Event API[^1] | Allow actor to ingest a platform event via API. | + +[^1]: Only active if REST_API_AUTHORIZATION_ENABLED is true + +**Specific entity-level privileges** that are not generalizable. + +| Entity | Privilege | Description | +| ------------ | ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Dataset | Edit Dataset Column Tags | Allow actor to edit the column (field) tags associated with a dataset schema. | +| Dataset | Edit Dataset Column Glossary Terms | Allow actor to edit the column (field) glossary terms associated with a dataset schema. | +| Dataset | Edit Dataset Column Descriptions | Allow actor to edit the column (field) descriptions associated with a dataset schema. | +| Dataset | View Dataset Usage | Allow actor to access usage metadata about a dataset both in the UI and in the GraphQL API. This includes example queries, number of queries, etc. Also applies to REST APIs when REST API Authorization is enabled. | +| Dataset | View Dataset Profile | Allow actor to access a dataset's profile both in the UI and in the GraphQL API. This includes snapshot statistics like #rows, #columns, null percentage per field, etc. | +| Tag | Edit Tag Color | Allow actor to change the color of a Tag. | +| Group | Edit Group Members | Allow actor to add and remove members to a group. | +| User | Edit User Profile | Allow actor to change the user's profile including display name, bio, title, profile image, etc. | +| User + Group | Edit Contact Information | Allow actor to change the contact information such as email & chat handles. | +| GlossaryNode | Manage Direct Glossary Children | Allow the actor to create, edit, and delete the direct children of the selected entities. | +| GlossaryNode | Manage All Glossary Children | Allow the actor to create, edit, and delete everything underneath the selected entities. | + +#### Resources + +Resource filter defines the set of resources that the policy applies to is defined using a list of criteria. Each +criterion defines a field type (like resource_type, resource_urn, domain), a list of field values to compare, and a +condition (like EQUALS). It essentially checks whether the field of a certain resource matches any of the input values. +Note, that if there are no criteria or resource is not set, policy is applied to ALL resources. + +For example, the following resource filter will apply the policy to datasets, charts, and dashboards under domain 1. + +```json +{ + "resource": { + "criteria": [ + { + "field": "resource_type", + "values": ["dataset", "chart", "dashboard"], + "condition": "EQUALS" + }, + { + "field": "domain", + "values": ["urn:li:domain:domain1"], + "condition": "EQUALS" + } + ] + } +} +``` + +Supported fields are as follows + +| Field Type | Description | Example | +| ------------- | ---------------------- | ----------------------- | +| resource_type | Type of the resource | dataset, chart, dataJob | +| resource_urn | Urn of the resource | urn:li:dataset:... | +| domain | Domain of the resource | urn:li:domain:domainX | + +## Managing Policies + +Policies can be managed on the page **Settings > Permissions > Policies** page. The `Policies` tab will only +be visible to those users having the `Manage Policies` privilege. + +Out of the box, DataHub is deployed with a set of pre-baked Policies. The set of default policies are created at deploy +time and can be found inside the `policies.json` file within `metadata-service/war/src/main/resources/boot`. This set of policies serves the +following purposes: + +1. Assigns immutable super-user privileges for the root `datahub` user account (Immutable) +2. Assigns all Platform privileges for all Users by default (Editable) + +The reason for #1 is to prevent people from accidentally deleting all policies and getting locked out (`datahub` super user account can be a backup) +The reason for #2 is to permit administrators to log in via OIDC or another means outside of the `datahub` root account +when they are bootstrapping with DataHub. This way, those setting up DataHub can start managing policies without friction. +Note that these privilege _can_ and likely _should_ be altered inside the **Policies** page of the UI. + +> Pro-Tip: To login using the `datahub` account, simply navigate to `+ +
+ +### Assigning Roles + +Roles can be assigned in two different ways. + +#### Assigning a New Role to a Single User + +If you go to **Settings > Users & Groups > Users**, you will be able to view your full list of users, as well as which Role they are currently +assigned to, including if they don't have a Role. + ++ +
+ +You can simply assign a new Role to a user by clicking on the drop-down that appears on their row and selecting the desired Role. + ++ +
+ +#### Batch Assigning a Role + +When viewing the full list of roles at **Settings > Permissions > Roles**, you will notice that each role has an `Add Users` button next to it. Clicking this button will +lead you to a search box where you can search through your users, and select which users you would like to assign this role to. + ++ +
+ +### How do Roles interact with Policies? + +Roles actually use Policies under-the-hood, and come prepackaged with corresponding policies to control what a Role can do, which you can view in the +Policies tab. Note that these Role-specific policies **cannot** be changed. You can find the full list of policies corresponding to each Role at the bottom of this +[file](https://github.com/datahub-project/datahub/blob/master/metadata-service/war/src/main/resources/boot/policies.json). + +If you would like to have finer control over what a user on your DataHub instance can do, the Roles system interfaces cleanly +with the Policies system. For example, if you would like to give a user a **Reader** role, but also allow them to edit metadata +for certain domains, you can add a policy that will allow them to do. Note that adding a policy like this will only add to what a user can do +in DataHub. + +### Role Privileges + +#### Self-Hosted DataHub and Managed DataHub + +These privileges are common to both Self-Hosted DataHub and Managed DataHub. + +##### Platform Privileges + +| Privilege | Admin | Editor | Reader | +| ------------------------------- | ------------------ | ------------------ | ------ | +| Generate Personal Access Tokens | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Domains | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Glossaries | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Tags | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Policies | :heavy_check_mark: | :x: | :x: | +| Manage Ingestion | :heavy_check_mark: | :x: | :x: | +| Manage Secrets | :heavy_check_mark: | :x: | :x: | +| Manage Users and Groups | :heavy_check_mark: | :x: | :x: | +| Manage Access Tokens | :heavy_check_mark: | :x: | :x: | +| Manage User Credentials | :heavy_check_mark: | :x: | :x: | +| Manage Public Views | :heavy_check_mark: | :x: | :x: | +| View Analytics | :heavy_check_mark: | :x: | :x: | + +##### Metadata Privileges + +| Privilege | Admin | Editor | Reader | +| ------------------------------------ | ------------------ | ------------------ | ------------------ | +| View Entity Page | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | +| View Dataset Usage | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | +| View Dataset Profile | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | +| Edit Entity | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Entity Tags | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Entity Glossary Terms | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Entity Owners | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Entity Docs | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Entity Doc Links | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Entity Status | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Entity Assertions | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Entity Tags | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Entity Glossary Terms | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Dataset Column Tags | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Dataset Column Glossary Terms | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Dataset Column Descriptions | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Dataset Column Tags | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Dataset Column Glossary Terms | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Tag Color | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit User Profile | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Contact Info | :heavy_check_mark: | :heavy_check_mark: | :x: | + +#### Managed DataHub + +These privileges are only relevant to Managed DataHub. + +##### Platform Privileges + +| Privilege | Admin | Editor | Reader | +| ----------------------- | ------------------ | ------------------ | ------ | +| Create Constraints | :heavy_check_mark: | :heavy_check_mark: | :x: | +| View Metadata Proposals | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Tests | :heavy_check_mark: | :x: | :x: | +| Manage Global Settings | :heavy_check_mark: | :x: | :x: | + +##### Metadata Privileges + +| Privilege | Admin | Editor | Reader | +| ------------------------------------- | ------------------ | ------------------ | ------------------ | +| Propose Entity Tags | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | +| Propose Entity Glossary Terms | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | +| Propose Dataset Column Tags | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | +| Propose Dataset Column Glossary Terms | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | +| Edit Entity Operations | :heavy_check_mark: | :heavy_check_mark: | :x: | + +## Additional Resources + +### GraphQL + +- [acceptRole](../../graphql/mutations.md#acceptrole) +- [batchAssignRole](../../graphql/mutations.md#batchassignrole) +- [listRoles](../../graphql/queries.md#listroles) + +## FAQ and Troubleshooting + +## What updates are planned for Roles? + +In the future, the DataHub team is looking into adding the following features to Roles. + +- Defining a role mapping from OIDC identity providers to DataHub that will grant users a DataHub role based on their IdP role +- Allowing Admins to set a default role on DataHub so all users are assigned a role +- Building custom roles diff --git a/docs-website/versioned_docs/version-0.10.4/docs/browse.md b/docs-website/versioned_docs/version-0.10.4/docs/browse.md new file mode 100644 index 0000000000000..691ecb392de01 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/browse.md @@ -0,0 +1,65 @@ +--- +title: About DataHub Browse +sidebar_label: Browse +slug: /browse +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/browse.md" +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub Browse + ++ +
+ +## Using Browse + +Browse is accessible by clicking on an Entity Type on the front page of the DataHub UI. + ++ +
+ +This will take you into the folder explorer view for browse in which you can drill down to your desired sub categories to find the data you are looking for. + ++ +
+ +## Additional Resources + +### GraphQL + +- [browse](../graphql/queries.md#browse) +- [browsePaths](../graphql/queries.md#browsePaths) + +## FAQ and Troubleshooting + +**How are BrowsePaths created?** + +BrowsePaths are automatically created for ingested entities based on separator characters that appear within an Urn. + +**How can I customize browse paths?** + +BrowsePaths are an Aspect similar to other components of an Entity. They can be customized by ingesting custom paths for specified Urns. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ + +### Related Features + +- [Search](./how/search.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/browseV2/browse-paths-v2.md b/docs-website/versioned_docs/version-0.10.4/docs/browseV2/browse-paths-v2.md new file mode 100644 index 0000000000000..664b3208c9649 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/browseV2/browse-paths-v2.md @@ -0,0 +1,58 @@ +--- +title: Generating Browse Paths (V2) +slug: /browsev2/browse-paths-v2 +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/browseV2/browse-paths-v2.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# Generating Browse Paths (V2) + ++ +
+ +This new browse sidebar always starts with Entity Type, then optionally shows Environment (PROD, DEV, etc.) if there are 2 or more Environments, then Platform. Below the Platform level, we render out folders that come directly from entity's [browsePathsV2](/docs/generated/metamodel/entities/dataset#browsepathsv2) aspects. + +## Generating Custom Browse Paths + +A `browsePathsV2` aspect has a field called `path` which contains a list of `BrowsePathEntry` objects. Each object in the path represents one level of the entity's browse path where the first entry is the highest level and the last entry is the lowest level. + +If an entity has this aspect filled out, their browse path will show up in the browse sidebar so that you can navigate its folders and select one to filter search results down. + +For example, in the browse sidebar on the left of the image above, there are 10 Dataset entities from the BigQuery Platform that have `browsePathsV2` aspects that look like the following: + +``` +[ { id: "bigquery-public-data" }, { id: "covid19_public_forecasts" } ] +``` + +The `id` in a `BrowsePathEntry` is required and is what will be shown in the UI unless the optional `urn` field is populated. If the `urn` field is populated, we will try to resolve this path entry into an entity object and display that entity's name. We will also show a link to allow you to open up the entity profile. + +The `urn` field should only be populated if there is an entity in your DataHub instance that belongs in that entity's browse path. This makes most sense for Datasets to have Container entities in the browse paths as well as some other cases such as a DataFlow being part of a DataJob's browse path. For any other situation, feel free to leave `urn` empty and populate `id` with the text you want to be shown in the UI for your entity's path. + +## Additional Resources + +### GraphQL + +- [browseV2](../../graphql/queries.md#browsev2) + +## FAQ and Troubleshooting + +**How are browsePathsV2 aspects created?** + +We create `browsePathsV2` aspects for all entities that should have one by default when you ingest your data if this aspect is not already provided. This happens based on separator characters that appear within an Urn. + +Our ingestion sources are also producing `browsePathsV2` aspects since CLI version v0.10.5. + +### Related Features + +- [Search](../how/search.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/cli.md b/docs-website/versioned_docs/version-0.10.4/docs/cli.md new file mode 100644 index 0000000000000..d925c933e12ff --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/cli.md @@ -0,0 +1,626 @@ +--- +toc_max_heading_level: 4 +title: DataHub CLI +sidebar_label: CLI +slug: /cli +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/cli.md" +--- + +# DataHub CLI + +DataHub comes with a friendly cli called `datahub` that allows you to perform a lot of common operations using just the command line. [Acryl Data](https://acryldata.io) maintains the [pypi package](https://pypi.org/project/acryl-datahub/) for `datahub`. + +## Installation + +### Using pip + +We recommend Python virtual environments (venv-s) to namespace pip modules. Here's an example setup: + +```shell +python3 -m venv venv # create the environment +source venv/bin/activate # activate the environment +``` + +**_NOTE:_** If you install `datahub` in a virtual environment, that same virtual environment must be re-activated each time a shell window or session is created. + +Once inside the virtual environment, install `datahub` using the following commands + +```shell +# Requires Python 3.7+ +python3 -m pip install --upgrade pip wheel setuptools +python3 -m pip install --upgrade acryl-datahub +# validate that the install was successful +datahub version +# If you see "command not found", try running this instead: python3 -m datahub version +``` + +If you run into an error, try checking the [_common setup issues_](../metadata-ingestion/developing.md#Common-setup-issues). + +Other installation options such as installation from source and running the cli inside a container are available further below in the guide [here](#alternate-installation-options) + +## Starter Commands + +The `datahub` cli allows you to do many things, such as quick-starting a DataHub docker instance locally, ingesting metadata from your sources into a DataHub server or a DataHub lite instance, as well as retrieving, modifying and exploring metadata. +Like most command line tools, `--help` is your best friend. Use it to discover the capabilities of the cli and the different commands and sub-commands that are supported. + +```console +datahub --help +Usage: datahub [OPTIONS] COMMAND [ARGS]... + +Options: + --debug / --no-debug Enable debug logging. + --log-file FILE Enable debug logging. + --debug-vars / --no-debug-vars Show variable values in stack traces. Implies --debug. While we try to avoid + printing sensitive information like passwords, this may still happen. + --version Show the version and exit. + -dl, --detect-memory-leaks Run memory leak detection. + --help Show this message and exit. + +Commands: + actions+ +
+ +## Metadata Store + +The Metadata Store is responsible for storing the [Entities & Aspects](/docs/metadata-modeling/metadata-model/) comprising the Metadata Graph. This includes +exposing an API for [ingesting metadata](/docs/metadata-service#ingesting-entities), [fetching Metadata by primary key](/docs/metadata-service#retrieving-entities), [searching entities](/docs/metadata-service#search-an-entity), and [fetching Relationships](/docs/metadata-service#get-relationships-edges) between +entities. It consists of a Spring Java Service hosting a set of [Rest.li](https://linkedin.github.io/rest.li/) API endpoints, along with +MySQL, Elasticsearch, & Kafka for primary storage & indexing. + +Get started with the Metadata Store by following the [Quickstart Guide](/docs/quickstart/). + +## Metadata Models + +Metadata Models are schemas defining the shape of the Entities & Aspects comprising the Metadata Graph, along with the relationships between them. They are defined +using [PDL](https://linkedin.github.io/rest.li/pdl_schema), a modeling language quite similar in form to Protobuf while serializes to JSON. Entities represent a specific class of Metadata +Asset such as a Dataset, a Dashboard, a Data Pipeline, and beyond. Each _instance_ of an Entity is identified by a unique identifier called an `urn`. Aspects represent related bundles of data attached +to an instance of an Entity such as its descriptions, tags, and more. View the current set of Entities supported [here](/docs/metadata-modeling/metadata-model#exploring-datahubs-metadata-model). + +Learn more about DataHub models Metadata [here](/docs/metadata-modeling/metadata-model/). + +## Ingestion Framework + +The Ingestion Framework is a modular, extensible Python library for extracting Metadata from external source systems (e.g. +Snowflake, Looker, MySQL, Kafka), transforming it into DataHub's [Metadata Model](/docs/metadata-modeling/metadata-model/), and writing it into DataHub via +either Kafka or using the Metadata Store Rest APIs directly. DataHub supports an [extensive list of Source connectors](/docs/metadata-ingestion/#installing-plugins) to choose from, along with +a host of capabilities including schema extraction, table & column profiling, usage information extraction, and more. + +Getting started with the Ingestion Framework is as simple: just define a YAML file and execute the `datahub ingest` command. +Learn more by heading over the the [Metadata Ingestion](/docs/metadata-ingestion/) guide. + +## GraphQL API + +The [GraphQL](https://graphql.org/) API provides a strongly-typed, entity-oriented API that makes interacting with the Entities comprising the Metadata +Graph simple, including APIs for adding and removing tags, owners, links & more to Metadata Entities! Most notably, this API is consumed by the User Interface (discussed below) for enabling Search & Discovery, Governance, Observability +and more. + +To get started using the GraphQL API, check out the [Getting Started with GraphQL](/docs/api/graphql/getting-started) guide. + +## User Interface + +DataHub comes with a React UI including an ever-evolving set of features to make Discovering, Governing, & Debugging your Data Assets easy & delightful. +For a full overview of the capabilities currently supported, take a look at the [Features](/docs/features/) overview. For a look at what's coming next, +head over to the [Roadmap](/docs/roadmap/). + +## Learn More + +Learn more about the specifics of the [DataHub Architecture](./architecture/architecture.md) in the Architecture Overview. Learn about using & developing the components +of the Platform by visiting the Module READMEs. + +## Feedback / Questions / Concerns + +We want to hear from you! For any inquiries, including Feedback, Questions, or Concerns, reach out on [Slack](https://datahubspace.slack.com/join/shared_invite/zt-nx7i0dj7-I3IJYC551vpnvvjIaNRRGw#/shared-invite/email)! diff --git a/docs-website/versioned_docs/version-0.10.4/docs/datahub_lite.md b/docs-website/versioned_docs/version-0.10.4/docs/datahub_lite.md new file mode 100644 index 0000000000000..0bc627a141868 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/datahub_lite.md @@ -0,0 +1,622 @@ +--- +title: DataHub Lite (Experimental) +sidebar_label: Lite (Experimental) +slug: /datahub_lite +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/datahub_lite.md" +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# DataHub Lite (Experimental) + +## What is it? + +DataHub Lite is a lightweight embeddable version of DataHub with no external dependencies. It is intended to enable local developer tooling use-cases such as simple access to metadata for scripts and other tools. +DataHub Lite is compatible with the DataHub metadata format and all the ingestion connectors that DataHub supports. +It was built as a reaction to [recap](https://github.com/recap-cloud/recap) to prove that a similar lightweight system could be built within DataHub quite easily. +Currently DataHub Lite uses DuckDB under the covers as its default storage layer, but that might change in the future. + +## Features + +- Designed for the CLI +- Available as a Python library or REST API +- Ingest metadata from all DataHub ingestion sources +- Metadata Reads + - navigate metadata using a hierarchy + - get metadata for an entity + - search / query metadata across all entities +- Forward metadata automatically to a central GMS or Kafka instance + +## Architecture + +![architecture](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/lite/lite_architecture.png) + +## What is it not? + +DataHub Lite is NOT meant to be a replacement for the production Java DataHub server ([datahub-gms](./architecture/metadata-serving.md)). It does not offer the full set of API-s that the DataHub GMS server does. +The following features are **NOT** supported: + +- Full-text search with ranking and relevance features +- Graph traversal of relationships (e.g. lineage) +- Metadata change stream over Kafka (only forwarding of writes is supported) +- GraphQL API + +## Prerequisites + +To use `datahub lite` commands, you need to install [`acryl-datahub`](https://pypi.org/project/acryl-datahub/) > 0.9.6 ([install instructions](./cli.md#using-pip)) and the `datahub-lite` plugin. + +```shell +pip install acryl-datahub[datahub-lite] +``` + +## Importing Metadata + +To ingest metadata into DataHub Lite, all you have to do is change the `sink:` in your recipe to be a `datahub-lite` instance. See the detailed sink docs [here](../metadata-ingestion/sink_docs/datahub.md#datahub-lite-experimental). +For example, here is a simple recipe file that ingests mysql metadata into datahub-lite. + +```yaml +# mysql.in.dhub.yaml +source: + type: mysql + config: + host_port: localhost:3306 + username: datahub + password: datahub + +sink: + type: datahub-lite +``` + +By default, `lite` will create a local instance under `~/.datahub/lite/`. + +Ingesting metadata into DataHub Lite is as simple as running ingestion: +`datahub ingest -c mysql.in.dhub.yaml` + +:::note + +DataHub Lite currently doesn't support stateful ingestion, so you'll have to turn off stateful ingestion in your recipe to use it. This will be fixed shortly. + +::: + +### Forwarding to a central DataHub GMS over REST or Kafka + +DataHub Lite can be configured to forward all writes to a central DataHub GMS using either the REST API or the Kafka API. +To configure forwarding, add a **forward_to** section to your DataHub Lite config that conforms to a valid DataHub Sink configuration. Here is an example: + +```yaml +# mysql.in.dhub.yaml with forwarding to datahub-gms over REST +source: + type: mysql + config: + host_port: localhost:3306 + username: datahub + password: datahub + +sink: + type: datahub-lite + forward_to: + type: datahub-rest + config: + server: "http://datahub-gms:8080" +``` + +:::note + +Forwarding is currently best-effort, so there can be losses in metadata if the remote server is down. For a reliable sync mechanism, look at the [exporting metadata](#export-metadata-export) section to generate a standard metadata file. + +::: + +### Importing from a file + +As a convenient short-cut, you can import metadata from any standard DataHub metadata json file (e.g. generated via using a file sink) by issuing a _datahub lite import_ command. + +```shell +> datahub lite import --file metadata_events.json + +``` + +## Exploring Metadata + +The `datahub lite` group of commands provides a set of capabilities for you to explore the metadata you just ingested. + +### List (ls) + +Listing functions like a directory structure that is customized based on the kind of system being explored. DataHub's metadata is automatically organized into databases, tables, views, dashboards, charts, etc. + +:::note + +Using the `ls` command below is much more pleasant when you have tab completion enabled on your shell. Check out the [Setting up Tab Completion](#tab-completion) section at the bottom of the guide. + +::: + +```shell +> datahub lite ls / +databases +bi_tools +tags +# Stepping down one level +> datahub lite ls /databases +mysql +# Stepping down another level +> datahub lite ls /databases/mysql +instances +... +# At the final level +> datahub lite ls /databases/mysql/instances/default/databases/datahub/tables/ +metadata_aspect_v2 + +# Listing a leaf entity functions just like the unix ls command +> datahub lite ls /databases/mysql/instances/default/databases/datahub/tables/metadata_aspect_v2 +metadata_aspect_v2 +``` + +### Read (get) + +Once you have located a path of interest, you can read metadata at that entity, by issuing a **get**. You can additionally filter the metadata retrieved from an entity by the aspect type of the metadata (e.g. to request the schema, filter by the **schemaMetadata** aspect). + +Aside: If you are curious what all types of entities and aspects DataHub supports, check out the metadata model of entities like [Dataset](./generated/metamodel/entities/dataset.md), [Dashboard](./generated/metamodel/entities/dashboard.md) etc. + +The general template for the get responses looks like: + +``` +{ + "urn":+ +
+ +Then navigate to the Data Products tab on the Domain's home page, and click '+ New Data Product'. +This will open a new modal where you can configure the settings for your data product. Inside the form, you can choose a name for your Data Product. Most often, this will align with the logical purpose of the Data Product, for example +'Customer Orders' or 'Revenue Attribution'. You can also add documentation for your product to help other users easily discover it. Don't worry, this can be changed later. + ++ +
+ +Once you've chosen a name and a description, click 'Create' to create the new Data Product. Once you've created the Data Product, you can click on it to continue on to the next step, adding assets to it. + +### Assigning an Asset to a Data Product (UI) + +You can assign an asset to a Data Product either using the Data Product page as the starting point or the Asset's page as the starting point. +On a Data Product page, click the 'Add Assets' button on the top right corner to add assets to the Data Product. + ++ +
+ +On an Asset's profile page, use the right sidebar to locate the Data Product section. Click 'Set Data Product', and then search for the Data Product you'd like to add this asset to. When you're done, click 'Add'. + ++ +
+ +To remove an asset from a Data Product, click the 'x' icon on the Data Product label. + +> Notice: Adding or removing an asset from a Data Product requires the `Edit Data Product` Metadata Privilege, which can be granted +> by a [Policy](authorization/policies.md). + +### Creating a Data Product (YAML + git) + +DataHub ships with a YAML-based Data Product spec for defining and managing Data Products as code. + +Here is an example of a Data Product named "Pet of the Week" which belongs to the **Marketing** domain and contains three data assets. The **Spec** tab describes the JSON Schema spec for a DataHub data product file. + ++ +
+ +First, add the DB password to kubernetes by running the following. + +``` +kubectl delete secret mysql-secrets +kubectl create secret generic mysql-secrets --from-literal=mysql-root-password=<+ +
+ +Update the elasticsearch settings under global in the values.yaml as follows. + +``` + elasticsearch: + host: <+ +
+ +Update the kafka settings under global in the values.yaml as follows. + +``` +kafka: + bootstrap: + server: "<+ +
+ +## Step 2: Configure DataHub Container to use Confluent Cloud Topics + +### Docker Compose + +If you are deploying DataHub via docker compose, enabling connection to Confluent is a matter of a) creating topics in the Confluent Control Center and b) changing the default container environment variables. + +First, configure GMS to connect to Confluent Cloud by changing `docker/gms/env/docker.env`: + +``` +KAFKA_BOOTSTRAP_SERVER=pkc-g4ml2.eu-west-2.aws.confluent.cloud:9092 +KAFKA_SCHEMAREGISTRY_URL=https://plrm-qwlpp.us-east-2.aws.confluent.cloud + +# Confluent Cloud Configs +SPRING_KAFKA_PROPERTIES_SECURITY_PROTOCOL=SASL_SSL +SPRING_KAFKA_PROPERTIES_SASL_JAAS_CONFIG=org.apache.kafka.common.security.plain.PlainLoginModule required username='XFA45EL1QFUQP4PA' password='ltyf96EvR1YYutsjLB3ZYfrk+yfCXD8sQHCE3EMp57A2jNs4RR7J1bU9k6lM6rU'; +SPRING_KAFKA_PROPERTIES_SASL_MECHANISM=PLAIN +SPRING_KAFKA_PROPERTIES_CLIENT_DNS_LOOKUP=use_all_dns_ips +SPRING_KAFKA_PROPERTIES_BASIC_AUTH_CREDENTIALS_SOURCE=USER_INFO +SPRING_KAFKA_PROPERTIES_BASIC_AUTH_USER_INFO=P2ETAN5QR2LCWL14:RTjqw7AfETDl0RZo/7R0123LhPYs2TGjFKmvMWUFnlJ3uKubFbB1Sfs7aOjjNi1m23 +``` + +Next, configure datahub-frontend to connect to Confluent Cloud by changing `docker/datahub-frontend/env/docker.env`: + +``` +KAFKA_BOOTSTRAP_SERVER=pkc-g4ml2.eu-west-2.aws.confluent.cloud:9092 + +# Confluent Cloud Configs +KAFKA_PROPERTIES_SECURITY_PROTOCOL=SASL_SSL +KAFKA_PROPERTIES_SASL_JAAS_CONFIG=org.apache.kafka.common.security.plain.PlainLoginModule required username='XFA45EL1QFUQP4PA' password='ltyf96EvR1YYutsjLB3ZYfrk+yfCXD8sQHCE3EMp57A2jNs4RR7J1bU9k6lM6rU'; +KAFKA_PROPERTIES_SASL_MECHANISM=PLAIN +KAFKA_PROPERTIES_CLIENT_DNS_LOOKUP=use_all_dns_ips +KAFKA_PROPERTIES_BASIC_AUTH_CREDENTIALS_SOURCE=USER_INFO +KAFKA_PROPERTIES_BASIC_AUTH_USER_INFO=P2ETAN5QR2LCWL14:RTjqw7AfETDl0RZo/7R0123LhPYs2TGjFKmvMWUFnlJ3uKubFbB1Sfs7aOjjNi1m23 +``` + +Note that this step is only required if `DATAHUB_ANALYTICS_ENABLED` environment variable is not explicitly set to false for the datahub-frontend +container. + +If you're deploying with Docker Compose, you do not need to deploy the Zookeeper, Kafka Broker, or Schema Registry containers that ship by default. + +#### DataHub Actions + +Configuring Confluent Cloud for DataHub Actions requires some additional edits to your `executor.yaml`. Under the Kafka +source connection config you will need to add the Python style client connection information: + +```yaml +connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} + consumer_config: + security.protocol: ${KAFKA_PROPERTIES_SECURITY_PROTOCOL:-PLAINTEXT} + sasl.mechanism: ${KAFKA_PROPERTIES_SASL_MECHANISM:-PLAIN} + sasl.username: ${KAFKA_PROPERTIES_SASL_USERNAME} + sasl.password: ${KAFKA_PROPERTIES_SASL_PASSWORD} + schema_registry_config: + basic.auth.user.info: ${KAFKA_PROPERTIES_BASIC_AUTH_USER_INFO} +``` + +Specifically `sasl.username` and `sasl.password` are the differences from the base `executor.yaml` example file. + +Additionally, you will need to set up environment variables for `KAFKA_PROPERTIES_SASL_USERNAME` and `KAFKA_PROPERTIES_SASL_PASSWORD` +which will use the same username and API Key you generated for the JAAS config. + +See [Overwriting a System Action Config](https://github.com/acryldata/datahub-actions/blob/main/docker/README.md#overwriting-a-system-action-config) for detailed reflection procedures. + +Next, configure datahub-actions to connect to Confluent Cloud by changing `docker/datahub-actions/env/docker.env`: + +``` +KAFKA_BOOTSTRAP_SERVER=pkc-g4ml2.eu-west-2.aws.confluent.cloud:9092 +SCHEMA_REGISTRY_URL=https://plrm-qwlpp.us-east-2.aws.confluent.cloud + +# Confluent Cloud Configs +KAFKA_PROPERTIES_SECURITY_PROTOCOL=SASL_SSL +KAFKA_PROPERTIES_SASL_USERNAME=XFA45EL1QFUQP4PA +KAFKA_PROPERTIES_SASL_PASSWORD=ltyf96EvR1YYutsjLB3ZYfrk+yfCXD8sQHCE3EMp57A2jNs4RR7J1bU9k6lM6rU +KAFKA_PROPERTIES_SASL_MECHANISM=PLAIN +KAFKA_PROPERTIES_CLIENT_DNS_LOOKUP=use_all_dns_ips +KAFKA_PROPERTIES_BASIC_AUTH_CREDENTIALS_SOURCE=USER_INFO +KAFKA_PROPERTIES_BASIC_AUTH_USER_INFO=P2ETAN5QR2LCWL14:RTjqw7AfETDl0RZo/7R0123LhPYs2TGjFKmvMWUFnlJ3uKubFbB1Sfs7aOjjNi1m23 +``` + +### Helm + +If you're deploying on K8s using Helm, you can simply change the **datahub-helm** `values.yml` to point to Confluent Cloud and disable some default containers: + +First, disable the `cp-schema-registry` service: + +``` +cp-schema-registry: + enabled: false +``` + +Next, disable the `kafkaSetupJob` service: + +``` +kafkaSetupJob: + enabled: false +``` + +Then, update the `kafka` configurations to point to your Confluent Cloud broker and schema registry instance, along with the topics you've created in Step 1: + +``` +kafka: + bootstrap: + server: pkc-g4ml2.eu-west-2.aws.confluent.cloud:9092 + schemaregistry: + url: https://plrm-qwlpp.us-east-2.aws.confluent.cloud +``` + +Next, you'll want to create 2 new Kubernetes secrets, one for the JaaS configuration which contains the username and password for Confluent, +and another for the user info used for connecting to the schema registry. You'll find the values for each within the Confluent Control Center. Specifically, +select "Clients" -> "Configure new Java Client". You should see a page like the following: + ++ +
+ +You'll want to generate both a Kafka Cluster API Key & a Schema Registry key. Once you do so,you should see the config +automatically populate with your new secrets: + ++ +
+ +You'll need to copy the values of `sasl.jaas.config` and `basic.auth.user.info` +for the next step. + +The next step is to create K8s secrets containing the config values you've just generated. Specifically, you can run the following commands: + +```shell +kubectl create secret generic confluent-secrets --from-literal=sasl_jaas_config="+ +
+ +Tick the checkbox for datahub-datahub-frontend and click "CREATE INGRESS" button. You should land on the following page. + ++ +
+ +Type in an arbitrary name for the ingress and click on the second step "Host and path rules". You should land on the +following page. + ++ +
+ +Select "datahub-datahub-frontend" in the dropdown menu for backends, and then click on "ADD HOST AND PATH RULE" button. +In the second row that got created, add in the host name of choice (here gcp.datahubproject.io) and select +"datahub-datahub-frontend" in the backends dropdown. + +This step adds the rule allowing requests from the host name of choice to get routed to datahub-frontend service. Click +on step 3 "Frontend configuration". You should land on the following page. + ++ +
+ +Choose HTTPS in the dropdown menu for protocol. To enable SSL, you need to add a certificate. If you do not have one, +you can click "CREATE A NEW CERTIFICATE" and input the host name of choice. GCP will create a certificate for you. + +Now press "CREATE" button on the left to create ingress! After around 5 minutes, you should see the following. + ++ +
+ +In your domain provider, add an A record for the host name set above using the IP address on the ingress page (noted +with the red box). Once DNS updates, you should be able to access DataHub through the host name!! + +Note, ignore the warning icon next to ingress. It takes about ten minutes for ingress to check that the backend service +is ready and show a check mark as follows. However, ingress is fully functional once you see the above page. + ++ +
diff --git a/docs-website/versioned_docs/version-0.10.4/docs/deploy/kubernetes.md b/docs-website/versioned_docs/version-0.10.4/docs/deploy/kubernetes.md new file mode 100644 index 0000000000000..fb4bc086a7fee --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/deploy/kubernetes.md @@ -0,0 +1,158 @@ +--- +title: Deploying with Kubernetes +sidebar_label: Deploying with Kubernetes +slug: /deploy/kubernetes +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/deploy/kubernetes.md +--- + +# Deploying DataHub with Kubernetes + +## Introduction + +Helm charts for deploying DataHub on a kubernetes cluster is located in +this [repository](https://github.com/acryldata/datahub-helm). We provide charts for +deploying [Datahub](https://github.com/acryldata/datahub-helm/tree/master/charts/datahub) and +it's [dependencies](https://github.com/acryldata/datahub-helm/tree/master/charts/prerequisites) +(Elasticsearch, optionally Neo4j, MySQL, and Kafka) on a Kubernetes cluster. + +This doc is a guide to deploy an instance of DataHub on a kubernetes cluster using the above charts from scratch. + +## Setup + +1. Set up a kubernetes cluster + - In a cloud platform of choice like [Amazon EKS](https://aws.amazon.com/eks), + [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine), + and [Azure Kubernetes Service](https://azure.microsoft.com/en-us/services/kubernetes-service/) OR + - In local environment using [Minikube](https://minikube.sigs.k8s.io/docs/). Note, more than 7GB of RAM is required + to run Datahub and it's dependencies +2. Install the following tools: + - [kubectl](https://kubernetes.io/docs/tasks/tools/) to manage kubernetes resources + - [helm](https://helm.sh/docs/intro/install/) to deploy the resources based on helm charts. Note, we only support + Helm 3. + +## Components + +Datahub consists of 4 main components: [GMS](/docs/metadata-service), +[MAE Consumer](/docs/metadata-jobs/mae-consumer-job) (optional), +[MCE Consumer](/docs/metadata-jobs/mce-consumer-job) (optional), and +[Frontend](/docs/datahub-frontend). Kubernetes deployment for each of the components are +defined as subcharts under the main +[Datahub](https://github.com/acryldata/datahub-helm/tree/master/charts/datahub) +helm chart. + +The main components are powered by 4 external dependencies: + +- Kafka +- Local DB (MySQL, Postgres, MariaDB) +- Search Index (Elasticsearch) +- Graph Index (Supports either Neo4j or Elasticsearch) + +The dependencies must be deployed before deploying Datahub. We created a separate +[chart](https://github.com/acryldata/datahub-helm/tree/master/charts/prerequisites) +for deploying the dependencies with example configuration. They could also be deployed separately on-prem or leveraged +as managed services. To remove your dependency on Neo4j, set enabled to false in +the [values.yaml](https://github.com/acryldata/datahub-helm/blob/master/charts/prerequisites/values.yaml#L54) for +prerequisites. Then, override the `graph_service_impl` field in +the [values.yaml](https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/values.yaml#L63) of datahub +instead of `neo4j`. + +## Quickstart + +Assuming kubectl context points to the correct kubernetes cluster, first create kubernetes secrets that contain MySQL +and Neo4j passwords. + +```(shell) +kubectl create secret generic mysql-secrets --from-literal=mysql-root-password=datahub +kubectl create secret generic neo4j-secrets --from-literal=neo4j-password=datahub +``` + +The above commands sets the passwords to "datahub" as an example. Change to any password of choice. + +Add datahub helm repo by running the following + +```(shell) +helm repo add datahub https://helm.datahubproject.io/ +``` + +Then, deploy the dependencies by running the following + +```(shell) +helm install prerequisites datahub/datahub-prerequisites +``` + +Note, the above uses the default configuration +defined [here](https://github.com/acryldata/datahub-helm/blob/master/charts/prerequisites/values.yaml). You can change +any of the configuration and deploy by running the following command. + +```(shell) +helm install prerequisites datahub/datahub-prerequisites --values <+ +
+ +## Change Event + +Each modification is modeled as a +[ChangeEvent](https://github.com/datahub-project/datahub/blob/master/metadata-service/services/src/main/java/com/linkedin/metadata/timeline/data/ChangeEvent.java) +which are grouped under [ChangeTransactions](https://github.com/datahub-project/datahub/blob/master/metadata-service/services/src/main/java/com/linkedin/metadata/timeline/data/ChangeTransaction.java) +based on timestamp. A `ChangeEvent` consists of: + +- `changeType`: An operational type for the change, either `ADD`, `MODIFY`, or `REMOVE` +- `semVerChange`: A [semver](https://semver.org/) change type based on the compatibility of the change. This gets utilized in the computation of the transaction level version. Options are `NONE`, `PATCH`, `MINOR`, `MAJOR`, and `EXCEPTIONAL` for cases where an exception occurred during processing, but we do not fail the entire change calculation +- `target`: The high level target of the change. This is usually an `urn`, but can differ depending on the type of change. +- `category`: The category a change falls under, specific aspects are mapped to each category depending on the entity +- `elementId`: Optional, the ID of the element being applied to the target +- `description`: A human readable description of the change produced by the `Differ` type computing the diff +- `changeDetails`: A loose property map of additional details about the change + +### Change Event Examples + +- A tag was applied to a _field_ of a dataset through the UI: + - `changeType`: `ADD` + - `target`: `urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:+ +
++ +
+ +# Future Work + +- Supporting versions as start and end parameters as part of the call to the timeline API +- Supporting entities beyond Datasets +- Adding GraphQL API support +- Supporting materialization of computed versions for entity categories (compared to the current read-time version computation) +- Support in the UI to visualize the timeline in various places (e.g. schema history, etc.) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/developers.md b/docs-website/versioned_docs/version-0.10.4/docs/developers.md new file mode 100644 index 0000000000000..bbb4ea5ed0df0 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/developers.md @@ -0,0 +1,165 @@ +--- +title: Local Development +sidebar_label: Local Development +slug: /developers +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/developers.md" +--- + +# DataHub Developer's Guide + +## Pre-requirements + +- [Java 11 SDK](https://openjdk.org/projects/jdk/11/) +- [Docker](https://www.docker.com/) +- [Docker Compose](https://docs.docker.com/compose/) +- Docker engine with at least 8GB of memory to run tests. + +:::note + +Do not try to use a JDK newer than JDK 11. The build process does not work with newer JDKs currently. + +::: + +## Building the Project + +Fork and clone the repository if haven't done so already + +``` +git clone https://github.com/{username}/datahub.git +``` + +Change into the repository's root directory + +``` +cd datahub +``` + +Use [gradle wrapper](https://docs.gradle.org/current/userguide/gradle_wrapper.html) to build the project + +``` +./gradlew build +``` + +Note that the above will also run run tests and a number of validations which makes the process considerably slower. + +We suggest partially compiling DataHub according to your needs: + +- Build Datahub's backend GMS (Generalized metadata service): + +``` +./gradlew :metadata-service:war:build +``` + +- Build Datahub's frontend: + +``` +./gradlew :datahub-frontend:dist -x yarnTest -x yarnLint +``` + +- Build DataHub's command line tool: + +``` +./gradlew :metadata-ingestion:installDev +``` + +- Build DataHub's documentation: + +``` +./gradlew :docs-website:yarnLintFix :docs-website:build -x :metadata-ingestion:runPreFlightScript +# To preview the documentation +./gradlew :docs-website:serve +``` + +## Deploying local versions + +Run just once to have the local `datahub` cli tool installed in your $PATH + +``` +cd smoke-test/ +python3 -m venv venv +source venv/bin/activate +pip install --upgrade pip wheel setuptools +pip install -r requirements.txt +cd ../ +``` + +Once you have compiled & packaged the project or appropriate module you can deploy the entire system via docker-compose by running: + +``` +./gradlew quickstart +``` + +Replace whatever container you want in the existing deployment. +I.e, replacing datahub's backend (GMS): + +``` +(cd docker && COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 docker-compose -p datahub -f docker-compose-without-neo4j.yml -f docker-compose-without-neo4j.override.yml -f docker-compose.dev.yml up -d --no-deps --force-recreate --build datahub-gms) +``` + +Running the local version of the frontend + +``` +(cd docker && COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 docker-compose -p datahub -f docker-compose-without-neo4j.yml -f docker-compose-without-neo4j.override.yml -f docker-compose.dev.yml up -d --no-deps --force-recreate --build datahub-frontend-react) +``` + +## IDE Support + +The recommended IDE for DataHub development is [IntelliJ IDEA](https://www.jetbrains.com/idea/). +You can run the following command to generate or update the IntelliJ project file + +``` +./gradlew idea +``` + +Open `datahub.ipr` in IntelliJ to start developing! + +For consistency please import and auto format the code using [LinkedIn IntelliJ Java style](https://github.com/datahub-project/datahub/blob/master/gradle/idea/LinkedIn%20Style.xml). + +## Windows Compatibility + +For optimal performance and compatibility, we strongly recommend building on a Mac or Linux system. +Please note that we do not actively support Windows in a non-virtualized environment. + +If you must use Windows, one workaround is to build within a virtualized environment, such as a VM(Virtual Machine) or [WSL(Windows Subsystem for Linux)](https://learn.microsoft.com/en-us/windows/wsl). +This approach can help ensure that your build environment remains isolated and stable, and that your code is compiled correctly. + +## Common Build Issues + +### Getting `Unsupported class file major version 57` + +You're probably using a Java version that's too new for gradle. Run the following command to check your Java version + +``` +java --version +``` + +While it may be possible to build and run DataHub using newer versions of Java, we currently only support [Java 11](https://openjdk.org/projects/jdk/11/) (aka Java 11). + +### Getting `cannot find symbol` error for `javax.annotation.Generated` + +Similar to the previous issue, please use Java 1.8 to build the project. +You can install multiple version of Java on a single machine and switch between them using the `JAVA_HOME` environment variable. See [this document](https://docs.oracle.com/cd/E21454_01/html/821-2531/inst_jdk_javahome_t.html) for more details. + +### `:metadata-models:generateDataTemplate` task fails with `java.nio.file.InvalidPathException: Illegal char <:> at index XX` or `Caused by: java.lang.IllegalArgumentException: 'other' has different root` error + +This is a [known issue](https://github.com/linkedin/rest.li/issues/287) when building the project on Windows due a bug in the Pegasus plugin. Please refer to [Windows Compatibility](/docs/developers.md#windows-compatibility). + +### Various errors related to `generateDataTemplate` or other `generate` tasks + +As we generate quite a few files from the models, it is possible that old generated files may conflict with new model changes. When this happens, a simple `./gradlew clean` should reosolve the issue. + +### `Execution failed for task ':metadata-service:restli-servlet-impl:checkRestModel'` + +This generally means that an [incompatible change](https://linkedin.github.io/rest.li/modeling/compatibility_check) was introduced to the rest.li API in GMS. You'll need to rebuild the snapshots/IDL by running the following command once + +``` +./gradlew :metadata-service:restli-servlet-impl:build -Prest.model.compatibility=ignore +``` + +### `java.io.IOException: No space left on device` + +This means you're running out of space on your disk to build. Please free up some space or try a different disk. + +### `Build failed` for task `./gradlew :datahub-frontend:dist -x yarnTest -x yarnLint` + +This could mean that you need to update your [Yarn](https://yarnpkg.com/getting-started/install) version diff --git a/docs-website/versioned_docs/version-0.10.4/docs/docker/development.md b/docs-website/versioned_docs/version-0.10.4/docs/docker/development.md new file mode 100644 index 0000000000000..22bd9510c409b --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/docker/development.md @@ -0,0 +1,146 @@ +--- +title: Using Docker Images During Development +slug: /docker/development +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/docker/development.md +--- + +# Using Docker Images During Development + +We've created a special `docker-compose.dev.yml` override file that should configure docker images to be easier to use +during development. + +Normally, you'd rebuild your images from scratch with a combination of gradle and docker compose commands. However, +this takes way too long for development and requires reasoning about several layers of docker compose configuration +yaml files which can depend on your hardware (Apple M1). + +The `docker-compose.dev.yml` file bypasses the need to rebuild docker images by mounting binaries, startup scripts, +and other data. These dev images, tagged with `debug` will use your _locally built code_ with gradle. +Building locally and bypassing the need to rebuild the Docker images should be much faster. + +We highly recommend you just invoke `./gradlew quickstartDebug` task. + +```shell +./gradlew quickstartDebug +``` + +This task is defined in `docker/build.gradle` and executes the following steps: + +1. Builds all required artifacts to run DataHub. This includes both application code such as the GMS war, the frontend + distribution zip which contains javascript, as wel as secondary support docker containers. + +1. Locally builds Docker images with the expected `debug` tag required by the docker compose files. + +1. Runs the special `docker-compose.dev.yml` and supporting docker-compose files to mount local files directly in the + containers with remote debugging ports enabled. + +Once the `debug` docker images are constructed you'll see images similar to the following: + +```shell +linkedin/datahub-frontend-react debug e52fef698025 28 minutes ago 763MB +linkedin/datahub-kafka-setup debug 3375aaa2b12d 55 minutes ago 659MB +linkedin/datahub-gms debug ea2b0a8ea115 56 minutes ago 408MB +acryldata/datahub-upgrade debug 322377a7a21d 56 minutes ago 463MB +acryldata/datahub-mysql-setup debug 17768edcc3e5 2 hours ago 58.2MB +linkedin/datahub-elasticsearch-setup debug 4d935be7c62c 2 hours ago 26.1MB +``` + +At this point it is possible to view the DataHub UI at `http://localhost:9002` as you normally would with quickstart. + +## Reloading + +Next, perform the desired modifications and rebuild the frontend and/or GMS components. + +**Builds GMS** + +```shell +./gradlew :metadata-service:war:build +``` + +**Builds the frontend** + +Including javascript components. + +```shell +./gradlew :datahub-frontend:build +``` + +After building the artifacts only a restart of the container(s) is required to run with the updated code. +The restart can be performed using a docker UI, the docker cli, or the following gradle task. + +```shell +./gradlew :docker:debugReload +``` + +## Start/Stop + +The following commands can pause the debugging environment to release resources when not needed. + +Pause containers and free resources. + +```shell +docker compose -p datahub stop +``` + +Resume containers for further debugging. + +```shell +docker compose -p datahub start +``` + +## Debugging + +The default debugging process uses your local code and enables debugging by default for both GMS and the frontend. Attach +to the instance using your IDE by using its Remote Java Debugging features. + +Environment variables control the debugging ports for GMS and the frontend. + +- `DATAHUB_MAPPED_GMS_DEBUG_PORT` - Default: 5001 +- `DATAHUB_MAPPED_FRONTEND_DEBUG_PORT` - Default: 5002 + +### IntelliJ Remote Debug Configuration + +The screenshot shows an example configuration for IntelliJ using the default GMS debugging port of 5001. + ++ +
+ +## Tips for People New To Docker + +### Accessing Logs + +It is highly recommended you use [Docker Desktop's dashboard](https://www.docker.com/products/docker-desktop) to access service logs. If you double click an image it will pull up the logs for you. + +### Quickstart Conflicts + +If you run quickstart, use `./gradlew quickstartDebug` to return to using the debugging containers. + +### Docker Prune + +If you run into disk space issues and prune the images & containers you will need to execute the `./gradlew quickstartDebug` +again. + +### System Update + +The `datahub-upgrade` job will not block the startup of the other containers as it normally +does in a quickstart or production environment. Normally this is process is required when making updates which +require Elasticsearch reindexing. If reindexing is required, the UI will render but may temporarily return errors +until this job finishes. + +### Running a specific service + +`docker-compose up` will launch all services in the configuration, including dependencies, unless they're already +running. If you, for some reason, wish to change this behavior, check out these example commands. + +``` +docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml -f docker-compose-without-neo4j.m1.yml -f docker-compose.dev.yml up datahub-gms +``` + +Will only start `datahub-gms` and its dependencies. + +``` +docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml -f docker-compose-without-neo4j.m1.yml -f docker-compose.dev.yml up --no-deps datahub-gms +``` + +Will only start `datahub-gms`, without dependencies. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/domains.md b/docs-website/versioned_docs/version-0.10.4/docs/domains.md new file mode 100644 index 0000000000000..9a707588e6c8a --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/domains.md @@ -0,0 +1,263 @@ +--- +title: About DataHub Domains +sidebar_label: Domains +slug: /domains +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/domains.md" +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub Domains + ++ +
+ +Once you're on the Domains page, you'll see a list of all the Domains that have been created on DataHub. Additionally, you can +view the number of entities inside each Domain. + ++ +
+ +To create a new Domain, click '+ New Domain'. + ++ +
+ +Inside the form, you can choose a name for your Domain. Most often, this will align with your business units or groups, for example +'Platform Engineering' or 'Social Marketing'. You can also add an optional description. Don't worry, this can be changed later. + +#### Advanced: Setting a Custom Domain id + +Click on 'Advanced' to show the option to set a custom Domain id. The Domain id determines what will appear in the DataHub 'urn' (primary key) +for the Domain. This option is useful if you intend to refer to Domains by a common name inside your code, or you want the primary +key to be human-readable. Proceed with caution: once you select a custom id, it cannot be easily changed. + ++ +
+ +By default, you don't need to worry about this. DataHub will auto-generate a unique Domain id for you. + +Once you've chosen a name and a description, click 'Create' to create the new Domain. + +### Assigning an Asset to a Domain + +You can assign assets to Domain using the UI or programmatically using the API or during ingestion. + +#### UI-Based Assignment + +To assign an asset to a Domain, simply navigate to the asset's profile page. At the bottom left-side menu bar, you'll +see a 'Domain' section. Click 'Set Domain', and then search for the Domain you'd like to add to. When you're done, click 'Add'. + ++ +
+ +To remove an asset from a Domain, click the 'x' icon on the Domain tag. + +> Notice: Adding or removing an asset from a Domain requires the `Edit Domain` Metadata Privilege, which can be granted +> by a [Policy](authorization/policies.md). + +#### Ingestion-time Assignment + +All SQL-based ingestion sources support assigning domains during ingestion using the `domain` configuration. Consult your source's configuration details page (e.g. [Snowflake](./generated/ingestion/sources/snowflake.md)), to verify that it supports the Domain capability. + +:::note + +Assignment of domains during ingestion will overwrite domains that you have assigned in the UI. A single table can only belong to one domain. + +::: + +Here is a quick example of a snowflake ingestion recipe that has been enhanced to attach the **Analytics** domain to all tables in the **long_tail_companions** database in the **analytics** schema, and the **Finance** domain to all tables in the **long_tail_companions** database in the **ecommerce** schema. + +```yaml +source: + type: snowflake + config: + username: ${SNOW_USER} + password: ${SNOW_PASS} + account_id: + warehouse: COMPUTE_WH + role: accountadmin + database_pattern: + allow: + - "long_tail_companions" + schema_pattern: + deny: + - information_schema + profiling: + enabled: False + domain: + Analytics: + allow: + - "long_tail_companions.analytics.*" + Finance: + allow: + - "long_tail_companions.ecommerce.*" +``` + +:::note + +When bare domain names like `Analytics` is used, the ingestion system will first check if a domain like `urn:li:domain:Analytics` is provisioned, failing that; it will check for a provisioned domain that has the same name. If we are unable to resolve bare domain names to provisioned domains, then ingestion will refuse to proceeed until the domain is provisioned on DataHub. + +::: + +You can also provide fully-qualified domain names to ensure that no ingestion-time domain resolution is needed. For example, the following recipe shows an example using fully qualified domain names: + +```yaml +source: + type: snowflake + config: + username: ${SNOW_USER} + password: ${SNOW_PASS} + account_id: + warehouse: COMPUTE_WH + role: accountadmin + database_pattern: + allow: + - "long_tail_companions" + schema_pattern: + deny: + - information_schema + profiling: + enabled: False + domain: + "urn:li:domain:6289fccc-4af2-4cbb-96ed-051e7d1de93c": + allow: + - "long_tail_companions.analytics.*" + "urn:li:domain:07155b15-cee6-4fda-b1c1-5a19a6b74c3a": + allow: + - "long_tail_companions.ecommerce.*" +``` + +### Searching by Domain + +Once you've created a Domain, you can use the search bar to find it. + ++ +
+ +Clicking on the search result will take you to the Domain's profile, where you +can edit its description, add / remove owners, and view the assets inside the Domain. + ++ +
+ +Once you've added assets to a Domain, you can filter search results to limit to those Assets +within a particular Domain using the left-side search filters. + ++ +
+ +On the homepage, you'll also find a list of the most popular Domains in your organization. + ++ +
+ +## Additional Resources + +### Videos + +**Supercharge Data Mesh with Domains in DataHub** + ++ +
+ +### GraphQL + +- [domain](../graphql/queries.md#domain) +- [listDomains](../graphql/queries.md#listdomains) +- [createDomains](../graphql/mutations.md#createdomain) +- [setDomain](../graphql/mutations.md#setdomain) +- [unsetDomain](../graphql/mutations.md#unsetdomain) + +#### Examples + +**Creating a Domain** + +```graphql +mutation createDomain { + createDomain( + input: { name: "My New Domain", description: "An optional description" } + ) +} +``` + +This query will return an `urn` which you can use to fetch the Domain details. + +**Fetching a Domain by Urn** + +```graphql +query getDomain { + domain(urn: "urn:li:domain:engineering") { + urn + properties { + name + description + } + entities { + total + } + } +} +``` + +**Adding a Dataset to a Domain** + +```graphql +mutation setDomain { + setDomain( + entityUrn: "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)" + domainUrn: "urn:li:domain:engineering" + ) +} +``` + +> Pro Tip! You can try out the sample queries by visiting `+ +
+ +### **Trace End-to-End Lineage** + +Quickly understand the end-to-end journey of data by tracing lineage across platforms, datasets, ETL/ELT pipelines, charts, dashboards, and beyond. + ++ +
+ +### **Understand the Impact of Breaking Changes on Downstream Dependencies** + +Proactively identify which entities may be impacted by a breaking change using Impact Analysis. + ++ +
+ +### **View Metadata 360 at a Glance** + +Combine _technical_ and _logical_ metadata to provide a 360º view of your data entities. + +Generate **Dataset Stats** to understand the shape & distribution of the data + ++ +
+ +Capture historical **Data Validation Outcomes** from tools like Great Expectations + ++ +
+ +Leverage DataHub's **Schema Version History** to track changes to the physical structure of data over time + ++ +
+ +--- + +## Modern Data Governance + +### **Govern in Real Time** + +[The Actions Framework](./actions/README.md) powers the following real-time use cases: + +- **Notifications:** Generate organization-specific notifications when a change is made on DataHub. For example, send an email to the governance team when a "PII" tag is added to any data asset. +- **Workflow Integration:** Integrate DataHub into your organization's internal workflows. For example, create a Jira ticket when specific Tags or Terms are proposed on a Dataset. +- **Synchronization:** Sync changes made in DataHub into a 3rd party system. For example, reflect Tag additions in DataHub into Snowflake. +- **Auditing:** Audit who is making what changes on DataHub through time. + ++ +
+ +### **Manage Entity Ownership** + +Quickly and easily assign entity ownership to users and user groups. + ++ +
+ +### **Govern with Tags, Glossary Terms, and Domains** + +Empower data owners to govern their data entities with: + +1. **Tags:** Informal, loosely controlled labels that serve as a tool for search & discovery. No formal, central management. +2. **Glossary Terms:** A controlled vocabulary with optional hierarchy, commonly used to describe core business concepts and measurements. +3. **Domains:** Curated, top-level folders or categories, widely used in Data Mesh to organize entities by department (i.e., Finance, Marketing) or Data Products. + ++ +
+ +--- + +## DataHub Administration + +### **Create Users, Groups, & Access Policies** + +DataHub admins can create Policies to define who can perform what action against which resource(s). When you create a new Policy, you will be able to define the following: + +- **Policy Type** - Platform (top-level DataHub Platform privileges, i.e., managing users, groups, and policies) or Metadata (ability to manipulate ownership, tags, documentation, and more) +- **Resource Type** - Specify the type of resources, such as Datasets, Dashboards, Pipelines, and beyond +- **Privileges** - Choose the set of permissions, such as Edit Owners, Edit Documentation, Edit Links +- **Users and/or Groups** - Assign relevant Users and Groups; you can also assign the Policy to Resource Owners, regardless of which Group they belong + ++ +
+ +### **Ingest Metadata from the UI** + +Create, configure, schedule, & execute batch metadata ingestion using the DataHub user interface. This makes getting metadata into DataHub easier by minimizing the overhead required to operate custom integration pipelines. + ++ +
diff --git a/docs-website/versioned_docs/version-0.10.4/docs/features/dataset-usage-and-query-history.md b/docs-website/versioned_docs/version-0.10.4/docs/features/dataset-usage-and-query-history.md new file mode 100644 index 0000000000000..9a8b70912f4c7 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/features/dataset-usage-and-query-history.md @@ -0,0 +1,92 @@ +--- +title: About DataHub Dataset Usage & Query History +sidebar_label: Dataset Usage & Query History +slug: /features/dataset-usage-and-query-history +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/features/dataset-usage-and-query-history.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub Dataset Usage & Query History + ++ +
+ +Some sources require a separate, usage-specific recipe to ingest Usage and Query History metadata. In this case, it is noted in the capabilities summary, like so: + ++ +
+ +Please, always check the usage prerequisities page if the source has as it can happen you have to add additional +permissions which only needs for usage. + +## Using Dataset Usage & Query History + +After successful ingestion, the Queries and Stats tab will be enabled on datasets with any usage. + ++ +
+ +On the Queries tab, you can see the top 5 most often run queries which referenced this dataset. + ++ +
+ +On the Stats tab, you can see the top 5 users who run the most queries which referenced this dataset + ++ +
+ +With the collected usage data, you can even see column-level usage statistics (Redshift Usage doesn't supported this yet): + ++ +
+ +## Additional Resources + +### Videos + +**DataHub 101: Data Profiling and Usage Stats 101** + ++ +
+ +### GraphQL + +-+ +
+ +You can view the necessary endpoints to configure by clicking on the Endpoints button in the Overview tab. + ++ +
+ +### Code Coordinates + +- Class Name: `datahub.ingestion.source.identity.azure_ad.AzureADSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/identity/azure_ad.py) + ++ +
+ +There are two important concepts to understand and identify: + +- _Extractor Project_: This is the project associated with a service-account, whose credentials you will be configuring in the connector. The connector uses this service-account to run jobs (including queries) within the project. +- _Bigquery Projects_ are the projects from which table metadata, lineage, usage, and profiling data need to be collected. By default, the extractor project is included in the list of projects that DataHub collects metadata from, but you can control that by passing in a specific list of project ids that you want to collect metadata from. Read the configuration section below to understand how to limit the list of projects that DataHub extracts metadata from. + +#### Create a datahub profile in GCP + +1. Create a custom role for datahub as per [BigQuery docs](https://cloud.google.com/iam/docs/creating-custom-roles#creating_a_custom_role). +2. Follow the sections below to grant permissions to this role on this project and other projects. + +##### Basic Requirements (needed for metadata ingestion) + +1. Identify your Extractor Project where the service account will run queries to extract metadata. + +| permission | Description | Capability | +| -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------- | +| `bigquery.jobs.create` | Run jobs (e.g. queries) within the project. _This only needs for the extractor project where the service account belongs_ | | +| `bigquery.jobs.list` | Manage the queries that the service account has sent. _This only needs for the extractor project where the service account belongs_ | | +| `bigquery.readsessions.create` | Create a session for streaming large results. _This only needs for the extractor project where the service account belongs_ | | +| `bigquery.readsessions.getData` | Get data from the read session. _This only needs for the extractor project where the service account belongs_ | + +2. Grant the following permissions to the Service Account on every project where you would like to extract metadata from + +:::info + +If you have multiple projects in your BigQuery setup, the role should be granted these permissions in each of the projects. + +::: +| permission | Description | Capability | Default GCP role which contains this permission | +|----------------------------------|--------------------------------------------------------------------------------------------------------------|-------------------------------------|-----------------------------------------------------------------------------------------------------------------| +| `bigquery.datasets.get` | Retrieve metadata about a dataset. | Table Metadata Extraction | [roles/bigquery.metadataViewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.metadataViewer) | +| `bigquery.datasets.getIamPolicy` | Read a dataset's IAM permissions. | Table Metadata Extraction | [roles/bigquery.metadataViewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.metadataViewer) | +| `bigquery.tables.list` | List BigQuery tables. | Table Metadata Extraction | [roles/bigquery.metadataViewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.metadataViewer) | +| `bigquery.tables.get` | Retrieve metadata for a table. | Table Metadata Extraction | [roles/bigquery.metadataViewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.metadataViewer) | +| `bigquery.routines.get` | Get Routines. Needs to retrieve metadata for a table from system table. | Table Metadata Extraction | [roles/bigquery.metadataViewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.metadataViewer) | +| `bigquery.routines.list` | List Routines. Needs to retrieve metadata for a table from system table | Table Metadata Extraction | [roles/bigquery.metadataViewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.metadataViewer) | +| `resourcemanager.projects.get` | Retrieve project names and metadata. | Table Metadata Extraction | [roles/bigquery.metadataViewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.metadataViewer) | +| `bigquery.jobs.listAll` | List all jobs (queries) submitted by any user. Needs for Lineage extraction. | Lineage Extraction/Usage extraction | [roles/bigquery.resourceViewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.resourceViewer) | +| `logging.logEntries.list` | Fetch log entries for lineage/usage data. Not required if `use_exported_bigquery_audit_metadata` is enabled. | Lineage Extraction/Usage extraction | [roles/logging.privateLogViewer](https://cloud.google.com/logging/docs/access-control#logging.privateLogViewer) | +| `logging.privateLogEntries.list` | Fetch log entries for lineage/usage data. Not required if `use_exported_bigquery_audit_metadata` is enabled. | Lineage Extraction/Usage extraction | [roles/logging.privateLogViewer](https://cloud.google.com/logging/docs/access-control#logging.privateLogViewer) | +| `bigquery.tables.getData` | Access table data to extract storage size, last updated at, data profiles etc. | Profiling | | + +#### Create a service account in the Extractor Project + +1. Setup a ServiceAccount as per [BigQuery docs](https://cloud.google.com/iam/docs/creating-managing-service-accounts#iam-service-accounts-create-console) + and assign the previously created role to this service account. +2. Download a service account JSON keyfile. + Example credential file: + +```json +{ + "type": "service_account", + "project_id": "project-id-1234567", + "private_key_id": "d0121d0000882411234e11166c6aaa23ed5d74e0", + "private_key": "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----", + "client_email": "test@suppproject-id-1234567.iam.gserviceaccount.com", + "client_id": "113545814931671546333", + "auth_uri": "https://accounts.google.com/o/oauth2/auth", + "token_uri": "https://oauth2.googleapis.com/token", + "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", + "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/test%suppproject-id-1234567.iam.gserviceaccount.com" +} +``` + +3. To provide credentials to the source, you can either: + + Set an environment variable: + + ```sh + $ export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json" + ``` + + _or_ + + Set credential config in your source based on the credential json file. For example: + + ```yml + credential: + project_id: project-id-1234567 + private_key_id: "d0121d0000882411234e11166c6aaa23ed5d74e0" + private_key: "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----\n" + client_email: "test@suppproject-id-1234567.iam.gserviceaccount.com" + client_id: "123456678890" + ``` + +### Lineage Computation Details + +When `use_exported_bigquery_audit_metadata` is set to `true`, lineage information will be computed using exported bigquery logs. On how to setup exported bigquery audit logs, refer to the following [docs](https://cloud.google.com/bigquery/docs/reference/auditlogs#defining_a_bigquery_log_sink_using_gcloud) on BigQuery audit logs. Note that only protoPayloads with "type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata" are supported by the current ingestion version. The `bigquery_audit_metadata_datasets` parameter will be used only if `use_exported_bigquery_audit_metadat` is set to `true`. + +Note: the `bigquery_audit_metadata_datasets` parameter receives a list of datasets, in the format $PROJECT.$DATASET. This way queries from a multiple number of projects can be used to compute lineage information. + +Note: Since bigquery source also supports dataset level lineage, the auth client will require additional permissions to be able to access the google audit logs. Refer the permissions section in bigquery-usage section below which also accesses the audit logs. + +### Profiling Details + +For performance reasons, we only profile the latest partition for partitioned tables and the latest shard for sharded tables. +You can set partition explicitly with `partition.partition_datetime` property if you want, though note that partition config will be applied to all partitioned tables. + +### Caveats + +- For materialized views, lineage is dependent on logs being retained. If your GCP logging is retained for 30 days (default) and 30 days have passed since the creation of the materialized view we won't be able to get lineage for them. + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[bigquery]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: bigquery + config: + # `schema_pattern` for BQ Datasets + schema_pattern: + allow: + - finance_bq_dataset + table_pattern: + deny: + # The exact name of the table is revenue_table_name + # The reason we have this `.*` at the beginning is because the current implmenetation of table_pattern is testing + # project_id.dataset_name.table_name + # We will improve this in the future + - .*revenue_table_name + include_table_lineage: true + include_usage_statistics: true + profiling: + enabled: true + profile_table_level_only: true + +sink: + # sink configs +``` + +### Config Details + +Source Module | Documentation |
+ +`clickhouse` + + | ++ +This plugin extracts the following: + +- Metadata for tables, views, materialized views and dictionaries +- Column types associated with each table(except \*AggregateFunction and DateTime with timezone) +- Table, row, and column statistics via optional SQL profiling. +- Table, view, materialized view and dictionary(with CLICKHOUSE source_type) lineage + +:::tip + +You can also get fine-grained usage statistics for ClickHouse using the `clickhouse-usage` source described below. + +::: + +[Read more...](#module-clickhouse) + + | +
+ +`clickhouse-usage` + + | ++ +This plugin has the below functionalities - + +1. For a specific dataset this plugin ingests the following statistics - + 1. top n queries. + 2. top users. + 3. usage of each column in the dataset. +2. Aggregation of these statistics into buckets, by day or hour granularity. + +Usage information is computed by querying the system.query_log table. In case you have a cluster or need to apply additional transformation/filters you can create a view and put to the `query_log_table` setting. + +:::note + +This source only does usage statistics. To get the tables, views, and schemas in your ClickHouse warehouse, ingest using the `clickhouse` source described above. + +::: + +[Read more...](#module-clickhouse-usage) + + | +
+ + + +
+ +![Incubating](https://img.shields.io/badge/support%20status-incubating-blue) + +### Important Capabilities + +| Capability | Status | Notes | +| ---------------------------------------------------------------------------------------------------------- | ------ | ----------------------------------------------------------------- | +| Asset Containers | ✅ | Enabled by default | +| Column-level Lineage | ✅ | Enabled by default | +| Dataset Usage | ✅ | Enabled by default | +| Descriptions | ✅ | Enabled by default | +| [Detect Deleted Entities](../../../../metadata-ingestion/docs/dev_guides/stateful.md#stale-entity-removal) | ✅ | Optionally enabled via `stateful_ingestion.remove_stale_metadata` | +| [Domains](../../../domains.md) | ✅ | Supported via the `domain` config field | +| Extract Ownership | ✅ | Supported via the `include_ownership` config | +| [Platform Instance](../../../platform-instances.md) | ✅ | Enabled by default | +| Schema Metadata | ✅ | Enabled by default | +| Table-Level Lineage | ✅ | Enabled by default | + +This plugin extracts the following metadata from Databricks Unity Catalog: + +- metastores +- schemas +- tables and column lineage + +### Prerequisities + +- Get your Databricks instance's [workspace url](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids) +- Create a [Databricks Service Principal](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#what-is-a-service-principal) + - You can skip this step and use your own account to get things running quickly, + but we strongly recommend creating a dedicated service principal for production use. +- Generate a Databricks Personal Access token following the following guides: + - [Service Principals](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#personal-access-tokens) + - [Personal Access Tokens](https://docs.databricks.com/dev-tools/auth.html#databricks-personal-access-tokens) +- Provision your service account: + - To ingest your workspace's metadata and lineage, your service principal must have all of the following: + - One of: metastore admin role, ownership of, or `USE CATALOG` privilege on any catalogs you want to ingest + - One of: metastore admin role, ownership of, or `USE SCHEMA` privilege on any schemas you want to ingest + - Ownership of or `SELECT` privilege on any tables and views you want to ingest + - [Ownership documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/ownership.html) + - [Privileges documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/privileges.html) + - To `include_usage_statistics` (enabled by default), your service principal must have `CAN_MANAGE` permissions on any SQL Warehouses you want to ingest: [guide](https://docs.databricks.com/security/auth-authz/access-control/sql-endpoint-acl.html). + - To ingest `profiling` information with `call_analyze` (enabled by default), your service principal must have ownership or `MODIFY` privilege on any tables you want to profile. + - Alternatively, you can run [ANALYZE TABLE](https://docs.databricks.com/sql/language-manual/sql-ref-syntax-aux-analyze-table.html) yourself on any tables you want to profile, then set `call_analyze` to `false`. + You will still need `SELECT` privilege on those tables to fetch the results. +- Check the starter recipe below and replace `workspace_url` and `token` with your information from the previous steps. + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[unity-catalog]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: unity-catalog + config: + workspace_url: https://my-workspace.cloud.databricks.com + token: "mygenerated_databricks_token" + #metastore_id_pattern: + # deny: + # - 11111-2222-33333-44-555555 + #catalog_pattern: + # allow: + # - my-catalog + #schema_pattern: + # deny: + # - information_schema + #table_pattern: + # allow: + # - test.lineagedemo.dinner + # First you have to create domains on Datahub by following this guide -> https://datahubproject.io/docs/domains/#domains-setup-prerequisites-and-permissions + #domain: + # urn:li:domain:1111-222-333-444-555: + # allow: + # - main.* + + stateful_ingestion: + enabled: true + +pipeline_name: acme-corp-unity +# sink configs if needed +``` + +### Config Details + +Source Module | Documentation |
+ +`dbt` + + | ++ +The artifacts used by this source are: + +- [dbt manifest file](https://docs.getdbt.com/reference/artifacts/manifest-json) + - This file contains model, source, tests and lineage data. +- [dbt catalog file](https://docs.getdbt.com/reference/artifacts/catalog-json) + - This file contains schema data. + - dbt does not record schema data for Ephemeral models, as such datahub will show Ephemeral models in the lineage, however there will be no associated schema for Ephemeral models +- [dbt sources file](https://docs.getdbt.com/reference/artifacts/sources-json) + - This file contains metadata for sources with freshness checks. + - We transfer dbt's freshness checks to DataHub's last-modified fields. + - Note that this file is optional – if not specified, we'll use time of ingestion instead as a proxy for time last-modified. +- [dbt run_results file](https://docs.getdbt.com/reference/artifacts/run-results-json) + - This file contains metadata from the result of a dbt run, e.g. dbt test + - When provided, we transfer dbt test run results into assertion run events to see a timeline of test runs on the dataset + [Read more...](#module-dbt) + + | +
+ +`dbt-cloud` + + | ++ +This source pulls dbt metadata directly from the dbt Cloud APIs. + +You'll need to have a dbt Cloud job set up to run your dbt project, and "Generate docs on run" should be enabled. + +The token should have the "read metadata" permission. + +To get the required IDs, go to the job details page (this is the one with the "Run History" table), and look at the URL. +It should look something like this: https://cloud.getdbt.com/next/deploy/107298/projects/175705/jobs/148094. +In this example, the account ID is 107298, the project ID is 175705, and the job ID is 148094. +[Read more...](#module-dbt-cloud) + + | +
Source Module | Documentation |
+ +`looker` + + | ++ +This plugin extracts the following: + +- Looker dashboards, dashboard elements (charts) and explores +- Names, descriptions, URLs, chart types, input explores for the charts +- Schemas and input views for explores +- Owners of dashboards + +:::note +To get complete Looker metadata integration (including Looker views and lineage to the underlying warehouse tables), you must ALSO use the `lookml` module. +::: +[Read more...](#module-looker) + + | +
+ +`lookml` + + | ++ +This plugin extracts the following: + +- LookML views from model files in a project +- Name, upstream table names, metadata for dimensions, measures, and dimension groups attached as tags +- If API integration is enabled (recommended), resolves table and view names by calling the Looker API, otherwise supports offline resolution of these names. + +:::note +To get complete Looker metadata integration (including Looker dashboards and charts and lineage to the underlying Looker views, you must ALSO use the `looker` source module. +::: +[Read more...](#module-lookml) + + | +
+ +
+ +#### Get an API key + +You need to get an API key for the account with the above privileges to perform ingestion. See the [Looker authentication docs](https://docs.looker.com/reference/api-and-integration/api-auth#authentication_with_an_sdk) for the steps to create a client ID and secret. + +### Ingestion through UI + +The following video shows you how to get started with ingesting Looker metadata through the UI. + +:::note + +You will need to run `lookml` ingestion through the CLI after you have ingested Looker metadata through the UI. Otherwise you will not be able to see Looker Views and their lineage to your warehouse tables. + +::: + +Source Module | Documentation |
+ +`powerbi` + + | ++ +This plugin extracts the following: + +- Power BI dashboards, tiles and datasets +- Names, descriptions and URLs of dashboard and tile +- Owners of dashboards + [Read more...](#module-powerbi) + + | +
+ +`powerbi-report-server` + + | ++ +Use this plugin to connect to [PowerBI Report Server](https://powerbi.microsoft.com/en-us/report-server/). +It extracts the following: + +Metadata that can be ingested: + +- report name +- report description +- ownership(can add existing users in DataHub as owners) +- transfer folders structure to DataHub as it is in Report Server +- webUrl to report in Report Server + +Due to limits of PBIRS REST API, it's impossible to ingest next data for now: + +- tiles info +- datasource of report +- dataset of report + +Next types of report can be ingested: + +- PowerBI report(.pbix) +- Paginated report(.rdl) +- Linked report + [Read more...](#module-powerbi-report-server) + + | +
Source Module | Documentation |
+ +`redshift` + + | ++ +This plugin extracts the following: + +- Metadata for databases, schemas, views and tables +- Column types associated with each table +- Table, row, and column statistics via optional SQL profiling +- Table lineage +- Usage statistics + +### Prerequisites + +This source needs to access system tables that require extra permissions. +To grant these permissions, please alter your datahub Redshift user the following way: + +```sql +ALTER USER datahub_user WITH SYSLOG ACCESS UNRESTRICTED; +GRANT SELECT ON pg_catalog.svv_table_info to datahub_user; +GRANT SELECT ON pg_catalog.svl_user_info to datahub_user; +``` + +:::note + +Giving a user unrestricted access to system tables gives the user visibility to data generated by other users. For example, STL_QUERY and STL_QUERYTEXT contain the full text of INSERT, UPDATE, and DELETE statements. + +::: + +### Lineage + +There are multiple lineage collector implementations as Redshift does not support table lineage out of the box. + +#### stl_scan_based + +The stl_scan based collector uses Redshift's [stl_insert](https://docs.aws.amazon.com/redshift/latest/dg/r_STL_INSERT.html) and [stl_scan](https://docs.aws.amazon.com/redshift/latest/dg/r_STL_SCAN.html) system tables to +discover lineage between tables. +Pros: + +- Fast +- Reliable + +Cons: + +- Does not work with Spectrum/external tables because those scans do not show up in stl_scan table. +- If a table is depending on a view then the view won't be listed as dependency. Instead the table will be connected with the view's dependencies. + +#### sql_based + +The sql_based based collector uses Redshift's [stl_insert](https://docs.aws.amazon.com/redshift/latest/dg/r_STL_INSERT.html) to discover all the insert queries +and uses sql parsing to discover the dependecies. + +Pros: + +- Works with Spectrum tables +- Views are connected properly if a table depends on it + +Cons: + +- Slow. +- Less reliable as the query parser can fail on certain queries + +#### mixed + +Using both collector above and first applying the sql based and then the stl_scan based one. + +Pros: + +- Works with Spectrum tables +- Views are connected properly if a table depends on it +- A bit more reliable than the sql_based one only + +Cons: + +- Slow +- May be incorrect at times as the query parser can fail on certain queries + +:::note + +The redshift stl redshift tables which are used for getting data lineage only retain approximately two to five days of log history. This means you cannot extract lineage from queries issued outside that window. + +::: + +### Profiling + +Profiling runs sql queries on the redshift cluster to get statistics about the tables. To be able to do that, the user needs to have read access to the tables that should be profiled. + +If you don't want to grant read access to the tables you can enable table level profiling which will get table statistics without reading the data. + +```yaml +profiling: + profile_table_level_only: true +``` + +[Read more...](#module-redshift) + + | +
+ +`redshift-legacy` + + | ++ +This plugin extracts the following: + +- Metadata for databases, schemas, views and tables +- Column types associated with each table +- Also supports PostGIS extensions +- Table, row, and column statistics via optional SQL profiling +- Table lineage + +:::tip + +You can also get fine-grained usage statistics for Redshift using the `redshift-usage` source described below. + +::: + +### Prerequisites + +This source needs to access system tables that require extra permissions. +To grant these permissions, please alter your datahub Redshift user the following way: + +```sql +ALTER USER datahub_user WITH SYSLOG ACCESS UNRESTRICTED; +GRANT SELECT ON pg_catalog.svv_table_info to datahub_user; +GRANT SELECT ON pg_catalog.svl_user_info to datahub_user; +``` + +:::note + +Giving a user unrestricted access to system tables gives the user visibility to data generated by other users. For example, STL_QUERY and STL_QUERYTEXT contain the full text of INSERT, UPDATE, and DELETE statements. + +::: + +### Lineage + +There are multiple lineage collector implementations as Redshift does not support table lineage out of the box. + +#### stl_scan_based + +The stl_scan based collector uses Redshift's [stl_insert](https://docs.aws.amazon.com/redshift/latest/dg/r_STL_INSERT.html) and [stl_scan](https://docs.aws.amazon.com/redshift/latest/dg/r_STL_SCAN.html) system tables to +discover lineage between tables. +Pros: + +- Fast +- Reliable + +Cons: + +- Does not work with Spectrum/external tables because those scans do not show up in stl_scan table. +- If a table is depending on a view then the view won't be listed as dependency. Instead the table will be connected with the view's dependencies. + +#### sql_based + +The sql_based based collector uses Redshift's [stl_insert](https://docs.aws.amazon.com/redshift/latest/dg/r_STL_INSERT.html) to discover all the insert queries +and uses sql parsing to discover the dependecies. + +Pros: + +- Works with Spectrum tables +- Views are connected properly if a table depends on it + +Cons: + +- Slow. +- Less reliable as the query parser can fail on certain queries + +#### mixed + +Using both collector above and first applying the sql based and then the stl_scan based one. + +Pros: + +- Works with Spectrum tables +- Views are connected properly if a table depends on it +- A bit more reliable than the sql_based one only + +Cons: + +- Slow +- May be incorrect at times as the query parser can fail on certain queries + +:::note + +The redshift stl redshift tables which are used for getting data lineage only retain approximately two to five days of log history. This means you cannot extract lineage from queries issued outside that window. + +::: + +[Read more...](#module-redshift-legacy) + + | +
+ +`redshift-usage-legacy` + + | ++ +This plugin extracts usage statistics for datasets in Amazon Redshift. + +Note: Usage information is computed by querying the following system tables - + +1. stl_scan +2. svv_table_info +3. stl_query +4. svl_user_info + +To grant access this plugin for all system tables, please alter your datahub Redshift user the following way: + +```sql +ALTER USER datahub_user WITH SYSLOG ACCESS UNRESTRICTED; +``` + +This plugin has the below functionalities - + +1. For a specific dataset this plugin ingests the following statistics - + 1. top n queries. + 2. top users. +2. Aggregation of these statistics into buckets, by day or hour granularity. + +:::note + +This source only does usage statistics. To get the tables, views, and schemas in your Redshift warehouse, ingest using the `redshift` source described above. + +::: + +:::note + +Redshift system tables have some latency in getting data from queries. In addition, these tables only maintain logs for 2-5 days. You can find more information from the official documentation [here](https://aws.amazon.com/premiumsupport/knowledge-center/logs-redshift-database-cluster/). + +::: + +[Read more...](#module-redshift-usage-legacy) + + | +
Source Module | Documentation |
+ +`trino` + + | ++ +This plugin extracts the following: + +- Metadata for databases, schemas, and tables +- Column types and schema associated with each table +- Table, row, and column statistics via optional SQL profiling + +[Read more...](#module-trino) + + | +
+ +`starburst-trino-usage` + + | ++ +If you are using Starburst Trino you can collect usage stats the following way. + +#### Prerequsities + +1. You need to setup Event Logger which saves audit logs into a Postgres db and setup this db as a catalog in Trino + Here you can find more info about how to setup: + https://docs.starburst.io/354-e/security/event-logger.html#security-event-logger--page-root + https://docs.starburst.io/354-e/security/event-logger.html#analyzing-the-event-log + +2. Install starbust-trino-usage plugin + Run pip install 'acryl-datahub[starburst-trino-usage]'. + +[Read more...](#module-starburst-trino-usage) + + | +