diff --git a/README.md b/README.md index 3b381ebc8dc89..3498eb595d3a6 100644 --- a/README.md +++ b/README.md @@ -81,7 +81,9 @@ Please follow the [DataHub Quickstart Guide](https://datahubproject.io/docs/quic If you're looking to build & modify datahub please take a look at our [Development Guide](https://datahubproject.io/docs/developers).

- + + +

## Source Code and Repositories diff --git a/datahub-web-react/README.md b/datahub-web-react/README.md index 6c91b169af858..8d11389ee29d2 100644 --- a/datahub-web-react/README.md +++ b/datahub-web-react/README.md @@ -126,7 +126,9 @@ for functional configurability should reside. to render a view associated with a particular entity type (user, dataset, etc.). -![entity-registry](./entity-registry.png) +

+ +

**graphql** - The React App talks to the `dathub-frontend` server using GraphQL. This module is where the *queries* issued against the server are defined. Once defined, running `yarn run generate` will code-gen TypeScript objects to make invoking diff --git a/docs-website/versioned_docs/version-0.10.4/README.md b/docs-website/versioned_docs/version-0.10.4/README.md new file mode 100644 index 0000000000000..6b9450294a7f0 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/README.md @@ -0,0 +1,185 @@ +--- +description: >- + DataHub is a data discovery application built on an extensible metadata + platform that helps you tame the complexity of diverse data ecosystems. +hide_title: true +title: Introduction +slug: /introduction +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/README.md" +--- + +import useBaseUrl from '@docusaurus/useBaseUrl'; + +export const Logo = (props) => { +return ( + +
+ +
+); +}; + + + + + +# DataHub: The Metadata Platform for the Modern Data Stack + +## Built with ❤️ by [Acryl Data](https://acryldata.io) and [LinkedIn](https://engineering.linkedin.com) + +[![Version](https://img.shields.io/github/v/release/datahub-project/datahub?include_prereleases)](https://github.com/datahub-project/datahub/releases/latest) +[![PyPI version](https://badge.fury.io/py/acryl-datahub.svg)](https://badge.fury.io/py/acryl-datahub) +[![build & test](https://github.com/datahub-project/datahub/workflows/build%20&%20test/badge.svg?branch=master&event=push)](https://github.com/datahub-project/datahub/actions?query=workflow%3A%22build+%26+test%22+branch%3Amaster+event%3Apush) +[![Docker Pulls](https://img.shields.io/docker/pulls/linkedin/datahub-gms.svg)](https://hub.docker.com/r/linkedin/datahub-gms) +[![Slack](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](https://slack.datahubproject.io) +[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/datahub-project/datahub/blob/master/docs/CONTRIBUTING.md) +[![GitHub commit activity](https://img.shields.io/github/commit-activity/m/datahub-project/datahub)](https://github.com/datahub-project/datahub/pulls?q=is%3Apr) +[![License](https://img.shields.io/github/license/datahub-project/datahub)](https://github.com/datahub-project/datahub/blob/master/LICENSE) +[![YouTube](https://img.shields.io/youtube/channel/subscribers/UC3qFQC5IiwR5fvWEqi_tJ5w?style=social)](https://www.youtube.com/channel/UC3qFQC5IiwR5fvWEqi_tJ5w) +[![Medium](https://img.shields.io/badge/Medium-12100E?style=for-the-badge&logo=medium&logoColor=white)](https://medium.com/datahub-project) +[![Follow](https://img.shields.io/twitter/follow/datahubproject?label=Follow&style=social)](https://twitter.com/datahubproject) + +### 🏠 Hosted DataHub Docs (Courtesy of Acryl Data): [datahubproject.io](/docs) + +--- + +[Quickstart](/docs/quickstart) | +[Features](/docs/features) | +[Roadmap](https://feature-requests.datahubproject.io/roadmap) | +[Adoption](#adoption) | +[Demo](https://demo.datahubproject.io/) | +[Town Hall](/docs/townhalls) + +--- + +> 📣 DataHub Town Hall is the 4th Thursday at 9am US PT of every month - [add it to your calendar!](https://rsvp.datahubproject.io/) +> +> - Town-hall Zoom link: [zoom.datahubproject.io](https://zoom.datahubproject.io) +> - [Meeting details](docs/townhalls.md) & [past recordings](docs/townhall-history.md) + +> ✨ DataHub Community Highlights: +> +> - Read our Monthly Project Updates [here](https://blog.datahubproject.io/tagged/project-updates). +> - Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data: [Data Engineering Podcast](https://www.dataengineeringpodcast.com/acryl-data-datahub-metadata-graph-episode-230/) +> - Check out our most-read blog post, [DataHub: Popular Metadata Architectures Explained](https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained) @ LinkedIn Engineering Blog. +> - Join us on [Slack](docs/slack.md)! Ask questions and keep up with the latest announcements. + +## Introduction + +DataHub is an open-source metadata platform for the modern data stack. Read about the architectures of different metadata systems and why DataHub excels [here](https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained). Also read our +[LinkedIn Engineering blog post](https://engineering.linkedin.com/blog/2019/data-hub), check out our [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019) and watch our [Crunch Conference Talk](https://www.youtube.com/watch?v=OB-O0Y6OYDE). You should also visit [DataHub Architecture](docs/architecture/architecture.md) to get a better understanding of how DataHub is implemented. + +## Features & Roadmap + +Check out DataHub's [Features](docs/features.md) & [Roadmap](https://feature-requests.datahubproject.io/roadmap). + +## Demo and Screenshots + +There's a [hosted demo environment](https://demo.datahubproject.io/) courtesy of [Acryl Data](https://acryldata.io) where you can explore DataHub without installing it locally + +## Quickstart + +Please follow the [DataHub Quickstart Guide](/docs/quickstart) to get a copy of DataHub up & running locally using [Docker](https://docker.com). As the guide assumes some basic knowledge of Docker, we'd recommend you to go through the "Hello World" example of [A Docker Tutorial for Beginners](https://docker-curriculum.com) if Docker is completely foreign to you. + +## Development + +If you're looking to build & modify datahub please take a look at our [Development Guide](/docs/developers). + +

+ +

+ +## Source Code and Repositories + +- [datahub-project/datahub](https://github.com/datahub-project/datahub): This repository contains the complete source code for DataHub's metadata model, metadata services, integration connectors and the web application. +- [acryldata/datahub-actions](https://github.com/acryldata/datahub-actions): DataHub Actions is a framework for responding to changes to your DataHub Metadata Graph in real time. +- [acryldata/datahub-helm](https://github.com/acryldata/datahub-helm): Repository of helm charts for deploying DataHub on a Kubernetes cluster +- [acryldata/meta-world](https://github.com/acryldata/meta-world): A repository to store recipes, custom sources, transformations and other things to make your DataHub experience magical + +## Releases + +See [Releases](https://github.com/datahub-project/datahub/releases) page for more details. We follow the [SemVer Specification](https://semver.org) when versioning the releases and adopt the [Keep a Changelog convention](https://keepachangelog.com/) for the changelog format. + +## Contributing + +We welcome contributions from the community. Please refer to our [Contributing Guidelines](docs/CONTRIBUTING.md) for more details. We also have a [contrib](https://github.com/datahub-project/datahub/blob/master/contrib) directory for incubating experimental features. + +## Community + +Join our [Slack workspace](https://slack.datahubproject.io) for discussions and important announcements. You can also find out more about our upcoming [town hall meetings](docs/townhalls.md) and view past recordings. + +## Adoption + +Here are the companies that have officially adopted DataHub. Please feel free to add yours to the list if we missed it. + +- [ABLY](https://ably.team/) +- [Adevinta](https://www.adevinta.com/) +- [Banksalad](https://www.banksalad.com) +- [Cabify](https://cabify.tech/) +- [ClassDojo](https://www.classdojo.com/) +- [Coursera](https://www.coursera.org/) +- [DefinedCrowd](http://www.definedcrowd.com) +- [DFDS](https://www.dfds.com/) +- [Digital Turbine](https://www.digitalturbine.com/) +- [Expedia Group](http://expedia.com) +- [Experius](https://www.experius.nl) +- [Geotab](https://www.geotab.com) +- [Grofers](https://grofers.com) +- [Haibo Technology](https://www.botech.com.cn) +- [hipages](https://hipages.com.au/) +- [inovex](https://www.inovex.de/) +- [IOMED](https://iomed.health) +- [Klarna](https://www.klarna.com) +- [LinkedIn](http://linkedin.com) +- [Moloco](https://www.moloco.com/en) +- [N26](https://n26brasil.com/) +- [Optum](https://www.optum.com/) +- [Peloton](https://www.onepeloton.com) +- [PITS Global Data Recovery Services](https://www.pitsdatarecovery.net/) +- [Razer](https://www.razer.com) +- [Saxo Bank](https://www.home.saxo) +- [Showroomprive](https://www.showroomprive.com/) +- [SpotHero](https://spothero.com) +- [Stash](https://www.stash.com) +- [Shanghai HuaRui Bank](https://www.shrbank.com) +- [ThoughtWorks](https://www.thoughtworks.com) +- [TypeForm](http://typeform.com) +- [Udemy](https://www.udemy.com/) +- [Uphold](https://uphold.com) +- [Viasat](https://viasat.com) +- [Wikimedia](https://www.wikimedia.org) +- [Wolt](https://wolt.com) +- [Zynga](https://www.zynga.com) + +## Select Articles & Talks + +- [DataHub Blog](https://blog.datahubproject.io/) +- [DataHub YouTube Channel](https://www.youtube.com/channel/UC3qFQC5IiwR5fvWEqi_tJ5w) +- [Optum: Data Mesh via DataHub](https://optum.github.io/blog/2022/03/23/data-mesh-via-datahub/) +- [Saxo Bank: Enabling Data Discovery in Data Mesh](https://medium.com/datahub-project/enabling-data-discovery-in-a-data-mesh-the-saxo-journey-451b06969c8f) +- [Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data](https://www.dataengineeringpodcast.com/acryl-data-datahub-metadata-graph-episode-230/) +- [DataHub: Popular Metadata Architectures Explained](https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained) +- [Driving DataOps Culture with LinkedIn DataHub](https://www.youtube.com/watch?v=ccsIKK9nVxk) @ [DataOps Unleashed 2021](https://dataopsunleashed.com/#shirshanka-session) +- [The evolution of metadata: LinkedIn’s story](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019) @ [Strata Data Conference 2019](https://conferences.oreilly.com/strata/strata-ny-2019.html) +- [Journey of metadata at LinkedIn](https://www.youtube.com/watch?v=OB-O0Y6OYDE) @ [Crunch Data Conference 2019](https://crunchconf.com/2019) +- [DataHub Journey with Expedia Group](https://www.youtube.com/watch?v=ajcRdB22s5o) +- [Data Discoverability at SpotHero](https://www.slideshare.net/MaggieHays/data-discoverability-at-spothero) +- [Data Catalogue — Knowing your data](https://medium.com/albert-franzi/data-catalogue-knowing-your-data-15f7d0724900) +- [DataHub: A Generalized Metadata Search & Discovery Tool](https://engineering.linkedin.com/blog/2019/data-hub) +- [Open sourcing DataHub: LinkedIn’s metadata search and discovery platform](https://engineering.linkedin.com/blog/2020/open-sourcing-datahub--linkedins-metadata-search-and-discovery-p) +- [Emerging Architectures for Modern Data Infrastructure](https://future.com/emerging-architectures-for-modern-data-infrastructure-2020/) + +See the full list [here](docs/links.md). + +## License + +[Apache License 2.0](https://github.com/datahub-project/datahub/blob/master/LICENSE). diff --git a/docs-website/versioned_docs/version-0.10.4/datahub-frontend/README.md b/docs-website/versioned_docs/version-0.10.4/datahub-frontend/README.md new file mode 100644 index 0000000000000..7f181861f6a43 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/datahub-frontend/README.md @@ -0,0 +1,97 @@ +--- +title: datahub-frontend +sidebar_label: datahub-frontend +slug: /datahub-frontend +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/datahub-frontend/README.md +--- + +# DataHub Frontend Proxy + +DataHub frontend is a [Play](https://www.playframework.com/) service written in Java. It is served as a mid-tier +between [DataHub GMS](https://github.com/datahub-project/datahub/blob/master/metadata-service) which is the backend service and [DataHub Web](../datahub-web-react/README.md). + +## Pre-requisites + +- You need to have [JDK11](https://openjdk.org/projects/jdk/11/) + installed on your machine to be able to build `DataHub Frontend`. +- You need to have [Chrome](https://www.google.com/chrome/) web browser + installed to be able to build because UI tests have a dependency on `Google Chrome`. + +## Build + +`DataHub Frontend` is already built as part of top level build: + +``` +./gradlew build +``` + +However, if you only want to build `DataHub Frontend` specifically: + +``` +./gradlew :datahub-frontend:dist +``` + +## Dependencies + +Before starting `DataHub Frontend`, you need to make sure that [DataHub GMS](https://github.com/datahub-project/datahub/blob/master/metadata-service) and +all its dependencies have already started and running. + +## Start via Docker image + +Quickest way to try out `DataHub Frontend` is running the [Docker image](https://github.com/datahub-project/datahub/blob/master/docker/datahub-frontend). + +## Start via command line + +If you do modify things and want to try it out quickly without building the Docker image, you can also run +the application directly from command line after a successful [build](#build): + +``` +cd datahub-frontend/run && ./run-local-frontend +``` + +## Checking out DataHub UI + +After starting your application in one of the two ways mentioned above, you can connect to it by typing below +into your favorite web browser: + +``` +http://localhost:9002 +``` + +To be able to sign in, you need to provide your user name. The default account is `datahub`, password `datahub`. + +## Authentication + +DataHub frontend leverages [Java Authentication and Authorization Service (JAAS)](https://docs.oracle.com/javase/7/docs/technotes/guides/security/jaas/JAASRefGuide.html) to perform the authentication. By default we provided a [DummyLoginModule](https://github.com/datahub-project/datahub/blob/master/datahub-frontend/app/security/DummyLoginModule.java) which will accept any username/password combination. You can update [jaas.conf](https://github.com/datahub-project/datahub/blob/master/datahub-frontend/conf/jaas.conf) to match your authentication requirement. For example, use the following config for LDAP-based authentication, + +``` +WHZ-Authentication { +  com.sun.security.auth.module.LdapLoginModule sufficient +  userProvider="ldaps://:636/dc=" +  authIdentity="{USERNAME}" +  userFilter="(&(objectClass=person)(uid={USERNAME}))" +  java.naming.security.authentication="simple" +  debug="false" +  useSSL="true"; +}; +``` + +### Authentication in React + +The React app supports both JAAS as described above and separately OIDC authentication. To learn about configuring OIDC for React, +see the [OIDC in React](../docs/authentication/guides/sso/configure-oidc-react.md) document. + +### API Debugging + +Most DataHub frontend API endpoints are protected using [Play Authentication](https://www.playframework.com/documentation/2.1.0/JavaGuide4), which means it requires authentication information stored in the cookie for the request to go through. This makes debugging using curl difficult. One option is to first make a curl call against the `/authenticate` endpoint and stores the authentication info in a cookie file like this + +``` +curl -c cookie.txt -d '{"username":"datahub", "password":"datahub"}' -H 'Content-Type: application/json' http://localhost:9002/authenticate +``` + +You can then make all subsequent calls using the same cookie file to pass the authentication check. + +``` +curl -b cookie.txt "http://localhost:9001/api/v2/search?type=dataset&input=page" +``` diff --git a/docs-website/versioned_docs/version-0.10.4/datahub-graphql-core/README.md b/docs-website/versioned_docs/version-0.10.4/datahub-graphql-core/README.md new file mode 100644 index 0000000000000..26ebc50731c68 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/datahub-graphql-core/README.md @@ -0,0 +1,93 @@ +--- +title: datahub-graphql-core +sidebar_label: datahub-graphql-core +slug: /datahub-graphql-core +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/datahub-graphql-core/README.md +--- + +# DataHub GraphQL Core + +DataHub GraphQL API is a shared lib module containing a GraphQL API on top of the GMS service layer. It exposes a graph-based representation +permitting reads and writes against the entities and aspects on the Metadata Graph, including Datasets, CorpUsers, & more. + +Contained within this module are + +1. **GMS Schema**: A GQL schema based on GMS models, located under [resources](https://github.com/datahub-project/datahub/tree/master/datahub-graphql-core/src/main/resources) folder. +2. **GMS Data Fetchers** (Resolvers): Components used by the GraphQL engine to resolve individual fields in the GQL schema. +3. **GMS Data Loaders**: Components used by the GraphQL engine to fetch data from downstream sources efficiently (by batching). +4. **GraphQLEngine**: A wrapper on top of the default `GraphQL` object provided by `graphql-java`. Provides a way to configure all of the important stuff using a simple `Builder API`. +5. **GMSGraphQLEngine**: An engine capable of resolving the GMS schema using the data fetchers + loaders mentioned above (with no additional configuration required). + +We've chosen to place these components in a library module so that GraphQL servers can be deployed in multiple "modes": + +1. **Standalone**: GraphQL facade, mainly used for programmatic access to the GMS graph from a non-Java environment +2. **Embedded**: Leverageable within another Java server to surface an extended GraphQL schema. For example, we use this to extend the GMS GraphQL schema in `datahub-frontend` + +## Extending the Graph + +### Adding an Entity + +When adding an entity to the GMS graph, the following steps should be followed: + +1. Extend [entity.graphql](https://github.com/datahub-project/datahub/blob/master/datahub-graphql-core/src/main/resources/entity.graphql) schema with new `types` (Queries) or `inputs` (Mutations) required for fetching & updating your Entity. + +These models should generally mirror the GMS models exactly, with notable exceptions: + +- **Maps**: the GQL model must instead contain a list of { key, value } objects (e.g. Dataset.pdl 'properties' field) +- **Foreign-Keys**: Foreign-key references embedded in GMS models should be resolved if the referenced entity exists in the GQL schema, + replacing the key with the actual entity model. (Example: replacing the 'owner' urn field in 'Ownership' with an actual `CorpUser` type) + +In GraphQL, the new Entity should extend the `Entity` interface. Additionally, you will need to add a new symbol to the standard +`EntityType` enum. + +The convention we follow is to have a top-level Query for each entity that takes a single "urn" parameter. This is for primary key lookups. +See all the existing entity Query types [here](https://github.com/datahub-project/datahub/blob/master/datahub-graphql-core/src/main/resources/entity.graphql#L19). + +On rebuilding the module (`./gradlew datahub-graphql-core:build`) you'll find newly generated classes corresponding to +the types you've defined inside the GraphQL schema inside the `mainGeneratedGraphQL` folder. These classes will be used in the next step. + +2. Implement `EntityType` classes for any new entities + +- These 'type' classes define how to load entities from GMS, and map them to the GraphQL data model. See [DatasetType.java](https://github.com/datahub-project/datahub/blob/master/datahub-graphql-core/src/main/java/com/linkedin/datahub/graphql/types/dataset/DatasetType.java) as an example. + +3. Implement `Mappers` to transform Pegasus model returned by GMS to an auto-generated GQL POJO. (under `/mainGeneratedGraphQL`, generated on `./gradlew datahub-graphql-core:build`) These mappers + will be used inside the type class defined in step 2. + +- If you've followed the guidance above, these mappers should be simple, mainly + providing identity mappings for fields that exist in both the GQL + Pegasus POJOs. +- In some cases, you'll need to perform small lambdas (unions, maps) to materialize the GQL object. + +4. Wire up your `EntityType` to the GraphQL schema. + +We use [GmsGraphQLEngine.java](https://github.com/datahub-project/datahub/blob/master/datahub-graphql-core/src/main/java/com/linkedin/datahub/graphql/GmsGraphQLEngine.java) to +configure the wiring for the GraphQL schema. This means associating "resolvers" to specific fields present in the GraphQL schema file. + +Inside of this file, you need to register your new `Type` object to be used in resolving primary-key entity queries. +To do so, simply follow the examples for other entities. + +5. Implement `EntityType` test for the new type defined in Step 2. See [ContainerTypeTest](https://github.com/datahub-project/datahub/blob/master/datahub-graphql-core/src/test/java/com/linkedin/datahub/graphql/types/container/ContainerTypeTest.java) as an example. + +6. Implement `Resolver` tests for any new `DataFetchers` that you needed to add. See [SetDomainResolverTest](https://github.com/datahub-project/datahub/blob/master/datahub-graphql-core/src/test/java/com/linkedin/datahub/graphql/resolvers/domain/SetDomainResolverTest.java) as an example. + +7. [Optional] Sometimes, your new entity will have relationships to other entities, or fields that require specific business logic + as opposed to basic mapping from the GMS model. In such cases, we tend to create an entity-specific configuration method in [GmsGraphQLEngine.java](https://github.com/datahub-project/datahub/blob/master/datahub-graphql-core/src/main/java/com/linkedin/datahub/graphql/GmsGraphQLEngine.java) + which allows you to wire custom resolvers (DataFetchers) to the fields in your Entity type. You also may need to do this, depending + on the complexity of the new entity. See [here](https://github.com/datahub-project/datahub/blob/master/datahub-graphql-core/src/main/java/com/linkedin/datahub/graphql/GmsGraphQLEngine.java#L438) for reference. + +> Note: If you want your new Entity to be "browsable" (folder navigation) via the UI, make sure you implement the `BrowsableEntityType` interface. + +#### Enabling Search for a new Entity + +In order to enable searching an Entity, you'll need to modify the [SearchAcrossEntities.java](https://github.com/datahub-project/datahub/blob/master/datahub-graphql-core/src/main/java/com/linkedin/datahub/graphql/resolvers/search/SearchAcrossEntitiesResolver.java) resolver, which enables unified search +across all DataHub entities. + +Steps: + +1. Add your new Entity type to [this list](https://github.com/datahub-project/datahub/blob/master/datahub-graphql-core/src/main/java/com/linkedin/datahub/graphql/resolvers/search/SearchAcrossEntitiesResolver.java#L32). +2. Add a new statement to [UrnToEntityMapper.java](https://github.com/datahub-project/datahub/blob/master/datahub-graphql-core/src/main/java/com/linkedin/datahub/graphql/types/common/mappers/UrnToEntityMapper.java#L35). This maps + an URN to a "placeholder" GraphQL entity which is subsequently resolved by the GraphQL engine. + +That should be it! + +Now, you can try to issue a search for the new entities you've ingested diff --git a/docs-website/versioned_docs/version-0.10.4/datahub-web-react/README.md b/docs-website/versioned_docs/version-0.10.4/datahub-web-react/README.md new file mode 100644 index 0000000000000..9eccc7551e84f --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/datahub-web-react/README.md @@ -0,0 +1,178 @@ +--- +title: datahub-web-react +sidebar_label: datahub-web-react +slug: /datahub-web-react +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/datahub-web-react/README.md +--- + +# DataHub React App + +## About + +This module contains a React application that serves as the DataHub UI. + +Feel free to take a look around, deploy, and contribute. + +## Functional Goals + +The initial milestone for the app was to achieve functional parity with the previous Ember app. This meant supporting + +- Dataset Profiles, Search, Browse Experience +- User Profiles, Search +- LDAP Authentication Flow + +This has since been achieved. The new set of functional goals are reflected in the latest version of the [DataHub Roadmap](../docs/roadmap.md). + +## Design Goals + +In building out the client experience, we intend to leverage learnings from the previous Ember-based app and incorporate feedback gathered +from organizations operating DataHub. Two themes have emerged to serve as guideposts: + +1. **Configurability**: The client experience should be configurable, such that deploying organizations can tailor certain + aspects to their needs. This includes theme / styling configurability, showing and hiding specific functionality, + customizing copy & logos, etc. +2. **Extensibility**: Extending the _functionality_ of DataHub should be as simple as possible. Making changes like + extending an existing entity & adding a new entity should require minimal effort and should be well covered in detailed + documentation. + +## Starting the Application + +### Quick Start + +Navigate to the `docker` directory and run the following to spin up the react app: + +``` +./quickstart.sh +``` + +at `http://localhost:9002`. + +If you want to make changes to the UI see them live without having to rebuild the `datahub-frontend-react` docker image, you +can run the following in this directory: + +`yarn install && yarn run start` + +which will start a forwarding server at `localhost:3000`. Note that to fetch real data, `datahub-frontend` server will also +need to be deployed, still at `http://localhost:9002`, to service GraphQL API requests. + +Optionally you could also start the app with the mock server without running the docker containers by executing `yarn start:mock`. See [here](https://github.com/datahub-project/datahub/blob/master/datahub-web-react/src/graphql-mock/fixtures/searchResult/userSearchResult.ts#L6) for available login users. + +### Functional testing + +In order to start a server and run frontend unit tests using react-testing-framework, run: + +`yarn test :e2e` + +There are also more automated tests using Cypress in the `smoke-test` folder of the repository root. + +#### Troubleshooting + +`Error: error:0308010C:digital envelope routines::unsupported`: This error message shows up when using Node 17, due to an OpenSSL update related to md5. +The best workaround is to revert to the Active LTS version of Node, 16.13.0 with the command `nvm install 16.13.0` and if necessary reinstall yarn `npm install --global yarn`. + +### Theming + +#### Customizing your App without rebuilding assets + +To see the results of any change to a theme, you will need to rebuild your datahub-frontend-react container. While this may work for some users, if you don't want to rebuild your container +you can change two things without rebuilding. + +1. You customize the logo on the homepage & the search bar header by setting the `REACT_APP_LOGO_URL` env variable when deploying GMS. +2. You can customize the favicon (the icon on your browser tab) by setting the `REACT_APP_FAVICON_URL` env var when deploying GMS. + +#### Selecting a theme + +Theme configurations are stored in `./src/conf/theme`. To select a theme, choose one and update the `REACT_APP_THEME_CONFIG` env variable stored in `.env`. +To change the selected theme, update the `.env` file and re-run `yarn start` from `datahub/datahub-web-react`. + +#### Editing a theme + +To edit an existing theme, the recommendation is to clone one of the existing themes into a new file with the name `.config.json`, +and then update the env variable as descibed above. The theme files have three sections, `styles`, `assets` and `content`. The type of the theme configs is specified +in `./src/conf/theme/types.ts`. + +`styles` configure overrides for the apps theming variables. + +`assets` configures the logo url. + +`content` specifies customizable text fields. + +While developing on your theme, all changes to assets and content are seen immediately in your local app. However, changes to styles require +you to terminate and re-run `yarn start` to see updated styles. + +## Design Details + +### Package Organization + +The `src` dir of the app is broken down into the following modules + +**conf** - Stores global configuration flags that can be referenced across the app. For example, the number of +search results shown per page, or the placeholder text in the search bar box. It serves as a location where levels +for functional configurability should reside. + +**app** - Contains all important components of the app. It has a few sub-modules: + +- `auth`: Components used to render the user authentication experience. +- `browse`: Shared components used to render the 'browse-by-path' experience. The experience is akin to navigating a filesystem hierarchy. +- `preview`: Shared components used to render Entity 'preview' views. These can appear in search results, browse results, + and within entity profile pages. +- `search`: Shared components used to render the full-text search experience. +- `shared`: Misc. shared components +- `entity`: Contains Entity definitions, where entity-specific functionality resides. + Configuration is provided by implementing the 'Entity' interface. (See DatasetEntity.tsx for example) + There are 2 visual components each entity should supply: + + - `profiles`: display relevant details about an individual entity. This serves as the entity's 'profile'. + - `previews`: provide a 'preview', or a smaller details card, containing the most important information about an entity instance. + When rendering a preview, the entity's data and the type of preview (SEARCH, BROWSE, PREVIEW) are provided. This + allows you to optionally customize the way an entities preview is rendered in different views. + - `entity registry`: There's another very important piece of code living within this module: the **EntityRegistry**. This is a layer + of abstraction over the intimate details of rendering a particular entity. It is used + to render a view associated with a particular entity type (user, dataset, etc.). + +

+ +

+ +**graphql** - The React App talks to the `dathub-frontend` server using GraphQL. This module is where the _queries_ issued +against the server are defined. Once defined, running `yarn run generate` will code-gen TypeScript objects to make invoking +these queries extremely easy. An example can be found at the top of `SearchPage.tsx.` + +**images** - Images to be displayed within the app. This is where one would place a custom logo image. + +## Adding an Entity + +The following outlines a series of steps required to introduce a new entity into the React app: + +1. Declare the GraphQL Queries required to display the new entity + + - If search functionality should be supported, extend the "search" query within `search.graphql` to fetch the new + entity data. + - If browse functionality should be supported, extend the "browse" query within `browse.graphql` to fetch the new + entity data. + - If display a 'profile' should be supported (most often), introduce a new `.graphql` file that contains a + `get` query to fetch the entity by primary key (urn). + + Note that your new entity _must_ implement the `Entity` GraphQL type interface, and thus must have a corresponding + `EntityType`. + +2. Implement the `Entity` interface + + - Create a new folder under `src/components/entity` corresponding to your entity + - Create a class that implements the `Entity` interface (example: `DatasetEntity.tsx`) + - Provide an implementation each method defined on the interface. + - This class specifies whether your new entity should be searchable & browsable, defines the names used to + identify your entity when instances are rendered in collection / when entity appears + in the URL path, and provides the ability to render your entity given data returned by the GQL API. + +3. Register the new entity in the `EntityRegistry` + - Update `App.tsx` to register an instance of your new entity. Now your entity will be accessible via the registry + and appear in the UI. To manually retrieve the info about your entity or others, simply use an instance + of the `EntityRegistry`, which is provided via `ReactContext` to _all_ components in the hierarchy. + For example + ``` + entityRegistry.getCollectionName(EntityType.YOUR_NEW_ENTITY) + ``` + +That's it! For any questions, do not hesitate to reach out on the DataHub Slack community in #datahub-react. diff --git a/docs-website/versioned_docs/version-0.10.4/datahub-web-react/src/app/analytics/README.md b/docs-website/versioned_docs/version-0.10.4/datahub-web-react/src/app/analytics/README.md new file mode 100644 index 0000000000000..a8731a893c8cf --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/datahub-web-react/src/app/analytics/README.md @@ -0,0 +1,165 @@ +--- +title: DataHub React Analytics +sidebar_label: React Analytics +slug: /datahub-web-react/src/app/analytics +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/datahub-web-react/src/app/analytics/README.md +--- + +# DataHub React Analytics + +## About + +The DataHub React application can be configured to emit a set of standardized product analytics events to multiple backend providers including + +- Mixpanel +- Amplitude +- Google Analytics + +This provides operators of DataHub with visibility into how their users are engaging with the platform, allowing them to answer questions around weekly active users, the most used features, the least used features, and more. + +To accomplish this, we have built a small extension on top of the popular [Analytics](https://www.npmjs.com/package/analytics) npm package. This package was chosen because it offers a clear pathway to extending support to many other providers, all of which you can find listed [here](https://github.com/DavidWells/analytics#analytic-plugins). + +## Configuring an Analytics Provider + +Currently, configuring an analytics provider requires that you fork DataHub & modify code. As described in 'Coming Soon', we intend to improve this process by implementing no-code configuration. + +### Mixpanel + +1. Open `datahub-web-react/src/conf/analytics.ts` +2. Uncomment the `mixpanel` field within the `config` object. +3. Replace the sample `token` with the API token provided by Mixpanel. +4. Rebuild & redeploy `datahub-frontend-react` to start tracking. + +```typescript +const config: any = { + mixpanel: { + token: "fad1285da4e618b618973cacf6565e61", + }, +}; +``` + +### Amplitude + +1. Open `datahub-web-react/src/conf/analytics.ts` +2. Uncomment the `amplitude` field within the `config` object. +3. Replace the sample `apiKey` with the key provided by Amplitude. +4. Rebuild & redeploy `datahub-frontend-react` to start tracking. + +```typescript +const config: any = { + amplitude: { + apiKey: "c5c212632315d19c752ab083bc7c92ff", + }, +}; +``` + +### Google Analytics + +**Disclaimers** + +- This plugin requires use of Universal Analytics and does not yet support GA4. To create a Universal Analytics Property, follow [this guide](https://www.analyticsmania.com/other-posts/how-to-create-a-universal-analytics-property/). +- Google Analytics lacks robust support for custom event properties. For that reason many of the DataHub events discussed above will not be fully populated. Instead, we map certain fields of the DataHub event to the standard `category`, `action`, `label` fields required by GA. + +1. Open `datahub-web-react/src/conf/analytics.ts` +2. Uncomment the `googleAnalytics` field within the `config` object. +3. Replace the sample `trackingId` with the one provided by Google Analytics. +4. Rebuild & redeploy `datahub-frontend-react` to start tracking. + +```typescript +const config: any = { + googleAnalytics: { + trackingId: "UA-24123123-01", + }, +}; +``` + +## Verifying your Analytics Setup + +To verify that analytics are being sent to your provider, you can inspect the networking tab of a Google Chrome inspector window: + +With DataHub open on Google Chrome + +1. Right click, then Inspect +2. Click 'Network' +3. Issue a search in DataHub +4. Inspect the outbound traffic for requests routed to your analytics provider. + +## Development + +### Adding a plugin + +To add a new plugin from the [Analytics](https://www.npmjs.com/package/analytics) library: + +1. Add a new file under `src/app/analytics/plugin` named based on the plugin +2. Extract configs from the analytics config object required to instantiate the plugin +3. Instantiate the plugin +4. Export a default object with 'isEnabled' and 'plugin' fields +5. Import / Export the new plugin module from `src/app/analytics/plugin/index.js` + +If you're unsure, check out the existing plugin implements as examples. Before contributing a plugin, please be sure to verify the integration by viewing the product metrics in the new analytics provider. + +### Adding an event + +To add a new DataHub analytics event, make the following changes to `src/app/analytics/event.ts`: + +1. Add a new value to the `EventType` enum + +```typescript + export enum EventType { + LogInEvent, + LogOutEvent, + ..., + MyNewEvent +} +``` + +2. Create a new interface extending `BaseEvent` + +```typescript +export interface MyNewEvent extends BaseEvent { + type: EventType.MyNewEvent; // must be the type you just added + ... your event's custom fields +} +``` + +3. Add the interface to the exported `Event` type. + +```typescript +export type Event = + | LogInEvent + | LogOutEvent + .... + | MyNewEvent +``` + +### Emitting an event + +Emitting a tracking DataHub analytics event is a 2-step process: + +1. Import relevant items from `analytics` module + +```typescript +import analytics, { EventType } from "../analytics"; +``` + +2. Call the `event` method, passing in an event object of the appropriate type + +```typescript +analytics.event({ type: EventType.MyNewEvent, ...my event fields }); +``` + +### Debugging: Enabling Event Logging + +To log events to the console for debugging / verification purposes + +1. Open `datahub-web-react/src/conf/analytics.ts` +2. Uncomment `logging: true` within the `config` object. +3. Rebuild & redeploy `datahub-frontend-react` to start logging all events to your browser's console. + +## Coming Soon + +In the near future, we intend to + +1. Send product analytics events back to DataHub itself, using them as feedback to improve the product experience. +2. No-code configuration of Analytics plugins. This will be achieved using server driven configuration for the React app. diff --git a/docs-website/versioned_docs/version-0.10.4/docker/README.md b/docs-website/versioned_docs/version-0.10.4/docker/README.md new file mode 100644 index 0000000000000..8e169fa3d326c --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docker/README.md @@ -0,0 +1,80 @@ +--- +title: Deploying with Docker +hide_title: true +sidebar_label: Deploying with Docker +slug: /docker +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docker/README.md" +--- + +# Docker Images + +## Prerequisites + +You need to install [docker](https://docs.docker.com/install/) and +[docker-compose](https://docs.docker.com/compose/install/) (if using Linux; on Windows and Mac compose is included with +Docker Desktop). + +Make sure to allocate enough hardware resources for Docker engine. Tested & confirmed config: 2 CPUs, 8GB RAM, 2GB Swap +area. + +## Quickstart + +The easiest way to bring up and test DataHub is using DataHub [Docker](https://www.docker.com) images +which are continuously deployed to [Docker Hub](https://hub.docker.com/u/linkedin) with every commit to repository. + +You can easily download and run all these images and their dependencies with our +[quick start guide](../docs/quickstart.md). + +DataHub Docker Images: + +Do not use `latest` or `debug` tags for any of the image as those are not supported and present only due to legacy reasons. Please use `head` or tags specific for versions like `v0.8.40`. For production we recommend using version specific tags not `head`. + +- [acryldata/datahub-ingestion](https://hub.docker.com/r/acryldata/datahub-ingestion/) +- [linkedin/datahub-gms](https://hub.docker.com/repository/docker/linkedin/datahub-gms/) +- [linkedin/datahub-frontend-react](https://hub.docker.com/repository/docker/linkedin/datahub-frontend-react/) +- [linkedin/datahub-mae-consumer](https://hub.docker.com/repository/docker/linkedin/datahub-mae-consumer/) +- [linkedin/datahub-mce-consumer](https://hub.docker.com/repository/docker/linkedin/datahub-mce-consumer/) +- [acryldata/datahub-upgrade](https://hub.docker.com/r/acryldata/datahub-upgrade/) +- [linkedin/datahub-kafka-setup](https://hub.docker.com/r/acryldata/datahub-kafka-setup/) +- [linkedin/datahub-elasticsearch-setup](https://hub.docker.com/r/linkedin/datahub-elasticsearch-setup/) +- [acryldata/datahub-mysql-setup](https://hub.docker.com/r/acryldata/datahub-mysql-setup/) +- [acryldata/datahub-postgres-setup](https://hub.docker.com/r/acryldata/datahub-postgres-setup/) +- [acryldata/datahub-actions](https://hub.docker.com/r/acryldata/datahub-actions). Do not use `acryldata/acryl-datahub-actions` as that is deprecated and no longer used. + +Dependencies: + +- [Kafka, Zookeeper, and Schema Registry](https://github.com/datahub-project/datahub/blob/master/docker/kafka-setup) +- [Elasticsearch](https://github.com/datahub-project/datahub/blob/master/docker/elasticsearch-setup) +- [MySQL](https://github.com/datahub-project/datahub/blob/master/docker/mysql) +- [(Optional) Neo4j](https://github.com/datahub-project/datahub/blob/master/docker/neo4j) + +### Ingesting demo data. + +If you want to test ingesting some data once DataHub is up, use the `./docker/ingestion/ingestion.sh` script or `datahub docker ingest-sample-data`. See the [quickstart guide](../docs/quickstart.md) for more details. + +## Using Docker Images During Development + +See [Using Docker Images During Development](../docs/docker/development.md). + +## Building And Deploying Docker Images + +We use GitHub Actions to build and continuously deploy our images. There should be no need to do this manually; a +successful release on Github will automatically publish the images. + +### Building images + +> This is **not** our recommended development flow and most developers should be following the +> [Using Docker Images During Development](../docs/docker/development.md) guide. + +To build the full images (that we are going to publish), you need to run the following: + +``` +COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 docker-compose -p datahub build +``` + +This is because we're relying on builtkit for multistage builds. It does not hurt also set `DATAHUB_VERSION` to +something unique. + +### Community Built Images + +As the open source project grows, community members would like to contribute additions to the docker images. Not all contributions to the images can be accepted because those changes are not useful for all community members, it will increase build times, add dependencies and possible security vulns. In those cases this section can be used to point to `Dockerfiles` hosted by the community which build on top of the images published by the DataHub core team along with any container registry links where the result of those images are maintained. diff --git a/docs-website/versioned_docs/version-0.10.4/docker/airflow/local_airflow.md b/docs-website/versioned_docs/version-0.10.4/docker/airflow/local_airflow.md new file mode 100644 index 0000000000000..30fc24b22bee1 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docker/airflow/local_airflow.md @@ -0,0 +1,208 @@ +--- +title: Running Airflow locally with DataHub +slug: /docker/airflow/local_airflow +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docker/airflow/local_airflow.md +--- + +:::caution + +This feature is currently unmaintained. As of 0.10.0 the container described is not published alongside the DataHub CLI. If you'd like to use it, please reach out to us on the [community slack.](docs/slack.md) + +::: + +# Running Airflow locally with DataHub + +## Introduction + +This document describes how you can run Airflow side-by-side with DataHub's quickstart docker images to test out Airflow lineage with DataHub. +This offers a much easier way to try out Airflow with DataHub, compared to configuring containers by hand, setting up configurations and networking connectivity between the two systems. + +## Prerequisites + +- Docker: ensure that you have a working Docker installation and you have at least 8GB of memory to allocate to both Airflow and DataHub combined. + +``` +docker info | grep Memory + +> Total Memory: 7.775GiB +``` + +- Quickstart: Ensure that you followed [quickstart](../../docs/quickstart.md) to get DataHub up and running. + +## Step 1: Set up your Airflow area + +- Create an area to host your airflow installation +- Download the docker-compose file hosted in DataHub's repo in that directory +- Download a sample dag to use for testing Airflow lineage + +``` +mkdir -p airflow_install +cd airflow_install +# Download docker-compose file +curl -L 'https://raw.githubusercontent.com/datahub-project/datahub/master/docker/airflow/docker-compose.yaml' -o docker-compose.yaml +# Create dags directory +mkdir -p dags +# Download a sample DAG +curl -L 'https://raw.githubusercontent.com/datahub-project/datahub/master/metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_demo.py' -o dags/lineage_backend_demo.py +``` + +### What is different between this docker-compose file and the official Apache Airflow docker compose file? + +- This docker-compose file is derived from the [official Airflow docker-compose file](https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#docker-compose-yaml) but makes a few critical changes to make interoperability with DataHub seamless. +- The Airflow image in this docker compose file extends the [base Apache Airflow docker image](https://airflow.apache.org/docs/docker-stack/index.html) and is published [here](https://hub.docker.com/r/acryldata/airflow-datahub). It includes the latest `acryl-datahub` pip package installed by default so you don't need to install it yourself. +- This docker-compose file sets up the networking so that + - the Airflow containers can talk to the DataHub containers through the `datahub_network` bridge interface. + - Modifies the port-forwarding to map the Airflow Webserver port `8080` to port `58080` on the localhost (to avoid conflicts with DataHub metadata-service, which is mapped to 8080 by default) +- This docker-compose file also sets up the ENV variables to configure Airflow's Lineage Backend to talk to DataHub. (Look for the `AIRFLOW__LINEAGE__BACKEND` and `AIRFLOW__LINEAGE__DATAHUB_KWARGS` variables) + +## Step 2: Bring up Airflow + +First you need to initialize airflow in order to create initial database tables and the initial airflow user. + +``` +docker-compose up airflow-init +``` + +You should see the following final initialization message + +``` +airflow-init_1 | Admin user airflow created +airflow-init_1 | 2.1.3 +airflow_install_airflow-init_1 exited with code 0 + +``` + +Afterwards you need to start the airflow docker-compose + +``` +docker-compose up +``` + +You should see a host of messages as Airflow starts up. + +``` +Container airflow_deploy_airflow-scheduler_1 Started 15.7s +Attaching to airflow-init_1, airflow-scheduler_1, airflow-webserver_1, airflow-worker_1, flower_1, postgres_1, redis_1 +airflow-worker_1 | BACKEND=redis +airflow-worker_1 | DB_HOST=redis +airflow-worker_1 | DB_PORT=6379 +airflow-worker_1 | +airflow-webserver_1 | +airflow-init_1 | DB: postgresql+psycopg2://airflow:***@postgres/airflow +airflow-init_1 | [2021-08-31 20:02:07,534] {db.py:702} INFO - Creating tables +airflow-init_1 | INFO [alembic.runtime.migration] Context impl PostgresqlImpl. +airflow-init_1 | INFO [alembic.runtime.migration] Will assume transactional DDL. +airflow-scheduler_1 | ____________ _____________ +airflow-scheduler_1 | ____ |__( )_________ __/__ /________ __ +airflow-scheduler_1 | ____ /| |_ /__ ___/_ /_ __ /_ __ \_ | /| / / +airflow-scheduler_1 | ___ ___ | / _ / _ __/ _ / / /_/ /_ |/ |/ / +airflow-scheduler_1 | _/_/ |_/_/ /_/ /_/ /_/ \____/____/|__/ +airflow-scheduler_1 | [2021-08-31 20:02:07,736] {scheduler_job.py:661} INFO - Starting the scheduler +airflow-scheduler_1 | [2021-08-31 20:02:07,736] {scheduler_job.py:666} INFO - Processing each file at most -1 times +airflow-scheduler_1 | [2021-08-31 20:02:07,915] {manager.py:254} INFO - Launched DagFileProcessorManager with pid: 25 +airflow-scheduler_1 | [2021-08-31 20:02:07,918] {scheduler_job.py:1197} INFO - Resetting orphaned tasks for active dag runs +airflow-scheduler_1 | [2021-08-31 20:02:07,923] {settings.py:51} INFO - Configured default timezone Timezone('UTC') +flower_1 | +airflow-worker_1 | * Serving Flask app "airflow.utils.serve_logs" (lazy loading) +airflow-worker_1 | * Environment: production +airflow-worker_1 | WARNING: This is a development server. Do not use it in a production deployment. +airflow-worker_1 | Use a production WSGI server instead. +airflow-worker_1 | * Debug mode: off +airflow-worker_1 | [2021-08-31 20:02:09,283] {_internal.py:113} INFO - * Running on http://0.0.0.0:8793/ (Press CTRL+C to quit) +flower_1 | BACKEND=redis +flower_1 | DB_HOST=redis +flower_1 | DB_PORT=6379 +flower_1 | +``` + +Finally, Airflow should be healthy and up on port 58080. Navigate to [http://localhost:58080](http://localhost:58080) to confirm and find your Airflow webserver. +The default username and password is: + +``` +airflow:airflow +``` + +## Step 3: Register DataHub connection (hook) to Airflow + +``` +docker exec -it `docker ps | grep webserver | cut -d " " -f 1` airflow connections add --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host 'http://datahub-gms:8080' +``` + +### Result + +``` +Successfully added `conn_id`=datahub_rest_default : datahub_rest://:@http://datahub-gms:8080: +``` + +### What is the above command doing? + +- Find the container running airflow webserver: `docker ps | grep webserver | cut -d " " -f 1` +- Running the `airflow connections add ...` command inside that container to register the `datahub_rest` connection type and connect it to the `datahub-gms` host on port 8080. +- Note: This is what requires Airflow to be able to connect to `datahub-gms` the host (this is the container running datahub-gms image) and this is why we needed to connect the Airflow containers to the `datahub_network` using our custom docker-compose file. + +## Step 4: Find the DAGs and run it + +Navigate the Airflow UI to find the sample Airflow dag we just brought in + +

+ +

+ +By default, Airflow loads all DAG-s in paused status. Unpause the sample DAG to use it. + +

+ +

+

+ +

+ +Then trigger the DAG to run. + +

+ +

+ +After the DAG runs successfully, go over to your DataHub instance to see the Pipeline and navigate its lineage. + +

+ +

+ +

+ +

+ +

+ +

+ +

+ +

+ +## TroubleShooting + +Most issues are related to connectivity between Airflow and DataHub. + +Here is how you can debug them. + +

+ +

+ +

+ +

+ +In this case, clearly the connection `datahub-rest` has not been registered. Looks like we forgot to register the connection with Airflow! +Let's execute Step 4 to register the datahub connection with Airflow. + +In case the connection was registered successfully but you are still seeing `Failed to establish a new connection`, check if the connection is `http://datahub-gms:8080` and not `http://localhost:8080`. + +After re-running the DAG, we see success! + +

+ +

diff --git a/docs-website/versioned_docs/version-0.10.4/docker/datahub-upgrade/README.md b/docs-website/versioned_docs/version-0.10.4/docker/datahub-upgrade/README.md new file mode 100644 index 0000000000000..687f3b3eae4a7 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docker/datahub-upgrade/README.md @@ -0,0 +1,121 @@ +--- +title: DataHub Upgrade Docker Image +sidebar_label: Upgrade Docker Image +slug: /docker/datahub-upgrade +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docker/datahub-upgrade/README.md +--- + +# DataHub Upgrade Docker Image + +This container is used to automatically apply upgrades from one version of DataHub to another. + +## Supported Upgrades + +As of today, there are 2 supported upgrades: + +1. **NoCodeDataMigration**: Performs a series of pre-flight qualification checks and then migrates metadata*aspect table data + to metadata_aspect_v2 table. Arguments: - \_batchSize* (Optional): The number of rows to migrate at a time. Defaults to 1000. - _batchDelayMs_ (Optional): The number of milliseconds of delay between migrated batches. Used for rate limiting. Defaults to 250. - _dbType_ (optional): The target DB type. Valid values are `MYSQL`, `MARIA`, `POSTGRES`. Defaults to `MYSQL`. +2. **NoCodeDataMigrationCleanup**: Cleanses graph index, search index, and key-value store of legacy DataHub data (metadata_aspect table) once + the No Code Data Migration has completed successfully. No arguments. + +3. **RestoreIndices**: Restores indices by fetching the latest version of each aspect and producing MAE + +4. **RestoreBackup**: Restores the storage stack from a backup of the local database + +## Environment Variables + +To run the `datahub-upgrade` container, some environment variables must be provided in order to tell the upgrade CLI +where the running DataHub containers reside. + +Below details the required configurations. By default, these configs are provided for local docker-compose deployments of +DataHub within `docker/datahub-upgrade/env/docker.env`. They assume that there is a Docker network called datahub_network +where the DataHub containers can be found. + +These are also the variables used when the provided `datahub-upgrade.sh` script is executed. To run the upgrade CLI for non-local deployments, +follow these steps: + +1. Define new ".env" variable to hold your environment variables. + +The following variables may be provided: + +```aidl +# Required Environment Variables +EBEAN_DATASOURCE_USERNAME=datahub +EBEAN_DATASOURCE_PASSWORD=datahub +EBEAN_DATASOURCE_HOST=:3306 +EBEAN_DATASOURCE_URL=jdbc:mysql://:3306/datahub?verifyServerCertificate=false&useSSL=true&useUnicode=yes&characterEncoding=UTF-8 +EBEAN_DATASOURCE_DRIVER=com.mysql.jdbc.Driver + +KAFKA_BOOTSTRAP_SERVER=:29092 +KAFKA_SCHEMAREGISTRY_URL=http://:8081 + +ELASTICSEARCH_HOST= +ELASTICSEARCH_PORT=9200 + +NEO4J_HOST=http://:7474 +NEO4J_URI=bolt:// +NEO4J_USERNAME=neo4j +NEO4J_PASSWORD=datahub + +DATAHUB_GMS_HOST=> +DATAHUB_GMS_PORT=8080 + +# Datahub protocol (default http) +# DATAHUB_GMS_PROTOCOL=http + +DATAHUB_MAE_CONSUMER_HOST= +DATAHUB_MAE_CONSUMER_PORT=9091 + +# Optional Arguments + +# Uncomment and set these to support SSL connection to Elasticsearch +# ELASTICSEARCH_USE_SSL= +# ELASTICSEARCH_SSL_PROTOCOL= +# ELASTICSEARCH_SSL_SECURE_RANDOM_IMPL= +# ELASTICSEARCH_SSL_TRUSTSTORE_FILE= +# ELASTICSEARCH_SSL_TRUSTSTORE_TYPE= +# ELASTICSEARCH_SSL_TRUSTSTORE_PASSWORD= +# ELASTICSEARCH_SSL_KEYSTORE_FILE= +# ELASTICSEARCH_SSL_KEYSTORE_TYPE= +# ELASTICSEARCH_SSL_KEYSTORE_PASSWORD= +``` + +2. Pull (or build) & execute the `datahub-upgrade` container: + +```aidl +docker pull acryldata/datahub-upgrade:head && docker run --env-file *path-to-custom-env-file.env* acryldata/datahub-upgrade:head -u NoCodeDataMigration +``` + +## Arguments + +The primary argument required by the datahub-upgrade container is the name of the upgrade to perform. This argument +can be specified using the `-u` flag when running the `datahub-upgrade` container. + +For example, to run the migration named "NoCodeDataMigration", you would do execute the following: + +```aidl +./datahub-upgrade.sh -u NoCodeDataMigration +``` + +OR + +```aidl +docker pull acryldata/datahub-upgrade:head && docker run --env-file env/docker.env acryldata/datahub-upgrade:head -u NoCodeDataMigration +``` + +In addition to the required `-u` argument, each upgrade may require specific arguments. You can provide arguments to individual +upgrades using multiple `-a` arguments. + +For example, the NoCodeDataMigration upgrade provides 2 optional arguments detailed above: _batchSize_ and _batchDelayMs_. +To specify these, you can use a combination of `-a` arguments and of the form _argumentName=argumentValue_ as follows: + +```aidl +./datahub-upgrade.sh -u NoCodeDataMigration -a batchSize=500 -a batchDelayMs=1000 // Small batches with 1 second delay. +``` + +OR + +```aidl +docker pull acryldata/datahub-upgrade:head && docker run --env-file env/docker.env acryldata/datahub-upgrade:head -u NoCodeDataMigration -a batchSize=500 -a batchDelayMs=1000 +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/CODE_OF_CONDUCT.md b/docs-website/versioned_docs/version-0.10.4/docs/CODE_OF_CONDUCT.md new file mode 100644 index 0000000000000..086bcf205c740 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/CODE_OF_CONDUCT.md @@ -0,0 +1,83 @@ +--- +title: Contributor Covenant Code of Conduct +slug: /code_of_conduct +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/CODE_OF_CONDUCT.md" +--- + +# Contributor Covenant Code of Conduct + +## Our Pledge + +In the interest of fostering an open and welcoming environment, we as +contributors and maintainers pledge to making participation in our project and +our community a harassment-free experience for everyone, regardless of age, body +size, disability, ethnicity, sex characteristics, gender identity and expression, +level of experience, education, socio-economic status, nationality, personal +appearance, race, religion, or sexual identity and orientation. + +## Our Standards + +Examples of behavior that contributes to creating a positive environment +include: + +- Using welcoming and inclusive language +- Being respectful of differing viewpoints and experiences +- Gracefully accepting constructive criticism +- Focusing on what is best for the community +- Showing empathy towards other community members + +Examples of unacceptable behavior by participants include: + +- The use of sexualized language or imagery and unwelcome sexual attention or + advances +- Trolling, insulting/derogatory comments, and personal or political attacks +- Public or private harassment +- Publishing others' private information, such as a physical or electronic + address, without explicit permission +- Other conduct which could reasonably be considered inappropriate in a + professional setting + +## Our Responsibilities + +Project maintainers are responsible for clarifying the standards of acceptable +behavior and are expected to take appropriate and fair corrective action in +response to any instances of unacceptable behavior. + +Project maintainers have the right and responsibility to remove, edit, or +reject comments, commits, code, wiki edits, issues, and other contributions +that are not aligned to this Code of Conduct, or to ban temporarily or +permanently any contributor for other behaviors that they deem inappropriate, +threatening, offensive, or harmful. + +## Scope + +This Code of Conduct applies both within project spaces and in public spaces +when an individual is representing the project or its community. Examples of +representing a project or community include using an official project e-mail +address, posting via an official social media account, or acting as an appointed +representative at an online or offline event. Representation of a project may be +further defined and clarified by project maintainers. + +## Enforcement + +Instances of abusive, harassing, or otherwise unacceptable behavior may be +reported by direct messaging the project team on [Slack]. All +complaints will be reviewed and investigated and will result in a response that +is deemed necessary and appropriate to the circumstances. The project team is +obligated to maintain confidentiality with regard to the reporter of an incident. +Further details of specific enforcement policies may be posted separately. + +Project maintainers who do not follow or enforce the Code of Conduct in good +faith may face temporary or permanent repercussions as determined by other +members of the project's leadership. + +## Attribution + +This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, +available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html + +[Slack]: https://slack.datahubproject.io +[homepage]: https://www.contributor-covenant.org + +For answers to common questions about this code of conduct, see +https://www.contributor-covenant.org/faq diff --git a/docs-website/versioned_docs/version-0.10.4/docs/CONTRIBUTING.md b/docs-website/versioned_docs/version-0.10.4/docs/CONTRIBUTING.md new file mode 100644 index 0000000000000..90148ef26751a --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/CONTRIBUTING.md @@ -0,0 +1,102 @@ +--- +title: Contributing +slug: /contributing +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/CONTRIBUTING.md" +--- + +# Contributing + +We always welcome contributions to help make DataHub better. Take a moment to read this document if you would like to contribute. + +## Provide Feedback + +Have ideas about how to make DataHub better? Head over to [DataHub Feature Requests](https://feature-requests.datahubproject.io/) and tell us all about it! + +Show your support for other requests by upvoting; stay up to date on progess by subscribing for updates via email. + +## Reporting Issues + +We use GitHub issues to track bug reports and submitting pull requests. + +If you find a bug: + +1. Use the GitHub issue search to check whether the bug has already been reported. + +1. If the issue has been fixed, try to reproduce the issue using the latest master branch of the repository. + +1. If the issue still reproduces or has not yet been reported, try to isolate the problem before opening an issue. + +## Submitting a Request For Comment (RFC) + +If you have a substantial feature or a design discussion that you'd like to have with the community follow the RFC process outlined [here](./rfc.md) + +## Submitting a Pull Request (PR) + +Before you submit your Pull Request (PR), consider the following guidelines: + +- Search GitHub for an open or closed PR that relates to your submission. You don't want to duplicate effort. +- Follow the [standard GitHub approach](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request-from-a-fork) to create the PR. Please also follow our [commit message format](#commit-message-format). +- If there are any breaking changes, potential downtime, deprecations, or big feature please add an update in [Updating DataHub under Next](how/updating-datahub.md). +- That's it! Thank you for your contribution! + +## Commit Message Format + +Please follow the [Conventional Commits](https://www.conventionalcommits.org/) specification for the commit message format. In summary, each commit message consists of a _header_, a _body_ and a _footer_, separated by a single blank line. + +``` +[optional scope]: + +[optional body] + +[optional footer(s)] +``` + +Any line of the commit message cannot be longer than 88 characters! This allows the message to be easier to read on GitHub as well as in various Git tools. + +### Type + +Must be one of the following (based on the [Angular convention](https://github.com/angular/angular/blob/22b96b9/CONTRIBUTING.md#-commit-message-guidelines)): + +- _feat_: A new feature +- _fix_: A bug fix +- _refactor_: A code change that neither fixes a bug nor adds a feature +- _docs_: Documentation only changes +- _test_: Adding missing tests or correcting existing tests +- _perf_: A code change that improves performance +- _style_: Changes that do not affect the meaning of the code (whitespace, formatting, missing semicolons, etc.) +- _build_: Changes that affect the build system or external dependencies +- _ci_: Changes to our CI configuration files and scripts + +A scope may be provided to a commit’s type, to provide additional contextual information and is contained within parenthesis, e.g., + +``` +feat(parser): add ability to parse arrays +``` + +### Description + +Each commit must contain a succinct description of the change: + +- use the imperative, present tense: "change" not "changed" nor "changes" +- don't capitalize the first letter +- no dot(.) at the end + +### Body + +Just as in the description, use the imperative, present tense: "change" not "changed" nor "changes". The body should include the motivation for the change and contrast this with previous behavior. + +### Footer + +The footer should contain any information about _Breaking Changes_, and is also the place to reference GitHub issues that this commit _Closes_. + +_Breaking Changes_ should start with the words `BREAKING CHANGE:` with a space or two new lines. The rest of the commit message is then used for this. + +### Revert + +If the commit reverts a previous commit, it should begin with `revert:`, followed by the description. In the body it should say: `Refs: ...`, where the hashs are the SHA of the commits being reverted, e.g. + +``` +revert: let us never again speak of the noodle incident + +Refs: 676104e, a215868 +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/_feature-guide-template.md b/docs-website/versioned_docs/version-0.10.4/docs/_feature-guide-template.md new file mode 100644 index 0000000000000..da573168bd3f7 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/_feature-guide-template.md @@ -0,0 +1,91 @@ +--- +title: "About DataHub [Feature Name]" +sidebar_label: "[Feature Name]" +slug: /_feature-guide-template +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/_feature-guide-template.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub [Feature Name] + + + + + + + + + +## [Feature Name] Setup, Prerequisites, and Permissions + + + +## Using [Feature Name] + + + +## Additional Resources + + + +### Videos + + + + + +### GraphQL + + + +### DataHub Blog + + + +## FAQ and Troubleshooting + + + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ + +### Related Features + + diff --git a/docs-website/versioned_docs/version-0.10.4/docs/act-on-metadata.md b/docs-website/versioned_docs/version-0.10.4/docs/act-on-metadata.md new file mode 100644 index 0000000000000..4af987af263fb --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/act-on-metadata.md @@ -0,0 +1,21 @@ +--- +title: Act on Metadata Overview +slug: /act-on-metadata +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/act-on-metadata.md" +--- + +# Act on Metadata Overview + +DataHub's metadata infrastructure is stream-oriented, meaning that all changes in metadata are communicated and reflected within the platform within seconds. + +This unlocks endless opportunities to automate data governance and data management workflows, such as: + +- Automatically enrich or annotate existing data entities within DataHub, i.e., apply Tags, Terms, Owners, etc. +- Leverage the [Actions Framework](actions/README.md) to trigger external workflows or send alerts to external systems, i.e., send a message to a team channel when there's a schema change +- Proactively identify what business-critical data resources will be impacted by a breaking schema change + +This section contains resources to help you take real-time action on your rapidly evolving data stack. + +

+ +

diff --git a/docs-website/versioned_docs/version-0.10.4/docs/act-on-metadata/impact-analysis.md b/docs-website/versioned_docs/version-0.10.4/docs/act-on-metadata/impact-analysis.md new file mode 100644 index 0000000000000..92c307e4a82da --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/act-on-metadata/impact-analysis.md @@ -0,0 +1,102 @@ +--- +title: About DataHub Lineage Impact Analysis +sidebar_label: Lineage Impact Analysis +slug: /act-on-metadata/impact-analysis +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/act-on-metadata/impact-analysis.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub Lineage Impact Analysis + + + +Lineage Impact Analysis is a powerful workflow for understanding the complete set of upstream and downstream dependencies of a Dataset, Dashboard, Chart, and many other DataHub Entities. + +This allows Data Practitioners to proactively identify the impact of breaking schema changes or failed data pipelines on downstream dependencies, rapidly discover which upstream dependencies may have caused unexpected data quality issues, and more. + +Lineage Impact Analysis is available via the DataHub UI and GraphQL endpoints, supporting manual and automated workflows. + +## Lineage Impact Analysis Setup, Prerequisites, and Permissions + +Lineage Impact Analysis is enabled for any Entity that has associated Lineage relationships with other Entities and does not require any additional configuration. + +Any DataHub user with “View Entity Page” permissions is able to view the full set of upstream or downstream Entities and export results to CSV from the DataHub UI. + +## Using Lineage Impact Analysis + +Follow these simple steps to understand the full dependency chain of your data entities. + +1. On a given Entity Page, select the **Lineage** tab + +

+ +

+ +2. Easily toggle between **Upstream** and **Downstream** dependencies + +

+ +

+ +3. Choose the **Degree of Dependencies** you are interested in. The default filter is “1 Degree of Dependency” to minimize processor-intensive queries. + +

+ +

+ +4. Slice and dice the result list by Entity Type, Platfrom, Owner, and more to isolate the relevant dependencies + +

+ +

+ +5. Export the full list of dependencies to CSV + +

+ +

+ +6. View the filtered set of dependencies via CSV, with details about assigned ownership, domain, tags, terms, and quick links back to those entities within DataHub + +

+ +

+ +## Additional Resources + +### Videos + +**DataHub 201: Impact Analysis** + +

+ +

+ +### GraphQL + +- [searchAcrossLineage](../../graphql/queries.md#searchacrosslineage) +- [searchAcrossLineageInput](../../graphql/inputObjects.md#searchacrosslineageinput) + +Looking for an example of how to use `searchAcrossLineage` to read lineage? Look [here](../api/tutorials/lineage.md#read-lineage) + +### DataHub Blog + +- [Dependency Impact Analysis, Data Validation Outcomes, and MORE! - Highlights from DataHub v0.8.27 & v.0.8.28](https://blog.datahubproject.io/dependency-impact-analysis-data-validation-outcomes-and-more-1302604da233) + +### FAQ and Troubleshooting + +**The Lineage Tab is greyed out - why can’t I click on it?** + +This means you have not yet ingested Lineage metadata for that entity. Please see the Lineage Guide to get started. + +**Why is my list of exported dependencies incomplete?** + +We currently limit the list of dependencies to 10,000 records; we suggest applying filters to narrow the result set if you hit that limit. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ + +### Related Features + +- [DataHub Lineage](../lineage/lineage-feature-guide.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/README.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/README.md new file mode 100644 index 0000000000000..10040bd1e45dd --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/README.md @@ -0,0 +1,250 @@ +--- +title: Introduction +slug: /actions +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/actions/README.md" +--- + +# ⚡ DataHub Actions Framework + +Welcome to DataHub Actions! The Actions framework makes responding to realtime changes in your Metadata Graph easy, enabling you to seamlessly integrate [DataHub](https://github.com/datahub-project/datahub) into a broader events-based architecture. + +For a detailed introduction, check out the [original announcement](https://www.youtube.com/watch?v=7iwNxHgqxtg&t=2189s) of the DataHub Actions Framework at the DataHub April 2022 Town Hall. For a more in-depth look at use cases and concepts, check out [DataHub Actions Concepts](concepts.md). + +## Quickstart + +To get started right away, check out the [DataHub Actions Quickstart](quickstart.md) Guide. + +## Prerequisites + +The DataHub Actions CLI commands are an extension of the base `datahub` CLI commands. We recommend +first installing the `datahub` CLI: + +```shell +python3 -m pip install --upgrade pip wheel setuptools +python3 -m pip install --upgrade acryl-datahub +datahub --version +``` + +> Note that the Actions Framework requires a version of `acryl-datahub` >= v0.8.34 + +## Installation + +Next, simply install the `acryl-datahub-actions` package from PyPi: + +```shell +python3 -m pip install --upgrade pip wheel setuptools +python3 -m pip install --upgrade acryl-datahub-actions +datahub actions version +``` + +## Configuring an Action + +Actions are configured using a YAML file, much in the same way DataHub ingestion sources are. An action configuration file consists of the following + +1. Action Pipeline Name (Should be unique and static) +2. Source Configurations +3. Transform + Filter Configurations +4. Action Configuration +5. Pipeline Options (Optional) +6. DataHub API configs (Optional - required for select actions) + +With each component being independently pluggable and configurable. + +```yml +# 1. Required: Action Pipeline Name +name: + +# 2. Required: Event Source - Where to source event from. +source: + type: + config: + # Event Source specific configs (map) + +# 3a. Optional: Filter to run on events (map) +filter: + event_type: + event: + # Filter event fields by exact-match + + +# 3b. Optional: Custom Transformers to run on events (array) +transform: + - type: + config: + # Transformer-specific configs (map) + +# 4. Required: Action - What action to take on events. +action: + type: + config: + # Action-specific configs (map) + +# 5. Optional: Additional pipeline options (error handling, etc) +options: + retry_count: 0 # The number of times to retry an Action with the same event. (If an exception is thrown). 0 by default. + failure_mode: "CONTINUE" # What to do when an event fails to be processed. Either 'CONTINUE' to make progress or 'THROW' to stop the pipeline. Either way, the failed event will be logged to a failed_events.log file. + failed_events_dir: "/tmp/datahub/actions" # The directory in which to write a failed_events.log file that tracks events which fail to be processed. Defaults to "/tmp/logs/datahub/actions". + +# 6. Optional: DataHub API configuration +datahub: + server: "http://localhost:8080" # Location of DataHub API + # token: # Required if Metadata Service Auth enabled +``` + +### Example: Hello World + +An simple configuration file for a "Hello World" action, which simply prints all events it receives, is + +```yml +# 1. Action Pipeline Name +name: "hello_world" +# 2. Event Source: Where to source event from. +source: + type: "kafka" + config: + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} +# 3. Action: What action to take on events. +action: + type: "hello_world" +``` + +We can modify this configuration further to filter for specific events, by adding a "filter" block. + +```yml +# 1. Action Pipeline Name +name: "hello_world" + +# 2. Event Source - Where to source event from. +source: + type: "kafka" + config: + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} + +# 3. Filter - Filter events that reach the Action +filter: + event_type: "EntityChangeEvent_v1" + event: + category: "TAG" + operation: "ADD" + modifier: "urn:li:tag:pii" + +# 4. Action - What action to take on events. +action: + type: "hello_world" +``` + +## Running an Action + +To run a new Action, just use the `actions` CLI command + +``` +datahub actions -c +``` + +Once the Action is running, you will see + +``` +Action Pipeline with name '' is now running. +``` + +### Running multiple Actions + +You can run multiple actions pipeline within the same command. Simply provide multiple +config files by restating the "-c" command line argument. + +For example, + +``` +datahub actions -c -c +``` + +### Running in debug mode + +Simply append the `--debug` flag to the CLI to run your action in debug mode. + +``` +datahub actions -c --debug +``` + +### Stopping an Action + +Just issue a Control-C as usual. You should see the Actions Pipeline shut down gracefully, with a small +summary of processing results. + +``` +Actions Pipeline with name '- + https://github.com/datahub-project/datahub/blob/master/docs/actions/actions/executor.md +--- + +# Ingestion Executor + + + +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +## Overview + +This Action executes ingestion recipes that are configured via the UI. + +### Capabilities + +- Executing `datahub ingest` command in a sub-process when an Execution Request command is received from DataHub. (Scheduled or manual ingestion run) +- Resolving secrets within an ingestion recipe from DataHub +- Reporting ingestion execution status to DataHub + +### Supported Events + +- `MetadataChangeLog_v1` + +Specifically, changes to the `dataHubExecutionRequestInput` and `dataHubExecutionRequestSignal` aspects of the `dataHubExecutionRequest` entity are required. + +## Action Quickstart + +### Prerequisites + +#### DataHub Privileges + +This action must be executed as a privileged DataHub user (e.g. using Personal Access Tokens). Specifically, the user must have the `Manage Secrets` Platform Privilege, which allows for retrieval +of decrypted secrets for injection into an ingestion recipe. + +An access token generated from a privileged account must be configured in the `datahub` configuration +block of the YAML configuration, as shown in the example below. + +#### Connecting to Ingestion Sources + +In order for ingestion to run successfully, the process running the Actions must have +network connectivity to any source systems that are required for ingestion. + +For example, if the ingestion recipe is pulling from an internal DBMS, the actions container +must be able to resolve & connect to that DBMS system for the ingestion command to run successfully. + +### Install the Plugin(s) + +Run the following commands to install the relevant action plugin(s): + +`pip install 'acryl-datahub-actions[executor]'` + +### Configure the Action Config + +Use the following config(s) to get started with this Action. + +```yml +name: "pipeline-name" +source: + # source configs +action: + type: "executor" +# Requires DataHub API configurations to report to DataHub +datahub: + server: "http://${DATAHUB_GMS_HOST:-localhost}:${DATAHUB_GMS_PORT:-8080}" + # token: # Must have "Manage Secrets" privilege +``` + +
+ View All Configuration Options + + | Field | Required | Default | Description | + | --- | :-: | :-: | --- | + | `executor_id` | ❌ | `default` | An executor ID assigned to the executor. This can be used to manage multiple distinct executors. | +
+ +## Troubleshooting + +### Quitting the Actions Framework + +Currently, when you quit the Actions framework, any in-flight ingestion processing will continue to execute as a subprocess on your system. This means that there may be "orphaned" processes which +are never marked as "Succeeded" or "Failed" in the UI, even though they may have completed. + +To address this, simply "Cancel" the ingestion source on the UI once you've restarted the Ingestion Executor action. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/actions/hello_world.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/actions/hello_world.md new file mode 100644 index 0000000000000..05687334c921f --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/actions/hello_world.md @@ -0,0 +1,63 @@ +--- +title: Hello World +slug: /actions/actions/hello_world +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/actions/actions/hello_world.md +--- + +# Hello World + + + +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +## Overview + +This Action is an example action which simply prints all Events it receives as JSON. + +### Capabilities + +- Printing events that are received by the Action to the console. + +### Supported Events + +All event types, including + +- `EntityChangeEvent_v1` +- `MetadataChangeLog_v1` + +## Action Quickstart + +### Prerequisites + +No prerequisites. This action comes pre-loaded with `acryl-datahub-actions`. + +### Install the Plugin(s) + +This action comes with the Actions Framework by default: + +`pip install 'acryl-datahub-actions'` + +### Configure the Action Config + +Use the following config(s) to get started with this Action. + +```yml +name: "pipeline-name" +source: + # source configs +action: + type: "hello_world" +``` + +
+ View All Configuration Options + + | Field | Required | Default | Description | + | --- | :-: | :-: | --- | + | `to_upper` | ❌| `False` | Whether to print events in upper case. | +
+ +## Troubleshooting + +N/A diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/actions/slack.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/actions/slack.md new file mode 100644 index 0000000000000..6dd8a5de00ffe --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/actions/slack.md @@ -0,0 +1,281 @@ +--- +title: Slack +slug: /actions/actions/slack +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/actions/actions/slack.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# Slack + + + +# Slack + +| | | +| ------------------------ | ----------------------------------------------------------------------------------------------------- | +| **Status** | ![Incubating](https://img.shields.io/badge/support%20status-incubating-blue) | +| **Version Requirements** | ![Minimum Version Requirements](https://img.shields.io/badge/acryl_datahub_actions-v0.0.9+-green.svg) | + +## Overview + +This Action integrates DataHub with Slack to send notifications to a configured Slack channel in your workspace. + +### Capabilities + +- Sending notifications of important events to a Slack channel + - Adding or Removing a tag from an entity (dataset, dashboard etc.) + - Updating documentation at the entity or field (column) level. + - Adding or Removing ownership from an entity (dataset, dashboard, etc.) + - Creating a Domain + - and many more. + +### User Experience + +On startup, the action will produce a welcome message that looks like the one below. +![](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/slack/slack_welcome_message.png) + +On each event, the action will produce a notification message that looks like the one below. +![](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/slack/slack_notification_message.png) + +Watch the townhall demo to see this in action: +[![Slack Action Demo](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/slack/slack_demo_image.png)](https://www.youtube.com/watch?v=BlCLhG8lGoY&t=2998s) + +### Supported Events + +- `EntityChangeEvent_v1` +- Currently, the `MetadataChangeLog_v1` event is **not** processed by the Action. + +## Action Quickstart + +### Prerequisites + +Ensure that you have configured the Slack App in your Slack workspace. + +#### Install the DataHub Slack App into your Slack workspace + +The following steps should be performed by a Slack Workspace Admin. + +- Navigate to https://api.slack.com/apps/ +- Click Create New App +- Use “From an app manifest” option +- Select your workspace +- Paste this Manifest in YAML. We suggest changing the name and `display_name` to be `DataHub App YOUR_TEAM_NAME` but this is not required. This name will show up in your Slack workspace. + +```yml +display_information: + name: DataHub App + description: An app to integrate DataHub with Slack + background_color: "#000000" +features: + bot_user: + display_name: DataHub App + always_online: false +oauth_config: + scopes: + bot: + - channels:history + - channels:read + - chat:write + - commands + - groups:read + - im:read + - mpim:read + - team:read + - users:read + - users:read.email +settings: + org_deploy_enabled: false + socket_mode_enabled: false + token_rotation_enabled: false +``` + +- Confirm you see the Basic Information Tab + +![](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/slack/slack_basic_info.png) + +- Click **Install to Workspace** +- It will show you permissions the Slack App is asking for, what they mean and a default channel in which you want to add the slack app + - Note that the Slack App will only be able to post in channels that the app has been added to. This is made clear by slack’s Authentication screen also. +- Select the channel you'd like notifications to go to and click **Allow** +- Go to the DataHub App page + - You can find your workspace's list of apps at https://api.slack.com/apps/ + +#### Getting Credentials and Configuration + +Now that you've created your app and installed it in your workspace, you need a few pieces of information before you can activate your Slack action. + +#### 1. The Signing Secret + +On your app's Basic Information page, you will see a App Credentials area. Take note of the Signing Secret information, you will need it later. + +![](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/slack/slack_app_credentials.png) + +#### 2. The Bot Token + +Navigate to the **OAuth & Permissions** Tab + +![](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/slack/slack_oauth_and_permissions.png) + +Here you'll find a “Bot User OAuth Token” which DataHub will need to communicate with your Slack workspace through the bot. + +#### 3. The Slack Channel + +Finally, you need to figure out which Slack channel you will send notifications to. Perhaps it should be called #datahub-notifications or maybe, #data-notifications or maybe you already have a channel where important notifications about datasets and pipelines are already being routed to. Once you have decided what channel to send notifications to, make sure to add the app to the channel. + +![](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/slack/slack_channel_add_app.png) + +Next, figure out the channel id for this Slack channel. You can find it in the About section for the channel if you scroll to the very bottom of the app. +![](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/slack/slack_channel_id.png) + +Alternately, if you are on the browser, you can figure it out from the URL. e.g. for the troubleshoot channel in OSS DataHub slack + +![](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/slack/slack_channel_url.png) + +- Notice `TUMKD5EGJ/C029A3M079U` in the URL + - Channel ID = `C029A3M079U` from above + +In the next steps, we'll show you how to configure the Slack Action based on the credentials and configuration values that you have collected. + +### Installation Instructions (Deployment specific) + +#### Managed DataHub + +Head over to the [Configuring Notifications](../../managed-datahub/saas-slack-setup.md#configuring-notifications) section in the Managed DataHub guide to configure Slack notifications for your Managed DataHub instance. + +#### Quickstart + +If you are running DataHub using the docker quickstart option, there are no additional software installation steps. The `datahub-actions` container comes pre-installed with the Slack action. + +All you need to do is export a few environment variables to activate and configure the integration. See below for the list of environment variables to export. + +| Env Variable | Required for Integration | Purpose | +| -------------------------------------- | ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| DATAHUB_ACTIONS_SLACK_ENABLED | ✅ | Set to "true" to enable the Slack action | +| DATAHUB_ACTIONS_SLACK_SIGNING_SECRET | ✅ | Set to the [Slack Signing Secret](#1-the-signing-secret) that you configured in the pre-requisites step above | +| DATAHUB_ACTIONS_SLACK_BOT_TOKEN | ✅ | Set to the [Bot User OAuth Token](#2-the-bot-token) that you configured in the pre-requisites step above | +| DATAHUB_ACTIONS_SLACK_CHANNEL | ✅ | Set to the [Slack Channel ID](#3-the-slack-channel) that you want the action to send messages to | +| DATAHUB_ACTIONS_SLACK_DATAHUB_BASE_URL | ❌ | Defaults to "http://localhost:9002". Set to the location where your DataHub UI is running. On a local quickstart this is usually "http://localhost:9002", so you shouldn't need to modify this | + +:::note + +You will have to restart the `datahub-actions` docker container after you have exported these environment variables if this is the first time. The simplest way to do it is via the Docker Desktop UI, or by just issuing a `datahub docker quickstart --stop && datahub docker quickstart` command to restart the whole instance. + +::: + +For example: + +```shell +export DATAHUB_ACTIONS_SLACK_ENABLED=true +export DATAHUB_ACTIONS_SLACK_SIGNING_SECRET= +.... +export DATAHUB_ACTIONS_SLACK_CHANNEL= + +datahub docker quickstart --stop && datahub docker quickstart +``` + +#### k8s / helm + +Similar to the quickstart scenario, there are no specific software installation steps. The `datahub-actions` container comes pre-installed with the Slack action. You just need to export a few environment variables and make them available to the `datahub-actions` container to activate and configure the integration. See below for the list of environment variables to export. + +| Env Variable | Required for Integration | Purpose | +| ------------------------------------ | ------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| DATAHUB_ACTIONS_SLACK_ENABLED | ✅ | Set to "true" to enable the Slack action | +| DATAHUB_ACTIONS_SLACK_SIGNING_SECRET | ✅ | Set to the [Slack Signing Secret](#1-the-signing-secret) that you configured in the pre-requisites step above | +| DATAHUB_ACTIONS_SLACK_BOT_TOKEN | ✅ | Set to the [Bot User OAuth Token](#2-the-bot-token) that you configured in the pre-requisites step above | +| DATAHUB_ACTIONS_SLACK_CHANNEL | ✅ | Set to the [Slack Channel ID](#3-the-slack-channel) that you want the action to send messages to | +| DATAHUB_ACTIONS_DATAHUB_BASE_URL | ✅ | Set to the location where your DataHub UI is running. For example, if your DataHub UI is hosted at "https://datahub.my-company.biz", set this to "https://datahub.my-company.biz" | + +#### Bare Metal - CLI or Python-based + +If you are using the `datahub-actions` library directly from Python, or the `datahub-actions` cli directly, then you need to first install the `slack` action plugin in your Python virtualenv. + +``` +pip install "datahub-actions[slack]" +``` + +Then run the action with a configuration file that you have modified to capture your credentials and configuration. + +##### Sample Slack Action Configuration File + +```yml +name: datahub_slack_action +enabled: true +source: + type: "kafka" + config: + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} + topic_routes: + mcl: ${METADATA_CHANGE_LOG_VERSIONED_TOPIC_NAME:-MetadataChangeLog_Versioned_v1} + pe: ${PLATFORM_EVENT_TOPIC_NAME:-PlatformEvent_v1} + +## 3a. Optional: Filter to run on events (map) +# filter: +# event_type: +# event: +# # Filter event fields by exact-match +# + +# 3b. Optional: Custom Transformers to run on events (array) +# transform: +# - type: +# config: +# # Transformer-specific configs (map) + +action: + type: slack + config: + # Action-specific configs (map) + base_url: ${DATAHUB_ACTIONS_SLACK_DATAHUB_BASE_URL:-http://localhost:9002} + bot_token: ${DATAHUB_ACTIONS_SLACK_BOT_TOKEN} + signing_secret: ${DATAHUB_ACTIONS_SLACK_SIGNING_SECRET} + default_channel: ${DATAHUB_ACTIONS_SLACK_CHANNEL} + suppress_system_activity: ${DATAHUB_ACTIONS_SLACK_SUPPRESS_SYSTEM_ACTIVITY:-true} + +datahub: + server: "http://${DATAHUB_GMS_HOST:-localhost}:${DATAHUB_GMS_PORT:-8080}" +``` + +##### Slack Action Configuration Parameters + +| Field | Required | Default | Description | +| -------------------------- | -------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `base_url` | ❌ | `False` | Whether to print events in upper case. | +| `signing_secret` | ✅ | | Set to the [Slack Signing Secret](#1-the-signing-secret) that you configured in the pre-requisites step above | +| `bot_token` | ✅ | | Set to the [Bot User OAuth Token](#2-the-bot-token) that you configured in the pre-requisites step above | +| `default_channel` | ✅ | | Set to the [Slack Channel ID](#3-the-slack-channel) that you want the action to send messages to | +| `suppress_system_activity` | ❌ | `True` | Set to `False` if you want to get low level system activity events, e.g. when datasets are ingested, etc. Note: this will currently result in a very spammy Slack notifications experience, so this is not recommended to be changed. | + +## Troubleshooting + +If things are configured correctly, you should see logs on the `datahub-actions` container that indicate success in enabling and running the Slack action. + +```shell +docker logs datahub-datahub-actions-1 + +... +[2022-12-04 07:07:53,804] INFO {datahub_actions.plugin.action.slack.slack:96} - Slack notification action configured with bot_token=SecretStr('**********') signing_secret=SecretStr('**********') default_channel='C04CZUSSR5X' base_url='http://localhost:9002' suppress_system_activity=True +[2022-12-04 07:07:54,506] WARNING {datahub_actions.cli.actions:103} - Skipping pipeline datahub_teams_action as it is not enabled +[2022-12-04 07:07:54,506] INFO {datahub_actions.cli.actions:119} - Action Pipeline with name 'ingestion_executor' is now running. +[2022-12-04 07:07:54,507] INFO {datahub_actions.cli.actions:119} - Action Pipeline with name 'datahub_slack_action' is now running. +... +``` + +If the Slack action was not enabled, you would see messages indicating that. +e.g. the following logs below show that neither the Slack or Teams action were enabled. + +```shell +docker logs datahub-datahub-actions-1 + +.... +No user action configurations found. Not starting user actions. +[2022-12-04 06:45:27,509] INFO {datahub_actions.cli.actions:76} - DataHub Actions version: unavailable (installed editable via git) +[2022-12-04 06:45:27,647] WARNING {datahub_actions.cli.actions:103} - Skipping pipeline datahub_slack_action as it is not enabled +[2022-12-04 06:45:27,649] WARNING {datahub_actions.cli.actions:103} - Skipping pipeline datahub_teams_action as it is not enabled +[2022-12-04 06:45:27,649] INFO {datahub_actions.cli.actions:119} - Action Pipeline with name 'ingestion_executor' is now running. +... + +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/actions/teams.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/actions/teams.md new file mode 100644 index 0000000000000..8fda995c7c825 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/actions/teams.md @@ -0,0 +1,185 @@ +--- +title: Microsoft Teams +slug: /actions/actions/teams +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/actions/actions/teams.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# Microsoft Teams + + + +| | | +| ------------------------ | ----------------------------------------------------------------------------------------------------- | +| **Status** | ![Incubating](https://img.shields.io/badge/support%20status-incubating-blue) | +| **Version Requirements** | ![Minimum Version Requirements](https://img.shields.io/badge/acryl_datahub_actions-v0.0.9+-green.svg) | + +## Overview + +This Action integrates DataHub with Microsoft Teams to send notifications to a configured Teams channel in your workspace. + +### Capabilities + +- Sending notifications of important events to a Teams channel + - Adding or Removing a tag from an entity (dataset, dashboard etc.) + - Updating documentation at the entity or field (column) level. + - Adding or Removing ownership from an entity (dataset, dashboard, etc.) + - Creating a Domain + - and many more. + +### User Experience + +On startup, the action will produce a welcome message that looks like the one below. +![](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/teams/teams_welcome_message.png) + +On each event, the action will produce a notification message that looks like the one below. +![](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/teams/teams_notification_message.png) + +Watch the townhall demo to see this in action: +[![Teams Action Demo](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/teams/teams_demo_image.png)](https://www.youtube.com/watch?v=BlCLhG8lGoY&t=2998s) + +### Supported Events + +- `EntityChangeEvent_v1` +- Currently, the `MetadataChangeLog_v1` event is **not** processed by the Action. + +## Action Quickstart + +### Prerequisites + +Ensure that you have configured an incoming webhook in your Teams channel. + +Follow the guide [here](https://learn.microsoft.com/en-us/microsoftteams/platform/webhooks-and-connectors/how-to/add-incoming-webhook) to set it up. + +Take note of the incoming webhook url as you will need to use that to configure the Team action. + +### Installation Instructions (Deployment specific) + +#### Quickstart + +If you are running DataHub using the docker quickstart option, there are no additional software installation steps. The `datahub-actions` container comes pre-installed with the Teams action. + +All you need to do is export a few environment variables to activate and configure the integration. See below for the list of environment variables to export. + +| Env Variable | Required for Integration | Purpose | +| --------------------------------- | ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| DATAHUB_ACTIONS_TEAMS_ENABLED | ✅ | Set to "true" to enable the Teams action | +| DATAHUB_ACTIONS_TEAMS_WEBHOOK_URL | ✅ | Set to the incoming webhook url that you configured in the [pre-requisites step](#prerequisites) above | +| DATAHUB_ACTIONS_DATAHUB_BASE_URL | ❌ | Defaults to "http://localhost:9002". Set to the location where your DataHub UI is running. On a local quickstart this is usually "http://localhost:9002", so you shouldn't need to modify this | + +:::note + +You will have to restart the `datahub-actions` docker container after you have exported these environment variables if this is the first time. The simplest way to do it is via the Docker Desktop UI, or by just issuing a `datahub docker quickstart --stop && datahub docker quickstart` command to restart the whole instance. + +::: + +For example: + +```shell +export DATAHUB_ACTIONS_TEAMS_ENABLED=true +export DATAHUB_ACTIONS_TEAMS_WEBHOOK_URL= + +datahub docker quickstart --stop && datahub docker quickstart +``` + +#### k8s / helm + +Similar to the quickstart scenario, there are no specific software installation steps. The `datahub-actions` container comes pre-installed with the Teams action. You just need to export a few environment variables and make them available to the `datahub-actions` container to activate and configure the integration. See below for the list of environment variables to export. + +| Env Variable | Required for Integration | Purpose | +| -------------------------------------- | ------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| DATAHUB_ACTIONS_TEAMS_ENABLED | ✅ | Set to "true" to enable the Teams action | +| DATAHUB_ACTIONS_TEAMS_WEBHOOK_URL | ✅ | Set to the incoming webhook url that you configured in the [pre-requisites step](#prerequisites) above | +| DATAHUB_ACTIONS_TEAMS_DATAHUB_BASE_URL | ✅ | Set to the location where your DataHub UI is running. For example, if your DataHub UI is hosted at "https://datahub.my-company.biz", set this to "https://datahub.my-company.biz" | + +#### Bare Metal - CLI or Python-based + +If you are using the `datahub-actions` library directly from Python, or the `datahub-actions` cli directly, then you need to first install the `teams` action plugin in your Python virtualenv. + +``` +pip install "datahub-actions[teams]" +``` + +Then run the action with a configuration file that you have modified to capture your credentials and configuration. + +##### Sample Teams Action Configuration File + +```yml +name: datahub_teams_action +enabled: true +source: + type: "kafka" + config: + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} + topic_routes: + mcl: ${METADATA_CHANGE_LOG_VERSIONED_TOPIC_NAME:-MetadataChangeLog_Versioned_v1} + pe: ${PLATFORM_EVENT_TOPIC_NAME:-PlatformEvent_v1} + +## 3a. Optional: Filter to run on events (map) +# filter: +# event_type: +# event: +# # Filter event fields by exact-match +# + +# 3b. Optional: Custom Transformers to run on events (array) +# transform: +# - type: +# config: +# # Transformer-specific configs (map) + +action: + type: teams + config: + # Action-specific configs (map) + base_url: ${DATAHUB_ACTIONS_TEAMS_DATAHUB_BASE_URL:-http://localhost:9002} + webhook_url: ${DATAHUB_ACTIONS_TEAMS_WEBHOOK_URL} + suppress_system_activity: ${DATAHUB_ACTIONS_TEAMS_SUPPRESS_SYSTEM_ACTIVITY:-true} + +datahub: + server: "http://${DATAHUB_GMS_HOST:-localhost}:${DATAHUB_GMS_PORT:-8080}" +``` + +##### Teams Action Configuration Parameters + +| Field | Required | Default | Description | +| -------------------------- | -------- | ------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `base_url` | ❌ | `False` | Whether to print events in upper case. | +| `webhook_url` | ✅ | Set to the incoming webhook url that you configured in the [pre-requisites step](#prerequisites) above | +| `suppress_system_activity` | ❌ | `True` | Set to `False` if you want to get low level system activity events, e.g. when datasets are ingested, etc. Note: this will currently result in a very spammy Teams notifications experience, so this is not recommended to be changed. | + +## Troubleshooting + +If things are configured correctly, you should see logs on the `datahub-actions` container that indicate success in enabling and running the Teams action. + +```shell +docker logs datahub-datahub-actions-1 + +... +[2022-12-04 16:47:44,536] INFO {datahub_actions.cli.actions:76} - DataHub Actions version: unavailable (installed editable via git) +[2022-12-04 16:47:44,565] WARNING {datahub_actions.cli.actions:103} - Skipping pipeline datahub_slack_action as it is not enabled +[2022-12-04 16:47:44,581] INFO {datahub_actions.plugin.action.teams.teams:60} - Teams notification action configured with webhook_url=SecretStr('**********') base_url='http://localhost:9002' suppress_system_activity=True +[2022-12-04 16:47:46,393] INFO {datahub_actions.cli.actions:119} - Action Pipeline with name 'ingestion_executor' is now running. +[2022-12-04 16:47:46,393] INFO {datahub_actions.cli.actions:119} - Action Pipeline with name 'datahub_teams_action' is now running. +... +``` + +If the Teams action was not enabled, you would see messages indicating that. +e.g. the following logs below show that neither the Teams or Slack action were enabled. + +```shell +docker logs datahub-datahub-actions-1 + +.... +No user action configurations found. Not starting user actions. +[2022-12-04 06:45:27,509] INFO {datahub_actions.cli.actions:76} - DataHub Actions version: unavailable (installed editable via git) +[2022-12-04 06:45:27,647] WARNING {datahub_actions.cli.actions:103} - Skipping pipeline datahub_slack_action as it is not enabled +[2022-12-04 06:45:27,649] WARNING {datahub_actions.cli.actions:103} - Skipping pipeline datahub_teams_action as it is not enabled +[2022-12-04 06:45:27,649] INFO {datahub_actions.cli.actions:119} - Action Pipeline with name 'ingestion_executor' is now running. +... + +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/concepts.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/concepts.md new file mode 100644 index 0000000000000..0a1fc4c0f0de4 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/concepts.md @@ -0,0 +1,106 @@ +--- +title: Concepts +slug: /actions/concepts +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/actions/concepts.md +--- + +# DataHub Actions Concepts + +The Actions framework includes pluggable components for filtering, transforming, and reacting to important DataHub, such as + +- Tag Additions / Removals +- Glossary Term Additions / Removals +- Schema Field Additions / Removals +- Owner Additions / Removals + +& more, in real time. + +DataHub Actions comes with open library of freely available Transformers, Actions, Events, and more. + +Finally, the framework is highly configurable & scalable. Notable highlights include: + +- **Distributed Actions**: Ability to scale-out processing for a single action. Support for running the same Action configuration across multiple nodes to load balance the traffic from the event stream. +- **At-least Once Delivery**: Native support for independent processing state for each Action via post-processing acking to achieve at-least once semantics. +- **Robust Error Handling**: Configurable failure policies featuring event-retry, dead letter queue, and failed-event continuation policy to achieve the guarantees required by your organization. + +### Use Cases + +Real-time use cases broadly fall into the following categories: + +- **Notifications**: Generate organization-specific notifications when a change is made on DataHub. For example, send an email to the governance team when a "PII" tag is added to any data asset. +- **Workflow Integration**: Integrate DataHub into your organization's internal workflows. For example, create a Jira ticket when specific Tags or Terms are proposed on a Dataset. +- **Synchronization**: Syncing changes made in DataHub into a 3rd party system. For example, reflecting Tag additions in DataHub into Snowflake. +- **Auditing**: Audit who is making what changes on DataHub through time. + +and more! + +## Concepts + +The Actions Framework consists of a few core concepts-- + +- **Pipelines** +- **Events** and **Event Sources** +- **Transformers** +- **Actions** + +Each of these will be described in detail below. + +

+ +

+ +**In the Actions Framework, Events flow continuously from left-to-right.** + +### Pipelines + +A **Pipeline** is a continuously running process which performs the following functions: + +1. Polls events from a configured Event Source (described below) +2. Applies configured Transformation + Filtering to the Event +3. Executes the configured Action on the resulting Event + +in addition to handling initialization, errors, retries, logging, and more. + +Each Action Configuration file corresponds to a unique Pipeline. In practice, +each Pipeline has its very own Event Source, Transforms, and Actions. This makes it easy to maintain state for mission-critical Actions independently. + +Importantly, each Action must have a unique name. This serves as a stable identifier across Pipeline run which can be useful in saving the Pipeline's consumer state (ie. resiliency + reliability). For example, the Kafka Event Source (default) uses the pipeline name as the Kafka Consumer Group id. This enables you to easily scale-out your Actions by running multiple processes with the same exact configuration file. Each will simply become different consumers in the same consumer group, sharing traffic of the DataHub Events stream. + +### Events + +**Events** are data objects representing changes that have occurred on DataHub. Strictly speaking, the only requirement that the Actions framework imposes is that these objects must be + +a. Convertible to JSON +b. Convertible from JSON + +So that in the event of processing failures, events can be written and read from a failed events file. + +#### Event Types + +Each Event instance inside the framework corresponds to a single **Event Type**, which is common name (e.g. "EntityChangeEvent_v1") which can be used to understand the shape of the Event. This can be thought of as a "topic" or "stream" name. That being said, Events associated with a single type are not expected to change in backwards-breaking ways across versons. + +### Event Sources + +Events are produced to the framework by **Event Sources**. Event Sources may include their own guarantees, configurations, behaviors, and semantics. They usually produce a fixed set of Event Types. + +In addition to sourcing events, Event Sources are also responsible for acking the succesful processing of an event by implementing the `ack` method. This is invoked by the framework once the Event is guaranteed to have reached the configured Action successfully. + +### Transformers + +**Transformers** are pluggable components which take an Event as input, and produce an Event (or nothing) as output. This can be used to enrich the information of an Event prior to sending it to an Action. + +Multiple Transformers can be configured to run in sequence, filtering and transforming an event in multiple steps. + +Transformers can also be used to generate a completely new type of Event (i.e. registered at runtime via the Event Registry) which can subsequently serve as input to an Action. + +Transformers can be easily customized and plugged in to meet an organization's unqique requirements. For more information on developing a Transformer, check out [Developing a Transformer](guides/developing-a-transformer.md) + +### Action + +**Actions** are pluggable components which take an Event as input and perform some business logic. Examples may be sending a Slack notification, logging to a file, +or creating a Jira ticket, etc. + +Each Pipeline can be configured to have a single Action which runs after the filtering and transformations have occurred. + +Actions can be easily customized and plugged in to meet an organization's unqique requirements. For more information on developing a Action, check out [Developing a Action](guides/developing-an-action.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/events/entity-change-event.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/events/entity-change-event.md new file mode 100644 index 0000000000000..3fa382ee660b2 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/events/entity-change-event.md @@ -0,0 +1,355 @@ +--- +title: Entity Change Event V1 +slug: /actions/events/entity-change-event +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/actions/events/entity-change-event.md +--- + +# Entity Change Event V1 + +## Event Type + +`EntityChangeEvent_v1` + +## Overview + +This Event is emitted when certain changes are made to an entity (dataset, dashboard, chart, etc) on DataHub. + +## Event Structure + +Entity Change Events are generated in a variety of circumstances, but share a common set of fields. + +### Common Fields + +| Name | Type | Description | Optional | +| ---------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------- | +| entityUrn | String | The unique identifier for the Entity being changed. For example, a Dataset's urn. | False | +| entityType | String | The type of the entity being changed. Supported values include dataset, chart, dashboard, dataFlow (Pipeline), dataJob (Task), domain, tag, glossaryTerm, corpGroup, & corpUser. | False | +| category | String | The category of the change, related to the kind of operation that was performed. Examples include TAG, GLOSSARY_TERM, DOMAIN, LIFECYCLE, and more. | False | +| operation | String | The operation being performed on the entity given the category. For example, ADD ,REMOVE, MODIFY. For the set of valid operations, see the full catalog below. | False | +| modifier | String | The modifier that has been applied to the entity. The value depends on the category. An example includes the URN of a tag being applied to a Dataset or Schema Field. | True | +| parameters | Dict | Additional key-value parameters used to provide specific context. The precise contents depends on the category + operation of the event. See the catalog below for a full summary of the combinations. | True | +| auditStamp.actor | String | The urn of the actor who triggered the change. | False | +| auditStamp.time | Number | The timestamp in milliseconds corresponding to the event. | False | + +In following sections, we will provide sample events for each scenario in which Entity Change Events are fired. + +### Add Tag Event + +This event is emitted when a Tag has been added to an entity on DataHub. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "TAG", + "operation": "ADD", + "modifier": "urn:li:tag:PII", + "parameters": { + "tagUrn": "urn:li:tag:PII" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Remove Tag Event + +This event is emitted when a Tag has been removed from an entity on DataHub. +Header + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "TAG", + "operation": "REMOVE", + "modifier": "urn:li:tag:PII", + "parameters": { + "tagUrn": "urn:li:tag:PII" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Add Glossary Term Event + +This event is emitted when a Glossary Term has been added to an entity on DataHub. +Header + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "GLOSSARY_TERM", + "operation": "ADD", + "modifier": "urn:li:glossaryTerm:ExampleNode.ExampleTerm", + "parameters": { + "termUrn": "urn:li:glossaryTerm:ExampleNode.ExampleTerm" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Remove Glossary Term Event + +This event is emitted when a Glossary Term has been removed from an entity on DataHub. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "GLOSSARY_TERM", + "operation": "REMOVE", + "modifier": "urn:li:glossaryTerm:ExampleNode.ExampleTerm", + "parameters": { + "termUrn": "urn:li:glossaryTerm:ExampleNode.ExampleTerm" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Add Domain Event + +This event is emitted when Domain has been added to an entity on DataHub. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "DOMAIN", + "operation": "ADD", + "modifier": "urn:li:domain:ExampleDomain", + "parameters": { + "domainUrn": "urn:li:domain:ExampleDomain" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Remove Domain Event + +This event is emitted when Domain has been removed from an entity on DataHub. +Header + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "DOMAIN", + "operation": "REMOVE", + "modifier": "urn:li:domain:ExampleDomain", + "parameters": { + "domainUrn": "urn:li:domain:ExampleDomain" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Add Owner Event + +This event is emitted when a new owner has been assigned to an entity on DataHub. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "OWNER", + "operation": "ADD", + "modifier": "urn:li:corpuser:jdoe", + "parameters": { + "ownerUrn": "urn:li:corpuser:jdoe", + "ownerType": "BUSINESS_OWNER" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Remove Owner Event + +This event is emitted when an existing owner has been removed from an entity on DataHub. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "OWNER", + "operation": "REMOVE", + "modifier": "urn:li:corpuser:jdoe", + "parameters": { + "ownerUrn": "urn:li:corpuser:jdoe", + "ownerType": "BUSINESS_OWNER" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Modify Deprecation Event + +This event is emitted when the deprecation status of an entity has been modified on DataHub. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "DEPRECATION", + "operation": "MODIFY", + "modifier": "DEPRECATED", + "parameters": { + "status": "DEPRECATED" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Add Dataset Schema Field Event + +This event is emitted when a new field has been added to a Dataset Schema. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "TECHNICAL_SCHEMA", + "operation": "ADD", + "modifier": "urn:li:schemaField:(urn:li:dataset:abc,newFieldName)", + "parameters": { + "fieldUrn": "urn:li:schemaField:(urn:li:dataset:abc,newFieldName)", + "fieldPath": "newFieldName", + "nullable": false + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Remove Dataset Schema Field Event + +This event is emitted when a new field has been remove from a Dataset Schema. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "TECHNICAL_SCHEMA", + "operation": "REMOVE", + "modifier": "urn:li:schemaField:(urn:li:dataset:abc,newFieldName)", + "parameters": { + "fieldUrn": "urn:li:schemaField:(urn:li:dataset:abc,newFieldName)", + "fieldPath": "newFieldName", + "nullable": false + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Entity Create Event + +This event is emitted when a new entity has been created on DataHub. +Header + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "LIFECYCLE", + "operation": "CREATE", + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Entity Soft-Delete Event + +This event is emitted when a new entity has been soft-deleted on DataHub. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "LIFECYCLE", + "operation": "SOFT_DELETE", + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Entity Hard-Delete Event + +This event is emitted when a new entity has been hard-deleted on DataHub. + +#### Sample Event + +```json +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "LIFECYCLE", + "operation": "HARD_DELETE", + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/events/metadata-change-log-event.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/events/metadata-change-log-event.md new file mode 100644 index 0000000000000..11db1bdfb4718 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/events/metadata-change-log-event.md @@ -0,0 +1,155 @@ +--- +title: Metadata Change Log Event V1 +slug: /actions/events/metadata-change-log-event +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/actions/events/metadata-change-log-event.md +--- + +# Metadata Change Log Event V1 + +## Event Type + +`MetadataChangeLog_v1` + +## Overview + +This event is emitted when any aspect on DataHub Metadata Graph is changed. This includes creates, updates, and removals of both "versioned" aspects and "time-series" aspects. + +> Disclaimer: This event is quite powerful, but also quite low-level. Because it exposes the underlying metadata model directly, it is subject to more frequent structural and semantic changes than the higher level [Entity Change Event](entity-change-event.md). We recommend using that event instead to achieve your use case when possible. + +## Event Structure + +The fields include + +| Name | Type | Description | Optional | +| ------------------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------- | +| entityUrn | String | The unique identifier for the Entity being changed. For example, a Dataset's urn. | False | +| entityType | String | The type of the entity being changed. Supported values include dataset, chart, dashboard, dataFlow (Pipeline), dataJob (Task), domain, tag, glossaryTerm, corpGroup, & corpUser. | False | +| entityKeyAspect | Object | The key struct of the entity that was changed. Only present if the Metadata Change Proposal contained the raw key struct. | True | +| changeType | String | The change type. UPSERT or DELETE are currently supported. | False | +| aspectName | String | The entity aspect which was changed. | False | +| aspect | Object | The new aspect value. Null if the aspect was deleted. | True | +| aspect.contentType | String | The serialization type of the aspect itself. The only supported value is `application/json`. | False | +| aspect.value | String | The serialized aspect. This is a JSON-serialized representing the aspect document originally defined in PDL. See https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin for more. | False | +| previousAspectValue | Object | The previous aspect value. Null if the aspect did not exist previously. | True | +| previousAspectValue.contentType | String | The serialization type of the aspect itself. The only supported value is `application/json` | False | +| previousAspectValue.value | String | The serialized aspect. This is a JSON-serialized representing the aspect document originally defined in PDL. See https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin for more. | False | +| systemMetadata | Object | The new system metadata. This includes the the ingestion run-id, model registry and more. For the full structure, see https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/SystemMetadata.pdl | True | +| previousSystemMetadata | Object | The previous system metadata. This includes the the ingestion run-id, model registry and more. For the full structure, see https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/SystemMetadata.pdl | True | +| created | Object | Audit stamp about who triggered the Metadata Change and when. | False | +| created.time | Number | The timestamp in milliseconds when the aspect change occurred. | False | +| created.actor | String | The URN of the actor (e.g. corpuser) that triggered the change. | + +### Sample Events + +#### Tag Change Event + +```json +{ + "entityType": "container", + "entityUrn": "urn:li:container:DATABASE", + "entityKeyAspect": null, + "changeType": "UPSERT", + "aspectName": "globalTags", + "aspect": { + "value": "{\"tags\":[{\"tag\":\"urn:li:tag:pii\"}]}", + "contentType": "application/json" + }, + "systemMetadata": { + "lastObserved": 1651516475595, + "runId": "no-run-id-provided", + "registryName": "unknownRegistry", + "registryVersion": "0.0.0.0-dev", + "properties": null + }, + "previousAspectValue": null, + "previousSystemMetadata": null, + "created": { + "time": 1651516475594, + "actor": "urn:li:corpuser:datahub", + "impersonator": null + } +} +``` + +#### Glossary Term Change Event + +```json +{ + "entityType": "dataset", + "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)", + "entityKeyAspect": null, + "changeType": "UPSERT", + "aspectName": "glossaryTerms", + "aspect": { + "value": "{\"auditStamp\":{\"actor\":\"urn:li:corpuser:datahub\",\"time\":1651516599479},\"terms\":[{\"urn\":\"urn:li:glossaryTerm:CustomerAccount\"}]}", + "contentType": "application/json" + }, + "systemMetadata": { + "lastObserved": 1651516599486, + "runId": "no-run-id-provided", + "registryName": "unknownRegistry", + "registryVersion": "0.0.0.0-dev", + "properties": null + }, + "previousAspectValue": null, + "previousSystemMetadata": null, + "created": { + "time": 1651516599480, + "actor": "urn:li:corpuser:datahub", + "impersonator": null + } +} +``` + +#### Owner Change Event + +```json +{ + "auditHeader": null, + "entityType": "dataset", + "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)", + "entityKeyAspect": null, + "changeType": "UPSERT", + "aspectName": "ownership", + "aspect": { + "value": "{\"owners\":[{\"type\":\"DATAOWNER\",\"owner\":\"urn:li:corpuser:datahub\"}],\"lastModified\":{\"actor\":\"urn:li:corpuser:datahub\",\"time\":1651516640488}}", + "contentType": "application/json" + }, + "systemMetadata": { + "lastObserved": 1651516640493, + "runId": "no-run-id-provided", + "registryName": "unknownRegistry", + "registryVersion": "0.0.0.0-dev", + "properties": null + }, + "previousAspectValue": { + "value": "{\"owners\":[{\"owner\":\"urn:li:corpuser:jdoe\",\"type\":\"DATAOWNER\"},{\"owner\":\"urn:li:corpuser:datahub\",\"type\":\"DATAOWNER\"}],\"lastModified\":{\"actor\":\"urn:li:corpuser:jdoe\",\"time\":1581407189000}}", + "contentType": "application/json" + }, + "previousSystemMetadata": { + "lastObserved": 1651516415088, + "runId": "file-2022_05_02-11_33_35", + "registryName": null, + "registryVersion": null, + "properties": null + }, + "created": { + "time": 1651516640490, + "actor": "urn:li:corpuser:datahub", + "impersonator": null + } +} +``` + +## FAQ + +### Where can I find all the aspects and their schemas? + +Great Question! All MetadataChangeLog events are based on the Metadata Model which is comprised of Entities, +Aspects, and Relationships which make up an enterprise Metadata Graph. We recommend checking out the following +resources to learn more about this: + +- [Intro to Metadata Model](/docs/metadata-modeling/metadata-model) + +You can also find a comprehensive list of Entities + Aspects of the Metadata Model under the **Metadata Modeling > Entities** section of the [official DataHub docs](/docs/). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/guides/developing-a-transformer.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/guides/developing-a-transformer.md new file mode 100644 index 0000000000000..5ee579175a58f --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/guides/developing-a-transformer.md @@ -0,0 +1,136 @@ +--- +title: Developing a Transformer +slug: /actions/guides/developing-a-transformer +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/actions/guides/developing-a-transformer.md +--- + +# Developing a Transformer + +In this guide, we will outline each step to developing a custom Transformer for the DataHub Actions Framework. + +## Overview + +Developing a DataHub Actions Transformer is a matter of extending the `Transformer` base class in Python, installing your +Transformer to make it visible to the framework, and then configuring the framework to use the new Transformer. + +## Step 1: Defining a Transformer + +To implement an Transformer, we'll need to extend the `Transformer` base class and override the following functions: + +- `create()` - This function is invoked to instantiate the action, with a free-form configuration dictionary + extracted from the Actions configuration file as input. +- `transform()` - This function is invoked when an Event is received. It should contain the core logic of the Transformer. + and will return the transformed Event, or `None` if the Event should be filtered. + +Let's start by defining a new implementation of Transformer called `CustomTransformer`. We'll keep it simple-- this Transformer will +print the configuration that is provided when it is created, and print any Events that it receives. + +```python +# custom_transformer.py +from datahub_actions.transform.transformer import Transformer +from datahub_actions.event.event import EventEnvelope +from datahub_actions.pipeline.pipeline_context import PipelineContext +from typing import Optional + +class CustomTransformer(Transformer): + @classmethod + def create(cls, config_dict: dict, ctx: PipelineContext) -> "Transformer": + # Simply print the config_dict. + print(config_dict) + return cls(config_dict, ctx) + + def __init__(self, ctx: PipelineContext): + self.ctx = ctx + + def transform(self, event: EventEnvelope) -> Optional[EventEnvelope]: + # Simply print the received event. + print(event) + # And return the original event (no-op) + return event +``` + +## Step 2: Installing the Transformer + +Now that we've defined the Transformer, we need to make it visible to the framework by making +it available in the Python runtime environment. + +The easiest way to do this is to just place it in the same directory as your configuration file, in which case the module name is the same as the file +name - in this case it will be `custom_transformer`. + +### Advanced: Installing as a Package + +Alternatively, create a `setup.py` file in the same directory as the new Transformer to convert it into a package that pip can understand. + +``` +from setuptools import find_packages, setup + +setup( + name="custom_transformer_example", + version="1.0", + packages=find_packages(), + # if you don't already have DataHub Actions installed, add it under install_requires + # install_requires=["acryl-datahub-actions"] +) +``` + +Next, install the package + +```shell +pip install -e . +``` + +inside the module. (alt.`python setup.py`). + +Once we have done this, our class will be referencable via `custom_transformer_example.custom_transformer:CustomTransformer`. + +## Step 3: Running the Action + +Now that we've defined our Transformer, we can create an Action configuration file that refers to the new Transformer. +We will need to provide the fully-qualified Python module & class name when doing so. + +_Example Configuration_ + +```yaml +# custom_transformer_action.yaml +name: "custom_transformer_test" +source: + type: "kafka" + config: + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} +transform: + - type: "custom_transformer_example.custom_transformer:CustomTransformer" + config: + # Some sample configuration which should be printed on create. + config1: value1 +action: + # Simply reuse the default hello_world action + type: "hello_world" +``` + +Next, run the `datahub actions` command as usual: + +```shell +datahub actions -c custom_transformer_action.yaml +``` + +If all is well, your Transformer should now be receiving & printing Events. + +### (Optional) Step 4: Contributing the Transformer + +If your Transformer is generally applicable, you can raise a PR to include it in the core Transformer library +provided by DataHub. All Transformers will live under the `datahub_actions/plugin/transform` directory inside the +[datahub-actions](https://github.com/acryldata/datahub-actions) repository. + +Once you've added your new Transformer there, make sure that you make it discoverable by updating the `entry_points` section +of the `setup.py` file. This allows you to assign a globally unique name for you Transformer, so that people can use +it without defining the full module path. + +#### Prerequisites: + +Prerequisites to consideration for inclusion in the core Transformer library include + +- **Testing** Define unit tests for your Transformer +- **Deduplication** Confirm that no existing Transformer serves the same purpose, or can be easily extended to serve the same purpose diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/guides/developing-an-action.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/guides/developing-an-action.md new file mode 100644 index 0000000000000..2a392a696b0fa --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/guides/developing-an-action.md @@ -0,0 +1,135 @@ +--- +title: Developing an Action +slug: /actions/guides/developing-an-action +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/actions/guides/developing-an-action.md +--- + +# Developing an Action + +In this guide, we will outline each step to developing a Action for the DataHub Actions Framework. + +## Overview + +Developing a DataHub Action is a matter of extending the `Action` base class in Python, installing your +Action to make it visible to the framework, and then configuring the framework to use the new Action. + +## Step 1: Defining an Action + +To implement an Action, we'll need to extend the `Action` base class and override the following functions: + +- `create()` - This function is invoked to instantiate the action, with a free-form configuration dictionary + extracted from the Actions configuration file as input. +- `act()` - This function is invoked when an Action is received. It should contain the core logic of the Action. +- `close()` - This function is invoked when the framework has issued a shutdown of the pipeline. It should be used + to cleanup any processes happening inside the Action. + +Let's start by defining a new implementation of Action called `CustomAction`. We'll keep it simple-- this Action will +print the configuration that is provided when it is created, and print any Events that it receives. + +```python +# custom_action.py +from datahub_actions.action.action import Action +from datahub_actions.event.event_envelope import EventEnvelope +from datahub_actions.pipeline.pipeline_context import PipelineContext + +class CustomAction(Action): + @classmethod + def create(cls, config_dict: dict, ctx: PipelineContext) -> "Action": + # Simply print the config_dict. + print(config_dict) + return cls(ctx) + + def __init__(self, ctx: PipelineContext): + self.ctx = ctx + + def act(self, event: EventEnvelope) -> None: + # Do something super important. + # For now, just print. :) + print(event) + + def close(self) -> None: + pass +``` + +## Step 2: Installing the Action + +Now that we've defined the Action, we need to make it visible to the framework by making it +available in the Python runtime environment. + +The easiest way to do this is to just place it in the same directory as your configuration file, in which case the module name is the same as the file +name - in this case it will be `custom_action`. + +### Advanced: Installing as a Package + +Alternatively, create a `setup.py` file in the same directory as the new Action to convert it into a package that pip can understand. + +``` +from setuptools import find_packages, setup + +setup( + name="custom_action_example", + version="1.0", + packages=find_packages(), + # if you don't already have DataHub Actions installed, add it under install_requires + # install_requires=["acryl-datahub-actions"] +) +``` + +Next, install the package + +```shell +pip install -e . +``` + +inside the module. (alt.`python setup.py`). + +Once we have done this, our class will be referencable via `custom_action_example.custom_action:CustomAction`. + +## Step 3: Running the Action + +Now that we've defined our Action, we can create an Action configuration file that refers to the new Action. +We will need to provide the fully-qualified Python module & class name when doing so. + +_Example Configuration_ + +```yaml +# custom_action.yaml +name: "custom_action_test" +source: + type: "kafka" + config: + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} +action: + type: "custom_action_example.custom_action:CustomAction" + config: + # Some sample configuration which should be printed on create. + config1: value1 +``` + +Next, run the `datahub actions` command as usual: + +```shell +datahub actions -c custom_action.yaml +``` + +If all is well, your Action should now be receiving & printing Events. + +## (Optional) Step 4: Contributing the Action + +If your Action is generally applicable, you can raise a PR to include it in the core Action library +provided by DataHub. All Actions will live under the `datahub_actions/plugin/action` directory inside the +[datahub-actions](https://github.com/acryldata/datahub-actions) repository. + +Once you've added your new Action there, make sure that you make it discoverable by updating the `entry_points` section +of the `setup.py` file. This allows you to assign a globally unique name for you Action, so that people can use +it without defining the full module path. + +### Prerequisites: + +Prerequisites to consideration for inclusion in the core Actions library include + +- **Testing** Define unit tests for your Action +- **Deduplication** Confirm that no existing Action serves the same purpose, or can be easily extended to serve the same purpose diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/quickstart.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/quickstart.md new file mode 100644 index 0000000000000..2a7b563f60157 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/quickstart.md @@ -0,0 +1,176 @@ +--- +title: Quickstart +slug: /actions/quickstart +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/actions/quickstart.md +--- + +# DataHub Actions Quickstart + +## Prerequisites + +The DataHub Actions CLI commands are an extension of the base `datahub` CLI commands. We recommend +first installing the `datahub` CLI: + +```shell +python3 -m pip install --upgrade pip wheel setuptools +python3 -m pip install --upgrade acryl-datahub +datahub --version +``` + +> Note that the Actions Framework requires a version of `acryl-datahub` >= v0.8.34 + +## Installation + +To install DataHub Actions, you need to install the `acryl-datahub-actions` package from PyPi + +```shell +python3 -m pip install --upgrade pip wheel setuptools +python3 -m pip install --upgrade acryl-datahub-actions + +# Verify the installation by checking the version. +datahub actions version +``` + +### Hello World + +DataHub ships with a "Hello World" Action which logs all events it receives to the console. +To run this action, simply create a new Action configuration file: + +```yaml +# hello_world.yaml +name: "hello_world" +source: + type: "kafka" + config: + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} +action: + type: "hello_world" +``` + +and then run it using the `datahub actions` command: + +```shell +datahub actions -c hello_world.yaml +``` + +You should the see the following output if the Action has been started successfully: + +```shell +Action Pipeline with name 'hello_world' is now running. +``` + +Now, navigate to the instance of DataHub that you've connected to and perform an Action such as + +- Adding / removing a Tag +- Adding / removing a Glossary Term +- Adding / removing a Domain + +If all is well, you should see some events being logged to the console + +```shell +Hello world! Received event: +{ + "event_type": "EntityChangeEvent_v1", + "event": { + "entityType": "dataset", + "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)", + "category": "TAG", + "operation": "ADD", + "modifier": "urn:li:tag:pii", + "parameters": {}, + "auditStamp": { + "time": 1651082697703, + "actor": "urn:li:corpuser:datahub", + "impersonator": null + }, + "version": 0, + "source": null + }, + "meta": { + "kafka": { + "topic": "PlatformEvent_v1", + "offset": 1262, + "partition": 0 + } + } +} +``` + +_An example of an event emitted when a 'pii' tag has been added to a Dataset._ + +Woohoo! You've successfully started using the Actions framework. Now, let's see how we can get fancy. + +#### Filtering events + +If we know which Event types we'd like to consume, we can optionally add a `filter` configuration, which +will prevent events that do not match the filter from being forwarded to the action. + +```yaml +# hello_world.yaml +name: "hello_world" +source: + type: "kafka" + config: + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} +filter: + event_type: "EntityChangeEvent_v1" +action: + type: "hello_world" +``` + +_Filtering for events of type EntityChangeEvent_v1 only_ + +#### Advanced Filtering + +Beyond simply filtering by event type, we can also filter events by matching against the values of their fields. To do so, +use the `event` block. Each field provided will be compared against the real event's value. An event that matches +**all** of the fields will be forwarded to the action. + +```yaml +# hello_world.yaml +name: "hello_world" +source: + type: "kafka" + config: + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} +filter: + event_type: "EntityChangeEvent_v1" + event: + category: "TAG" + operation: "ADD" + modifier: "urn:li:tag:pii" +action: + type: "hello_world" +``` + +_This filter only matches events representing "PII" tag additions to an entity._ + +And more, we can achieve "OR" semantics on a particular field by providing an array of values. + +```yaml +# hello_world.yaml +name: "hello_world" +source: + type: "kafka" + config: + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} +filter: + event_type: "EntityChangeEvent_v1" + event: + category: "TAG" + operation: ["ADD", "REMOVE"] + modifier: "urn:li:tag:pii" +action: + type: "hello_world" +``` + +_This filter only matches events representing "PII" tag additions to OR removals from an entity. How fancy!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/actions/sources/kafka-event-source.md b/docs-website/versioned_docs/version-0.10.4/docs/actions/sources/kafka-event-source.md new file mode 100644 index 0000000000000..3584ced1d61c6 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/actions/sources/kafka-event-source.md @@ -0,0 +1,97 @@ +--- +title: Kafka Event Source +slug: /actions/sources/kafka-event-source +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/actions/sources/kafka-event-source.md +--- + +# Kafka Event Source + +## Overview + +The Kafka Event Source is the default Event Source used within the DataHub Actions Framework. + +Under the hood, the Kafka Event Source uses a Kafka Consumer to subscribe to the topics streaming +out of DataHub (MetadataChangeLog_v1, PlatformEvent_v1). Each Action is automatically placed into a unique +[consumer group](https://docs.confluent.io/platform/current/clients/consumer.html#consumer-groups) based on +the unique `name` provided inside the Action configuration file. + +This means that you can easily scale-out Actions processing by sharing the same Action configuration file across +multiple nodes or processes. As long as the `name` of the Action is the same, each instance of the Actions framework will subscribe as a member in the same Kafka Consumer Group, which allows for load balancing the +topic traffic across consumers which each consume independent [partitions](https://developer.confluent.io/learn-kafka/apache-kafka/partitions/#kafka-partitioning). + +Because the Kafka Event Source uses consumer groups by default, actions using this source will be **stateful**. +This means that Actions will keep track of their processing offsets of the upstream Kafka topics. If you +stop an Action and restart it sometime later, it will first "catch up" by processing the messages that the topic +has received since the Action last ran. Be mindful of this - if your Action is computationally expensive, it may be preferable to start consuming from the end of the log, instead of playing catch up. The easiest way to achieve this is to simply rename the Action inside the Action configuration file - this will create a new Kafka Consumer Group which will begin processing new messages at the end of the log (latest policy). + +### Processing Guarantees + +This event source implements an "ack" function which is invoked if and only if an event is successfully processed +by the Actions framework, meaning that the event made it through the Transformers and into the Action without +any errors. Under the hood, the "ack" method synchronously commits Kafka Consumer Offsets on behalf of the Action. This means that by default, the framework provides _at-least once_ processing semantics. That is, in the unusual case that a failure occurs when attempting to commit offsets back to Kafka, that event may be replayed on restart of the Action. + +If you've configured your Action pipeline `failure_mode` to be `CONTINUE` (the default), then events which +fail to be processed will simply be logged to a `failed_events.log` file for further investigation (dead letter queue). The Kafka Event Source will continue to make progress against the underlying topics and continue to commit offsets even in the case of failed messages. + +If you've configured your Action pipeline `failure_mode` to be `THROW`, then events which fail to be processed result in an Action Pipeline error. This in turn terminates the pipeline before committing offsets back to Kafka. Thus the message will not be marked as "processed" by the Action consumer. + +## Supported Events + +The Kafka Event Source produces + +- [Entity Change Event V1](../events/entity-change-event.md) +- [Metadata Change Log V1](../events/metadata-change-log-event.md) + +## Configure the Event Source + +Use the following config(s) to get started with the Kafka Event Source. + +```yml +name: "pipeline-name" +source: + type: "kafka" + config: + # Connection-related configuration + connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} + # Dictionary of freeform consumer configs propagated to underlying Kafka Consumer + consumer_config: + #security.protocol: ${KAFKA_PROPERTIES_SECURITY_PROTOCOL:-PLAINTEXT} + #ssl.keystore.location: ${KAFKA_PROPERTIES_SSL_KEYSTORE_LOCATION:-/mnt/certs/keystore} + #ssl.truststore.location: ${KAFKA_PROPERTIES_SSL_TRUSTSTORE_LOCATION:-/mnt/certs/truststore} + #ssl.keystore.password: ${KAFKA_PROPERTIES_SSL_KEYSTORE_PASSWORD:-keystore_password} + #ssl.key.password: ${KAFKA_PROPERTIES_SSL_KEY_PASSWORD:-keystore_password} + #ssl.truststore.password: ${KAFKA_PROPERTIES_SSL_TRUSTSTORE_PASSWORD:-truststore_password} + # Topic Routing - which topics to read from. + topic_routes: + mcl: ${METADATA_CHANGE_LOG_VERSIONED_TOPIC_NAME:-MetadataChangeLog_Versioned_v1} # Topic name for MetadataChangeLog_v1 events. + pe: ${PLATFORM_EVENT_TOPIC_NAME:-PlatformEvent_v1} # Topic name for PlatformEvent_v1 events. +action: + # action configs +``` + +
+ View All Configuration Options + + | Field | Required | Default | Description | + | --- | :-: | :-: | --- | + | `connection.bootstrap` | ✅ | N/A | The Kafka bootstrap URI, e.g. `localhost:9092`. | + | `connection.schema_registry_url` | ✅ | N/A | The URL for the Kafka schema registry, e.g. `http://localhost:8081` | + | `connection.consumer_config` | ❌ | {} | A set of key-value pairs that represents arbitrary Kafka Consumer configs | + | `topic_routes.mcl` | ❌ | `MetadataChangeLog_v1` | The name of the topic containing MetadataChangeLog events | + | `topic_routes.pe` | ❌ | `PlatformEvent_v1` | The name of the topic containing PlatformEvent events | +
+ +## FAQ + +1. Is there a way to always start processing from the end of the topics on Actions start? + +Currently, the only way is to change the `name` of the Action in its configuration file. In the future, +we are hoping to add first-class support for configuring the action to be "stateless", ie only process +messages that are received while the Action is running. + +2. Is there a way to asynchronously commit offsets back to Kafka? + +Currently, all consumer offset commits are made synchronously for each message received. For now we've optimized for correctness over performance. If this commit policy does not accommodate your organization's needs, certainly reach out on [Slack](https://slack.datahubproject.io/). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/advanced/aspect-versioning.md b/docs-website/versioned_docs/version-0.10.4/docs/advanced/aspect-versioning.md new file mode 100644 index 0000000000000..22861e3cf9fda --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/advanced/aspect-versioning.md @@ -0,0 +1,64 @@ +--- +title: Aspect Versioning +slug: /advanced/aspect-versioning +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/advanced/aspect-versioning.md +--- + +# Aspect Versioning + +As each version of [metadata aspect](../what/aspect.md) is immutable, any update to an existing aspect results in the creation of a new version. Typically one would expect the version number increases sequentially with the largest version number being the latest version, i.e. `v1` (oldest), `v2` (second oldest), ..., `vN` (latest). However, this approach results in major challenges in both rest.li modeling & transaction isolation and therefore requires a rethinking. + +## Rest.li Modeling + +As it's common to create dedicated rest.li sub-resources for a specific aspect, e.g. `/datasets/{datasetKey}/ownership`, the concept of versions become an interesting modeling question. Should the sub-resource be a [Simple](https://linkedin.github.io/rest.li/modeling/modeling#simple) or a [Collection](https://linkedin.github.io/rest.li/modeling/modeling#collection) type? + +If Simple, the [GET](https://linkedin.github.io/rest.li/user_guide/restli_server#get) method is expected to return the latest version, and the only way to retrieve non-latest versions is through a custom [ACTION](https://linkedin.github.io/rest.li/user_guide/restli_server#action) method, which is going against the [REST](https://en.wikipedia.org/wiki/Representational_state_transfer) principle. As a result, a Simple sub-resource doesn't seem to a be a good fit. + +If Collection, the version number naturally becomes the key so it's easy to retrieve specific version number using the typical GET method. It's also easy to list all versions using the standard [GET_ALL](https://linkedin.github.io/rest.li/user_guide/restli_server#get_all) method or get a set of versions via [BATCH_GET](https://linkedin.github.io/rest.li/user_guide/restli_server#batch_get). However, Collection resources don't support a simple way to get the latest/largest key directly. To achieve that, one must do one of the following + +- a GET_ALL (assuming descending key order) with a page size of 1 +- a [FINDER](https://linkedin.github.io/rest.li/user_guide/restli_server#finder) with special parameters and a page size of 1 +- a custom ACTION method again + +None of these options seems like a natural way to ask for the latest version of an aspect, which is one of the most common use cases. + +## Transaction Isolation + +[Transaction isolation]() is a complex topic so make sure to familiarize yourself with the basics first. + +To support concurrent update of a metadata aspect, the following pseudo DB operations must be run in a single transaction, + +``` +1. Retrieve the current max version (Vmax) +2. Write the new value as (Vmax + 1) +``` + +Operation 1 above can easily suffer from [Phantom Reads](). This subsequently leads to Operation 2 computing the incorrect version and thus overwrites an existing version instead of creating a new one. + +One way to solve this is by enforcing [Serializable]() isolation level in DB at the [cost of performance](https://logicalread.com/optimize-mysql-perf-part-2-mc13/#.XjxSRSlKh1N). In reality, very few DB even supports this level of isolation, especially for distributed document stores. It's more common to support [Repeatable Reads]() or [Read Committed]() isolation levels—sadly neither would help in this case. + +Another possible solution is to transactionally keep track of `Vmax` directly in a separate table to avoid the need to compute that through a `select` (thus prevent Phantom Reads). However, cross-table/document/entity transaction is not a feature supported by all distributed document stores, which precludes this as a generalized solution. + +## Solution: Version 0 + +The solution to both challenges turns out to be surprisingly simple. Instead of using a "floating" version number to represent the latest version, one can use a "fixed/sentinel" version number instead. In this case we choose Version 0 as we want all non-latest versions to still keep increasing sequentially. In other words, it'd be `v0` (latest), `v1` (oldest), `v2` (second oldest), etc. Alternatively, you can also simply view all the non-zero versions as an audit trail. + +Let's examine how Version 0 can solve the aforementioned challenges. + +### Rest.li Modeling + +With Version 0, getting the latest version becomes calling the GET method of a Collection aspect-specific sub-resource with a deterministic key, e.g. `/datasets/{datasetkey}/ownership/0`, which is a lot more natural than using GET_ALL or FINDER. + +### Transaction Isolation + +The pseudo DB operations change to the following transaction block with version 0, + +``` +1. Retrieve v0 of the aspect +2. Retrieve the current max version (Vmax) +3. Write the old value back as (Vmax + 1) +4. Write the new value back as v0 +``` + +While Operation 2 still suffers from potential Phantom Reads and thus corrupting existing version in Operation 3, Repeatable Reads isolation level will ensure that the transaction fails due to [Lost Update](https://codingsight.com/the-lost-update-problem-in-concurrent-transactions/) detected in Operation 4. Note that this happens to also be the [default isolation level](https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html) for InnoDB in MySQL. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/advanced/backfilling.md b/docs-website/versioned_docs/version-0.10.4/docs/advanced/backfilling.md new file mode 100644 index 0000000000000..b732d39a389f6 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/advanced/backfilling.md @@ -0,0 +1,10 @@ +--- +title: Backfilling Search Index & Graph DB +slug: /advanced/backfilling +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/advanced/backfilling.md +--- + +# Backfilling Search Index & Graph DB + +WIP diff --git a/docs-website/versioned_docs/version-0.10.4/docs/advanced/browse-paths-upgrade.md b/docs-website/versioned_docs/version-0.10.4/docs/advanced/browse-paths-upgrade.md new file mode 100644 index 0000000000000..e9ce62353e540 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/advanced/browse-paths-upgrade.md @@ -0,0 +1,144 @@ +--- +title: Browse Paths Upgrade (August 2022) +slug: /advanced/browse-paths-upgrade +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/advanced/browse-paths-upgrade.md +--- + +# Browse Paths Upgrade (August 2022) + +## Background + +Up to this point, there's been a historical constraint on all entity browse paths. Namely, each browse path has been +required to end with a path component that represents "simple name" for an entity. For example, a Browse Path for a +Snowflake Table called "test_table" may look something like this: + +``` +/prod/snowflake/warehouse1/db1/test_table +``` + +In the UI, we artificially truncate the final path component when you are browsing the Entity hierarchy, so your browse experience +would be: + +`prod` > `snowflake` > `warehouse1`> `db1` > `Click Entity` + +As you can see, the final path component `test_table` is effectively ignored. It could have any value, and we would still ignore +it in the UI. This behavior serves as a workaround to the historical requirement that all browse paths end with a simple name. + +This data constraint stands in opposition the original intention of Browse Paths: to provide a simple mechanism for organizing +assets into a hierarchical folder structure. For this reason, we've changed the semantics of Browse Paths to better align with the original intention. +Going forward, you will not be required to provide a final component detailing the "name". Instead, you will be able to provide a simpler path that +omits this final component: + +``` +/prod/snowflake/warehouse1/db1 +``` + +and the browse experience from the UI will continue to work as you would expect: + +`prod` > `snowflake` > `warehouse1`> `db1` > `Click Entity`. + +With this change comes a fix to a longstanding bug where multiple browse paths could not be attached to a single URN. Going forward, +we will support producing multiple browse paths for the same entity, and allow you to traverse via multiple paths. For example + +```python +browse_path = BrowsePathsClass( + paths=["/powerbi/my/custom/path", "/my/other/custom/path"] +) +return MetadataChangeProposalWrapper( + entityType="dataset", + changeType="UPSERT", + entityUrn="urn:li:dataset:(urn:li:dataPlatform:custom,MyFileName,PROD), + aspectName="browsePaths", + aspect=browse_path, +) +``` + +_Using the Python Emitter SDK to produce multiple Browse Paths for the same entity_ + +We've received multiple bug reports, such as [this issue](https://github.com/datahub-project/datahub/issues/5525), and requests to address these issues with Browse, and thus are deciding +to do it now before more workarounds are created. + +## What this means for you + +Once you upgrade to DataHub `v0.8.45` you will immediately notice that traversing your Browse Path hierarchy will require +one extra click to find the entity. This is because we are correctly displaying the FULL browse path, including the simple name mentioned above. + +There will be 2 ways to upgrade to the new browse path format. Depending on your ingestion sources, you may want to use one or both: + +1. Migrate default browse paths to the new format by restarting DataHub +2. Upgrade your version of the `datahub` CLI to push new browse path format (version `v0.8.45`) + +Each step will be discussed in detail below. + +### 1. Migrating default browse paths to the new format + +To migrate those Browse Paths that are generated by DataHub by default (when no path is provided), simply restart the `datahub-gms` container / pod with a single +additional environment variable: + +``` +UPGRADE_DEFAULT_BROWSE_PATHS_ENABLED=true +``` + +And restart the `datahub-gms` instance. This will cause GMS to perform a boot-time migration of all your existing Browse Paths +to the new format, removing the unnecessarily name component at the very end. + +If the migration is successful, you'll see the following in your GMS logs: + +``` +18:58:17.414 [main] INFO c.l.m.b.s.UpgradeDefaultBrowsePathsStep:60 - Successfully upgraded all browse paths! +``` + +After this one-time migration is complete, you should be able to navigate the Browse hierarchy exactly as you did previously. + +> Note that certain ingestion sources actively produce their own Browse Paths, which overrides the default path +> computed by DataHub. +> +> In these cases, getting the updated Browse Path will require re-running your ingestion process with the updated +> version of the connector. This is discussed in more detail in the next section. + +### 2. Upgrading the `datahub` CLI to push new browse paths + +If you are actively ingesting metadata from one or more of following sources + +1. Sagemaker +2. Looker / LookML +3. Feast +4. Kafka +5. Mode +6. PowerBi +7. Pulsar +8. Tableau +9. Business Glossary + +You will need to upgrade the DataHub CLI to >= `v0.8.45` and re-run metadata ingestion. This will generate the new browse path format +and overwrite the existing paths for entities that were extracted from these sources. + +### If you are producing custom Browse Paths + +If you've decided to produce your own custom Browse Paths to organize your assets (e.g. via the Python Emitter SDK), you'll want to change the code to produce those paths +to truncate the final path component. For example, if you were previously emitting a browse path like this: + +``` +"my/custom/browse/path/suffix" +``` + +You can simply remove the final "suffix" piece: + +``` +"my/custom/browse/path" +``` + +Your users will be able to find the entity by traversing through these folders in the UI: + +`my` > `custom` > `browse`> `path` > `Click Entity`. + +> Note that if you are using the Browse Path Transformer you _will_ be impacted in the same way. It is recommended that you revisit the +> paths that you are producing, and update them to the new format. + +## Support + +The Acryl team will be on standby to assist you in your migration. Please +join [#release-0_8_0](https://datahubspace.slack.com/archives/C0244FHMHJQ) channel and reach out to us if you find +trouble with the upgrade or have feedback on the process. We will work closely to make sure you can continue to operate +DataHub smoothly. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/advanced/db-retention.md b/docs-website/versioned_docs/version-0.10.4/docs/advanced/db-retention.md new file mode 100644 index 0000000000000..f8c242b9e03a3 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/advanced/db-retention.md @@ -0,0 +1,89 @@ +--- +title: Configuring Database Retention +slug: /advanced/db-retention +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/advanced/db-retention.md +--- + +# Configuring Database Retention + +## Goal + +DataHub stores different versions of [metadata aspects](/docs/what/aspect) as they are ingested +using a database (or key-value store). These multiple versions allow us to look at an aspect's historical changes and +rollback to a previous version if incorrect metadata is ingested. However, every stored version takes additional storage +space, while possibly bringing less value to the system. We need to be able to impose a **retention** policy on these +records to keep the size of the DB in check. + +Goal of the retention system is to be able to **configure and enforce retention policies** on documents at each of these +various levels: + +- global +- entity-level +- aspect-level + +## What type of retention policies are supported? + +We support 3 types of retention policies for aspects: + +| Policy | Versions Kept | +| :-----------: | :-----------------------------------: | +| Indefinite | All versions | +| Version-based | Latest _N_ versions | +| Time-based | Versions ingested in last _N_ seconds | + +**Note:** The latest version (version 0) is never deleted. This ensures core functionality of DataHub is not impacted while applying retention. + +## When is the retention policy applied? + +As of now, retention policies are applied in two places: + +1. **GMS boot-up**: A bootstrap step ingests the predefined set of retention policies. If no policy existed before or the existing policy + was updated, an asynchronous call will be triggered. It will apply the retention policy (or policies) to **all** records in the database. +2. **Ingest**: On every ingest, if an existing aspect got updated, it applies the retention policy to the urn-aspect pair being ingested. + +We are planning to support a cron-based application of retention in the near future to ensure that the time-based retention is applied correctly. + +## How to configure? + +For the initial iteration, we have made this feature opt-in. Please set **ENTITY_SERVICE_ENABLE_RETENTION=true** when +creating the datahub-gms container/k8s pod. + +On GMS start up, retention policies are initialized with: + +1. First, the default provided **version-based** retention to keep **20 latest aspects** for all entity-aspect pairs. +2. Second, we read YAML files from the `/etc/datahub/plugins/retention` directory and overlay them on the default set of policies we provide. + +For docker, we set docker-compose to mount `${HOME}/.datahub` directory to `/etc/datahub` directory +within the containers, so you can customize the initial set of retention policies by creating +a `${HOME}/.datahub/plugins/retention/retention.yaml` file. + +We will support a standardized way to do this in Kubernetes setup in the near future. + +The format for the YAML file is as follows: + +```yaml +- entity: "*" # denotes that policy will be applied to all entities + aspect: "*" # denotes that policy will be applied to all aspects + config: + retention: + version: + maxVersions: 20 +- entity: "dataset" + aspect: "datasetProperties" + config: + retention: + version: + maxVersions: 20 + time: + maxAgeInSeconds: 2592000 # 30 days +``` + +Note, it searches for the policies corresponding to the entity, aspect pair in the following order: + +1. entity, aspect +2. \*, aspect +3. entity, \* +4. _, _ + +By restarting datahub-gms after creating the plugin yaml file, the new set of retention policies will be applied. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/advanced/derived-aspects.md b/docs-website/versioned_docs/version-0.10.4/docs/advanced/derived-aspects.md new file mode 100644 index 0000000000000..6eefe87564651 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/advanced/derived-aspects.md @@ -0,0 +1,10 @@ +--- +title: Derived Aspects +slug: /advanced/derived-aspects +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/advanced/derived-aspects.md +--- + +# Derived Aspects + +WIP diff --git a/docs-website/versioned_docs/version-0.10.4/docs/advanced/entity-hierarchy.md b/docs-website/versioned_docs/version-0.10.4/docs/advanced/entity-hierarchy.md new file mode 100644 index 0000000000000..61fda719bfd31 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/advanced/entity-hierarchy.md @@ -0,0 +1,10 @@ +--- +title: Entity Hierarchy +slug: /advanced/entity-hierarchy +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/advanced/entity-hierarchy.md +--- + +# Entity Hierarchy + +WIP diff --git a/docs-website/versioned_docs/version-0.10.4/docs/advanced/field-path-spec-v2.md b/docs-website/versioned_docs/version-0.10.4/docs/advanced/field-path-spec-v2.md new file mode 100644 index 0000000000000..d61a76f1da53a --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/advanced/field-path-spec-v2.md @@ -0,0 +1,373 @@ +--- +title: SchemaFieldPath Specification (Version 2) +slug: /advanced/field-path-spec-v2 +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/advanced/field-path-spec-v2.md +--- + +# SchemaFieldPath Specification (Version 2) + +This document outlines the formal specification for the fieldPath member of +the [SchemaField](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/schema/SchemaField.pdl) +model. This specification (version 2) takes into account the unique requirements of supporting a wide variety of nested +types, unions and optional fields and is a substantial improvement over the current implementation (version 1). + +## Requirements + +The `fieldPath` field is currently used by datahub for not just rendering the schema fields in the UI, but also as a +primary identifier of a field in other places such +as [EditableSchemaFieldInfo](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/schema/EditableSchemaFieldInfo.pdl#L12), +usage stats and data profiles. Therefore, it must satisfy the following requirements. + +- must be unique across all fields within a schema. +- make schema navigation in the UI more intuitive. +- allow for identifying the type of schema the field is part of, such as a `key-schema` or a `value-schema`. +- allow for future-evolution + +## Existing Convention(v1) + +The existing convention is to simply use the field's name as the `fieldPath` for simple fields, and use the `dot` +delimited names for nested fields. This scheme does not satisfy the [requirements](#requirements) stated above. The +following example illustrates where the `uniqueness` requirement is not satisfied. + +### Example: Ambiguous field path + +Consider the following `Avro` schema which is a `union` of two record types `A` and `B`, each having a simple field with +the same name `f` that is of type `string`. The v1 naming scheme cannot differentiate if a `fieldPath=f` is referring to +the record type `A` or `B`. + +``` +[ + { + "type": "record", + "name": "A", + "fields": [{ "name": "f", "type": "string" } ] + }, { + "type": "record", + "name": "B", + "fields": [{ "name": "f", "type": "string" } ] + } +] +``` + +## The FieldPath encoding scheme(v2) + +The syntax for V2 encoding of the `fieldPath` is captured in the following grammar. The `FieldPathSpec` is essentially +the type annotated path of the member, with each token along the path representing one level of nested member, +starting from the most-enclosing type, leading up to the member. In the case of `unions` that have `one-of` semantics, +the corresponding field will be emitted once for each `member` of the union as its `type`, along with one path +corresponding to the `union` itself. + +### Formal Spec: + +``` + := .. // when part of a key-schema + | . // when part of a value schema + := [version=] // [version=2.0] for v2 + := [key=True] // when part of a key schema + := + // this is the type prefixed path field (nested if repeats). + := . // type prefixed path of a field. + := . | + := [type=] + := [type=] + := | union | array | map + := int | float | double | string | fixed | enum +``` + +For the [example above](#example-ambiguous-field-path), this encoding would produce the following 2 unique paths +corresponding to the `A.f` and `B.f` fields. + +```python +unique_v2_field_paths = [ + "[version=2.0].[type=union].[type=A].[type=string].f", + "[version=2.0].[type=union].[type=B].[type=string].f" +] +``` + +NOTE: + +- this encoding always ensures uniqueness within a schema since the full type annotation leading to a field is encoded + in the fieldPath itself. +- processing a fieldPath, such as from UI, gets simplified simply by walking each token along the path from + left-to-right. +- adding PartOfKeySchemaToken allows for identifying if the field is part of key-schema. +- adding VersionToken allows for future evolvability. +- to represent `optional` fields, which sometimes are modeled as `unions` in formats like `Avro`, instead of treating it + as a `union` member, set the `nullable` member of `SchemaField` to `True`. + +## Examples + +### Primitive types + +```python +avro_schema = """ +{ + "type": "string" +} +""" +unique_v2_field_paths = [ + "[version=2.0].[type=string]" +] +``` + +### Records + +**Simple Record** + +```python +avro_schema = """ +{ + "type": "record", + "name": "some.event.E", + "namespace": "some.event.N", + "doc": "this is the event record E" + "fields": [ + { + "name": "a", + "type": "string", + "doc": "this is string field a of E" + }, + { + "name": "b", + "type": "string", + "doc": "this is string field b of E" + } + ] +} +""" + +unique_v2_field_paths = [ + "[version=2.0].[type=E].[type=string].a", + "[version=2.0].[type=E].[type=string].b", +] +``` + +**Nested Record** + +```python +avro_schema = """ +{ + "type": "record", + "name": "SimpleNested", + "namespace": "com.linkedin", + "fields": [{ + "name": "nestedRcd", + "type": { + "type": "record", + "name": "InnerRcd", + "fields": [{ + "name": "aStringField", + "type": "string" + } ] + } + }] +} +""" + +unique_v2_field_paths = [ + "[version=2.0].[key=True].[type=SimpleNested].[type=InnerRcd].nestedRcd", + "[version=2.0].[key=True].[type=SimpleNested].[type=InnerRcd].nestedRcd.[type=string].aStringField", +] +``` + +**Recursive Record** + +```python +avro_schema = """ +{ + "type": "record", + "name": "Recursive", + "namespace": "com.linkedin", + "fields": [{ + "name": "r", + "type": { + "type": "record", + "name": "R", + "fields": [ + { "name" : "anIntegerField", "type" : "int" }, + { "name": "aRecursiveField", "type": "com.linkedin.R"} + ] + } + }] +} +""" + +unique_v2_field_paths = [ + "[version=2.0].[type=Recursive].[type=R].r", + "[version=2.0].[type=Recursive].[type=R].r.[type=int].anIntegerField", + "[version=2.0].[type=Recursive].[type=R].r.[type=R].aRecursiveField" +] +``` + +```python +avro_schema =""" +{ + "type": "record", + "name": "TreeNode", + "fields": [ + { + "name": "value", + "type": "long" + }, + { + "name": "children", + "type": { "type": "array", "items": "TreeNode" } + } + ] +} +""" +unique_v2_field_paths = [ + "[version=2.0].[type=TreeNode].[type=long].value", + "[version=2.0].[type=TreeNode].[type=array].[type=TreeNode].children", +] +``` + +### Unions + +```python +avro_schema = """ +{ + "type": "record", + "name": "ABUnion", + "namespace": "com.linkedin", + "fields": [{ + "name": "a", + "type": [{ + "type": "record", + "name": "A", + "fields": [{ "name": "f", "type": "string" } ] + }, { + "type": "record", + "name": "B", + "fields": [{ "name": "f", "type": "string" } ] + } + ] + }] +} +""" +unique_v2_field_paths: List[str] = [ + "[version=2.0].[key=True].[type=ABUnion].[type=union].a", + "[version=2.0].[key=True].[type=ABUnion].[type=union].[type=A].a", + "[version=2.0].[key=True].[type=ABUnion].[type=union].[type=A].a.[type=string].f", + "[version=2.0].[key=True].[type=ABUnion].[type=union].[type=B].a", + "[version=2.0].[key=True].[type=ABUnion].[type=union].[type=B].a.[type=string].f", +] +``` + +### Arrays + +```python +avro_schema = """ +{ + "type": "record", + "name": "NestedArray", + "namespace": "com.linkedin", + "fields": [{ + "name": "ar", + "type": { + "type": "array", + "items": { + "type": "array", + "items": [ + "null", + { + "type": "record", + "name": "Foo", + "fields": [ { + "name": "a", + "type": "long" + } ] + } + ] + } + } + }] +} +""" +unique_v2_field_paths: List[str] = [ + "[version=2.0].[type=NestedArray].[type=array].[type=array].[type=Foo].ar", + "[version=2.0].[type=NestedArray].[type=array].[type=array].[type=Foo].ar.[type=long].a", +] +``` + +### Maps + +```python +avro_schema = """ +{ + "type": "record", + "name": "R", + "namespace": "some.namespace", + "fields": [ + { + "name": "a_map_of_longs_field", + "type": { + "type": "map", + "values": "long" + } + } + ] +} +""" +unique_v2_field_paths = [ + "[version=2.0].[type=R].[type=map].[type=long].a_map_of_longs_field", +] + + +``` + +### Mixed Complex Type Examples + +```python +# Combines arrays, unions and records. +avro_schema = """ +{ + "type": "record", + "name": "ABFooUnion", + "namespace": "com.linkedin", + "fields": [{ + "name": "a", + "type": [ { + "type": "record", + "name": "A", + "fields": [{ "name": "f", "type": "string" } ] + }, { + "type": "record", + "name": "B", + "fields": [{ "name": "f", "type": "string" } ] + }, { + "type": "array", + "items": { + "type": "array", + "items": [ + "null", + { + "type": "record", + "name": "Foo", + "fields": [{ "name": "f", "type": "long" }] + } + ] + } + }] + }] +} +""" + +unique_v2_field_paths: List[str] = [ + "[version=2.0].[type=ABFooUnion].[type=union].a", + "[version=2.0].[type=ABFooUnion].[type=union].[type=A].a", + "[version=2.0].[type=ABFooUnion].[type=union].[type=A].a.[type=string].f", + "[version=2.0].[type=ABFooUnion].[type=union].[type=B].a", + "[version=2.0].[type=ABFooUnion].[type=union].[type=B].a.[type=string].f", + "[version=2.0].[type=ABFooUnion].[type=union].[type=array].[type=array].[type=Foo].a", + "[version=2.0].[type=ABFooUnion].[type=union].[type=array].[type=array].[type=Foo].a.[type=long].f", +] +``` + +For more examples, see +the [unit-tests for AvroToMceSchemaConverter](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/tests/unit/test_schema_util.py). + +### Backward-compatibility + +While this format is not directly compatible with the v1 format, the v1 equivalent can easily be constructed from the v2 +encoding by stripping away all the v2 tokens enclosed in the square-brackets `[]`. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/advanced/high-cardinality.md b/docs-website/versioned_docs/version-0.10.4/docs/advanced/high-cardinality.md new file mode 100644 index 0000000000000..275f5b566a572 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/advanced/high-cardinality.md @@ -0,0 +1,53 @@ +--- +title: High Cardinality Relationships +slug: /advanced/high-cardinality +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/advanced/high-cardinality.md +--- + +# High Cardinality Relationships + +As explained in [What is a Relationship](../what/relationship.md), the raw metadata for forming relationships is captured directly inside of a [Metadata Aspect](../what/aspect.md). The most natural way to model this is using an array, e.g. a group membership aspect contains an array of user [URNs](../what/urn.md). However, this poses some challenges when the cardinality of the relationship is expected to be large (say, greater than 10,000). The aspect becomes large in size, which leads to slow update and retrieval. It may even exceed the underlying limit of the document store, which is often in the range of a few MBs. Furthermore, sending large messages (> 1MB) over Kafka requires special tuning and is generally discouraged. + +Depending on the type of relationships, there are different strategies for dealing with high cardinality. + +### 1:N Relationships + +When `N` is large, simply store the relationship as a reverse pointer on the `N` side, instead of an `N`-element array on the `1` side. In other words, instead of doing this + +``` +record MemberList { + members: array[UserUrn] +} +``` + +do this + +``` +record Membership { + group: GroupUrn +} +``` + +One drawback with this approach is that batch updating the member list becomes multiple DB operations and non-atomic. If the list is provided by an external metadata provider via [MCEs](../what/mxe.md), this also means that multiple MCEs will be required to update the list, instead of having one giant array in a single MCE. + +### M:N Relationships + +When one side of the relation (`M` or `N`) has low cardinality, you can apply the same trick in [1:N Relationship] by creating the array on the side with low-cardinality. For example, assuming a user can only be part of a small number of groups but each group can have a large number of users, the following model will be more efficient than the reverse. + +``` +record Membership { + groups: array[GroupUrn] +} +``` + +When both `M` and `N` are of high cardinality (e.g. millions of users, each belongs to million of groups), the only way to store such relationships efficiently is by creating a new "Mapping Entity" with a single aspect like this + +``` +record UserGroupMap { + user: UserUrn + group: GroupUrn +} +``` + +This means that the relationship now can only be created & updated at a single source-destination pair granularity. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/advanced/mcp-mcl.md b/docs-website/versioned_docs/version-0.10.4/docs/advanced/mcp-mcl.md new file mode 100644 index 0000000000000..45e513634fed3 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/advanced/mcp-mcl.md @@ -0,0 +1,164 @@ +--- +title: MetadataChangeProposal & MetadataChangeLog Events +slug: /advanced/mcp-mcl +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/advanced/mcp-mcl.md +--- + +# MetadataChangeProposal & MetadataChangeLog Events + +## Overview & Vision + +As of release v0.8.7, two new important event streams have been introduced: MetadataChangeProposal & MetadataChangeLog. These topics serve as a more generic (and more appropriately named) versions of the classic MetadataChangeEvent and MetadataAuditEvent events, used for a) proposing and b) logging changes to the DataHub Metadata Graph. + +With these events, we move towards a more generic world, in which Metadata models are not strongly-typed parts of the event schemas themselves. This provides flexibility, allowing for the core models comprising the Metadata Graph to be added and changed dynamically, without requiring structural updates to Kafka or REST API schemas used for ingesting and serving Metadata. + +Moreover, we've focused in on the "aspect" as the atomic unit of write in DataHub. MetadataChangeProposal & MetadataChangeLog with carry only a single aspect in their payload, as opposed to the list of aspects carried by today's MCE & MAE. This more accurately reflects the atomicity contract of the metadata model, hopefully lessening confusion about transactional guarantees for multi-aspect writes in addition to making it simpler to tune into the metadata changes a consumer cares about. + +Making these events more generic does not come for free; we give up some in the form of Restli and Kafka-native schema validation and defer this responsibility to DataHub itself, who is the sole enforcer of the graph model contracts. Additionally, we add an extra step to unbundling the actual metadata by requiring a double-deserialization: that of the event / response body itself and another of the nested Metadata aspect. + +To mitigate these downsides, we are committed to providing cross-language client libraries capable of doing the hard work for you. We intend to publish these as strongly-typed artifacts generated from the "default" model set DataHub ships with. This stands in addition to an initiative to introduce an OpenAPI layer in DataHub's backend (gms) which would provide a strongly typed model. + +Ultimately, we intend to realize a state in which the Entities and Aspect schemas can be altered without requiring generated code and without maintaining a single mega-model schema (looking at you, Snapshot.pdl). The intention is that changes to the metadata model become even easier than they are today. + +## Modeling + +A Metadata Change Proposal is defined (in PDL) as follows + +```protobuf +record MetadataChangeProposal { + + /** + * Kafka audit header. See go/kafkaauditheader for more info. + */ + auditHeader: optional KafkaAuditHeader + + /** + * Type of the entity being written to + */ + entityType: string + + /** + * Urn of the entity being written + **/ + entityUrn: optional Urn, + + /** + * Key aspect of the entity being written + */ + entityKeyAspect: optional GenericAspect + + /** + * Type of change being proposed + */ + changeType: ChangeType + + /** + * Aspect of the entity being written to + * Not filling this out implies that the writer wants to affect the entire entity + * Note: This is only valid for CREATE and DELETE operations. + **/ + aspectName: optional string + + aspect: optional GenericAspect + + /** + * A string->string map of custom properties that one might want to attach to an event + **/ + systemMetadata: optional SystemMetadata + +} +``` + +Each proposal comprises of the following: + +1. entityType + + Refers to the type of the entity e.g. dataset, chart + +2. entityUrn + + Urn of the entity being updated. Note, **exactly one** of entityUrn or entityKeyAspect must be filled out to correctly identify an entity. + +3. entityKeyAspect + + Key aspect of the entity. Instead of having a string URN, we will support identifying entities by their key aspect structs. Note, this is not supported as of now. + +4. changeType + + Type of change you are proposing: one of + + - UPSERT: Insert if not exists, update otherwise + - CREATE: Insert if not exists, fail otherwise + - UPDATE: Update if exists, fail otherwise + - DELETE: Delete + - PATCH: Patch the aspect instead of doing a full replace + + Only UPSERT is supported as of now. + +5. aspectName + + Name of the aspect. Must match the name in the "@Aspect" annotation. + +6. aspect + + To support strongly typed aspects, without having to keep track of a union of all existing aspects, we introduced a new object called GenericAspect. + + ```xml + record GenericAspect { + value: bytes + contentType: string + } + ``` + + It contains the type of serialization and the serialized value. Note, currently we only support "application/json" as contentType but will be adding more forms of serialization in the future. Validation of the serialized object happens in GMS against the schema matching the aspectName. + +7. systemMetadata + + Extra metadata about the proposal like run_id or updated timestamp. + +GMS processes the proposal and produces the Metadata Change Log, which looks like this. + +```protobuf +record MetadataChangeLog includes MetadataChangeProposal { + + previousAspectValue: optional GenericAspect + + previousSystemMetadata: optional SystemMetadata + +} +``` + +It includes all fields in the proposal, but also has the previous version of the aspect value and system metadata. This allows the MCL processor to know the previous value before deciding to update all indices. + +## Topics + +Following the change in our event models, we introduced 4 new topics. The old topics will get deprecated as we fully migrate to this model. + +1. **MetadataChangeProposal_v1, FailedMetadataChangeProposal_v1** + + Analogous to the MCE topic, proposals that get produced into the MetadataChangeProposal_v1 topic, will get ingested to GMS asynchronously, and any failed ingestion will produce a failed MCP in the FailedMetadataChangeProposal_v1 topic. + +2. **MetadataChangeLog_Versioned_v1** + + Analogous to the MAE topic, MCLs for versioned aspects will get produced into this topic. Since versioned aspects have a source of truth that can be separately backed up, the retention of this topic is short (by default 7 days). Note both this and the next topic are consumed by the same MCL processor. + +3. **MetadataChangeLog_Timeseries_v1** + + Analogous to the MAE topics, MCLs for timeseries aspects will get produced into this topic. Since timeseries aspects do not have a source of truth, but rather gets ingested straight to elasticsearch, we set the retention of this topic to be longer (90 days). You can backup timeseries aspect by replaying this topic. + +## Configuration + +With MetadataChangeProposal and MetadataChangeLog, we will introduce a new mechanism for configuring the association between Metadata Entities & Aspects. Specifically, the Snapshot.pdl model will no longer encode this information by way of [Rest.li](http://rest.li) union. Instead, a more explicit yaml file will provide these links. This file will be leveraged at runtime to construct the in-memory Entity Registry which contains the global Metadata schema along with some additional metadata. + +An example of the configuration file that will be used for MCP & MCL, which defines a "dataset" entity that is associated with to two aspects: "datasetKey" and "datasetProfile". + +``` +# entity-registry.yml + +entities: + - name: dataset + keyAspect: datasetKey + aspects: + - datasetProfile +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/advanced/monitoring.md b/docs-website/versioned_docs/version-0.10.4/docs/advanced/monitoring.md new file mode 100644 index 0000000000000..b6130fd59c014 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/advanced/monitoring.md @@ -0,0 +1,104 @@ +--- +title: Monitoring DataHub +slug: /advanced/monitoring +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/advanced/monitoring.md +--- + +# Monitoring DataHub + +Monitoring DataHub's system components is critical for operating and improving DataHub. This doc explains how to add +tracing and metrics measurements in the DataHub containers. + +## Tracing + +Traces let us track the life of a request across multiple components. Each trace is consisted of multiple spans, which +are units of work, containing various context about the work being done as well as time taken to finish the work. By +looking at the trace, we can more easily identify performance bottlenecks. + +We enable tracing by using +the [OpenTelemetry java instrumentation library](https://github.com/open-telemetry/opentelemetry-java-instrumentation). +This project provides a Java agent JAR that is attached to java applications. The agent injects bytecode to capture +telemetry from popular libraries. + +Using the agent we are able to + +1. Plug and play different tracing tools based on the user's setup: Jaeger, Zipkin, or other tools +2. Get traces for Kafka, JDBC, and Elasticsearch without any additional code +3. Track traces of any function with a simple `@WithSpan` annotation + +You can enable the agent by setting env variable `ENABLE_OTEL` to `true` for GMS and MAE/MCE consumers. In our +example [docker-compose](https://github.com/datahub-project/datahub/blob/master/docker/monitoring/docker-compose.monitoring.yml), we export metrics to a local Jaeger +instance by setting env variable `OTEL_TRACES_EXPORTER` to `jaeger` +and `OTEL_EXPORTER_JAEGER_ENDPOINT` to `http://jaeger-all-in-one:14250`, but you can easily change this behavior by +setting the correct env variables. Refer to +this [doc](https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk-extensions/autoconfigure/README.md) for +all configs. + +Once the above is set up, you should be able to see a detailed trace as a request is sent to GMS. We added +the `@WithSpan` annotation in various places to make the trace more readable. You should start to see traces in the +tracing collector of choice. Our example [docker-compose](https://github.com/datahub-project/datahub/blob/master/docker/monitoring/docker-compose.monitoring.yml) deploys +an instance of Jaeger with port 16686. The traces should be available at http://localhost:16686. + +## Metrics + +With tracing, we can observe how a request flows through our system into the persistence layer. However, for a more +holistic picture, we need to be able to export metrics and measure them across time. Unfortunately, OpenTelemetry's java +metrics library is still in active development. + +As such, we decided to use [Dropwizard Metrics](https://metrics.dropwizard.io/4.2.0/) to export custom metrics to JMX, +and then use [Prometheus-JMX exporter](https://github.com/prometheus/jmx_exporter) to export all JMX metrics to +Prometheus. This allows our code base to be independent of the metrics collection tool, making it easy for people to use +their tool of choice. You can enable the agent by setting env variable `ENABLE_PROMETHEUS` to `true` for GMS and MAE/MCE +consumers. Refer to this example [docker-compose](https://github.com/datahub-project/datahub/blob/master/docker/monitoring/docker-compose.monitoring.yml) for setting the +variables. + +In our example [docker-compose](https://github.com/datahub-project/datahub/blob/master/docker/monitoring/docker-compose.monitoring.yml), we have configured prometheus to +scrape from 4318 ports of each container used by the JMX exporter to export metrics. We also configured grafana to +listen to prometheus and create useful dashboards. By default, we provide two +dashboards: [JVM dashboard](https://grafana.com/grafana/dashboards/14845) and DataHub dashboard. + +In the JVM dashboard, you can find detailed charts based on JVM metrics like CPU/memory/disk usage. In the DataHub +dashboard, you can find charts to monitor each endpoint and the kafka topics. Using the example implementation, go +to http://localhost:3001 to find the grafana dashboards! (Username: admin, PW: admin) + +To make it easy to track various metrics within the code base, we created MetricUtils class. This util class creates a +central metric registry, sets up the JMX reporter, and provides convenient functions for setting up counters and timers. +You can run the following to create a counter and increment. + +```java +MetricUtils.counter(this.getClass(),"metricName").increment(); +``` + +You can run the following to time a block of code. + +```java +try(Timer.Context ignored=MetricUtils.timer(this.getClass(),"timerName").timer()){ + ...block of code + } +``` + +## Enable monitoring through docker-compose + +We provide some example configuration for enabling monitoring in +this [directory](https://github.com/datahub-project/datahub/tree/master/docker/monitoring). Take a look at the docker-compose +files, which adds necessary env variables to existing containers, and spawns new containers (Jaeger, Prometheus, +Grafana). + +You can add in the above docker-compose using the `-f <>` when running docker-compose commands. +For instance, + +```shell +docker-compose \ + -f quickstart/docker-compose.quickstart.yml \ + -f monitoring/docker-compose.monitoring.yml \ + pull && \ +docker-compose -p datahub \ + -f quickstart/docker-compose.quickstart.yml \ + -f monitoring/docker-compose.monitoring.yml \ + up +``` + +We set up quickstart.sh, dev.sh, and dev-without-neo4j.sh to add the above docker-compose when MONITORING=true. For +instance `MONITORING=true ./docker/quickstart.sh` will add the correct env variables to start collecting traces and +metrics, and also deploy Jaeger, Prometheus, and Grafana. We will soon support this as a flag during quickstart. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/advanced/no-code-modeling.md b/docs-website/versioned_docs/version-0.10.4/docs/advanced/no-code-modeling.md new file mode 100644 index 0000000000000..d92245944016a --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/advanced/no-code-modeling.md @@ -0,0 +1,417 @@ +--- +title: No Code Metadata +slug: /advanced/no-code-modeling +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/advanced/no-code-modeling.md +--- + +# No Code Metadata + +## Summary of changes + +As part of the No Code Metadata Modeling initiative, we've made radical changes to the DataHub stack. + +Specifically, we've + +- Decoupled the persistence layer from Java + Rest.li specific concepts +- Consolidated the per-entity Rest.li resources into a single general-purpose Entity Resource +- Consolidated the per-entity Graph Index Writers + Readers into a single general-purpose Neo4J DAO +- Consolidated the per-entity Search Index Writers + Readers into a single general-purpose ES DAO. +- Developed mechanisms for declaring search indexing configurations + foreign key relationships as annotations + on PDL models themselves. +- Introduced a special "Browse Paths" aspect that allows the browse configuration to be + pushed into DataHub, as opposed to computed in a blackbox lambda sitting within DataHub +- Introduced special "Key" aspects for conveniently representing the information that identifies a DataHub entities via + a normal struct. +- Removed the need for hand-written Elastic `settings.json` and `mappings.json`. (Now generated at runtime) +- Removed the need for the Elastic Set Up container (indexes are not registered at runtime) +- Simplified the number of models that need to be maintained for each DataHub entity. We removed the need for + 1. Relationship Models + 2. Entity Models + 3. Urn models + the associated Java container classes + 4. 'Value' models, those which are returned by the Rest.li resource + +In doing so, dramatically reducing the level of effort required to add or extend an existing entity. + +For more on the design considerations, see the **Design** section below. + +## Engineering Spec + +This section will provide a more in-depth overview of the design considerations that were at play when working on the No +Code initiative. + +# Use Cases + +Who needs what & why? + +| As a | I want to | because | +| ---------------- | ------------------------ | --------------------------------------------------------- | +| DataHub Operator | Add new entities | The default domain model does not match my business needs | +| DataHub Operator | Extend existing entities | The default domain model does not match my business needs | + +What we heard from folks in the community is that adding new entities + aspects is just **too difficult**. + +They'd be happy if this process was streamlined and simple. **Extra** happy if there was no chance of merge conflicts in the future. (no fork necessary) + +# Goals + +### Primary Goal + +**Reduce the friction** of adding new entities, aspects, and relationships. + +### Secondary Goal + +Achieve the primary goal in a way that does not require a fork. + +# Requirements + +### Must-Haves + +1. Mechanisms for **adding** a browsable, searchable, linkable GMS entity by defining one or more PDL models + +- GMS Endpoint for fetching entity +- GMS Endpoint for fetching entity relationships +- GMS Endpoint for searching entity +- GMS Endpoint for browsing entity + +2. Mechanisms for **extending** a \***\*browsable, searchable, linkable GMS \*\***entity by defining one or more PDL models + +- GMS Endpoint for fetching entity +- GMS Endpoint for fetching entity relationships +- GMS Endpoint for searching entity +- GMS Endpoint for browsing entity + +3. Mechanisms + conventions for introducing a new **relationship** between 2 GMS entities without writing code +4. Clear documentation describing how to perform actions in #1, #2, and #3 above published on [datahubproject.io](http://datahubproject.io) + +## Nice-to-haves + +1. Mechanisms for automatically generating a working GraphQL API using the entity PDL models +2. Ability to add / extend GMS entities without a fork. + +- e.g. **Register** new entity / extensions _at runtime_. (Unlikely due to code generation) +- or, **configure** new entities at _deploy time_ + +## What Success Looks Like + +1. Adding a new browsable, searchable entity to GMS (not DataHub UI / frontend) takes 1 dev < 15 minutes. +2. Extending an existing browsable, searchable entity in GMS takes 1 dev < 15 minutes +3. Adding a new relationship among 2 GMS entities takes 1 dev < 15 minutes +4. [Bonus] Implementing the `datahub-frontend` GraphQL API for a new / extended entity takes < 10 minutes + +## Design + +## State of the World + +### Modeling + +Currently, there are various models in GMS: + +1. [Urn](https://github.com/datahub-project/datahub/blob/master/li-utils/src/main/pegasus/com/linkedin/common/DatasetUrn.pdl) - Structs composing primary keys +2. [Root] [Snapshots](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/Snapshot.pdl) - Container of aspects +3. [Aspects](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/aspect/DashboardAspect.pdl) - Optional container of fields +4. [Values](https://github.com/datahub-project/datahub/blob/master/gms/api/src/main/pegasus/com/linkedin/dataset/Dataset.pdl), [Keys](https://github.com/datahub-project/datahub/blob/master/gms/api/src/main/pegasus/com/linkedin/dataset/DatasetKey.pdl) - Model returned by GMS [Rest.li](http://rest.li) API (public facing) +5. [Entities](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/entity/DatasetEntity.pdl) - Records with fields derived from the URN. Used only in graph / relationships +6. [Relationships](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/relationship/Relationship.pdl) - Edges between 2 entities with optional edge properties +7. [Search Documents](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/search/ChartDocument.pdl) - Flat documents for indexing within Elastic index + +- And corresponding index [mappings.json](https://github.com/datahub-project/datahub/blob/master/gms/impl/src/main/resources/index/chart/mappings.json), [settings.json](https://github.com/datahub-project/datahub/blob/master/gms/impl/src/main/resources/index/chart/settings.json) + +Various components of GMS depend on / make assumptions about these model types: + +1. IndexBuilders depend on **Documents** +2. GraphBuilders depend on **Snapshots** +3. RelationshipBuilders depend on **Aspects** +4. Mae Processor depend on **Snapshots, Documents, Relationships** +5. Mce Processor depend on **Snapshots, Urns** +6. [Rest.li](http://rest.li) Resources on **Documents, Snapshots, Aspects, Values, Urns** +7. Graph Reader Dao (BaseQueryDao) depends on **Relationships, Entity** +8. Graph Writer Dao (BaseGraphWriterDAO) depends on **Relationships, Entity** +9. Local Dao Depends on **aspects, urns** +10. Search Dao depends on **Documents** + +Additionally, there are some implicit concepts that require additional caveats / logic: + +1. Browse Paths - Requires defining logic in an entity-specific index builder to generate. +2. Urns - Requires defining a) an Urn PDL model and b) a hand-written Urn class + +As you can see, there are many tied up concepts. Fundamentally changing the model would require a serious amount of refactoring, as it would require new versions of numerous components. + +The challenge is, how can we meet the requirements without fundamentally altering the model? + +## Proposed Solution + +In a nutshell, the idea is to consolidate the number of models + code we need to write on a per-entity basis. +We intend to achieve this by making search index + relationship configuration declarative, specified as part of the model +definition itself. + +We will use this configuration to drive more generic versions of the index builders + rest resources, +with the intention of reducing the overall surface area of GMS. + +During this initiative, we will also seek to make the concepts of Browse Paths and Urns declarative. Browse Paths +will be provided using a special BrowsePaths aspect. Urns will no longer be strongly typed. + +To achieve this, we will attempt to generify many components throughout the stack. Currently, many of them are defined on +a _per-entity_ basis, including + +- Rest.li Resources +- Index Builders +- Graph Builders +- Local, Search, Browse, Graph DAOs +- Clients +- Browse Path Logic + +along with simplifying the number of raw data models that need defined, including + +- Rest.li Resource Models +- Search Document Models +- Relationship Models +- Urns + their java classes + +From an architectural PoV, we will move from a before that looks something like this: + +

+ +

+ +to an after that looks like this + +

+ +

+ +That is, a move away from patterns of strong-typing-everywhere to a more generic + flexible world. + +### How will we do it? + +We will accomplish this by building the following: + +1. Set of custom annotations to permit declarative entity, search, graph configurations + - @Entity & @Aspect + - @Searchable + - @Relationship +2. Entity Registry: In-memory structures for representing, storing & serving metadata associated with a particular Entity, including search and relationship configurations. +3. Generic Entity, Search, Graph Service classes: Replaces traditional strongly-typed DAOs with flexible, pluggable APIs that can be used for CRUD, search, and graph across all entities. +4. Generic Rest.li Resources: + - 1 permitting reading, writing, searching, autocompleting, and browsing arbitrary entities + - 1 permitting reading of arbitrary entity-entity relationship edges +5. Generic Search Index Builder: Given a MAE and a specification of the Search Configuration for an entity, updates the search index. +6. Generic Graph Index Builder: Given a MAE and a specification of the Relationship Configuration for an entity, updates the graph index. +7. Generic Index + Mappings Builder: Dynamically generates index mappings and creates indices on the fly. +8. Introduce of special aspects to address other imperative code requirements + - BrowsePaths Aspect: Include an aspect to permit customization of the indexed browse paths. + - Key aspects: Include "virtual" aspects for representing the fields that uniquely identify an Entity for easy + reading by clients of DataHub. + +### Final Developer Experience: Defining an Entity + +We will outline what the experience of adding a new Entity should look like. We will imagine we want to define a "Service" entity representing +online microservices. + +#### Step 1. Add aspects + +ServiceKey.pdl + +``` +namespace com.linkedin.metadata.key + +/** + * Key for a Service + */ +@Aspect = { + "name": "serviceKey" +} +record ServiceKey { + /** + * Name of the service + */ + @Searchable = { + "fieldType": "TEXT_PARTIAL", + "enableAutocomplete": true + } + name: string +} +``` + +ServiceInfo.pdl + +``` +namespace com.linkedin.service + +import com.linkedin.common.Urn + +/** + * Properties associated with a Tag + */ +@Aspect = { + "name": "serviceInfo" +} +record ServiceInfo { + + /** + * Description of the service + */ + @Searchable = {} + description: string + + /** + * The owners of the + */ + @Relationship = { + "name": "OwnedBy", + "entityTypes": ["corpUser"] + } + owner: Urn +} +``` + +#### Step 2. Add aspect union. + +ServiceAspect.pdl + +``` +namespace com.linkedin.metadata.aspect + +import com.linkedin.metadata.key.ServiceKey +import com.linkedin.service.ServiceInfo +import com.linkedin.common.BrowsePaths + +/** + * Service Info + */ +typeref ServiceAspect = union[ + ServiceKey, + ServiceInfo, + BrowsePaths +] +``` + +#### Step 3. Add Snapshot model. + +ServiceSnapshot.pdl + +``` +namespace com.linkedin.metadata.snapshot + +import com.linkedin.common.Urn +import com.linkedin.metadata.aspect.ServiceAspect + +@Entity = { + "name": "service", + "keyAspect": "serviceKey" +} +record ServiceSnapshot { + + /** + * Urn for the service + */ + urn: Urn + + /** + * The list of service aspects + */ + aspects: array[ServiceAspect] +} +``` + +#### Step 4. Update Snapshot union. + +Snapshot.pdl + +``` +namespace com.linkedin.metadata.snapshot + +/** + * A union of all supported metadata snapshot types. + */ +typeref Snapshot = union[ + ... + ServiceSnapshot +] +``` + +### Interacting with New Entity + +1. Write Entity + +``` +curl 'http://localhost:8080/entities?action=ingest' -X POST -H 'X-RestLi-Protocol-Version:2.0.0' --data '{ + "entity":{ + "value":{ + "com.linkedin.metadata.snapshot.ServiceSnapshot":{ + "urn": "urn:li:service:mydemoservice", + "aspects":[ + { + "com.linkedin.service.ServiceInfo":{ + "description":"My demo service", + "owner": "urn:li:corpuser:user1" + } + }, + { + "com.linkedin.common.BrowsePaths":{ + "paths":[ + "/my/custom/browse/path1", + "/my/custom/browse/path2" + ] + } + } + ] + } + } + } +}' +``` + +2. Read Entity + +``` +curl 'http://localhost:8080/entities/urn%3Ali%3Aservice%3Amydemoservice' -H 'X-RestLi-Protocol-Version:2.0.0' +``` + +3. Search Entity + +``` +curl --location --request POST 'http://localhost:8080/entities?action=search' \ +--header 'X-RestLi-Protocol-Version: 2.0.0' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "input": "My demo", + "entity": "service", + "start": 0, + "count": 10 +}' +``` + +4. Autocomplete + +``` +curl --location --request POST 'http://localhost:8080/entities?action=autocomplete' \ +--header 'X-RestLi-Protocol-Version: 2.0.0' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "query": "mydem", + "entity": "service", + "limit": 10 +}' +``` + +5. Browse + +``` +curl --location --request POST 'http://localhost:8080/entities?action=browse' \ +--header 'X-RestLi-Protocol-Version: 2.0.0' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "path": "/my/custom/browse", + "entity": "service", + "start": 0, + "limit": 10 +}' +``` + +6. Relationships + +``` +curl --location --request GET 'http://localhost:8080/relationships?direction=INCOMING&urn=urn%3Ali%3Acorpuser%3Auser1&types=OwnedBy' \ +--header 'X-RestLi-Protocol-Version: 2.0.0' +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/advanced/no-code-upgrade.md b/docs-website/versioned_docs/version-0.10.4/docs/advanced/no-code-upgrade.md new file mode 100644 index 0000000000000..091a877b62911 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/advanced/no-code-upgrade.md @@ -0,0 +1,212 @@ +--- +title: No Code Upgrade (In-Place Migration Guide) +slug: /advanced/no-code-upgrade +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/advanced/no-code-upgrade.md +--- + +# No Code Upgrade (In-Place Migration Guide) + +## Summary of changes + +With the No Code metadata initiative, we've introduced various major changes: + +1. New Ebean Aspect table (metadata_aspect_v2) +2. New Elastic Indices (*entityName*index_v2) +3. New edge triples. (Remove fully qualified classpaths from nodes & edges) +4. Dynamic DataPlatform entities (no more hardcoded DataPlatformInfo.json) +5. Dynamic Browse Paths (no more hardcoded browse path creation logic) +6. Addition of Entity Key aspects, dropped requirement for strongly-typed Urns. +7. Addition of @Entity, @Aspect, @Searchable, @Relationship annotations to existing models. + +Because of these changes, it is required that your persistence layer be migrated after the NoCode containers have been +deployed. + +For more information about the No Code Update, please see [no-code-modeling](./no-code-modeling.md). + +## Migration strategy + +We are merging these breaking changes into the main branch upfront because we feel they are fundamental to subsequent +changes, providing a more solid foundation upon which exciting new features will be built upon. We will continue to +offer limited support for previous verions of DataHub. + +This approach means that companies who actively deploy the latest version of DataHub will need to perform an upgrade to +continue operating DataHub smoothly. + +## Upgrade Steps + +### Step 1: Pull & deploy latest container images + +It is important that the following containers are pulled and deployed simultaneously: + +- datahub-frontend-react +- datahub-gms +- datahub-mae-consumer +- datahub-mce-consumer + +#### Docker Compose Deployments + +From the `docker` directory: + +```aidl +docker-compose down --remove-orphans && docker-compose pull && docker-compose -p datahub up --force-recreate +``` + +#### Helm + +Deploying latest helm charts will upgrade all components to version 0.8.0. Once all the pods are up and running, it will +run the datahub-upgrade job, which will run the above docker container to migrate to the new sources. + +### Step 2: Execute Migration Job + +#### Docker Compose Deployments - Preserve Data + +If you do not care about migrating your data, you can refer to the Docker Compose Deployments - Lose All Existing Data +section below. + +To migrate existing data, the easiest option is to execute the `run_upgrade.sh` script located under `docker/datahub-upgrade/nocode`. + +``` +cd docker/datahub-upgrade/nocode +./run_upgrade.sh +``` + +Using this command, the default environment variables will be used (`docker/datahub-upgrade/env/docker.env`). These assume +that your deployment is local & that you are running MySQL. If this is not the case, you'll need to define your own environment variables to tell the +upgrade system where your DataHub containers reside and run + +To update the default environment variables, you can either + +1. Change `docker/datahub-upgrade/env/docker.env` in place and then run one of the above commands OR +2. Define a new ".env" file containing your variables and execute `docker pull acryldata/datahub-upgrade && docker run acryldata/datahub-upgrade:latest -u NoCodeDataMigration` + +To see the required environment variables, see the [datahub-upgrade](../../docker/datahub-upgrade/README.md) +documentation. + +To run the upgrade against a database other than MySQL, you can use the `-a dbType=` argument. + +Execute + +``` +./docker/datahub-upgrade.sh -u NoCodeDataMigration -a dbType=POSTGRES +``` + +where dbType can be either `MYSQL`, `MARIA`, `POSTGRES`. + +#### Docker Compose Deployments - Lose All Existing Data + +This path is quickest but will wipe your DataHub's database. + +If you want to make sure your current data is migrated, refer to the Docker Compose Deployments - Preserve Data section above. +If you are ok losing your data and re-ingesting, this approach is simplest. + +``` +# make sure you are on the latest +git checkout master +git pull origin master + +# wipe all your existing data and turn off all processes +./docker/nuke.sh + +# spin up latest datahub +./docker/quickstart.sh + +# re-ingest data, for example, to ingest sample data: +./docker/ingestion/ingestion.sh +``` + +After that, you will be ready to go. + +##### How to fix the "listening to port 5005" issue + +Fix for this issue have been published to the acryldata/datahub-upgrade:head tag. Please pull latest master and rerun +the upgrade script. + +However, we have seen cases where the problematic docker image is cached and docker does not pull the latest version. If +the script fails with the same error after pulling latest master, please run the following command to clear the docker +image cache. + +``` +docker images -a | grep acryldata/datahub-upgrade | awk '{print $3}' | xargs docker rmi -f +``` + +#### Helm Deployments + +Upgrade to latest helm charts by running the following after pulling latest master. + +```(shell) +helm upgrade datahub datahub/ +``` + +In the latest helm charts, we added a datahub-upgrade-job, which runs the above mentioned docker container to migrate to +the new storage layer. Note, the job will fail in the beginning as it waits for GMS and MAE consumer to be deployed with +the NoCode code. It will rerun until it runs successfully. + +Once the storage layer has been migrated, subsequent runs of this job will be a noop. + +### Step 3 (Optional): Cleaning Up + +Warning: This step clears all legacy metadata. If something is wrong with the upgraded metadata, there will no easy way to +re-run the migration. + +This step involves removing data from previous versions of DataHub. This step should only be performed once you've +validated that your DataHub deployment is healthy after performing the upgrade. If you're able to search, browse, and +view your Metadata after the upgrade steps have been completed, you should be in good shape. + +In advanced DataHub deployments, or cases in which you cannot easily rebuild the state stored in DataHub, it is strongly +advised that you do due diligence prior to running cleanup. This may involve manually inspecting the relational +tables (metadata_aspect_v2), search indices, and graph topology. + +#### Docker Compose Deployments + +The easiest option is to execute the `run_clean.sh` script located under `docker/datahub-upgrade/nocode`. + +``` +cd docker/datahub-upgrade/nocode +./run_clean.sh +``` + +Using this command, the default environment variables will be used (`docker/datahub-upgrade/env/docker.env`). These assume +that your deployment is local. If this is not the case, you'll need to define your own environment variables to tell the +upgrade system where your DataHub containers reside. + +To update the default environment variables, you can either + +1. Change `docker/datahub-upgrade/env/docker.env` in place and then run one of the above commands OR +2. Define a new ".env" file containing your variables and execute + `docker pull acryldata/datahub-upgrade && docker run acryldata/datahub-upgrade:latest -u NoCodeDataMigrationCleanup` + +To see the required environment variables, see the [datahub-upgrade](../../docker/datahub-upgrade/README.md) +documentation + +#### Helm Deployments + +Assuming the latest helm chart has been deployed in the previous step, datahub-cleanup-job-template cronJob should have +been created. You can check by running the following: + +``` +kubectl get cronjobs +``` + +You should see an output like below: + +``` +NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE +datahub-datahub-cleanup-job-template * * * * * True 0 12m +``` + +Note that the cronJob has been suspended. It is intended to be run in an adhoc fashion when ready to clean up. Make sure +the migration was successful and DataHub is working as expected. Then run the following command to run the clean up job: + +``` +kubectl create job --from=cronjob/<>-datahub-cleanup-job-template datahub-cleanup-job +``` + +Replace release-name with the name of the helm release. If you followed the kubernetes guide, it should be "datahub". + +## Support + +The Acryl team will be on standby to assist you in your migration. Please +join [#release-0_8_0](https://datahubspace.slack.com/archives/C0244FHMHJQ) channel and reach out to us if you find +trouble with the upgrade or have feedback on the process. We will work closely to make sure you can continue to operate +DataHub smoothly. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/advanced/partial-update.md b/docs-website/versioned_docs/version-0.10.4/docs/advanced/partial-update.md new file mode 100644 index 0000000000000..c276024dbd09c --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/advanced/partial-update.md @@ -0,0 +1,10 @@ +--- +title: Supporting Partial Aspect Update +slug: /advanced/partial-update +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/advanced/partial-update.md +--- + +# Supporting Partial Aspect Update + +WIP diff --git a/docs-website/versioned_docs/version-0.10.4/docs/advanced/patch.md b/docs-website/versioned_docs/version-0.10.4/docs/advanced/patch.md new file mode 100644 index 0000000000000..b6ad784c3e25d --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/advanced/patch.md @@ -0,0 +1,285 @@ +--- +title: "But First, Semantics: Upsert versus Patch" +slug: /advanced/patch +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/advanced/patch.md" +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# But First, Semantics: Upsert versus Patch + +## Why Would You Use Patch + +By default, most of the SDK tutorials and API-s involve applying full upserts at the aspect level. This means that typically, when you want to change one field within an aspect without modifying others, you need to do a read-modify-write to not overwrite existing fields. +To support these scenarios, DataHub supports PATCH based operations so that targeted changes to single fields or values within arrays of fields are possible without impacting other existing metadata. + +:::note + +Currently, PATCH support is only available for a selected set of aspects, so before pinning your hopes on using PATCH as a way to make modifications to aspect values, confirm whether your aspect supports PATCH semantics. The complete list of Aspects that are supported are maintained [here](https://github.com/datahub-project/datahub/blob/9588440549f3d99965085e97b214a7dabc181ed2/entity-registry/src/main/java/com/linkedin/metadata/models/registry/template/AspectTemplateEngine.java#L24). In the near future, we do have plans to automatically support PATCH semantics for aspects by default. + +::: + +## How To Use Patch + +Examples for using Patch are sprinkled throughout the API guides. +Here's how to find the appropriate classes for the language for your choice. + + + + +The Java Patch builders are aspect-oriented and located in the [datahub-client](https://github.com/datahub-project/datahub/tree/master/metadata-integration/java/datahub-client/src/main/java/datahub/client/patch) module under the `datahub.client.patch` namespace. + +Here are a few illustrative examples using the Java Patch builders: + +### Add Custom Properties + +```java +# Inlined from /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/DatasetCustomPropertiesAdd.java +package io.datahubproject.examples; + +import com.linkedin.common.urn.UrnUtils; +import datahub.client.MetadataWriteResponse; +import datahub.client.patch.dataset.DatasetPropertiesPatchBuilder; +import datahub.client.rest.RestEmitter; +import java.io.IOException; +import com.linkedin.mxe.MetadataChangeProposal; +import java.util.concurrent.ExecutionException; +import java.util.concurrent.Future; +import lombok.extern.slf4j.Slf4j; + + +@Slf4j +class DatasetCustomPropertiesAdd { + + private DatasetCustomPropertiesAdd() { + + } + + /** + * Adds properties to an existing custom properties aspect without affecting any existing properties + * @param args + * @throws IOException + * @throws ExecutionException + * @throws InterruptedException + */ + public static void main(String[] args) throws IOException, ExecutionException, InterruptedException { + MetadataChangeProposal datasetPropertiesProposal = new DatasetPropertiesPatchBuilder() + .urn(UrnUtils.toDatasetUrn("hive", "fct_users_deleted", "PROD")) + .addCustomProperty("cluster_name", "datahubproject.acryl.io") + .addCustomProperty("retention_time", "2 years") + .build(); + + String token = ""; + RestEmitter emitter = RestEmitter.create( + b -> b.server("http://localhost:8080") + .token(token) + ); + try { + Future response = emitter.emit(datasetPropertiesProposal); + + System.out.println(response.get().getResponseContent()); + } catch (Exception e) { + log.error("Failed to emit metadata to DataHub", e); + throw e; + } finally { + emitter.close(); + } + + } + +} + + + +``` + +### Add and Remove Custom Properties + +```java +# Inlined from /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/DatasetCustomPropertiesAddRemove.java +package io.datahubproject.examples; + +import com.linkedin.common.urn.UrnUtils; +import com.linkedin.mxe.MetadataChangeProposal; +import datahub.client.MetadataWriteResponse; +import datahub.client.patch.dataset.DatasetPropertiesPatchBuilder; +import datahub.client.rest.RestEmitter; +import java.io.IOException; +import java.util.concurrent.ExecutionException; +import java.util.concurrent.Future; +import lombok.extern.slf4j.Slf4j; + + +@Slf4j +class DatasetCustomPropertiesAddRemove { + + private DatasetCustomPropertiesAddRemove() { + + } + + /** + * Applies Add and Remove property operations on an existing custom properties aspect without + * affecting any other properties + * @param args + * @throws IOException + * @throws ExecutionException + * @throws InterruptedException + */ + public static void main(String[] args) throws IOException, ExecutionException, InterruptedException { + MetadataChangeProposal datasetPropertiesProposal = new DatasetPropertiesPatchBuilder() + .urn(UrnUtils.toDatasetUrn("hive", "fct_users_deleted", "PROD")) + .addCustomProperty("cluster_name", "datahubproject.acryl.io") + .removeCustomProperty("retention_time") + .build(); + + String token = ""; + RestEmitter emitter = RestEmitter.create( + b -> b.server("http://localhost:8080") + .token(token) + ); + try { + Future response = emitter.emit(datasetPropertiesProposal); + + System.out.println(response.get().getResponseContent()); + } catch (Exception e) { + log.error("Failed to emit metadata to DataHub", e); + throw e; + } finally { + emitter.close(); + } + + } + +} + + + +``` + +### Add Data Job Lineage + +```java +# Inlined from /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/DataJobLineageAdd.java +package io.datahubproject.examples; + +import com.linkedin.common.urn.DataJobUrn; +import com.linkedin.common.urn.DatasetUrn; +import com.linkedin.common.urn.UrnUtils; +import datahub.client.MetadataWriteResponse; +import datahub.client.patch.datajob.DataJobInputOutputPatchBuilder; +import datahub.client.rest.RestEmitter; +import java.io.IOException; +import com.linkedin.mxe.MetadataChangeProposal; +import java.util.concurrent.ExecutionException; +import java.util.concurrent.Future; +import lombok.extern.slf4j.Slf4j; + + +@Slf4j +class DataJobLineageAdd { + + private DataJobLineageAdd() { + + } + + /** + * Adds lineage to an existing DataJob without affecting any lineage + * @param args + * @throws IOException + * @throws ExecutionException + * @throws InterruptedException + */ + public static void main(String[] args) throws IOException, ExecutionException, InterruptedException { + String token = ""; + try (RestEmitter emitter = RestEmitter.create( + b -> b.server("http://localhost:8080") + .token(token) + )) { + MetadataChangeProposal dataJobIOPatch = new DataJobInputOutputPatchBuilder().urn(UrnUtils + .getUrn("urn:li:dataJob:(urn:li:dataFlow:(airflow,dag_abc,PROD),task_456)")) + .addInputDatasetEdge(DatasetUrn.createFromString("urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)")) + .addOutputDatasetEdge(DatasetUrn.createFromString("urn:li:dataset:(urn:li:dataPlatform:kafka,SampleHiveDataset,PROD)")) + .addInputDatajobEdge(DataJobUrn.createFromString("urn:li:dataJob:(urn:li:dataFlow:(airflow,dag_abc,PROD),task_123)")) + .addInputDatasetField(UrnUtils.getUrn( + "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD),user_id)")) + .addOutputDatasetField(UrnUtils.getUrn( + "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD),user_id)")) + .build(); + + Future response = emitter.emit(dataJobIOPatch); + + System.out.println(response.get().getResponseContent()); + } catch (Exception e) { + log.error("Failed to emit metadata to DataHub", e); + throw new RuntimeException(e); + } + + } + +} + + + +``` + + + + +The Python Patch builders are entity-oriented and located in the [metadata-ingestion](https://github.com/datahub-project/datahub/tree/9588440549f3d99965085e97b214a7dabc181ed2/metadata-ingestion/src/datahub/specific) module and located in the `datahub.specific` module. + +Here are a few illustrative examples using the Python Patch builders: + +### Add Properties to Dataset + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_add_properties.py +import logging +from typing import Union + +from datahub.configuration.kafka import KafkaProducerConnectionConfig +from datahub.emitter.kafka_emitter import DatahubKafkaEmitter, KafkaEmitterConfig +from datahub.emitter.mce_builder import make_dataset_urn +from datahub.emitter.rest_emitter import DataHubRestEmitter +from datahub.specific.dataset import DatasetPatchBuilder + +log = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + + +# Get an emitter, either REST or Kafka, this example shows you both +def get_emitter() -> Union[DataHubRestEmitter, DatahubKafkaEmitter]: + USE_REST_EMITTER = True + if USE_REST_EMITTER: + gms_endpoint = "http://localhost:8080" + return DataHubRestEmitter(gms_server=gms_endpoint) + else: + kafka_server = "localhost:9092" + schema_registry_url = "http://localhost:8081" + return DatahubKafkaEmitter( + config=KafkaEmitterConfig( + connection=KafkaProducerConnectionConfig( + bootstrap=kafka_server, schema_registry_url=schema_registry_url + ) + ) + ) + + +dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD") + +with get_emitter() as emitter: + for patch_mcp in ( + DatasetPatchBuilder(dataset_urn) + .add_custom_property("cluster_name", "datahubproject.acryl.io") + .add_custom_property("retention_time", "2 years") + .build() + ): + emitter.emit(patch_mcp) + + +log.info(f"Added cluster_name, retention_time properties to dataset {dataset_urn}") + +``` + + + diff --git a/docs-website/versioned_docs/version-0.10.4/docs/advanced/pdl-best-practices.md b/docs-website/versioned_docs/version-0.10.4/docs/advanced/pdl-best-practices.md new file mode 100644 index 0000000000000..77c5dc7a5485e --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/advanced/pdl-best-practices.md @@ -0,0 +1,10 @@ +--- +title: PDL Best Practices +slug: /advanced/pdl-best-practices +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/advanced/pdl-best-practices.md +--- + +# PDL Best Practices + +WIP diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/datahub-apis.md b/docs-website/versioned_docs/version-0.10.4/docs/api/datahub-apis.md new file mode 100644 index 0000000000000..5dcb47a3f9d3e --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/datahub-apis.md @@ -0,0 +1,98 @@ +--- +title: Which DataHub API is for me? +slug: /api/datahub-apis +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/datahub-apis.md +--- + +# Which DataHub API is for me? + +DataHub supplys several APIs to manipulate metadata on the platform. These are our most-to-least recommended approaches: + +- Our most recommended tools for extending and customizing the behavior of your DataHub instance are our SDKs in [Python](metadata-ingestion/as-a-library.md) and [Java](metadata-integration/java/as-a-library.md). +- If you'd like to customize the DataHub client or roll your own; the [GraphQL API](docs/api/graphql/getting-started.md) is our what powers our frontend. We figure if it's good enough for us, it's good enough for everyone! If `graphql` doesn't cover everything in your usecase, drop into [our slack](docs/slack.md) and let us know how we can improve it! +- If you are less familiar with `graphql` and would rather use OpenAPI, we offer [OpenAPI](docs/api/openapi/openapi-usage-guide.md) endpoints that allow you to produce metadata events and query metadata. +- Finally, if you're a brave soul and know exactly what you are doing... are you sure you don't just want to use the SDK directly? If you insist, the [Rest.li API](docs/api/restli/restli-overview.md) is a much more powerful, low level API intended only for advanced users. + +## Python and Java SDK + +We offer an SDK for both Python and Java that provide full functionality when it comes to CRUD operations and any complex functionality you may want to build into DataHub. + +Get started with the Python SDK + + + +Get started with the Java SDK + + +## GraphQL API + +The `graphql` API serves as the primary public API for the platform. It can be used to fetch and update metadata programatically in the language of your choice. Intended as a higher-level API that simplifies the most common operations. + + +Get started with the GraphQL API + + +## OpenAPI + +For developers who prefer OpenAPI to GraphQL for programmatic operations. Provides lower-level API access to the entire DataHub metadata model for writes, reads and queries. + +Get started with OpenAPI + + +## Rest.li API + +:::caution +The Rest.li API is intended only for advanced users. If you're just getting started with DataHub, we recommend the GraphQL API +::: + +The Rest.li API represents the underlying persistence layer, and exposes the raw PDL models used in storage. Under the hood, it powers the GraphQL API. Aside from that, it is also used for system-specific ingestion of metadata, being used by the Metadata Ingestion Framework for pushing metadata into DataHub directly. For all intents and purposes, the Rest.li API is considered system-internal, meaning DataHub components are the only ones to consume this API directly. + +Get started with our Rest.li API + + +## DataHub API Comparison + +DataHub supports several APIs, each with its own unique usage and format. +Here's an overview of what each API can do. + +> Last Updated : Apr 8 2023 + +| Feature | GraphQL | Python SDK | OpenAPI | +| ------------------------------------------------------- | ---------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | ------- | +| Create a dataset | 🚫 | ✅ [[Guide]](/docs/api/tutorials/datasets.md) | ✅ | +| Delete a dataset (Soft delete) | ✅ [[Guide]](/docs/api/tutorials/datasets.md#delete-dataset) | ✅ [[Guide]](/docs/api/tutorials/datasets.md#delete-dataset) | ✅ | +| Delete a dataset (Hard delele) | 🚫 | ✅ [[Guide]](/docs/api/tutorials/datasets.md#delete-dataset) | ✅ | +| Search a dataset | ✅ | ✅ | ✅ | +| Create a tag | ✅ [[Guide]](/docs/api/tutorials/tags.md) | ✅ [[Guide]](/docs/api/tutorials/tags.md) | ✅ | +| Read a tag | ✅ [[Guide]](/docs/api/tutorials/tags.md) | ✅ [[Guide]](/docs/api/tutorials/tags.md) | ✅ | +| Add tags to a dataset | ✅ [[Guide]](/docs/api/tutorials/tags.md) | ✅ [[Guide]](/docs/api/tutorials/tags.md) | ✅ | +| Add tags to a column of a dataset | ✅ [[Guide]](/docs/api/tutorials/tags.md) | ✅ [[Guide]](/docs/api/tutorials/tags.md) | ✅ | +| Remove tags from a dataset | ✅ [[Guide]](/docs/api/tutorials/tags.md) | ✅ [[Guide]](/docs/api/tutorials/tags.md#add-tags) | ✅ | +| Create glossary terms | ✅ [[Guide]](/docs/api/tutorials/terms.md) | ✅ [[Guide]](/docs/api/tutorials/terms.md) | ✅ | +| Read terms from a dataset | ✅ [[Guide]](/docs/api/tutorials/terms.md) | ✅ [[Guide]](/docs/api/tutorials/terms.md) | ✅ | +| Add terms to a column of a dataset | ✅ [[Guide]](/docs/api/tutorials/terms.md) | ✅ [[Guide]](/docs/api/tutorials/terms.md) | ✅ | +| Add terms to a dataset | ✅ [[Guide]](/docs/api/tutorials/terms.md) | ✅ [[Guide]](/docs/api/tutorials/terms.md) | ✅ | +| Create domains | ✅ [[Guide]](/docs/api/tutorials/domains.md) | ✅ [[Guide]](/docs/api/tutorials/domains.md) | ✅ | +| Read domains | ✅ [[Guide]](/docs/api/tutorials/domains.md) | ✅ [[Guide]](/docs/api/tutorials/domains.md) | ✅ | +| Add domains to a dataset | ✅ [[Guide]](/docs/api/tutorials/domains.md) | ✅ [[Guide]](/docs/api/tutorials/domains.md) | ✅ | +| Remove domains from a dataset | ✅ [[Guide]](/docs/api/tutorials/domains.md) | ✅ [[Guide]](/docs/api/tutorials/domains.md) | ✅ | +| Crate users and groups | ✅ [[Guide]](/docs/api/tutorials/owners.md) | ✅ [[Guide]](/docs/api/tutorials/owners.md) | ✅ | +| Read owners of a dataset | ✅ [[Guide]](/docs/api/tutorials/owners.md) | ✅ [[Guide]](/docs/api/tutorials/owners.md) | ✅ | +| Add owner to a dataset | ✅ [[Guide]](/docs/api/tutorials/owners.md) | ✅ [[Guide]](/docs/api/tutorials/owners.md) | ✅ | +| Remove owner from a dataset | ✅ [[Guide]](/docs/api/tutorials/owners.md) | ✅ [[Guide]](/docs/api/tutorials/owners.md) | ✅ | +| Add lineage | ✅ [[Guide]](/docs/api/tutorials/lineage.md) | ✅ [[Guide]](/docs/api/tutorials/lineage.md) | ✅ | +| Add column level(Fine Grained) lineage | 🚫 | ✅ | ✅ | +| Add documentation(description) to a column of a dataset | ✅ [[Guide]](/docs/api/tutorials/descriptions.md#add-description-on-column) | ✅ [[Guide]](/docs/api/tutorials/descriptions.md#add-description-on-column) | ✅ | +| Add documentation(description) to a dataset | ✅ [[Guide]](/docs/api/tutorials/descriptions.md#add-description-on-dataset) | ✅ [[Guide]](/docs/api/tutorials/descriptions.md#add-description-on-dataset) | ✅ | +| Add / Remove / Replace custom properties on a dataset | 🚫 [[Guide]](/docs/api/tutorials/custom-properties.md) | ✅ [[Guide]](/docs/api/tutorials/custom-properties.md) | ✅ | diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/graphql/getting-started.md b/docs-website/versioned_docs/version-0.10.4/docs/api/graphql/getting-started.md new file mode 100644 index 0000000000000..42c41c62fdea4 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/graphql/getting-started.md @@ -0,0 +1,164 @@ +--- +title: Getting Started With GraphQL +slug: /api/graphql/getting-started +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/graphql/getting-started.md +--- + +# Getting Started With GraphQL + +## Reading an Entity: Queries + +DataHub provides the following `graphql` queries for retrieving entities in your Metadata Graph. + +### Query + +The following `graphql` query retrieves the `urn` and `name` of the `properties` of a specific dataset + +```json +{ + dataset(urn: "urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)") { + urn + properties { + name + } + } +} +``` + +In addition to the URN and properties, you can also fetch other types of metadata for an asset, such as owners, tags, domains, and terms of an entity. +For more information on, please refer to the following links." + +- [Querying for Owners of a Dataset](/docs/api/tutorials/owners.md#read-owners) +- [Querying for Tags of a Dataset](/docs/api/tutorials/tags.md#read-tags) +- [Querying for Domain of a Dataset](/docs/api/tutorials/domains.md#read-domains) +- [Querying for Glossary Terms of a Dataset](/docs/api/tutorials/terms.md#read-terms) +- [Querying for Deprecation of a dataset](/docs/api/tutorials/deprecation.md#read-deprecation) + +### Search + +To perform full-text search against an Entity of a particular type, use the search(input: `SearchInput!`) `graphql` Query. +The following `graphql` query searches for datasets that match a specific query term. + +```json +{ + search(input: { type: DATASET, query: "my sql dataset", start: 0, count: 10 }) { + start + count + total + searchResults { + entity { + urn + type + ...on Dataset { + name + } + } + } + } +} +``` + +The `search` field is used to indicate that we want to perform a search. +The `input` argument specifies the search criteria, including the type of entity being searched, the search query term, the start index of the search results, and the count of results to return. + +The `query` term is used to specify the search term. +The search term can be a simple string, or it can be a more complex query using patterns. + +- `*` : Search for all entities. +- `*[string]` : Search for all entities that contain aspects **starting with** the specified \[string\]. +- `[string]*` : Search for all entities that contain aspects **ending with** the specified \[string\]. +- `*[string]*` : Search for all entities that **match** aspects named \[string\]. +- `[string]` : Search for all entities that **contain** the specified \[string\]. + +:::note +Note that by default Elasticsearch only allows pagination through 10,000 entities via the search API. +If you need to paginate through more, you can change the default value for the `index.max_result_window` setting in Elasticsearch, or using the scroll API to read from the index directly. +::: + +## Modifying an Entity: Mutations + +:::note +Mutations which change Entity metadata are subject to [DataHub Access Policies](../../authorization/policies.md). +This means that DataHub's server will check whether the requesting actor is authorized to perform the action. +::: + +To update an existing Metadata Entity, simply use the `update(urn: String!, input: EntityUpdateInput!)` GraphQL Query. +For example, to update a Dashboard entity, you can issue the following GraphQL mutation: + +```json +mutation updateDashboard { + updateDashboard( + urn: "urn:li:dashboard:(looker,baz)", + input: { + editableProperties: { + description: "My new desription" + } + } + ) { + urn + } +} +``` + +For more information, please refer to following links. + +- [Adding Tags](/docs/api/tutorials/tags.md#add-tags) +- [Adding Glossary Terms](/docs/api/tutorials/terms.md#add-terms) +- [Adding Domain](/docs/api/tutorials/domains.md#add-domains) +- [Adding Owners](/docs/api/tutorials/owners.md#add-owners) +- [Removing Tags](/docs/api/tutorials/tags.md#remove-tags) +- [Removing Glossary Terms](/docs/api/tutorials/terms.md#remove-terms) +- [Removing Domain](/docs/api/tutorials/domains.md#remove-domains) +- [Removing Owners](/docs/api/tutorials/owners.md#remove-owners) +- [Updating Deprecation](/docs/api/tutorials/deprecation.md#update-deprecation) +- [Editing Description (i.e. Documentation) on Datasets](/docs/api/tutorials/descriptions.md#add-description-on-dataset) +- [Editing Description (i.e. Documentation) on Columns](/docs/api/tutorials/descriptions.md#add-description-on-column) +- [Soft Deleting](/docs/api/tutorials/datasets.md#delete-dataset) + +Please refer to [Datahub API Comparison](/docs/api/datahub-apis.md#datahub-api-comparison) to navigate to the use-case oriented guide. + +## Handling Errors + +In GraphQL, requests that have errors do not always result in a non-200 HTTP response body. Instead, errors will be +present in the response body inside a top-level `errors` field. + +This enables situations in which the client is able to deal gracefully will partial data returned by the application server. +To verify that no error has returned after making a GraphQL request, make sure you check _both_ the `data` and `errors` fields that are returned. + +To catch a GraphQL error, simply check the `errors` field side the GraphQL response. It will contain a message, a path, and a set of extensions +which contain a standard error code. + +```json +{ + "errors": [ + { + "message": "Failed to change ownership for resource urn:li:dataFlow:(airflow,dag_abc,PROD). Expected a corp user urn.", + "locations": [ + { + "line": 1, + "column": 22 + } + ], + "path": ["addOwners"], + "extensions": { + "code": 400, + "type": "BAD_REQUEST", + "classification": "DataFetchingException" + } + } + ] +} +``` + +With the following error codes officially supported: + +| Code | Type | Description | +| ---- | ------------ | --------------------------------------------------------------------------------------------- | +| 400 | BAD_REQUEST | The query or mutation was malformed. | +| 403 | UNAUTHORIZED | The current actor is not authorized to perform the requested action. | +| 404 | NOT_FOUND | The resource is not found. | +| 500 | SERVER_ERROR | An internal error has occurred. Check your server logs or contact your DataHub administrator. | + +> Visit our [Slack channel](https://slack.datahubproject.io) to ask questions, tell us what we can do better, & make requests for what you'd like to see in the future. Or just +> stop by to say 'Hi'. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/graphql/graphql-endpoint-development.md b/docs-website/versioned_docs/version-0.10.4/docs/api/graphql/graphql-endpoint-development.md new file mode 100644 index 0000000000000..3630da67c370e --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/graphql/graphql-endpoint-development.md @@ -0,0 +1,66 @@ +--- +title: Creating a New GraphQL Endpoint in GMS +slug: /api/graphql/graphql-endpoint-development +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/graphql/graphql-endpoint-development.md +--- + +# Creating a New GraphQL Endpoint in GMS + +This guide will walk you through how to add a new GraphQL endpoint in GMS. + +> **listOwnershipTypes example:** The `listOwnershipTypes` endpoint will be used as an example. This endpoint was added in [this commit](https://github.com/datahub-project/datahub/commit/ea92b86e6ab4cbb18742fb8db6bc11fae8970cdb#diff-df9c96427d45d7af6d92dd6caa23a349357dbc4bdb915768ab4ce000a4286964) which can be used as reference. + +## GraphQL API changes + +### Adding an endpoint definition + +GraphQL endpoint definitions for GMS are located in the `datahub-graphql-core/src/main/resources/` directory. New endpoints can be added to the relevant file, e.g. `entity.graphql` for entity management endpoints, `search.graphql` for search-related endpoints, etc. Or, for totally new features, new files can be added to this directory. + +> **listOwnershipTypes example:** The endpoint was added in the [`entity.graphql`](https://github.com/datahub-project/datahub/commit/ea92b86e6ab4cbb18742fb8db6bc11fae8970cdb#diff-df9c96427d45d7af6d92dd6caa23a349357dbc4bdb915768ab4ce000a4286964) file since ownership types are being added as an entity. + +#### Query or Mutation? + +Read-only functionality can go in the `Query` section, while mutations go in the `Mutation` section. The definition for new functionality can go in the appropriate section depending on the use case. + +> **listOwnershipTypes example:** The endpoint was added in the `type Query` section because it is read-only functionality. In the same commit, `createOwnershipType`, `updateOwnershipType`, and `deleteOwnershipType` were added in the `type Mutation` section as these are operations that perform writes. + +#### Input and Output Types + +If the new endpoint requires more than a few inputs or outputs, a struct can be created in the same file to collect these fields. + +> **listOwnershipTypes example:** Since this functionality takes and returns quite a few parameters, `input ListOwnershipTypesInput` and `type ListOwnershipTypesResult` were added to represent the input and output structs. In the same PR, no input and output structs were added for `deleteOwnershipType` since the inputs and output are primitive types. + +### Building your changes + +After adding the new endpoint, and new structs if necessary, building the project will generate the Java classes for the new code that can be used in making the server changes. Build the datahub project to make the new symbols available. + +> **listOwnershipTypes example:** The build step will make the new types `ListOwnershipTypesInput` and `ListOwnershipTypesResult` available in a Java IDE. + +## Java Server changes + +We turn now to developing the server-side functionality for the new endpoint. + +### Adding a resolver + +GraphQL queries are handled by `Resolver` classes located in the `datahub-graphql-core/src/main/java/com/linkedin/datahub/graphql/resolvers/` directory. Resolvers are classes that implement the `DataFetcher` interface where `T` is `CompletableFuture`.This interface provides a `get` method that takes in a `DataFetchingEnvironment` and returns a `CompletableFuture` of the endpoint return type. The resolver can contain any services needed to resolve the endpoint, and use them to compute the result. + +> **listOwnershipTypes example:** The [`ListOwnershipTypesResolver`](https://github.com/datahub-project/datahub/commit/ea92b86e6ab4cbb18742fb8db6bc11fae8970cdb#diff-d2ad02d0ec286017d032640cfdb289fbdad554ef5f439355104766fa068513ac) class implements `DataFetcher>` since this is the return type of the endpoint. It contains an `EntityClient` instance variable to handle the ownership type fetching. + +Often the structure of the `Resolver` classes is to call a service to receive a response, then use a method to transform the result from the service into the GraphQL type returned. + +> **listOwnershipTypes example:** The [`ListOwnershipTypesResolver`](https://github.com/datahub-project/datahub/commit/ea92b86e6ab4cbb18742fb8db6bc11fae8970cdb#diff-d2ad02d0ec286017d032640cfdb289fbdad554ef5f439355104766fa068513ac) calls the `search` method in its `EntityClient` to get the ownership types, then calls the defined `mapUnresolvedOwnershipTypes` function to transform the response into a `ListOwnershipTypesResult`. + +Tip: Resolver classes can be tested with unit tests! + +> **listOwnershipTypes example:** The reference commit adds the [`ListOwnershipTypeResolverTest` class](https://github.com/datahub-project/datahub/commit/ea92b86e6ab4cbb18742fb8db6bc11fae8970cdb#diff-9443d70b221e36e9d47bfa9244673d1cd553a92ae496d03622932ad0a4832045). + +### Adding the resolver to the GMS server + +The main GMS server is located in [`GmsGraphQLEngine.java`](https://github.com/datahub-project/datahub/blob/master/datahub-graphql-core/src/main/java/com/linkedin/datahub/graphql/GmsGraphQLEngine.java). To hook up the resolver to handle the endpoint, find the relevant section based on if the new enpoint is a `Query` or a `Mutation` and add the resolver as the `dataFetcher` for the name of the endpoint. + +> **listOwnershipTypes example:** The following line of code is added in [`GmsGraphQLEngine`](https://github.com/datahub-project/datahub/commit/ea92b86e6ab4cbb18742fb8db6bc11fae8970cdb#diff-e04c9c2d80cbfd7aa7e3e0f867248464db0f6497684661132d6ead81ded21856): `.dataFetcher("listOwnershipTypes", new ListOwnershipTypesResolver(this.entityClient))`. This uses the `ListOwnershipTypes` resolver to handle queries for `listOwnershipTypes` endpoint. + +## Testing your change + +In addition to unit tests for your resolver mentioned above, GraphQL functionality in datahub can be tested using the built-in [GraphiQL](https://www.gatsbyjs.com/docs/how-to/querying-data/running-queries-with-graphiql/) endpoint. The endpoint is located at `localhost:8080/api/graphiql` on Quickstart and at the equivalent URL for a production instance. This provides fast debug-ability for querying GraphQL. See [How to Set Up GraphQL](./how-to-set-up-graphql.md#graphql-explorer-graphiql) for more information diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/graphql/how-to-set-up-graphql.md b/docs-website/versioned_docs/version-0.10.4/docs/api/graphql/how-to-set-up-graphql.md new file mode 100644 index 0000000000000..79b88e8d87801 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/graphql/how-to-set-up-graphql.md @@ -0,0 +1,93 @@ +--- +title: How To Set Up GraphQL +slug: /api/graphql/how-to-set-up-graphql +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/graphql/how-to-set-up-graphql.md +--- + +# How To Set Up GraphQL + +## Preparing Local Datahub Deployment + +The first thing you'll need to use the GraphQL API is a deployed instance of DataHub with some metadata ingested. +For more information, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +## Querying the GraphQL API + +DataHub's GraphQL endpoint is served at the path `/api/graphql`, e.g. `https://my-company.datahub.com/api/graphql`. +There are a few options when it comes to querying the GraphQL endpoint. + +For **Testing**: + +- GraphQL Explorer (GraphiQL) +- CURL +- Postman + +For **Production**: + +- GraphQL [Client SDK](https://graphql.org/code/) for the language of your choice +- Basic HTTP client + +> Notice: The DataHub GraphQL endpoint only supports POST requests at this time. + +### GraphQL Explorer (GraphiQL) + +DataHub provides a browser-based GraphQL Explorer Tool ([GraphiQL](https://github.com/graphql/graphiql)) for live interaction with the GraphQL API. This tool is available at the path `/api/graphiql` (e.g. `https://my-company.datahub.com/api/graphiql`) +This interface allows you to easily craft queries and mutations against real metadata stored in your live DataHub deployment. + +To experiment with GraphiQL before deploying it in your live DataHub deployment, you can access a demo site provided by DataHub at https://demo.datahubproject.io/api/graphiql. +For instance, you can create a tag by posting the following query: + +```json +mutation createTag { + createTag(input: + { + name: "Deprecated", + description: "Having this tag means this column or table is deprecated." + }) +} +``` + +For a detailed usage guide, check out [How to use GraphiQL](https://www.gatsbyjs.com/docs/how-to/querying-data/running-queries-with-graphiql/). + +### CURL + +CURL is a command-line tool used for transferring data using various protocols including HTTP, HTTPS, and others. +To query the DataHub GraphQL API using CURL, you can send a `POST` request to the `/api/graphql` endpoint with the GraphQL query in the request body. +Here is an example CURL command to create a tag via GraphQL API: + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "mutation createTag { createTag(input: { name: \"Deprecated\", description: \"Having this tag means this column or table is deprecated.\" }) }", "variables":{}}' +``` + +### Postman + +Postman is a popular API client that provides a graphical user interface for sending requests and viewing responses. +Within Postman, you can create a `POST` request and set the request URL to the `/api/graphql` endpoint. +In the request body, select the `GraphQL` option and enter your GraphQL query in the request body. + +

+ +

+ +Please refer to [Querying with GraphQL](https://learning.postman.com/docs/sending-requests/graphql/graphql/) in the Postman documentation for more information. + +### Authentication + Authorization + +In general, you'll need to provide an [Access Token](../../authentication/personal-access-tokens.md) when querying the GraphQL by +providing an `Authorization` header containing a `Bearer` token. The header should take the following format: + +```bash +Authorization: Bearer +``` + +Authorization for actions exposed by the GraphQL endpoint will be performed based on the actor making the request. +For Personal Access Tokens, the token will carry the user's privileges. Please refer to [Access Token Management](/docs/api/graphql/token-management.md) for more information. + +## What's Next? + +Now that you are ready with GraphQL, how about browsing through some use cases? +Please refer to [Getting Started With GraphQL](/docs/api/graphql/getting-started.md) for more information. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/graphql/overview.md b/docs-website/versioned_docs/version-0.10.4/docs/api/graphql/overview.md new file mode 100644 index 0000000000000..10e4a2e1be442 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/graphql/overview.md @@ -0,0 +1,49 @@ +--- +slug: /api/graphql/overview +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/graphql/overview.md +--- + +# DataHub GraphQL API + +DataHub provides a rich [`graphql`](https://graphql.org/) API for programmatically interacting with the Entities & Relationships comprising your organization's Metadata Graph. + +## Getting Started + +To begin using the DataHub `graphql` API, please consult the [Getting Started](/docs/api/graphql/getting-started.md). + +For detailed guidance on using `graphql` for specific use cases, please refer to [Datahub API Comparison](/docs/api/datahub-apis.md#datahub-api-comparison). + +> **Pro Tip!** Throughout our API guides, we have examples of using GraphQL API. +> Lookout for the `| GraphQL |` tab within our tutorials. + +## About GraphQL + +[`graphql`](https://graphql.org/) provides a data query language and API with the following characteristics: + +- A **validated specification**: The `graphql` spec verifies a _schema_ on the API server. The server in turn is responsible + for validating incoming queries from the clients against that schema. +- **Strongly typed**: A GraphQL schema declares the universe of types and relationships composing the interface. +- **Document-oriented & hierarchical**: GraphQL makes it eay to ask for related entities using a familiar JSON document + structure. This minimizes the number of round-trip API requests a client must make to answer a particular question. +- **Flexible & efficient**: GraphQL provides a way to ask for only the data you want, and that's it. Ignore all + the rest. It allows you to replace multiple REST calls with one GraphQL call. +- **Large Open Source Ecosystem**: Open source GraphQL projects have been developed for [virtually every programming language](https://graphql.org/code/). With a thriving + community, it offers a sturdy foundation to build upon. + +For these reasons among others DataHub provides a GraphQL API on top of the Metadata Graph, +permitting easy exploration of the Entities & Relationships composing it. + +For more information about the GraphQL specification, check out [Introduction to GraphQL](https://graphql.org/learn/). + +## GraphQL Schema Reference + +The Reference docs in the sidebar are generated from the DataHub GraphQL schema. Each call to the `/api/graphql` endpoint is +validated against this schema. You can use these docs to understand data that is available for retrieval and operations +that may be performed using the API. + +- Available Operations: [Queries](/graphql/queries.md) (Reads) & [Mutations](/graphql/mutations.md) (Writes) +- Schema Types: [Objects](/graphql/objects.md), [Input Objects](/graphql/inputObjects.md), [Interfaces](/graphql/interfaces.md), [Unions](/graphql/unions.md), [Enums](/graphql/enums.md), [Scalars](/graphql/scalars.md) + +> Visit our [Slack channel](https://slack.datahubproject.io) to ask questions, tell us what we can do better, & make requests for what you'd like to see in the future. Or just +> stop by to say 'Hi'. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/graphql/token-management.md b/docs-website/versioned_docs/version-0.10.4/docs/api/graphql/token-management.md new file mode 100644 index 0000000000000..986ab38de880d --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/graphql/token-management.md @@ -0,0 +1,143 @@ +--- +title: Access Token Management +slug: /api/graphql/token-management +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/graphql/token-management.md +--- + +# Access Token Management + +DataHub provides the following `graphql` endpoints for managing Access Tokens. In this page you will see examples as well +as explanations as to how to administrate access tokens within the project whether for yourself or others, depending on the caller's privileges. + +_Note_: This API makes use of DataHub Policies to safeguard against improper use. By default, a user will not be able to interact with it at all unless they have at least `Generate Personal Access Tokens` privileges. + +### Generating Access Tokens + +To generate an access token, simply use the `createAccessToken(input: GetAccessTokenInput!)` `graphql` Query. +This endpoint will return an `AccessToken` object, containing the access token string itself alongside with metadata +which will allow you to identify said access token later on. + +For example, to generate an access token for the `datahub` corp user, you can issue the following `graphql` Query: + +_As GraphQL_ + +```graphql +mutation { + createAccessToken( + input: { + type: PERSONAL + actorUrn: "urn:li:corpuser:datahub" + duration: ONE_HOUR + name: "my personal token" + } + ) { + accessToken + metadata { + id + name + description + } + } +} +``` + +_As CURL_ + +```curl +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'X-DataHub-Actor: urn:li:corpuser:datahub' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query":"mutation { createAccessToken(input: { type: PERSONAL, actorUrn: \"urn:li:corpuser:datahub\", duration: ONE_HOUR, name: \"my personal token\" } ) { accessToken metadata { id name description} } }", "variables":{}}' +``` + +### Listing Access Tokens + +Listing tokens is a powerful endpoint that allows you to list the tokens owned by a particular user (ie. YOU). +To list all tokens that you own, you must specify a filter with: `{field: "actorUrn", value: ""}` configuration. + +_As GraphQL_ + +```graphql +{ + listAccessTokens( + input: { + start: 0 + count: 100 + filters: [{ field: "ownerUrn", value: "urn:li:corpuser:datahub" }] + } + ) { + start + count + total + tokens { + urn + id + actorUrn + } + } +} +``` + +_As CURL_ + +```curl +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'X-DataHub-Actor: urn:li:corpuser:datahub' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query":"{ listAccessTokens(input: {start: 0, count: 100, filters: [{field: \"ownerUrn\", value: \"urn:li:corpuser:datahub\"}]}) { start count total tokens {urn id actorUrn} } }", "variables":{}}' +``` + +Admin users can also list tokens owned by other users of the platform. To list tokens belonging to other users, you must have the `Manage All Access Tokens` Platform privilege. + +_As GraphQL_ + +```graphql +{ + listAccessTokens(input: { start: 0, count: 100, filters: [] }) { + start + count + total + tokens { + urn + id + actorUrn + } + } +} +``` + +_As CURL_ + +```curl +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'X-DataHub-Actor: urn:li:corpuser:datahub' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query":"{ listAccessTokens(input: {start: 0, count: 100, filters: []}) { start count total tokens {urn id actorUrn} } }", "variables":{}}' +``` + +Other filters besides `actorUrn=` are possible. You can filter by property in the `DataHubAccessTokenInfo` aspect which you can find in the Entities documentation. + +### Revoking Access Tokens + +To revoke an existing access token, you can use the `revokeAccessToken` mutation. + +_As GraphQL_ + +```graphql +mutation { + revokeAccessToken(tokenId: "HnMJylxuowJ1FKN74BbGogLvXCS4w+fsd3MZdI35+8A=") +} +``` + +```curl +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'X-DataHub-Actor: urn:li:corpuser:datahub' \ +--header 'Content-Type: application/json' \ +--data-raw '{"query":"mutation {revokeAccessToken(tokenId: \"HnMJylxuowJ1FKN74BbGogLvXCS4w+fsd3MZdI35+8A=\")}","variables":{}}}' +``` + +This endpoint will return a boolean detailing whether the operation was successful. In case of failure, an error message will appear explaining what went wrong. + +> Visit our [Slack channel](https://slack.datahubproject.io) to ask questions, tell us what we can do better, & make requests for what you'd like to see in the future. Or just +> stop by to say 'Hi'. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/openapi/openapi-usage-guide.md b/docs-website/versioned_docs/version-0.10.4/docs/api/openapi/openapi-usage-guide.md new file mode 100644 index 0000000000000..44a9fd0d9db43 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/openapi/openapi-usage-guide.md @@ -0,0 +1,520 @@ +--- +title: DataHub OpenAPI Guide +sidebar_label: OpenAPI Guide +slug: /api/openapi/openapi-usage-guide +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/openapi/openapi-usage-guide.md +--- + +# DataHub OpenAPI Guide + +## Why OpenAPI + +The OpenAPI standard is a widely used documentation and design approach for REST-ful APIs. +To make it easier to integrate with DataHub, we are publishing an OpenAPI based set of endpoints. + +Read [the DataHub API overview](../datahub-apis.md) to understand the rationale behind the different API-s and when to use each one. + +## Locating the OpenAPI endpoints + +Currently, the OpenAPI endpoints are isolated to a servlet on GMS and are automatically deployed with a GMS server. +The servlet includes auto-generation of an OpenAPI UI, also known as Swagger, which is available at **GMS_SERVER_HOST:GMS_PORT/openapi/swagger-ui/index.html**. For example, the Quickstart running locally exposes this at http://localhost:8080/openapi/swagger-ui/index.html. + +This is also exposed through DataHub frontend as a proxy with the same endpoint, but GMS host and port replaced with DataHub frontend's url ([Local Quickstart link](http://localhost:9002/openapi/swagger-ui/index.html)) and is available in the top right dropdown under the user profile picture as a link. +![image](https://github.com/datahub-project/static-assets/blob/main/imgs/api/openapi/openapi_dropdown.png?raw=true) + +Note that it is possible to get the raw JSON or YAML formats of the OpenAPI spec by navigating to [**BASE_URL/openapi/v3/api-docs**](http://localhost:9002/openapi/v3/api-docs) or [**BASE_URL/openapi/v3/api-docs.yaml**](http://localhost:9002/openapi/v3/api-docs.yaml). +The raw forms can be fed into codegen systems to generate client side code in the language of your choice that support the OpenAPI format. We have noticed varying degrees of maturity with different languages in these codegen systems so some may require customizations to be fully compatible. + +The OpenAPI UI includes explorable schemas for request and response objects that are fully documented. The models used +in the OpenAPI UI are all autogenerated at build time from the PDL models to JSON Schema compatible Java Models. + +## Understanding the OpenAPI endpoints + +While the full OpenAPI spec is always available at [**GMS_SERVER_HOST:GMS_PORT/openapi/swagger-ui/index.html**](http://localhost:8080/openapi/swagger-ui/index.html), here's a quick overview of the main OpenAPI endpoints and their purpose. + +### Entities (/entities) + +The entities endpoints are intended for reads and writes to the metadata graph. The entire DataHub metadata model is available for you to write to (as entity, aspect pairs) or to read an individual entity's metadata from. See [examples](#entities-entities-endpoint) below. + +### Relationships (/relationships) + +The relationships endpoints are intended for you to query the graph, to navigate relationships from one entity to others. See [examples](#relationships-relationships-endpoint) below. + +### Timeline (/timeline) + +The timeline endpoints are intended for querying the versioned history of a given entity over time. For example, you can query a dataset for all schema changes that have happened to it over time, or all documentation changes that have happened to it. See [this](../../dev-guides/timeline.md) guide for more details. + +### Platform (/platform) + +Even lower-level API-s that allow you to write metadata events into the DataHub platform using a standard format. + +### Example Requests + +#### Entities (/entities) endpoint + +##### POST + +```shell +curl --location --request POST 'localhost:8080/openapi/entities/v1/' \ +--header 'Content-Type: application/json' \ +--header 'Accept: application/json' \ +--header 'Authorization: Bearer eyJhbGciOiJIUzI1NiJ9.eyJhY3RvclR5cGUiOiJVU0VSIiwiYWN0b3JJZCI6ImRhdGFodWIiLCJ0eXBlIjoiUEVSU09OQUwiLCJ2ZXJzaW9uIjoiMSIsImV4cCI6MTY1MDY2MDY1NSwianRpIjoiM2E4ZDY3ZTItOTM5Yi00NTY3LWE0MjYtZDdlMDA1ZGU3NjJjIiwic3ViIjoiZGF0YWh1YiIsImlzcyI6ImRhdGFodWItbWV0YWRhdGEtc2VydmljZSJ9.pp_vW2u1tiiTT7U0nDF2EQdcayOMB8jatiOA8Je4JJA' \ +--data-raw '[ + { + "aspect": { + "__type": "SchemaMetadata", + "schemaName": "SampleHdfsSchema", + "platform": "urn:li:dataPlatform:platform", + "platformSchema": { + "__type": "MySqlDDL", + "tableSchema": "schema" + }, + "version": 0, + "created": { + "time": 1621882982738, + "actor": "urn:li:corpuser:etl", + "impersonator": "urn:li:corpuser:jdoe" + }, + "lastModified": { + "time": 1621882982738, + "actor": "urn:li:corpuser:etl", + "impersonator": "urn:li:corpuser:jdoe" + }, + "hash": "", + "fields": [ + { + "fieldPath": "county_fips_codefg", + "jsonPath": "null", + "nullable": true, + "description": "null", + "type": { + "type": { + "__type": "StringType" + } + }, + "nativeDataType": "String()", + "recursive": false + }, + { + "fieldPath": "county_name", + "jsonPath": "null", + "nullable": true, + "description": "null", + "type": { + "type": { + "__type": "StringType" + } + }, + "nativeDataType": "String()", + "recursive": false + } + ] + }, + "entityType": "dataset", + "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:platform,testSchemaIngest,PROD)" + } +]' +``` + +##### GET + +```shell +curl --location --request GET 'localhost:8080/openapi/entities/v1/latest?urns=urn:li:dataset:(urn:li:dataPlatform:platform,testSchemaIngest,PROD)&aspectNames=schemaMetadata' \ +--header 'Accept: application/json' \ +--header 'Authorization: Bearer eyJhbGciOiJIUzI1NiJ9.eyJhY3RvclR5cGUiOiJVU0VSIiwiYWN0b3JJZCI6ImRhdGFodWIiLCJ0eXBlIjoiUEVSU09OQUwiLCJ2ZXJzaW9uIjoiMSIsImV4cCI6MTY1MDY2MDY1NSwianRpIjoiM2E4ZDY3ZTItOTM5Yi00NTY3LWE0MjYtZDdlMDA1ZGU3NjJjIiwic3ViIjoiZGF0YWh1YiIsImlzcyI6ImRhdGFodWItbWV0YWRhdGEtc2VydmljZSJ9.pp_vW2u1tiiTT7U0nDF2EQdcayOMB8jatiOA8Je4JJA' +``` + +##### DELETE + +```shell +curl --location --request DELETE 'localhost:8080/openapi/entities/v1/?urns=urn:li:dataset:(urn:li:dataPlatform:platform,testSchemaIngest,PROD)&soft=true' \ +--header 'Accept: application/json' \ +--header 'Authorization: Bearer eyJhbGciOiJIUzI1NiJ9.eyJhY3RvclR5cGUiOiJVU0VSIiwiYWN0b3JJZCI6ImRhdGFodWIiLCJ0eXBlIjoiUEVSU09OQUwiLCJ2ZXJzaW9uIjoiMSIsImV4cCI6MTY1MDY2MDY1NSwianRpIjoiM2E4ZDY3ZTItOTM5Yi00NTY3LWE0MjYtZDdlMDA1ZGU3NjJjIiwic3ViIjoiZGF0YWh1YiIsImlzcyI6ImRhdGFodWItbWV0YWRhdGEtc2VydmljZSJ9.pp_vW2u1tiiTT7U0nDF2EQdcayOMB8jatiOA8Je4JJA' +``` + +#### Postman Collection + +Collection includes a POST, GET, and DELETE for a single entity with a SchemaMetadata aspect + +```json +{ + "info": { + "_postman_id": "87b7401c-a5dc-47e4-90b4-90fe876d6c28", + "name": "DataHub OpenAPI", + "description": "A description", + "schema": "https://schema.getpostman.com/json/collection/v2.1.0/collection.json" + }, + "item": [ + { + "name": "entities/v1", + "item": [ + { + "name": "post Entities 1", + "request": { + "method": "POST", + "header": [ + { + "key": "Content-Type", + "value": "application/json" + }, + { + "key": "Accept", + "value": "application/json" + } + ], + "body": { + "mode": "raw", + "raw": "[\n {\n \"aspect\": {\n \"__type\": \"SchemaMetadata\",\n \"schemaName\": \"SampleHdfsSchema\",\n \"platform\": \"urn:li:dataPlatform:platform\",\n \"platformSchema\": {\n \"__type\": \"MySqlDDL\",\n \"tableSchema\": \"schema\"\n },\n \"version\": 0,\n \"created\": {\n \"time\": 1621882982738,\n \"actor\": \"urn:li:corpuser:etl\",\n \"impersonator\": \"urn:li:corpuser:jdoe\"\n },\n \"lastModified\": {\n \"time\": 1621882982738,\n \"actor\": \"urn:li:corpuser:etl\",\n \"impersonator\": \"urn:li:corpuser:jdoe\"\n },\n \"hash\": \"\",\n \"fields\": [\n {\n \"fieldPath\": \"county_fips_codefg\",\n \"jsonPath\": \"null\",\n \"nullable\": true,\n \"description\": \"null\",\n \"type\": {\n \"type\": {\n \"__type\": \"StringType\"\n }\n },\n \"nativeDataType\": \"String()\",\n \"recursive\": false\n },\n {\n \"fieldPath\": \"county_name\",\n \"jsonPath\": \"null\",\n \"nullable\": true,\n \"description\": \"null\",\n \"type\": {\n \"type\": {\n \"__type\": \"StringType\"\n }\n },\n \"nativeDataType\": \"String()\",\n \"recursive\": false\n }\n ]\n },\n \"aspectName\": \"schemaMetadata\",\n \"entityType\": \"dataset\",\n \"entityUrn\": \"urn:li:dataset:(urn:li:dataPlatform:platform,testSchemaIngest,PROD)\"\n }\n]", + "options": { + "raw": { + "language": "json" + } + } + }, + "url": { + "raw": "{{baseUrl}}/openapi/entities/v1/", + "host": ["{{baseUrl}}"], + "path": ["openapi", "entities", "v1", ""] + } + }, + "response": [ + { + "name": "OK", + "originalRequest": { + "method": "POST", + "header": [], + "body": { + "mode": "raw", + "raw": "[\n {\n \"aspect\": {\n \"value\": \"\"\n },\n \"aspectName\": \"aliquip ipsum tempor\",\n \"entityType\": \"ut est\",\n \"entityUrn\": \"enim in nulla\",\n \"entityKeyAspect\": {\n \"value\": \"\"\n }\n },\n {\n \"aspect\": {\n \"value\": \"\"\n },\n \"aspectName\": \"ipsum id\",\n \"entityType\": \"deser\",\n \"entityUrn\": \"aliqua sit\",\n \"entityKeyAspect\": {\n \"value\": \"\"\n }\n }\n]", + "options": { + "raw": { + "language": "json" + } + } + }, + "url": { + "raw": "{{baseUrl}}/entities/v1/", + "host": ["{{baseUrl}}"], + "path": ["entities", "v1", ""] + } + }, + "status": "OK", + "code": 200, + "_postman_previewlanguage": "json", + "header": [ + { + "key": "Content-Type", + "value": "application/json" + } + ], + "cookie": [], + "body": "[\n \"c\",\n \"labore dolor exercitation in\"\n]" + } + ] + }, + { + "name": "delete Entities", + "request": { + "method": "DELETE", + "header": [ + { + "key": "Accept", + "value": "application/json" + } + ], + "url": { + "raw": "{{baseUrl}}/openapi/entities/v1/?urns=urn:li:dataset:(urn:li:dataPlatform:platform,testSchemaIngest,PROD)&soft=true", + "host": ["{{baseUrl}}"], + "path": ["openapi", "entities", "v1", ""], + "query": [ + { + "key": "urns", + "value": "urn:li:dataset:(urn:li:dataPlatform:platform,testSchemaIngest,PROD)", + "description": "(Required) A list of raw urn strings, only supports a single entity type per request." + }, + { + "key": "urns", + "value": "labore dolor exercitation in", + "description": "(Required) A list of raw urn strings, only supports a single entity type per request.", + "disabled": true + }, + { + "key": "soft", + "value": "true", + "description": "Determines whether the delete will be soft or hard, defaults to true for soft delete" + } + ] + } + }, + "response": [ + { + "name": "OK", + "originalRequest": { + "method": "DELETE", + "header": [], + "url": { + "raw": "{{baseUrl}}/entities/v1/?urns=urn:li:dataset:(urn:li:dataPlatform:platform,testSchemaIngest,PROD)&soft=true", + "host": ["{{baseUrl}}"], + "path": ["entities", "v1", ""], + "query": [ + { + "key": "urns", + "value": "urn:li:dataset:(urn:li:dataPlatform:platform,testSchemaIngest,PROD)" + }, + { + "key": "urns", + "value": "officia occaecat elit dolor", + "disabled": true + }, + { + "key": "soft", + "value": "true" + } + ] + } + }, + "status": "OK", + "code": 200, + "_postman_previewlanguage": "json", + "header": [ + { + "key": "Content-Type", + "value": "application/json" + } + ], + "cookie": [], + "body": "[\n {\n \"rowsRolledBack\": [\n {\n \"urn\": \"urn:li:dataset:(urn:li:dataPlatform:platform,testSchemaIngest,PROD)\"\n }\n ],\n \"rowsDeletedFromEntityDeletion\": 1\n }\n]" + } + ] + }, + { + "name": "get Entities", + "protocolProfileBehavior": { + "disableUrlEncoding": false + }, + "request": { + "method": "GET", + "header": [ + { + "key": "Accept", + "value": "application/json" + } + ], + "url": { + "raw": "{{baseUrl}}/openapi/entities/v1/latest?urns=urn:li:dataset:(urn:li:dataPlatform:platform,testSchemaIngest,PROD)&aspectNames=schemaMetadata", + "host": ["{{baseUrl}}"], + "path": ["openapi", "entities", "v1", "latest"], + "query": [ + { + "key": "urns", + "value": "urn:li:dataset:(urn:li:dataPlatform:platform,testSchemaIngest,PROD)", + "description": "(Required) A list of raw urn strings, only supports a single entity type per request." + }, + { + "key": "urns", + "value": "labore dolor exercitation in", + "description": "(Required) A list of raw urn strings, only supports a single entity type per request.", + "disabled": true + }, + { + "key": "aspectNames", + "value": "schemaMetadata", + "description": "The list of aspect names to retrieve" + }, + { + "key": "aspectNames", + "value": "labore dolor exercitation in", + "description": "The list of aspect names to retrieve", + "disabled": true + } + ] + } + }, + "response": [ + { + "name": "OK", + "originalRequest": { + "method": "GET", + "header": [], + "url": { + "raw": "{{baseUrl}}/entities/v1/latest?urns=urn:li:dataset:(urn:li:dataPlatform:platform,testSchemaIngest,PROD)&aspectNames=schemaMetadata", + "host": ["{{baseUrl}}"], + "path": ["entities", "v1", "latest"], + "query": [ + { + "key": "urns", + "value": "non exercitation occaecat", + "disabled": true + }, + { + "key": "urns", + "value": "urn:li:dataset:(urn:li:dataPlatform:platform,testSchemaIngest,PROD)" + }, + { + "key": "aspectNames", + "value": "non exercitation occaecat", + "disabled": true + }, + { + "key": "aspectNames", + "value": "schemaMetadata" + } + ] + } + }, + "status": "OK", + "code": 200, + "_postman_previewlanguage": "json", + "header": [ + { + "key": "Content-Type", + "value": "application/json" + } + ], + "cookie": [], + "body": "{\n \"responses\": {\n \"urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)\": {\n \"entityName\": \"dataset\",\n \"urn\": \"urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)\",\n \"aspects\": {\n \"datasetKey\": {\n \"name\": \"datasetKey\",\n \"type\": \"VERSIONED\",\n \"version\": 0,\n \"value\": {\n \"__type\": \"DatasetKey\",\n \"platform\": \"urn:li:dataPlatform:hive\",\n \"name\": \"SampleHiveDataset\",\n \"origin\": \"PROD\"\n },\n \"created\": {\n \"time\": 1650657843351,\n \"actor\": \"urn:li:corpuser:__datahub_system\"\n }\n },\n \"schemaMetadata\": {\n \"name\": \"schemaMetadata\",\n \"type\": \"VERSIONED\",\n \"version\": 0,\n \"value\": {\n \"__type\": \"SchemaMetadata\",\n \"schemaName\": \"SampleHiveSchema\",\n \"platform\": \"urn:li:dataPlatform:hive\",\n \"version\": 0,\n \"created\": {\n \"time\": 1581407189000,\n \"actor\": \"urn:li:corpuser:jdoe\"\n },\n \"lastModified\": {\n \"time\": 1581407189000,\n \"actor\": \"urn:li:corpuser:jdoe\"\n },\n \"hash\": \"\",\n \"platformSchema\": {\n \"__type\": \"KafkaSchema\",\n \"documentSchema\": \"{\\\"type\\\":\\\"record\\\",\\\"name\\\":\\\"SampleHiveSchema\\\",\\\"namespace\\\":\\\"com.linkedin.dataset\\\",\\\"doc\\\":\\\"Sample Hive dataset\\\",\\\"fields\\\":[{\\\"name\\\":\\\"field_foo\\\",\\\"type\\\":[\\\"string\\\"]},{\\\"name\\\":\\\"field_bar\\\",\\\"type\\\":[\\\"boolean\\\"]}]}\"\n },\n \"fields\": [\n {\n \"fieldPath\": \"field_foo\",\n \"nullable\": false,\n \"description\": \"Foo field description\",\n \"type\": {\n \"type\": {\n \"__type\": \"BooleanType\"\n }\n },\n \"nativeDataType\": \"varchar(100)\",\n \"recursive\": false,\n \"isPartOfKey\": true\n },\n {\n \"fieldPath\": \"field_bar\",\n \"nullable\": false,\n \"description\": \"Bar field description\",\n \"type\": {\n \"type\": {\n \"__type\": \"BooleanType\"\n }\n },\n \"nativeDataType\": \"boolean\",\n \"recursive\": false,\n \"isPartOfKey\": false\n }\n ]\n },\n \"created\": {\n \"time\": 1650610810000,\n \"actor\": \"urn:li:corpuser:UNKNOWN\"\n }\n }\n }\n }\n }\n}" + } + ] + } + ], + "auth": { + "type": "bearer", + "bearer": [ + { + "key": "token", + "value": "{{token}}", + "type": "string" + } + ] + }, + "event": [ + { + "listen": "prerequest", + "script": { + "type": "text/javascript", + "exec": [""] + } + }, + { + "listen": "test", + "script": { + "type": "text/javascript", + "exec": [""] + } + } + ] + } + ], + "event": [ + { + "listen": "prerequest", + "script": { + "type": "text/javascript", + "exec": [""] + } + }, + { + "listen": "test", + "script": { + "type": "text/javascript", + "exec": [""] + } + } + ], + "variable": [ + { + "key": "baseUrl", + "value": "localhost:8080", + "type": "string" + }, + { + "key": "token", + "value": "eyJhbGciOiJIUzI1NiJ9.eyJhY3RvclR5cGUiOiJVU0VSIiwiYWN0b3JJZCI6ImRhdGFodWIiLCJ0eXBlIjoiUEVSU09OQUwiLCJ2ZXJzaW9uIjoiMSIsImV4cCI6MTY1MDY2MDY1NSwianRpIjoiM2E4ZDY3ZTItOTM5Yi00NTY3LWE0MjYtZDdlMDA1ZGU3NjJjIiwic3ViIjoiZGF0YWh1YiIsImlzcyI6ImRhdGFodWItbWV0YWRhdGEtc2VydmljZSJ9.pp_vW2u1tiiTT7U0nDF2EQdcayOMB8jatiOA8Je4JJA", + "type": "default" + } + ] +} +``` + +#### Relationships (/relationships) endpoint + +##### GET + +**Sample Request** + +```shell +curl -X 'GET' \ + 'http://localhost:8080/openapi/relationships/v1/?urn=urn%3Ali%3Acorpuser%3Adatahub&relationshipTypes=IsPartOf&direction=INCOMING&start=0&count=200' \ + -H 'accept: application/json' +``` + +**Sample Response** + +```json +{ + "start": 0, + "count": 2, + "total": 2, + "entities": [ + { + "relationshipType": "IsPartOf", + "urn": "urn:li:corpGroup:bfoo" + }, + { + "relationshipType": "IsPartOf", + "urn": "urn:li:corpGroup:jdoe" + } + ] +} +``` + +## Programmatic Usage + +Programmatic usage of the models can be done through the Java Rest Emitter which includes the generated models. A minimal +Java project for emitting to the OpenAPI endpoints would need the following dependencies (gradle format): + +```groovy +dependencies { + implementation 'io.acryl:datahub-client:' + implementation 'org.apache.httpcomponents:httpclient:' + implementation 'org.apache.httpcomponents:httpasyncclient:' +} +``` + +### Writing metadata events to the /platform endpoints + +The following code emits metadata events through OpenAPI by constructing a list of `UpsertAspectRequest`s. Behind the scenes, this is using the **/platform/entities/v1** endpoint to send metadata to GMS. + +```java +import io.datahubproject.openapi.generated.DatasetProperties; +import datahub.client.rest.RestEmitter; +import datahub.event.UpsertAspectRequest; +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import java.util.concurrent.ExecutionException; + + +public class Main { + public static void main(String[] args) throws IOException, ExecutionException, InterruptedException { + RestEmitter emitter = RestEmitter.createWithDefaults(); + + List requests = new ArrayList<>(); + UpsertAspectRequest upsertAspectRequest = UpsertAspectRequest.builder() + .entityType("dataset") + .entityUrn("urn:li:dataset:(urn:li:dataPlatform:bigquery,my-project.my-other-dataset.user-table,PROD)") + .aspect(new DatasetProperties().description("This is the canonical User profile dataset")) + .build(); + UpsertAspectRequest upsertAspectRequest2 = UpsertAspectRequest.builder() + .entityType("dataset") + .entityUrn("urn:li:dataset:(urn:li:dataPlatform:bigquery,my-project.another-dataset.user-table,PROD)") + .aspect(new DatasetProperties().description("This is the canonical User profile dataset 2")) + .build(); + requests.add(upsertAspectRequest); + requests.add(upsertAspectRequest2); + System.out.println(emitter.emit(requests, null).get()); + System.exit(0); + } +} +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/restli/evaluate-tests.md b/docs-website/versioned_docs/version-0.10.4/docs/api/restli/evaluate-tests.md new file mode 100644 index 0000000000000..f33898faf040a --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/restli/evaluate-tests.md @@ -0,0 +1,28 @@ +--- +title: Evaluate Tests Endpoint +slug: /api/restli/evaluate-tests +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/restli/evaluate-tests.md +--- + +# Evaluate Tests Endpoint + + + +You can do a HTTP POST request to `/gms/test?action=evaluate` endpoint with the `urn` as part of JSON Payload to run metadata tests for the particular URN. + +``` +curl --location --request POST 'https://DOMAIN.acryl.io/gms/test?action=evaluate' \ +--header 'Authorization: Bearer TOKEN' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "urn": "YOUR_URN" +}' +``` + +w +The supported parameters are + +- `urn` - Required URN string +- `shouldPush` - Optional Boolean - whether or not to push the results to persist them +- `testUrns` - Optional List of string - If you wish to get specific test URNs evaluated diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/restli/get-elastic-task-status.md b/docs-website/versioned_docs/version-0.10.4/docs/api/restli/get-elastic-task-status.md new file mode 100644 index 0000000000000..7cb67e42b5b2c --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/restli/get-elastic-task-status.md @@ -0,0 +1,59 @@ +--- +title: Get ElasticSearch Task Status Endpoint +slug: /api/restli/get-elastic-task-status +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/restli/get-elastic-task-status.md +--- + +# Get ElasticSearch Task Status Endpoint + +You can do a HTTP POST request to `/gms/operations?action=getEsTaskStatus` endpoint to see the status of the input task running in ElasticSearch. For example, the task ID given by the [`truncateTimeseriesAspect` endpoint](./truncate-time-series-aspect.md). The task ID can be passed in as a string with node name and task ID separated by a colon (as is output by the previous API), or the node name and task ID parameters separately. + +``` +curl --location --request POST 'https://demo.datahubproject.io/api/gms/operations?action=getEsTaskStatus' \ +--header 'Authorization: Bearer TOKEN' +--header 'Content-Type: application/json' \ +--data-raw '{ + "task": "aB1cdEf2GHIJKLMnoPQr3S:123456" +}' + +curl --location --request POST http://localhost:8080/operations\?action\=getEsTaskStatus \ +--header 'Authorization: Bearer TOKEN' +--header 'Content-Type: application/json' \ +--data-raw '{ + "nodeId": "aB1cdEf2GHIJKLMnoPQr3S", + taskId: 12345 +}' +``` + +The output will be a string representing a JSON object with the task status. + +``` +{ + "value": "{\"error\":\"Could not get task status for XIAMx5WySACgg9XxBgaKmw:12587\"}" +} +``` + +``` +"{ + "completed": true, + "taskId": "qhxGdzytQS-pQek8CwBCZg:54654", + "runTimeNanos": 1179458, + "status": "{ + "total": 0, + "updated": 0, + "created": 0, + "deleted": 0, + "batches": 0, + "version_conflicts": 0, + "noops": 0, + "retries": { + "bulk": 0, + "search": 0 + }, + "throttled_millis": 0, + "requests_per_second": -1.0, + "throttled_until_millis": 0 + } +} +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/restli/get-index-sizes.md b/docs-website/versioned_docs/version-0.10.4/docs/api/restli/get-index-sizes.md new file mode 100644 index 0000000000000..d6db969348879 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/restli/get-index-sizes.md @@ -0,0 +1,26 @@ +--- +title: Get Index Sizes Endpoint +slug: /api/restli/get-index-sizes +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/restli/get-index-sizes.md +--- + +# Get Index Sizes Endpoint + +You can do a HTTP POST request to `/gms/operations?action=getIndexSizes` endpoint with no parameters to see the size of indices in ElasticSearch. For now, only timeseries indices are supported, as they can grow indefinitely, and the `truncateTimeseriesAspect` endpoint is provided to clean up old entries. This endpoint can be used in conjunction with the cleanup endpoint to see which indices are the largest before truncation. + +``` +curl --location --request POST 'https://demo.datahubproject.io/api/gms/operations?action=getIndexSizes' \ +--header 'Authorization: Bearer TOKEN' +``` + +The endpoint takes no parameters, and the output will be a string representing a JSON object containing the following information about each index: + +``` + { + "aspectName": "datasetusagestatistics", + "sizeMb": 0.208, + "indexName": "dataset_datasetusagestatisticsaspect_v1", + "entityName": "dataset" + } +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/restli/restli-overview.md b/docs-website/versioned_docs/version-0.10.4/docs/api/restli/restli-overview.md new file mode 100644 index 0000000000000..04dc344366de2 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/restli/restli-overview.md @@ -0,0 +1,1443 @@ +--- +title: Rest.li API +slug: /api/restli/restli-overview +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/restli/restli-overview.md +--- + +# Rest.li API + +You can access basic documentation on the API endpoints by opening the `/restli/docs` endpoint in the browser. + +``` +python -c "import webbrowser; webbrowser.open('http://localhost:8080/restli/docs', new=2)" +``` + +\*Please note that because DataHub is in a period of rapid development, the APIs below are subject to change. + +#### Sample API Calls + +#### Ingesting Aspects + +To ingest individual aspects into DataHub, you can use the following CURL: + +```shell +curl --location --request POST 'http://localhost:8080/aspects?action=ingestProposal' \ +--header 'X-RestLi-Protocol-Version: 2.0.0' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "proposal" : { + "entityType": "dataset", + "entityUrn" : "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)", + "changeType" : "UPSERT", + "aspectName" : "datasetUsageStatistics", + "aspect" : { + "value" : "{ \"timestampMillis\":1629840771000,\"uniqueUserCount\" : 10, \"totalSqlQueries\": 20, \"fieldCounts\": [ {\"fieldPath\": \"col1\", \"count\": 20}, {\"fieldPath\" : \"col2\", \"count\": 5} ]}", + "contentType": "application/json" + } + } +}' +``` + +Notice that you need to provide the target entity urn, the entity type, a change type (`UPSERT` + `DELETE` supported), +the aspect name, and a JSON-serialized aspect, which corresponds to the PDL schema defined for the aspect. + +For more examples of serialized aspect payloads, see [bootstrap_mce.json](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/mce_files/bootstrap_mce.json). + +#### Ingesting Entities (Legacy) + +> Note - we are deprecating support for ingesting Entities via Snapshots. Please see **Ingesting Aspects** above for the latest +> guidance around ingesting metadata into DataHub without defining or changing the legacy snapshot models. (e.g. using ConfigEntityRegistry) + +The Entity Snapshot Ingest endpoints allow you to ingest multiple aspects about a particular entity at the same time. + +##### Create a user + +``` +curl 'http://localhost:8080/entities?action=ingest' -X POST --data '{ + "entity":{ + "value":{ + "com.linkedin.metadata.snapshot.CorpUserSnapshot":{ + "urn":"urn:li:corpuser:footbarusername", + "aspects":[ + { + "com.linkedin.identity.CorpUserInfo":{ + "active":true, + "displayName":"Foo Bar", + "fullName":"Foo Bar", + "email":"fbar@linkedin.com" + } + } + ] + } + } + } +}' +``` + +##### Create a group + +``` +curl 'http://localhost:8080/entities?action=ingest' -X POST --data '{ + "entity":{ + "value":{ + "com.linkedin.metadata.snapshot.CorpGroupSnapshot":{ + "urn":"urn:li:corpGroup:dev", + "aspects":[ + { + "com.linkedin.identity.CorpGroupInfo":{ + "email":"dev@linkedin.com", + "admins":[ + "urn:li:corpUser:jdoe" + ], + "members":[ + "urn:li:corpUser:datahub", + "urn:li:corpUser:jdoe" + ], + "groups":[ + + ] + } + } + ] + } + } + } +}' +``` + +##### Create a dataset + +``` +curl 'http://localhost:8080/entities?action=ingest' -X POST --data '{ + "entity":{ + "value":{ + "com.linkedin.metadata.snapshot.DatasetSnapshot":{ + "urn":"urn:li:dataset:(urn:li:dataPlatform:foo,bar,PROD)", + "aspects":[ + { + "com.linkedin.common.Ownership":{ + "owners":[ + { + "owner":"urn:li:corpuser:fbar", + "type":"DATAOWNER" + } + ], + "lastModified":{ + "time":0, + "actor":"urn:li:corpuser:fbar" + } + } + }, + { + "com.linkedin.common.InstitutionalMemory":{ + "elements":[ + { + "url":"https://www.linkedin.com", + "description":"Sample doc", + "createStamp":{ + "time":0, + "actor":"urn:li:corpuser:fbar" + } + } + ] + } + }, + { + "com.linkedin.schema.SchemaMetadata":{ + "schemaName":"FooEvent", + "platform":"urn:li:dataPlatform:foo", + "version":0, + "created":{ + "time":0, + "actor":"urn:li:corpuser:fbar" + }, + "lastModified":{ + "time":0, + "actor":"urn:li:corpuser:fbar" + }, + "hash":"", + "platformSchema":{ + "com.linkedin.schema.KafkaSchema":{ + "documentSchema":"{\"type\":\"record\",\"name\":\"MetadataChangeEvent\",\"namespace\":\"com.linkedin.mxe\",\"doc\":\"Kafka event for proposing a metadata change for an entity.\",\"fields\":[{\"name\":\"auditHeader\",\"type\":{\"type\":\"record\",\"name\":\"KafkaAuditHeader\",\"namespace\":\"com.linkedin.avro2pegasus.events\",\"doc\":\"Header\"}}]}" + } + }, + "fields":[ + { + "fieldPath":"foo", + "description":"Bar", + "nativeDataType":"string", + "type":{ + "type":{ + "com.linkedin.schema.StringType":{ + + } + } + } + } + ] + } + } + ] + } + } + } +}' +``` + +##### Create a chart + +``` +curl 'http://localhost:8080/entities?action=ingest' -X POST --data '{ + "entity":{ + "value":{ + "com.linkedin.metadata.snapshot.ChartSnapshot":{ + "urn":"urn:li:chart:(looker,baz1)", + "aspects":[ + { + "com.linkedin.chart.ChartInfo":{ + "title":"Baz Chart 1", + "description":"Baz Chart 1", + "inputs":[ + { + "string":"urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)" + } + ], + "lastModified":{ + "created":{ + "time":0, + "actor":"urn:li:corpuser:jdoe" + }, + "lastModified":{ + "time":0, + "actor":"urn:li:corpuser:datahub" + } + } + } + } + ] + } + } + } +}' +``` + +##### Create a dashboard + +``` +curl 'http://localhost:8080/entities?action=ingest' -X POST --data '{ + "entity":{ + "value":{ + "com.linkedin.metadata.snapshot.DashboardSnapshot":{ + "urn":"urn:li:dashboard:(looker,baz)", + "aspects":[ + { + "com.linkedin.dashboard.DashboardInfo":{ + "title":"Baz Dashboard", + "description":"Baz Dashboard", + "charts":[ + "urn:li:chart:(looker,baz1)", + "urn:li:chart:(looker,baz2)" + ], + "lastModified":{ + "created":{ + "time":0, + "actor":"urn:li:corpuser:jdoe" + }, + "lastModified":{ + "time":0, + "actor":"urn:li:corpuser:datahub" + } + } + } + } + ] + } + } + } +}' +``` + +##### Create Tags + +To create a new tag called "Engineering", we can use the following curl. + +``` +curl 'http://localhost:8080/entities?action=ingest' -X POST --data '{ + "entity":{ + "value":{ + "com.linkedin.metadata.snapshot.TagSnapshot":{ + "urn":"urn:li:tag:Engineering", + "aspects":[ + { + "com.linkedin.tag.TagProperties":{ + "name":"Engineering", + "description":"The tag will be assigned to all assets owned by the Eng org." + } + } + ] + } + } + } +}' +``` + +This tag can subsequently be associated with a Data Asset using the "Global Tags" aspect associated with each. For example, +to add a tag to a Dataset, you can use the following CURL: + +``` +curl 'http://localhost:8080/entities?action=ingest' -X POST --data '{ + "entity":{ + "value":{ + "com.linkedin.metadata.snapshot.DatasetSnapshot":{ + "urn":"urn:li:dataset:(urn:li:dataPlatform:foo,bar,PROD)", + "aspects":[ + { + "com.linkedin.common.GlobalTags":{ + "tags":[ + { + "tag":"urn:li:tag:Engineering" + } + ] + } + } + ] + } + } + } +}' +``` + +And to add the tag to a field in a particular Dataset's schema, you can use a CURL to update the EditableSchemaMetadata Aspect: + +``` +curl 'http://localhost:8080/entities?action=ingest' -X POST --data '{ + "entity":{ + "value":{ + "com.linkedin.metadata.snapshot.DatasetSnapshot":{ + "urn":"urn:li:dataset:(urn:li:dataPlatform:foo,bar,PROD)", + "aspects":[ + { + "com.linkedin.schema.EditableSchemaMetadata": { + "editableSchemaFieldInfo":[ + { + "fieldPath":"myFieldName", + "globalTags": { + "tags":[ + { + "tag":"urn:li:tag:Engineering" + } + ] + } + } + ] + } + } + ] + } + } + } +}' +``` + +##### Soft Deleting an Entity + +DataHub uses a special "Status" aspect associated with each entity to represent the lifecycle state of an Entity. +To soft delete an entire Entity, you can use the special "Status" aspect. Note that soft deletion means that +an entity will not be discoverable via Search or Browse, but its entity page will still be viewable. + +For example, to delete a particular chart: + +``` +curl 'http://localhost:8080/entities?action=ingest' -X POST --data '{ + "entity":{ + "value":{ + "com.linkedin.metadata.snapshot.ChartSnapshot":{ + "aspects":[ + { + "com.linkedin.common.Status":{ + "removed": true + } + } + ], + "urn":"urn:li:chart:(looker,baz1)" + } + } + } +}' +``` + +To re-enable the Entity, you can make a similar request: + +``` +curl 'http://localhost:8080/entities?action=ingest' -X POST --data '{ + "entity":{ + "value":{ + "com.linkedin.metadata.snapshot.ChartSnapshot":{ + "aspects":[ + { + "com.linkedin.common.Status":{ + "removed": false + } + } + ], + "urn":"urn:li:chart:(looker,baz1)" + } + } + } +}' +``` + +To issue a hard delete or soft-delete, or undo a particular ingestion run, you can use the [DataHub CLI](docs/how/delete-metadata.md). + +#### Retrieving Entity Aspects + +Simply curl the `entitiesV2` endpoint of GMS: + +``` +curl 'http://localhost:8080/entitiesV2/' +``` + +For example, to retrieve the latest aspects associated with the "SampleHdfsDataset" `Dataset`: + +``` +curl --header 'X-RestLi-Protocol-Version: 2.0.0' 'http://localhost:8080/entitiesV2/urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Ahdfs%2CSampleHdfsDataset%2CPROD%29' +``` + +**Example Response** + +```json +{ + "urn": "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)", + "aspects": { + "editableSchemaMetadata": { + "name": "editableSchemaMetadata", + "version": 0, + "value": { + "created": { + "actor": "urn:li:corpuser:jdoe", + "time": 1581407189000 + }, + "editableSchemaFieldInfo": [ + { + "fieldPath": "shipment_info", + "globalTags": { + "tags": [ + { + "tag": "urn:li:tag:Legacy" + } + ] + } + } + ], + "lastModified": { + "actor": "urn:li:corpuser:jdoe", + "time": 1581407189000 + } + }, + "created": { + "actor": "urn:li:corpuser:UNKNOWN", + "time": 1646245614843 + } + }, + "browsePaths": { + "name": "browsePaths", + "version": 0, + "value": { + "paths": ["/prod/hdfs/SampleHdfsDataset"] + }, + "created": { + "actor": "urn:li:corpuser:UNKNOWN", + "time": 1646245614843 + } + }, + "datasetKey": { + "name": "datasetKey", + "version": 0, + "value": { + "name": "SampleHdfsDataset", + "platform": "urn:li:dataPlatform:hdfs", + "origin": "PROD" + }, + "created": { + "actor": "urn:li:corpuser:UNKNOWN", + "time": 1646245614843 + } + }, + "ownership": { + "name": "ownership", + "version": 0, + "value": { + "owners": [ + { + "owner": "urn:li:corpuser:jdoe", + "type": "DATAOWNER" + }, + { + "owner": "urn:li:corpuser:datahub", + "type": "DATAOWNER" + } + ], + "lastModified": { + "actor": "urn:li:corpuser:jdoe", + "time": 1581407189000 + } + }, + "created": { + "actor": "urn:li:corpuser:UNKNOWN", + "time": 1646245614843 + } + }, + "dataPlatformInstance": { + "name": "dataPlatformInstance", + "version": 0, + "value": { + "platform": "urn:li:dataPlatform:hdfs" + }, + "created": { + "actor": "urn:li:corpuser:UNKNOWN", + "time": 1646245614843 + } + }, + "institutionalMemory": { + "name": "institutionalMemory", + "version": 0, + "value": { + "elements": [ + { + "createStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1581407189000 + }, + "description": "Sample doc", + "url": "https://www.linkedin.com" + } + ] + }, + "created": { + "actor": "urn:li:corpuser:UNKNOWN", + "time": 1646245614843 + } + }, + "schemaMetadata": { + "name": "schemaMetadata", + "version": 0, + "value": { + "created": { + "actor": "urn:li:corpuser:jdoe", + "time": 1581407189000 + }, + "platformSchema": { + "com.linkedin.schema.KafkaSchema": { + "documentSchema": "{\"type\":\"record\",\"name\":\"SampleHdfsSchema\",\"namespace\":\"com.linkedin.dataset\",\"doc\":\"Sample HDFS dataset\",\"fields\":[{\"name\":\"field_foo\",\"type\":[\"string\"]},{\"name\":\"field_bar\",\"type\":[\"boolean\"]}]}" + } + }, + "lastModified": { + "actor": "urn:li:corpuser:jdoe", + "time": 1581407189000 + }, + "schemaName": "SampleHdfsSchema", + "fields": [ + { + "nullable": false, + "fieldPath": "shipment_info", + "description": "Shipment info description", + "isPartOfKey": false, + "type": { + "type": { + "com.linkedin.schema.RecordType": {} + } + }, + "nativeDataType": "varchar(100)", + "recursive": false + }, + { + "nullable": false, + "fieldPath": "shipment_info.date", + "description": "Shipment info date description", + "isPartOfKey": false, + "type": { + "type": { + "com.linkedin.schema.DateType": {} + } + }, + "nativeDataType": "Date", + "recursive": false + }, + { + "nullable": false, + "fieldPath": "shipment_info.target", + "description": "Shipment info target description", + "isPartOfKey": false, + "type": { + "type": { + "com.linkedin.schema.StringType": {} + } + }, + "nativeDataType": "text", + "recursive": false + }, + { + "nullable": false, + "fieldPath": "shipment_info.destination", + "description": "Shipment info destination description", + "isPartOfKey": false, + "type": { + "type": { + "com.linkedin.schema.StringType": {} + } + }, + "nativeDataType": "varchar(100)", + "recursive": false + }, + { + "nullable": false, + "fieldPath": "shipment_info.geo_info", + "description": "Shipment info geo_info description", + "isPartOfKey": false, + "type": { + "type": { + "com.linkedin.schema.RecordType": {} + } + }, + "nativeDataType": "varchar(100)", + "recursive": false + }, + { + "nullable": false, + "fieldPath": "shipment_info.geo_info.lat", + "description": "Shipment info geo_info lat", + "isPartOfKey": false, + "type": { + "type": { + "com.linkedin.schema.NumberType": {} + } + }, + "nativeDataType": "float", + "recursive": false + }, + { + "nullable": false, + "fieldPath": "shipment_info.geo_info.lng", + "description": "Shipment info geo_info lng", + "isPartOfKey": false, + "type": { + "type": { + "com.linkedin.schema.NumberType": {} + } + }, + "nativeDataType": "float", + "recursive": false + } + ], + "version": 0, + "hash": "", + "platform": "urn:li:dataPlatform:hdfs" + }, + "created": { + "actor": "urn:li:corpuser:UNKNOWN", + "time": 1646245614843 + } + }, + "upstreamLineage": { + "name": "upstreamLineage", + "version": 0, + "value": { + "upstreams": [ + { + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1581407189000 + }, + "type": "TRANSFORMED", + "dataset": "urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)" + } + ] + }, + "created": { + "actor": "urn:li:corpuser:UNKNOWN", + "time": 1646245614843 + } + } + }, + "entityName": "dataset" +} +``` + +You can also optionally limit to specific aspects using the `aspects` query parameter: + +``` +curl 'http://localhost:8080/entitiesV2/?aspects=List(upstreamLineage)' +``` + +#### Retrieving Entities (Legacy) + +> Note that this method of retrieving entities is deprecated, as it uses the legacy Snapshot models. Please refer to the **Retriving Entity Aspects** section above for the +> latest guidance. + +The Entity Snapshot Get APIs allow to retrieve the latest version of each aspect associated with an Entity. + +In general, when reading entities by primary key (urn), you will use the general-purpose `entities` endpoints. To fetch by primary key (urn), you'll +issue a query of the following form: + +``` +curl 'http://localhost:8080/entities/' +``` + +##### Get a CorpUser + +``` +curl 'http://localhost:8080/entities/urn%3Ali%3Acorpuser%3Afbar' + +{ + "value":{ + "com.linkedin.metadata.snapshot.CorpUserSnapshot":{ + "urn":"urn:li:corpuser:fbar", + "aspects":[ + { + "com.linkedin.metadata.key.CorpUserKey":{ + "username":"fbar" + } + }, + { + "com.linkedin.identity.CorpUserInfo":{ + "active":true, + "fullName":"Foo Bar", + "displayName":"Foo Bar", + "email":"fbar@linkedin.com" + } + }, + { + "com.linkedin.identity.CorpUserEditableInfo":{ + + } + } + ] + } + } +} +``` + +##### Get a CorpGroup + +``` +curl 'http://localhost:8080/entities/urn%3Ali%3AcorpGroup%3Adev' + +{ + "value":{ + "com.linkedin.metadata.snapshot.CorpGroupSnapshot":{ + "urn":"urn:li:corpGroup:dev", + "aspects":[ + { + "com.linkedin.metadata.key.CorpGroupKey":{ + "name":"dev" + } + }, + { + "com.linkedin.identity.CorpGroupInfo":{ + "groups":[ + + ], + "email":"dev@linkedin.com", + "admins":[ + "urn:li:corpUser:jdoe" + ], + "members":[ + "urn:li:corpUser:datahub", + "urn:li:corpUser:jdoe" + ] + } + } + ] + } + } +} +``` + +##### Get a Dataset + +``` +curl 'http://localhost:8080/entities/urn%3Ali%3Adataset%3A(urn%3Ali%3AdataPlatform%3Afoo,bar,PROD)' + +{ + "value":{ + "com.linkedin.metadata.snapshot.DatasetSnapshot":{ + "urn":"urn:li:dataset:(urn:li:dataPlatform:foo,bar,PROD)", + "aspects":[ + { + "com.linkedin.metadata.key.DatasetKey":{ + "origin":"PROD", + "name":"bar", + "platform":"urn:li:dataPlatform:foo" + } + }, + { + "com.linkedin.common.InstitutionalMemory":{ + "elements":[ + { + "createStamp":{ + "actor":"urn:li:corpuser:fbar", + "time":0 + }, + "description":"Sample doc", + "url":"https://www.linkedin.com" + } + ] + } + }, + { + "com.linkedin.common.Ownership":{ + "owners":[ + { + "owner":"urn:li:corpuser:fbar", + "type":"DATAOWNER" + } + ], + "lastModified":{ + "actor":"urn:li:corpuser:fbar", + "time":0 + } + } + }, + { + "com.linkedin.schema.SchemaMetadata":{ + "created":{ + "actor":"urn:li:corpuser:fbar", + "time":0 + }, + "platformSchema":{ + "com.linkedin.schema.KafkaSchema":{ + "documentSchema":"{\"type\":\"record\",\"name\":\"MetadataChangeEvent\",\"namespace\":\"com.linkedin.mxe\",\"doc\":\"Kafka event for proposing a metadata change for an entity.\",\"fields\":[{\"name\":\"auditHeader\",\"type\":{\"type\":\"record\",\"name\":\"KafkaAuditHeader\",\"namespace\":\"com.linkedin.avro2pegasus.events\",\"doc\":\"Header\"}}]}" + } + }, + "lastModified":{ + "actor":"urn:li:corpuser:fbar", + "time":0 + }, + "schemaName":"FooEvent", + "fields":[ + { + "fieldPath":"foo", + "description":"Bar", + "type":{ + "type":{ + "com.linkedin.schema.StringType":{ + + } + } + }, + "nativeDataType":"string" + } + ], + "version":0, + "hash":"", + "platform":"urn:li:dataPlatform:foo" + } + }, + { + "com.linkedin.common.BrowsePaths":{ + "paths":[ + "/prod/foo/bar" + ] + } + }, + { + "com.linkedin.dataset.UpstreamLineage":{ + "upstreams":[ + { + "auditStamp":{ + "actor":"urn:li:corpuser:fbar", + "time":0 + }, + "type":"TRANSFORMED", + "dataset":"urn:li:dataset:(urn:li:dataPlatform:foo,barUp,PROD)" + } + ] + } + } + ] + } + } +} +``` + +##### Get a Chart + +``` +curl 'http://localhost:8080/entities/urn%3Ali%3Achart%3A(looker,baz1)' + +{ + "value":{ + "com.linkedin.metadata.snapshot.ChartSnapshot":{ + "urn":"urn:li:chart:(looker,baz1)", + "aspects":[ + { + "com.linkedin.metadata.key.ChartKey":{ + "chartId":"baz1", + "dashboardTool":"looker" + } + }, + { + "com.linkedin.common.BrowsePaths":{ + "paths":[ + "/looker/baz1" + ] + } + }, + { + "com.linkedin.chart.ChartInfo":{ + "description":"Baz Chart 1", + "lastModified":{ + "created":{ + "actor":"urn:li:corpuser:jdoe", + "time":0 + }, + "lastModified":{ + "actor":"urn:li:corpuser:datahub", + "time":0 + } + }, + "title":"Baz Chart 1", + "inputs":[ + { + "string":"urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)" + } + ] + } + } + ] + } + } +} +``` + +##### Get a Dashboard + +``` +curl 'http://localhost:8080/entities/urn%3Ali%3Adashboard%3A(looker,foo)' + +{ + "value":{ + "com.linkedin.metadata.snapshot.DashboardSnapshot":{ + "urn":"urn:li:dashboard:(looker,foo)", + "aspects":[ + { + "com.linkedin.metadata.key.DashboardKey":{ + "dashboardId":"foo", + "dashboardTool":"looker" + } + } + ] + } + } +} +``` + +##### Get a GlossaryTerm + +``` +curl 'http://localhost:8080/entities/urn%3Ali%3AglossaryTerm%3A(instruments,instruments.FinancialInstrument_v1)' +{ + "value":{ + "com.linkedin.metadata.snapshot.GlossaryTermSnapshot":{ + "urn":"urn:li:glossaryTerm:instruments.FinancialInstrument_v1", + "ownership":{ + "owners":[ + { + "owner":"urn:li:corpuser:jdoe", + "type":"DATAOWNER" + } + ], + "lastModified":{ + "actor":"urn:li:corpuser:jdoe", + "time":1581407189000 + } + }, + "glossaryTermInfo":{ + "definition":"written contract that gives rise to both a financial asset of one entity and a financial liability of another entity", + "customProperties":{ + "FQDN":"full" + }, + "sourceRef":"FIBO", + "sourceUrl":"https://spec.edmcouncil.org/fibo/ontology/FBC/FinancialInstruments/FinancialInstruments/FinancialInstrument", + "termSource":"EXTERNAL" + } + } + } +} +``` + +##### Browse an Entity + +To browse (explore) for an Entity of a particular type (e.g. dataset, chart, etc), you can use the following query format: + +``` +curl -X POST 'http://localhost:8080/entities?action=browse' \ +--data '{ + "path": "", + "entity": "", + "start": 0, + "limit": 10 +}' +``` + +For example, to browse the "charts" entity, you could use the following query: + +``` +curl -X POST 'http://localhost:8080/entities?action=browse' \ +--data '{ + "path": "/looker", + "entity": "chart", + "start": 0, + "limit": 10 +}' + +{ + "value":{ + "numEntities":1, + "pageSize":1, + "metadata":{ + "totalNumEntities":1, + "groups":[ + + ], + "path":"/looker" + }, + "from":0, + "entities":[ + { + "name":"baz1", + "urn":"urn:li:chart:(looker,baz1)" + } + ] + } +} +``` + +##### Search an Entity + +To search for an Entity of a particular type (e.g. dataset, chart, etc), you can use the following query format: + +``` +curl -X POST 'http://localhost:8080/entities?action=search' \ +--data '{ + "input": "", + "entity": "", + "start": 0, + "count": 10 +}' +``` + +The API will return a list of URNs that matched your search query. + +For example, to search the "charts" entity, you could use the following query: + +``` +curl -X POST 'http://localhost:8080/entities?action=search' \ +--data '{ + "input": "looker", + "entity": "chart", + "start": 0, + "count": 10 +}' + +{ + "value":{ + "numEntities":1, + "pageSize":10, + "metadata":{ + "urns":[ + "urn:li:chart:(looker,baz1)" + ], + "matches":[ + { + "matchedFields":[ + { + "name":"tool", + "value":"looker" + } + ] + } + ], + "searchResultMetadatas":[ + { + "name":"tool", + "aggregations":{ + "looker":1 + } + } + ] + }, + "from":0, + "entities":[ + "urn:li:chart:(looker,baz1)" + ] + } +} +``` + +###### Exact Match Search + +You can use colon search for exact match searching on particular @Searchable fields of an Entity. + +###### Example: Find assets by Tag + +For example, to fetch all Datasets having a particular tag (Engineering), we can use the following query: + +``` +curl -X POST 'http://localhost:8080/entities?action=search' \ +--data '{ + "input": "tags:Engineering", + "entity": "dataset", + "start": 0, + "count": 10 +}' + +{ + "value":{ + "numEntities":1, + "pageSize":10, + "metadata":{ + "urns":[ + "urn:li:dataset:(urn:li:dataPlatform:foo,bar,PROD)" + ], + "matches":[ + { + "matchedFields":[ + { + "name":"tags", + "value":"urn:li:tag:Engineering" + } + ] + } + ], + "searchResultMetadatas":[ + { + "name":"platform", + "aggregations":{ + "foo":1 + } + }, + { + "name":"origin", + "aggregations":{ + "PROD":1 + } + } + ] + }, + "from":0, + "entities":[ + "urn:li:dataset:(urn:li:dataPlatform:foo,bar,PROD)" + ] + } +} +``` + +###### Filtering + +In addition to performing full-text search, you can also filter explicitly against fields marked as @Searchable in the corresponding aspect PDLs. + +For example, to perform filtering for a chart with title "Baz Chart 1", you could issue the following query: + +``` +curl -X POST 'http://localhost:8080/entities?action=search' \ +--data '{ + "input": "looker", + "entity": "chart", + "start": 0, + "count": 10, + "filter": { + "or": [{ + "and": [ + { + "field": "title", + "value": "Baz Chart 1", + "condition": "EQUAL" + } + ] + }] + } +}' + +{ + "value":{ + "numEntities":1, + "pageSize":10, + "metadata":{ + "urns":[ + "urn:li:chart:(looker,baz1)" + ], + "matches":[ + { + "matchedFields":[ + { + "name":"tool", + "value":"looker" + } + ] + } + ], + "searchResultMetadatas":[ + { + "name":"tool", + "aggregations":{ + "looker":1 + } + } + ] + }, + "from":0, + "entities":[ + "urn:li:chart:(looker,baz1)" + ] + } +} +``` + +where valid conditions include - CONTAIN - END_WITH - EQUAL - GREATER_THAN - GREATER_THAN_OR_EQUAL_TO - LESS_THAN - LESS_THAN_OR_EQUAL_TO - START_WITH + +\*Note that the search API only includes data corresponding to the latest snapshots of a particular Entity. + +##### Autocomplete against fields of an entity + +To autocomplete a query for a particular entity type, you can use a query of the following form: + +``` +curl -X POST 'http://localhost:8080/entities?action=autocomplete' \ +--data '{ + "query": "", + "entity": "", + "limit": 10 +}' +``` + +For example, to autocomplete a query against all Dataset entities, you could issue the following: + +``` +curl -X POST 'http://localhost:8080/entities?action=autocomplete' \ +--data '{ + "query": "Baz Ch", + "entity": "chart", + "start": 0, + "limit": 10 +}' + +{ + "value":{ + "suggestions":[ + "Baz Chart 1" + ], + "query":"Baz Ch" + } +} +``` + +Note that you can also provide a `Filter` to the autocomplete endpoint: + +``` +curl -X POST 'http://localhost:8080/entities?action=autocomplete' \ +--data '{ + "query": "Baz C", + "entity": "chart", + "start": 0, + "limit": 10, + "filter": { + "or": [{ + "and": [ + { + "field": "tool", + "value": "looker", + "condition": "EQUAL" + } + ] + }] + } +}' + +{ + "value":{ + "suggestions":[ + "Baz Chart 1" + ], + "query":"Baz Ch" + } +} +``` + +\*Note that the autocomplete API only includes data corresponding to the latest snapshots of a particular Entity. + +##### Get a Versioned Aspect + +In addition to fetching the set of latest Snapshot aspects for an entity, we also support doing a point lookup of an entity at a particular version. + +To do so, you can use the following query template: + +``` +curl 'http://localhost:8080/aspects/?aspect=&version= +``` + +Which will return a VersionedAspect, which is a record containing a version and an aspect inside a Rest.li Union, wherein the fully-qualified record name of the +aspect is the key for the union. + +For example, to fetch the latest version of a Dataset's "schemaMetadata" aspect, you could issue the following query: + +``` +curl 'http://localhost:8080/aspects/urn%3Ali%3Adataset%3A(urn%3Ali%3AdataPlatform%3Afoo%2Cbar%2CPROD)?aspect=schemaMetadata&version=0' + +{ + "version":0, + "aspect":{ + "com.linkedin.schema.SchemaMetadata":{ + "created":{ + "actor":"urn:li:corpuser:fbar", + "time":0 + }, + "platformSchema":{ + "com.linkedin.schema.KafkaSchema":{ + "documentSchema":"{\"type\":\"record\",\"name\":\"MetadataChangeEvent\",\"namespace\":\"com.linkedin.mxe\",\"doc\":\"Kafka event for proposing a metadata change for an entity.\",\"fields\":[{\"name\":\"auditHeader\",\"type\":{\"type\":\"record\",\"name\":\"KafkaAuditHeader\",\"namespace\":\"com.linkedin.avro2pegasus.events\",\"doc\":\"Header\"}}]}" + } + }, + "lastModified":{ + "actor":"urn:li:corpuser:fbar", + "time":0 + }, + "schemaName":"FooEvent", + "fields":[ + { + "fieldPath":"foo", + "description":"Bar", + "type":{ + "type":{ + "com.linkedin.schema.StringType":{ + + } + } + }, + "nativeDataType":"string" + } + ], + "version":0, + "hash":"", + "platform":"urn:li:dataPlatform:foo" + } + } +} +``` + +Keep in mind that versions increase monotonically _after_ version 0, which represents the latest. + +Note that this API will soon be deprecated and replaced by the V2 Aspect API, discussed below. + +##### Get a range of Versioned Aspects + +_Coming Soon_! + +##### Get a range of Timeseries Aspects + +With the introduction of Timeseries Aspects, we've introduced a new API for fetching a series of aspects falling into a particular time range. For this, you'll +use the `/aspects` endpoint. The V2 APIs are unique in that they return a new type of payload: an "Enveloped Aspect". This is essentially a serialized aspect along with +some system metadata. The serialized aspect can be in any form, though we currently default to escaped Rest.li-compatible JSON. + +Callers of the V2 Aspect APIs will be expected to deserialize the aspect payload in the way they see fit. For example, they may bind the deserialized JSON object +into a strongly typed Rest.li RecordTemplate class (which is what datahub-frontend does). The benefit of doing it this way is thaet we remove the necessity to +use Rest.li Unions to represent an object which can take on multiple payload forms. It also makes adding and removing aspects from the model easier, a process +which could theoretically be done at runtime as opposed to at deploy time. + +To fetch a set of Timeseries Aspects that fall into a particular time range, you can use the following query template: + +``` +curl -X POST 'http://localhost:8080/aspects?action=getTimeseriesAspectValues' \ +--data '{ + "urn": "", + "entity": "", + "aspect": "", + "startTimeMillis": "", + "endTimeMillis": "" +}' +``` + +For example, to fetch "datasetProfile" timeseries aspects for a dataset with urn `urn:li:dataset:(urn:li:dataPlatform:foo,barUp,PROD)` +that were reported after July 26, 2021 and before July 28, 2021, you could issue the following query: + +``` +curl -X POST 'http://localhost:8080/aspects?action=getTimeseriesAspectValues' \ +--data '{ + "urn": "urn:li:dataset:(urn:li:dataPlatform:redshift,global_dev.larxynx_carcinoma_data_2020,PROD)", + "entity": "dataset", + "aspect": "datasetProfile", + "startTimeMillis": 1625122800000, + "endTimeMillis": 1627455600000 +}' + +{ + "value":{ + "limit":10000, + "aspectName":"datasetProfile", + "endTimeMillis":1627455600000, + "startTimeMillis":1625122800000, + "entityName":"dataset", + "values":[ + { + "aspect":{ + "value":"{\"timestampMillis\":1626912000000,\"fieldProfiles\":[{\"uniqueProportion\":1.0,\"sampleValues\":[\"123MMKK12\",\"13KDFMKML\",\"123NNJJJL\"],\"fieldPath\":\"id\",\"nullCount\":0,\"nullProportion\":0.0,\"uniqueCount\":3742},{\"uniqueProportion\":1.0,\"min\":\"1524406400000\",\"max\":\"1624406400000\",\"sampleValues\":[\"1640023230002\",\"1640343012207\",\"16303412330117\"],\"mean\":\"1555406400000\",\"fieldPath\":\"date\",\"nullCount\":0,\"nullProportion\":0.0,\"uniqueCount\":3742},{\"uniqueProportion\":0.037,\"min\":\"21\",\"median\":\"68\",\"max\":\"92\",\"sampleValues\":[\"45\",\"65\",\"81\"],\"mean\":\"65\",\"distinctValueFrequencies\":[{\"value\":\"12\",\"frequency\":103},{\"value\":\"54\",\"frequency\":12}],\"fieldPath\":\"patient_age\",\"nullCount\":0,\"nullProportion\":0.0,\"uniqueCount\":79},{\"uniqueProportion\":0.00820873786407767,\"sampleValues\":[\"male\",\"female\"],\"fieldPath\":\"patient_gender\",\"nullCount\":120,\"nullProportion\":0.03,\"uniqueCount\":2}],\"rowCount\":3742,\"columnCount\":4}", + "contentType":"application/json" + } + }, + ] + } +} +``` + +You'll notice that in this API (V2), we return a generic serialized aspect string as opposed to an inlined Rest.li-serialized Snapshot Model. + +This is part of an initiative to move from MCE + MAE to MetadataChangeProposal and MetadataChangeLog. For more information, see [this doc](docs/advanced/mcp-mcl.md). + +##### Get Relationships (Edges) + +To get relationships between entities, you can use the `/relationships` API. Do do so, you must provide the following inputs: + +1. Urn of the source node +2. Direction of the edge (INCOMING, OUTGOING) +3. The name of the Relationship (This can be found in Aspect PDLs within the @Relationship annotation) + +For example, to get all entities owned by `urn:li:corpuser:fbar`, we could issue the following query: + +``` +curl 'http://localhost:8080/relationships?direction=INCOMING&urn=urn%3Ali%3Acorpuser%3Auser1&types=OwnedBy' +``` + +which will return a list of urns, representing entities on the other side of the relationship: + +``` +{ + "entities":[ + urn:li:dataset:(urn:li:dataPlatform:foo,barUp,PROD) + ] +} +``` + +## FAQ + +_1. How do I find the valid set of Entity names?_ + +Entities are named inside of PDL schemas. Each entity will be annotated with the @Entity annotation, which will include a "name" field inside. +This represents the "common name" for the entity which can be used in browsing, searching, and more. By default, DataHub ships with the following entities: + +By convention, all entity PDLs live under `metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot` + +_2. How do I find the valid set of Aspect names?_ + +Aspects are named inside of PDL schemas. Each aspect will be annotated with the @Aspect annotation, which will include a "name" field inside. +This represents the "common name" for the entity which can be used in browsing, searching, and more. + +By convention, all entity PDLs live under `metadata-models/src/main/pegasus/com/linkedin/metadata/common` or `metadata-models/src/main/pegasus/com/linkedin/metadata/`. For example, +the dataset-specific aspects are located under `metadata-models/src/main/pegasus/com/linkedin/metadata/dataset`. + +_3. How do I find the valid set of Relationship names?_ + +All relationships are defined on foreign-key fields inside Aspect PDLs. They are reflected by fields bearing the @Relationship annotation. Inside this annotation +is a "name" field that defines the standardized name of the Relationship to be used when querying. + +By convention, all entity PDLs live under `metadata-models/src/main/pegasus/com/linkedin/metadata/common` or `metadata-models/src/main/pegasus/com/linkedin/metadata/`. For example, +the dataset-specific aspects are located under `metadata-models/src/main/pegasus/com/linkedin/metadata/dataset`. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/restli/restore-indices.md b/docs-website/versioned_docs/version-0.10.4/docs/api/restli/restore-indices.md new file mode 100644 index 0000000000000..85ea2546e2348 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/restli/restore-indices.md @@ -0,0 +1,34 @@ +--- +title: Restore Indices Endpoint +slug: /api/restli/restore-indices +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/restli/restore-indices.md +--- + +# Restore Indices Endpoint + +You can do a HTTP POST request to `/gms/operations?action=restoreIndices` endpoint with the `urn` as part of JSON Payload to restore indices for the particular URN, or with the `urnLike` regex to restore for `batchSize` URNs matching the pattern starting from `start`. + +``` +curl --location --request POST 'https://demo.datahubproject.io/api/gms/operations?action=restoreIndices' \ +--header 'Authorization: Bearer TOKEN' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "urn": "YOUR_URN" +}' + +curl --location --request POST 'https://demo.datahubproject.io/api/gms/operations?action=restoreIndices' \ +--header 'Authorization: Bearer TOKEN' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "urnLike": "urn:dataPlatform:%" +}' +``` + +The supported parameters are + +- `urn` - Optional URN string +- `aspect` - Optional Aspect string +- `urnLike` - Optional string regex to match URNs +- `start` - Optional integer to decide which rows number of sql store to restore. Default: 0 +- `batchSize` - Optional integer to decide how many rows to restore. Default: 10 diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/restli/truncate-time-series-aspect.md b/docs-website/versioned_docs/version-0.10.4/docs/api/restli/truncate-time-series-aspect.md new file mode 100644 index 0000000000000..5a1beab78f1ab --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/restli/truncate-time-series-aspect.md @@ -0,0 +1,55 @@ +--- +title: Truncate Timeseries Index Endpoint +slug: /api/restli/truncate-time-series-aspect +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/restli/truncate-time-series-aspect.md +--- + +# Truncate Timeseries Index Endpoint + +You can do a HTTP POST request to `/gms/operations?action=truncateTimeseriesAspect` endpoint to manage the size of a time series index by removing entries older than a certain timestamp, thereby truncating the table to only the entries needed, to save space. The `getIndexSizes` endpoint can be used to identify the largest indices. The output includes the index parameters needed for this function. + +``` +curl --location --request POST 'https://demo.datahubproject.io/api/gms/operations?action=truncateTimeseriesAspect' \ +--header 'Authorization: Bearer TOKEN' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "entityType": "YOUR_ENTITY_TYPE", + "aspect": "YOUR_ASPECT_NAME", + "endTimeMillis": 1000000000000 +}' + +curl --location --request POST 'https://demo.datahubproject.io/api/gms/operations?action=truncateTimeseriesAspect' \ +--header 'Authorization: Bearer TOKEN' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "entityType": "YOUR_ENTITY_TYPE", + "aspect": "YOUR_ASPECT_NAME", + "endTimeMillis": 1000000000000, + "dryRun": false, + "batchSize": 100, + "timeoutSeconds": 3600 +}' +``` + +The supported parameters are + +- `entityType` - Required type of the entity to truncate the index of, for example, `dataset`. +- `aspect` - Required name of the aspect to truncate the index of, for example, `datasetusagestatistics`. A call to `getIndexSizes` shows the `entityType` and `aspect` parameters for each index along with its size. +- `endTimeMillis` - Required timestamp to truncate the index to. Entities with timestamps older than this time will be deleted. +- `dryRun` - Optional boolean to enable/disable dry run functionality. Default: true. In a dry run, the following information will be printed: + +``` +{"value":"Delete 0 out of 201 rows (0.00%). Reindexing the aspect without the deleted records. This was a dry run. Run with dryRun = false to execute."} +``` + +- `batchSize` - Optional integer to control the batch size for the deletion. Default: 10000 +- `timeoutSeconds` - Optional integer to set a timeout for the delete operation. Default: No timeout set + +The output to the call will be information about how many rows would be deleted and how to proceed for a dry run: + +``` +{"value":"Delete 0 out of 201 rows (0.00%). Reindexing the aspect without the deleted records. This was a dry run. Run with dryRun = false to execute."} +``` + +For a non-dry-run, the output will be the Task ID of the asynchronous delete operation. This task ID can be used to monitor the status of the operation. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/custom-properties.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/custom-properties.md new file mode 100644 index 0000000000000..2481ba6a7406e --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/custom-properties.md @@ -0,0 +1,526 @@ +--- +title: Custom Properties +slug: /api/tutorials/custom-properties +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/custom-properties.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Custom Properties + +## Why Would You Use Custom Properties on Datasets? + +Custom properties to datasets can help to provide additional information about the data that is not readily available in the standard metadata fields. Custom properties can be used to describe specific attributes of the data, such as the units of measurement used, the date range covered, or the geographical region the data pertains to. This can be particularly helpful when working with large and complex datasets, where additional context can help to ensure that the data is being used correctly and effectively. + +DataHub models custom properties of a Dataset as a map of key-value pairs of strings. + +Custom properties can also be used to enable advanced search and discovery capabilities, by allowing users to filter and sort datasets based on specific attributes. This can help users to quickly find and access the data they need, without having to manually review large numbers of datasets. + +### Goal Of This Guide + +This guide will show you how to add, remove or replace custom properties on a dataset `fct_users_deleted`. Here's what each operation means: + +- Add: Add custom properties to a dataset without affecting existing properties +- Remove: Removing specific properties from the dataset without affecting other properties +- Replace: Completely replacing the entire property map without affecting other fields that are located in the same aspect. e.g. `DatasetProperties` aspect contains `customProperties` as well as other fields like `name` and `description`. + +## Prerequisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed information, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +:::note +Before adding custom properties, you need to ensure the target dataset is already present in your DataHub instance. +If you attempt to manipulate entities that do not exist, your operation will fail. +In this guide, we will be using data from sample ingestion. +::: + +In this example, we will add some custom properties `cluster_name` and `retention_time` to the dataset `fct_users_deleted`. + +After you have ingested sample data, the dataset `fct_users_deleted` should have a custom properties section with `encoding` set to `utf-8`. + +

+ +

+ +```shell +datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)" --aspect datasetProperties +{ + "datasetProperties": { + "customProperties": { + "encoding": "utf-8" + }, + "description": "table containing all the users deleted on a single day", + "tags": [] + } +} +``` + +## Add Custom Properties programmatically + +The following code adds custom properties `cluster_name` and `retention_time` to a dataset named `fct_users_deleted` without affecting existing properties. + + + + +> 🚫 Adding Custom Properties on Dataset via GraphQL is currently not supported. +> Please check out [API feature comparison table](/docs/api/datahub-apis.md#datahub-api-comparison) for more information, + + + + +```java +# Inlined from /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/DatasetCustomPropertiesAdd.java +package io.datahubproject.examples; + +import com.linkedin.common.urn.UrnUtils; +import datahub.client.MetadataWriteResponse; +import datahub.client.patch.dataset.DatasetPropertiesPatchBuilder; +import datahub.client.rest.RestEmitter; +import java.io.IOException; +import com.linkedin.mxe.MetadataChangeProposal; +import java.util.concurrent.ExecutionException; +import java.util.concurrent.Future; +import lombok.extern.slf4j.Slf4j; + + +@Slf4j +class DatasetCustomPropertiesAdd { + + private DatasetCustomPropertiesAdd() { + + } + + /** + * Adds properties to an existing custom properties aspect without affecting any existing properties + * @param args + * @throws IOException + * @throws ExecutionException + * @throws InterruptedException + */ + public static void main(String[] args) throws IOException, ExecutionException, InterruptedException { + MetadataChangeProposal datasetPropertiesProposal = new DatasetPropertiesPatchBuilder() + .urn(UrnUtils.toDatasetUrn("hive", "fct_users_deleted", "PROD")) + .addCustomProperty("cluster_name", "datahubproject.acryl.io") + .addCustomProperty("retention_time", "2 years") + .build(); + + String token = ""; + RestEmitter emitter = RestEmitter.create( + b -> b.server("http://localhost:8080") + .token(token) + ); + try { + Future response = emitter.emit(datasetPropertiesProposal); + + System.out.println(response.get().getResponseContent()); + } catch (Exception e) { + log.error("Failed to emit metadata to DataHub", e); + throw e; + } finally { + emitter.close(); + } + + } + +} + + + +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_add_properties.py +import logging +from typing import Union + +from datahub.configuration.kafka import KafkaProducerConnectionConfig +from datahub.emitter.kafka_emitter import DatahubKafkaEmitter, KafkaEmitterConfig +from datahub.emitter.mce_builder import make_dataset_urn +from datahub.emitter.rest_emitter import DataHubRestEmitter +from datahub.specific.dataset import DatasetPatchBuilder + +log = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + + +# Get an emitter, either REST or Kafka, this example shows you both +def get_emitter() -> Union[DataHubRestEmitter, DatahubKafkaEmitter]: + USE_REST_EMITTER = True + if USE_REST_EMITTER: + gms_endpoint = "http://localhost:8080" + return DataHubRestEmitter(gms_server=gms_endpoint) + else: + kafka_server = "localhost:9092" + schema_registry_url = "http://localhost:8081" + return DatahubKafkaEmitter( + config=KafkaEmitterConfig( + connection=KafkaProducerConnectionConfig( + bootstrap=kafka_server, schema_registry_url=schema_registry_url + ) + ) + ) + + +dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD") + +with get_emitter() as emitter: + for patch_mcp in ( + DatasetPatchBuilder(dataset_urn) + .add_custom_property("cluster_name", "datahubproject.acryl.io") + .add_custom_property("retention_time", "2 years") + .build() + ): + emitter.emit(patch_mcp) + + +log.info(f"Added cluster_name, retention_time properties to dataset {dataset_urn}") + +``` + + + + +### Expected Outcome of Adding Custom Properties + +You can now see the two new properties are added to `fct_users_deleted` and the previous property `encoding` is unchanged. + +

+ +

+ +We can also verify this operation by programmatically checking the `datasetProperties` aspect after running this code using the `datahub` cli. + +```shell +datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)" --aspect datasetProperties +{ + "datasetProperties": { + "customProperties": { + "encoding": "utf-8", + "cluster_name": "datahubproject.acryl.io", + "retention_time": "2 years" + }, + "description": "table containing all the users deleted on a single day", + "tags": [] + } +} +``` + +## Add and Remove Custom Properties programmatically + +The following code shows you how can add and remove custom properties in the same call. In the following code, we add custom property `cluster_name` and remove property `retention_time` from a dataset named `fct_users_deleted` without affecting existing properties. + + + + +> 🚫 Adding and Removing Custom Properties on Dataset via GraphQL is currently not supported. +> Please check out [API feature comparison table](/docs/api/datahub-apis.md#datahub-api-comparison) for more information, + + + + +```java +# Inlined from /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/DatasetCustomPropertiesAddRemove.java +package io.datahubproject.examples; + +import com.linkedin.common.urn.UrnUtils; +import com.linkedin.mxe.MetadataChangeProposal; +import datahub.client.MetadataWriteResponse; +import datahub.client.patch.dataset.DatasetPropertiesPatchBuilder; +import datahub.client.rest.RestEmitter; +import java.io.IOException; +import java.util.concurrent.ExecutionException; +import java.util.concurrent.Future; +import lombok.extern.slf4j.Slf4j; + + +@Slf4j +class DatasetCustomPropertiesAddRemove { + + private DatasetCustomPropertiesAddRemove() { + + } + + /** + * Applies Add and Remove property operations on an existing custom properties aspect without + * affecting any other properties + * @param args + * @throws IOException + * @throws ExecutionException + * @throws InterruptedException + */ + public static void main(String[] args) throws IOException, ExecutionException, InterruptedException { + MetadataChangeProposal datasetPropertiesProposal = new DatasetPropertiesPatchBuilder() + .urn(UrnUtils.toDatasetUrn("hive", "fct_users_deleted", "PROD")) + .addCustomProperty("cluster_name", "datahubproject.acryl.io") + .removeCustomProperty("retention_time") + .build(); + + String token = ""; + RestEmitter emitter = RestEmitter.create( + b -> b.server("http://localhost:8080") + .token(token) + ); + try { + Future response = emitter.emit(datasetPropertiesProposal); + + System.out.println(response.get().getResponseContent()); + } catch (Exception e) { + log.error("Failed to emit metadata to DataHub", e); + throw e; + } finally { + emitter.close(); + } + + } + +} + + + +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_add_remove_properties.py +import logging +from typing import Union + +from datahub.configuration.kafka import KafkaProducerConnectionConfig +from datahub.emitter.kafka_emitter import DatahubKafkaEmitter, KafkaEmitterConfig +from datahub.emitter.mce_builder import make_dataset_urn +from datahub.emitter.rest_emitter import DataHubRestEmitter +from datahub.specific.dataset import DatasetPatchBuilder + +log = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + + +# Get an emitter, either REST or Kafka, this example shows you both +def get_emitter() -> Union[DataHubRestEmitter, DatahubKafkaEmitter]: + USE_REST_EMITTER = True + if USE_REST_EMITTER: + gms_endpoint = "http://localhost:8080" + return DataHubRestEmitter(gms_server=gms_endpoint) + else: + kafka_server = "localhost:9092" + schema_registry_url = "http://localhost:8081" + return DatahubKafkaEmitter( + config=KafkaEmitterConfig( + connection=KafkaProducerConnectionConfig( + bootstrap=kafka_server, schema_registry_url=schema_registry_url + ) + ) + ) + + +dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD") + +with get_emitter() as emitter: + for patch_mcp in ( + DatasetPatchBuilder(dataset_urn) + .add_custom_property("cluster_name", "datahubproject.acryl.io") + .remove_custom_property("retention_time") + .build() + ): + emitter.emit(patch_mcp) + + +log.info( + f"Added cluster_name property, removed retention_time property from dataset {dataset_urn}" +) + +``` + + + + +### Expected Outcome of Add and Remove Operations on Custom Properties + +You can now see the `cluster_name` property is added to `fct_users_deleted` and the `retention_time` property is removed. + +

+ +

+ +We can also verify this operation programmatically by checking the `datasetProperties` aspect using the `datahub` cli. + +```shell +datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)" --aspect datasetProperties +{ + "datasetProperties": { + "customProperties": { + "encoding": "utf-8", + "cluster_name": "datahubproject.acryl.io" + }, + "description": "table containing all the users deleted on a single day", + "tags": [] + } +} +``` + +## Replace Custom Properties programmatically + +The following code replaces the current custom properties with a new properties map that includes only the properties `cluster_name` and `retention_time`. After running this code, the previous `encoding` property will be removed. + + + + +> 🚫 Replacing Custom Properties on Dataset via GraphQL is currently not supported. +> Please check out [API feature comparison table](/docs/api/datahub-apis.md#datahub-api-comparison) for more information, + + + + +```java +# Inlined from /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/DatasetCustomPropertiesReplace.java +package io.datahubproject.examples; + +import com.linkedin.common.urn.UrnUtils; +import com.linkedin.mxe.MetadataChangeProposal; +import datahub.client.MetadataWriteResponse; +import datahub.client.patch.dataset.DatasetPropertiesPatchBuilder; +import datahub.client.rest.RestEmitter; +import java.io.IOException; +import java.util.HashMap; +import java.util.Map; +import java.util.concurrent.Future; +import lombok.extern.slf4j.Slf4j; + + +@Slf4j +class DatasetCustomPropertiesReplace { + + private DatasetCustomPropertiesReplace() { + + } + + /** + * Replaces the existing custom properties map with a new map. + * Fields like dataset name, description etc remain unchanged. + * @param args + * @throws IOException + */ + public static void main(String[] args) throws IOException { + Map customPropsMap = new HashMap<>(); + customPropsMap.put("cluster_name", "datahubproject.acryl.io"); + customPropsMap.put("retention_time", "2 years"); + MetadataChangeProposal datasetPropertiesProposal = new DatasetPropertiesPatchBuilder() + .urn(UrnUtils.toDatasetUrn("hive", "fct_users_deleted", "PROD")) + .setCustomProperties(customPropsMap) + .build(); + + String token = ""; + RestEmitter emitter = RestEmitter.create( + b -> b.server("http://localhost:8080") + .token(token) + ); + + try { + Future response = emitter.emit(datasetPropertiesProposal); + System.out.println(response.get().getResponseContent()); + } catch (Exception e) { + log.error("Failed to emit metadata to DataHub", e); + } finally { + emitter.close(); + } + + } + +} + + + +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_replace_properties.py +import logging +from typing import Union + +from datahub.configuration.kafka import KafkaProducerConnectionConfig +from datahub.emitter.kafka_emitter import DatahubKafkaEmitter, KafkaEmitterConfig +from datahub.emitter.mce_builder import make_dataset_urn +from datahub.emitter.rest_emitter import DataHubRestEmitter +from datahub.specific.dataset import DatasetPatchBuilder + +log = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + + +# Get an emitter, either REST or Kafka, this example shows you both +def get_emitter() -> Union[DataHubRestEmitter, DatahubKafkaEmitter]: + USE_REST_EMITTER = True + if USE_REST_EMITTER: + gms_endpoint = "http://localhost:8080" + return DataHubRestEmitter(gms_server=gms_endpoint) + else: + kafka_server = "localhost:9092" + schema_registry_url = "http://localhost:8081" + return DatahubKafkaEmitter( + config=KafkaEmitterConfig( + connection=KafkaProducerConnectionConfig( + bootstrap=kafka_server, schema_registry_url=schema_registry_url + ) + ) + ) + + +dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD") + +property_map_to_set = { + "cluster_name": "datahubproject.acryl.io", + "retention_time": "2 years", +} + +with get_emitter() as emitter: + for patch_mcp in ( + DatasetPatchBuilder(dataset_urn) + .set_custom_properties(property_map_to_set) + .build() + ): + emitter.emit(patch_mcp) + + +log.info( + f"Replaced custom properties on dataset {dataset_urn} as {property_map_to_set}" +) + +``` + + + + +### Expected Outcome of Replacing Custom Properties + +You can now see the `cluster_name` and `retention_time` properties are added to `fct_users_deleted` but the previous `encoding` property is no longer present. + +

+ +

+ +We can also verify this operation programmatically by checking the `datasetProperties` aspect using the `datahub` cli. + +```shell +datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)" --aspect datasetProperties +{ + "datasetProperties": { + "customProperties": { + "cluster_name": "datahubproject.acryl.io", + "retention_time": "2 years" + }, + "description": "table containing all the users deleted on a single day", + "tags": [] + } +} +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/datasets.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/datasets.md new file mode 100644 index 0000000000000..30a732f36f38b --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/datasets.md @@ -0,0 +1,211 @@ +--- +title: Dataset +slug: /api/tutorials/datasets +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/datasets.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Dataset + +## Why Would You Use Datasets? + +The dataset entity is one the most important entities in the metadata model. They represent collections of data that are typically represented as Tables or Views in a database (e.g. BigQuery, Snowflake, Redshift etc.), Streams in a stream-processing environment (Kafka, Pulsar etc.), bundles of data found as Files or Folders in data lake systems (S3, ADLS, etc.). +For more information about datasets, refer to [Dataset](/docs/generated/metamodel/entities/dataset.md). + +### Goal Of This Guide + +This guide will show you how to + +- Create: create a dataset with three columns. +- Delete: delete a dataset. + +## Prerequisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +## Create Dataset + + + + +> 🚫 Creating a dataset via `graphql` is currently not supported. +> Please check out [API feature comparison table](/docs/api/datahub-apis.md#datahub-api-comparison) for more information. + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_schema.py +# Imports for urn construction utility methods +from datahub.emitter.mce_builder import make_data_platform_urn, make_dataset_urn +from datahub.emitter.mcp import MetadataChangeProposalWrapper +from datahub.emitter.rest_emitter import DatahubRestEmitter + +# Imports for metadata model classes +from datahub.metadata.schema_classes import ( + AuditStampClass, + DateTypeClass, + OtherSchemaClass, + SchemaFieldClass, + SchemaFieldDataTypeClass, + SchemaMetadataClass, + StringTypeClass, +) + +event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper( + entityUrn=make_dataset_urn(platform="hive", name="realestate_db.sales", env="PROD"), + aspect=SchemaMetadataClass( + schemaName="customer", # not used + platform=make_data_platform_urn("hive"), # important <- platform must be an urn + version=0, # when the source system has a notion of versioning of schemas, insert this in, otherwise leave as 0 + hash="", # when the source system has a notion of unique schemas identified via hash, include a hash, else leave it as empty string + platformSchema=OtherSchemaClass(rawSchema="__insert raw schema here__"), + lastModified=AuditStampClass( + time=1640692800000, actor="urn:li:corpuser:ingestion" + ), + fields=[ + SchemaFieldClass( + fieldPath="address.zipcode", + type=SchemaFieldDataTypeClass(type=StringTypeClass()), + nativeDataType="VARCHAR(50)", # use this to provide the type of the field in the source system's vernacular + description="This is the zipcode of the address. Specified using extended form and limited to addresses in the United States", + lastModified=AuditStampClass( + time=1640692800000, actor="urn:li:corpuser:ingestion" + ), + ), + SchemaFieldClass( + fieldPath="address.street", + type=SchemaFieldDataTypeClass(type=StringTypeClass()), + nativeDataType="VARCHAR(100)", + description="Street corresponding to the address", + lastModified=AuditStampClass( + time=1640692800000, actor="urn:li:corpuser:ingestion" + ), + ), + SchemaFieldClass( + fieldPath="last_sold_date", + type=SchemaFieldDataTypeClass(type=DateTypeClass()), + nativeDataType="Date", + description="Date of the last sale date for this property", + created=AuditStampClass( + time=1640692800000, actor="urn:li:corpuser:ingestion" + ), + lastModified=AuditStampClass( + time=1640692800000, actor="urn:li:corpuser:ingestion" + ), + ), + ], + ), +) + +# Create rest emitter +rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080") +rest_emitter.emit(event) + +``` + + + + +### Expected Outcomes of Creating Dataset + +You can now see `realestate_db.sales` dataset has been created. + +

+ +

+ +## Delete Dataset + +You may want to delete a dataset if it is no longer needed, contains incorrect or sensitive information, or if it was created for testing purposes and is no longer necessary in production. +It is possible to [delete entities via CLI](/docs/how/delete-metadata.md), but a programmatic approach is necessary for scalability. + +There are two methods of deletion: soft delete and hard delete. +**Soft delete** sets the Status aspect of the entity to Removed, which hides the entity and all its aspects from being returned by the UI. +**Hard delete** physically deletes all rows for all aspects of the entity. + +For more information about soft delete and hard delete, please refer to [Removing Metadata from DataHub](/docs/how/delete-metadata.md#delete-by-urn). + + + + +> 🚫 Hard delete with `graphql` is currently not supported. +> Please check out [API feature comparison table](/docs/api/datahub-apis.md#datahub-api-comparison) for more information. + +```json +mutation batchUpdateSoftDeleted { + batchUpdateSoftDeleted(input: + { urns: ["urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"], + deleted: true }) +} +``` + +If you see the following response, the operation was successful: + +```json +{ + "data": { + "batchUpdateSoftDeleted": true + }, + "extensions": {} +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "mutation batchUpdateSoftDeleted { batchUpdateSoftDeleted(input: { deleted: true, urns: [\"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)\"] }) }", "variables":{}}' +``` + +Expected Response: + +```json +{ "data": { "batchUpdateSoftDeleted": true }, "extensions": {} } +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/delete_dataset.py +import logging + +from datahub.emitter.mce_builder import make_dataset_urn +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +log = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + +graph = DataHubGraph( + config=DatahubClientConfig( + server="http://localhost:8080", + ) +) + +dataset_urn = make_dataset_urn(name="fct_users_created", platform="hive") + +# Soft-delete the dataset. +graph.delete_entity(urn=dataset_urn, hard=False) + +log.info(f"Deleted dataset {dataset_urn}") + +``` + + + + +### Expected Outcomes of Deleting Dataset + +The dataset `fct_users_deleted` has now been deleted, so if you search for a hive dataset named `fct_users_delete`, you will no longer be able to see it. + +

+ +

diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/deprecation.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/deprecation.md new file mode 100644 index 0000000000000..6731a99714619 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/deprecation.md @@ -0,0 +1,189 @@ +--- +title: Deprecation +slug: /api/tutorials/deprecation +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/deprecation.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Deprecation + +## Why Would You Deprecate Datasets? + +The Deprecation feature on DataHub indicates the status of an entity. For datasets, keeping the deprecation status up-to-date is important to inform users and downstream systems of changes to the dataset's availability or reliability. By updating the status, you can communicate changes proactively, prevent issues and ensure users are always using highly trusted data assets. + +### Goal Of This Guide + +This guide will show you how to read or update deprecation status of a dataset. + +## Prerequisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +:::note +Before updating deprecation, you need to ensure the targeted dataset is already present in your datahub. +If you attempt to manipulate entities that do not exist, your operation will fail. +In this guide, we will be using data from a sample ingestion. +::: + +## Read Deprecation + + + + +```json +query { + dataset(urn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)") { + deprecation { + deprecated + decommissionTime + } + } +} +``` + +If you see the following response, the operation was successful: + +```python +{ + "data": { + "dataset": { + "deprecation": { + "deprecated": false, + "decommissionTime": null + } + } + }, + "extensions": {} +} +``` + + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "{ dataset(urn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)\") { deprecation { deprecated decommissionTime } } }", "variables":{} }' +``` + +Expected Response: + +```json +{ + "data": { + "dataset": { + "deprecation": { "deprecated": false, "decommissionTime": null } + } + }, + "extensions": {} +} +``` + + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_query_deprecation.py +from datahub.emitter.mce_builder import make_dataset_urn + +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import DeprecationClass + +dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD") + +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +# Query multiple aspects from entity +result = graph.get_aspects_for_entity( + entity_urn=dataset_urn, + aspects=["deprecation"], + aspect_types=[DeprecationClass], +) + +print(result) + +``` + + + + +## Update Deprecation + + + + +```json +mutation updateDeprecation { + updateDeprecation(input: { urn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)", deprecated: true }) +} +``` + +Also note that you can update deprecation status of multiple entities or subresource using `batchUpdateDeprecation`. + +```json +mutation batchUpdateDeprecation { + batchUpdateDeprecation( + input: { + deprecated: true, + resources: [ + { resourceUrn:"urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)"} , + { resourceUrn:"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)"} ,] + } + ) +} + +``` + +If you see the following response, the operation was successful: + +```python +{ + "data": { + "updateDeprecation": true + }, + "extensions": {} +} +``` + + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "mutation updateDeprecation { updateDeprecation(input: { deprecated: true, urn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)\" }) }", "variables":{}}' +``` + +Expected Response: + +```json +{ "data": { "removeTag": true }, "extensions": {} } +``` + + + + + + + + +### Expected Outcomes of Updating Deprecation + +You can now see the dataset `fct_users_created` has been marked as `Deprecated.` + +

+ +

diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/descriptions.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/descriptions.md new file mode 100644 index 0000000000000..9a3e80e3e1fb0 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/descriptions.md @@ -0,0 +1,613 @@ +--- +title: Description +slug: /api/tutorials/descriptions +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/descriptions.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Description + +## Why Would You Use Description on Dataset? + +Adding a description and related link to a dataset can provide important information about the data, such as its source, collection methods, and potential uses. This can help others understand the context of the data and how it may be relevant to their own work or research. Including a related link can also provide access to additional resources or related datasets, further enriching the information available to users. + +### Goal Of This Guide + +This guide will show you how to + +- Read dataset description: read a description of a dataset. +- Read column description: read a description of columns of a dataset`. +- Add dataset description: add a description and a link to dataset. +- Add column description: add a description to a column of a dataset. + +## Prerequisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +:::note +Before adding a description, you need to ensure the targeted dataset is already present in your datahub. +If you attempt to manipulate entities that do not exist, your operation will fail. +In this guide, we will be using data from sample ingestion. +::: + +In this example, we will add a description to `user_name `column of a dataset `fct_users_deleted`. + +## Read Description on Dataset + + + + +```json +query { + dataset(urn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)") { + properties { + description + } + } +} +``` + +If you see the following response, the operation was successful: + +```json +{ + "data": { + "dataset": { + "properties": { + "description": "table containing all the users deleted on a single day" + } + } + }, + "extensions": {} +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "query { dataset(urn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)\") { properties { description } } }", "variables":{}}' +``` + +Expected Response: + +```json +{ + "data": { + "dataset": { + "properties": { + "description": "table containing all the users deleted on a single day" + } + } + }, + "extensions": {} +} +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_query_description.py +from datahub.emitter.mce_builder import make_dataset_urn + +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import DatasetPropertiesClass + +dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD") + +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +# Query multiple aspects from entity +result = graph.get_aspects_for_entity( + entity_urn=dataset_urn, + aspects=["datasetProperties"], + aspect_types=[DatasetPropertiesClass], +)["datasetProperties"] + +if result: + print(result.description) + +``` + + + + +## Read Description on Columns + + + + +```json +query { + dataset(urn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)") { + schemaMetadata { + fields { + fieldPath + description + } + } + } +} +``` + +If you see the following response, the operation was successful: + +```json +{ + "data": { + "dataset": { + "schemaMetadata": { + "fields": [ + { + "fieldPath": "user_name", + "description": "Name of the user who was deleted" + }, + ... + { + "fieldPath": "deletion_reason", + "description": "Why the user chose to deactivate" + } + ] + } + } + }, + "extensions": {} +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "query { dataset(urn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)\") { schemaMetadata { fields { fieldPath description } } } }", "variables":{}}' +``` + +Expected Response: + +```json +{ + "data": { + "dataset": { + "schemaMetadata": { + "fields": [ + { + "fieldPath": "user_name", + "description": "Name of the user who was deleted" + }, + { + "fieldPath": "timestamp", + "description": "Timestamp user was deleted at" + }, + { "fieldPath": "user_id", "description": "Id of the user deleted" }, + { + "fieldPath": "browser_id", + "description": "Cookie attached to identify the browser" + }, + { + "fieldPath": "session_id", + "description": "Cookie attached to identify the session" + }, + { + "fieldPath": "deletion_reason", + "description": "Why the user chose to deactivate" + } + ] + } + } + }, + "extensions": {} +} +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_query_description_on_columns.py +from datahub.emitter.mce_builder import make_dataset_urn + +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import SchemaMetadataClass + +dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD") + +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +# Query multiple aspects from entity +result = graph.get_aspects_for_entity( + entity_urn=dataset_urn, + aspects=["schemaMetadata"], + aspect_types=[SchemaMetadataClass], +)["schemaMetadata"] + +if result: + column_descriptions = [ + {field.fieldPath: field.description} for field in result.fields + ] + print(column_descriptions) + +``` + + + + +## Add Description on Dataset + + + + +```graphql +mutation updateDataset { + updateDataset( + urn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)" + input: { + editableProperties: { + description: "## The Real Estate Sales Dataset\nThis is a really important Dataset that contains all the relevant information about sales that have happened organized by address.\n" + } + institutionalMemory: { + elements: { + author: "urn:li:corpuser:jdoe" + url: "https://wikipedia.com/real_estate" + description: "This is the definition of what real estate means" + } + } + } + ) { + urn + } +} +``` + +Expected Response: + +```json +{ + "data": { + "updateDataset": { + "urn": "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)" + } + }, + "extensions": {} +} +``` + + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "query": "mutation updateDataset { updateDataset( urn:\"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)\", input: { editableProperties: { description: \"## The Real Estate Sales Dataset\nThis is a really important Dataset that contains all the relevant information about sales that have happened organized by address.\n\" } institutionalMemory: { elements: { author: \"urn:li:corpuser:jdoe\", url: \"https://wikipedia.com/real_estate\", description: \"This is the definition of what real estate means\" } } } ) { urn } }", + "variables": {} +}' +``` + +Expected Response: + +```json +{ + "data": { + "updateDataset": { + "urn": "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)" + } + }, + "extensions": {} +} +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_add_documentation.py +import logging +import time + +from datahub.emitter.mce_builder import make_dataset_urn +from datahub.emitter.mcp import MetadataChangeProposalWrapper + +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import ( + AuditStampClass, + EditableDatasetPropertiesClass, + InstitutionalMemoryClass, + InstitutionalMemoryMetadataClass, +) + +log = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + + +# Inputs -> owner, ownership_type, dataset +documentation_to_add = "## The Real Estate Sales Dataset\nThis is a really important Dataset that contains all the relevant information about sales that have happened organized by address.\n" +link_to_add = "https://wikipedia.com/real_estate" +link_description = "This is the definition of what real estate means" +dataset_urn = make_dataset_urn(platform="hive", name="realestate_db.sales", env="PROD") + +# Some helpful variables to fill out objects later +now = int(time.time() * 1000) # milliseconds since epoch +current_timestamp = AuditStampClass(time=now, actor="urn:li:corpuser:ingestion") +institutional_memory_element = InstitutionalMemoryMetadataClass( + url=link_to_add, + description=link_description, + createStamp=current_timestamp, +) + + +# First we get the current owners +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(config=DatahubClientConfig(server=gms_endpoint)) + +current_editable_properties = graph.get_aspect( + entity_urn=dataset_urn, aspect_type=EditableDatasetPropertiesClass +) + +need_write = False +if current_editable_properties: + if documentation_to_add != current_editable_properties.description: + current_editable_properties.description = documentation_to_add + need_write = True +else: + # create a brand new editable dataset properties aspect + current_editable_properties = EditableDatasetPropertiesClass( + created=current_timestamp, description=documentation_to_add + ) + need_write = True + +if need_write: + event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper( + entityUrn=dataset_urn, + aspect=current_editable_properties, + ) + graph.emit(event) + log.info(f"Documentation added to dataset {dataset_urn}") + +else: + log.info("Documentation already exists and is identical, omitting write") + + +current_institutional_memory = graph.get_aspect( + entity_urn=dataset_urn, aspect_type=InstitutionalMemoryClass +) + +need_write = False + +if current_institutional_memory: + if link_to_add not in [x.url for x in current_institutional_memory.elements]: + current_institutional_memory.elements.append(institutional_memory_element) + need_write = True +else: + # create a brand new institutional memory aspect + current_institutional_memory = InstitutionalMemoryClass( + elements=[institutional_memory_element] + ) + need_write = True + +if need_write: + event = MetadataChangeProposalWrapper( + entityUrn=dataset_urn, + aspect=current_institutional_memory, + ) + graph.emit(event) + log.info(f"Link {link_to_add} added to dataset {dataset_urn}") + +else: + log.info(f"Link {link_to_add} already exists and is identical, omitting write") + +``` + + + + +### Expected Outcomes of Adding Description on Dataset + +You can now see the description is added to `fct_users_deleted`. + +

+ +

+ +## Add Description on Column + + + + +```json +mutation updateDescription { + updateDescription( + input: { + description: "Name of the user who was deleted. This description is updated via GrpahQL.", + resourceUrn:"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)", + subResource: "user_name", + subResourceType:DATASET_FIELD + } + ) +} +``` + +Note that you can use general markdown in `description`. For example, you can do the following. + +```json +mutation updateDescription { + updateDescription( + input: { + description: """ + ### User Name + The `user_name` column is a primary key column that contains the name of the user who was deleted. + """, + resourceUrn:"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)", + subResource: "user_name", + subResourceType:DATASET_FIELD + } + ) +} +``` + +`updateDescription` currently only supports Dataset Schema Fields, Containers. +For more information about the `updateDescription` mutation, please refer to [updateLineage](/docs/graphql/mutations/#updateDescription). + +If you see the following response, the operation was successful: + +```json +{ + "data": { + "updateDescription": true + }, + "extensions": {} +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "mutation updateDescription { updateDescription ( input: { description: \"Name of the user who was deleted. This description is updated via GrpahQL.\", resourceUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)\", subResource: \"user_name\", subResourceType:DATASET_FIELD }) }", "variables":{}}' +``` + +Expected Response: + +```json +{ "data": { "updateDescription": true }, "extensions": {} } +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_add_column_documentation.py +import logging +import time + +from datahub.emitter.mce_builder import make_dataset_urn +from datahub.emitter.mcp import MetadataChangeProposalWrapper + +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import ( + AuditStampClass, + EditableSchemaFieldInfoClass, + EditableSchemaMetadataClass, + InstitutionalMemoryClass, +) + +log = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + + +def get_simple_field_path_from_v2_field_path(field_path: str) -> str: + """A helper function to extract simple . path notation from the v2 field path""" + if not field_path.startswith("[version=2.0]"): + # not a v2, we assume this is a simple path + return field_path + # this is a v2 field path + tokens = [ + t for t in field_path.split(".") if not (t.startswith("[") or t.endswith("]")) + ] + + return ".".join(tokens) + + +# Inputs -> owner, ownership_type, dataset +documentation_to_add = ( + "Name of the user who was deleted. This description is updated via PythonSDK." +) +dataset_urn = make_dataset_urn(platform="hive", name="fct_users_deleted", env="PROD") +column = "user_name" +field_info_to_set = EditableSchemaFieldInfoClass( + fieldPath=column, description=documentation_to_add +) + + +# Some helpful variables to fill out objects later +now = int(time.time() * 1000) # milliseconds since epoch +current_timestamp = AuditStampClass(time=now, actor="urn:li:corpuser:ingestion") + + +# First we get the current owners +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(config=DatahubClientConfig(server=gms_endpoint)) + +current_editable_schema_metadata = graph.get_aspect( + entity_urn=dataset_urn, + aspect_type=EditableSchemaMetadataClass, +) + + +need_write = False + +if current_editable_schema_metadata: + for fieldInfo in current_editable_schema_metadata.editableSchemaFieldInfo: + if get_simple_field_path_from_v2_field_path(fieldInfo.fieldPath) == column: + # we have some editable schema metadata for this field + field_match = True + if documentation_to_add != fieldInfo.description: + fieldInfo.description = documentation_to_add + need_write = True +else: + # create a brand new editable dataset properties aspect + current_editable_schema_metadata = EditableSchemaMetadataClass( + editableSchemaFieldInfo=[field_info_to_set], + created=current_timestamp, + ) + need_write = True + +if need_write: + event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper( + entityUrn=dataset_urn, + aspect=current_editable_schema_metadata, + ) + graph.emit(event) + log.info(f"Documentation added to dataset {dataset_urn}") + +else: + log.info("Documentation already exists and is identical, omitting write") + + +current_institutional_memory = graph.get_aspect( + entity_urn=dataset_urn, aspect_type=InstitutionalMemoryClass +) + +need_write = False + +``` + + + + +### Expected Outcomes of Adding Description on Column + +You can now see column description is added to `user_name` column of `fct_users_deleted`. + +

+ +

diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/domains.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/domains.md new file mode 100644 index 0000000000000..a42f8d3307ef6 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/domains.md @@ -0,0 +1,357 @@ +--- +title: Domains +slug: /api/tutorials/domains +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/domains.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Domains + +## Why Would You Use Domains? + +Domains are curated, top-level folders or categories where related assets can be explicitly grouped. Management of Domains can be centralized, or distributed out to Domain owners Currently, an asset can belong to only one Domain at a time. +For more information about domains, refer to [About DataHub Domains](/docs/domains.md). + +### Goal Of This Guide + +This guide will show you how to + +- Create a domain. +- Read domains attached to a dataset. +- Add a dataset to a domain +- Remove the domain from a dataset. + +## Prerequisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +## Create Domain + + + + +```json +mutation createDomain { + createDomain(input: { name: "Marketing", description: "Entities related to the marketing department" }) +} +``` + +If you see the following response, the operation was successful: + +```json +{ + "data": { + "createDomain": "" + }, + "extensions": {} +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "mutation createDomain { createDomain(input: { name: \"Marketing\", description: \"Entities related to the marketing department.\" }) }", "variables":{}}' +``` + +Expected Response: + +```json +{ "data": { "createDomain": "" }, "extensions": {} } +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/create_domain.py +import logging + +from datahub.emitter.mce_builder import make_domain_urn +from datahub.emitter.mcp import MetadataChangeProposalWrapper +from datahub.emitter.rest_emitter import DatahubRestEmitter +from datahub.metadata.schema_classes import ChangeTypeClass, DomainPropertiesClass + +log = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + +domain_urn = make_domain_urn("marketing") +domain_properties_aspect = DomainPropertiesClass( + name="Marketing", description="Entities related to the marketing department" +) + +event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper( + entityType="domain", + changeType=ChangeTypeClass.UPSERT, + entityUrn=domain_urn, + aspect=domain_properties_aspect, +) + +rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080") +rest_emitter.emit(event) +log.info(f"Created domain {domain_urn}") + +``` + + + + +### Expected Outcomes of Creating Domain + +You can now see `Marketing` domain has been created under `Govern > Domains`. + +

+ +

+ +## Read Domains + + + + +```json +query { + dataset(urn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)") { + domain { + associatedUrn + domain { + urn + properties { + name + } + } + } + } +} +``` + +If you see the following response, the operation was successful: + +```python +{ + "data": { + "dataset": { + "domain": { + "associatedUrn": "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)", + "domain": { + "urn": "urn:li:domain:71b3bf7b-2e3f-4686-bfe1-93172c8c4e10", + "properties": { + "name": "Marketing" + } + } + } + } + }, + "extensions": {} +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "{ dataset(urn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)\") { domain { associatedUrn domain { urn properties { name } } } } }", "variables":{}}' +``` + +Expected Response: + +```json +{ + "data": { + "dataset": { + "domain": { + "associatedUrn": "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)", + "domain": { + "urn": "urn:li:domain:71b3bf7b-2e3f-4686-bfe1-93172c8c4e10", + "properties": { "name": "Marketing" } + } + } + } + }, + "extensions": {} +} +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_query_domain.py +from datahub.emitter.mce_builder import make_dataset_urn + +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import DomainsClass + +dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD") + +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +# Query multiple aspects from entity +result = graph.get_aspects_for_entity( + entity_urn=dataset_urn, + aspects=["domains"], + aspect_types=[DomainsClass], +) + +print(result) + +``` + + + + +## Add Domains + + + + +```json +mutation setDomain { + setDomain(domainUrn: "urn:li:domain:marketing", entityUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)") +} +``` + +If you see the following response, the operation was successful: + +```python +{ + "data": { + "setDomain": true + }, + "extensions": {} +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "mutation setDomain { setDomain(entityUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)", domainUrn: "urn:li:domain:marketing")) }", "variables":{}}' +``` + +Expected Response: + +```json +{ "data": { "setDomain": true }, "extensions": {} } +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_add_domain_execute_graphql.py +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +# Query multiple aspects from entity +query = """ +mutation setDomain { + setDomain(domainUrn: "urn:li:domain:marketing", entityUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)") +} +""" +result = graph.execute_graphql(query=query) + +print(result) + +``` + + + + +### Expected Outcomes of Adding Domain + +You can now see `Marketing` domain has been added to the dataset. + +

+ +

+ +## Remove Domains + + + + +```json +mutation unsetDomain { + unsetDomain( + entityUrn:"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)" + ) +} +``` + +Expected Response: + +```python +{ + "data": { + "removeDomain": true + }, + "extensions": {} +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "mutation unsetDomain { unsetDomain(entityUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)\") }", "variables":{}}' +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_remove_domain_execute_graphql.py +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +# Query multiple aspects from entity +query = """ +mutation unsetDomain { + unsetDomain( + entityUrn:"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)" + ) +} +""" +result = graph.execute_graphql(query=query) + +print(result) + +``` + + + + +### Expected Outcomes of Removing Domain + +You can now see a domain `Marketing` has been removed from the `fct_users_created` dataset. + +

+ +

diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/lineage.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/lineage.md new file mode 100644 index 0000000000000..45a6f56b11316 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/lineage.md @@ -0,0 +1,322 @@ +--- +title: Lineage +slug: /api/tutorials/lineage +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/lineage.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Lineage + +## Why Would You Use Lineage? + +Lineage is used to capture data dependencies within an organization. It allows you to track the inputs from which a data asset is derived, along with the data assets that depend on it downstream. +For more information about lineage, refer to [About DataHub Lineage](/docs/lineage/lineage-feature-guide.md). + +### Goal Of This Guide + +This guide will show you how to + +- Add lineage between datasets. +- Add column-level lineage between datasets. + +## Prerequisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +:::note +Before adding lineage, you need to ensure the targeted dataset is already present in your datahub. +If you attempt to manipulate entities that do not exist, your operation will fail. +In this guide, we will be using data from sample ingestion. +::: + +## Add Lineage + + + + +```json +mutation updateLineage { + updateLineage( + input: { + edgesToAdd: [ + { + downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)" + upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)" + } + ] + edgesToRemove: [] + } + ) +} +``` + +Note that you can create a list of edges. For example, if you want to assign multiple upstream entities to a downstream entity, you can do the following. + +```json +mutation updateLineage { + updateLineage( + input: { + edgesToAdd: [ + { + downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)" + upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)" + } + { + downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)" + upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)" + } + ] + edgesToRemove: [] + } + ) +} + +``` + +For more information about the `updateLineage` mutation, please refer to [updateLineage](/docs/graphql/mutations/#updatelineage). + +If you see the following response, the operation was successful: + +```python +{ + "data": { + "updateLineage": true + }, + "extensions": {} +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' --data-raw '{ "query": "mutation updateLineage { updateLineage( input:{ edgesToAdd : { downstreamUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)\", upstreamUrn : \"urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)\"}, edgesToRemove :{downstreamUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)\",upstreamUrn : \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)\" } })}", "variables":{}}' +``` + +Expected Response: + +```json +{ "data": { "updateLineage": true }, "extensions": {} } +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/lineage_emitter_rest.py +import datahub.emitter.mce_builder as builder +from datahub.emitter.rest_emitter import DatahubRestEmitter + +# Construct a lineage object. +lineage_mce = builder.make_lineage_mce( + [ + builder.make_dataset_urn("hive", "fct_users_deleted"), # Upstream + ], + builder.make_dataset_urn("hive", "logging_events"), # Downstream +) + +# Create an emitter to the GMS REST API. +emitter = DatahubRestEmitter("http://localhost:8080") + +# Emit metadata! +emitter.emit_mce(lineage_mce) + +``` + + + + +### Expected Outcomes of Adding Lineage + +You can now see the lineage between `fct_users_deleted` and `logging_events`. + +

+ +

+ +## Add Column-level Lineage + + + + +```python +# Inlined from /metadata-ingestion/examples/library/lineage_emitter_dataset_finegrained_sample.py +import datahub.emitter.mce_builder as builder +from datahub.emitter.mcp import MetadataChangeProposalWrapper +from datahub.emitter.rest_emitter import DatahubRestEmitter +from datahub.metadata.com.linkedin.pegasus2avro.dataset import ( + DatasetLineageType, + FineGrainedLineage, + FineGrainedLineageDownstreamType, + FineGrainedLineageUpstreamType, + Upstream, + UpstreamLineage, +) + + +def datasetUrn(tbl): + return builder.make_dataset_urn("hive", tbl) + + +def fldUrn(tbl, fld): + return builder.make_schema_field_urn(datasetUrn(tbl), fld) + + +fineGrainedLineages = [ + FineGrainedLineage( + upstreamType=FineGrainedLineageUpstreamType.FIELD_SET, + upstreams=[ + fldUrn("fct_users_deleted", "browser_id"), + fldUrn("fct_users_created", "user_id"), + ], + downstreamType=FineGrainedLineageDownstreamType.FIELD, + downstreams=[fldUrn("logging_events", "browser")], + ), +] + + +# this is just to check if any conflicts with existing Upstream, particularly the DownstreamOf relationship +upstream = Upstream( + dataset=datasetUrn("fct_users_deleted"), type=DatasetLineageType.TRANSFORMED +) + +fieldLineages = UpstreamLineage( + upstreams=[upstream], fineGrainedLineages=fineGrainedLineages +) + +lineageMcp = MetadataChangeProposalWrapper( + entityUrn=datasetUrn("logging_events"), + aspect=fieldLineages, +) + +# Create an emitter to the GMS REST API. +emitter = DatahubRestEmitter("http://localhost:8080") + +# Emit metadata! +emitter.emit_mcp(lineageMcp) + +``` + + + + +### Expected Outcome of Adding Column Level Lineage + +You can now see the column-level lineage between datasets. Note that you have to enable `Show Columns` to be able to see the column-level lineage. + +

+ +

+ +## Read Lineage + + + + +```json +mutation searchAcrossLineage { + searchAcrossLineage( + input: { + query: "*" + urn: "urn:li:dataset:(urn:li:dataPlatform:dbt,long_tail_companions.adoption.human_profiles,PROD)" + start: 0 + count: 10 + direction: DOWNSTREAM + orFilters: [ + { + and: [ + { + condition: EQUAL + negated: false + field: "degree" + values: ["1", "2", "3+"] + } + ] + } + ] + } + ) { + searchResults { + degree + entity { + urn + type + } + } + } +} +``` + +This example shows using lineage degrees as a filter, but additional search filters can be included here as well. + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' --data-raw '{ { "query": "mutation searchAcrossLineage { searchAcrossLineage( input: { query: \"*\" urn: \"urn:li:dataset:(urn:li:dataPlatform:dbt,long_tail_companions.adoption.human_profiles,PROD)\" start: 0 count: 10 direction: DOWNSTREAM orFilters: [ { and: [ { condition: EQUAL negated: false field: \"degree\" values: [\"1\", \"2\", \"3+\"] } ] } ] } ) { searchResults { degree entity { urn type } } }}" +}}' +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/read_lineage_rest.py +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +# Query multiple aspects from entity +query = """ +mutation searchAcrossLineage { + searchAcrossLineage( + input: { + query: "*" + urn: "urn:li:dataset:(urn:li:dataPlatform:dbt,long_tail_companions.adoption.human_profiles,PROD)" + start: 0 + count: 10 + direction: DOWNSTREAM + orFilters: [ + { + and: [ + { + condition: EQUAL + negated: false + field: "degree" + values: ["1", "2", "3+"] + } + ] # Additional search filters can be included here as well + } + ] + } + ) { + searchResults { + degree + entity { + urn + type + } + } + } +} +""" +result = graph.execute_graphql(query=query) + +print(result) + +``` + + + + +This will perform a multi-hop lineage search on the urn specified. For more information about the `searchAcrossLineage` mutation, please refer to [searchAcrossLineage](/docs/graphql/queries/#searchacrosslineage). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/ml.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/ml.md new file mode 100644 index 0000000000000..16904a7202f18 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/ml.md @@ -0,0 +1,847 @@ +--- +title: ML System +slug: /api/tutorials/ml +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/ml.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# ML System + +## Why Would You Integrate ML System with DataHub? + +Machine learning systems have become a crucial feature in modern data stacks. +However, the relationships between the different components of a machine learning system, such as features, models, and feature tables, can be complex. +Thus, it is essential for these systems to be discoverable to facilitate easy access and utilization by other members of the organization. + +For more information on ML entities, please refer to the following docs: + +- [MlFeature](/docs/generated/metamodel/entities/mlFeature.md) +- [MlFeatureTable](/docs/generated/metamodel/entities/mlFeatureTable.md) +- [MlModel](/docs/generated/metamodel/entities/mlModel.md) +- [MlModelGroup](/docs/generated/metamodel/entities/mlModelGroup.md) + +### Goal Of This Guide + +This guide will show you how to + +- Create ML entities: MlFeature, MlFeatureTable, MlModel, MlModelGroup +- Read ML entities: MlFeature, MlFeatureTable, MlModel, MlModelGroup +- Attach MlFeatureTable or MlModel to MlFeature + +## Prerequisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +## Create ML Entities + +### Create MlFeature + + + + +```python +# Inlined from /metadata-ingestion/examples/library/create_mlfeature.py +import datahub.emitter.mce_builder as builder +import datahub.metadata.schema_classes as models +from datahub.emitter.mcp import MetadataChangeProposalWrapper +from datahub.emitter.rest_emitter import DatahubRestEmitter + +# Create an emitter to DataHub over REST +emitter = DatahubRestEmitter(gms_server="http://localhost:8080", extra_headers={}) + +dataset_urn = builder.make_dataset_urn( + name="fct_users_deleted", platform="hive", env="PROD" +) +feature_urn = builder.make_ml_feature_urn( + feature_table_name="my-feature-table", + feature_name="my-feature", +) + +# Create feature +metadata_change_proposal = MetadataChangeProposalWrapper( + entityType="mlFeature", + changeType=models.ChangeTypeClass.UPSERT, + entityUrn=feature_urn, + aspectName="mlFeatureProperties", + aspect=models.MLFeaturePropertiesClass( + description="my feature", sources=[dataset_urn], dataType="TEXT" + ), +) + +# Emit metadata! +emitter.emit(metadata_change_proposal) + +``` + +Note that when creating a feature, you can access a list of data sources using `sources`. + + + + +### Create MlFeatureTable + + + + +```python +# Inlined from /metadata-ingestion/examples/library/create_mlfeature_table.py +import datahub.emitter.mce_builder as builder +import datahub.metadata.schema_classes as models +from datahub.emitter.mcp import MetadataChangeProposalWrapper +from datahub.emitter.rest_emitter import DatahubRestEmitter + +# Create an emitter to DataHub over REST +emitter = DatahubRestEmitter(gms_server="http://localhost:8080", extra_headers={}) + +feature_table_urn = builder.make_ml_feature_table_urn( + feature_table_name="my-feature-table", platform="feast" +) +feature_urns = [ + builder.make_ml_feature_urn( + feature_name="my-feature", feature_table_name="my-feature-table" + ), + builder.make_ml_feature_urn( + feature_name="my-feature2", feature_table_name="my-feature-table" + ), +] +feature_table_properties = models.MLFeatureTablePropertiesClass( + description="Test description", mlFeatures=feature_urns +) + +# MCP creation +metadata_change_proposal = MetadataChangeProposalWrapper( + entityType="mlFeatureTable", + changeType=models.ChangeTypeClass.UPSERT, + entityUrn=feature_table_urn, + aspect=feature_table_properties, +) + +# Emit metadata! +emitter.emit(metadata_change_proposal) + +``` + +Note that when creating a feature table, you can access a list of features using `mlFeatures`. + + + + +### Create MlModel + +Please note that an MlModel represents the outcome of a single training run for a model, not the collective results of all model runs. + + + + +```python +# Inlined from /metadata-ingestion/examples/library/create_mlmodel.py +import datahub.emitter.mce_builder as builder +import datahub.metadata.schema_classes as models +from datahub.emitter.mcp import MetadataChangeProposalWrapper +from datahub.emitter.rest_emitter import DatahubRestEmitter + +# Create an emitter to DataHub over REST +emitter = DatahubRestEmitter(gms_server="http://localhost:8080", extra_headers={}) +model_urn = builder.make_ml_model_urn( + model_name="my-test-model", platform="science", env="PROD" +) +model_group_urns = [ + builder.make_ml_model_group_urn( + group_name="my-model-group", platform="science", env="PROD" + ) +] +feature_urns = [ + builder.make_ml_feature_urn( + feature_name="my-feature", feature_table_name="my-feature-table" + ), + builder.make_ml_feature_urn( + feature_name="my-feature2", feature_table_name="my-feature-table" + ), +] + +metadata_change_proposal = MetadataChangeProposalWrapper( + entityType="mlModel", + changeType=models.ChangeTypeClass.UPSERT, + entityUrn=model_urn, + aspectName="mlModelProperties", + aspect=models.MLModelPropertiesClass( + description="my feature", + groups=model_group_urns, + mlFeatures=feature_urns, + trainingMetrics=[ + models.MLMetricClass( + name="accuracy", description="accuracy of the model", value="1.0" + ) + ], + hyperParams=[ + models.MLHyperParamClass( + name="hyper_1", description="hyper_1", value="0.102" + ) + ], + ), +) + +# Emit metadata! +emitter.emit(metadata_change_proposal) + +``` + +Note that when creating a model, you can access a list of features using `mlFeatures`. +Additionally, you can access the relationship to model groups with `groups`. + + + + +### Create MlModelGroup + +Please note that an MlModelGroup serves as a container for all the runs of a single ML model. + + + + +```python +# Inlined from /metadata-ingestion/examples/library/create_mlmodel_group.py +import datahub.emitter.mce_builder as builder +import datahub.metadata.schema_classes as models +from datahub.emitter.mcp import MetadataChangeProposalWrapper +from datahub.emitter.rest_emitter import DatahubRestEmitter + +# Create an emitter to DataHub over REST +emitter = DatahubRestEmitter(gms_server="http://localhost:8080", extra_headers={}) +model_group_urn = builder.make_ml_model_group_urn( + group_name="my-model-group", platform="science", env="PROD" +) + + +metadata_change_proposal = MetadataChangeProposalWrapper( + entityType="mlModelGroup", + changeType=models.ChangeTypeClass.UPSERT, + entityUrn=model_group_urn, + aspectName="mlModelGroupProperties", + aspect=models.MLModelGroupPropertiesClass( + description="my model group", + ), +) + + +# Emit metadata! +emitter.emit(metadata_change_proposal) + +``` + + + + +### Expected Outcome of creating entities + +You can search the entities in DataHub UI. + +

+ +

+ +

+ +

+ +## Read ML Entities + +### Read MLFeature + + + + +```json +query { + mlFeature(urn: "urn:li:mlFeature:(test_feature_table_all_feature_dtypes,test_BOOL_LIST_feature)"){ + name + featureNamespace + description + properties { + description + dataType + version { + versionTag + } + } + } +} +``` + +Expected response: + +```json +{ + "data": { + "mlFeature": { + "name": "test_BOOL_LIST_feature", + "featureNamespace": "test_feature_table_all_feature_dtypes", + "description": null, + "properties": { + "description": null, + "dataType": "SEQUENCE", + "version": null + } + } + }, + "extensions": {} +} +``` + + + + +```json +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "query": "{ mlFeature(urn: \"urn:li:mlFeature:(test_feature_table_all_feature_dtypes,test_BOOL_LIST_feature)\") { name featureNamespace description properties { description dataType version { versionTag } } } }" +}' +``` + +Expected response: + +```json +{ + "data": { + "mlFeature": { + "name": "test_BOOL_LIST_feature", + "featureNamespace": "test_feature_table_all_feature_dtypes", + "description": null, + "properties": { + "description": null, + "dataType": "SEQUENCE", + "version": null + } + } + }, + "extensions": {} +} +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/read_mlfeature.py +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import MLFeaturePropertiesClass + +# First we get the current owners +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +urn = "urn:li:mlFeature:(test_feature_table_all_feature_dtypes,test_BOOL_feature)" +result = graph.get_aspect(entity_urn=urn, aspect_type=MLFeaturePropertiesClass) + +print(result) + +``` + + + + +### Read MLFeatureTable + + + + +```json +query { + mlFeatureTable(urn: "urn:li:mlFeatureTable:(urn:li:dataPlatform:feast,test_feature_table_all_feature_dtypes)"){ + name + description + platform { + name + } + properties { + description + mlFeatures { + name + } + } + } +} +``` + +Expected Response: + +```json +{ + "data": { + "mlFeatureTable": { + "name": "test_feature_table_all_feature_dtypes", + "description": null, + "platform": { + "name": "feast" + }, + "properties": { + "description": null, + "mlFeatures": [ + { + "name": "test_BOOL_LIST_feature" + }, + ... + { + "name": "test_STRING_feature" + } + ] + } + } + }, + "extensions": {} +} +``` + + + + +```json +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "query": "{ mlFeatureTable(urn: \"urn:li:mlFeatureTable:(urn:li:dataPlatform:feast,test_feature_table_all_feature_dtypes)\") { name description platform { name } properties { description mlFeatures { name } } } }" +}' +``` + +Expected Response: + +```json +{ + "data": { + "mlFeatureTable": { + "name": "test_feature_table_all_feature_dtypes", + "description": null, + "platform": { + "name": "feast" + }, + "properties": { + "description": null, + "mlFeatures": [ + { + "name": "test_BOOL_LIST_feature" + }, + ... + { + "name": "test_STRING_feature" + } + ] + } + } + }, + "extensions": {} +} +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/read_mlfeature_table.py +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import MLFeatureTablePropertiesClass + +# First we get the current owners +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +urn = "urn:li:mlFeatureTable:(urn:li:dataPlatform:feast,test_feature_table_all_feature_dtypes)" +result = graph.get_aspect(entity_urn=urn, aspect_type=MLFeatureTablePropertiesClass) + +print(result) + +``` + + + + +### Read MLModel + + + + +```json +query { + mlModel(urn: "urn:li:mlModel:(urn:li:dataPlatform:science,scienceModel,PROD)"){ + name + description + properties { + description + version + type + mlFeatures + groups { + urn + name + } + } + } +} +``` + +Expected Response: + +```json +{ + "data": { + "mlModel": { + "name": "scienceModel", + "description": "A sample model for predicting some outcome.", + "properties": { + "description": "A sample model for predicting some outcome.", + "version": null, + "type": "Naive Bayes classifier", + "mlFeatures": null, + "groups": [] + } + } + }, + "extensions": {} +} +``` + + + + +```json +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "query": "{ mlModel(urn: \"urn:li:mlModel:(urn:li:dataPlatform:science,scienceModel,PROD)\") { name description properties { description version type mlFeatures groups { urn name } } } }" +}' +``` + +Expected Response: + +```json +{ + "data": { + "mlModel": { + "name": "scienceModel", + "description": "A sample model for predicting some outcome.", + "properties": { + "description": "A sample model for predicting some outcome.", + "version": null, + "type": "Naive Bayes classifier", + "mlFeatures": null, + "groups": [] + } + } + }, + "extensions": {} +} +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/read_mlmodel.py +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import MLModelPropertiesClass + +# First we get the current owners +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +urn = "urn:li:mlModel:(urn:li:dataPlatform:science,scienceModel,PROD)" +result = graph.get_aspect(entity_urn=urn, aspect_type=MLModelPropertiesClass) + +print(result) + +``` + + + + +### Read MLModelGroup + + + + +```json +query { + mlModelGroup(urn: "urn:li:mlModelGroup:(urn:li:dataPlatform:science,my-model-group,PROD)"){ + name + description + platform { + name + } + properties { + description + } + } +} +``` + +Expected Response: (Note that this entity does not exist in the sample ingestion and you might want to create this entity first.) + +```json +{ + "data": { + "mlModelGroup": { + "name": "my-model-group", + "description": "my model group", + "platform": { + "name": "science" + }, + "properties": { + "description": "my model group" + } + } + }, + "extensions": {} +} +``` + + + + +```json +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "query": "{ mlModelGroup(urn: \"urn:li:mlModelGroup:(urn:li:dataPlatform:science,my-model-group,PROD)\") { name description platform { name } properties { description } } }" +}' +``` + +Expected Response: (Note that this entity does not exist in the sample ingestion and you might want to create this entity first.) + +```json +{ + "data": { + "mlModelGroup": { + "name": "my-model-group", + "description": "my model group", + "platform": { + "name": "science" + }, + "properties": { + "description": "my model group" + } + } + }, + "extensions": {} +} +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/read_mlmodel_group.py +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import MLModelGroupPropertiesClass + +# First we get the current owners +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +urn = "urn:li:mlModelGroup:(urn:li:dataPlatform:science,my-model-group,PROD)" +result = graph.get_aspect(entity_urn=urn, aspect_type=MLModelGroupPropertiesClass) + +print(result) + +``` + + + + +## Add ML Entities + +### Add MlFeature to MlFeatureTable + + + + +```python +# Inlined from /metadata-ingestion/examples/library/add_mlfeature_to_mlfeature_table.py +import datahub.emitter.mce_builder as builder +import datahub.metadata.schema_classes as models +from datahub.emitter.mcp import MetadataChangeProposalWrapper +from datahub.emitter.rest_emitter import DatahubRestEmitter +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph +from datahub.metadata.schema_classes import MLFeatureTablePropertiesClass + +gms_endpoint = "http://localhost:8080" +# Create an emitter to DataHub over REST +emitter = DatahubRestEmitter(gms_server=gms_endpoint, extra_headers={}) + +feature_table_urn = builder.make_ml_feature_table_urn( + feature_table_name="my-feature-table", platform="feast" +) +feature_urns = [ + builder.make_ml_feature_urn( + feature_name="my-feature2", feature_table_name="my-feature-table" + ), +] + +# This code concatenates the new features with the existing features in the feature table. +# If you want to replace all existing features with only the new ones, you can comment out this line. +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) +feature_table_properties = graph.get_aspect( + entity_urn=feature_table_urn, aspect_type=MLFeatureTablePropertiesClass +) +if feature_table_properties: + current_features = feature_table_properties.mlFeatures + print("current_features:", current_features) + if current_features: + feature_urns += current_features + +feature_table_properties = models.MLFeatureTablePropertiesClass(mlFeatures=feature_urns) +# MCP createion +metadata_change_proposal = MetadataChangeProposalWrapper( + entityType="mlFeatureTable", + changeType=models.ChangeTypeClass.UPSERT, + entityUrn=feature_table_urn, + aspect=feature_table_properties, +) + +# Emit metadata! This is a blocking call +emitter.emit(metadata_change_proposal) + +``` + + + + +### Add MlFeature to MLModel + + + + +```python +# Inlined from /metadata-ingestion/examples/library/add_mlfeature_to_mlmodel.py +import datahub.emitter.mce_builder as builder +import datahub.metadata.schema_classes as models +from datahub.emitter.mcp import MetadataChangeProposalWrapper +from datahub.emitter.rest_emitter import DatahubRestEmitter +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph +from datahub.metadata.schema_classes import MLModelPropertiesClass + +gms_endpoint = "http://localhost:8080" +# Create an emitter to DataHub over REST +emitter = DatahubRestEmitter(gms_server=gms_endpoint, extra_headers={}) + +model_urn = builder.make_ml_model_urn( + model_name="my-test-model", platform="science", env="PROD" +) +feature_urns = [ + builder.make_ml_feature_urn( + feature_name="my-feature3", feature_table_name="my-feature-table" + ), +] + +# This code concatenates the new features with the existing features in the model +# If you want to replace all existing features with only the new ones, you can comment out this line. +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) +model_properties = graph.get_aspect( + entity_urn=model_urn, aspect_type=MLModelPropertiesClass +) +if model_properties: + current_features = model_properties.mlFeatures + print("current_features:", current_features) + if current_features: + feature_urns += current_features + +model_properties = models.MLModelPropertiesClass(mlFeatures=feature_urns) + +# MCP creation +metadata_change_proposal = MetadataChangeProposalWrapper( + entityType="mlModel", + changeType=models.ChangeTypeClass.UPSERT, + entityUrn=model_urn, + aspect=model_properties, +) + +# Emit metadata! +emitter.emit(metadata_change_proposal) + +``` + + + + +### Add MLGroup To MLModel + + + + +```python +# Inlined from /metadata-ingestion/examples/library/add_mlgroup_to_mlmodel.py +import datahub.emitter.mce_builder as builder +import datahub.metadata.schema_classes as models +from datahub.emitter.mcp import MetadataChangeProposalWrapper +from datahub.emitter.rest_emitter import DatahubRestEmitter +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +gms_endpoint = "http://localhost:8080" +# Create an emitter to DataHub over REST +emitter = DatahubRestEmitter(gms_server=gms_endpoint, extra_headers={}) + +model_group_urns = [ + builder.make_ml_model_group_urn( + group_name="my-model-group", platform="science", env="PROD" + ) +] +model_urn = builder.make_ml_model_urn( + model_name="science-model", platform="science", env="PROD" +) + +# This code concatenates the new features with the existing features in the feature table. +# If you want to replace all existing features with only the new ones, you can comment out this line. +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +target_model_properties = graph.get_aspect( + entity_urn=model_urn, aspect_type=models.MLModelPropertiesClass +) +if target_model_properties: + current_model_groups = target_model_properties.groups + print("current_model_groups:", current_model_groups) + if current_model_groups: + model_group_urns += current_model_groups + +model_properties = models.MLModelPropertiesClass(groups=model_group_urns) +# MCP createion +metadata_change_proposal = MetadataChangeProposalWrapper( + entityType="mlModel", + changeType=models.ChangeTypeClass.UPSERT, + entityUrn=model_urn, + aspect=model_properties, +) + +# Emit metadata! This is a blocking call +emitter.emit(metadata_change_proposal) + +``` + + + + +### Expected Outcome of Adding ML Entities + +You can access to `Features` or `Group` Tab of each entity to view the added entities. + +

+ +

+ +

+ +

diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/owners.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/owners.md new file mode 100644 index 0000000000000..361bdd9546ea3 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/owners.md @@ -0,0 +1,536 @@ +--- +title: Ownership +slug: /api/tutorials/owners +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/owners.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Ownership + +## Why Would You Use Users and Groups? + +Users and groups are essential for managing ownership of data. +By creating or updating user accounts and assigning them to appropriate groups, administrators can ensure that the right people can access the data they need to do their jobs. +This helps to avoid confusion or conflicts over who is responsible for specific datasets and can improve the overall effectiveness. + +### Goal Of This Guide + +This guide will show you how to + +- Create: create or update users and groups. +- Read: read owners attached to a dataset. +- Add: add user group as an owner to a dataset. +- Remove: remove the owner from a dataset. + +## Pre-requisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed information, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +:::note +In this guide, ingesting sample data is optional. +::: + +## Upsert Users + + + + +Save this `user.yaml` as a local file. + +```yaml +- id: bar@acryl.io + first_name: The + last_name: Bar + email: bar@acryl.io + slack: "@the_bar_raiser" + description: "I like raising the bar higher" + groups: + - foogroup@acryl.io +- id: datahub + slack: "@datahubproject" + phone: "1-800-GOT-META" + description: "The DataHub Project" + picture_link: "https://raw.githubusercontent.com/datahub-project/datahub/master/datahub-web-react/src/images/datahub-logo-color-stable.svg" +``` + +Execute the following CLI command to ingest user data. +Since the user datahub already exists in the sample data, any updates made to the user information will overwrite the existing data. + +``` +datahub user upsert -f user.yaml +``` + +If you see the following logs, the operation was successful: + +```shell +Update succeeded for urn urn:li:corpuser:bar@acryl.io. +Update succeeded for urn urn:li:corpuser:datahub. +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/upsert_user.py +import logging + +from datahub.api.entities.corpuser.corpuser import CorpUser, CorpUserGenerationConfig +from datahub.ingestion.graph.client import DataHubGraph, DataHubGraphConfig + +log = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + +user_email = "bar@acryl.io" + +user: CorpUser = CorpUser( + id=user_email, + display_name="The Bar", + email=user_email, + title="Software Engineer", + first_name="The", + last_name="Bar", + full_name="The Bar", +) + +# Create graph client +datahub_graph = DataHubGraph(DataHubGraphConfig(server="http://localhost:8080")) +for event in user.generate_mcp( + generation_config=CorpUserGenerationConfig(override_editable=False) +): + datahub_graph.emit(event) +log.info(f"Upserted user {user.urn}") + +``` + + + + +### Expected Outcomes of Upserting User + +You can see the user `The bar` has been created and the user `Datahub` has been updated under `Settings > Access > Users & Groups` + +

+ +

+ +## Upsert Group + + + + +Save this `group.yaml` as a local file. Note that the group includes a list of users who are owners and members. +Within these lists, you can refer to the users by their ids or their urns, and can additionally specify their metadata inline within the group description itself. See the example below to understand how this works and feel free to make modifications to this file locally to see the effects of your changes in your local DataHub instance. + +```yaml +id: foogroup@acryl.io +display_name: Foo Group +owners: + - datahub +members: + - bar@acryl.io # refer to a user either by id or by urn + - id: joe@acryl.io # inline specification of user + slack: "@joe_shmoe" + display_name: "Joe's Hub" +``` + +Execute the following CLI command to ingest this group's information. + +``` +datahub group upsert -f group.yaml +``` + +If you see the following logs, the operation was successful: + +```shell +Update succeeded for group urn:li:corpGroup:foogroup@acryl.io. +``` + + + + + +```python +# Inlined from /metadata-ingestion/examples/library/upsert_group.py +import logging + +from datahub.api.entities.corpgroup.corpgroup import ( + CorpGroup, + CorpGroupGenerationConfig, +) +from datahub.ingestion.graph.client import DataHubGraph, DataHubGraphConfig +from datahub.utilities.urns.corpuser_urn import CorpuserUrn + +log = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + +group_email = "foogroup@acryl.io" +group = CorpGroup( + id=group_email, + owners=[str(CorpuserUrn.create_from_id("datahub"))], + members=[ + str(CorpuserUrn.create_from_id("bar@acryl.io")), + str(CorpuserUrn.create_from_id("joe@acryl.io")), + ], + display_name="Foo Group", + email=group_email, + description="Software engineering team", + slack="@foogroup", +) + +# Create graph client +datahub_graph = DataHubGraph(DataHubGraphConfig(server="http://localhost:8080")) + +for event in group.generate_mcp( + generation_config=CorpGroupGenerationConfig( + override_editable=False, datahub_graph=datahub_graph + ) +): + datahub_graph.emit(event) +log.info(f"Upserted group {group.urn}") + +``` + + + + +### Expected Outcomes of Upserting Group + +You can see the group `Foo Group` has been created under `Settings > Access > Users & Groups` + +

+ +

+ +## Read Owners + + + + +```json +query { + dataset(urn: "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)") { + ownership { + owners { + owner { + ... on CorpUser { + urn + type + } + ... on CorpGroup { + urn + type + } + } + } + } + } +} +``` + +If you see the following response, the operation was successful: + +```json +{ + "data": { + "dataset": { + "ownership": { + "owners": [ + { + "owner": { + "urn": "urn:li:corpuser:jdoe", + "type": "CORP_USER" + } + }, + { + "owner": { + "urn": "urn:li:corpuser:datahub", + "type": "CORP_USER" + } + } + ] + } + } + }, + "extensions": {} +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "{ dataset(urn: \"urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)\") { ownership { owners { owner { ... on CorpUser { urn type } ... on CorpGroup { urn type } } } } } }", "variables":{}}' +``` + +Expected Response: + +```json +{ + "data": { + "dataset": { + "ownership": { + "owners": [ + { "owner": { "urn": "urn:li:corpuser:jdoe", "type": "CORP_USER" } }, + { "owner": { "urn": "urn:li:corpuser:datahub", "type": "CORP_USER" } } + ] + } + } + }, + "extensions": {} +} +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_query_owners.py +from datahub.emitter.mce_builder import make_dataset_urn + +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import OwnershipClass + +dataset_urn = make_dataset_urn(platform="hive", name="SampleHiveDataset", env="PROD") + +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +# Query multiple aspects from entity +result = graph.get_aspects_for_entity( + entity_urn=dataset_urn, + aspects=["ownership"], + aspect_types=[OwnershipClass], +) + +print(result) + +``` + + + + +## Add Owners + + + + +```python +mutation addOwners { + addOwner( + input: { + ownerUrn: "urn:li:corpGroup:bfoo", + resourceUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)", + ownerEntityType: CORP_GROUP, + type: TECHNICAL_OWNER + } + ) +} +``` + +Expected Response: + +```python +{ + "data": { + "addOwner": true + }, + "extensions": {} +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "mutation addOwners { addOwner(input: { ownerUrn: \"urn:li:corpGroup:bfoo\", resourceUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)\", ownerEntityType: CORP_GROUP, type: TECHNICAL_OWNER }) }", "variables":{}}' +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_add_owner.py +import logging +from typing import Optional + +from datahub.emitter.mce_builder import make_dataset_urn, make_user_urn +from datahub.emitter.mcp import MetadataChangeProposalWrapper + +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import ( + OwnerClass, + OwnershipClass, + OwnershipTypeClass, +) + +log = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + + +# Inputs -> owner, ownership_type, dataset +owner_to_add = make_user_urn("jdoe") +ownership_type = OwnershipTypeClass.TECHNICAL_OWNER +dataset_urn = make_dataset_urn(platform="hive", name="realestate_db.sales", env="PROD") + +# Some objects to help with conditional pathways later +owner_class_to_add = OwnerClass(owner=owner_to_add, type=ownership_type) +ownership_to_add = OwnershipClass(owners=[owner_class_to_add]) + + +# First we get the current owners +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + + +current_owners: Optional[OwnershipClass] = graph.get_aspect( + entity_urn=dataset_urn, aspect_type=OwnershipClass +) + + +need_write = False +if current_owners: + if (owner_to_add, ownership_type) not in [ + (x.owner, x.type) for x in current_owners.owners + ]: + # owners exist, but this owner is not present in the current owners + current_owners.owners.append(owner_class_to_add) + need_write = True +else: + # create a brand new ownership aspect + current_owners = ownership_to_add + need_write = True + +if need_write: + event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper( + entityUrn=dataset_urn, + aspect=current_owners, + ) + graph.emit(event) + log.info( + f"Owner {owner_to_add}, type {ownership_type} added to dataset {dataset_urn}" + ) + +else: + log.info(f"Owner {owner_to_add} already exists, omitting write") + +``` + + + + +## Expected Outcomes of Adding Owner + +You can now see `bfoo` has been added as an owner to the `fct_users_created` dataset. + +

+ +

+ +## Remove Owners + + + + +```json +mutation removeOwners { + removeOwner( + input: { + ownerUrn: "urn:li:corpuser:jdoe", + resourceUrn: "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)", + } + ) +} +``` + +Note that you can also remove owners from multiple entities or subresource using `batchRemoveOwners`. + +```json +mutation batchRemoveOwners { + batchRemoveOwners( + input: { + ownerUrns: ["urn:li:corpuser:jdoe"], + resources: [ + { resourceUrn:"urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)"} , + { resourceUrn:"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)"} ,] + } + ) +} +``` + +Expected Response: + +```python +{ + "data": { + "removeOwner": true + }, + "extensions": {} +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "mutation removeOwner { removeOwner(input: { ownerUrn: \"urn:li:corpuser:jdoe\", resourceUrn: \"urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)\" }) }", "variables":{}}' +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_remove_owner_execute_graphql.py +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +# Query multiple aspects from entity +query = """ +mutation batchRemoveOwners { + batchRemoveOwners( + input: { + ownerUrns: ["urn:li:corpuser:jdoe"], + resources: [ + { resourceUrn:"urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)"} , + { resourceUrn:"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)"} ,] + } + ) +} +""" +result = graph.execute_graphql(query=query) + +print(result) + +``` + + + + +### Expected Outcomes of Removing Owners + +You can now see `John Doe` has been removed as an owner from the `fct_users_created` dataset. + +

+ +

diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/tags.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/tags.md new file mode 100644 index 0000000000000..7ae66b5f89626 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/tags.md @@ -0,0 +1,613 @@ +--- +title: Tags +slug: /api/tutorials/tags +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/tags.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Tags + +## Why Would You Use Tags on Datasets? + +Tags are informal, loosely controlled labels that help in search & discovery. They can be added to datasets, dataset schemas, or containers, for an easy way to label or categorize entities – without having to associate them to a broader business glossary or vocabulary. +For more information about tags, refer to [About DataHub Tags](/docs/tags.md). + +### Goal Of This Guide + +This guide will show you how to + +- Create: create a tag. +- Read : read tags attached to a dataset. +- Add: add a tag to a column of a dataset or a dataset itself. +- Remove: remove a tag from a dataset. + +## Prerequisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed information, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +:::note +Before modifying tags, you need to ensure the target dataset is already present in your DataHub instance. +If you attempt to manipulate entities that do not exist, your operation will fail. +In this guide, we will be using data from sample ingestion. +::: + +For more information on how to set up for GraphQL, please refer to [How To Set Up GraphQL](/docs/api/graphql/how-to-set-up-graphql.md). + +## Create Tags + +The following code creates a tag `Deprecated`. + + + + +```json +mutation createTag { + createTag(input: + { + name: "Deprecated", + id: "deprecated", + description: "Having this tag means this column or table is deprecated." + }) +} +``` + +If you see the following response, the operation was successful: + +```python +{ + "data": { + "createTag": "urn:li:tag:deprecated" + }, + "extensions": {} +} +``` + + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "mutation createTag { createTag(input: { name: \"Deprecated\", id: \"deprecated\",description: \"Having this tag means this column or table is deprecated.\" }) }", "variables":{}}' +``` + +Expected Response: + +```json +{ "data": { "createTag": "urn:li:tag:deprecated" }, "extensions": {} } +``` + + + + + +```python +# Inlined from /metadata-ingestion/examples/library/create_tag.py +import logging + +from datahub.emitter.mce_builder import make_tag_urn +from datahub.emitter.mcp import MetadataChangeProposalWrapper +from datahub.emitter.rest_emitter import DatahubRestEmitter + +# Imports for metadata model classes +from datahub.metadata.schema_classes import TagPropertiesClass + +log = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + +tag_urn = make_tag_urn("deprecated") +tag_properties_aspect = TagPropertiesClass( + name="Deprecated", + description="Having this tag means this column or table is deprecated.", +) + +event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper( + entityUrn=tag_urn, + aspect=tag_properties_aspect, +) + +# Create rest emitter +rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080") +rest_emitter.emit(event) +log.info(f"Created tag {tag_urn}") + +``` + + + + +### Expected Outcome of Creating Tags + +You can now see the new tag `Deprecated` has been created. + +

+ +

+ +We can also verify this operation by programmatically searching `Deprecated` tag after running this code using the `datahub` cli. + +```shell +datahub get --urn "urn:li:tag:deprecated" --aspect tagProperties + +{ + "tagProperties": { + "description": "Having this tag means this column or table is deprecated.", + "name": "Deprecated" + } +} +``` + +## Read Tags + + + + +```json +query { + dataset(urn: "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)") { + tags { + tags { + tag { + name + urn + properties { + description + colorHex + } + } + } + } + } +} +``` + +If you see the following response, the operation was successful: + +```python +{ + "data": { + "dataset": { + "tags": { + "tags": [ + { + "tag": { + "name": "Legacy", + "urn": "urn:li:tag:Legacy", + "properties": { + "description": "Indicates the dataset is no longer supported", + "colorHex": null, + "name": "Legacy" + } + } + } + ] + } + } + }, + "extensions": {} +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "{dataset(urn: \"urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)\") {tags {tags {tag {name urn properties { description colorHex } } } } } }", "variables":{}}' +``` + +Expected Response: + +```json +{ + "data": { + "dataset": { + "tags": { + "tags": [ + { + "tag": { + "name": "Legacy", + "urn": "urn:li:tag:Legacy", + "properties": { + "description": "Indicates the dataset is no longer supported", + "colorHex": null + } + } + } + ] + } + } + }, + "extensions": {} +} +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_query_tags.py +from datahub.emitter.mce_builder import make_dataset_urn + +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import GlobalTagsClass + +dataset_urn = make_dataset_urn(platform="hive", name="SampleHiveDataset", env="PROD") + +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +# Query multiple aspects from entity +result = graph.get_aspects_for_entity( + entity_urn=dataset_urn, + aspects=["globalTags"], + aspect_types=[GlobalTagsClass], +) + +print(result) + +``` + + + + +## Add Tags + +### Add Tags to a dataset + +The following code shows you how can add tags to a dataset. +In the following code, we add a tag `Deprecated` to a dataset named `fct_users_created`. + + + + +```json +mutation addTags { + addTags( + input: { + tagUrns: ["urn:li:tag:deprecated"], + resourceUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)", + } + ) +} +``` + +If you see the following response, the operation was successful: + +```python +{ + "data": { + "addTags": true + }, + "extensions": {} +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "mutation addTags { addTags(input: { tagUrns: [\"urn:li:tag:deprecated\"], resourceUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)\" }) }", "variables":{}}' +``` + +Expected Response: + +```json +{ "data": { "addTags": true }, "extensions": {} } +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_add_tag.py +import logging +from typing import Optional + +from datahub.emitter.mce_builder import make_dataset_urn, make_tag_urn +from datahub.emitter.mcp import MetadataChangeProposalWrapper + +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import GlobalTagsClass, TagAssociationClass + +log = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + + +# First we get the current tags +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +dataset_urn = make_dataset_urn(platform="hive", name="realestate_db.sales", env="PROD") + +current_tags: Optional[GlobalTagsClass] = graph.get_aspect( + entity_urn=dataset_urn, + aspect_type=GlobalTagsClass, +) + +tag_to_add = make_tag_urn("purchase") +tag_association_to_add = TagAssociationClass(tag=tag_to_add) + +need_write = False +if current_tags: + if tag_to_add not in [x.tag for x in current_tags.tags]: + # tags exist, but this tag is not present in the current tags + current_tags.tags.append(TagAssociationClass(tag_to_add)) + need_write = True +else: + # create a brand new tags aspect + current_tags = GlobalTagsClass(tags=[tag_association_to_add]) + need_write = True + +if need_write: + event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper( + entityUrn=dataset_urn, + aspect=current_tags, + ) + graph.emit(event) + log.info(f"Tag {tag_to_add} added to dataset {dataset_urn}") + +else: + log.info(f"Tag {tag_to_add} already exists, omitting write") + +``` + + + + +### Add Tags to a Column of a dataset + + + + +```json +mutation addTags { + addTags( + input: { + tagUrns: ["urn:li:tag:deprecated"], + resourceUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)", + subResourceType:DATASET_FIELD, + subResource:"user_name"}) +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "mutation addTags { addTags(input: { tagUrns: [\"urn:li:tag:deprecated\"], resourceUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)\", subResourceType: DATASET_FIELD, subResource: \"user_name\" }) }", "variables":{}}' +``` + +Expected Response: + +```json +{ "data": { "addTags": true }, "extensions": {} } +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_add_column_tag.py +import logging +import time + +from datahub.emitter.mce_builder import make_dataset_urn, make_tag_urn +from datahub.emitter.mcp import MetadataChangeProposalWrapper + +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import ( + AuditStampClass, + EditableSchemaFieldInfoClass, + EditableSchemaMetadataClass, + GlobalTagsClass, + TagAssociationClass, +) + +log = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + + +def get_simple_field_path_from_v2_field_path(field_path: str) -> str: + """A helper function to extract simple . path notation from the v2 field path""" + if not field_path.startswith("[version=2.0]"): + # not a v2, we assume this is a simple path + return field_path + # this is a v2 field path + tokens = [ + t for t in field_path.split(".") if not (t.startswith("[") or t.endswith("]")) + ] + + return ".".join(tokens) + + +# Inputs -> the column, dataset and the tag to set +column = "user_name" +dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD") +tag_to_add = make_tag_urn("deprecated") + + +# First we get the current editable schema metadata +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + + +current_editable_schema_metadata = graph.get_aspect( + entity_urn=dataset_urn, + aspect_type=EditableSchemaMetadataClass, +) + + +# Some pre-built objects to help all the conditional pathways +tag_association_to_add = TagAssociationClass(tag=tag_to_add) +tags_aspect_to_set = GlobalTagsClass(tags=[tag_association_to_add]) +field_info_to_set = EditableSchemaFieldInfoClass( + fieldPath=column, globalTags=tags_aspect_to_set +) + + +need_write = False +field_match = False +if current_editable_schema_metadata: + for fieldInfo in current_editable_schema_metadata.editableSchemaFieldInfo: + if get_simple_field_path_from_v2_field_path(fieldInfo.fieldPath) == column: + # we have some editable schema metadata for this field + field_match = True + if fieldInfo.globalTags: + if tag_to_add not in [x.tag for x in fieldInfo.globalTags.tags]: + # this tag is not present + fieldInfo.globalTags.tags.append(tag_association_to_add) + need_write = True + else: + fieldInfo.globalTags = tags_aspect_to_set + need_write = True + + if not field_match: + # this field isn't present in the editable schema metadata aspect, add it + field_info = field_info_to_set + current_editable_schema_metadata.editableSchemaFieldInfo.append(field_info) + need_write = True + +else: + # create a brand new editable schema metadata aspect + now = int(time.time() * 1000) # milliseconds since epoch + current_timestamp = AuditStampClass(time=now, actor="urn:li:corpuser:ingestion") + current_editable_schema_metadata = EditableSchemaMetadataClass( + editableSchemaFieldInfo=[field_info_to_set], + created=current_timestamp, + ) + need_write = True + +if need_write: + event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper( + entityUrn=dataset_urn, + aspect=current_editable_schema_metadata, + ) + graph.emit(event) + log.info(f"Tag {tag_to_add} added to column {column} of dataset {dataset_urn}") + +else: + log.info(f"Tag {tag_to_add} already attached to column {column}, omitting write") + +``` + + + + +### Expected Outcome of Adding Tags + +You can now see `Deprecated` tag has been added to `user_name` column. + +

+ +

+ +We can also verify this operation programmatically by checking the `globalTags` aspect using the `datahub` cli. + +```shell +datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)" --aspect globalTags + +``` + +## Remove Tags + +The following code remove a tag from a dataset. +After running this code, `Deprecated` tag will be removed from a `user_name` column. + + + + +```json +mutation removeTag { + removeTag( + input: { + tagUrn: "urn:li:tag:deprecated", + resourceUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)", + subResourceType:DATASET_FIELD, + subResource:"user_name"}) +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "mutation removeTag { removeTag(input: { tagUrn: \"urn:li:tag:deprecated\", resourceUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)\" }) }", "variables":{}}' +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_remove_tag_execute_graphql.py +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +# Query multiple aspects from entity +query = """ +mutation removeTag { + removeTag( + input: { + tagUrn: "urn:li:tag:deprecated", + resourceUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)", + subResourceType:DATASET_FIELD, + subResource:"user_name"}) +} +""" +result = graph.execute_graphql(query=query) + +print(result) + +``` + + + + +### Expected Outcome of Removing Tags + +You can now see `Deprecated` tag has been removed to `user_name` column. + +

+ +

+ +We can also verify this operation programmatically by checking the `gloablTags` aspect using the `datahub` cli. + +```shell +datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)" --aspect globalTags + +{ + "globalTags": { + "tags": [] + } +} +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/terms.md b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/terms.md new file mode 100644 index 0000000000000..d2d0e715a4ca8 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/api/tutorials/terms.md @@ -0,0 +1,613 @@ +--- +title: Terms +slug: /api/tutorials/terms +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/terms.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Terms + +## Why Would You Use Terms on Datasets? + +The Business Glossary(Term) feature in DataHub helps you use a shared vocabulary within the orgarnization, by providing a framework for defining a standardized set of data concepts and then associating them with the physical assets that exist within your data ecosystem. + +For more information about terms, refer to [About DataHub Business Glossary](/docs/glossary/business-glossary.md). + +### Goal Of This Guide + +This guide will show you how to + +- Create: create a term. +- Read : read terms attached to a dataset. +- Add: add a term to a column of a dataset or a dataset itself. +- Remove: remove a term from a dataset. + +## Prerequisites + +For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. +For detailed information, please refer to [Datahub Quickstart Guide](/docs/quickstart.md). + +:::note +Before modifying terms, you need to ensure the target dataset is already present in your DataHub instance. +If you attempt to manipulate entities that do not exist, your operation will fail. +In this guide, we will be using data from sample ingestion. +::: + +For more information on how to set up for GraphQL, please refer to [How To Set Up GraphQL](/docs/api/graphql/how-to-set-up-graphql.md). + +## Create Terms + +The following code creates a term `Rate of Return`. + + + + +```json +mutation createGlossaryTerm { + createGlossaryTerm(input: { + name: "Rate of Return", + id: "rateofreturn", + description: "A rate of return (RoR) is the net gain or loss of an investment over a specified time period." + }, + ) +} +``` + +If you see the following response, the operation was successful: + +```python +{ + "data": { + "createGlossaryTerm": "urn:li:glossaryTerm:rateofreturn" + }, + "extensions": {} +} +``` + + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "mutation createGlossaryTerm { createGlossaryTerm(input: { name: \"Rate of Return\", id:\"rateofreturn\", description: \"A rate of return (RoR) is the net gain or loss of an investment over a specified time period.\" }) }", "variables":{}}' +``` + +Expected Response: + +```json +{ + "data": { "createGlossaryTerm": "urn:li:glossaryTerm:rateofreturn" }, + "extensions": {} +} +``` + + + + + +```python +# Inlined from /metadata-ingestion/examples/library/create_term.py +import logging + +from datahub.emitter.mce_builder import make_term_urn +from datahub.emitter.mcp import MetadataChangeProposalWrapper +from datahub.emitter.rest_emitter import DatahubRestEmitter + +# Imports for metadata model classes +from datahub.metadata.schema_classes import GlossaryTermInfoClass + +log = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + +term_urn = make_term_urn("rateofreturn") +term_properties_aspect = GlossaryTermInfoClass( + definition="A rate of return (RoR) is the net gain or loss of an investment over a specified time period.", + name="Rate of Return", + termSource="", +) + +event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper( + entityUrn=term_urn, + aspect=term_properties_aspect, +) + +# Create rest emitter +rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080") +rest_emitter.emit(event) +log.info(f"Created term {term_urn}") + +``` + + + + +### Expected Outcome of Creating Terms + +You can now see the new term `Rate of Return` has been created. + +

+ +

+ +We can also verify this operation by programmatically searching `Rate of Return` term after running this code using the `datahub` cli. + +```shell +datahub get --urn "urn:li:glossaryTerm:rateofreturn" --aspect glossaryTermInfo + +{ + "glossaryTermInfo": { + "definition": "A rate of return (RoR) is the net gain or loss of an investment over a specified time period.", + "name": "Rate of Return", + "termSource": "INTERNAL" + } +} +``` + +## Read Terms + + + + +```json +query { + dataset(urn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)") { + glossaryTerms { + terms { + term { + urn + glossaryTermInfo { + name + description + } + } + } + } + } +} +``` + +If you see the following response, the operation was successful: + +```python +{ + "data": { + "dataset": { + "glossaryTerms": { + "terms": [ + { + "term": { + "urn": "urn:li:glossaryTerm:CustomerAccount", + "glossaryTermInfo": { + "name": "CustomerAccount", + "description": "account that represents an identified, named collection of balances and cumulative totals used to summarize customer transaction-related activity over a designated period of time" + } + } + } + ] + } + } + }, + "extensions": {} +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "{dataset(urn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)\") {glossaryTerms {terms {term {urn glossaryTermInfo { name description } } } } } }", "variables":{}}' +``` + +Expected Response: + +````json +{"data":{"dataset":{"glossaryTerms":{"terms":[{"term":{"urn":"urn:li:glossaryTerm:CustomerAccount","glossaryTermInfo":{"name":"CustomerAccount","description":"account that represents an identified, named collection of balances and cumulative totals used to summarize customer transaction-related activity over a designated period of time"}}}]}}},"extensions":{}}``` +```` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_query_terms.py +from datahub.emitter.mce_builder import make_dataset_urn + +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import GlossaryTermsClass + +dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD") + +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +# Query multiple aspects from entity +result = graph.get_aspects_for_entity( + entity_urn=dataset_urn, + aspects=["glossaryTerms"], + aspect_types=[GlossaryTermsClass], +) + +print(result) + +``` + + + + +## Add Terms + +### Add Terms to a dataset + +The following code shows you how can add terms to a dataset. +In the following code, we add a term `Rate of Return` to a dataset named `fct_users_created`. + + + + +```json +mutation addTerms { + addTerms( + input: { + termUrns: ["urn:li:glossaryTerm:rateofreturn"], + resourceUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)", + } + ) +} +``` + +If you see the following response, the operation was successful: + +```python +{ + "data": { + "addTerms": true + }, + "extensions": {} +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "mutation addTerm { addTerms(input: { termUrns: [\"urn:li:glossaryTerm:rateofreturn\"], resourceUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)\" }) }", "variables":{}}' +``` + +Expected Response: + +```json +{ "data": { "addTerms": true }, "extensions": {} } +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_add_term.py +import logging +from typing import Optional + +from datahub.emitter.mce_builder import make_dataset_urn, make_term_urn +from datahub.emitter.mcp import MetadataChangeProposalWrapper + +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import ( + AuditStampClass, + GlossaryTermAssociationClass, + GlossaryTermsClass, +) + +log = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + + +# First we get the current terms +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +dataset_urn = make_dataset_urn(platform="hive", name="realestate_db.sales", env="PROD") + +current_terms: Optional[GlossaryTermsClass] = graph.get_aspect( + entity_urn=dataset_urn, aspect_type=GlossaryTermsClass +) + +term_to_add = make_term_urn("Classification.HighlyConfidential") +term_association_to_add = GlossaryTermAssociationClass(urn=term_to_add) +# an audit stamp that basically says we have no idea when these terms were added to this dataset +# change the time value to (time.time() * 1000) if you want to specify the current time of running this code as the time +unknown_audit_stamp = AuditStampClass(time=0, actor="urn:li:corpuser:ingestion") +need_write = False +if current_terms: + if term_to_add not in [x.urn for x in current_terms.terms]: + # terms exist, but this term is not present in the current terms + current_terms.terms.append(term_association_to_add) + need_write = True +else: + # create a brand new terms aspect + current_terms = GlossaryTermsClass( + terms=[term_association_to_add], + auditStamp=unknown_audit_stamp, + ) + need_write = True + +if need_write: + event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper( + entityUrn=dataset_urn, + aspect=current_terms, + ) + graph.emit(event) +else: + log.info(f"Term {term_to_add} already exists, omitting write") + +``` + + + + +### Add Terms to a Column of a Dataset + + + + +```json +mutation addTerms { + addTerms( + input: { + termUrns: ["urn:li:glossaryTerm:rateofreturn"], + resourceUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)", + subResourceType:DATASET_FIELD, + subResource:"user_name"}) +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "mutation addTerms { addTerms(input: { termUrns: [\"urn:li:glossaryTerm:rateofreturn\"], resourceUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)\", subResourceType: DATASET_FIELD, subResource: \"user_name\" }) }", "variables":{}}' +``` + +Expected Response: + +```json +{ "data": { "addTerms": true }, "extensions": {} } +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_add_column_term.py +import logging +import time + +from datahub.emitter.mce_builder import make_dataset_urn, make_term_urn +from datahub.emitter.mcp import MetadataChangeProposalWrapper + +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +# Imports for metadata model classes +from datahub.metadata.schema_classes import ( + AuditStampClass, + EditableSchemaFieldInfoClass, + EditableSchemaMetadataClass, + GlossaryTermAssociationClass, + GlossaryTermsClass, +) + +log = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + + +def get_simple_field_path_from_v2_field_path(field_path: str) -> str: + """A helper function to extract simple . path notation from the v2 field path""" + if not field_path.startswith("[version=2.0]"): + # not a v2, we assume this is a simple path + return field_path + # this is a v2 field path + tokens = [ + t for t in field_path.split(".") if not (t.startswith("[") or t.endswith("]")) + ] + + return ".".join(tokens) + + +# Inputs -> the column, dataset and the term to set +column = "address.zipcode" +dataset_urn = make_dataset_urn(platform="hive", name="realestate_db.sales", env="PROD") +term_to_add = make_term_urn("Classification.Location") + + +# First we get the current editable schema metadata +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + + +current_editable_schema_metadata = graph.get_aspect( + entity_urn=dataset_urn, aspect_type=EditableSchemaMetadataClass +) + + +# Some pre-built objects to help all the conditional pathways +now = int(time.time() * 1000) # milliseconds since epoch +current_timestamp = AuditStampClass(time=now, actor="urn:li:corpuser:ingestion") + +term_association_to_add = GlossaryTermAssociationClass(urn=term_to_add) +term_aspect_to_set = GlossaryTermsClass( + terms=[term_association_to_add], auditStamp=current_timestamp +) +field_info_to_set = EditableSchemaFieldInfoClass( + fieldPath=column, glossaryTerms=term_aspect_to_set +) + +need_write = False +field_match = False +if current_editable_schema_metadata: + for fieldInfo in current_editable_schema_metadata.editableSchemaFieldInfo: + if get_simple_field_path_from_v2_field_path(fieldInfo.fieldPath) == column: + # we have some editable schema metadata for this field + field_match = True + if fieldInfo.glossaryTerms: + if term_to_add not in [x.urn for x in fieldInfo.glossaryTerms.terms]: + # this tag is not present + fieldInfo.glossaryTerms.terms.append(term_association_to_add) + need_write = True + else: + fieldInfo.glossaryTerms = term_aspect_to_set + need_write = True + + if not field_match: + # this field isn't present in the editable schema metadata aspect, add it + field_info = field_info_to_set + current_editable_schema_metadata.editableSchemaFieldInfo.append(field_info) + need_write = True + +else: + # create a brand new editable schema metadata aspect + current_editable_schema_metadata = EditableSchemaMetadataClass( + editableSchemaFieldInfo=[field_info_to_set], + created=current_timestamp, + ) + need_write = True + +if need_write: + event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper( + entityUrn=dataset_urn, + aspect=current_editable_schema_metadata, + ) + graph.emit(event) + log.info(f"Term {term_to_add} added to column {column} of dataset {dataset_urn}") + +else: + log.info(f"Term {term_to_add} already attached to column {column}, omitting write") + +``` + + + + +### Expected Outcome of Adding Terms + +You can now see `Rate of Return` term has been added to `user_name` column. + +

+ +

+ +## Remove Terms + +The following code remove a term from a dataset. +After running this code, `Rate of Return` term will be removed from a `user_name` column. + + + + +```json +mutation removeTerm { + removeTerm( + input: { + termUrn: "urn:li:glossaryTerm:rateofreturn", + resourceUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)", + subResourceType:DATASET_FIELD, + subResource:"user_name"}) +} +``` + +Note that you can also remove a term from a dataset if you don't specify `subResourceType` and `subResource`. + +```json +mutation removeTerm { + removeTerm( + input: { + termUrn: "urn:li:glossaryTerm:rateofreturn", + resourceUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)", + }) +} +``` + +Also note that you can remove terms from multiple entities or subresource using `batchRemoveTerms`. + +```json +mutation batchRemoveTerms { + batchRemoveTerms( + input: { + termUrns: ["urn:li:glossaryTerm:rateofreturn"], + resources: [ + { resourceUrn:"urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)"} , + { resourceUrn:"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)"} ,] + } + ) +} +``` + + + + +```shell +curl --location --request POST 'http://localhost:8080/api/graphql' \ +--header 'Authorization: Bearer ' \ +--header 'Content-Type: application/json' \ +--data-raw '{ "query": "mutation removeTerm { removeTerm(input: { termUrn: \"urn:li:glossaryTerm:rateofreturn\", resourceUrn: \"urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)\" }) }", "variables":{}}' +``` + + + + +```python +# Inlined from /metadata-ingestion/examples/library/dataset_remove_term_execute_graphql.py +# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough) +from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph + +gms_endpoint = "http://localhost:8080" +graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint)) + +# Query multiple aspects from entity +query = """ +mutation batchRemoveTerms { + batchRemoveTerms( + input: { + termUrns: ["urn:li:glossaryTerm:rateofreturn"], + resources: [ + { resourceUrn:"urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)"} , + { resourceUrn:"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)"} ,] + } + ) +} +""" +result = graph.execute_graphql(query=query) + +print(result) + +``` + + + + +### Expected Outcome of Removing Terms + +You can now see `Rate of Return` term has been removed to `user_name` column. + +

+ +

diff --git a/docs-website/versioned_docs/version-0.10.4/docs/architecture/architecture.md b/docs-website/versioned_docs/version-0.10.4/docs/architecture/architecture.md new file mode 100644 index 0000000000000..4e7a8e081cf63 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/architecture/architecture.md @@ -0,0 +1,45 @@ +--- +title: Overview +sidebar_label: Overview +slug: /architecture/architecture +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/architecture/architecture.md +--- + +# DataHub Architecture Overview + +DataHub is a [3rd generation](https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained) Metadata Platform that enables Data Discovery, Collaboration, Governance, and end-to-end Observability +that is built for the Modern Data Stack. DataHub employs a model-first philosophy, with a focus on unlocking interoperability between +disparate tools & systems. + +The figures below describe the high-level architecture of DataHub. + +

+ +

+![Acryl DataHub System Architecture ](../../../../docs/managed-datahub/imgs/saas/DataHub-Architecture.png) + +For a more detailed look at the components that make up the Architecture, check out [Components](../components.md). + +## Architecture Highlights + +There are three main highlights of DataHub's architecture. + +### Schema-first approach to Metadata Modeling + +DataHub's metadata model is described using a [serialization agnostic language](https://linkedin.github.io/rest.li/pdl_schema). Both [REST](https://github.com/datahub-project/datahub/blob/master/metadata-service) as well as [GraphQL API-s](https://github.com/datahub-project/datahub/blob/master/datahub-web-react/src/graphql) are supported. In addition, DataHub supports an [AVRO-based API](https://github.com/datahub-project/datahub/blob/master/metadata-events) over Kafka to communicate metadata changes and subscribe to them. Our [roadmap](../roadmap.md) includes a milestone to support no-code metadata model edits very soon, which will allow for even more ease of use, while retaining all the benefits of a typed API. Read about metadata modeling at [metadata modeling]. + +### Stream-based Real-time Metadata Platform + +DataHub's metadata infrastructure is stream-oriented, which allows for changes in metadata to be communicated and reflected within the platform within seconds. You can also subscribe to changes happening in DataHub's metadata, allowing you to build real-time metadata-driven systems. For example, you can build an access-control system that can observe a previously world-readable dataset adding a new schema field which contains PII, and locks down that dataset for access control reviews. + +### Federated Metadata Serving + +DataHub comes with a single [metadata service (gms)](https://github.com/datahub-project/datahub/blob/master/metadata-service) as part of the open source repository. However, it also supports federated metadata services which can be owned and operated by different teams –– in fact, that is how LinkedIn runs DataHub internally. The federated services communicate with the central search index and graph using Kafka, to support global search and discovery while still enabling decoupled ownership of metadata. This kind of architecture is very amenable for companies who are implementing [data mesh](https://martinfowler.com/articles/data-monolith-to-mesh.html). + +[metadata modeling]: ../modeling/metadata-model.md +[PDL]: https://linkedin.github.io/rest.li/pdl_schema +[metadata architectures blog post]: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained +[datahub-serving]: metadata-serving.md +[datahub-ingestion]: metadata-ingestion.md +[react-frontend]: ../../datahub-web-react/README.md diff --git a/docs-website/versioned_docs/version-0.10.4/docs/architecture/metadata-ingestion.md b/docs-website/versioned_docs/version-0.10.4/docs/architecture/metadata-ingestion.md new file mode 100644 index 0000000000000..16e270a304c3c --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/architecture/metadata-ingestion.md @@ -0,0 +1,42 @@ +--- +title: Ingestion Framework +sidebar_label: Ingestion Framework +slug: /architecture/metadata-ingestion +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/architecture/metadata-ingestion.md +--- + +# Metadata Ingestion Architecture + +DataHub supports an extremely flexible ingestion architecture that can support push, pull, asynchronous and synchronous models. +The figure below describes all the options possible for connecting your favorite system to DataHub. + +

+ +

+ +## Metadata Change Proposal: The Center Piece + +The center piece for ingestion are [Metadata Change Proposal]s which represent requests to make a metadata change to an organization's Metadata Graph. +Metadata Change Proposals can be sent over Kafka, for highly scalable async publishing from source systems. They can also be sent directly to the HTTP endpoint exposed by the DataHub service tier to get synchronous success / failure responses. + +## Pull-based Integration + +DataHub ships with a Python based [metadata-ingestion system](../../metadata-ingestion/README.md) that can connect to different sources to pull metadata from them. This metadata is then pushed via Kafka or HTTP to the DataHub storage tier. Metadata ingestion pipelines can be [integrated with Airflow](../../metadata-ingestion/README.md#lineage-with-airflow) to set up scheduled ingestion or capture lineage. If you don't find a source already supported, it is very easy to [write your own](../../metadata-ingestion/README.md#contributing). + +## Push-based Integration + +As long as you can emit a [Metadata Change Proposal (MCP)] event to Kafka or make a REST call over HTTP, you can integrate any system with DataHub. For convenience, DataHub also provides simple [Python emitters] for you to integrate into your systems to emit metadata changes (MCP-s) at the point of origin. + +## Internal Components + +### Applying Metadata Change Proposals to DataHub Metadata Service (mce-consumer-job) + +DataHub comes with a Spring job, [mce-consumer-job], which consumes the Metadata Change Proposals and writes them into the DataHub Metadata Service (datahub-gms) using the `/ingest` endpoint. + +[Metadata Change Proposal (MCP)]: ../what/mxe.md#metadata-change-proposal-mcp +[Metadata Change Proposal]: ../what/mxe.md#metadata-change-proposal-mcp +[Metadata Change Log (MCL)]: ../what/mxe.md#metadata-change-log-mcl +[equivalent Pegasus format]: https://linkedin.github.io/rest.li/how_data_is_represented_in_memory#the-data-template-layer +[mce-consumer-job]: https://github.com/datahub-project/datahub/blob/master/metadata-jobs/mce-consumer-job +[Python emitters]: ../../metadata-ingestion/README.md#using-as-a-library diff --git a/docs-website/versioned_docs/version-0.10.4/docs/architecture/metadata-serving.md b/docs-website/versioned_docs/version-0.10.4/docs/architecture/metadata-serving.md new file mode 100644 index 0000000000000..e46fe3ffc91ea --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/architecture/metadata-serving.md @@ -0,0 +1,69 @@ +--- +title: Serving Tier +sidebar_label: Serving Tier +slug: /architecture/metadata-serving +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/architecture/metadata-serving.md +--- + +# DataHub Serving Architecture + +The figure below shows the high-level system diagram for DataHub's Serving Tier. + +

+ +

+ +The primary component is called [the Metadata Service](https://github.com/datahub-project/datahub/blob/master/metadata-service) and exposes a REST API and a GraphQL API for performing CRUD operations on metadata. The service also exposes search and graph query API-s to support secondary-index style queries, full-text search queries as well as relationship queries like lineage. In addition, the [datahub-frontend](https://github.com/datahub-project/datahub/blob/master/datahub-frontend) service expose a GraphQL API on top of the metadata graph. + +## DataHub Serving Tier Components + +### Metadata Storage + +The DataHub Metadata Service persists metadata in a document store (an RDBMS like MySQL, Postgres, or Cassandra, etc.). + +### Metadata Change Log Stream (MCL) + +The DataHub Service Tier also emits a commit event [Metadata Change Log] when a metadata change has been successfully committed to persistent storage. This event is sent over Kafka. + +The MCL stream is a public API and can be subscribed to by external systems (for example, the Actions Framework) providing an extremely powerful way to react in real-time to changes happening in metadata. For example, you could build an access control enforcer that reacts to change in metadata (e.g. a previously world-readable dataset now has a pii field) to immediately lock down the dataset in question. +Note that not all MCP-s will result in an MCL, because the DataHub serving tier will ignore any duplicate changes to metadata. + +### Metadata Index Applier (mae-consumer-job) + +[Metadata Change Log]s are consumed by another Spring job, [mae-consumer-job], which applies the changes to the [graph] and [search index] accordingly. +The job is entity-agnostic and will execute corresponding graph & search index builders, which will be invoked by the job when a specific metadata aspect is changed. +The builder should instruct the job how to update the graph and search index based on the metadata change. + +To ensure that metadata changes are processed in the correct chronological order, MCLs are keyed by the entity [URN] — meaning all MAEs for a particular entity will be processed sequentially by a single thread. + +### Metadata Query Serving + +Primary-key based reads (e.g. getting schema metadata for a dataset based on the `dataset-urn`) on metadata are routed to the document store. Secondary index based reads on metadata are routed to the search index (or alternately can use the strongly consistent secondary index support described [here](https://github.com/datahub-project/datahub/blob/master/docs/architecture/)). Full-text and advanced search queries are routed to the search index. Complex graph queries such as lineage are routed to the graph index. + +[RecordTemplate]: https://github.com/linkedin/rest.li/blob/master/data/src/main/java/com/linkedin/data/template/RecordTemplate.java +[GenericRecord]: https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/generic/GenericRecord.java +[Pegasus]: https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates +[relationship]: ../what/relationship.md +[entity]: ../what/entity.md +[aspect]: ../what/aspect.md +[GMS]: ../what/gms.md +[Metadata Change Log]: ../what/mxe.md#metadata-change-log-mcl +[rest.li]: https://rest.li +[Metadata Change Proposal (MCP)]: ../what/mxe.md#metadata-change-proposal-mcp +[Metadata Change Log (MCL)]: ../what/mxe.md#metadata-change-log-mcl +[MCP]: ../what/mxe.md#metadata-change-proposal-mcp +[MCL]: ../what/mxe.md#metadata-change-log-mcl +[equivalent Pegasus format]: https://linkedin.github.io/rest.li/how_data_is_represented_in_memory#the-data-template-layer +[graph]: ../what/graph.md +[search index]: ../what/search-index.md +[mce-consumer-job]: https://github.com/datahub-project/datahub/blob/master/metadata-jobs/mce-consumer-job +[mae-consumer-job]: https://github.com/datahub-project/datahub/blob/master/metadata-jobs/mae-consumer-job +[Remote DAO]: ../architecture/metadata-serving.md#remote-dao +[URN]: ../what/urn.md +[Metadata Modelling]: ../modeling/metadata-model.md +[Entity]: ../what/entity.md +[Relationship]: ../what/relationship.md +[Search Document]: ../what/search-document.md +[metadata aspect]: ../what/aspect.md +[Python emitters]: /docs/metadata-ingestion/#using-as-a-library diff --git a/docs-website/versioned_docs/version-0.10.4/docs/architecture/stemming_and_synonyms.md b/docs-website/versioned_docs/version-0.10.4/docs/architecture/stemming_and_synonyms.md new file mode 100644 index 0000000000000..1d6fb95b4993c --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/architecture/stemming_and_synonyms.md @@ -0,0 +1,163 @@ +--- +title: "About DataHub [Stemming and Synonyms Support]" +sidebar_label: "[Stemming and Synonyms Support]" +slug: /architecture/stemming_and_synonyms +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/architecture/stemming_and_synonyms.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub [Stemming and Synonyms Support] + + + +This feature adds features to our current search implementation in an effort to make search results more relevant. Included improvements are: + +- Stemming - Using a multi-language stemmer to allow better partial matching based on lexicographical roots i.e. "log" resolves from logs, logging. logger etc. +- Urn matching - Both partial and full Urns previously did not give desirable behavior in search results, these are now properly indexed and queried to give better matching results +- Word breaks across special characters - Previously when typing in a query like "logging_events", autocomplete would fail to resolve results after typing in the underscore until at least "logging_eve" had been typed and the same would occur with spaces. This has been resolved. +- Synonyms - A list of synonyms that will match across search results, currently static, has been added. We will be evolving this list over time to improve matching jargon versions of words to their full word equivalent. For example, typing in staging in a query can resolve datasets with stg in their name. + + + +## [Stemming and Synonyms Support] Setup, Prerequisites, and Permissions + +A reindex is required for this feature to work as it modifies non-dynamic mappings and settings in the index. This reindex is carried out as a part of the bootstrapping process by +DataHub Upgrade which has been added to the helm charts and docker-compose files as a required component with default configurations that should work for most deployments. +The job uses existing credentials and permissions for ElasticSearch to achieve this. During the reindex, writes to ElasticSearch will fail, so it is recommended to schedule an outage during this time. If doing a rolling update, old versions of GMS should still be able to serve queries, but at minimum ingestion traffic needs to be stopped. Estimated downtime for instances on the order of a few million records is ~30 minutes. Larger instances may require several hours though. +Once the reindex has succeeded, a message will be sent to new GMS and MCL/MAE Consumer instances that the state is ready for them to start up. Until this time they will hold off on starting using an exponential backoff to check for readiness. + +Relevant configuration for the Upgrade Job: + +### Helm Values + +```yaml +global: + elasticsearch: + ## The following section controls when and how reindexing of elasticsearch indices are performed + index: + ## Enable reindexing when mappings change based on the data model annotations + enableMappingsReindex: false + + ## Enable reindexing when static index settings change. + ## Dynamic settings which do not require reindexing are not affected + ## Primarily this should be enabled when re-sharding is necessary for scaling/performance. + enableSettingsReindex: false + + ## Index settings can be overridden for entity indices or other indices on an index by index basis + ## Some index settings, such as # of shards, requires reindexing while others, i.e. replicas, do not + ## Non-Entity indices do not require the prefix + # settingsOverrides: '{"graph_service_v1":{"number_of_shards":"5"},"system_metadata_service_v1":{"number_of_shards":"5"}}' + ## Entity indices do not require the prefix or suffix + # entitySettingsOverrides: '{"dataset":{"number_of_shards":"10"}}' + + ## The amount of delay between indexing a document and having it returned in queries + ## Increasing this value can improve performance when ingesting large amounts of data + # refreshIntervalSeconds: 1 + + ## The following options control settings for datahub-upgrade job when creating or reindexing indices + upgrade: + enabled: true + + ## When reindexing is required, this option will clone the existing index as a backup + ## The clone indices are not currently managed. + # cloneIndices: true + + ## Typically when reindexing the document counts between the original and destination indices should match. + ## In some cases reindexing might not be able to proceed due to incompatibilities between a document in the + ## orignal index and the new index's mappings. This document could be dropped and re-ingested or restored from + ## the SQL database. + ## + ## This setting allows continuing if and only if the cloneIndices setting is also enabled which + ## ensures a complete backup of the original index is preserved. + # allowDocCountMismatch: false +``` + +### Docker Environment Variables + +- ELASTICSEARCH_INDEX_BUILDER_MAPPINGS_REINDEX - Controls whether to perform a reindex for mappings mismatches +- ELASTICSEARCH_INDEX_BUILDER_SETTINGS_REINDEX - Controls whether to perform a reindex for settings mismatches +- ELASTICSEARCH_BUILD_INDICES_ALLOW_DOC_COUNT_MISMATCH - Used in conjunction with ELASTICSEARCH_BUILD_INDICES_CLONE_INDICES to allow users to skip passed document count mismatches when reindexing. Count mismatches may indicate dropped records during the reindex, so to prevent data loss this is only allowed if cloning is enabled. +- ELASTICSEARCH_BUILD_INDICES_CLONE_INDICES - Enables creating a clone of the current index to prevent data loss, default true +- ELASTICSEARCH_BUILD_INDICES_INITIAL_BACK_OFF_MILLIS - Controls the GMS and MCL Consumer backoff for checking if the reindex process has completed during start up. It is recommended to leave the defaults which will result in waiting up to ~5 minutes before killing the start-up process, allowing a new pod to attempt to start up in orchestrated deployments. +- ELASTICSEARCH_BUILD_INDICES_MAX_BACK_OFFS +- ELASTICSEARCH_BUILD_INDICES_BACK_OFF_FACTOR +- ELASTICSEARCH_BUILD_INDICES_WAIT_FOR_BUILD_INDICES - Controls whether to require waiting for the Build Indices job to finish. Defaults to true. It is not recommended to change this as it will allow GMS and MCL Consumers to start up in an error state. + +## Using [Stemming and Synonyms Support] + +### Stemming + +Stemming uses the root of a word without suffixes to match across intent of the search when a user is not quite sure of the precise name of a resource. + +

+ +

+ +In this first image stemming is shown in the results. Even though the query is "event", the results contain instances with "events." + +

+ +

+ +The second image exemplifies stemming on a query. The query is for "events", but the results show resources containing "event" as well. + +### Urn Matching + +Previously queries were not properly parsing out and tokenizing the expected portions of Urn types. Changes have been made on the index mapping and query side to support various partial and full Urn matching. + +

+ +

+ +

+ +

+ +

+ +

+ +### Synonyms + +Synonyms includes a static list of equivalent terms that are baked into the index at index creation time. This allows for efficient indexing of related terms. It is possible to add these to the query side as well to +allow for dynamic synonyms, but this is unsupported at this time and has performance implications. + +

+ +

+ +

+ +

+ +### Autocomplete improvements + +Improvements were made to autocomplete handling around special characters like underscores and spaces. + +

+ +

+ +

+ +

+ +

+ +

+ +## Additional Resources + +### Videos + +**DataHub TownHall: Search Improvements Preview** + +

+ +

+ +## FAQ and Troubleshooting + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authentication/README.md b/docs-website/versioned_docs/version-0.10.4/docs/authentication/README.md new file mode 100644 index 0000000000000..e6a79405a1222 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authentication/README.md @@ -0,0 +1,62 @@ +--- +title: Overview +slug: /authentication +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authentication/README.md +--- + +# Overview + +Authentication is the process of verifying the identity of a user or service. There are two +places where Authentication occurs inside DataHub: + +1. DataHub frontend service when a user attempts to log in to the DataHub application. +2. DataHub backend service when making API requests to DataHub. + +In this document, we'll tak a closer look at both. + +### Authentication in the Frontend + +Authentication of normal users of DataHub takes place in two phases. + +At login time, authentication is performed by either DataHub itself (via username / password entry) or a third-party Identity Provider. Once the identity +of the user has been established, and credentials validated, a persistent session token is generated for the user and stored +in a browser-side session cookie. + +DataHub provides 3 mechanisms for authentication at login time: + +- **Native Authentication** which uses username and password combinations natively stored and managed by DataHub, with users invited via an invite link. +- [Single Sign-On with OpenID Connect](guides/sso/configure-oidc-react.md) to delegate authentication responsibility to third party systems like Okta or Google/Azure Authentication. This is the recommended approach for production systems. +- [JaaS Authentication](guides/jaas.md) for simple deployments where authenticated users are part of some known list or invited as a [Native DataHub User](guides/add-users.md). + +In subsequent requests, the session token is used to represent the authenticated identity of the user, and is validated by DataHub's backend service (discussed below). +Eventually, the session token is expired (24 hours by default), at which point the end user is required to log in again. + +### Authentication in the Backend (Metadata Service) + +When a user makes a request for Data within DataHub, the request is authenticated by DataHub's Backend (Metadata Service) via a JSON Web Token. This applies to both requests originating from the DataHub application, +and programmatic calls to DataHub APIs. There are two types of tokens that are important: + +1. **Session Tokens**: Generated for users of the DataHub web application. By default, having a duration of 24 hours. + These tokens are encoded and stored inside browser-side session cookies. +2. **Personal Access Tokens**: These are tokens generated via the DataHub settings panel useful for interacting + with DataHub APIs. They can be used to automate processes like enriching documentation, ownership, tags, and more on DataHub. Learn + more about Personal Access Tokens [here](personal-access-tokens.md). + +To learn more about DataHub's backend authentication, check out [Introducing Metadata Service Authentication](introducing-metadata-service-authentication.md). + +Credentials must be provided as Bearer Tokens inside of the **Authorization** header in any request made to DataHub's API layer. To learn + +```shell +Authorization: Bearer +``` + +Note that in DataHub local quickstarts, Authentication at the backend layer is disabled for convenience. This leaves the backend +vulnerable to unauthenticated requests and should not be used in production. To enable +backend (token-based) authentication, simply set the `METADATA_SERVICE_AUTH_ENABLED=true` environment variable +for the datahub-gms container or pod. + +### References + +For a quick video on the topic of users and groups within DataHub, have a look at [DataHub Basics — Users, Groups, & Authentication 101 +](https://youtu.be/8Osw6p9vDYY) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authentication/changing-default-credentials.md b/docs-website/versioned_docs/version-0.10.4/docs/authentication/changing-default-credentials.md new file mode 100644 index 0000000000000..39b15444958bf --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authentication/changing-default-credentials.md @@ -0,0 +1,162 @@ +--- +title: Changing the default user credentials +slug: /authentication/changing-default-credentials +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authentication/changing-default-credentials.md +--- + +# Changing the default user credentials + +## Default User Credential + +The 'datahub' root user is created for you by default. This user is controlled via a [user.props](https://github.com/datahub-project/datahub/blob/master/datahub-frontend/conf/user.props) file which [JaaS Authentication](./guides/jaas.md) is configured to use: + +By default, the credential file looks like this for each and every self-hosted DataHub deployment: + +``` +// default user.props +datahub:datahub +``` + +Obviously, this is not ideal from a security perspective. It is highly recommended that this file +is changed _prior_ to deploying DataHub to production at your organization. + +:::warning +Please note that deleting the `Data Hub` user in the UI **WILL NOT** disable the default user. +You will still be able to log in using the default 'datahub:datahub' credentials. +To safely delete the default credentials, please follow the guide provided below. + +::: + +## Changing the default user `datahub` + +The method for changing the default user depends on how DataHub is deployed. + +- [Helm chart](#helm-chart) + - [Deployment Guide](/docs/deploy/kubernetes.md) +- [Docker-compose](#docker-compose) + - [Deployment Guide](../../docker/README.md) +- [Quickstart](#quickstart) + - [Deployment Guide](/docs/quickstart.md) + +### Helm chart + +You'll need to create a Kubernetes secret, then mount the file as a volume to the datahub-frontend pod. + +#### 1. Create a new config file + +Create a new version [user.props](https://github.com/datahub-project/datahub/blob/master/datahub-frontend/conf/user.props) which defines the updated password for the datahub user. + +To remove the user 'datahub' from the new file, simply omit the username. Please note that you can also choose to leave the file empty. +For example, to change the password for the DataHub root user to 'newpassword', your file would contain the following: + +``` +// new user.props +datahub:newpassword +``` + +#### 2. Create a kubernetes secret + +Create a secret from your local `user.props` file. + +```shell +kubectl create secret generic datahub-users-secret --from-file=user.props=./ +``` + +#### 3. Mount the config file + +Configure your [values.yaml](https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/values.yaml#LL22C1-L22C1) to add the volume to the datahub-frontend container. + +```yaml +datahub-frontend: + ... + extraVolumes: + - name: datahub-users + secret: + defaultMode: 0444 + secretName: datahub-users-secret + extraVolumeMounts: + - name: datahub-users + mountPath: /datahub-frontend/conf/user.props + subPath: user.props +``` + +#### 4. Restart Datahub + +Restart the DataHub containers or pods to pick up the new configs. +For example, you could run the following command to upgrade the current helm deployment. + +```shell +helm upgrade datahub datahub/datahub --values +``` + +Note that if you update the secret you will need to restart the datahub-frontend pods so the changes are reflected. To update the secret in-place you can run something like this. + +``` +kubectl create secret generic datahub-users-secret --from-file=user.props=./ -o yaml --dry-run=client | kubectl apply -f - +``` + +### Docker-compose + +#### 1. Modify a config file + +Modify [user.props](https://github.com/datahub-project/datahub/blob/master/datahub-frontend/conf/user.props) which defines the updated password for the datahub user. + +To remove the user 'datahub' from the new file, simply omit the username. Please note that you can also choose to leave the file empty. +For example, to change the password for the DataHub root user to 'newpassword', your file would contain the following: + +``` +// new user.props +datahub:newpassword +``` + +#### 2. Mount the updated config file + +Change the [docker-compose.yaml](https://github.com/datahub-project/datahub/blob/master/docker/docker-compose.yml) to mount an updated user.props file to the following location inside the `datahub-frontend-react` container using a volume:`/datahub-frontend/conf/user.props` + +```yaml + datahub-frontend-react: + ... + volumes: + ... + - :/datahub-frontend/conf/user.props +``` + +#### 3. Restart DataHub + +Restart the DataHub containers or pods to pick up the new configs. + +### Quickstart + +#### 1. Modify a config file + +Modify [user.props](https://github.com/datahub-project/datahub/blob/master/datahub-frontend/conf/user.props) which defines the updated password for the datahub user. + +To remove the user 'datahub' from the new file, simply omit the username. Please note that you can also choose to leave the file empty. +For example, to change the password for the DataHub root user to 'newpassword', your file would contain the following: + +``` +// new user.props +datahub:newpassword +``` + +#### 2. Mount the updated config file + +In [docker-compose file used in quickstart](https://github.com/datahub-project/datahub/blob/master/docker/quickstart/docker-compose.quickstart.yml). +Modify the [datahub-frontend-react block](https://github.com/datahub-project/datahub/blob/master/docker/quickstart/docker-compose.quickstart.yml#L116) to contain the extra volume mount. + +```yaml + datahub-frontend-react: + ... + volumes: + ... + - :/datahub-frontend/conf/user.props +``` + +#### 3. Restart Datahub + +Run the following command. + +``` +datahub docker quickstart --quickstart-compose-file .yml +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authentication/concepts.md b/docs-website/versioned_docs/version-0.10.4/docs/authentication/concepts.md new file mode 100644 index 0000000000000..fbb32e487721f --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authentication/concepts.md @@ -0,0 +1,131 @@ +--- +title: Concepts & Key Components +slug: /authentication/concepts +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authentication/concepts.md +--- + +# Concepts & Key Components + +We introduced a few important concepts to the Metadata Service to make authentication work: + +1. Actor +2. Authenticator +3. AuthenticatorChain +4. AuthenticationFilter +5. DataHub Access Token +6. DataHub Token Service + +In following sections, we'll take a closer look at each individually. + +

+ +

+*High level overview of Metadata Service Authentication* + +## What is an Actor? + +An **Actor** is a concept within the new Authentication subsystem to represent a unique identity / principal that is initiating actions (e.g. read & write requests) +on the platform. + +An actor can be characterized by 2 attributes: + +1. **Type**: The "type" of the actor making a request. The purpose is to for example distinguish between a "user" & "service" actor. Currently, the "user" actor type is the only one + formally supported. +2. **Id**: A unique identifier for the actor within DataHub. This is commonly known as a "principal" in other systems. In the case of users, this + represents a unique "username". This username is in turn used when converting from the "Actor" concept into a Metadata Entity Urn (e.g. CorpUserUrn). + +For example, the root "datahub" super user would have the following attributes: + +``` +{ + "type": "USER", + "id": "datahub" +} +``` + +Which is mapped to the CorpUser urn: + +``` +urn:li:corpuser:datahub +``` + +for Metadata retrieval. + +## What is an Authenticator? + +An **Authenticator** is a pluggable component inside the Metadata Service that is responsible for authenticating an inbound request provided context about the request (currently, the request headers). +Authentication boils down to successfully resolving an **Actor** to associate with the inbound request. + +There can be many types of Authenticator. For example, there can be Authenticators that + +- Verify the authenticity of access tokens (ie. issued by either DataHub itself or a 3rd-party IdP) +- Authenticate username / password credentials against a remote database (ie. LDAP) + +and more! A key goal of the abstraction is _extensibility_: a custom Authenticator can be developed to authenticate requests +based on an organization's unique needs. + +DataHub ships with 2 Authenticators by default: + +- **DataHubSystemAuthenticator**: Verifies that inbound requests have originated from inside DataHub itself using a shared system identifier + and secret. This authenticator is always present. + +- **DataHubTokenAuthenticator**: Verifies that inbound requests contain a DataHub-issued Access Token (discussed further in the "DataHub Access Token" section below) in their + 'Authorization' header. This authenticator is required if Metadata Service Authentication is enabled. + +## What is an AuthenticatorChain? + +An **AuthenticatorChain** is a series of **Authenticators** that are configured to run one-after-another. This allows +for configuring multiple ways to authenticate a given request, for example via LDAP OR via local key file. + +Only if each Authenticator within the chain fails to authenticate a request will it be rejected. + +The Authenticator Chain can be configured in the `application.yml` file under `authentication.authenticators`: + +``` +authentication: + .... + authenticators: + # Configure the Authenticators in the chain + - type: com.datahub.authentication.Authenticator1 + ... + - type: com.datahub.authentication.Authenticator2 + .... +``` + +## What is the AuthenticationFilter? + +The **AuthenticationFilter** is a [servlet filter](http://tutorials.jenkov.com/java-servlets/servlet-filters.html) that authenticates each and requests to the Metadata Service. +It does so by constructing and invoking an **AuthenticatorChain**, described above. + +If an Actor is unable to be resolved by the AuthenticatorChain, then a 401 unauthorized exception will be returned by the filter. + +## What is a DataHub Token Service? What are Access Tokens? + +Along with Metadata Service Authentication comes an important new component called the **DataHub Token Service**. The purpose of this +component is twofold: + +1. Generate Access Tokens that grant access to the Metadata Service +2. Verify the validity of Access Tokens presented to the Metadata Service + +**Access Tokens** granted by the Token Service take the form of [Json Web Tokens](https://jwt.io/introduction), a type of stateless token which +has a finite lifespan & is verified using a unique signature. JWTs can also contain a set of claims embedded within them. Tokens issued by the Token +Service contain the following claims: + +- exp: the expiration time of the token +- version: version of the DataHub Access Token for purposes of evolvability (currently 1) +- type: The type of token, currently SESSION (used for UI-based sessions) or PERSONAL (used for personal access tokens) +- actorType: The type of the **Actor** associated with the token. Currently, USER is the only type supported. +- actorId: The id of the **Actor** associated with the token. + +Today, Access Tokens are granted by the Token Service under two scenarios: + +1. **UI Login**: When a user logs into the DataHub UI, for example via [JaaS](guides/jaas.md) or + [OIDC](guides/sso/configure-oidc-react.md), the `datahub-frontend` service issues an + request to the Metadata Service to generate a SESSION token _on behalf of_ of the user logging in. (\*Only the frontend service is authorized to perform this action). +2. **Generating Personal Access Tokens**: When a user requests to generate a Personal Access Token (described below) from the UI. + +> At present, the Token Service supports the symmetric signing method `HS256` to generate and verify tokens. + +Now that we're familiar with the concepts, we will talk concretely about what new capabilities have been built on top +of Metadata Service Authentication. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/add-users.md b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/add-users.md new file mode 100644 index 0000000000000..bcf33122aaf3a --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/add-users.md @@ -0,0 +1,217 @@ +--- +title: Onboarding Users to DataHub +slug: /authentication/guides/add-users +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authentication/guides/add-users.md +--- + +# Onboarding Users to DataHub + +New user accounts can be provisioned on DataHub in 3 ways: + +1. Shared Invite Links +2. Single Sign-On using [OpenID Connect](https://www.google.com/search?q=openid+connect&oq=openid+connect&aqs=chrome.0.0i131i433i512j0i512l4j69i60l2j69i61.1468j0j7&sourceid=chrome&ie=UTF-8) +3. Static Credential Configuration File (Self-Hosted Only) + +The first option is the easiest to get started with. The second is recommended for deploying DataHub in production. The third should +be reserved for special circumstances where access must be closely monitored and controlled, and is only relevant for Self-Hosted instances. + +# Shared Invite Links + +### Generating an Invite Link + +If you have the `Manage User Credentials` [Platform Privilege](../../authorization/access-policies-guide.md), you can invite new users to DataHub by sharing an invite link. + +To do so, navigate to the **Users & Groups** section inside of Settings page. Here you can generate a shareable invite link by clicking the `Invite Users` button. If you +do not have the correct privileges to invite users, this button will be disabled. + +

+ +

+ +To invite new users, simply share the link with others inside your organization. + +

+ +

+ +When a new user visits the link, they will be directed to a sign up screen where they can create their DataHub account. + +### Resetting User Passwords + +To reset a user's password, navigate to the Users & Groups tab, find the user who needs their password reset, +and click **Reset user password** inside the menu dropdown on the right hand side. Note that a user must have the +`Manage User Credentials` [Platform Privilege](../../authorization/access-policies-guide.md) in order to reset passwords. + +

+ +

+ +To reset the password, simply share the password reset link with the user who needs to change their password. Password reset links expire after 24 hours. + +

+ +

+ +# Configuring Single Sign-On with OpenID Connect + +Setting up Single Sign-On via OpenID Connect enables your organization's users to login to DataHub via a central Identity Provider such as + +- Azure AD +- Okta +- Keycloak +- Ping! +- Google Identity + +and many more. + +This option is strongly recommended for production deployments of DataHub. + +### Managed DataHub + +Single Sign-On can be configured and enabled by navigating to **Settings** > **SSO** > **OIDC**. Note +that a user must have the **Manage Platform Settings** [Platform Privilege](../../authorization/access-policies-guide.md) +in order to configure SSO settings. + +To complete the integration, you'll need the following: + +1. **Client ID** - A unique identifier for your application with the identity provider +2. **Client Secret** - A shared secret to use for exchange between you and your identity provider +3. **Discovery URL** - A URL where the OpenID settings for your identity provider can be discovered. + +These values can be obtained from your Identity Provider by following Step 1 on the [OpenID Connect Authentication](sso/configure-oidc-react.md)) Guide. + +### Self-Hosted DataHub + +For information about configuring Self-Hosted DataHub to use OpenID Connect (OIDC) to +perform authentication, check out [OIDC Authentication](sso/configure-oidc-react.md). + +> **A note about user URNs**: User URNs are unique identifiers for users on DataHub. The username received from an Identity Provider +> when a user logs into DataHub via OIDC is used to construct a unique identifier for the user on DataHub. The urn is computed as: +> `urn:li:corpuser:` +> +> By default, the email address will be the username extracted from the Identity Provider. For information about customizing +> the claim should be treated as the username in Datahub, check out the [OIDC Authentication](sso/configure-oidc-react.md) documentation. + +# Static Credential Configuration File (Self-Hosted Only) + +User credentials can be managed via a [JaaS Authentication](./jaas.md) configuration file containing +static username and password combinations. By default, the credentials for the root 'datahub' users are configured +using this mechanism. It is highly recommended that admins change or remove the default credentials for this user + +## Adding new users using a user.props file + +To define a set of username / password combinations that should be allowed to log in to DataHub (in addition to the root 'datahub' user), +create a new file called `user.props` at the file path `${HOME}/.datahub/plugins/frontend/auth/user.props` within the `datahub-frontend-react` container +or pod. + +This file should contain username:password specifications, with one on each line. For example, to create 2 new users, +with usernames "janesmith" and "johndoe", we would define the following file: + +``` +// custom user.props +janesmith:janespassword +johndoe:johnspassword +``` + +Once you've saved the file, simply start the DataHub containers & navigate to `http://localhost:9002/login` +to verify that your new credentials work. + +To change or remove existing login credentials, edit and save the `user.props` file. Then restart DataHub containers. + +If you want to customize the location of the `user.props` file, or if you're deploying DataHub via Helm, proceed to Step 2. + +### (Advanced) Mount custom user.props file to container + +This step is only required when mounting custom credentials into a Kubernetes pod (e.g. Helm) **or** if you want to change +the default filesystem location from which DataHub mounts a custom `user.props` file (`${HOME}/.datahub/plugins/frontend/auth/user.props)`. + +If you are deploying with `datahub docker quickstart`, or running using Docker Compose, you can most likely skip this step. + +#### Docker Compose + +You'll need to modify the `docker-compose.yml` file to mount a container volume mapping your custom user.props to the standard location inside the container +(`/etc/datahub/plugins/frontend/auth/user.props`). + +For example, to mount a user.props file that is stored on my local filesystem at `/tmp/datahub/user.props`, we'd modify the YAML for the +`datahub-web-react` config to look like the following: + +```aidl + datahub-frontend-react: + build: + context: ../ + dockerfile: docker/datahub-frontend/Dockerfile + image: linkedin/datahub-frontend-react:${DATAHUB_VERSION:-head} + ..... + # The new stuff + volumes: + - ${HOME}/.datahub/plugins:/etc/datahub/plugins + - /tmp/datahub:/etc/datahub/plugins/frontend/auth +``` + +Once you've made this change, restarting DataHub enable authentication for the configured users. + +#### Helm + +You'll need to create a Kubernetes secret, then mount the file as a volume to the `datahub-frontend` pod. + +First, create a secret from your local `user.props` file + +```shell +kubectl create secret generic datahub-users-secret --from-file=user.props=./ +``` + +Then, configure your `values.yaml` to add the volume to the `datahub-frontend` container. + +```YAML +datahub-frontend: + ... + extraVolumes: + - name: datahub-users + secret: + defaultMode: 0444 + secretName: datahub-users-secret + extraVolumeMounts: + - name: datahub-users + mountPath: /etc/datahub/plugins/frontend/auth/user.props + subPath: user.props +``` + +Note that if you update the secret you will need to restart the `datahub-frontend` pods so the changes are reflected. To update the secret in-place you can run something like this. + +```shell +kubectl create secret generic datahub-users-secret --from-file=user.props=./ -o yaml --dry-run=client | kubectl apply -f - +``` + +> A note on user URNs: User URNs are unique identifiers for users of DataHub. The usernames defined in the `user.props` file will be used to generate the DataHub user "urn", which uniquely identifies +> the user on DataHub. The urn is computed as `urn:li:corpuser:{username}`, where "username is defined inside your user.props file." + +## Changing the default 'datahub' user credentials (Recommended) + +Please refer to [Changing the default user credentials](../changing-default-credentials.md). + +## Caveats + +### Adding User Details + +If you add a new username / password to the `user.props` file, no other information about the user will exist +about the user in DataHub (full name, email, bio, etc). This means that you will not be able to search to find the user. + +In order for the user to become searchable, simply navigate to the new user's profile page (top-right corner) and click +**Edit Profile**. Add some details like a display name, an email, and more. Then click **Save**. Now you should be able +to find the user via search. + +> You can also use our Python Emitter SDK to produce custom information about the new user via the CorpUser metadata entity. + +For a more comprehensive overview of how users & groups are managed within DataHub, check out [this video](https://www.youtube.com/watch?v=8Osw6p9vDYY). + +## FAQ + +1. Can I enable OIDC and username / password (JaaS) authentication at the same time? + +YES! If you have not explicitly disabled JaaS via an environment variable on the datahub-frontend container (AUTH_JAAS_ENABLED), +then you can always access the standard login flow at `http://your-datahub-url.com/login`. + +## Feedback / Questions / Concerns + +We want to hear from you! For any inquiries, including Feedback, Questions, or Concerns, reach out on Slack! diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/jaas.md b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/jaas.md new file mode 100644 index 0000000000000..5ff10f4eb8745 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/jaas.md @@ -0,0 +1,77 @@ +--- +title: JaaS Authentication +slug: /authentication/guides/jaas +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authentication/guides/jaas.md +--- + +# JaaS Authentication + +## Overview + +The DataHub frontend server comes with support for plugging in [JaaS](https://docs.oracle.com/javase/7/docs/technotes/guides/security/jaas/JAASRefGuide.html) modules. +This allows you to use a custom authentication protocol to log your users into DataHub. + +By default, we in include sample configuration of a file-based username / password authentication module ([PropertyFileLoginModule](http://archive.eclipse.org/jetty/8.0.0.M3/apidocs/org/eclipse/jetty/plus/jaas/spi/PropertyFileLoginModule.html)) +that is configured with a single username / password combination: datahub - datahub. + +To change or extend the default behavior, you have multiple options, each dependent on which deployment environment you're operating in. + +### Modify user.props file directly (Local Testing) + +The first option for customizing file-based users is to modify the file `datahub-frontend/app/conf/user.props` directly. +Once you've added your desired users, you can simply run `./dev.sh` or `./datahub-frontend/run-local-frontend` to validate your +new users can log in. + +### Mount a custom user.props file (Docker Compose) + +By default, the `datahub-frontend` container will look for a file called `user.props` mounted at the container path +`/datahub-frontend/conf/user.props`. If you wish to launch this container with a custom set of users, you'll need to override the default +file mounting when running using `docker-compose`. + +To do so, change the `datahub-frontend-react` service in the docker-compose.yml file containing it to include the custom file: + +``` +datahub-frontend-react: + build: + context: ../ + dockerfile: docker/datahub-frontend/Dockerfile + image: linkedin/datahub-frontend-react:${DATAHUB_VERSION:-head} + env_file: datahub-frontend/env/docker.env + hostname: datahub-frontend-react + container_name: datahub-frontend-react + ports: + - "9002:9002" + depends_on: + - datahub-gms + volumes: + - ./my-custom-dir/user.props:/datahub-frontend/conf/user.props +``` + +And then run `docker-compose up` against your compose file. + +## Custom JaaS Configuration + +In order to change the default JaaS module configuration, you will have to launch the `datahub-frontend-react` container with the custom `jaas.conf` file mounted as a volume +at the location `/datahub-frontend/conf/jaas.conf`. + +To do so, change the `datahub-frontend-react` service in the docker-compose.yml file containing it to include the custom file: + +``` +datahub-frontend-react: + build: + context: ../ + dockerfile: docker/datahub-frontend/Dockerfile + image: linkedin/datahub-frontend-react:${DATAHUB_VERSION:-head} + env_file: datahub-frontend/env/docker.env + hostname: datahub-frontend-react + container_name: datahub-frontend-react + ports: + - "9002:9002" + depends_on: + - datahub-gms + volumes: + - ./my-custom-dir/jaas.conf:/datahub-frontend/conf/jaas.conf +``` + +And then run `docker-compose up` against your compose file. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react-azure.md b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react-azure.md new file mode 100644 index 0000000000000..ee975691223fe --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react-azure.md @@ -0,0 +1,130 @@ +--- +title: Configuring Azure Authentication for React App (OIDC) +slug: /authentication/guides/sso/configure-oidc-react-azure +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authentication/guides/sso/configure-oidc-react-azure.md +--- + +# Configuring Azure Authentication for React App (OIDC) + +_Authored on 21/12/2021_ + +`datahub-frontend` server can be configured to authenticate users over OpenID Connect (OIDC). As such, it can be configured to +delegate authentication responsibility to identity providers like Microsoft Azure. + +This guide will provide steps for configuring DataHub authentication using Microsoft Azure. + +:::caution +Even when OIDC is configured, the root user can still login without OIDC by going +to `/login` URL endpoint. It is recommended that you don't use the default +credentials by mounting a different file in the front end container. To do this +please see [this guide](../jaas.md) to mount a custom user.props file for a JAAS authenticated deployment. +::: + +## Steps + +### 1. Create an application registration in Microsoft Azure portal + +a. Using an account linked to your organization, navigate to the [Microsoft Azure Portal](https://portal.azure.com). + +b. Select **App registrations**, then **New registration** to register a new app. + +c. Name your app registration and choose who can access your application. + +d. Select `Web` as the **Redirect URI** type and enter the following: + +``` +https://your-datahub-domain.com/callback/oidc +``` + +If you are just testing locally, the following can be used: `http://localhost:9002/callback/oidc`. +Azure supports more than one redirect URI, so both can be configured at the same time from the **Authentication** tab once the registration is complete. + +At this point, your app registration should look like the following: + +

+ +

+ +e. Click **Register**. + +### 2. Configure Authentication (optional) + +Once registration is done, you will land on the app registration **Overview** tab. On the left-side navigation bar, click on **Authentication** under **Manage** and add extra redirect URIs if need be (if you want to support both local testing and Azure deployments). + +

+ +

+ +Click **Save**. + +### 3. Configure Certificates & secrets + +On the left-side navigation bar, click on **Certificates & secrets** under **Manage**. +Select **Client secrets**, then **New client secret**. Type in a meaningful description for your secret and select an expiry. Click the **Add** button when you are done. + +**IMPORTANT:** Copy the `value` of your newly create secret since Azure will never display its value afterwards. + +

+ +

+ +### 4. Configure API permissions + +On the left-side navigation bar, click on **API permissions** under **Manage**. DataHub requires the following four Microsoft Graph APIs: + +1. `User.Read` _(should be already configured)_ +2. `profile` +3. `email` +4. `openid` + +Click on **Add a permission**, then from the **Microsoft APIs** tab select **Microsoft Graph**, then **Delegated permissions**. From the **OpenId permissions** category, select `email`, `openid`, `profile` and click **Add permissions**. + +At this point, you should be looking at a screen like the following: + +

+ +

+ +### 5. Obtain Application (Client) ID + +On the left-side navigation bar, go back to the **Overview** tab. You should see the `Application (client) ID`. Save its value for the next step. + +### 6. Obtain Discovery URI + +On the same page, you should see a `Directory (tenant) ID`. Your OIDC discovery URI will be formatted as follows: + +``` +https://login.microsoftonline.com/{tenant ID}/v2.0/.well-known/openid-configuration +``` + +### 7. Configure `datahub-frontend` to enable OIDC authentication + +a. Open the file `docker/datahub-frontend/env/docker.env` + +b. Add the following configuration values to the file: + +``` +AUTH_OIDC_ENABLED=true +AUTH_OIDC_CLIENT_ID=your-client-id +AUTH_OIDC_CLIENT_SECRET=your-client-secret +AUTH_OIDC_DISCOVERY_URI=https://login.microsoftonline.com/{tenant ID}/v2.0/.well-known/openid-configuration +AUTH_OIDC_BASE_URL=your-datahub-url +AUTH_OIDC_SCOPE="openid profile email" +``` + +Replacing the placeholders above with the client id (step 5), client secret (step 3) and tenant ID (step 6) received from Microsoft Azure. + +### 9. Restart `datahub-frontend-react` docker container + +Now, simply restart the `datahub-frontend-react` container to enable the integration. + +``` +docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml up datahub-frontend-react +``` + +Navigate to your DataHub domain to see SSO in action. + +## Resources + +- [Microsoft identity platform and OpenID Connect protocol](https://docs.microsoft.com/en-us/azure/active-directory/develop/v2-protocols-oidc/) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react-google.md b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react-google.md new file mode 100644 index 0000000000000..581c0130e27f9 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react-google.md @@ -0,0 +1,120 @@ +--- +title: Configuring Google Authentication for React App (OIDC) +slug: /authentication/guides/sso/configure-oidc-react-google +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authentication/guides/sso/configure-oidc-react-google.md +--- + +# Configuring Google Authentication for React App (OIDC) + +_Authored on 3/10/2021_ + +`datahub-frontend` server can be configured to authenticate users over OpenID Connect (OIDC). As such, it can be configured to delegate +authentication responsibility to identity providers like Google. + +This guide will provide steps for configuring DataHub authentication using Google. + +:::caution +Even when OIDC is configured, the root user can still login without OIDC by going +to `/login` URL endpoint. It is recommended that you don't use the default +credentials by mounting a different file in the front end container. To do this +please see [this guide](../jaas.md) to mount a custom user.props file for a JAAS authenticated deployment. +::: + +## Steps + +### 1. Create a project in the Google API Console + +Using an account linked to your organization, navigate to the [Google API Console](https://console.developers.google.com/) and select **New project**. +Within this project, we will configure the OAuth2.0 screen and credentials. + +### 2. Create OAuth2.0 consent screen + +a. Navigate to `OAuth consent screen`. This is where you'll configure the screen your users see when attempting to +log in to DataHub. + +b. Select `Internal` (if you only want your company users to have access) and then click **Create**. +Note that in order to complete this step you should be logged into a Google account associated with your organization. + +c. Fill out the details in the App Information & Domain sections. Make sure the 'Application Home Page' provided matches where DataHub is deployed +at your organization. + +

+ +

+ +Once you've completed this, **Save & Continue**. + +d. Configure the scopes: Next, click **Add or Remove Scopes**. Select the following scopes: + + - `.../auth/userinfo.email` + - `.../auth/userinfo.profile` + - `openid` + +Once you've selected these, **Save & Continue**. + +### 3. Configure client credentials + +Now navigate to the **Credentials** tab. This is where you'll obtain your client id & secret, as well as configure info +like the redirect URI used after a user is authenticated. + +a. Click **Create Credentials** & select `OAuth client ID` as the credential type. + +b. On the following screen, select `Web application` as your Application Type. + +c. Add the domain where DataHub is hosted to your 'Authorized Javascript Origins'. + +``` +https://your-datahub-domain.com +``` + +d. Add the domain where DataHub is hosted with the path `/callback/oidc` appended to 'Authorized Redirect URLs'. + +``` +https://your-datahub-domain.com/callback/oidc +``` + +e. Click **Create** + +f. You will now receive a pair of values, a client id and a client secret. Bookmark these for the next step. + +At this point, you should be looking at a screen like the following: + +

+ +

+ +Success! + +### 4. Configure `datahub-frontend` to enable OIDC authentication + +a. Open the file `docker/datahub-frontend/env/docker.env` + +b. Add the following configuration values to the file: + +``` +AUTH_OIDC_ENABLED=true +AUTH_OIDC_CLIENT_ID=your-client-id +AUTH_OIDC_CLIENT_SECRET=your-client-secret +AUTH_OIDC_DISCOVERY_URI=https://accounts.google.com/.well-known/openid-configuration +AUTH_OIDC_BASE_URL=your-datahub-url +AUTH_OIDC_SCOPE="openid profile email" +AUTH_OIDC_USER_NAME_CLAIM=email +AUTH_OIDC_USER_NAME_CLAIM_REGEX=([^@]+) +``` + +Replacing the placeholders above with the client id & client secret received from Google in Step 3f. + +### 5. Restart `datahub-frontend-react` docker container + +Now, simply restart the `datahub-frontend-react` container to enable the integration. + +``` +docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml up datahub-frontend-react +``` + +Navigate to your DataHub domain to see SSO in action. + +## References + +- [OpenID Connect in Google Identity](https://developers.google.com/identity/protocols/oauth2/openid-connect) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react-okta.md b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react-okta.md new file mode 100644 index 0000000000000..a2816cf79de1c --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react-okta.md @@ -0,0 +1,127 @@ +--- +title: Configuring Okta Authentication for React App (OIDC) +slug: /authentication/guides/sso/configure-oidc-react-okta +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authentication/guides/sso/configure-oidc-react-okta.md +--- + +# Configuring Okta Authentication for React App (OIDC) + +_Authored on 3/10/2021_ + +`datahub-frontend` server can be configured to authenticate users over OpenID Connect (OIDC). As such, it can be configured to +delegate authentication responsibility to identity providers like Okta. + +This guide will provide steps for configuring DataHub authentication using Okta. + +:::caution +Even when OIDC is configured, the root user can still login without OIDC by going +to `/login` URL endpoint. It is recommended that you don't use the default +credentials by mounting a different file in the front end container. To do this +please see [this guide](../jaas.md) to mount a custom user.props file for a JAAS authenticated deployment. +::: + +## Steps + +### 1. Create an application in Okta Developer Console + +a. Log in to your Okta admin account & navigate to the developer console + +b. Select **Applications**, then **Add Application**, the **Create New App** to create a new app. + +c. Select `Web` as the **Platform**, and `OpenID Connect` as the **Sign on method** + +d. Click **Create** + +e. Under 'General Settings', name your application + +f. Below, add a **Login Redirect URI**. This should be formatted as + +``` +https://your-datahub-domain.com/callback/oidc +``` + +If you're just testing locally, this can be `http://localhost:9002/callback/oidc`. + +g. Below, add a **Logout Redirect URI**. This should be formatted as + +``` +https://your-datahub-domain.com +``` + +h. [Optional] If you're enabling DataHub login as an Okta tile, you'll need to provide the **Initiate Login URI**. You +can set if to + +``` +https://your-datahub-domain.com/authenticate +``` + +If you're just testing locally, this can be `http://localhost:9002`. + +i. Click **Save** + +### 2. Obtain Client Credentials + +On the subsequent screen, you should see the client credentials. Bookmark the `Client id` and `Client secret` for the next step. + +### 3. Obtain Discovery URI + +On the same page, you should see an `Okta Domain`. Your OIDC discovery URI will be formatted as follows: + +``` +https://your-okta-domain.com/.well-known/openid-configuration +``` + +for example, `https://dev-33231928.okta.com/.well-known/openid-configuration`. + +At this point, you should be looking at a screen like the following: + +

+ +

+

+ +

+ +Success! + +### 4. Configure `datahub-frontend` to enable OIDC authentication + +a. Open the file `docker/datahub-frontend/env/docker.env` + +b. Add the following configuration values to the file: + +``` +AUTH_OIDC_ENABLED=true +AUTH_OIDC_CLIENT_ID=your-client-id +AUTH_OIDC_CLIENT_SECRET=your-client-secret +AUTH_OIDC_DISCOVERY_URI=https://your-okta-domain.com/.well-known/openid-configuration +AUTH_OIDC_BASE_URL=your-datahub-url +AUTH_OIDC_SCOPE="openid profile email groups" +``` + +Replacing the placeholders above with the client id & client secret received from Okta in Step 2. + +> **Pro Tip!** You can easily enable Okta to return the groups that a user is associated with, which will be provisioned in DataHub, along with the user logging in. This can be enabled by setting the `AUTH_OIDC_EXTRACT_GROUPS_ENABLED` flag to `true`. +> if they do not already exist in DataHub. You can enable your Okta application to return a 'groups' claim from the Okta Console at Applications > Your Application -> Sign On -> OpenID Connect ID Token Settings (Requires an edit). +> +> By default, we assume that the groups will appear in a claim named "groups". This can be customized using the `AUTH_OIDC_GROUPS_CLAIM` container configuration. +> +>

+> + +

+ +### 5. Restart `datahub-frontend-react` docker container + +Now, simply restart the `datahub-frontend-react` container to enable the integration. + +``` +docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml up datahub-frontend-react +``` + +Navigate to your DataHub domain to see SSO in action. + +## Resources + +- [OAuth 2.0 and OpenID Connect Overview](https://developer.okta.com/docs/concepts/oauth-openid/) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react.md b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react.md new file mode 100644 index 0000000000000..fa4f33929c8df --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authentication/guides/sso/configure-oidc-react.md @@ -0,0 +1,240 @@ +--- +title: Overview +slug: /authentication/guides/sso/configure-oidc-react +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authentication/guides/sso/configure-oidc-react.md +--- + +# Overview + +The DataHub React application supports OIDC authentication built on top of the [Pac4j Play](https://github.com/pac4j/play-pac4j) library. +This enables operators of DataHub to integrate with 3rd party identity providers like Okta, Google, Keycloak, & more to authenticate their users. + +When configured, OIDC auth will be enabled between clients of the DataHub UI & `datahub-frontend` server. Beyond this point is considered +to be a secure environment and as such authentication is validated & enforced only at the "front door" inside datahub-frontend. + +:::caution +Even if OIDC is configured the root user can still login without OIDC by going +to `/login` URL endpoint. It is recommended that you don't use the default +credentials by mounting a different file in the front end container. To do this +please see [this guide](../jaas.md) to mount a custom user.props file for a JAAS authenticated deployment. +::: + +## Provider-Specific Guides + +1. [Configuring OIDC using Google](configure-oidc-react-google.md) +2. [Configuring OIDC using Okta](configure-oidc-react-okta.md) +3. [Configuring OIDC using Azure](configure-oidc-react-azure.md) + +## Configuring OIDC in React + +### 1. Register an app with your Identity Provider + +To configure OIDC in React, you will most often need to register yourself as a client with your identity provider (Google, Okta, etc). Each provider may +have their own instructions. Provided below are links to examples for Okta, Google, Azure AD, & Keycloak. + +- [Registering an App in Okta](https://developer.okta.com/docs/guides/add-an-external-idp/apple/register-app-in-okta/) +- [OpenID Connect in Google Identity](https://developers.google.com/identity/protocols/oauth2/openid-connect) +- [OpenID Connect authentication with Azure Active Directory](https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/auth-oidc) +- [Keycloak - Securing Applications and Services Guide](https://www.keycloak.org/docs/latest/securing_apps/) + +During the registration process, you'll need to provide a login redirect URI to the identity provider. This tells the identity provider +where to redirect to once they've authenticated the end user. + +By default, the URL will be constructed as follows: + +> "http://your-datahub-domain.com/callback/oidc" + +For example, if you're hosted DataHub at `datahub.myorg.com`, this +value would be `http://datahub.myorg.com/callback/oidc`. For testing purposes you can also specify localhost as the domain name +directly: `http://localhost:9002/callback/oidc` + +The goal of this step should be to obtain the following values, which will need to be configured before deploying DataHub: + +1. **Client ID** - A unique identifier for your application with the identity provider +2. **Client Secret** - A shared secret to use for exchange between you and your identity provider +3. **Discovery URL** - A URL where the OIDC API of your identity provider can be discovered. This should suffixed by + `.well-known/openid-configuration`. Sometimes, identity providers will not explicitly include this URL in their setup guides, though + this endpoint _will_ exist as per the OIDC specification. For more info see http://openid.net/specs/openid-connect-discovery-1_0.html. + +### 2. Configure DataHub Frontend Server + +The second step to enabling OIDC involves configuring `datahub-frontend` to enable OIDC authentication with your Identity Provider. + +To do so, you must update the `datahub-frontend` [docker.env](https://github.com/datahub-project/datahub/blob/master/docker/datahub-frontend/env/docker.env) file with the +values received from your identity provider: + +``` +# Required Configuration Values: +AUTH_OIDC_ENABLED=true +AUTH_OIDC_CLIENT_ID=your-client-id +AUTH_OIDC_CLIENT_SECRET=your-client-secret +AUTH_OIDC_DISCOVERY_URI=your-provider-discovery-url +AUTH_OIDC_BASE_URL=your-datahub-url +``` + +- `AUTH_OIDC_ENABLED`: Enable delegating authentication to OIDC identity provider +- `AUTH_OIDC_CLIENT_ID`: Unique client id received from identity provider +- `AUTH_OIDC_CLIENT_SECRET`: Unique client secret received from identity provider +- `AUTH_OIDC_DISCOVERY_URI`: Location of the identity provider OIDC discovery API. Suffixed with `.well-known/openid-configuration` +- `AUTH_OIDC_BASE_URL`: The base URL of your DataHub deployment, e.g. https://yourorgdatahub.com (prod) or http://localhost:9002 (testing) + +Providing these configs will cause DataHub to delegate authentication to your identity +provider, requesting the "oidc email profile" scopes and parsing the "preferred_username" claim from +the authenticated profile as the DataHub CorpUser identity. + +> By default, the login callback endpoint exposed by DataHub will be located at `${AUTH_OIDC_BASE_URL}/callback/oidc`. This must **exactly** match the login redirect URL you've registered with your identity provider in step 1. + +In kubernetes, you can add the above env variables in the values.yaml as follows. + +```yaml +datahub-frontend: + ... + extraEnvs: + - name: AUTH_OIDC_ENABLED + value: "true" + - name: AUTH_OIDC_CLIENT_ID + value: your-client-id + - name: AUTH_OIDC_CLIENT_SECRET + value: your-client-secret + - name: AUTH_OIDC_DISCOVERY_URI + value: your-provider-discovery-url + - name: AUTH_OIDC_BASE_URL + value: your-datahub-url +``` + +You can also package OIDC client secrets into a k8s secret by running + +`kubectl create secret generic datahub-oidc-secret --from-literal=secret=<>` + +Then set the secret env as follows. + +```yaml +- name: AUTH_OIDC_CLIENT_SECRET + valueFrom: + secretKeyRef: + name: datahub-oidc-secret + key: secret +``` + +#### Advanced + +You can optionally customize the flow further using advanced configurations. These allow +you to specify the OIDC scopes requested, how the DataHub username is parsed from the claims returned by the identity provider, and how users and groups are extracted and provisioned from the OIDC claim set. + +``` +# Optional Configuration Values: +AUTH_OIDC_USER_NAME_CLAIM=your-custom-claim +AUTH_OIDC_USER_NAME_CLAIM_REGEX=your-custom-regex +AUTH_OIDC_SCOPE=your-custom-scope +AUTH_OIDC_CLIENT_AUTHENTICATION_METHOD=authentication-method +``` + +- `AUTH_OIDC_USER_NAME_CLAIM`: The attribute that will contain the username used on the DataHub platform. By default, this is "email" provided + as part of the standard `email` scope. +- `AUTH_OIDC_USER_NAME_CLAIM_REGEX`: A regex string used for extracting the username from the userNameClaim attribute. For example, if + the userNameClaim field will contain an email address, and we want to omit the domain name suffix of the email, we can specify a custom + regex to do so. (e.g. `([^@]+)`) +- `AUTH_OIDC_SCOPE`: a string representing the scopes to be requested from the identity provider, granted by the end user. For more info, + see [OpenID Connect Scopes](https://auth0.com/docs/scopes/openid-connect-scopes). +- `AUTH_OIDC_CLIENT_AUTHENTICATION_METHOD`: a string representing the token authentication method to use with the identity provider. Default value + is `client_secret_basic`, which uses HTTP Basic authentication. Another option is `client_secret_post`, which includes the client_id and secret_id + as form parameters in the HTTP POST request. For more info, see [OAuth 2.0 Client Authentication](https://darutk.medium.com/oauth-2-0-client-authentication-4b5f929305d4) + +Additional OIDC Options: + +- `AUTH_OIDC_PREFERRED_JWS_ALGORITHM` - Can be used to select a preferred signing algorithm for id tokens. Examples include: `RS256` or `HS256`. If + your IdP includes `none` before `RS256`/`HS256` in the list of signing algorithms, then this value **MUST** be set. + +##### User & Group Provisioning (JIT Provisioning) + +By default, DataHub will optimistically attempt to provision users and groups that do not already exist at the time of login. +For users, we extract information like first name, last name, display name, & email to construct a basic user profile. If a groups claim is present, +we simply extract their names. + +The default provisioning behavior can be customized using the following configs. + +``` +# User and groups provisioning +AUTH_OIDC_JIT_PROVISIONING_ENABLED=true +AUTH_OIDC_PRE_PROVISIONING_REQUIRED=false +AUTH_OIDC_EXTRACT_GROUPS_ENABLED=false +AUTH_OIDC_GROUPS_CLAIM= +``` + +- `AUTH_OIDC_JIT_PROVISIONING_ENABLED`: Whether DataHub users & groups should be provisioned on login if they do not exist. Defaults to true. +- `AUTH_OIDC_PRE_PROVISIONING_REQUIRED`: Whether the user should already exist in DataHub when they login, failing login if they are not. This is appropriate for situations in which users and groups are batch ingested and tightly controlled inside your environment. Defaults to false. +- `AUTH_OIDC_EXTRACT_GROUPS_ENABLED`: Only applies if `AUTH_OIDC_JIT_PROVISIONING_ENABLED` is set to true. This determines whether we should attempt to extract a list of group names from a particular claim in the OIDC attributes. Note that if this is enabled, each login will re-sync group membership with the groups in your Identity Provider, clearing the group membership that has been assigned through the DataHub UI. Enable with care! Defaults to false. +- `AUTH_OIDC_GROUPS_CLAIM`: Only applies if `AUTH_OIDC_EXTRACT_GROUPS_ENABLED` is set to true. This determines which OIDC claims will contain a list of string group names. Accepts multiple claim names with comma-separated values. I.e: `groups, teams, departments`. Defaults to 'groups'. + +Once configuration has been updated, `datahub-frontend-react` will need to be restarted to pick up the new environment variables: + +``` +docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml up datahub-frontend-react +``` + +> Note that by default, enabling OIDC will _not_ disable the dummy JAAS authentication path, which can be reached at the `/login` +> route of the React app. To disable this authentication path, additionally specify the following config: +> `AUTH_JAAS_ENABLED=false` + +### Summary + +Once configured, deploying the `datahub-frontend-react` container will enable an indirect authentication flow in which DataHub delegates +authentication to the specified identity provider. + +Once a user is authenticated by the identity provider, DataHub will extract a username from the provided claims +and grant DataHub access to the user by setting a pair of session cookies. + +A brief summary of the steps that occur when the user navigates to the React app are as follows: + +1. A `GET` to the `/authenticate` endpoint in `datahub-frontend` server is initiated +2. The `/authenticate` attempts to authenticate the request via session cookies +3. If auth fails, the server issues a redirect to the Identity Provider's login experience +4. The user logs in with the Identity Provider +5. The Identity Provider authenticates the user and redirects back to DataHub's registered login redirect URL, providing an authorization code which + can be used to retrieve information on behalf of the authenticated user +6. DataHub fetches the authenticated user's profile and extracts a username to identify the user on DataHub (eg. urn:li:corpuser:username) +7. DataHub sets session cookies for the newly authenticated user +8. DataHub redirects the user to the homepage ("/") + +## FAQ + +**No users can log in. Instead, I get redirected to the login page with an error. What do I do?** + +This can occur for a variety of reasons, but most often it is due to misconfiguration of Single-Sign On, either on the DataHub +side or on the Identity Provider side. + +First, verify that all values are consistent across them (e.g. the host URL where DataHub is deployed), and that no values +are misspelled (client id, client secret). + +Next, verify that the scopes requested are supported by your Identity Provider +and that the claim (i.e. attribute) DataHub uses for uniquely identifying the user is supported by your Identity Provider (refer to Identity Provider OpenID Connect documentation). By default, this claim is `email`. + +Then, make sure the Discovery URI you've configured (`AUTH_OIDC_DISCOVERY_URI`) is accessible where the datahub-frontend container is running. You +can do this by issuing a basic CURL to the address (**Pro-Tip**: you may also visit the address in your browser to check more specific details about your Identity Provider). + +Finally, check the container logs for the `datahub-frontend` container. This should hopefully provide some additional context +around why exactly the login handoff is not working. + +If all else fails, feel free to reach out to the DataHub Community on Slack for +real-time support + +**I'm seeing an error in the `datahub-frontend` logs when a user tries to login** + +```shell +Caused by: java.lang.RuntimeException: Failed to resolve user name claim from profile provided by Identity Provider. Missing attribute. Attribute: 'email', Regex: '(.*)', Profile: { ... +``` + +**what do I do?** + +This indicates that your Identity Provider does not provide the claim with name 'email', which DataHub +uses by default to uniquely identify users within your organization. + +To fix this, you may need to + +1. Change the claim that is used as the unique user identifier to something else by changing the `AUTH_OIDC_USER_NAME_CLAIM` (e.g. to "name" or "preferred*username") \_OR* +2. Change the environment variable `AUTH_OIDC_SCOPE` to include the scope required to retrieve the claim with name "email" + +For the `datahub-frontend` container / pod. + +**Pro-Tip**: Check the documentation for your Identity Provider to learn more about the scope claims supported. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authentication/introducing-metadata-service-authentication.md b/docs-website/versioned_docs/version-0.10.4/docs/authentication/introducing-metadata-service-authentication.md new file mode 100644 index 0000000000000..d565e03d19470 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authentication/introducing-metadata-service-authentication.md @@ -0,0 +1,197 @@ +--- +title: Metadata Service Authentication +slug: /authentication/introducing-metadata-service-authentication +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authentication/introducing-metadata-service-authentication.md +--- + +# Metadata Service Authentication + +## Introduction + +This document provides a technical overview of the how authentication works in DataHub's backend aimed at developers evaluating or operating DataHub. +It includes a characterization of the motivations for the feature, the key components in its design, the new capabilities it provides, & configuration instructions. + +## Background + +Let's recall 2 critical components of DataHub's architecture: + +- **DataHub Frontend Proxy** (datahub-frontend) - Resource server that routes requests to downstream Metadata Service +- **DataHub Metadata Service** (datahub-gms) - Source of truth for storing and serving DataHub Metadata Graph. + +Previously, Authentication was exclusively handled by the Frontend Proxy. This service would perform the following steps +when a user navigated to `http://localhost:9002/`: + +a. Check for the presence of a special `PLAY_SESSION` cookie. + +b. If cookie was present + valid, redirect to the home page + +c. If cookie was invalid, redirect to either a) the DataHub login screen (for [JAAS authentication](guides/jaas.md) or b) a [configured OIDC Identity Provider](guides/sso/configure-oidc-react.md) to perform authentication. + +Once authentication had succeeded at the frontend proxy layer, a stateless (token-based) session cookie (PLAY_SESSION) would be set in the users browser. +All subsequent requests, including the GraphQL requests issued by the React UI, would be authenticated using this session cookie. Once a request had made it beyond +the frontend service layer, it was assumed to have been already authenticated. Hence, there was **no native authentication inside of the Metadata Service**. + +### Problems with this approach + +The major challenge with this situation is that requests to the backend Metadata Service were completely unauthenticated. There were 2 options for folks who required authentication at the Metadata Service layer: + +1. Set up a proxy in front of Metadata Service that performed authentication +2. [A more recent possibility] Route requests to Metadata Service through DataHub Frontend Proxy, including the PLAY_SESSION + Cookie with every request. + +Neither of which are ideal. Setting up a proxy to do authentication takes time & expertise. Extracting and setting a session cookie from the browser for programmatic is +clunky & unscalable. On top of that, extending the authentication system was difficult, requiring implementing a new [Play module](https://www.playframework.com/documentation/2.8.8/api/java/play/mvc/Security.Authenticator.html) within DataHub Frontend. + +## Introducing Authentication in DataHub Metadata Service + +To address these problems, we introduced configurable Authentication inside the **Metadata Service** itself, +meaning that requests are no longer considered trusted until they are authenticated by the Metadata Service. + +Why push Authentication down? In addition to the problems described above, we wanted to plan for a future +where Authentication of Kafka-based-writes could be performed in the same manner as Rest writes. + +## Configuring Metadata Service Authentication + +Metadata Service Authentication is currently **opt-in**. This means that you may continue to use DataHub without Metadata Service Authentication without interruption. +To enable Metadata Service Authentication: + +- set the `METADATA_SERVICE_AUTH_ENABLED` environment variable to "true" for the `datahub-gms` AND `datahub-frontend` containers / pods. + +OR + +- change the Metadata Service `application.yml` configuration file to set `authentication.enabled` to "true" AND +- change the Frontend Proxy Service `application.config` configuration file to set `metadataService.auth.enabled` to "true" + +After setting the configuration flag, simply restart the Metadata Service to start enforcing Authentication. + +Once enabled, all requests to the Metadata Service will need to be authenticated; if you're using the default Authenticators +that ship with DataHub, this means that all requests will need to present an Access Token in the Authorization Header as follows: + +``` +Authorization: Bearer +``` + +For users logging into the UI, this process will be handled for you. When logging in, a cookie will be set in your browser that internally +contains a valid Access Token for the Metadata Service. When browsing the UI, this token will be extracted and sent to the Metadata Service +to authenticate each request. + +For users who want to access the Metadata Service programmatically, i.e. for running ingestion, the current recommendation is to generate +a **Personal Access Token** (described above) from the root "datahub" user account, and using this token when configuring your [Ingestion Recipes](../../metadata-ingestion/README.md#recipes). +To configure the token for use in ingestion, simply populate the "token" configuration for the `datahub-rest` sink: + +``` +source: + # source configs +sink: + type: "datahub-rest" + config: + ... + token: +``` + +> Note that ingestion occurring via `datahub-kafka` sink will continue to be Unauthenticated _for now_. Soon, we will be introducing +> support for providing an access token in the event payload itself to authenticate ingestion requests over Kafka. + +### The Role of DataHub Frontend Proxy Going Forward + +With these changes, DataHub Frontend Proxy will continue to play a vital part in the complex dance of Authentication. It will serve as the place +where UI-based session authentication originates and will continue to support 3rd Party SSO configuration (OIDC) +and JAAS configuration as it does today. + +The major improvement is that the Frontend Service will validate credentials provided at UI login time +and generate a DataHub **Access Token**, embedding it into traditional session cookie (which will continue to work). + +In summary, DataHub Frontend Service will continue to play a vital role to Authentication. It's scope, however, will likely +remain limited to concerns specific to the React UI. + +## Where to go from here + +These changes represent the first milestone in Metadata Service Authentication. They will serve as a foundation upon which we can build new features, prioritized based on Community demand: + +1. **Dynamic Authenticator Plugins**: Configure + register custom Authenticator implementations, without forking DataHub. +2. **Service Accounts**: Create service accounts and generate Access tokens on their behalf. +3. **Kafka Ingestion Authentication**: Authenticate ingestion requests coming from the Kafka ingestion sink inside the Metadata Service. +4. **Access Token Management**: Ability to view, manage, and revoke access tokens that have been generated. (Currently, access tokens inlcude no server side state, and thus cannot be revoked once granted) + +...and more! To advocate for these features or others, reach out on [Slack](https://datahubspace.slack.com/join/shared_invite/zt-nx7i0dj7-I3IJYC551vpnvvjIaNRRGw#/shared-invite/email). + +## Q&As + +### What if I don't want to use Metadata Service Authentication? + +That's perfectly fine, for now. Metadata Service Authentication is disabled by default, only enabled if you provide the +environment variable `METADATA_SERVICE_AUTH_ENABLED` to the `datahub-gms` container or change the `authentication.enabled` to "true" +inside your DataHub Metadata Service configuration (`application.yml`). + +That being said, we will be recommending that you enable Authentication for production use cases, to prevent +arbitrary actors from ingesting metadata into DataHub. + +### If I enable Metadata Service Authentication, will ingestion stop working? + +If you enable Metadata Service Authentication, you will want to provide a value for the "token" configuration value +when using the `datahub-rest` sink in your [Ingestion Recipes](/docs/metadata-ingestion/#recipes). See +the [Rest Sink Docs](/docs/metadata-ingestion/sink_docs/datahub#config-details) for configuration details. + +We'd recommend generating a Personal Access Token (described above) from a trusted DataHub Account (e.g. root 'datahub' user) when configuring +your Ingestion sources. + +Note that you can also provide the "extraHeaders" configuration in `datahub-rest` sink to specify a custom header to +pass with each request. This can be used in conjunction to authenticate using a custom Authenticator, for example. + +### How do I generate an Access Token for a service account? + +There is no formal concept of "service account" or "bot" on DataHub (yet). For now, we recommend you configure any +programmatic clients of DataHub to use a Personal Access Token generated from a user with the correct privileges, for example +the root "datahub" user account. + +### I want to authenticate requests using a custom Authenticator? How do I do this? + +You can configure DataHub to add your custom **Authenticator** to the **Authentication Chain** by changing the `application.yml` configuration file for the Metadata Service: + +```yml +authentication: + enabled: true # Enable Metadata Service Authentication + .... + authenticators: # Configure an Authenticator Chain + - type: # E.g. com.linkedin.datahub.authentication.CustomAuthenticator + configs: # Specific configs that should be passed into 'init' method of Authenticator + customConfig1: +``` + +Notice that you will need to have a class that implements the `Authenticator` interface with a zero-argument constructor available on the classpath +of the Metadata Service java process. + +We love contributions! Feel free to raise a PR to contribute an Authenticator back if it's generally useful. + +### Now that I can make authenticated requests to either DataHub Proxy Service and DataHub Metadata Service, which should I use? + +Previously, we were recommending that folks contact the Metadata Service directly when doing things like + +- ingesting Metadata via recipes +- issuing programmatic requests to the Rest.li APIs +- issuing programmatic requests to the GraphQL APIs + +With these changes, we will be shifting to the recommendation that folks direct all traffic, whether it's programmatic or not, +to the **DataHub Frontend Proxy**, as routing to Metadata Service endpoints is currently available at the path `/api/gms`. +This recommendation is in effort to minimize the exposed surface area of DataHub to make securing, operating, maintaining, and developing +the platform simpler. + +In practice, this will require migrating Metadata [Ingestion Recipes](../../metadata-ingestion/README.md#recipes) use the `datahub-rest` sink to pointing at a slightly different +host + path. + +Example recipe that proxies through DataHub Frontend + +```yml +source: + # source configs +sink: + type: "datahub-rest" + config: + ... + token: +``` + +## Feedback / Questions / Concerns + +We want to hear from you! For any inquiries, including Feedback, Questions, or Concerns, reach out on [Slack](https://datahubspace.slack.com/join/shared_invite/zt-nx7i0dj7-I3IJYC551vpnvvjIaNRRGw#/shared-invite/email)! diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authentication/personal-access-tokens.md b/docs-website/versioned_docs/version-0.10.4/docs/authentication/personal-access-tokens.md new file mode 100644 index 0000000000000..bb08c8bd4db83 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authentication/personal-access-tokens.md @@ -0,0 +1,118 @@ +--- +title: About DataHub Personal Access Tokens +sidebar_label: Personal Access Tokens +slug: /authentication/personal-access-tokens +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authentication/personal-access-tokens.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub Personal Access Tokens + + + +Personal Access Tokens, or PATs for short, allow users to represent themselves in code and programmatically use DataHub's APIs in deployments where security is a concern. + +Used along-side with [authentication-enabled metadata service](introducing-metadata-service-authentication.md), PATs add a layer of protection to DataHub where only authorized users are able to perform actions in an automated way. + +## Personal Access Tokens Setup, Prerequisites, and Permissions + +To use PATs, two things are required: + +1. Metadata Authentication must have been enabled in GMS. See `Configuring Metadata Service Authentication` in [authentication-enabled metadata service](introducing-metadata-service-authentication.md). +2. Users must have been granted the `Generate Personal Access Tokens` or `Manage All Access Tokens` Privilege via a [DataHub Policy](../authorization/policies.md). + +Once configured, users should be able to navigate to **'Settings'** > **'Access Tokens'** > **'Generate Personal Access Token'** to generate a token: + +

+ +

+ +If you have configured permissions correctly the `Generate new token` should be clickable. + +:::note + +If you see `Token based authentication is currently disabled. Contact your DataHub administrator to enable this feature.` then you must enable authentication in the metadata service (step 1 of the prerequisites). + +::: + +## Creating Personal Access Tokens + +Once in the Manage Access Tokens Settings Tab: + +1. Click `Generate new token` where a form should appear. + +

+ +

+ +2. Fill out the information as needed and click `Create`. +

+ +

+ +3. Save the token text somewhere secure! This is what will be used later on! +

+ +

+ +## Using Personal Access Tokens + +Once a token has been generated, the user that created it will subsequently be able to make authenticated HTTP requests, assuming he/she has permissions to do so, to DataHub frontend proxy or DataHub GMS directly by providing +the generated Access Token as a Bearer token in the `Authorization` header: + +``` +Authorization: Bearer +``` + +For example, using a curl to the frontend proxy (preferred in production): + +```bash +curl 'http://localhost:9002/api/gms/entities/urn:li:corpuser:datahub' -H 'Authorization: Bearer +``` + +or to Metadata Service directly: + +```bash +curl 'http://localhost:8080/entities/urn:li:corpuser:datahub' -H 'Authorization: Bearer +``` + +Since authorization happens at the GMS level, this means that ingestion is also protected behind access tokens, to use them simply add a `token` to the sink config property as seen below: + +

+ +

+ +:::note + +Without an access token, making programmatic requests will result in a 401 result from the server if Metadata Service Authentication +is enabled. + +::: + +## Additional Resources + +- Learn more about how this feature is by DataHub [Authentication Metadata Service](introducing-metadata-service-authentication.md). +- Check out our [Authorization Policies](../authorization/policies.md) to see what permissions can be programatically used. + +### GraphQL + +- Have a look at [Token Management in GraphQL](../api/graphql/token-management.md) to learn how to manage tokens programatically! + +## FAQ and Troubleshooting + +**The button to create tokens is greyed out - why can’t I click on it?** + +This means that the user currently logged in DataHub does not have either `Generate Personal Access Tokens` or `Manage All Access Tokens` permissions. +Please ask your DataHub administrator to grant you those permissions. + +**When using a token, I get 401 unauthorized - why?** + +A PAT represents a user in DataHub, if that user does not have permissions for a given action, neither will the token. + +**Can I create a PAT that represents some other user?** + +Yes, although not through the UI correctly, you will have to use the [token management graphQL API](../api/graphql/token-management.md) and the user making the request must have `Manage All Access Tokens` permissions. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authorization/README.md b/docs-website/versioned_docs/version-0.10.4/docs/authorization/README.md new file mode 100644 index 0000000000000..7d592e285716a --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authorization/README.md @@ -0,0 +1,25 @@ +--- +title: Overview +slug: /authorization +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authorization/README.md +--- + +# Overview + +Authorization specifies _what_ accesses an _authenticated_ user has within a system. +This section is all about how DataHub authorizes a given user/service that wants to interact with the system. + +:::note + +Authorization only makes sense in the context of an **Authenticated** DataHub deployment. To use DataHub's authorization features +please first make sure that the system has been configured from an authentication perspective as you intend. + +::: + +Once the identity of a user or service has been established, DataHub determines what accesses the authenticated request has. + +This is done by checking what operation a given user/service wants to perform within DataHub & whether it is allowed to do so. +The set of operations that are allowed in DataHub are what we call **Policies**. + +Policies specify fine-grain access control for _who_ can do _what_ to _which_ resources, for more details on the set of Policies that DataHub provides please see the [Policies Guide](../authorization/policies.md). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authorization/access-policies-guide.md b/docs-website/versioned_docs/version-0.10.4/docs/authorization/access-policies-guide.md new file mode 100644 index 0000000000000..d0aeaec60ecd7 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authorization/access-policies-guide.md @@ -0,0 +1,344 @@ +--- +title: About DataHub Access Policies +sidebar_label: Access Policies +slug: /authorization/access-policies-guide +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authorization/access-policies-guide.md +--- + +# About DataHub Access Policies + + + +Access Policies define who can do what to which resources. In conjunction with [Roles](./roles.md), Access Policies determine what users are allowed to do on DataHub. + +## Policy Types + +There are 2 types of Access Policy within DataHub: + +1. **Platform** Policies +2. **Metadata** Policies + +

+ +

+ +**Platform** Policies determine who has platform-level Privileges on DataHub. These include: + +- Managing Users & Groups +- Viewing the DataHub Analytics Page +- Managing Policies themselves + +Platform policies can be broken down into 2 parts: + +1. **Privileges**: Which privileges should be assigned to the Actors (e.g. "View Analytics") +2. **Actors**: Who the should be granted the privileges (Users, or Groups) + +A few Platform Policies in plain English include: + +- The Data Platform team should be allowed to manage users & groups, view platform analytics, & manage policies themselves +- John from IT should be able to invite new users + +**Metadata** policies determine who can do what to which Metadata Entities. For example: + +- Who can edit Dataset Documentation & Links? +- Who can add Owners to a Chart? +- Who can add Tags to a Dashboard? + +Metadata policies can be broken down into 3 parts: + +1. **Privileges**: The 'what'. What actions are being permitted by a Policy, e.g. "Add Tags". +2. **Resources**: The 'which'. Resources that the Policy applies to, e.g. "All Datasets". +3. **Actors**: The 'who'. Specific users, groups, & roles that the Policy applies to. + +A few **Metadata** Policies in plain English include: + +- Dataset Owners should be allowed to edit documentation, but not Tags. +- Jenny, our Data Steward, should be allowed to edit Tags for any Dashboard, but no other metadata. +- James, a Data Analyst, should be allowed to edit the Links for a specific Data Pipeline he is a downstream consumer of. + +Each of these can be implemented by constructing DataHub Access Policies. + +## Access Policies Setup, Prerequisites, and Permissions + +What you need to manage Access Policies on DataHub: + +- **Manage Policies** Privilege + +This Platform Privilege allows users to create, edit, and remove all Access Policies on DataHub. Therefore, it should only be +given to those users who will be serving as Admins of the platform. The default `Admin` role has this Privilege. + +## Using Access Policies + +Policies can be created by first navigating to **Settings > Permissions > Policies**. + +To begin building a new Policy, click **Create new Policy**. + +

+ +

+ +### Creating a Platform Policy + +#### Step 1. Provide a Name & Description + +In the first step, we select the **Platform** Policy type, and define a name and description for the new Policy. + +Good Policy names describe the high-level purpose of the Policy. For example, a Policy named +"View DataHub Analytics - Data Governance Team" would be a great way to describe a Platform +Policy which grants abilities to view DataHub's Analytics view to anyone on the Data Governance team. + +You can optionally provide a text description to add richer details about the purpose of the Policy. + +#### Step 2: Configure Privileges + +In the second step, we can simply select the Privileges that this Platform Policy will grant. + +

+ +

+ +**Platform** Privileges most often provide access to perform administrative functions on the Platform. These include: + +| Platform Privileges | Description | +| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | +| Manage Policies | Allow actor to create and remove access control policies. Be careful - Actors with this Privilege are effectively super users. | +| Manage Metadata Ingestion | Allow actor to create, remove, and update Metadata Ingestion sources. | +| Manage Secrets | Allow actor to create & remove secrets stored inside DataHub. | +| Manage Users & Groups | Allow actor to create, remove, and update users and groups on DataHub. | +| Manage All Access Tokens | Allow actor to create, remove, and list access tokens for all users on DataHub. | +| Create Domains | Allow the actor to create new Domains | +| Manage Domains | Allow actor to create and remove any Domains. | +| View Analytics | Allow the actor access to the DataHub analytics dashboard. | +| Generate Personal Access Tokens | Allow the actor to generate access tokens for personal use with DataHub APIs. | +| Manage User Credentials | Allow the actor to generate invite links for new native DataHub users, and password reset links for existing native users. | +| Manage Glossaries | Allow the actor to create, edit, move, and delete Glossary Terms and Term Groups | +| Create Tags | Allow the actor to create new Tags | +| Manage Tags | Allow the actor to create and remove any Tags | +| Manage Public Views | Allow the actor to create, edit, and remove any public (shared) Views. | +| Manage Ownership Types | Allow the actor to create, edit, and remove any Ownership Types. | +| Restore Indices API[^1] | Allow the actor to restore indices for a set of entities via API | +| Enable/Disable Writeability API[^1] | Allow the actor to enable or disable GMS writeability for use in data migrations | +| Apply Retention API[^1] | Allow the actor to apply aspect retention via API | + +[^1]: Only active if REST_API_AUTHORIZATION_ENABLED environment flag is enabled + +#### Step 3: Choose Policy Actors + +In this step, we can select the actors who should be granted Privileges appearing on this Policy. + +To do so, simply search and select the Users or Groups that the Policy should apply to. + +**Assigning a Policy to a User** + +

+ +

+ +**Assigning a Policy to a Group** + +

+ +

+ +### Creating a Metadata Policy + +#### Step 1. Provide a Name & Description + +In the first step, we select the **Metadata** Policy, and define a name and description for the new Policy. + +Good Policy names describe the high-level purpose of the Policy. For example, a Policy named +"Full Dataset Edit Privileges - Data Platform Engineering" would be a great way to describe a Metadata +Policy which grants all abilities to edit Dataset Metadata to anyone in the "Data Platform" group. + +You can optionally provide a text description to add richer detail about the purpose of the Policy. + +#### Step 2: Configure Privileges + +In the second step, we can simply select the Privileges that this Metadata Policy will grant. +To begin, we should first determine which assets that the Privileges should be granted for (i.e. the _scope_), then +select the appropriate Privileges to grant. + +Using the `Resource Type` selector, we can narrow down the _type_ of the assets that the Policy applies to. If left blank, +all entity types will be in scope. + +For example, if we only want to grant access for `Datasets` on DataHub, we can select +`Datasets`. + +

+ +

+ +Next, we can search for specific Entities of the that the Policy should grant privileges on. +If left blank, all entities of the selected types are in scope. + +For example, if we only want to grant access for a specific sample dataset, we can search and +select it directly. + +

+ +

+ +We can also limit the scope of the Policy to assets that live in a specific **Domain**. If left blank, +entities from all Domains will be in scope. + +For example, if we only want to grant access for assets part of a "Marketing" Domain, we can search and +select it. + +

+ +

+ +Finally, we will choose the Privileges to grant when the selected entities fall into the defined +scope. + +

+ +

+ +**Metadata** Privileges grant access to change specific _entities_ (i.e. data assets) on DataHub. + +The common Metadata Privileges, which span across entity types, include: + +| Common Privileges | Description | +| -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| View Entity Page | Allow actor to access the entity page for the resource in the UI. If not granted, it will redirect them to an unauthorized page. | +| Edit Tags | Allow actor to add and remove tags to an asset. | +| Edit Glossary Terms | Allow actor to add and remove glossary terms to an asset. | +| Edit Owners | Allow actor to add and remove owners of an entity. | +| Edit Description | Allow actor to edit the description (documentation) of an entity. | +| Edit Links | Allow actor to edit links associated with an entity. | +| Edit Status | Allow actor to edit the status of an entity (soft deleted or not). | +| Edit Domain | Allow actor to edit the Domain of an entity. | +| Edit Deprecation | Allow actor to edit the Deprecation status of an entity. | +| Edit Assertions | Allow actor to add and remove assertions from an entity. | +| Edit All | Allow actor to edit any information about an entity. Super user privileges. Controls the ability to ingest using API when REST API Authorization is enabled. | +| Get Timeline API[^1] | Allow actor to get the timeline of an entity via API. | +| Get Entity API[^1] | Allow actor to get an entity via API. | +| Get Timeseries Aspect API[^1] | Allow actor to get a timeseries aspect via API. | +| Get Aspect/Entity Count APIs[^1] | Allow actor to get aspect and entity counts via API. | +| Search API | Allow actor to search for entities via API. | +| Produce Platform Event API | Allow actor to ingest a platform event via API. | + +[^1]: Only active if REST_API_AUTHORIZATION_ENABLED is true + +**Specific Metadata Privileges** include + +| Entity | Privilege | Description | +| ------------ | ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Dataset | Edit Dataset Column Tags | Allow actor to edit the column (field) tags associated with a dataset schema. | +| Dataset | Edit Dataset Column Glossary Terms | Allow actor to edit the column (field) glossary terms associated with a dataset schema. | +| Dataset | Edit Dataset Column Descriptions | Allow actor to edit the column (field) descriptions associated with a dataset schema. | +| Dataset | Edit Dataset Queries | Allow actor to edit the Highlighted Queries on the Queries tab of the dataset. | +| Dataset | View Dataset Usage | Allow actor to access usage metadata about a dataset both in the UI and in the GraphQL API. This includes example queries, number of queries, etc. Also applies to REST APIs when REST API Authorization is enabled. | +| Dataset | View Dataset Profile | Allow actor to access a dataset's profile both in the UI and in the GraphQL API. This includes snapshot statistics like #rows, #columns, null percentage per field, etc. | +| Tag | Edit Tag Color | Allow actor to change the color of a Tag. | +| Group | Edit Group Members | Allow actor to add and remove members to a group. | +| User | Edit User Profile | Allow actor to change the user's profile including display name, bio, title, profile image, etc. | +| User + Group | Edit Contact Information | Allow actor to change the contact information such as email & chat handles. | + +> **Still have questions about Privileges?** Let us know in [Slack](https://slack.datahubproject.io)! + +#### Step 3: Choose Policy Actors + +In this step, we can select the actors who should be granted the Privileges on this Policy. Metadata Policies +can target specific Users & Groups, or the _owners_ of the Entities that are included in the scope of the Policy. + +To do so, simply search and select the Users or Groups that the Policy should apply to. + +

+ +

+ +

+ +

+ +We can also grant the Privileges to the _owners_ of Entities (or _Resources_) that are in scope for the Policy. +This advanced functionality allows of Admins of DataHub to closely control which actions can or cannot be performed by owners. + +

+ +

+ +### Updating an Existing Policy + +To update an existing Policy, simply click the **Edit** on the Policy you wish to change. + +

+ +

+ +Then, make the changes required and click **Save**. When you save a Policy, it may take up to 2 minutes for changes +to be reflected. + +### Removing a Policy + +To remove a Policy, simply click on the trashcan icon located on the Policies list. This will remove the Policy and +deactivate it so that it no longer applies. + +When you delete a Policy, it may take up to 2 minutes for changes to be reflected. + +### Deactivating a Policy + +In addition to deletion, DataHub also supports "deactivating" a Policy. This is useful if you need to temporarily disable +a particular Policy, but do not want to remove it altogether. + +To deactivate a Policy, simply click the **Deactivate** button on the Policy you wish to deactivate. When you change +the state of a Policy, it may take up to 2 minutes for the changes to be reflected. + +

+ +

+ +After deactivating, you can re-enable a Policy by clicking **Activate**. + +### Default Policies + +Out of the box, DataHub is deployed with a set of pre-baked Policies. This set of policies serves the +following purposes: + +1. Assigns immutable super-user privileges for the root `datahub` user account (Immutable) +2. Assigns all Platform Privileges for all Users by default (Editable) + +The reason for #1 is to prevent people from accidentally deleting all policies and getting locked out (`datahub` super user account can be a backup) +The reason for #2 is to permit administrators to log in via OIDC or another means outside of the `datahub` root account +when they are bootstrapping with DataHub. This way, those setting up DataHub can start managing Access Policies without friction. +Note that these Privileges _can_ and likely _should_ be changed inside the **Policies** page before onboarding +your company's users. + +### REST API Authorization + +Policies only affect REST APIs when the environment variable `REST_API_AUTHORIZATION` is set to `true` for GMS. Some policies only apply when this setting is enabled, marked above, and other Metadata and Platform policies apply to the APIs where relevant, also specified in the table above. + +## Additional Resources + +- [Authorization Overview](./README.md) +- [Roles Overview](./roles.md) +- [Authorization using Groups](./groups.md) + +### Videos + +- [Introducing DataHub Access Policies](https://youtu.be/19zQCznqhMI?t=282) + +### GraphQL + +- [listPolicies](../../graphql/queries.md#listPolicies) +- [createPolicy](../../graphql/mutations.md#createPolicy) +- [updatePolicy](../../graphql/mutations.md#updatePolicy) +- [deletePolicy](../../graphql/mutations.md#deletePolicy) + +## FAQ and Troubleshooting + +**How do Policies relate to Roles?** + +Policies are the lowest level primitive for granting Privileges to users on DataHub. + +Roles are built for convenience on top of Policies. Roles grant Privileges to actors indirectly, driven by Policies +behind the scenes. Both can be used in conjunction to grant Privileges to end users. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ + +### Related Features + +- [Roles](./roles.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authorization/groups.md b/docs-website/versioned_docs/version-0.10.4/docs/authorization/groups.md new file mode 100644 index 0000000000000..3882317dd7f81 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authorization/groups.md @@ -0,0 +1,39 @@ +--- +title: Authorization using Groups +slug: /authorization/groups +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authorization/groups.md +--- + +# Authorization using Groups + +## Introduction + +DataHub provides the ability to use **Groups** to manage policies. + +## Why do we need groups for authorization? + +### Easily Applying Access Privileges + +Groups are useful for managing user privileges in DataHub. If you want a set of Admin users, +or you want to define a set of users that are only able to view metadata assets but not make changes to them, you could +create groups for each of these use cases and apply the appropriate policies at the group-level rather than the +user-level. + +### Syncing with Existing Enterprise Groups (via IdP) + +If you work with an Identity Provider like Okta or Azure AD, it's likely you already have groups defined there. DataHub +allows you to import the groups you have from OIDC for [Okta](../generated/ingestion/sources/okta.md) and +[Azure AD](../generated/ingestion/sources/azure-ad.md) using the DataHub ingestion framework. + +If you routinely ingest groups from these providers, you will also be able to keep groups synced. New groups will +be created in DataHub, stale groups will be deleted, and group membership will be updated! + +## Custom Groups + +DataHub admins can create custom groups by going to the **Settings > Users & Groups > Groups > Create Group**. +Members can be added to Groups via the Group profile page. + +## Feedback / Questions / Concerns + +We want to hear from you! For any inquiries, including Feedback, Questions, or Concerns, reach out on Slack! diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authorization/policies.md b/docs-website/versioned_docs/version-0.10.4/docs/authorization/policies.md new file mode 100644 index 0000000000000..b6cf1cd5a4e55 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authorization/policies.md @@ -0,0 +1,233 @@ +--- +title: Policies Guide +slug: /authorization/policies +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authorization/policies.md +--- + +# Policies Guide + +## Introduction + +DataHub provides the ability to declare fine-grained access control Policies via the UI & GraphQL API. +Access policies in DataHub define _who_ can _do what_ to _which resources_. A few policies in plain English include + +- Dataset Owners should be allowed to edit documentation, but not Tags. +- Jenny, our Data Steward, should be allowed to edit Tags for any Dashboard, but no other metadata. +- James, a Data Analyst, should be allowed to edit the Links for a specific Data Pipeline he is a downstream consumer of. +- The Data Platform team should be allowed to manage users & groups, view platform analytics, & manage policies themselves. + +In this document, we'll take a deeper look at DataHub Policies & how to use them effectively. + +## What is a Policy? + +There are 2 types of Policy within DataHub: + +1. Platform Policies +2. Metadata Policies + +We'll briefly describe each. + +### Platform Policies + +**Platform** policies determine who has platform-level privileges on DataHub. These privileges include + +- Managing Users & Groups +- Viewing the DataHub Analytics Page +- Managing Policies themselves + +Platform policies can be broken down into 2 parts: + +1. **Actors**: Who the policy applies to (Users or Groups) +2. **Privileges**: Which privileges should be assigned to the Actors (e.g. "View Analytics") + +Note that platform policies do not include a specific "target resource" against which the Policies apply. Instead, +they simply serve to assign specific privileges to DataHub users and groups. + +### Metadata Policies + +**Metadata** policies determine who can do what to which Metadata Entities. For example, + +- Who can edit Dataset Documentation & Links? +- Who can add Owners to a Chart? +- Who can add Tags to a Dashboard? + +and so on. + +A Metadata Policy can be broken down into 3 parts: + +1. **Actors**: The 'who'. Specific users, groups that the policy applies to. +2. **Privileges**: The 'what'. What actions are being permitted by a policy, e.g. "Add Tags". +3. **Resources**: The 'which'. Resources that the policy applies to, e.g. "All Datasets". + +#### Actors + +We currently support 3 ways to define the set of actors the policy applies to: a) list of users b) list of groups, and +c) owners of the entity. You also have the option to apply the policy to all users or groups. + +#### Privileges + +Check out the list of +privileges [here](https://github.com/datahub-project/datahub/blob/master/metadata-utils/src/main/java/com/linkedin/metadata/authorization/PoliciesConfig.java) +. Note, the privileges are semantic by nature, and does not tie in 1-to-1 with the aspect model. + +All edits on the UI are covered by a privilege, to make sure we have the ability to restrict write access. + +We currently support the following: + +**Platform-level** privileges for DataHub operators to access & manage the administrative functionality of the system. + +| Platform Privileges | Description | +| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | +| Manage Policies | Allow actor to create and remove access control policies. Be careful - Actors with this privilege are effectively super users. | +| Manage Metadata Ingestion | Allow actor to create, remove, and update Metadata Ingestion sources. | +| Manage Secrets | Allow actor to create & remove secrets stored inside DataHub. | +| Manage Users & Groups | Allow actor to create, remove, and update users and groups on DataHub. | +| Manage All Access Tokens | Allow actor to create, remove, and list access tokens for all users on DataHub. | +| Create Domains | Allow the actor to create new Domains | +| Manage Domains | Allow actor to create and remove any Domains. | +| View Analytics | Allow the actor access to the DataHub analytics dashboard. | +| Generate Personal Access Tokens | Allow the actor to generate access tokens for personal use with DataHub APIs. | +| Manage User Credentials | Allow the actor to generate invite links for new native DataHub users, and password reset links for existing native users. | +| Manage Glossaries | Allow the actor to create, edit, move, and delete Glossary Terms and Term Groups | +| Create Tags | Allow the actor to create new Tags | +| Manage Tags | Allow the actor to create and remove any Tags | +| Manage Public Views | Allow the actor to create, edit, and remove any public (shared) Views. | +| Restore Indices API[^1] | Allow the actor to restore indices for a set of entities via API | +| Enable/Disable Writeability API[^1] | Allow the actor to enable or disable GMS writeability for use in data migrations | +| Apply Retention API[^1] | Allow the actor to apply aspect retention via API | + +[^1]: Only active if REST_API_AUTHORIZATION_ENABLED is true + +**Common metadata privileges** to view & modify any entity within DataHub. + +| Common Privileges | Description | +| -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- | +| View Entity Page | Allow actor to access the entity page for the resource in the UI. If not granted, it will redirect them to an unauthorized page. | +| Edit Tags | Allow actor to add and remove tags to an asset. | +| Edit Glossary Terms | Allow actor to add and remove glossary terms to an asset. | +| Edit Owners | Allow actor to add and remove owners of an entity. | +| Edit Description | Allow actor to edit the description (documentation) of an entity. | +| Edit Links | Allow actor to edit links associated with an entity. | +| Edit Status | Allow actor to edit the status of an entity (soft deleted or not). | +| Edit Domain | Allow actor to edit the Domain of an entity. | +| Edit Deprecation | Allow actor to edit the Deprecation status of an entity. | +| Edit Assertions | Allow actor to add and remove assertions from an entity. | +| Edit All | Allow actor to edit any information about an entity. Super user privileges. Controls the ability to ingest using API when REST API Authorization is enabled. | | +| Get Timeline API[^1] | Allow actor to get the timeline of an entity via API. | +| Get Entity API[^1] | Allow actor to get an entity via API. | +| Get Timeseries Aspect API[^1] | Allow actor to get a timeseries aspect via API. | +| Get Aspect/Entity Count APIs[^1] | Allow actor to get aspect and entity counts via API. | +| Search API[^1] | Allow actor to search for entities via API. | +| Produce Platform Event API[^1] | Allow actor to ingest a platform event via API. | + +[^1]: Only active if REST_API_AUTHORIZATION_ENABLED is true + +**Specific entity-level privileges** that are not generalizable. + +| Entity | Privilege | Description | +| ------------ | ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Dataset | Edit Dataset Column Tags | Allow actor to edit the column (field) tags associated with a dataset schema. | +| Dataset | Edit Dataset Column Glossary Terms | Allow actor to edit the column (field) glossary terms associated with a dataset schema. | +| Dataset | Edit Dataset Column Descriptions | Allow actor to edit the column (field) descriptions associated with a dataset schema. | +| Dataset | View Dataset Usage | Allow actor to access usage metadata about a dataset both in the UI and in the GraphQL API. This includes example queries, number of queries, etc. Also applies to REST APIs when REST API Authorization is enabled. | +| Dataset | View Dataset Profile | Allow actor to access a dataset's profile both in the UI and in the GraphQL API. This includes snapshot statistics like #rows, #columns, null percentage per field, etc. | +| Tag | Edit Tag Color | Allow actor to change the color of a Tag. | +| Group | Edit Group Members | Allow actor to add and remove members to a group. | +| User | Edit User Profile | Allow actor to change the user's profile including display name, bio, title, profile image, etc. | +| User + Group | Edit Contact Information | Allow actor to change the contact information such as email & chat handles. | +| GlossaryNode | Manage Direct Glossary Children | Allow the actor to create, edit, and delete the direct children of the selected entities. | +| GlossaryNode | Manage All Glossary Children | Allow the actor to create, edit, and delete everything underneath the selected entities. | + +#### Resources + +Resource filter defines the set of resources that the policy applies to is defined using a list of criteria. Each +criterion defines a field type (like resource_type, resource_urn, domain), a list of field values to compare, and a +condition (like EQUALS). It essentially checks whether the field of a certain resource matches any of the input values. +Note, that if there are no criteria or resource is not set, policy is applied to ALL resources. + +For example, the following resource filter will apply the policy to datasets, charts, and dashboards under domain 1. + +```json +{ + "resource": { + "criteria": [ + { + "field": "resource_type", + "values": ["dataset", "chart", "dashboard"], + "condition": "EQUALS" + }, + { + "field": "domain", + "values": ["urn:li:domain:domain1"], + "condition": "EQUALS" + } + ] + } +} +``` + +Supported fields are as follows + +| Field Type | Description | Example | +| ------------- | ---------------------- | ----------------------- | +| resource_type | Type of the resource | dataset, chart, dataJob | +| resource_urn | Urn of the resource | urn:li:dataset:... | +| domain | Domain of the resource | urn:li:domain:domainX | + +## Managing Policies + +Policies can be managed on the page **Settings > Permissions > Policies** page. The `Policies` tab will only +be visible to those users having the `Manage Policies` privilege. + +Out of the box, DataHub is deployed with a set of pre-baked Policies. The set of default policies are created at deploy +time and can be found inside the `policies.json` file within `metadata-service/war/src/main/resources/boot`. This set of policies serves the +following purposes: + +1. Assigns immutable super-user privileges for the root `datahub` user account (Immutable) +2. Assigns all Platform privileges for all Users by default (Editable) + +The reason for #1 is to prevent people from accidentally deleting all policies and getting locked out (`datahub` super user account can be a backup) +The reason for #2 is to permit administrators to log in via OIDC or another means outside of the `datahub` root account +when they are bootstrapping with DataHub. This way, those setting up DataHub can start managing policies without friction. +Note that these privilege _can_ and likely _should_ be altered inside the **Policies** page of the UI. + +> Pro-Tip: To login using the `datahub` account, simply navigate to `/login` and enter `datahub`, `datahub`. Note that the password can be customized for your +> deployment by changing the `user.props` file within the `datahub-frontend` module. Notice that JaaS authentication must be enabled. + +## Configuration + +By default, the Policies feature is _enabled_. This means that the deployment will support creating, editing, removing, and +most importantly enforcing fine-grained access policies. + +In some cases, these capabilities are not desirable. For example, if your company's users are already used to having free reign, you +may want to keep it that way. Or perhaps it is only your Data Platform team who actively uses DataHub, in which case Policies may be overkill. + +For these scenarios, we've provided a back door to disable Policies in your deployment of DataHub. This will completely hide +the policies management UI and by default will allow all actions on the platform. It will be as though +each user has _all_ privileges, both of the **Platform** & **Metadata** flavor. + +To disable Policies, you can simply set the `AUTH_POLICIES_ENABLED` environment variable for the `datahub-gms` service container +to `false`. For example in your `docker/datahub-gms/docker.env`, you'd place + +``` +AUTH_POLICIES_ENABLED=false +``` + +### REST API Authorization + +Policies only affect REST APIs when the environment variable `REST_API_AUTHORIZATION` is set to `true` for GMS. Some policies only apply when this setting is enabled, marked above, and other Metadata and Platform policies apply to the APIs where relevant, also specified in the table above. + +## Coming Soon + +The DataHub team is hard at work trying to improve the Policies feature. We are planning on building out the following: + +- Hide edit action buttons on Entity pages to reflect user privileges + +Under consideration + +- Ability to define Metadata Policies against multiple reosurces scoped to particular "Containers" (e.g. A "schema", "database", or "collection") + +## Feedback / Questions / Concerns + +We want to hear from you! For any inquiries, including Feedback, Questions, or Concerns, reach out on Slack! diff --git a/docs-website/versioned_docs/version-0.10.4/docs/authorization/roles.md b/docs-website/versioned_docs/version-0.10.4/docs/authorization/roles.md new file mode 100644 index 0000000000000..e562d0be1a943 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/authorization/roles.md @@ -0,0 +1,169 @@ +--- +title: About DataHub Roles +sidebar_label: Roles +slug: /authorization/roles +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/authorization/roles.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub Roles + + + +DataHub provides the ability to use **Roles** to manage permissions. + +:::tip **Roles** are the recommended way to manage permissions on DataHub. This should suffice for most use cases, but advanced users can use **Policies** if needed. + +## Roles Setup, Prerequisites, and Permissions + +The out-of-the-box Roles represent the most common types of DataHub users. Currently, the supported Roles are **Admin**, **Editor** and **Reader**. + +| Role Name | Description | +| --------- | --------------------------------------------------------------------------------------- | +| Admin | Can do everything on the platform. | +| Editor | Can read and edit all metadata. Cannot take administrative actions. | +| Reader | Can read all metadata. Cannot edit anything by default, or take administrative actions. | + +:::note To manage roles, including viewing roles, or editing a user's role, you must either be an **Admin**, or have the **Manage Policies** privilege. + +## Using Roles + +### Viewing Roles + +You can view the list of existing Roles under **Settings > Permissions > Roles**. You can click into a Role to see details about +it, like which users have that Role, and which Policies correspond to that Role. + +

+ +

+ +### Assigning Roles + +Roles can be assigned in two different ways. + +#### Assigning a New Role to a Single User + +If you go to **Settings > Users & Groups > Users**, you will be able to view your full list of users, as well as which Role they are currently +assigned to, including if they don't have a Role. + +

+ +

+ +You can simply assign a new Role to a user by clicking on the drop-down that appears on their row and selecting the desired Role. + +

+ +

+ +#### Batch Assigning a Role + +When viewing the full list of roles at **Settings > Permissions > Roles**, you will notice that each role has an `Add Users` button next to it. Clicking this button will +lead you to a search box where you can search through your users, and select which users you would like to assign this role to. + +

+ +

+ +### How do Roles interact with Policies? + +Roles actually use Policies under-the-hood, and come prepackaged with corresponding policies to control what a Role can do, which you can view in the +Policies tab. Note that these Role-specific policies **cannot** be changed. You can find the full list of policies corresponding to each Role at the bottom of this +[file](https://github.com/datahub-project/datahub/blob/master/metadata-service/war/src/main/resources/boot/policies.json). + +If you would like to have finer control over what a user on your DataHub instance can do, the Roles system interfaces cleanly +with the Policies system. For example, if you would like to give a user a **Reader** role, but also allow them to edit metadata +for certain domains, you can add a policy that will allow them to do. Note that adding a policy like this will only add to what a user can do +in DataHub. + +### Role Privileges + +#### Self-Hosted DataHub and Managed DataHub + +These privileges are common to both Self-Hosted DataHub and Managed DataHub. + +##### Platform Privileges + +| Privilege | Admin | Editor | Reader | +| ------------------------------- | ------------------ | ------------------ | ------ | +| Generate Personal Access Tokens | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Domains | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Glossaries | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Tags | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Policies | :heavy_check_mark: | :x: | :x: | +| Manage Ingestion | :heavy_check_mark: | :x: | :x: | +| Manage Secrets | :heavy_check_mark: | :x: | :x: | +| Manage Users and Groups | :heavy_check_mark: | :x: | :x: | +| Manage Access Tokens | :heavy_check_mark: | :x: | :x: | +| Manage User Credentials | :heavy_check_mark: | :x: | :x: | +| Manage Public Views | :heavy_check_mark: | :x: | :x: | +| View Analytics | :heavy_check_mark: | :x: | :x: | + +##### Metadata Privileges + +| Privilege | Admin | Editor | Reader | +| ------------------------------------ | ------------------ | ------------------ | ------------------ | +| View Entity Page | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | +| View Dataset Usage | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | +| View Dataset Profile | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | +| Edit Entity | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Entity Tags | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Entity Glossary Terms | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Entity Owners | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Entity Docs | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Entity Doc Links | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Entity Status | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Entity Assertions | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Entity Tags | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Entity Glossary Terms | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Dataset Column Tags | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Dataset Column Glossary Terms | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Dataset Column Descriptions | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Dataset Column Tags | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Dataset Column Glossary Terms | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Tag Color | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit User Profile | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Edit Contact Info | :heavy_check_mark: | :heavy_check_mark: | :x: | + +#### Managed DataHub + +These privileges are only relevant to Managed DataHub. + +##### Platform Privileges + +| Privilege | Admin | Editor | Reader | +| ----------------------- | ------------------ | ------------------ | ------ | +| Create Constraints | :heavy_check_mark: | :heavy_check_mark: | :x: | +| View Metadata Proposals | :heavy_check_mark: | :heavy_check_mark: | :x: | +| Manage Tests | :heavy_check_mark: | :x: | :x: | +| Manage Global Settings | :heavy_check_mark: | :x: | :x: | + +##### Metadata Privileges + +| Privilege | Admin | Editor | Reader | +| ------------------------------------- | ------------------ | ------------------ | ------------------ | +| Propose Entity Tags | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | +| Propose Entity Glossary Terms | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | +| Propose Dataset Column Tags | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | +| Propose Dataset Column Glossary Terms | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | +| Edit Entity Operations | :heavy_check_mark: | :heavy_check_mark: | :x: | + +## Additional Resources + +### GraphQL + +- [acceptRole](../../graphql/mutations.md#acceptrole) +- [batchAssignRole](../../graphql/mutations.md#batchassignrole) +- [listRoles](../../graphql/queries.md#listroles) + +## FAQ and Troubleshooting + +## What updates are planned for Roles? + +In the future, the DataHub team is looking into adding the following features to Roles. + +- Defining a role mapping from OIDC identity providers to DataHub that will grant users a DataHub role based on their IdP role +- Allowing Admins to set a default role on DataHub so all users are assigned a role +- Building custom roles diff --git a/docs-website/versioned_docs/version-0.10.4/docs/browse.md b/docs-website/versioned_docs/version-0.10.4/docs/browse.md new file mode 100644 index 0000000000000..691ecb392de01 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/browse.md @@ -0,0 +1,65 @@ +--- +title: About DataHub Browse +sidebar_label: Browse +slug: /browse +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/browse.md" +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub Browse + + + +Browse is one of the primary entrypoints for discovering different Datasets, Dashboards, Charts and other DataHub Entities. + +Browsing is useful for finding data entities based on a hierarchical structure set in the source system. Generally speaking, that hierarchy will contain the following levels: + +- Entity Type (Dataset, Dashboard, Chart, etc.) +- Environment (prod vs. dev) +- Platform Type (Snowflake, dbt, Looker, etc.) +- Container (Warehouse, Schema, Folder, etc.) +- Entity Name + +For example, a user can easily browse for Datasets within the PROD Snowflake environment, the long_tail_companions warehouse, and the analytics schema: + +

+ +

+ +## Using Browse + +Browse is accessible by clicking on an Entity Type on the front page of the DataHub UI. + +

+ +

+ +This will take you into the folder explorer view for browse in which you can drill down to your desired sub categories to find the data you are looking for. + +

+ +

+ +## Additional Resources + +### GraphQL + +- [browse](../graphql/queries.md#browse) +- [browsePaths](../graphql/queries.md#browsePaths) + +## FAQ and Troubleshooting + +**How are BrowsePaths created?** + +BrowsePaths are automatically created for ingested entities based on separator characters that appear within an Urn. + +**How can I customize browse paths?** + +BrowsePaths are an Aspect similar to other components of an Entity. They can be customized by ingesting custom paths for specified Urns. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ + +### Related Features + +- [Search](./how/search.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/browseV2/browse-paths-v2.md b/docs-website/versioned_docs/version-0.10.4/docs/browseV2/browse-paths-v2.md new file mode 100644 index 0000000000000..664b3208c9649 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/browseV2/browse-paths-v2.md @@ -0,0 +1,58 @@ +--- +title: Generating Browse Paths (V2) +slug: /browsev2/browse-paths-v2 +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/browseV2/browse-paths-v2.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# Generating Browse Paths (V2) + + + +## Introduction + +Browse (V2) is a way for users to explore and dive deeper into their data. Its integration with the search experience allows users to combine search queries and filters with entity type and platform nested folders. + +Most entities should have a browse path that allows users to navigate the left side panel on the search page to find groups of entities under different folders that come from these browse paths. Below, you can see an example of the sidebar with some new browse paths. + +

+ +

+ +This new browse sidebar always starts with Entity Type, then optionally shows Environment (PROD, DEV, etc.) if there are 2 or more Environments, then Platform. Below the Platform level, we render out folders that come directly from entity's [browsePathsV2](/docs/generated/metamodel/entities/dataset#browsepathsv2) aspects. + +## Generating Custom Browse Paths + +A `browsePathsV2` aspect has a field called `path` which contains a list of `BrowsePathEntry` objects. Each object in the path represents one level of the entity's browse path where the first entry is the highest level and the last entry is the lowest level. + +If an entity has this aspect filled out, their browse path will show up in the browse sidebar so that you can navigate its folders and select one to filter search results down. + +For example, in the browse sidebar on the left of the image above, there are 10 Dataset entities from the BigQuery Platform that have `browsePathsV2` aspects that look like the following: + +``` +[ { id: "bigquery-public-data" }, { id: "covid19_public_forecasts" } ] +``` + +The `id` in a `BrowsePathEntry` is required and is what will be shown in the UI unless the optional `urn` field is populated. If the `urn` field is populated, we will try to resolve this path entry into an entity object and display that entity's name. We will also show a link to allow you to open up the entity profile. + +The `urn` field should only be populated if there is an entity in your DataHub instance that belongs in that entity's browse path. This makes most sense for Datasets to have Container entities in the browse paths as well as some other cases such as a DataFlow being part of a DataJob's browse path. For any other situation, feel free to leave `urn` empty and populate `id` with the text you want to be shown in the UI for your entity's path. + +## Additional Resources + +### GraphQL + +- [browseV2](../../graphql/queries.md#browsev2) + +## FAQ and Troubleshooting + +**How are browsePathsV2 aspects created?** + +We create `browsePathsV2` aspects for all entities that should have one by default when you ingest your data if this aspect is not already provided. This happens based on separator characters that appear within an Urn. + +Our ingestion sources are also producing `browsePathsV2` aspects since CLI version v0.10.5. + +### Related Features + +- [Search](../how/search.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/cli.md b/docs-website/versioned_docs/version-0.10.4/docs/cli.md new file mode 100644 index 0000000000000..d925c933e12ff --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/cli.md @@ -0,0 +1,626 @@ +--- +toc_max_heading_level: 4 +title: DataHub CLI +sidebar_label: CLI +slug: /cli +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/cli.md" +--- + +# DataHub CLI + +DataHub comes with a friendly cli called `datahub` that allows you to perform a lot of common operations using just the command line. [Acryl Data](https://acryldata.io) maintains the [pypi package](https://pypi.org/project/acryl-datahub/) for `datahub`. + +## Installation + +### Using pip + +We recommend Python virtual environments (venv-s) to namespace pip modules. Here's an example setup: + +```shell +python3 -m venv venv # create the environment +source venv/bin/activate # activate the environment +``` + +**_NOTE:_** If you install `datahub` in a virtual environment, that same virtual environment must be re-activated each time a shell window or session is created. + +Once inside the virtual environment, install `datahub` using the following commands + +```shell +# Requires Python 3.7+ +python3 -m pip install --upgrade pip wheel setuptools +python3 -m pip install --upgrade acryl-datahub +# validate that the install was successful +datahub version +# If you see "command not found", try running this instead: python3 -m datahub version +``` + +If you run into an error, try checking the [_common setup issues_](../metadata-ingestion/developing.md#Common-setup-issues). + +Other installation options such as installation from source and running the cli inside a container are available further below in the guide [here](#alternate-installation-options) + +## Starter Commands + +The `datahub` cli allows you to do many things, such as quick-starting a DataHub docker instance locally, ingesting metadata from your sources into a DataHub server or a DataHub lite instance, as well as retrieving, modifying and exploring metadata. +Like most command line tools, `--help` is your best friend. Use it to discover the capabilities of the cli and the different commands and sub-commands that are supported. + +```console +datahub --help +Usage: datahub [OPTIONS] COMMAND [ARGS]... + +Options: + --debug / --no-debug Enable debug logging. + --log-file FILE Enable debug logging. + --debug-vars / --no-debug-vars Show variable values in stack traces. Implies --debug. While we try to avoid + printing sensitive information like passwords, this may still happen. + --version Show the version and exit. + -dl, --detect-memory-leaks Run memory leak detection. + --help Show this message and exit. + +Commands: + actions + check Helper commands for checking various aspects of DataHub. + dataproduct A group of commands to interact with the DataProduct entity in DataHub. + delete Delete metadata from datahub using a single urn or a combination of filters + docker Helper commands for setting up and interacting with a local DataHub instance using Docker. + exists A group of commands to check existence of entities in DataHub. + get A group of commands to get metadata from DataHub. + group A group of commands to interact with the Group entity in DataHub. + ingest Ingest metadata into DataHub. + init Configure which datahub instance to connect to + lite A group of commands to work with a DataHub Lite instance + migrate Helper commands for migrating metadata within DataHub. + put A group of commands to put metadata in DataHub. + state Managed state stored in DataHub by stateful ingestion. + telemetry Toggle telemetry. + timeline Get timeline for an entity based on certain categories + user A group of commands to interact with the User entity in DataHub. + version Print version number and exit. +``` + +The following top-level commands listed below are here mainly to give the reader a high-level picture of what are the kinds of things you can accomplish with the cli. +We've ordered them roughly in the order we expect you to interact with these commands as you get deeper into the `datahub`-verse. + +### docker + +The `docker` command allows you to start up a local DataHub instance using `datahub docker quickstart`. You can also check if the docker cluster is healthy using `datahub docker check`. + +### ingest + +The `ingest` command allows you to ingest metadata from your sources using ingestion configuration files, which we call recipes. +Source specific crawlers are provided by plugins and might sometimes need additional extras to be installed. See [installing plugins](#installing-plugins) for more information. +[Removing Metadata from DataHub](./how/delete-metadata.md) contains detailed instructions about how you can use the ingest command to perform operations like rolling-back previously ingested metadata through the `rollback` sub-command and listing all runs that happened through `list-runs` sub-command. + +```console +Usage: datahub [datahub-options] ingest [command-options] + +Command Options: + -c / --config Config file in .toml or .yaml format + -n / --dry-run Perform a dry run of the ingestion, essentially skipping writing to sink + --preview Perform limited ingestion from the source to the sink to get a quick preview + --preview-workunits The number of workunits to produce for preview + --strict-warnings If enabled, ingestion runs with warnings will yield a non-zero error code +``` + +### init + +The init command is used to tell `datahub` about where your DataHub instance is located. The CLI will point to localhost DataHub by default. +Running `datahub init` will allow you to customize the datahub instance you are communicating with. + +**_Note_**: Provide your GMS instance's host when the prompt asks you for the DataHub host. + +#### Environment variables supported + +The environment variables listed below take precedence over the DataHub CLI config created through the `init` command. + +- `DATAHUB_SKIP_CONFIG` (default `false`) - Set to `true` to skip creating the configuration file. +- `DATAHUB_GMS_URL` (default `http://localhost:8080`) - Set to a URL of GMS instance +- `DATAHUB_GMS_HOST` (default `localhost`) - Set to a host of GMS instance. Prefer using `DATAHUB_GMS_URL` to set the URL. +- `DATAHUB_GMS_PORT` (default `8080`) - Set to a port of GMS instance. Prefer using `DATAHUB_GMS_URL` to set the URL. +- `DATAHUB_GMS_PROTOCOL` (default `http`) - Set to a protocol like `http` or `https`. Prefer using `DATAHUB_GMS_URL` to set the URL. +- `DATAHUB_GMS_TOKEN` (default `None`) - Used for communicating with DataHub Cloud. +- `DATAHUB_TELEMETRY_ENABLED` (default `true`) - Set to `false` to disable telemetry. If CLI is being run in an environment with no access to public internet then this should be disabled. +- `DATAHUB_TELEMETRY_TIMEOUT` (default `10`) - Set to a custom integer value to specify timeout in secs when sending telemetry. +- `DATAHUB_DEBUG` (default `false`) - Set to `true` to enable debug logging for CLI. Can also be achieved through `--debug` option of the CLI. +- `DATAHUB_VERSION` (default `head`) - Set to a specific version to run quickstart with the particular version of docker images. +- `ACTIONS_VERSION` (default `head`) - Set to a specific version to run quickstart with that image tag of `datahub-actions` container. + +```shell +DATAHUB_SKIP_CONFIG=false +DATAHUB_GMS_URL=http://localhost:8080 +DATAHUB_GMS_TOKEN= +DATAHUB_TELEMETRY_ENABLED=true +DATAHUB_TELEMETRY_TIMEOUT=10 +DATAHUB_DEBUG=false +``` + +### check + +The datahub package is composed of different plugins that allow you to connect to different metadata sources and ingest metadata from them. +The `check` command allows you to check if all plugins are loaded correctly as well as validate an individual MCE-file. + +### delete + +The `delete` command allows you to delete metadata from DataHub. + +The [metadata deletion guide](./how/delete-metadata.md) covers the various options for the delete command. + +### exists + +**🤝 Version compatibility** : `acryl-datahub>=0.10.2.4` + +The exists command can be used to check if an entity exists in DataHub. + +```shell +> datahub exists --urn "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)" +true +> datahub exists --urn "urn:li:dataset:(urn:li:dataPlatform:hive,NonExistentHiveDataset,PROD)" +false +``` + +### get + +The `get` command allows you to easily retrieve metadata from DataHub, by using the REST API. This works for both versioned aspects and timeseries aspects. For timeseries aspects, it fetches the latest value. +For example the following command gets the ownership aspect from the dataset `urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)` + +```shell-session +$ datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)" --aspect ownership +{ + "ownership": { + "lastModified": { + "actor": "urn:li:corpuser:jdoe", + "time": 1680210917580 + }, + "owners": [ + { + "owner": "urn:li:corpuser:jdoe", + "source": { + "type": "SERVICE" + }, + "type": "NONE" + } + ] + } +} +``` + +### put + +The `put` group of commands allows you to write metadata into DataHub. This is a flexible way for you to issue edits to metadata from the command line. + +#### put aspect + +The **put aspect** (also the default `put`) command instructs `datahub` to set a specific aspect for an entity to a specified value. +For example, the command shown below sets the `ownership` aspect of the dataset `urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)` to the value in the file `ownership.json`. +The JSON in the `ownership.json` file needs to conform to the [`Ownership`](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl) Aspect model as shown below. + +```json +{ + "owners": [ + { + "owner": "urn:li:corpuser:jdoe", + "type": "DEVELOPER" + }, + { + "owner": "urn:li:corpuser:jdub", + "type": "DATAOWNER" + } + ] +} +``` + +```console +datahub --debug put --urn "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)" --aspect ownership -d ownership.json + +[DATE_TIMESTAMP] DEBUG {datahub.cli.cli_utils:340} - Attempting to emit to DataHub GMS; using curl equivalent to: +curl -X POST -H 'User-Agent: python-requests/2.26.0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept: */*' -H 'Connection: keep-alive' -H 'X-RestLi-Protocol-Version: 2.0.0' -H 'Content-Type: application/json' --data '{"proposal": {"entityType": "dataset", "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)", "aspectName": "ownership", "changeType": "UPSERT", "aspect": {"contentType": "application/json", "value": "{\"owners\": [{\"owner\": \"urn:li:corpuser:jdoe\", \"type\": \"DEVELOPER\"}, {\"owner\": \"urn:li:corpuser:jdub\", \"type\": \"DATAOWNER\"}]}"}}}' 'http://localhost:8080/aspects/?action=ingestProposal' +Update succeeded with status 200 +``` + +#### put platform + +**🤝 Version Compatibility:** `acryl-datahub>0.8.44.4` + +The **put platform** command instructs `datahub` to create or update metadata about a data platform. This is very useful if you are using a custom data platform, to set up its logo and display name for a native UI experience. + +```shell +datahub put platform --name longtail_schemas --display_name "Long Tail Schemas" --logo "https://flink.apache.org/img/logo/png/50/color_50.png" +✅ Successfully wrote data platform metadata for urn:li:dataPlatform:longtail_schemas to DataHub (DataHubRestEmitter: configured to talk to https://longtailcompanions.acryl.io/api/gms with token: eyJh**********Cics) +``` + +### timeline + +The `timeline` command allows you to view a version history for entities. Currently only supported for Datasets. For example, +the following command will show you the modifications to tags for a dataset for the past week. The output includes a computed semantic version, +relevant for schema changes only currently, the target of the modification, and a description of the change including a timestamp. +The default output is sanitized to be more readable, but the full API output can be obtained by passing the `--verbose` flag and +to get the raw JSON difference in addition to the API output you can add the `--raw` flag. For more details about the feature please see [the main feature page](dev-guides/timeline.md) + +```console +datahub timeline --urn "urn:li:dataset:(urn:li:dataPlatform:mysql,User.UserAccount,PROD)" --category TAG --start 7daysago +2022-02-17 14:03:42 - 0.0.0-computed + MODIFY TAG dataset:mysql:User.UserAccount : A change in aspect editableSchemaMetadata happened at time 2022-02-17 20:03:42.0 +2022-02-17 14:17:30 - 0.0.0-computed + MODIFY TAG dataset:mysql:User.UserAccount : A change in aspect editableSchemaMetadata happened at time 2022-02-17 20:17:30.118 +``` + +## Entity Specific Commands + +### user (User Entity) + +The `user` command allows you to interact with the User entity. +It currently supports the `upsert` operation, which can be used to create a new user or update an existing one. +For detailed information, please refer to [Creating Users and Groups with Datahub CLI](/docs/api/tutorials/owners.md#upsert-users). + +```shell +datahub user upsert -f users.yaml +``` + +An example of `users.yaml` would look like the following. You can refer to the [bar.user.dhub.yaml](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/cli_usage/user/bar.user.dhub.yaml) file for the complete code. + +```yaml +- id: bar@acryl.io + first_name: The + last_name: Bar + email: bar@acryl.io + slack: "@the_bar_raiser" + description: "I like raising the bar higher" + groups: + - foogroup@acryl.io +- id: datahub + slack: "@datahubproject" + phone: "1-800-GOT-META" + description: "The DataHub Project" + picture_link: "https://raw.githubusercontent.com/datahub-project/datahub/master/datahub-web-react/src/images/datahub-logo-color-stable.svg" +``` + +### group (Group Entity) + +The `group` command allows you to interact with the Group entity. +It currently supports the `upsert` operation, which can be used to create a new group or update an existing one with embedded Users. +For more information, please refer to [Creating Users and Groups with Datahub CLI](/docs/api/tutorials/owners.md#upsert-users). + +```shell +datahub group upsert -f group.yaml +``` + +An example of `group.yaml` would look like the following. You can refer to the [foo.group.dhub.yaml](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/cli_usage/group/foo.group.dhub.yaml) file for the complete code. + +```yaml +id: foogroup@acryl.io +display_name: Foo Group +admins: + - datahub +members: + - bar@acryl.io # refer to a user either by id or by urn + - id: joe@acryl.io # inline specification of user + slack: "@joe_shmoe" + display_name: "Joe's Hub" +``` + +### dataproduct (Data Product Entity) + +**🤝 Version Compatibility:** `acryl-datahub>=0.10.2.4` + +The dataproduct group of commands allows you to manage the lifecycle of a DataProduct entity on DataHub. +See the [Data Products](./dataproducts.md) page for more details on what a Data Product is and how DataHub represents it. + +```shell +datahub dataproduct --help +Commands: + upsert* Upsert attributes to a Data Product in DataHub + update Create or Update a Data Product in DataHub. + add_asset Add an asset to a Data Product + add_owner Add an owner to a Data Product + delete Delete a Data Product in DataHub. + diff Diff a Data Product file with its twin in DataHub + get Get a Data Product from DataHub + remove_asset Add an asset to a Data Product + remove_owner Remove an owner from a Data Product + set_description Set description for a Data Product in DataHub +``` + +Here we detail the sub-commands available under the dataproduct group of commands: + +#### upsert + +Use this to upsert a data product yaml file into DataHub. This will create the data product if it doesn't exist already. Remember, this will upsert all the fields that are specified in the yaml file and will not touch the fields that are not specified. For example, if you do not specify the `description` field in the yaml file, then `upsert` will not modify the description field on the Data Product entity in DataHub. To keep this file sync-ed with the metadata on DataHub use the [diff](#diff) command. The format of the yaml file is available [here](./dataproducts.md#creating-a-data-product-yaml--git). + +```shell +# Usage +> datahub dataproduct upsert -f data_product.yaml + +``` + +#### update + +Use this to fully replace a data product's metadata in DataHub from a yaml file. This will create the data product if it doesn't exist already. Remember, this will update all the fields including ones that are not specified in the yaml file. For example, if you do not specify the `description` field in the yaml file, then `update` will set the description field on the Data Product entity in DataHub to empty. To keep this file sync-ed with the metadata on DataHub use the [diff](#diff) command. The format of the yaml file is available [here](./dataproducts.md#creating-a-data-product-yaml--git). + +```shell +# Usage +> datahub dataproduct upsert -f data_product.yaml + +``` + +:::note + +❗**Pro-Tip: upsert versus update** + +Wondering which command is right for you? Use `upsert` if there are certain elements of metadata that you don't want to manage using the yaml file (e.g. owners, assets or description). Use `update` if you want to manage the entire data product's metadata using the yaml file. + +::: + +#### diff + +Use this to keep a data product yaml file updated from its server-side version in DataHub. + +```shell +# Usage +> datahub dataproduct diff -f data_product.yaml --update +``` + +#### get + +Use this to get a data product entity from DataHub and optionally write it to a yaml file + +```shell +# Usage +> datahub dataproduct get --urn urn:li:dataProduct:pet_of_the_week --to-file pet_of_the_week_dataproduct.yaml +{ + "id": "urn:li:dataProduct:pet_of_the_week", + "domain": "urn:li:domain:dcadded3-2b70-4679-8b28-02ac9abc92eb", + "assets": [ + "urn:li:dataset:(urn:li:dataPlatform:snowflake,long_tail_companions.analytics.pet_details,PROD)", + "urn:li:dashboard:(looker,dashboards.19)", + "urn:li:dataFlow:(airflow,snowflake_load,prod)" + ], + "display_name": "Pet of the Week Campaign", + "owners": [ + { + "id": "urn:li:corpuser:jdoe", + "type": "BUSINESS_OWNER" + } + ], + "description": "This campaign includes Pet of the Week data.", + "tags": [ + "urn:li:tag:adoption" + ], + "terms": [ + "urn:li:glossaryTerm:ClientsAndAccounts.AccountBalance" + ], + "properties": { + "lifecycle": "production", + "sla": "7am every day" + } +} +Data Product yaml written to pet_of_the_week_dataproduct.yaml +``` + +#### add_asset + +Use this to add a data asset to a Data Product. + +```shell +# Usage +> datahub dataproduct add_asset --urn "urn:li:dataProduct:pet_of_the_week" --asset "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)" +``` + +#### remove_asset + +Use this to remove a data asset from a Data Product. + +```shell +# Usage +> datahub dataproduct remove_asset --urn "urn:li:dataProduct:pet_of_the_week" --asset "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)" +``` + +#### add_owner + +Use this to add an owner to a Data Product. + +```shell +# Usage +> datahub dataproduct add_owner --urn "urn:li:dataProduct:pet_of_the_week" --owner "jdoe@longtail.com" --owner-type BUSINESS_OWNER +``` + +#### remove_owner + +Use this to remove an owner from a Data Product. + +```shell +# Usage +> datahub dataproduct remove_owner --urn "urn:li:dataProduct:pet_of_the_week" --owner "urn:li:corpUser:jdoe@longtail.com" +``` + +#### set_description + +Use this to attach rich documentation for a Data Product in DataHub. + +```shell +> datahub dataproduct set_description --urn "urn:li:dataProduct:pet_of_the_week" --description "This is the pet dataset" +# For uploading rich documentation from a markdown file, use the --md-file option +# > datahub dataproduct set_description --urn "urn:li:dataProduct:pet_of_the_week" --md-file ./pet_of_the_week.md +``` + +#### delete + +Use this to delete a Data Product from DataHub. Default to `--soft` which preserves metadata, use `--hard` to erase all metadata associated with this Data Product. + +```shell +> datahub dataproduct delete --urn "urn:li:dataProduct:pet_of_the_week" +# For Hard Delete see below: +# > datahub dataproduct delete --urn "urn:li:dataProduct:pet_of_the_week" --hard +``` + +## Miscellaneous Admin Commands + +### lite (experimental) + +The lite group of commands allow you to run an embedded, lightweight DataHub instance for command line exploration of your metadata. This is intended more for developer tool oriented usage rather than as a production server instance for DataHub. See [DataHub Lite](./datahub_lite.md) for more information about how you can ingest metadata into DataHub Lite and explore your metadata easily. + +### telemetry + +To help us understand how people are using DataHub, we collect anonymous usage statistics on actions such as command invocations via Mixpanel. +We do not collect private information such as IP addresses, contents of ingestions, or credentials. +The code responsible for collecting and broadcasting these events is open-source and can be found [within our GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/telemetry/telemetry.py). + +Telemetry is enabled by default, and the `telemetry` command lets you toggle the sending of these statistics via `telemetry enable/disable`. + +### migrate + +The `migrate` group of commands allows you to perform certain kinds of migrations. + +#### dataplatform2instance + +The `dataplatform2instance` migration command allows you to migrate your entities from an instance-agnostic platform identifier to an instance-specific platform identifier. If you have ingested metadata in the past for this platform and would like to transfer any important metadata over to the new instance-specific entities, then you should use this command. For example, if your users have added documentation or added tags or terms to your datasets, then you should run this command to transfer this metadata over to the new entities. For further context, read the Platform Instance Guide [here](./platform-instances.md). + +A few important options worth calling out: + +- --dry-run / -n : Use this to get a report for what will be migrated before running +- --force / -F : Use this if you know what you are doing and do not want to get a confirmation prompt before migration is started +- --keep : When enabled, will preserve the old entities and not delete them. Default behavior is to soft-delete old entities. +- --hard : When enabled, will hard-delete the old entities. + +**_Note_**: Timeseries aspects such as Usage Statistics and Dataset Profiles are not migrated over to the new entity instances, you will get new data points created when you re-run ingestion using the `usage` or sources with profiling turned on. + +##### Dry Run + +```console +datahub migrate dataplatform2instance --platform elasticsearch --instance prod_index --dry-run +Starting migration: platform:elasticsearch, instance=prod_index, force=False, dry-run=True +100% (25 of 25) |####################################################################################################################################################################################| Elapsed Time: 0:00:00 Time: 0:00:00 +[Dry Run] Migration Report: +-------------- +[Dry Run] Migration Run Id: migrate-5710349c-1ec7-4b83-a7d3-47d71b7e972e +[Dry Run] Num entities created = 25 +[Dry Run] Num entities affected = 0 +[Dry Run] Num entities migrated = 25 +[Dry Run] Details: +[Dry Run] New Entities Created: {'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.datahubretentionindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.schemafieldindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.system_metadata_service_v1,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.tagindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.dataset_datasetprofileaspect_v1,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.mlmodelindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.mlfeaturetableindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.datajob_datahubingestioncheckpointaspect_v1,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.datahub_usage_event,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.dataset_operationaspect_v1,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.datajobindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.dataprocessindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.glossarytermindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.dataplatformindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.mlmodeldeploymentindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.datajob_datahubingestionrunsummaryaspect_v1,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.graph_service_v1,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.datahubpolicyindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.dataset_datasetusagestatisticsaspect_v1,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.dashboardindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.glossarynodeindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.mlfeatureindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.dataflowindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.mlprimarykeyindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,prod_index.chartindex_v2,PROD)'} +[Dry Run] External Entities Affected: None +[Dry Run] Old Entities Migrated = {'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,dataset_datasetusagestatisticsaspect_v1,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,mlmodelindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,mlmodeldeploymentindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,datajob_datahubingestionrunsummaryaspect_v1,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,datahubretentionindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,datahubpolicyindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,dataset_datasetprofileaspect_v1,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,glossarynodeindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,dataset_operationaspect_v1,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,graph_service_v1,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,datajobindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,mlprimarykeyindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,dashboardindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,datajob_datahubingestioncheckpointaspect_v1,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,tagindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,datahub_usage_event,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,schemafieldindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,mlfeatureindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,dataprocessindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,dataplatformindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,mlfeaturetableindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,glossarytermindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,dataflowindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,chartindex_v2,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:elasticsearch,system_metadata_service_v1,PROD)'} +``` + +##### Real Migration (with soft-delete) + +``` +> datahub migrate dataplatform2instance --platform hive --instance +datahub migrate dataplatform2instance --platform hive --instance warehouse +Starting migration: platform:hive, instance=warehouse, force=False, dry-run=False +Will migrate 4 urns such as ['urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)'] +New urns will look like ['urn:li:dataset:(urn:li:dataPlatform:hive,warehouse.logging_events,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:hive,warehouse.fct_users_created,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:hive,warehouse.SampleHiveDataset,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:hive,warehouse.fct_users_deleted,PROD)'] + +Ok to proceed? [y/N]: +... +Migration Report: +-------------- +Migration Run Id: migrate-f5ae7201-4548-4bee-aed4-35758bb78c89 +Num entities created = 4 +Num entities affected = 0 +Num entities migrated = 4 +Details: +New Entities Created: {'urn:li:dataset:(urn:li:dataPlatform:hive,warehouse.SampleHiveDataset,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:hive,warehouse.fct_users_deleted,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:hive,warehouse.logging_events,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:hive,warehouse.fct_users_created,PROD)'} +External Entities Affected: None +Old Entities Migrated = {'urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)'} +``` + +## Alternate Installation Options + +### Using docker + +[![Docker Hub](https://img.shields.io/docker/pulls/acryldata/datahub-ingestion?style=plastic)](https://hub.docker.com/r/acryldata/datahub-ingestion) +[![datahub-ingestion docker](https://github.com/acryldata/datahub/actions/workflows/docker-ingestion.yml/badge.svg)](https://github.com/acryldata/datahub/actions/workflows/docker-ingestion.yml) + +If you don't want to install locally, you can alternatively run metadata ingestion within a Docker container. +We have prebuilt images available on [Docker hub](https://hub.docker.com/r/acryldata/datahub-ingestion). All plugins will be installed and enabled automatically. + +You can use the `datahub-ingestion` docker image as explained in [Docker Images](../docker/README.md). In case you are using Kubernetes you can start a pod with the `datahub-ingestion` docker image, log onto a shell on the pod and you should have the access to datahub CLI in your kubernetes cluster. + +_Limitation: the datahub_docker.sh convenience script assumes that the recipe and any input/output files are accessible in the current working directory or its subdirectories. Files outside the current working directory will not be found, and you'll need to invoke the Docker image directly._ + +```shell +# Assumes the DataHub repo is cloned locally. +./metadata-ingestion/scripts/datahub_docker.sh ingest -c ./examples/recipes/example_to_datahub_rest.yml +``` + +### Install from source + +If you'd like to install from source, see the [developer guide](../metadata-ingestion/developing.md). + +## Installing Plugins + +We use a plugin architecture so that you can install only the dependencies you actually need. Click the plugin name to learn more about the specific source recipe and any FAQs! + +### Sources + +Please see our [Integrations page](/integrations) if you want to filter on the features offered by each source. + +| Plugin Name | Install Command | Provides | +| ---------------------------------------------------------------------------------------------- | ---------------------------------------------------------- | --------------------------------------- | +| [file](./generated/ingestion/sources/file.md) | _included by default_ | File source and sink | +| [athena](./generated/ingestion/sources/athena.md) | `pip install 'acryl-datahub[athena]'` | AWS Athena source | +| [bigquery](./generated/ingestion/sources/bigquery.md) | `pip install 'acryl-datahub[bigquery]'` | BigQuery source | +| [datahub-lineage-file](./generated/ingestion/sources/file-based-lineage.md) | _no additional dependencies_ | Lineage File source | +| [datahub-business-glossary](./generated/ingestion/sources/business-glossary.md) | _no additional dependencies_ | Business Glossary File source | +| [dbt](./generated/ingestion/sources/dbt.md) | _no additional dependencies_ | dbt source | +| [druid](./generated/ingestion/sources/druid.md) | `pip install 'acryl-datahub[druid]'` | Druid Source | +| [feast](./generated/ingestion/sources/feast.md) | `pip install 'acryl-datahub[feast]'` | Feast source (0.26.0) | +| [glue](./generated/ingestion/sources/glue.md) | `pip install 'acryl-datahub[glue]'` | AWS Glue source | +| [hana](./generated/ingestion/sources/hana.md) | `pip install 'acryl-datahub[hana]'` | SAP HANA source | +| [hive](./generated/ingestion/sources/hive.md) | `pip install 'acryl-datahub[hive]'` | Hive source | +| [kafka](./generated/ingestion/sources/kafka.md) | `pip install 'acryl-datahub[kafka]'` | Kafka source | +| [kafka-connect](./generated/ingestion/sources/kafka-connect.md) | `pip install 'acryl-datahub[kafka-connect]'` | Kafka connect source | +| [ldap](./generated/ingestion/sources/ldap.md) | `pip install 'acryl-datahub[ldap]'` ([extra requirements]) | LDAP source | +| [looker](./generated/ingestion/sources/looker.md) | `pip install 'acryl-datahub[looker]'` | Looker source | +| [lookml](./generated/ingestion/sources/looker.md#module-lookml) | `pip install 'acryl-datahub[lookml]'` | LookML source, requires Python 3.7+ | +| [metabase](./generated/ingestion/sources/metabase.md) | `pip install 'acryl-datahub[metabase]'` | Metabase source | +| [mode](./generated/ingestion/sources/mode.md) | `pip install 'acryl-datahub[mode]'` | Mode Analytics source | +| [mongodb](./generated/ingestion/sources/mongodb.md) | `pip install 'acryl-datahub[mongodb]'` | MongoDB source | +| [mssql](./generated/ingestion/sources/mssql.md) | `pip install 'acryl-datahub[mssql]'` | SQL Server source | +| [mysql](./generated/ingestion/sources/mysql.md) | `pip install 'acryl-datahub[mysql]'` | MySQL source | +| [mariadb](./generated/ingestion/sources/mariadb.md) | `pip install 'acryl-datahub[mariadb]'` | MariaDB source | +| [openapi](./generated/ingestion/sources/openapi.md) | `pip install 'acryl-datahub[openapi]'` | OpenApi Source | +| [oracle](./generated/ingestion/sources/oracle.md) | `pip install 'acryl-datahub[oracle]'` | Oracle source | +| [postgres](./generated/ingestion/sources/postgres.md) | `pip install 'acryl-datahub[postgres]'` | Postgres source | +| [redash](./generated/ingestion/sources/redash.md) | `pip install 'acryl-datahub[redash]'` | Redash source | +| [redshift](./generated/ingestion/sources/redshift.md) | `pip install 'acryl-datahub[redshift]'` | Redshift source | +| [sagemaker](./generated/ingestion/sources/sagemaker.md) | `pip install 'acryl-datahub[sagemaker]'` | AWS SageMaker source | +| [snowflake](./generated/ingestion/sources/snowflake.md) | `pip install 'acryl-datahub[snowflake]'` | Snowflake source | +| [sqlalchemy](./generated/ingestion/sources/sqlalchemy.md) | `pip install 'acryl-datahub[sqlalchemy]'` | Generic SQLAlchemy source | +| [superset](./generated/ingestion/sources/superset.md) | `pip install 'acryl-datahub[superset]'` | Superset source | +| [tableau](./generated/ingestion/sources/tableau.md) | `pip install 'acryl-datahub[tableau]'` | Tableau source | +| [trino](./generated/ingestion/sources/trino.md) | `pip install 'acryl-datahub[trino]'` | Trino source | +| [starburst-trino-usage](./generated/ingestion/sources/trino.md) | `pip install 'acryl-datahub[starburst-trino-usage]'` | Starburst Trino usage statistics source | +| [nifi](./generated/ingestion/sources/nifi.md) | `pip install 'acryl-datahub[nifi]'` | NiFi source | +| [powerbi](./generated/ingestion/sources/powerbi.md#module-powerbi) | `pip install 'acryl-datahub[powerbi]'` | Microsoft Power BI source | +| [powerbi-report-server](./generated/ingestion/sources/powerbi.md#module-powerbi-report-server) | `pip install 'acryl-datahub[powerbi-report-server]'` | Microsoft Power BI Report Server source | + +### Sinks + +| Plugin Name | Install Command | Provides | +| ----------------------------------------------------------- | -------------------------------------------- | -------------------------- | +| [file](../metadata-ingestion/sink_docs/file.md) | _included by default_ | File source and sink | +| [console](../metadata-ingestion/sink_docs/console.md) | _included by default_ | Console sink | +| [datahub-rest](../metadata-ingestion/sink_docs/datahub.md) | `pip install 'acryl-datahub[datahub-rest]'` | DataHub sink over REST API | +| [datahub-kafka](../metadata-ingestion/sink_docs/datahub.md) | `pip install 'acryl-datahub[datahub-kafka]'` | DataHub sink over Kafka | + +These plugins can be mixed and matched as desired. For example: + +```shell +pip install 'acryl-datahub[bigquery,datahub-rest]' +``` + +### Check the active plugins + +```shell +datahub check plugins +``` + +[extra requirements]: https://www.python-ldap.org/en/python-ldap-3.3.0/installing.html#build-prerequisites + +## Release Notes and CLI versions + +The server release notes can be found in [github releases](https://github.com/datahub-project/datahub/releases). These releases are done approximately every week on a regular cadence unless a blocking issue or regression is discovered. + +CLI release is made through a different repository and release notes can be found in [acryldata releases](https://github.com/acryldata/datahub/releases). At least one release which is tied to the server release is always made alongwith the server release. Multiple other bigfix releases are made in between based on amount of fixes that are merged between the server release mentioned above. + +If server with version `0.8.28` is being used then CLI used to connect to it should be `0.8.28.x`. Tests of new CLI are not ran with older server versions so it is not recommended to update the CLI if the server is not updated. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/components.md b/docs-website/versioned_docs/version-0.10.4/docs/components.md new file mode 100644 index 0000000000000..93ce0890ca948 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/components.md @@ -0,0 +1,65 @@ +--- +title: Components +sidebar_label: Components +slug: /components +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/components.md" +--- + +# DataHub Components Overview + +The DataHub platform consists of the components shown in the following diagram. + +

+ +

+ +## Metadata Store + +The Metadata Store is responsible for storing the [Entities & Aspects](/docs/metadata-modeling/metadata-model/) comprising the Metadata Graph. This includes +exposing an API for [ingesting metadata](/docs/metadata-service#ingesting-entities), [fetching Metadata by primary key](/docs/metadata-service#retrieving-entities), [searching entities](/docs/metadata-service#search-an-entity), and [fetching Relationships](/docs/metadata-service#get-relationships-edges) between +entities. It consists of a Spring Java Service hosting a set of [Rest.li](https://linkedin.github.io/rest.li/) API endpoints, along with +MySQL, Elasticsearch, & Kafka for primary storage & indexing. + +Get started with the Metadata Store by following the [Quickstart Guide](/docs/quickstart/). + +## Metadata Models + +Metadata Models are schemas defining the shape of the Entities & Aspects comprising the Metadata Graph, along with the relationships between them. They are defined +using [PDL](https://linkedin.github.io/rest.li/pdl_schema), a modeling language quite similar in form to Protobuf while serializes to JSON. Entities represent a specific class of Metadata +Asset such as a Dataset, a Dashboard, a Data Pipeline, and beyond. Each _instance_ of an Entity is identified by a unique identifier called an `urn`. Aspects represent related bundles of data attached +to an instance of an Entity such as its descriptions, tags, and more. View the current set of Entities supported [here](/docs/metadata-modeling/metadata-model#exploring-datahubs-metadata-model). + +Learn more about DataHub models Metadata [here](/docs/metadata-modeling/metadata-model/). + +## Ingestion Framework + +The Ingestion Framework is a modular, extensible Python library for extracting Metadata from external source systems (e.g. +Snowflake, Looker, MySQL, Kafka), transforming it into DataHub's [Metadata Model](/docs/metadata-modeling/metadata-model/), and writing it into DataHub via +either Kafka or using the Metadata Store Rest APIs directly. DataHub supports an [extensive list of Source connectors](/docs/metadata-ingestion/#installing-plugins) to choose from, along with +a host of capabilities including schema extraction, table & column profiling, usage information extraction, and more. + +Getting started with the Ingestion Framework is as simple: just define a YAML file and execute the `datahub ingest` command. +Learn more by heading over the the [Metadata Ingestion](/docs/metadata-ingestion/) guide. + +## GraphQL API + +The [GraphQL](https://graphql.org/) API provides a strongly-typed, entity-oriented API that makes interacting with the Entities comprising the Metadata +Graph simple, including APIs for adding and removing tags, owners, links & more to Metadata Entities! Most notably, this API is consumed by the User Interface (discussed below) for enabling Search & Discovery, Governance, Observability +and more. + +To get started using the GraphQL API, check out the [Getting Started with GraphQL](/docs/api/graphql/getting-started) guide. + +## User Interface + +DataHub comes with a React UI including an ever-evolving set of features to make Discovering, Governing, & Debugging your Data Assets easy & delightful. +For a full overview of the capabilities currently supported, take a look at the [Features](/docs/features/) overview. For a look at what's coming next, +head over to the [Roadmap](/docs/roadmap/). + +## Learn More + +Learn more about the specifics of the [DataHub Architecture](./architecture/architecture.md) in the Architecture Overview. Learn about using & developing the components +of the Platform by visiting the Module READMEs. + +## Feedback / Questions / Concerns + +We want to hear from you! For any inquiries, including Feedback, Questions, or Concerns, reach out on [Slack](https://datahubspace.slack.com/join/shared_invite/zt-nx7i0dj7-I3IJYC551vpnvvjIaNRRGw#/shared-invite/email)! diff --git a/docs-website/versioned_docs/version-0.10.4/docs/datahub_lite.md b/docs-website/versioned_docs/version-0.10.4/docs/datahub_lite.md new file mode 100644 index 0000000000000..0bc627a141868 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/datahub_lite.md @@ -0,0 +1,622 @@ +--- +title: DataHub Lite (Experimental) +sidebar_label: Lite (Experimental) +slug: /datahub_lite +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/datahub_lite.md" +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# DataHub Lite (Experimental) + +## What is it? + +DataHub Lite is a lightweight embeddable version of DataHub with no external dependencies. It is intended to enable local developer tooling use-cases such as simple access to metadata for scripts and other tools. +DataHub Lite is compatible with the DataHub metadata format and all the ingestion connectors that DataHub supports. +It was built as a reaction to [recap](https://github.com/recap-cloud/recap) to prove that a similar lightweight system could be built within DataHub quite easily. +Currently DataHub Lite uses DuckDB under the covers as its default storage layer, but that might change in the future. + +## Features + +- Designed for the CLI +- Available as a Python library or REST API +- Ingest metadata from all DataHub ingestion sources +- Metadata Reads + - navigate metadata using a hierarchy + - get metadata for an entity + - search / query metadata across all entities +- Forward metadata automatically to a central GMS or Kafka instance + +## Architecture + +![architecture](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/lite/lite_architecture.png) + +## What is it not? + +DataHub Lite is NOT meant to be a replacement for the production Java DataHub server ([datahub-gms](./architecture/metadata-serving.md)). It does not offer the full set of API-s that the DataHub GMS server does. +The following features are **NOT** supported: + +- Full-text search with ranking and relevance features +- Graph traversal of relationships (e.g. lineage) +- Metadata change stream over Kafka (only forwarding of writes is supported) +- GraphQL API + +## Prerequisites + +To use `datahub lite` commands, you need to install [`acryl-datahub`](https://pypi.org/project/acryl-datahub/) > 0.9.6 ([install instructions](./cli.md#using-pip)) and the `datahub-lite` plugin. + +```shell +pip install acryl-datahub[datahub-lite] +``` + +## Importing Metadata + +To ingest metadata into DataHub Lite, all you have to do is change the `sink:` in your recipe to be a `datahub-lite` instance. See the detailed sink docs [here](../metadata-ingestion/sink_docs/datahub.md#datahub-lite-experimental). +For example, here is a simple recipe file that ingests mysql metadata into datahub-lite. + +```yaml +# mysql.in.dhub.yaml +source: + type: mysql + config: + host_port: localhost:3306 + username: datahub + password: datahub + +sink: + type: datahub-lite +``` + +By default, `lite` will create a local instance under `~/.datahub/lite/`. + +Ingesting metadata into DataHub Lite is as simple as running ingestion: +`datahub ingest -c mysql.in.dhub.yaml` + +:::note + +DataHub Lite currently doesn't support stateful ingestion, so you'll have to turn off stateful ingestion in your recipe to use it. This will be fixed shortly. + +::: + +### Forwarding to a central DataHub GMS over REST or Kafka + +DataHub Lite can be configured to forward all writes to a central DataHub GMS using either the REST API or the Kafka API. +To configure forwarding, add a **forward_to** section to your DataHub Lite config that conforms to a valid DataHub Sink configuration. Here is an example: + +```yaml +# mysql.in.dhub.yaml with forwarding to datahub-gms over REST +source: + type: mysql + config: + host_port: localhost:3306 + username: datahub + password: datahub + +sink: + type: datahub-lite + forward_to: + type: datahub-rest + config: + server: "http://datahub-gms:8080" +``` + +:::note + +Forwarding is currently best-effort, so there can be losses in metadata if the remote server is down. For a reliable sync mechanism, look at the [exporting metadata](#export-metadata-export) section to generate a standard metadata file. + +::: + +### Importing from a file + +As a convenient short-cut, you can import metadata from any standard DataHub metadata json file (e.g. generated via using a file sink) by issuing a _datahub lite import_ command. + +```shell +> datahub lite import --file metadata_events.json + +``` + +## Exploring Metadata + +The `datahub lite` group of commands provides a set of capabilities for you to explore the metadata you just ingested. + +### List (ls) + +Listing functions like a directory structure that is customized based on the kind of system being explored. DataHub's metadata is automatically organized into databases, tables, views, dashboards, charts, etc. + +:::note + +Using the `ls` command below is much more pleasant when you have tab completion enabled on your shell. Check out the [Setting up Tab Completion](#tab-completion) section at the bottom of the guide. + +::: + +```shell +> datahub lite ls / +databases +bi_tools +tags +# Stepping down one level +> datahub lite ls /databases +mysql +# Stepping down another level +> datahub lite ls /databases/mysql +instances +... +# At the final level +> datahub lite ls /databases/mysql/instances/default/databases/datahub/tables/ +metadata_aspect_v2 + +# Listing a leaf entity functions just like the unix ls command +> datahub lite ls /databases/mysql/instances/default/databases/datahub/tables/metadata_aspect_v2 +metadata_aspect_v2 +``` + +### Read (get) + +Once you have located a path of interest, you can read metadata at that entity, by issuing a **get**. You can additionally filter the metadata retrieved from an entity by the aspect type of the metadata (e.g. to request the schema, filter by the **schemaMetadata** aspect). + +Aside: If you are curious what all types of entities and aspects DataHub supports, check out the metadata model of entities like [Dataset](./generated/metamodel/entities/dataset.md), [Dashboard](./generated/metamodel/entities/dashboard.md) etc. + +The general template for the get responses looks like: + +``` +{ + "urn": , + : { + "value": , # aspect value as a dictionary + "systemMetadata": # present if details are requested + } +} +``` + +Here is what executing a _get_ command looks like: + +
+ +Get metadata for an entity by path + + +```json +> datahub lite get --path /databases/mysql/instances/default/databases/datahub/tables/metadata_aspect_v2 +{ + "urn": "urn:li:dataset:(urn:li:dataPlatform:mysql,datahub.metadata_aspect_v2,PROD)", + "container": { + "value": { + "container": "urn:li:container:21d4204e13d5b984c58acad468ecdbdd" + } + }, + "status": { + "value": { + "removed": false + } + }, + "datasetProperties": { + "value": { + "customProperties": {}, + "name": "metadata_aspect_v2", + "tags": [] + } + }, + "schemaMetadata": { + "value": { + "schemaName": "datahub.metadata_aspect_v2", + "platform": "urn:li:dataPlatform:mysql", + "version": 0, + "created": { + "time": 0, + "actor": "urn:li:corpuser:unknown" + }, + "lastModified": { + "time": 0, + "actor": "urn:li:corpuser:unknown" + }, + "hash": "", + "platformSchema": { + "com.linkedin.schema.MySqlDDL": { + "tableSchema": "" + } + }, + "fields": [ + { + "fieldPath": "urn", + "nullable": false, + "type": { + "type": { + "com.linkedin.schema.StringType": {} + } + }, + "nativeDataType": "VARCHAR(collation='utf8mb4_bin', length=500)", + "recursive": false, + "isPartOfKey": true + }, + { + "fieldPath": "aspect", + "nullable": false, + "type": { + "type": { + "com.linkedin.schema.StringType": {} + } + }, + "nativeDataType": "VARCHAR(collation='utf8mb4_bin', length=200)", + "recursive": false, + "isPartOfKey": true + }, + { + "fieldPath": "version", + "nullable": false, + "type": { + "type": { + "com.linkedin.schema.NumberType": {} + } + }, + "nativeDataType": "BIGINT(display_width=20)", + "recursive": false, + "isPartOfKey": true + }, + { + "fieldPath": "metadata", + "nullable": false, + "type": { + "type": { + "com.linkedin.schema.StringType": {} + } + }, + "nativeDataType": "LONGTEXT(collation='utf8mb4_bin')", + "recursive": false, + "isPartOfKey": false + }, + { + "fieldPath": "systemmetadata", + "nullable": true, + "type": { + "type": { + "com.linkedin.schema.StringType": {} + } + }, + "nativeDataType": "LONGTEXT(collation='utf8mb4_bin')", + "recursive": false, + "isPartOfKey": false + }, + { + "fieldPath": "createdon", + "nullable": false, + "type": { + "type": { + "com.linkedin.schema.TimeType": {} + } + }, + "nativeDataType": "DATETIME(fsp=6)", + "recursive": false, + "isPartOfKey": false + }, + { + "fieldPath": "createdby", + "nullable": false, + "type": { + "type": { + "com.linkedin.schema.StringType": {} + } + }, + "nativeDataType": "VARCHAR(collation='utf8mb4_bin', length=255)", + "recursive": false, + "isPartOfKey": false + }, + { + "fieldPath": "createdfor", + "nullable": true, + "type": { + "type": { + "com.linkedin.schema.StringType": {} + } + }, + "nativeDataType": "VARCHAR(collation='utf8mb4_bin', length=255)", + "recursive": false, + "isPartOfKey": false + } + ] + } + }, + "subTypes": { + "value": { + "typeNames": [ + "table" + ] + } + } +} +``` + +
+ +#### Get metadata for an entity filtered by specific aspect + +```json +> datahub lite get --path /databases/mysql/instances/default/databases/datahub/tables/metadata_aspect_v2 --aspect status +{ + "urn": "urn:li:dataset:(urn:li:dataPlatform:mysql,datahub.metadata_aspect_v2,PROD)", + "status": { + "value": { + "removed": false + } + } +} +``` + +:::note + +Using the `get` command by path is much more pleasant when you have tab completion enabled on your shell. Check out the [Setting up Tab Completion](#tab-completion) section at the bottom of the guide. + +::: + +#### Get metadata using the urn of the entity + +```json +> datahub lite get --urn "urn:li:dataset:(urn:li:dataPlatform:mysql,datahub.metadata_aspect_v2,PROD)" --aspect status +{ + "urn": "urn:li:dataset:(urn:li:dataPlatform:mysql,datahub.metadata_aspect_v2,PROD)", + "status": { + "value": { + "removed": false + } + } +} +``` + +
+ +Get metadata with additional details (systemMetadata) + + +```json +> datahub lite get --path /databases/mysql/instances/default/databases/datahub/tables/metadata_aspect_v2 --aspect status --verbose +{ + "urn": "urn:li:dataset:(urn:li:dataPlatform:mysql,datahub.metadata_aspect_v2,PROD)", + "status": { + "value": { + "removed": false + }, + "systemMetadata": { + "lastObserved": 1673982834666, + "runId": "mysql-2023_01_17-11_13_12", + "properties": { + "sysVersion": 1 + } + } + } +} +``` + +
+ +#### Point-in-time Queries + +DataHub Lite preserves every version of metadata ingested, just like DataHub GMS. You can also query the metadata as of a specific point in time by adding the _--asof_ parameter to your _get_ command. + +```shell +> datahub lite get "urn:li:dataset:(urn:li:dataPlatform:mysql,datahub.metadata_aspect_v2,PROD)" --aspect status --asof 2020-01-01 +null + +> datahub lite get "urn:li:dataset:(urn:li:dataPlatform:mysql,datahub.metadata_aspect_v2,PROD)" --aspect status --asof 2023-01-16 +{ + "urn": "urn:li:dataset:(urn:li:dataPlatform:mysql,datahub.metadata_aspect_v2,PROD)", + "status": { + "removed": false + } +} +``` + +### Search (search) + +DataHub Lite also allows you to search using queries within the metadata using the `datahub lite search` command. +You can provide a free form search query like: "customer" and DataHub Lite will attempt to find entities that match the name customer either in the id of the entity or within the name fields of aspects in the entities. + +```shell +> datahub lite search pet +{"id": "urn:li:dataset:(urn:li:dataPlatform:looker,long_tail_companions.explore.long_tail_pets,PROD)", "aspect": "urn", "snippet": null} +{"id": "urn:li:dataset:(urn:li:dataPlatform:looker,long_tail_companions.explore.long_tail_pets,PROD)", "aspect": "datasetProperties", "snippet": "{\"customProperties\": {\"looker.explore.label\": \"Long Tail Pets\", \"looker.explore.file\": \"long_tail_companions.model.lkml\"}, \"externalUrl\": \"https://acryl.cloud.looker.com/explore/long_tail_companions/long_tail_pets\", \"name\": \"Long Tail Pets\", \"tags\": []}"} +``` + +You can also query the metadata precisely using DuckDB's [JSON](https://duckdb.org/docs/extensions/json.html) extract functions. +Writing these functions requires that you understand the DataHub metadata model and how the data is laid out in DataHub Lite. + +For example, to find all entities whose _datasetProperties_ aspect includes the _view_definition_ in its _customProperties_ sub-field, we can issue the following command: + +```shell +> datahub lite search --aspect datasetProperties --flavor exact "metadata -> '$.customProperties' ->> '$.view_definition' IS NOT NULL" +``` + +```json +{"id": "urn:li:dataset:(urn:li:dataPlatform:mysql,information_schema.INNODB_MUTEXES,PROD)", "aspect": "datasetProperties", "snippet": "{\"customProperties\": {\"view_definition\": \"CREATE TEMPORARY TABLE `INNODB_MUTEXES` (\\n `NAME` varchar(4000) NOT NULL DEFAULT '',\\n `CREATE_FILE` varchar(4000) NOT NULL DEFAULT '',\\n `CREATE_LINE` int(11) unsigned NOT NULL DEFAULT 0,\\n `OS_WAITS` bigint(21) unsigned NOT NULL DEFAULT 0\\n) ENGINE=MEMORY DEFAULT CHARSET=utf8\", \"is_view\": \"True\"}, \"name\": \"INNODB_MUTEXES\", \"tags\": []}"} +{"id": "urn:li:dataset:(urn:li:dataPlatform:mysql,information_schema.user_variables,PROD)", "aspect": "datasetProperties", "snippet": "{\"customProperties\": {\"view_definition\": \"CREATE TEMPORARY TABLE `user_variables` (\\n `VARIABLE_NAME` varchar(64) NOT NULL DEFAULT '',\\n `VARIABLE_VALUE` varchar(2048) DEFAULT NULL,\\n `VARIABLE_TYPE` varchar(64) NOT NULL DEFAULT '',\\n `CHARACTER_SET_NAME` varchar(32) DEFAULT NULL\\n) ENGINE=MEMORY DEFAULT CHARSET=utf8\", \"is_view\": \"True\"}, \"name\": \"user_variables\", \"tags\": []}"} +{"id": "urn:li:dataset:(urn:li:dataPlatform:mysql,information_schema.INNODB_TABLESPACES_ENCRYPTION,PROD)", "aspect": "datasetProperties", "snippet": "{\"customProperties\": {\"view_definition\": \"CREATE TEMPORARY TABLE `INNODB_TABLESPACES_ENCRYPTION` (\\n `SPACE` int(11) unsigned NOT NULL DEFAULT 0,\\n `NAME` varchar(655) DEFAULT NULL,\\n `ENCRYPTION_SCHEME` int(11) unsigned NOT NULL DEFAULT 0,\\n `KEYSERVER_REQUESTS` int(11) unsigned NOT NULL DEFAULT 0,\\n `MIN_KEY_VERSION` int(11) unsigned NOT NULL DEFAULT 0,\\n `CURRENT_KEY_VERSION` int(11) unsigned NOT NULL DEFAULT 0,\\n `KEY_ROTATION_PAGE_NUMBER` bigint(21) unsigned DEFAULT NULL,\\n `KEY_ROTATION_MAX_PAGE_NUMBER` bigint(21) unsigned DEFAULT NULL,\\n `CURRENT_KEY_ID` int(11) unsigned NOT NULL DEFAULT 0,\\n `ROTATING_OR_FLUSHING` int(1) NOT NULL DEFAULT 0\\n) ENGINE=MEMORY DEFAULT CHARSET=utf8\", \"is_view\": \"True\"}, \"name\": \"INNODB_TABLESPACES_ENCRYPTION\", \"tags\": []}"} +``` + +Search will return results that include the _id_ of the entity that matched along with the _aspect_ and the content of the aspect as part of the _snippet_ field. If you just want the _id_ of the entity to be returned, use the _--no-details_ flag. + +```shell +> datahub lite search --aspect datasetProperties --flavor exact "metadata -> '$.customProperties' ->> '$.view_definition' IS NOT NULL" --no-details +urn:li:dataset:(urn:li:dataPlatform:mysql,information_schema.INNODB_SYS_FOREIGN,PROD) +urn:li:dataset:(urn:li:dataPlatform:mysql,information_schema.INNODB_CMPMEM_RESET,PROD) +urn:li:dataset:(urn:li:dataPlatform:mysql,information_schema.INNODB_FT_DEFAULT_STOPWORD,PROD) +urn:li:dataset:(urn:li:dataPlatform:mysql,information_schema.INNODB_SYS_TABLES,PROD) +... +urn:li:dataset:(urn:li:dataPlatform:mysql,information_schema.INNODB_SYS_COLUMNS,PROD) +urn:li:dataset:(urn:li:dataPlatform:mysql,information_schema.INNODB_FT_CONFIG,PROD) +urn:li:dataset:(urn:li:dataPlatform:mysql,information_schema.USER_STATISTICS,PROD) +``` + +### List Urns (list-urns) + +List all the ids in the DataHub Lite instance. + +```shell +> datahub lite list-urns +urn:li:container:21d4204e13d5b984c58acad468ecdbdd +urn:li:dataset:(urn:li:dataPlatform:mysql,datahub.metadata_aspect_v2,PROD) + +urn:li:container:aa82e8309ce84acc350640647a54ca3b +urn:li:dataset:(urn:li:dataPlatform:mysql,information_schema.ALL_PLUGINS,PROD) +urn:li:dataset:(urn:li:dataPlatform:mysql,information_schema.APPLICABLE_ROLES,PROD) +urn:li:dataset:(urn:li:dataPlatform:mysql,information_schema.CHARACTER_SETS,PROD) +urn:li:dataset:(urn:li:dataPlatform:mysql,information_schema.CHECK_CONSTRAINTS,PROD) +urn:li:dataset:(urn:li:dataPlatform:mysql,information_schema.COLLATIONS,PROD) +urn:li:dataset:(urn:li:dataPlatform:mysql,information_schema.COLLATION_CHARACTER_SET_APPLICABILITY,PROD) +urn:li:dataset:(urn:li:dataPlatform:mysql,information_schema.COLUMNS,PROD) +... + +``` + +### HTTP Server (serve) + +DataHub Lite can be run as a lightweight HTTP server, exposing an OpenAPI spec over FastAPI. + +```shell +> datahub lite serve +INFO: Started server process [33364] +INFO: Waiting for application startup. +INFO: Application startup complete. +INFO: Uvicorn running on http://127.0.0.1:8979 (Press CTRL+C to quit) +``` + +OpenAPI docs are available via your browser at the same port: http://localhost:8979 + +The server exposes similar commands as the **lite** cli commands over HTTP: + +- entities: list of all entity ids and get metadata for an entity +- browse: traverse the entity hierarchy in a path based way +- search: execute search against the metadata + +#### Server Configuration + +Configuration for the server is picked up from the standard location for the **datahub** cli: **~/.datahubenv** and can be created using **datahub lite init**. + +Here is a sample config file with the **lite** section filled out: + +```yaml +gms: + server: http://localhost:8080 + token: '' +lite: + config: + file: /Users//.datahub/lite/datahub.duckdb + type: duckdb + forward_to: + type: datahub-rest + server: "http://datahub-gms:8080 +``` + +## Admin Commands + +### Export Metadata (export) + +The _export_ command allows you to export the contents of DataHub Lite into a metadata events file that you can then send to another DataHub instance (e.g. over REST). + +```shell +> datahub lite export --file datahub_lite_export.json +Successfully exported 1785 events to datahub_lite_export.json +``` + +### Clear (nuke) + +If you want to clear your DataHub lite instance, you can just issue the `nuke` command. + +```shell +> datahub lite nuke +DataHub Lite destroyed at +``` + +### Use a different file (init) + +By default, DataHub Lite will create and use a local duckdb instance located at `~/.datahub/lite/datahub.duckdb`. +If you want to use a different location, you can configure it using the `datahub lite init` command. + +```shell +> datahub lite init --type duckdb --file my_local_datahub.duckdb +Will replace datahub lite config type='duckdb' config={'file': '/Users//.datahub/lite/datahub.duckdb', 'options': {}} with type='duckdb' config={'file': 'my_local_datahub.duckdb', 'options': {}} [y/N]: y +DataHub Lite inited at my_local_datahub.duckdb +``` + +### Reindex + +DataHub Lite maintains a few derived tables to make access possible via both the native id (urn) as well as the logical path of the entity. The `reindex` command recomputes these indexes. + +## Caveat Emptor! + +DataHub Lite is a very new project. Do not use it for production use-cases. The API-s and storage formats are subject to change and we get feedback from early adopters. That said, we are really interested in accepting PR-s and suggestions for improvements to this fledgling project. + +## Advanced Options + +### Tab Completion + +Using the datahub lite commands like `ls` or `get` is much more pleasant when you have tab completion enabled on your shell. Tab completion is supported on the command line through the [Click Shell completion](https://click.palletsprojects.com/en/8.1.x/shell-completion/) module. +To set up shell completion for your shell, follow the instructions below based on your shell variant: + +#### Option 1: Inline eval (easy, less performant) + + + + +Add this to ~/.zshrc: + +```shell +eval "$(_DATAHUB_COMPLETE=zsh_source datahub)" +``` + + + + +Add this to ~/.bashrc: + +```shell +eval "$(_DATAHUB_COMPLETE=bash_source datahub)" +``` + + + + + +#### Option 2: External completion script (recommended, better performance) + +Using eval means that the command is invoked and evaluated every time a shell is started, which can delay shell responsiveness. To speed it up, write the generated script to a file, then source that. + + + + +Save the script somewhere. + +```shell +# the additional sed patches completion to be path oriented and not add spaces between each completed token +_DATAHUB_COMPLETE=zsh_source datahub | sed 's;compadd;compadd -S /;' > ~/.datahub-complete.zsh +``` + +Source the file in ~/.zshrc. + +```shell +. ~/.datahub-complete.zsh +``` + + + + +```shell +_DATAHUB_COMPLETE=bash_source datahub > ~/.datahub-complete.bash +``` + +Source the file in ~/.bashrc. + +```shell +. ~/.datahub-complete.bash +``` + + + + + +Save the script to ~/.config/fish/completions/datahub.fish: + +```shell +_DATAHUB_COMPLETE=fish_source datahub > ~/.config/fish/completions/datahub.fish +``` + + + diff --git a/docs-website/versioned_docs/version-0.10.4/docs/dataproducts.md b/docs-website/versioned_docs/version-0.10.4/docs/dataproducts.md new file mode 100644 index 0000000000000..b0e6ed29163a1 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/dataproducts.md @@ -0,0 +1,271 @@ +--- +title: Data Products +slug: /dataproducts +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/dataproducts.md" +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Data Products + + + +**🤝 Version compatibility** + +> Open Source DataHub: **0.10.3** | Acryl: **0.2.8** + +## What are Data Products? + +Data Products are an innovative way to organize and manage your Data Assets, such as Tables, Topics, Views, Pipelines, Charts, Dashboards, etc., within DataHub. These Data Products belong to a specific Domain and can be easily accessed by various teams or stakeholders within your organization. + +## Why Data Products? + +A key concept in data mesh architecture, Data Products are independent units of data managed by a specific domain team. They are responsible for defining, publishing, and maintaining their data assets while ensuring high-quality data that meets the needs of its consumers. + +## Benefits of Data Products + +Data Products help in curating a coherent set of logical entities, simplifying data discovery and governance. By grouping related Data Assets into a Data Product, it allows stakeholders to discover and understand available data easily, supporting data governance efforts by managing and controlling access to Data Products. + +## How Can You Use Data Products? + +Data Products can be easily published to the DataHub catalog, allowing other teams to discover and consume them. By doing this, data teams can streamline the process of sharing data, making data-driven decisions faster and more efficient. + +## Data Products Setup, Prerequisites, and Permissions + +What you need to create and add data products: + +- **Manage Data Product** metadata privilege for Domains to create/delete Data Products at the entity level. If a user has this privilege for a given Domain, they will be able to create and delete Data Products underneath it. +- **Edit Data Product** metadata privilege to add or remove the Data Product for a given entity. + +You can create this privileges by creating a new [Metadata Policy](./authorization/policies.md). + +## Using Data Products + +Data Products can be created using the UI or via a YAML file that is managed using software engineering (GitOps) practices. + +### Creating a Data Product (UI) + +To create a Data Product, first navigate to the Domain that will contain this Data Product. + +

+ +

+ +Then navigate to the Data Products tab on the Domain's home page, and click '+ New Data Product'. +This will open a new modal where you can configure the settings for your data product. Inside the form, you can choose a name for your Data Product. Most often, this will align with the logical purpose of the Data Product, for example +'Customer Orders' or 'Revenue Attribution'. You can also add documentation for your product to help other users easily discover it. Don't worry, this can be changed later. + +

+ +

+ +Once you've chosen a name and a description, click 'Create' to create the new Data Product. Once you've created the Data Product, you can click on it to continue on to the next step, adding assets to it. + +### Assigning an Asset to a Data Product (UI) + +You can assign an asset to a Data Product either using the Data Product page as the starting point or the Asset's page as the starting point. +On a Data Product page, click the 'Add Assets' button on the top right corner to add assets to the Data Product. + +

+ +

+ +On an Asset's profile page, use the right sidebar to locate the Data Product section. Click 'Set Data Product', and then search for the Data Product you'd like to add this asset to. When you're done, click 'Add'. + +

+ +

+ +To remove an asset from a Data Product, click the 'x' icon on the Data Product label. + +> Notice: Adding or removing an asset from a Data Product requires the `Edit Data Product` Metadata Privilege, which can be granted +> by a [Policy](authorization/policies.md). + +### Creating a Data Product (YAML + git) + +DataHub ships with a YAML-based Data Product spec for defining and managing Data Products as code. + +Here is an example of a Data Product named "Pet of the Week" which belongs to the **Marketing** domain and contains three data assets. The **Spec** tab describes the JSON Schema spec for a DataHub data product file. + + + + +```yaml +# Inlined from /metadata-ingestion/examples/data_product/dataproduct.yaml +id: pet_of_the_week +domain: Marketing +display_name: Pet of the Week Campaign +description: |- + This campaign includes Pet of the Week data. + +# List of assets that belong to this Data Product +assets: + - urn:li:dataset:(urn:li:dataPlatform:snowflake,long_tail_companions.analytics.pet_details,PROD) + - urn:li:dashboard:(looker,dashboards.19) + - urn:li:dataFlow:(airflow,snowflake_load,prod) + +owners: + - id: urn:li:corpuser:jdoe + type: BUSINESS_OWNER + +# Tags associated with this Data Product +tags: + - urn:li:tag:adoption + +# Glossary Terms associated with this Data Product +terms: + - urn:li:glossaryTerm:ClientsAndAccounts.AccountBalance + +# Custom Properties +properties: + lifecycle: production + sla: 7am every day +``` + +:::note + +When bare domain names like `Marketing` is used, `datahub` will first check if a domain like `urn:li:domain:Marketing` is provisioned, failing that; it will check for a provisioned domain that has the same name. If we are unable to resolve bare domain names to provisioned domains, then yaml-based ingestion will refuse to proceeed until the domain is provisioned on DataHub. + +::: + +You can also provide fully-qualified domain names (e.g. `urn:li:domain:dcadded3-2b70-4679-8b28-02ac9abc92eb`) to ensure that no ingestion-time domain resolution is needed. + + + + +```json +{ + "title": "DataProduct", + "description": "This is a DataProduct class which represents a DataProduct\n\nArgs:\n id (str): The id of the Data Product\n domain (str): The domain that the Data Product belongs to. Either as a name or a fully-qualified urn.\n owners (Optional[List[str, Ownership]]): A list of owners and their types.\n display_name (Optional[str]): The name of the Data Product to display in the UI\n description (Optional[str]): A documentation string for the Data Product\n tags (Optional[List[str]]): An array of tags (either bare ids or urns) for the Data Product\n terms (Optional[List[str]]): An array of terms (either bare ids or urns) for the Data Product\n assets (List[str]): An array of entity urns that are part of the Data Product", + "type": "object", + "properties": { + "id": { + "title": "Id", + "type": "string" + }, + "domain": { + "title": "Domain", + "type": "string" + }, + "assets": { + "title": "Assets", + "type": "array", + "items": { + "type": "string" + } + }, + "display_name": { + "title": "Display Name", + "type": "string" + }, + "owners": { + "title": "Owners", + "type": "array", + "items": { + "anyOf": [ + { + "type": "string" + }, + { + "$ref": "#/definitions/Ownership" + } + ] + } + }, + "description": { + "title": "Description", + "type": "string" + }, + "tags": { + "title": "Tags", + "type": "array", + "items": { + "type": "string" + } + }, + "terms": { + "title": "Terms", + "type": "array", + "items": { + "type": "string" + } + }, + "properties": { + "title": "Properties", + "type": "object", + "additionalProperties": { + "type": "string" + } + }, + "external_url": { + "title": "External Url", + "type": "string" + } + }, + "required": ["id", "domain"], + "additionalProperties": false, + "definitions": { + "Ownership": { + "title": "Ownership", + "type": "object", + "properties": { + "id": { + "title": "Id", + "type": "string" + }, + "type": { + "title": "Type", + "type": "string" + } + }, + "required": ["id", "type"], + "additionalProperties": false + } + } +} +``` + + + + +To sync this yaml file to DataHub, use the `datahub` cli via the `dataproduct` group of commands. + +```shell +datahub dataproduct upsert -f user_dataproduct.yaml +``` + +### Keeping the YAML file sync-ed with changes in UI + +The `datahub` cli allows you to keep this YAML file synced with changes happening in the UI. All you have to do is run the `datahub dataproduct diff` command. + +Here is an example invocation that checks if there is any diff and updates the file in place: + +```shell +datahub dataproduct diff -f user_dataproduct.yaml --update +``` + +This allows you to manage your data product definition in git while still allowing for edits in the UI. Business Users and Developers can both collaborate on the definition of a data product with ease using this workflow. + +### Advanced cli commands for managing Data Products + +There are many more advanced cli commands for managing Data Products as code. Take a look at the [Data Products section](./cli.md#dataproduct-data-product-entity) on the CLI reference guide for more details. + +### What updates are planned for the Data Products feature? + +The following features are next on the roadmap for Data Products + +- Support for marking data assets in a Data Product as private versus shareable for other teams to consume +- Support for declaring lineage manually to upstream and downstream data products +- Support for declaring logical schema for Data Products +- Support for associating data contracts with Data Products +- Support for semantic versioning of the Data Product entity + +### Related Features + +- [Domains](./domains.md) +- [Glossary Terms](./glossary/business-glossary.md) +- [Tags](./tags.md) + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/deploy/aws.md b/docs-website/versioned_docs/version-0.10.4/docs/deploy/aws.md new file mode 100644 index 0000000000000..ee08f095590d0 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/deploy/aws.md @@ -0,0 +1,502 @@ +--- +title: Deploying to AWS +sidebar_label: Deploying to AWS +slug: /deploy/aws +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/deploy/aws.md" +--- + +# AWS setup guide + +The following is a set of instructions to quickstart DataHub on AWS Elastic Kubernetes Service (EKS). Note, the guide +assumes that you do not have a kubernetes cluster set up. If you are deploying DataHub to an existing cluster, please +skip the corresponding sections. + +## Prerequisites + +This guide requires the following tools: + +- [kubectl](https://kubernetes.io/docs/tasks/tools/) to manage kubernetes resources +- [helm](https://helm.sh/docs/intro/install/) to deploy the resources based on helm charts. Note, we only support Helm 3. +- [eksctl](https://eksctl.io/introduction/#installation) to create and manage clusters on EKS +- [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) to manage AWS resources + +To use the above tools, you need to set up AWS credentials by following +this [guide](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html). + +## Start up a kubernetes cluster on AWS EKS + +Let’s follow this [guide](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html) to create a new +cluster using eksctl. Run the following command with cluster-name set to the cluster name of choice, and region set to +the AWS region you are operating on. + +``` +eksctl create cluster \ + --name <> \ + --region <> \ + --with-oidc \ + --nodes=3 +``` + +The command will provision an EKS cluster powered by 3 EC2 m3.large nodes and provision a VPC based networking layer. + +If you are planning to run the storage layer (MySQL, Elasticsearch, Kafka) as pods in the cluster, you need at least 3 +nodes. If you decide to use managed storage services, you can reduce the number of nodes or use m3.medium nodes to save +cost. Refer to this [guide](https://eksctl.io/usage/creating-and-managing-clusters/) to further customize the cluster +before provisioning. + +Note, OIDC setup is required for following this guide when setting up the load balancer. + +Run `kubectl get nodes` to confirm that the cluster has been setup correctly. You should get results like below + +``` +NAME STATUS ROLES AGE VERSION +ip-192-168-49-49.us-west-2.compute.internal Ready 3h v1.18.9-eks-d1db3c +ip-192-168-64-56.us-west-2.compute.internal Ready 3h v1.18.9-eks-d1db3c +ip-192-168-8-126.us-west-2.compute.internal Ready 3h v1.18.9-eks-d1db3c +``` + +## Setup DataHub using Helm + +Once the kubernetes cluster has been set up, you can deploy DataHub and it’s prerequisites using helm. Please follow the +steps in this [guide](kubernetes.md) + +## Expose endpoints using a load balancer + +Now that all the pods are up and running, you need to expose the datahub-frontend end point by setting +up [ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/). To do this, you need to first set up an +ingress controller. There are +many [ingress controllers](https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/) to choose +from, but here, we will follow +this [guide](https://docs.aws.amazon.com/eks/latest/userguide/aws-load-balancer-controller.html) to set up the AWS +Application Load Balancer(ALB) Controller. + +First, if you did not use eksctl to setup the kubernetes cluster, make sure to go through the prerequisites listed +[here](https://docs.aws.amazon.com/eks/latest/userguide/alb-ingress.html). + +Download the IAM policy document for allowing the controller to make calls to AWS APIs on your behalf. + +``` +curl -o iam_policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.2.0/docs/install/iam_policy.json +``` + +Create an IAM policy based on the policy document by running the following. + +``` +aws iam create-policy \ + --policy-name AWSLoadBalancerControllerIAMPolicy \ + --policy-document file://iam_policy.json +``` + +Use eksctl to create a service account that allows us to attach the above policy to kubernetes pods. + +``` +eksctl create iamserviceaccount \ + --cluster=<> \ + --namespace=kube-system \ + --name=aws-load-balancer-controller \ + --attach-policy-arn=arn:aws:iam::<>:policy/AWSLoadBalancerControllerIAMPolicy \ + --override-existing-serviceaccounts \ + --approve +``` + +Install the TargetGroupBinding custom resource definition by running the following. + +``` +kubectl apply -k "github.com/aws/eks-charts/stable/aws-load-balancer-controller//crds?ref=master" +``` + +Add the helm chart repository containing the latest version of the ALB controller. + +``` +helm repo add eks https://aws.github.io/eks-charts +helm repo update +``` + +Install the controller into the kubernetes cluster by running the following. + +``` +helm upgrade -i aws-load-balancer-controller eks/aws-load-balancer-controller \ + --set clusterName=<> \ + --set serviceAccount.create=false \ + --set serviceAccount.name=aws-load-balancer-controller \ + -n kube-system +``` + +Verify the install completed by running `kubectl get deployment -n kube-system aws-load-balancer-controller`. It should +return a result like the following. + +``` +NAME READY UP-TO-DATE AVAILABLE AGE +aws-load-balancer-controller 2/2 2 2 142m +``` + +Now that the controller has been set up, we can enable ingress by updating the values.yaml (or any other values.yaml +file used to deploy datahub). Change datahub-frontend values to the following. + +``` +datahub-frontend: + enabled: true + image: + repository: linkedin/datahub-frontend-react + tag: "latest" + ingress: + enabled: true + annotations: + kubernetes.io/ingress.class: alb + alb.ingress.kubernetes.io/scheme: internet-facing + alb.ingress.kubernetes.io/target-type: instance + alb.ingress.kubernetes.io/certificate-arn: <> + alb.ingress.kubernetes.io/inbound-cidrs: 0.0.0.0/0 + alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]' + alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}' + hosts: + - host: <> + redirectPaths: + - path: /* + name: ssl-redirect + port: use-annotation + paths: + - /* +``` + +You need to request a certificate in the AWS Certificate Manager by following this +[guide](https://docs.aws.amazon.com/acm/latest/userguide/gs-acm-request-public.html), and replace certificate-arn with +the ARN of the new certificate. You also need to replace host-name with the hostname of choice like +demo.datahubproject.io. + +To have the metadata [authentication service](../authentication/introducing-metadata-service-authentication.md#configuring-metadata-service-authentication) enabled and use [API tokens](../authentication/personal-access-tokens.md#creating-personal-access-tokens) from the UI you will need to set the configuration in the values.yaml for the `gms` and the `frontend` deployments. This could be done by enabling the `metadata_service_authentication`: + +``` +datahub: + metadata_service_authentication: + enabled: true +``` + +After updating the yaml file, run the following to apply the updates. + +``` +helm upgrade --install datahub datahub/datahub --values values.yaml +``` + +Once the upgrade completes, run `kubectl get ingress` to verify the ingress setup. You should see a result like the +following. + +``` +NAME CLASS HOSTS ADDRESS PORTS AGE +datahub-datahub-frontend demo.datahubproject.io k8s-default-datahubd-80b034d83e-904097062.us-west-2.elb.amazonaws.com 80 3h5m +``` + +Note down the elb address in the address column. Add the DNS CNAME record to the host domain pointing the host-name ( +from above) to the elb address. DNS updates generally take a few minutes to an hour. Once that is done, you should be +able to access datahub-frontend through the host-name. + +## Use AWS managed services for the storage layer + +Managing the storage services like MySQL, Elasticsearch, and Kafka as kubernetes pods requires a great deal of +maintenance workload. To reduce the workload, you can use managed services like AWS [RDS](https://aws.amazon.com/rds), +[Elasticsearch Service](https://aws.amazon.com/elasticsearch-service/), and [Managed Kafka](https://aws.amazon.com/msk/) +as the storage layer for DataHub. Support for using AWS Neptune as graph DB is coming soon. + +### RDS + +Provision a MySQL database in AWS RDS that shares the VPC with the kubernetes cluster or has VPC peering set up between +the VPC of the kubernetes cluster. Once the database is provisioned, you should be able to see the following page. Take +a note of the endpoint marked by the red box. + +

+ +

+ +First, add the DB password to kubernetes by running the following. + +``` +kubectl delete secret mysql-secrets +kubectl create secret generic mysql-secrets --from-literal=mysql-root-password=<> +``` + +Update the sql settings under global in the values.yaml as follows. + +``` + sql: + datasource: + host: "<>:3306" + hostForMysqlClient: "<>" + port: "3306" + url: "jdbc:mysql://<>:3306/datahub?verifyServerCertificate=false&useSSL=true&useUnicode=yes&characterEncoding=UTF-8" + driver: "com.mysql.jdbc.Driver" + username: "root" + password: + secretRef: mysql-secrets + secretKey: mysql-root-password +``` + +Run `helm upgrade --install datahub datahub/datahub --values values.yaml` to apply the changes. + +### Elasticsearch Service + +Provision an elasticsearch domain running elasticsearch version 7.10 or above that shares the VPC with the kubernetes +cluster or has VPC peering set up between the VPC of the kubernetes cluster. Once the domain is provisioned, you should +be able to see the following page. Take a note of the endpoint marked by the red box. + +

+ +

+ +Update the elasticsearch settings under global in the values.yaml as follows. + +``` + elasticsearch: + host: <> + port: "443" + useSSL: "true" +``` + +You can also allow communication via HTTP (without SSL) by using the settings below. + +``` + elasticsearch: + host: <> + port: "80" +``` + +If you have fine-grained access control enabled with basic authentication, first run the following to create a k8s +secret with the password. + +``` +kubectl delete secret elasticsearch-secrets +kubectl create secret generic elasticsearch-secrets --from-literal=elasticsearch-password=<> +``` + +Then use the settings below. + +``` + elasticsearch: + host: <> + port: "443" + useSSL: "true" + auth: + username: <> + password: + secretRef: elasticsearch-secrets + secretKey: elasticsearch-password +``` + +If you have access control enabled with IAM auth, enable AWS auth signing in Datahub + +``` + OPENSEARCH_USE_AWS_IAM_AUTH=true +``` + +Then use the settings below. + +``` + elasticsearch: + host: <> + port: "443" + useSSL: "true" + region: <> +``` + +Lastly, you **NEED** to set the following env variable for **elasticsearchSetupJob**. AWS Elasticsearch/Opensearch +service uses OpenDistro version of Elasticsearch, which does not support the "datastream" functionality. As such, we use +a different way of creating time based indices. + +``` + elasticsearchSetupJob: + enabled: true + image: + repository: linkedin/datahub-elasticsearch-setup + tag: "***" + extraEnvs: + - name: USE_AWS_ELASTICSEARCH + value: "true" +``` + +Run `helm upgrade --install datahub datahub/datahub --values values.yaml` to apply the changes. + +**Note:** +If you have a custom setup of elastic search cluster and are deploying through docker, you can modify the configurations +in datahub to point to the specific ES instance - + +1. If you are using `docker quickstart` you can modify the hostname and port of the ES instance in docker compose + quickstart files located [here](https://github.com/datahub-project/datahub/blob/master/docker/quickstart/). + 1. Once you have modified the quickstart recipes you can run the quickstart command using a specific docker compose + file. Sample command for that is + - `datahub docker quickstart --quickstart-compose-file docker/quickstart/docker-compose-without-neo4j.quickstart.yml` +2. If you are not using quickstart recipes, you can modify environment variable in GMS to point to the ES instance. The + env files for datahub-gms are located [here](https://github.com/datahub-project/datahub/blob/master/docker/datahub-gms/env/). + +Further, you can find a list of properties supported to work with a custom ES +instance [here](https://github.com/datahub-project/datahub/blob/master/metadata-service/factories/src/main/java/com/linkedin/gms/factory/common/ElasticsearchSSLContextFactory.java) +and [here](https://github.com/datahub-project/datahub/blob/master/metadata-service/factories/src/main/java/com/linkedin/gms/factory/common/RestHighLevelClientFactory.java) +. + +A mapping between the property name used in the above two files and the name used in docker/env file can be +found [here](https://github.com/datahub-project/datahub/blob/master/metadata-service/configuration/src/main/resources/application.yml). + +### Managed Streaming for Apache Kafka (MSK) + +Provision an MSK cluster that shares the VPC with the kubernetes cluster or has VPC peering set up between the VPC of +the kubernetes cluster. Once the domain is provisioned, click on the “View client information” button in the ‘Cluster +Summary” section. You should see a page like below. Take a note of the endpoints marked by the red boxes. + +

+ +

+ +Update the kafka settings under global in the values.yaml as follows. + +``` +kafka: + bootstrap: + server: "<>" + zookeeper: + server: "<>" + schemaregistry: + url: "http://prerequisites-cp-schema-registry:8081" + partitions: 3 + replicationFactor: 3 +``` + +Note, the number of partitions and replicationFactor should match the number of bootstrap servers. This is by default 3 +for AWS MSK. + +Run `helm upgrade --install datahub datahub/datahub --values values.yaml` to apply the changes. + +### AWS Glue Schema Registry + +> **WARNING**: AWS Glue Schema Registry DOES NOT have a python SDK. As such, python based libraries like ingestion or datahub-actions (UI ingestion) is not supported when using AWS Glue Schema Registry + +You can use AWS Glue schema registry instead of the kafka schema registry. To do so, first provision an AWS Glue schema +registry in the "Schema Registry" tab in the AWS Glue console page. + +Once the registry is provisioned, you can change helm chart as follows. + +``` +kafka: + bootstrap: + ... + zookeeper: + ... + schemaregistry: + type: AWS_GLUE + glue: + region: <> + registry: <> +``` + +Note, it will use the name of the topic as the schema name in the registry. + +Before you update the pods, you need to give the k8s worker nodes the correct permissions to access the schema registry. + +The minimum permissions required looks like this + +``` +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "VisualEditor0", + "Effect": "Allow", + "Action": [ + "glue:GetRegistry", + "glue:ListRegistries", + "glue:CreateSchema", + "glue:UpdateSchema", + "glue:GetSchema", + "glue:ListSchemas", + "glue:RegisterSchemaVersion", + "glue:GetSchemaByDefinition", + "glue:GetSchemaVersion", + "glue:GetSchemaVersionsDiff", + "glue:ListSchemaVersions", + "glue:CheckSchemaVersionValidity", + "glue:PutSchemaVersionMetadata", + "glue:QuerySchemaVersionMetadata" + ], + "Resource": [ + "arn:aws:glue:*:795586375822:schema/*", + "arn:aws:glue:us-west-2:795586375822:registry/demo-shared" + ] + }, + { + "Sid": "VisualEditor1", + "Effect": "Allow", + "Action": [ + "glue:GetSchemaVersion" + ], + "Resource": [ + "*" + ] + } + ] +} +``` + +The latter part is required to have "\*" as the resource because of an issue in the AWS Glue schema registry library. +Refer to [this issue](https://github.com/awslabs/aws-glue-schema-registry/issues/68) for any updates. + +Glue currently doesn't support AWS Signature V4. As such, we cannot use service accounts to give permissions to access +the schema registry. The workaround is to give the above permission to the EKS worker node's IAM role. Refer +to [this issue](https://github.com/awslabs/aws-glue-schema-registry/issues/69) for any updates. + +Run `helm upgrade --install datahub datahub/datahub --values values.yaml` to apply the changes. + +Note, you will be seeing log "Schema Version Id is null. Trying to register the schema" on every request. This log is +misleading, so should be ignored. Schemas are cached, so it does not register a new version on every request (aka no +performance issues). This has been fixed by [this PR](https://github.com/awslabs/aws-glue-schema-registry/pull/64) but +the code has not been released yet. We will update version once a new release is out. + +### IAM policies for UI-based ingestion + +This section details how to attach policies to the acryl-datahub-actions pod that powers UI-based ingestion. For some of +the ingestion recipes, you sepecify login creds in the recipe itself, making it easy to set up auth to grab metadata +from the data source. However, for AWS resources, the recommendation is to use IAM roles and policies to gate requests +to access metadata on these resources. + +To do this, let's follow +this [guide](https://docs.aws.amazon.com/eks/latest/userguide/create-service-account-iam-policy-and-role.html) to +associate a kubernetes service account with an IAM role. Then we can attach this IAM role to the acryl-datahub-actions +pod to let the pod assume the specified role. + +First, you must create an IAM policy with all the permissions needed to run ingestion. This is specific to each +connector and the set of metadata you are trying to pull. i.e. profiling requires more permissions, since it needs +access to the data, not just the metadata. Let's say assume the ARN of that policy +is `arn:aws:iam::<>:policy/policy1`. + +Then, create a service account with the policy attached is to use [eksctl](https://eksctl.io/). You can run the +following command to do so. + +``` +eksctl create iamserviceaccount \ + --name <> \ + --namespace <> \ + --cluster <> \ + --attach-policy-arn <> \ + --approve \ + --override-existing-serviceaccounts +``` + +For example, running the following will create a service account "acryl-datahub-actions" in the datahub namespace of +datahub EKS cluster with `arn:aws:iam::<>:policy/policy1` attached. + +``` +eksctl create iamserviceaccount \ + --name acryl-datahub-actions \ + --namespace datahub \ + --cluster datahub \ + --attach-policy-arn arn:aws:iam::<>:policy/policy1 \ + --approve \ + --override-existing-serviceaccounts +``` + +Lastly, in the helm values.yaml, you can add the following to the acryl-datahub-actions to attach the service account to +the acryl-datahub-actions pod. + +```yaml +acryl-datahub-actions: + enabled: true + serviceAccount: + name: <> + ... +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/deploy/confluent-cloud.md b/docs-website/versioned_docs/version-0.10.4/docs/deploy/confluent-cloud.md new file mode 100644 index 0000000000000..a950ddcc3da3a --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/deploy/confluent-cloud.md @@ -0,0 +1,242 @@ +--- +title: Integrating with Confluent Cloud +slug: /deploy/confluent-cloud +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/deploy/confluent-cloud.md +--- + +# Integrating with Confluent Cloud + +DataHub provides the ability to easily leverage Confluent Cloud as your Kafka provider. To do so, you'll need to configure DataHub to talk to a broker and schema registry hosted by Confluent. + +Doing this is a matter of configuring the Kafka Producer and Consumers used by DataHub correctly. There are 2 places where Kafka configuration should be provided: the metadata service (GMS) and the frontend server (datahub-frontend). Follow the steps below to configure these components for your deployment. + +## **Step 1: Create topics in Confluent Control Center** + +First, you'll need to create following new topics in the [Confluent Control Center](https://docs.confluent.io/platform/current/control-center/index.html). By default they have the following names: + +1. **MetadataChangeProposal_v1** +2. **FailedMetadataChangeProposal_v1** +3. **MetadataChangeLog_Versioned_v1** +4. **MetadataChangeLog_Timeseries_v1** +5. **DataHubUsageEvent_v1**: User behavior tracking event for UI +6. (Deprecated) **MetadataChangeEvent_v4**: Metadata change proposal messages +7. (Deprecated) **MetadataAuditEvent_v4**: Metadata change log messages +8. (Deprecated) **FailedMetadataChangeEvent_v4**: Failed to process #1 event + +The first five are the most important, and are explained in more depth in [MCP/MCL](../advanced/mcp-mcl.md). The final topics are +those which are deprecated but still used under certain circumstances. It is likely that in the future they will be completely +decommissioned. + +To create the topics, navigate to your **Cluster** and click "Create Topic". Feel free to tweak the default topic configurations to +match your preferences. + +

+ +

+ +## Step 2: Configure DataHub Container to use Confluent Cloud Topics + +### Docker Compose + +If you are deploying DataHub via docker compose, enabling connection to Confluent is a matter of a) creating topics in the Confluent Control Center and b) changing the default container environment variables. + +First, configure GMS to connect to Confluent Cloud by changing `docker/gms/env/docker.env`: + +``` +KAFKA_BOOTSTRAP_SERVER=pkc-g4ml2.eu-west-2.aws.confluent.cloud:9092 +KAFKA_SCHEMAREGISTRY_URL=https://plrm-qwlpp.us-east-2.aws.confluent.cloud + +# Confluent Cloud Configs +SPRING_KAFKA_PROPERTIES_SECURITY_PROTOCOL=SASL_SSL +SPRING_KAFKA_PROPERTIES_SASL_JAAS_CONFIG=org.apache.kafka.common.security.plain.PlainLoginModule required username='XFA45EL1QFUQP4PA' password='ltyf96EvR1YYutsjLB3ZYfrk+yfCXD8sQHCE3EMp57A2jNs4RR7J1bU9k6lM6rU'; +SPRING_KAFKA_PROPERTIES_SASL_MECHANISM=PLAIN +SPRING_KAFKA_PROPERTIES_CLIENT_DNS_LOOKUP=use_all_dns_ips +SPRING_KAFKA_PROPERTIES_BASIC_AUTH_CREDENTIALS_SOURCE=USER_INFO +SPRING_KAFKA_PROPERTIES_BASIC_AUTH_USER_INFO=P2ETAN5QR2LCWL14:RTjqw7AfETDl0RZo/7R0123LhPYs2TGjFKmvMWUFnlJ3uKubFbB1Sfs7aOjjNi1m23 +``` + +Next, configure datahub-frontend to connect to Confluent Cloud by changing `docker/datahub-frontend/env/docker.env`: + +``` +KAFKA_BOOTSTRAP_SERVER=pkc-g4ml2.eu-west-2.aws.confluent.cloud:9092 + +# Confluent Cloud Configs +KAFKA_PROPERTIES_SECURITY_PROTOCOL=SASL_SSL +KAFKA_PROPERTIES_SASL_JAAS_CONFIG=org.apache.kafka.common.security.plain.PlainLoginModule required username='XFA45EL1QFUQP4PA' password='ltyf96EvR1YYutsjLB3ZYfrk+yfCXD8sQHCE3EMp57A2jNs4RR7J1bU9k6lM6rU'; +KAFKA_PROPERTIES_SASL_MECHANISM=PLAIN +KAFKA_PROPERTIES_CLIENT_DNS_LOOKUP=use_all_dns_ips +KAFKA_PROPERTIES_BASIC_AUTH_CREDENTIALS_SOURCE=USER_INFO +KAFKA_PROPERTIES_BASIC_AUTH_USER_INFO=P2ETAN5QR2LCWL14:RTjqw7AfETDl0RZo/7R0123LhPYs2TGjFKmvMWUFnlJ3uKubFbB1Sfs7aOjjNi1m23 +``` + +Note that this step is only required if `DATAHUB_ANALYTICS_ENABLED` environment variable is not explicitly set to false for the datahub-frontend +container. + +If you're deploying with Docker Compose, you do not need to deploy the Zookeeper, Kafka Broker, or Schema Registry containers that ship by default. + +#### DataHub Actions + +Configuring Confluent Cloud for DataHub Actions requires some additional edits to your `executor.yaml`. Under the Kafka +source connection config you will need to add the Python style client connection information: + +```yaml +connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} + consumer_config: + security.protocol: ${KAFKA_PROPERTIES_SECURITY_PROTOCOL:-PLAINTEXT} + sasl.mechanism: ${KAFKA_PROPERTIES_SASL_MECHANISM:-PLAIN} + sasl.username: ${KAFKA_PROPERTIES_SASL_USERNAME} + sasl.password: ${KAFKA_PROPERTIES_SASL_PASSWORD} + schema_registry_config: + basic.auth.user.info: ${KAFKA_PROPERTIES_BASIC_AUTH_USER_INFO} +``` + +Specifically `sasl.username` and `sasl.password` are the differences from the base `executor.yaml` example file. + +Additionally, you will need to set up environment variables for `KAFKA_PROPERTIES_SASL_USERNAME` and `KAFKA_PROPERTIES_SASL_PASSWORD` +which will use the same username and API Key you generated for the JAAS config. + +See [Overwriting a System Action Config](https://github.com/acryldata/datahub-actions/blob/main/docker/README.md#overwriting-a-system-action-config) for detailed reflection procedures. + +Next, configure datahub-actions to connect to Confluent Cloud by changing `docker/datahub-actions/env/docker.env`: + +``` +KAFKA_BOOTSTRAP_SERVER=pkc-g4ml2.eu-west-2.aws.confluent.cloud:9092 +SCHEMA_REGISTRY_URL=https://plrm-qwlpp.us-east-2.aws.confluent.cloud + +# Confluent Cloud Configs +KAFKA_PROPERTIES_SECURITY_PROTOCOL=SASL_SSL +KAFKA_PROPERTIES_SASL_USERNAME=XFA45EL1QFUQP4PA +KAFKA_PROPERTIES_SASL_PASSWORD=ltyf96EvR1YYutsjLB3ZYfrk+yfCXD8sQHCE3EMp57A2jNs4RR7J1bU9k6lM6rU +KAFKA_PROPERTIES_SASL_MECHANISM=PLAIN +KAFKA_PROPERTIES_CLIENT_DNS_LOOKUP=use_all_dns_ips +KAFKA_PROPERTIES_BASIC_AUTH_CREDENTIALS_SOURCE=USER_INFO +KAFKA_PROPERTIES_BASIC_AUTH_USER_INFO=P2ETAN5QR2LCWL14:RTjqw7AfETDl0RZo/7R0123LhPYs2TGjFKmvMWUFnlJ3uKubFbB1Sfs7aOjjNi1m23 +``` + +### Helm + +If you're deploying on K8s using Helm, you can simply change the **datahub-helm** `values.yml` to point to Confluent Cloud and disable some default containers: + +First, disable the `cp-schema-registry` service: + +``` +cp-schema-registry: + enabled: false +``` + +Next, disable the `kafkaSetupJob` service: + +``` +kafkaSetupJob: + enabled: false +``` + +Then, update the `kafka` configurations to point to your Confluent Cloud broker and schema registry instance, along with the topics you've created in Step 1: + +``` +kafka: + bootstrap: + server: pkc-g4ml2.eu-west-2.aws.confluent.cloud:9092 + schemaregistry: + url: https://plrm-qwlpp.us-east-2.aws.confluent.cloud +``` + +Next, you'll want to create 2 new Kubernetes secrets, one for the JaaS configuration which contains the username and password for Confluent, +and another for the user info used for connecting to the schema registry. You'll find the values for each within the Confluent Control Center. Specifically, +select "Clients" -> "Configure new Java Client". You should see a page like the following: + +

+ +

+ +You'll want to generate both a Kafka Cluster API Key & a Schema Registry key. Once you do so,you should see the config +automatically populate with your new secrets: + +

+ +

+ +You'll need to copy the values of `sasl.jaas.config` and `basic.auth.user.info` +for the next step. + +The next step is to create K8s secrets containing the config values you've just generated. Specifically, you can run the following commands: + +```shell +kubectl create secret generic confluent-secrets --from-literal=sasl_jaas_config="" +kubectl create secret generic confluent-secrets --from-literal=basic_auth_user_info="" +``` + +With your config values substituted as appropriate. For example, in our case we'd run: + +```shell +kubectl create secret generic confluent-secrets --from-literal=sasl_jaas_config="org.apache.kafka.common.security.plain.PlainLoginModule required username='XFA45EL1QFUQP4PA' password='ltyf96EvR1YYutsjLB3ZYfrk+yfCXD8sQHCE3EMp57A2jNs4RR7J1bU9k6lM6rU';" +kubectl create secret generic confluent-secrets --from-literal=basic_auth_user_info="P2ETAN5QR2LCWL14:RTjqw7AfETDl0RZo/7R0123LhPYs2TGjFKmvMWUFnlJ3uKubFbB1Sfs7aOjjNi1m23" +``` + +Finally, we'll configure our containers to pick up the Confluent Kafka Configs by changing two config blocks in our `values.yaml` file. You +should see these blocks commented at the bottom of the template. You'll want to uncomment them and set them to the following values: + +``` +credentialsAndCertsSecrets: + name: confluent-secrets + secureEnv: + sasl.jaas.config: sasl_jaas_config + basic.auth.user.info: basic_auth_user_info + + +springKafkaConfigurationOverrides: + security.protocol: SASL_SSL + sasl.mechanism: PLAIN + client.dns.lookup: use_all_dns_ips + basic.auth.credentials.source: USER_INFO +``` + +Then simply apply the updated `values.yaml` to your K8s cluster via `kubectl apply`. + +#### DataHub Actions + +Configuring Confluent Cloud for DataHub Actions requires some additional edits to your `executor.yaml`. Under the Kafka +source connection config you will need to add the Python style client connection information: + +```yaml +connection: + bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092} + schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081} + consumer_config: + security.protocol: ${KAFKA_PROPERTIES_SECURITY_PROTOCOL:-PLAINTEXT} + sasl.mechanism: ${KAFKA_PROPERTIES_SASL_MECHANISM:-PLAIN} + sasl.username: ${KAFKA_PROPERTIES_SASL_USERNAME} + sasl.password: ${KAFKA_PROPERTIES_SASL_PASSWORD} + schema_registry_config: + basic.auth.user.info: ${KAFKA_PROPERTIES_BASIC_AUTH_USER_INFO} +``` + +Specifically `sasl.username` and `sasl.password` are the differences from the base `executor.yaml` example file. + +Additionally, you will need to set up secrets for `KAFKA_PROPERTIES_SASL_USERNAME` and `KAFKA_PROPERTIES_SASL_PASSWORD` +which will use the same username and API Key you generated for the JAAS config. + +See [Overwriting a System Action Config](https://github.com/acryldata/datahub-actions/blob/main/docker/README.md#overwriting-a-system-action-config) for detailed reflection procedures. + +```yaml +credentialsAndCertsSecrets: + name: confluent-secrets + secureEnv: + sasl.jaas.config: sasl_jaas_config + basic.auth.user.info: basic_auth_user_info + sasl.username: sasl_username + sasl.password: sasl_password +``` + +The Actions pod will automatically pick these up in the correctly named environment variables when they are named this exact way. + +## Contribution + +Accepting contributions for a setup script compatible with Confluent Cloud! + +The kafka-setup-job container we ship with is only compatible with a distribution of Kafka wherein ZooKeeper +is exposed and available. A version of the job using the [Confluent CLI](https://docs.confluent.io/confluent-cli/current/command-reference/kafka/topic/confluent_kafka_topic_create.html) +would be very useful for the broader community. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/deploy/environment-vars.md b/docs-website/versioned_docs/version-0.10.4/docs/deploy/environment-vars.md new file mode 100644 index 0000000000000..84390fee56d9b --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/deploy/environment-vars.md @@ -0,0 +1,90 @@ +--- +title: Deployment Environment Variables +sidebar_label: Deployment Environment Variables +slug: /deploy/environment-vars +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/deploy/environment-vars.md +--- + +# Environment Variables + +The following is a summary of a few important environment variables which expose various levers which control how +DataHub works. + +## Feature Flags + +| Variable | Default | Unit/Type | Components | Description | +| ------------------------------------------------ | ------- | --------- | --------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | --- | +| `UI_INGESTION_ENABLED` | `true` | boolean | [`GMS`, `MCE Consumer`] | Enable UI based ingestion. | +| `DATAHUB_ANALYTICS_ENABLED` | `true` | boolean | [`Frontend`, `GMS`] | Collect DataHub usage to populate the analytics dashboard. | | +| `BOOTSTRAP_SYSTEM_UPDATE_WAIT_FOR_SYSTEM_UPDATE` | `true` | boolean | [`GMS`, `MCE Consumer`, `MAE Consumer`] | Do not wait for the `system-update` to complete before starting. This should typically only be disabled during development. | + +## Ingestion + +| Variable | Default | Unit/Type | Components | Description | +| --------------------------------- | ------- | --------- | ----------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | +| `ASYNC_INGESTION_DEFAULT` | `false` | boolean | [`GMS`] | Asynchronously process ingestProposals by writing the ingestion MCP to Kafka. Typically enabled with standalone consumers. | +| `MCP_CONSUMER_ENABLED` | `true` | boolean | [`GMS`, `MCE Consumer`] | When running in standalone mode, disabled on `GMS` and enabled on separate `MCE Consumer`. | +| `MCL_CONSUMER_ENABLED` | `true` | boolean | [`GMS`, `MAE Consumer`] | When running in standalone mode, disabled on `GMS` and enabled on separate `MAE Consumer`. | +| `PE_CONSUMER_ENABLED` | `true` | boolean | [`GMS`, `MAE Consumer`] | When running in standalone mode, disabled on `GMS` and enabled on separate `MAE Consumer`. | +| `ES_BULK_REQUESTS_LIMIT` | 1000 | docs | [`GMS`, `MAE Consumer`] | Number of bulk documents to index. `MAE Consumer` if standalone. | +| `ES_BULK_FLUSH_PERIOD` | 1 | seconds | [`GMS`, `MAE Consumer`] | How frequently indexed documents are made available for query. | +| `ALWAYS_EMIT_CHANGE_LOG` | `false` | boolean | [`GMS`] | Enables always emitting a MCL even when no changes are detected. Used for Time Based Lineage when no changes occur. | | +| `GRAPH_SERVICE_DIFF_MODE_ENABLED` | `true` | boolean | [`GMS`] | Enables diff mode for graph writes, uses a different code path that produces a diff from previous to next to write relationships instead of wholesale deleting edges and reading. | + +## Caching + +| Variable | Default | Unit/Type | Components | Description | +| ------------------------------------------ | -------- | --------- | ---------- | ------------------------------------------------------------------------------------ | +| `SEARCH_SERVICE_ENABLE_CACHE` | `false` | boolean | [`GMS`] | Enable caching of search results. | +| `SEARCH_SERVICE_CACHE_IMPLEMENTATION` | caffeine | string | [`GMS`] | Set to `hazelcast` if the number of GMS replicas > 1 for enabling distributed cache. | +| `CACHE_TTL_SECONDS` | 600 | seconds | [`GMS`] | Default cache time to live. | +| `CACHE_MAX_SIZE` | 10000 | objects | [`GMS`] | Maximum number of items to cache. | +| `LINEAGE_SEARCH_CACHE_ENABLED` | `true` | boolean | [`GMS`] | Enables in-memory cache for searchAcrossLineage query. | +| `CACHE_ENTITY_COUNTS_TTL_SECONDS` | 600 | seconds | [`GMS`] | Homepage entity count time to live. | +| `CACHE_SEARCH_LINEAGE_TTL_SECONDS` | 86400 | seconds | [`GMS`] | Search lineage cache time to live. | +| `CACHE_SEARCH_LINEAGE_LIGHTNING_THRESHOLD` | 300 | objects | [`GMS`] | Lineage graphs exceeding this limit will use a local cache. | + +## Search + +| Variable | Default | Unit/Type | Components | Description | +| --------------------------------------------------- | ------------------- | --------- | --------------------------------------------------------------- | ------------------------------------------------------------------------ | +| `INDEX_PREFIX` | `` | string | [`GMS`, `MAE Consumer`, `Elasticsearch Setup`, `System Update`] | Prefix Elasticsearch indices with the given string. | +| `ELASTICSEARCH_NUM_SHARDS_PER_INDEX` | 1 | integer | [`System Update`] | Default number of shards per Elasticsearch index. | +| `ELASTICSEARCH_NUM_REPLICAS_PER_INDEX` | 1 | integer | [`System Update`] | Default number of replica per Elasticsearch index. | +| `ELASTICSEARCH_BUILD_INDICES_RETENTION_VALUE` | 60 | integer | [`System Update`] | Number of units for the retention of Elasticsearch clone/backup indices. | +| `ELASTICSEARCH_BUILD_INDICES_RETENTION_UNIT` | DAYS | string | [`System Update`] | Unit for the retention of Elasticsearch clone/backup indices. | +| `ELASTICSEARCH_QUERY_EXACT_MATCH_EXCLUSIVE` | `false` | boolean | [`GMS`] | Only return exact matches when using quotes. | +| `ELASTICSEARCH_QUERY_EXACT_MATCH_WITH_PREFIX` | `true` | boolean | [`GMS`] | Include prefix match in exact match results. | +| `ELASTICSEARCH_QUERY_EXACT_MATCH_FACTOR` | 10.0 | float | [`GMS`] | Multiply by this number on true exact match. | +| `ELASTICSEARCH_QUERY_EXACT_MATCH_PREFIX_FACTOR` | 1.6 | float | [`GMS`] | Multiply by this number when prefix match. | +| `ELASTICSEARCH_QUERY_EXACT_MATCH_CASE_FACTOR` | 0.7 | float | [`GMS`] | Multiply by this number when case insensitive match. | +| `ELASTICSEARCH_QUERY_EXACT_MATCH_ENABLE_STRUCTURED` | `true` | boolean | [`GMS`] | When using structured query, also include exact matches. | +| `ELASTICSEARCH_QUERY_PARTIAL_URN_FACTOR` | 0.5 | float | [`GMS`] | Multiply by this number when partial token match on URN) | +| `ELASTICSEARCH_QUERY_PARTIAL_FACTOR` | 0.4 | float | [`GMS`] | Multiply by this number when partial token match on non-URN field. | +| `ELASTICSEARCH_QUERY_CUSTOM_CONFIG_ENABLED` | `false` | boolean | [`GMS`] | Enable search query and ranking customization configuration. | +| `ELASTICSEARCH_QUERY_CUSTOM_CONFIG_FILE` | `search_config.yml` | string | [`GMS`] | The location of the search customization configuration. | + +## Kafka + +In general, there are **lots** of Kafka configuration environment variables for both the producer and consumers defined in the official Spring Kafka documentation [here](https://docs.spring.io/spring-boot/docs/2.7.10/reference/html/application-properties.html#appendix.application-properties.integration). +These environment variables follow the standard Spring representation of properties as environment variables. +Simply replace the dot, `.`, with an underscore, `_`, and convert to uppercase. + +| Variable | Default | Unit/Type | Components | Description | +| --------------------------------------------------- | -------------------------------------------- | --------- | --------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `KAFKA_LISTENER_CONCURRENCY` | 1 | integer | [`GMS`, `MCE Consumer`, `MAE Consumer`] | Number of Kafka consumer threads. Optimize throughput by matching to topic partitions. | +| `SPRING_KAFKA_PRODUCER_PROPERTIES_MAX_REQUEST_SIZE` | 1048576 | bytes | [`GMS`, `MCE Consumer`, `MAE Consumer`] | Max produced message size. Note that the topic configuration is not controlled by this variable. | +| `SCHEMA_REGISTRY_TYPE` | `INTERNAL` | string | [`GMS`, `MCE Consumer`, `MAE Consumer`] | Schema registry implementation. One of `INTERNAL` or `KAFKA` or `AWS_GLUE` | +| `KAFKA_SCHEMAREGISTRY_URL` | `http://localhost:8080/schema-registry/api/` | string | [`GMS`, `MCE Consumer`, `MAE Consumer`] | Schema registry url. Used for `INTERNAL` and `KAFKA`. The default value is for the `GMS` component. The `MCE Consumer` and `MAE Consumer` should be the `GMS` hostname and port. | +| `AWS_GLUE_SCHEMA_REGISTRY_REGION` | `us-east-1` | string | [`GMS`, `MCE Consumer`, `MAE Consumer`] | If using `AWS_GLUE` in the `SCHEMA_REGISTRY_TYPE` variable for the schema registry implementation. | +| `AWS_GLUE_SCHEMA_REGISTRY_NAME` | `` | string | [`GMS`, `MCE Consumer`, `MAE Consumer`] | If using `AWS_GLUE` in the `SCHEMA_REGISTRY_TYPE` variable for the schema registry. | +| `USE_CONFLUENT_SCHEMA_REGISTRY` | `true` | boolean | [`kafka-setup`] | Enable Confluent schema registry configuration. | + +## Frontend + +| Variable | Default | Unit/Type | Components | Description | +| ---------------------------------- | -------- | --------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------------- | +| `AUTH_VERBOSE_LOGGING` | `false` | boolean | [`Frontend`] | Enable verbose authentication logging. Enabling this will leak sensisitve information in the logs. Disable when finished debugging. | +| `AUTH_OIDC_GROUPS_CLAIM` | `groups` | string | [`Frontend`] | Claim to use as the user's group. | +| `AUTH_OIDC_EXTRACT_GROUPS_ENABLED` | `false` | boolean | [`Frontend`] | Auto-provision the group from the user's group claim. | diff --git a/docs-website/versioned_docs/version-0.10.4/docs/deploy/gcp.md b/docs-website/versioned_docs/version-0.10.4/docs/deploy/gcp.md new file mode 100644 index 0000000000000..d70696c61eeef --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/deploy/gcp.md @@ -0,0 +1,115 @@ +--- +title: Deploying to GCP +sidebar_label: Deploying to GCP +slug: /deploy/gcp +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/deploy/gcp.md" +--- + +# GCP setup guide + +The following is a set of instructions to quickstart DataHub on GCP Google Kubernetes Engine (GKE). Note, the guide +assumes that you do not have a kubernetes cluster set up. If you are deploying DataHub to an existing cluster, please +skip the corresponding sections. + +## Prerequisites + +This guide requires the following tools: + +- [kubectl](https://kubernetes.io/docs/tasks/tools/) to manage kubernetes resources +- [helm](https://helm.sh/docs/intro/install/) to deploy the resources based on helm charts. Note, we only support Helm 3. +- [gcloud](https://cloud.google.com/sdk/docs/install) to manage GCP resources + +Follow the +following [guide](https://cloud.google.com/kubernetes-engine/docs/how-to/creating-a-zonal-cluster#before_you_begin) to +correctly set up Google Cloud SDK. + +After setting up, run `gcloud services enable container.googleapis.com` to make sure GKE service is enabled. + +## Start up a kubernetes cluster on GKE + +Let’s follow this [guide](https://cloud.google.com/kubernetes-engine/docs/how-to/creating-a-zonal-cluster) to create a +new cluster using gcloud. Run the following command with cluster-name set to the cluster name of choice, and zone set to +the GCP zone you are operating on. + +``` +gcloud container clusters create <> \ + --zone <> \ + -m e2-standard-2 +``` + +The command will provision a GKE cluster powered by 3 e2-standard-2 (2 CPU, 8GB RAM) nodes. + +If you are planning to run the storage layer (MySQL, Elasticsearch, Kafka) as pods in the cluster, you need at least 3 +nodes with the above specs. If you decide to use managed storage services, you can reduce the number of nodes or use +m3.medium nodes to save cost. Refer to +this [guide](https://cloud.google.com/kubernetes-engine/docs/how-to/creating-a-regional-cluster) for creating a regional +cluster for better robustness. + +Run `kubectl get nodes` to confirm that the cluster has been setup correctly. You should get results like below + +``` +NAME STATUS ROLES AGE VERSION +gke-datahub-default-pool-e5be7c4f-8s97 Ready 34h v1.19.10-gke.1600 +gke-datahub-default-pool-e5be7c4f-d68l Ready 34h v1.19.10-gke.1600 +gke-datahub-default-pool-e5be7c4f-rksj Ready 34h v1.19.10-gke.1600 +``` + +## Setup DataHub using Helm + +Once the kubernetes cluster has been set up, you can deploy DataHub and it’s prerequisites using helm. Please follow the +steps in this [guide](kubernetes.md) + +## Expose endpoints using GKE ingress controller + +Now that all the pods are up and running, you need to expose the datahub-frontend end point by setting +up [ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/). Easiest way to set up ingress is to use +the GKE page on [GCP website](https://console.cloud.google.com/kubernetes/discovery). + +Once all deploy is successful, you should see a page like below in the "Services & Ingress" tab on the left. + +

+ +

+ +Tick the checkbox for datahub-datahub-frontend and click "CREATE INGRESS" button. You should land on the following page. + +

+ +

+ +Type in an arbitrary name for the ingress and click on the second step "Host and path rules". You should land on the +following page. + +

+ +

+ +Select "datahub-datahub-frontend" in the dropdown menu for backends, and then click on "ADD HOST AND PATH RULE" button. +In the second row that got created, add in the host name of choice (here gcp.datahubproject.io) and select +"datahub-datahub-frontend" in the backends dropdown. + +This step adds the rule allowing requests from the host name of choice to get routed to datahub-frontend service. Click +on step 3 "Frontend configuration". You should land on the following page. + +

+ +

+ +Choose HTTPS in the dropdown menu for protocol. To enable SSL, you need to add a certificate. If you do not have one, +you can click "CREATE A NEW CERTIFICATE" and input the host name of choice. GCP will create a certificate for you. + +Now press "CREATE" button on the left to create ingress! After around 5 minutes, you should see the following. + +

+ +

+ +In your domain provider, add an A record for the host name set above using the IP address on the ingress page (noted +with the red box). Once DNS updates, you should be able to access DataHub through the host name!! + +Note, ignore the warning icon next to ingress. It takes about ten minutes for ingress to check that the backend service +is ready and show a check mark as follows. However, ingress is fully functional once you see the above page. + +

+ +

diff --git a/docs-website/versioned_docs/version-0.10.4/docs/deploy/kubernetes.md b/docs-website/versioned_docs/version-0.10.4/docs/deploy/kubernetes.md new file mode 100644 index 0000000000000..fb4bc086a7fee --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/deploy/kubernetes.md @@ -0,0 +1,158 @@ +--- +title: Deploying with Kubernetes +sidebar_label: Deploying with Kubernetes +slug: /deploy/kubernetes +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/deploy/kubernetes.md +--- + +# Deploying DataHub with Kubernetes + +## Introduction + +Helm charts for deploying DataHub on a kubernetes cluster is located in +this [repository](https://github.com/acryldata/datahub-helm). We provide charts for +deploying [Datahub](https://github.com/acryldata/datahub-helm/tree/master/charts/datahub) and +it's [dependencies](https://github.com/acryldata/datahub-helm/tree/master/charts/prerequisites) +(Elasticsearch, optionally Neo4j, MySQL, and Kafka) on a Kubernetes cluster. + +This doc is a guide to deploy an instance of DataHub on a kubernetes cluster using the above charts from scratch. + +## Setup + +1. Set up a kubernetes cluster + - In a cloud platform of choice like [Amazon EKS](https://aws.amazon.com/eks), + [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine), + and [Azure Kubernetes Service](https://azure.microsoft.com/en-us/services/kubernetes-service/) OR + - In local environment using [Minikube](https://minikube.sigs.k8s.io/docs/). Note, more than 7GB of RAM is required + to run Datahub and it's dependencies +2. Install the following tools: + - [kubectl](https://kubernetes.io/docs/tasks/tools/) to manage kubernetes resources + - [helm](https://helm.sh/docs/intro/install/) to deploy the resources based on helm charts. Note, we only support + Helm 3. + +## Components + +Datahub consists of 4 main components: [GMS](/docs/metadata-service), +[MAE Consumer](/docs/metadata-jobs/mae-consumer-job) (optional), +[MCE Consumer](/docs/metadata-jobs/mce-consumer-job) (optional), and +[Frontend](/docs/datahub-frontend). Kubernetes deployment for each of the components are +defined as subcharts under the main +[Datahub](https://github.com/acryldata/datahub-helm/tree/master/charts/datahub) +helm chart. + +The main components are powered by 4 external dependencies: + +- Kafka +- Local DB (MySQL, Postgres, MariaDB) +- Search Index (Elasticsearch) +- Graph Index (Supports either Neo4j or Elasticsearch) + +The dependencies must be deployed before deploying Datahub. We created a separate +[chart](https://github.com/acryldata/datahub-helm/tree/master/charts/prerequisites) +for deploying the dependencies with example configuration. They could also be deployed separately on-prem or leveraged +as managed services. To remove your dependency on Neo4j, set enabled to false in +the [values.yaml](https://github.com/acryldata/datahub-helm/blob/master/charts/prerequisites/values.yaml#L54) for +prerequisites. Then, override the `graph_service_impl` field in +the [values.yaml](https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/values.yaml#L63) of datahub +instead of `neo4j`. + +## Quickstart + +Assuming kubectl context points to the correct kubernetes cluster, first create kubernetes secrets that contain MySQL +and Neo4j passwords. + +```(shell) +kubectl create secret generic mysql-secrets --from-literal=mysql-root-password=datahub +kubectl create secret generic neo4j-secrets --from-literal=neo4j-password=datahub +``` + +The above commands sets the passwords to "datahub" as an example. Change to any password of choice. + +Add datahub helm repo by running the following + +```(shell) +helm repo add datahub https://helm.datahubproject.io/ +``` + +Then, deploy the dependencies by running the following + +```(shell) +helm install prerequisites datahub/datahub-prerequisites +``` + +Note, the above uses the default configuration +defined [here](https://github.com/acryldata/datahub-helm/blob/master/charts/prerequisites/values.yaml). You can change +any of the configuration and deploy by running the following command. + +```(shell) +helm install prerequisites datahub/datahub-prerequisites --values <> +``` + +Run `kubectl get pods` to check whether all the pods for the dependencies are running. You should get a result similar +to below. + +``` +NAME READY STATUS RESTARTS AGE +elasticsearch-master-0 1/1 Running 0 62m +elasticsearch-master-1 1/1 Running 0 62m +elasticsearch-master-2 1/1 Running 0 62m +prerequisites-cp-schema-registry-cf79bfccf-kvjtv 2/2 Running 1 63m +prerequisites-kafka-0 1/1 Running 2 62m +prerequisites-mysql-0 1/1 Running 1 62m +prerequisites-neo4j-community-0 1/1 Running 0 52m +prerequisites-zookeeper-0 1/1 Running 0 62m +``` + +deploy Datahub by running the following + +```(shell) +helm install datahub datahub/datahub +``` + +Values in [values.yaml](https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/values.yaml) +have been preset to point to the dependencies deployed using +the [prerequisites](https://github.com/acryldata/datahub-helm/tree/master/charts/prerequisites) +chart with release name "prerequisites". If you deployed the helm chart using a different release name, update the +quickstart-values.yaml file accordingly before installing. + +Run `kubectl get pods` to check whether all the datahub pods are running. You should get a result similar to below. + +``` +NAME READY STATUS RESTARTS AGE +datahub-datahub-frontend-84c58df9f7-5bgwx 1/1 Running 0 4m2s +datahub-datahub-gms-58b676f77c-c6pfx 1/1 Running 0 4m2s +datahub-datahub-mae-consumer-7b98bf65d-tjbwx 1/1 Running 0 4m3s +datahub-datahub-mce-consumer-8c57d8587-vjv9m 1/1 Running 0 4m2s +datahub-elasticsearch-setup-job-8dz6b 0/1 Completed 0 4m50s +datahub-kafka-setup-job-6blcj 0/1 Completed 0 4m40s +datahub-mysql-setup-job-b57kc 0/1 Completed 0 4m7s +elasticsearch-master-0 1/1 Running 0 97m +elasticsearch-master-1 1/1 Running 0 97m +elasticsearch-master-2 1/1 Running 0 97m +prerequisites-cp-schema-registry-cf79bfccf-kvjtv 2/2 Running 1 99m +prerequisites-kafka-0 1/1 Running 2 97m +prerequisites-mysql-0 1/1 Running 1 97m +prerequisites-neo4j-community-0 1/1 Running 0 88m +prerequisites-zookeeper-0 1/1 Running 0 97m +``` + +You can run the following to expose the frontend locally. Note, you can find the pod name using the command above. In +this case, the datahub-frontend pod name was `datahub-datahub-frontend-84c58df9f7-5bgwx`. + +```(shell) +kubectl port-forward 9002:9002 +``` + +You should be able to access the frontend via http://localhost:9002. + +Once you confirm that the pods are running well, you can set up ingress for datahub-frontend to expose the 9002 port to +the public. + +## Other useful commands + +| Command | Description | +| ---------------------- | ----------------------- | +| helm uninstall datahub | Remove DataHub | +| helm ls | List of Helm charts | +| helm history | Fetch a release history | diff --git a/docs-website/versioned_docs/version-0.10.4/docs/deploy/telemetry.md b/docs-website/versioned_docs/version-0.10.4/docs/deploy/telemetry.md new file mode 100644 index 0000000000000..756b0ac1e2c64 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/deploy/telemetry.md @@ -0,0 +1,17 @@ +--- +title: DataHub Telemetry +sidebar_label: Telemetry +slug: /deploy/telemetry +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/deploy/telemetry.md +--- + +# DataHub Telemetry + +## Overview of DataHub Telemetry + +To effectively build and maintain the DataHub Project, we must understand how end-users work within DataHub. Beginning in version 0.8.35, DataHub collects anonymous usage statistics and errors to inform our roadmap priorities and to enable us to proactively address errors. + +Deployments are assigned a UUID which is sent along with event details, Java version, OS, and timestamp; telemetry collection is enabled by default and can be disabled by setting `DATAHUB_TELEMETRY_ENABLED=false` in your Docker Compose config. + +The source code is available [here.](https://github.com/datahub-project/datahub/blob/master/metadata-service/factories/src/main/java/com/linkedin/gms/factory/telemetry/TelemetryUtils.java) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/dev-guides/timeline.md b/docs-website/versioned_docs/version-0.10.4/docs/dev-guides/timeline.md new file mode 100644 index 0000000000000..2d8b2aabc59ca --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/dev-guides/timeline.md @@ -0,0 +1,261 @@ +--- +title: Timeline API +sidebar_label: Timeline API +slug: /dev-guides/timeline +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/dev-guides/timeline.md +--- + +The Timeline API supports viewing version history of schemas, documentation, tags, glossary terms, and other updates +to entities. At present, the API only supports Datasets and Glossary Terms. + +## Compatibility + +The Timeline API is available in server versions `0.8.28` and higher. The `cli` timeline command is available in [pypi](https://pypi.org/project/acryl-datahub/) versions `0.8.27.1` onwards. + +# Concepts + +## Entity Timeline Conceptually + +For the visually inclined, here is a conceptual diagram that illustrates how to think about the entity timeline with categorical changes overlaid on it. + +

+ +

+ +## Change Event + +Each modification is modeled as a +[ChangeEvent](https://github.com/datahub-project/datahub/blob/master/metadata-service/services/src/main/java/com/linkedin/metadata/timeline/data/ChangeEvent.java) +which are grouped under [ChangeTransactions](https://github.com/datahub-project/datahub/blob/master/metadata-service/services/src/main/java/com/linkedin/metadata/timeline/data/ChangeTransaction.java) +based on timestamp. A `ChangeEvent` consists of: + +- `changeType`: An operational type for the change, either `ADD`, `MODIFY`, or `REMOVE` +- `semVerChange`: A [semver](https://semver.org/) change type based on the compatibility of the change. This gets utilized in the computation of the transaction level version. Options are `NONE`, `PATCH`, `MINOR`, `MAJOR`, and `EXCEPTIONAL` for cases where an exception occurred during processing, but we do not fail the entire change calculation +- `target`: The high level target of the change. This is usually an `urn`, but can differ depending on the type of change. +- `category`: The category a change falls under, specific aspects are mapped to each category depending on the entity +- `elementId`: Optional, the ID of the element being applied to the target +- `description`: A human readable description of the change produced by the `Differ` type computing the diff +- `changeDetails`: A loose property map of additional details about the change + +### Change Event Examples + +- A tag was applied to a _field_ of a dataset through the UI: + - `changeType`: `ADD` + - `target`: `urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:,,),)` -> The field the tag is being added to + - `category`: `TAG` + - `elementId`: `urn:li:tag:` -> The ID of the tag being added + - `semVerChange`: `MINOR` +- A tag was added directly at the top-level to a dataset through the UI: + - `changeType`: `ADD` + - `target`: `urn:li:dataset:(urn:li:dataPlatform:,,)` -> The dataset the tag is being added to + - `category`: `TAG` + - `elementId`: `urn:li:tag:` -> The ID of the tag being added + - `semVerChange`: `MINOR` + +Note the `target` and `elementId` fields in the examples above to familiarize yourself with the semantics. + +## Change Transaction + +Each `ChangeTransaction` is assigned a computed semantic version based on the `ChangeEvents` that occurred within it, +starting at `0.0.0` and updating based on whether the most significant change in the transaction is a `MAJOR`, `MINOR`, or +`PATCH` change. The logic for what changes constitute a Major, Minor or Patch change are encoded in the category specific `Differ` implementation. +For example, the [SchemaMetadataDiffer](https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/com/linkedin/metadata/timeline/eventgenerator/SchemaMetadataChangeEventGenerator.java) has baked-in logic for determining what level of semantic change an event is based on backwards and forwards incompatibility. Read on to learn about the different categories of changes, and how semantic changes are interpreted in each. + +# Categories + +ChangeTransactions contain a `category` that represents a kind of change that happened. The `Timeline API` allows the caller to specify which categories of changes they are interested in. Categories allow us to abstract away the low-level technical change that happened in the metadata (e.g. the `schemaMetadata` aspect changed) to a high-level semantic change that happened in the metadata (e.g. the `Technical Schema` of the dataset changed). Read on to learn about the different categories that are supported today. + +The Dataset entity currently supports the following categories: + +## Technical Schema + +- Any structural changes in the technical schema of the dataset, such as adding, dropping, renaming columns. +- Driven by the `schemaMetadata` aspect. +- Changes are marked with the appropriate semantic version marker based on well-understood rules for backwards and forwards compatibility. + +**_NOTE_**: Changes in field descriptions are not communicated via this category, use the Documentation category for that. + +### Example Usage + +We have provided some example scripts that demonstrate making changes to an aspect within each category and use then use the Timeline API to query the result. +All examples can be found in [smoke-test/test_resources/timeline](https://github.com/datahub-project/datahub/blob/master/smoke-test/test_resources/timeline) and should be executed from that directory. + +```console +% ./test_timeline_schema.sh +[2022-02-24 15:31:52,617] INFO {datahub.cli.delete_cli:130} - DataHub configured with http://localhost:8080 +Successfully deleted urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD). 6 rows deleted +Took 1.077 seconds to hard delete 6 rows for 1 entities +Update succeeded with status 200 +Update succeeded with status 200 +Update succeeded with status 200 +http://localhost:8080/openapi/timeline/v1/urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Ahive%2CtestTimelineDataset%2CPROD%29?categories=TECHNICAL_SCHEMA&start=1644874316591&end=2682397800000 +2022-02-24 15:31:53 - 0.0.0-computed + ADD TECHNICAL_SCHEMA dataset:hive:testTimelineDataset (field:property_id): A forwards & backwards compatible change due to the newly added field 'property_id'. + ADD TECHNICAL_SCHEMA dataset:hive:testTimelineDataset (field:service): A forwards & backwards compatible change due to the newly added field 'service'. + ADD TECHNICAL_SCHEMA dataset:hive:testTimelineDataset (field:service.type): A forwards & backwards compatible change due to the newly added field 'service.type'. + ADD TECHNICAL_SCHEMA dataset:hive:testTimelineDataset (field:service.provider): A forwards & backwards compatible change due to the newly added field 'service.provider'. + ADD TECHNICAL_SCHEMA dataset:hive:testTimelineDataset (field:service.provider.name): A forwards & backwards compatible change due to the newly added field 'service.provider.name'. + ADD TECHNICAL_SCHEMA dataset:hive:testTimelineDataset (field:service.provider.id): A forwards & backwards compatible change due to the newly added field 'service.provider.id'. +2022-02-24 15:31:55 - 1.0.0-computed + MODIFY TECHNICAL_SCHEMA dataset:hive:testTimelineDataset (field:service.provider.name): A backwards incompatible change due to native datatype of the field 'service.provider.id' changed from 'varchar(50)' to 'tinyint'. + MODIFY TECHNICAL_SCHEMA dataset:hive:testTimelineDataset (field:service.provider.id): A forwards compatible change due to field name changed from 'service.provider.id' to 'service.provider.id2' +2022-02-24 15:31:55 - 2.0.0-computed + MODIFY TECHNICAL_SCHEMA dataset:hive:testTimelineDataset (field:service.provider.id): A backwards incompatible change due to native datatype of the field 'service.provider.name' changed from 'tinyint' to 'varchar(50)'. + MODIFY TECHNICAL_SCHEMA dataset:hive:testTimelineDataset (field:service.provider.id2): A forwards compatible change due to field name changed from 'service.provider.id2' to 'service.provider.id' +``` + +## Ownership + +- Any changes in ownership of the dataset, adding an owner, or changing the type of the owner. +- Driven by the `ownership` aspect. +- All changes are currently marked as `MINOR`. + +### Example Usage + +We have provided some example scripts that demonstrate making changes to an aspect within each category and use then use the Timeline API to query the result. +All examples can be found in [smoke-test/test_resources/timeline](https://github.com/datahub-project/datahub/blob/master/smoke-test/test_resources/timeline) and should be executed from that directory. + +```console +% ./test_timeline_ownership.sh +[2022-02-24 15:40:25,367] INFO {datahub.cli.delete_cli:130} - DataHub configured with http://localhost:8080 +Successfully deleted urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD). 6 rows deleted +Took 1.087 seconds to hard delete 6 rows for 1 entities +Update succeeded with status 200 +Update succeeded with status 200 +Update succeeded with status 200 +http://localhost:8080/openapi/timeline/v1/urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Ahive%2CtestTimelineDataset%2CPROD%29?categories=OWNER&start=1644874829027&end=2682397800000 +2022-02-24 15:40:26 - 0.0.0-computed + ADD OWNERSHIP dataset:hive:testTimelineDataset (urn:li:corpuser:datahub): A new owner 'datahub' for the dataset 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been added. + ADD OWNERSHIP dataset:hive:testTimelineDataset (urn:li:corpuser:jdoe): A new owner 'jdoe' for the dataset 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been added. +2022-02-24 15:40:27 - 0.1.0-computed + REMOVE OWNERSHIP dataset:hive:testTimelineDataset (urn:li:corpuser:datahub): Owner 'datahub' of the dataset 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been removed. +2022-02-24 15:40:28 - 0.2.0-computed + ADD OWNERSHIP dataset:hive:testTimelineDataset (urn:li:corpuser:datahub): A new owner 'datahub' for the dataset 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been added. + REMOVE OWNERSHIP dataset:hive:testTimelineDataset (urn:li:corpuser:jdoe): Owner 'jdoe' of the dataset 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been removed. +Update succeeded with status 200 +Update succeeded with status 200 +Update succeeded with status 200 +http://localhost:8080/openapi/timeline/v1/urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Ahive%2CtestTimelineDataset%2CPROD%29?categories=OWNER&start=1644874831456&end=2682397800000 +2022-02-24 15:40:26 - 0.0.0-computed + ADD OWNERSHIP dataset:hive:testTimelineDataset (urn:li:corpuser:datahub): A new owner 'datahub' for the dataset 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been added. + ADD OWNERSHIP dataset:hive:testTimelineDataset (urn:li:corpuser:jdoe): A new owner 'jdoe' for the dataset 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been added. +2022-02-24 15:40:27 - 0.1.0-computed + REMOVE OWNERSHIP dataset:hive:testTimelineDataset (urn:li:corpuser:datahub): Owner 'datahub' of the dataset 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been removed. +2022-02-24 15:40:28 - 0.2.0-computed + ADD OWNERSHIP dataset:hive:testTimelineDataset (urn:li:corpuser:datahub): A new owner 'datahub' for the dataset 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been added. + REMOVE OWNERSHIP dataset:hive:testTimelineDataset (urn:li:corpuser:jdoe): Owner 'jdoe' of the dataset 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been removed. +2022-02-24 15:40:29 - 0.2.0-computed +2022-02-24 15:40:30 - 0.3.0-computed + ADD OWNERSHIP dataset:hive:testTimelineDataset (urn:li:corpuser:jdoe): A new owner 'jdoe' for the dataset 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been added. +2022-02-24 15:40:30 - 0.4.0-computed + MODIFY OWNERSHIP urn:li:corpuser:jdoe (DEVELOPER): The ownership type of the owner 'jdoe' changed from 'DATAOWNER' to 'DEVELOPER'. +``` + +## Tags + +- Any changes in tags applied to the dataset or to fields of the dataset. +- Driven by the `schemaMetadata`, `editableSchemaMetadata` and `globalTags` aspects. +- All changes are currently marked as `MINOR`. + +### Example Usage + +We have provided some example scripts that demonstrate making changes to an aspect within each category and use then use the Timeline API to query the result. +All examples can be found in [smoke-test/test_resources/timeline](https://github.com/datahub-project/datahub/blob/master/smoke-test/test_resources/timeline) and should be executed from that directory. + +```console +% ./test_timeline_tags.sh +[2022-02-24 15:44:04,279] INFO {datahub.cli.delete_cli:130} - DataHub configured with http://localhost:8080 +Successfully deleted urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD). 9 rows deleted +Took 0.626 seconds to hard delete 9 rows for 1 entities +Update succeeded with status 200 +Update succeeded with status 200 +Update succeeded with status 200 +http://localhost:8080/openapi/timeline/v1/urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Ahive%2CtestTimelineDataset%2CPROD%29?categories=TAG&start=1644875047911&end=2682397800000 +2022-02-24 15:44:05 - 0.0.0-computed + ADD TAG dataset:hive:testTimelineDataset (urn:li:tag:Legacy): A new tag 'Legacy' for the entity 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been added. +2022-02-24 15:44:06 - 0.1.0-computed + ADD TAG dataset:hive:testTimelineDataset (urn:li:tag:NeedsDocumentation): A new tag 'NeedsDocumentation' for the entity 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been added. +2022-02-24 15:44:07 - 0.2.0-computed + REMOVE TAG dataset:hive:testTimelineDataset (urn:li:tag:Legacy): Tag 'Legacy' of the entity 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been removed. + REMOVE TAG dataset:hive:testTimelineDataset (urn:li:tag:NeedsDocumentation): Tag 'NeedsDocumentation' of the entity 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been removed. +``` + +## Documentation + +- Any changes to documentation at the dataset level or at the field level. +- Driven by the `datasetProperties`, `institutionalMemory`, `schemaMetadata` and `editableSchemaMetadata`. +- Addition or removal of documentation or links is marked as `MINOR` while edits to existing documentation are marked as `PATCH` changes. + +### Example Usage + +We have provided some example scripts that demonstrate making changes to an aspect within each category and use then use the Timeline API to query the result. +All examples can be found in [smoke-test/test_resources/timeline](https://github.com/datahub-project/datahub/blob/master/smoke-test/test_resources/timeline) and should be executed from that directory. + +```console +% ./test_timeline_documentation.sh +[2022-02-24 15:45:53,950] INFO {datahub.cli.delete_cli:130} - DataHub configured with http://localhost:8080 +Successfully deleted urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD). 6 rows deleted +Took 0.578 seconds to hard delete 6 rows for 1 entities +Update succeeded with status 200 +Update succeeded with status 200 +Update succeeded with status 200 +http://localhost:8080/openapi/timeline/v1/urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Ahive%2CtestTimelineDataset%2CPROD%29?categories=DOCUMENTATION&start=1644875157616&end=2682397800000 +2022-02-24 15:45:55 - 0.0.0-computed + ADD DOCUMENTATION dataset:hive:testTimelineDataset (https://www.linkedin.com): The institutionalMemory 'https://www.linkedin.com' for the dataset 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been added. +2022-02-24 15:45:56 - 0.1.0-computed + ADD DOCUMENTATION dataset:hive:testTimelineDataset (https://www.google.com): The institutionalMemory 'https://www.google.com' for the dataset 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been added. +2022-02-24 15:45:56 - 0.2.0-computed + ADD DOCUMENTATION dataset:hive:testTimelineDataset (https://datahubproject.io/docs): The institutionalMemory 'https://datahubproject.io/docs' for the dataset 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been added. + ADD DOCUMENTATION dataset:hive:testTimelineDataset (https://datahubproject.io/docs): The institutionalMemory 'https://datahubproject.io/docs' for the dataset 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been added. + REMOVE DOCUMENTATION dataset:hive:testTimelineDataset (https://www.linkedin.com): The institutionalMemory 'https://www.linkedin.com' of the dataset 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been removed. +``` + +## Glossary Terms + +- Any changes to applied glossary terms to the dataset or to fields in the dataset. +- Driven by the `schemaMetadata`, `editableSchemaMetadata`, `glossaryTerms` aspects. +- All changes are currently marked as `MINOR`. + +### Example Usage + +We have provided some example scripts that demonstrate making changes to an aspect within each category and use then use the Timeline API to query the result. +All examples can be found in [smoke-test/test_resources/timeline](https://github.com/datahub-project/datahub/blob/master/smoke-test/test_resources/timeline) and should be executed from that directory. + +```console +% ./test_timeline_glossary.sh +[2022-02-24 15:44:56,152] INFO {datahub.cli.delete_cli:130} - DataHub configured with http://localhost:8080 +Successfully deleted urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD). 6 rows deleted +Took 0.443 seconds to hard delete 6 rows for 1 entities +Update succeeded with status 200 +Update succeeded with status 200 +Update succeeded with status 200 +http://localhost:8080/openapi/timeline/v1/urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Ahive%2CtestTimelineDataset%2CPROD%29?categories=GLOSSARY_TERM&start=1644875100605&end=2682397800000 +1969-12-31 18:00:00 - 0.0.0-computed + None None : java.lang.NullPointerException:null +2022-02-24 15:44:58 - 0.1.0-computed + ADD GLOSSARY_TERM dataset:hive:testTimelineDataset (urn:li:glossaryTerm:SavingsAccount): The GlossaryTerm 'SavingsAccount' for the entity 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been added. +2022-02-24 15:44:59 - 0.2.0-computed + REMOVE GLOSSARY_TERM dataset:hive:testTimelineDataset (urn:li:glossaryTerm:CustomerAccount): The GlossaryTerm 'CustomerAccount' for the entity 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been removed. + REMOVE GLOSSARY_TERM dataset:hive:testTimelineDataset (urn:li:glossaryTerm:SavingsAccount): The GlossaryTerm 'SavingsAccount' for the entity 'urn:li:dataset:(urn:li:dataPlatform:hive,testTimelineDataset,PROD)' has been removed. +``` + +# Explore the API + +The API is browse-able via the UI through through the dropdown. +Here are a few screenshots showing how to navigate to it. You can try out the API and send example requests. + +

+ +

+

+ +

+ +# Future Work + +- Supporting versions as start and end parameters as part of the call to the timeline API +- Supporting entities beyond Datasets +- Adding GraphQL API support +- Supporting materialization of computed versions for entity categories (compared to the current read-time version computation) +- Support in the UI to visualize the timeline in various places (e.g. schema history, etc.) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/developers.md b/docs-website/versioned_docs/version-0.10.4/docs/developers.md new file mode 100644 index 0000000000000..bbb4ea5ed0df0 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/developers.md @@ -0,0 +1,165 @@ +--- +title: Local Development +sidebar_label: Local Development +slug: /developers +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/developers.md" +--- + +# DataHub Developer's Guide + +## Pre-requirements + +- [Java 11 SDK](https://openjdk.org/projects/jdk/11/) +- [Docker](https://www.docker.com/) +- [Docker Compose](https://docs.docker.com/compose/) +- Docker engine with at least 8GB of memory to run tests. + +:::note + +Do not try to use a JDK newer than JDK 11. The build process does not work with newer JDKs currently. + +::: + +## Building the Project + +Fork and clone the repository if haven't done so already + +``` +git clone https://github.com/{username}/datahub.git +``` + +Change into the repository's root directory + +``` +cd datahub +``` + +Use [gradle wrapper](https://docs.gradle.org/current/userguide/gradle_wrapper.html) to build the project + +``` +./gradlew build +``` + +Note that the above will also run run tests and a number of validations which makes the process considerably slower. + +We suggest partially compiling DataHub according to your needs: + +- Build Datahub's backend GMS (Generalized metadata service): + +``` +./gradlew :metadata-service:war:build +``` + +- Build Datahub's frontend: + +``` +./gradlew :datahub-frontend:dist -x yarnTest -x yarnLint +``` + +- Build DataHub's command line tool: + +``` +./gradlew :metadata-ingestion:installDev +``` + +- Build DataHub's documentation: + +``` +./gradlew :docs-website:yarnLintFix :docs-website:build -x :metadata-ingestion:runPreFlightScript +# To preview the documentation +./gradlew :docs-website:serve +``` + +## Deploying local versions + +Run just once to have the local `datahub` cli tool installed in your $PATH + +``` +cd smoke-test/ +python3 -m venv venv +source venv/bin/activate +pip install --upgrade pip wheel setuptools +pip install -r requirements.txt +cd ../ +``` + +Once you have compiled & packaged the project or appropriate module you can deploy the entire system via docker-compose by running: + +``` +./gradlew quickstart +``` + +Replace whatever container you want in the existing deployment. +I.e, replacing datahub's backend (GMS): + +``` +(cd docker && COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 docker-compose -p datahub -f docker-compose-without-neo4j.yml -f docker-compose-without-neo4j.override.yml -f docker-compose.dev.yml up -d --no-deps --force-recreate --build datahub-gms) +``` + +Running the local version of the frontend + +``` +(cd docker && COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 docker-compose -p datahub -f docker-compose-without-neo4j.yml -f docker-compose-without-neo4j.override.yml -f docker-compose.dev.yml up -d --no-deps --force-recreate --build datahub-frontend-react) +``` + +## IDE Support + +The recommended IDE for DataHub development is [IntelliJ IDEA](https://www.jetbrains.com/idea/). +You can run the following command to generate or update the IntelliJ project file + +``` +./gradlew idea +``` + +Open `datahub.ipr` in IntelliJ to start developing! + +For consistency please import and auto format the code using [LinkedIn IntelliJ Java style](https://github.com/datahub-project/datahub/blob/master/gradle/idea/LinkedIn%20Style.xml). + +## Windows Compatibility + +For optimal performance and compatibility, we strongly recommend building on a Mac or Linux system. +Please note that we do not actively support Windows in a non-virtualized environment. + +If you must use Windows, one workaround is to build within a virtualized environment, such as a VM(Virtual Machine) or [WSL(Windows Subsystem for Linux)](https://learn.microsoft.com/en-us/windows/wsl). +This approach can help ensure that your build environment remains isolated and stable, and that your code is compiled correctly. + +## Common Build Issues + +### Getting `Unsupported class file major version 57` + +You're probably using a Java version that's too new for gradle. Run the following command to check your Java version + +``` +java --version +``` + +While it may be possible to build and run DataHub using newer versions of Java, we currently only support [Java 11](https://openjdk.org/projects/jdk/11/) (aka Java 11). + +### Getting `cannot find symbol` error for `javax.annotation.Generated` + +Similar to the previous issue, please use Java 1.8 to build the project. +You can install multiple version of Java on a single machine and switch between them using the `JAVA_HOME` environment variable. See [this document](https://docs.oracle.com/cd/E21454_01/html/821-2531/inst_jdk_javahome_t.html) for more details. + +### `:metadata-models:generateDataTemplate` task fails with `java.nio.file.InvalidPathException: Illegal char <:> at index XX` or `Caused by: java.lang.IllegalArgumentException: 'other' has different root` error + +This is a [known issue](https://github.com/linkedin/rest.li/issues/287) when building the project on Windows due a bug in the Pegasus plugin. Please refer to [Windows Compatibility](/docs/developers.md#windows-compatibility). + +### Various errors related to `generateDataTemplate` or other `generate` tasks + +As we generate quite a few files from the models, it is possible that old generated files may conflict with new model changes. When this happens, a simple `./gradlew clean` should reosolve the issue. + +### `Execution failed for task ':metadata-service:restli-servlet-impl:checkRestModel'` + +This generally means that an [incompatible change](https://linkedin.github.io/rest.li/modeling/compatibility_check) was introduced to the rest.li API in GMS. You'll need to rebuild the snapshots/IDL by running the following command once + +``` +./gradlew :metadata-service:restli-servlet-impl:build -Prest.model.compatibility=ignore +``` + +### `java.io.IOException: No space left on device` + +This means you're running out of space on your disk to build. Please free up some space or try a different disk. + +### `Build failed` for task `./gradlew :datahub-frontend:dist -x yarnTest -x yarnLint` + +This could mean that you need to update your [Yarn](https://yarnpkg.com/getting-started/install) version diff --git a/docs-website/versioned_docs/version-0.10.4/docs/docker/development.md b/docs-website/versioned_docs/version-0.10.4/docs/docker/development.md new file mode 100644 index 0000000000000..22bd9510c409b --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/docker/development.md @@ -0,0 +1,146 @@ +--- +title: Using Docker Images During Development +slug: /docker/development +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/docker/development.md +--- + +# Using Docker Images During Development + +We've created a special `docker-compose.dev.yml` override file that should configure docker images to be easier to use +during development. + +Normally, you'd rebuild your images from scratch with a combination of gradle and docker compose commands. However, +this takes way too long for development and requires reasoning about several layers of docker compose configuration +yaml files which can depend on your hardware (Apple M1). + +The `docker-compose.dev.yml` file bypasses the need to rebuild docker images by mounting binaries, startup scripts, +and other data. These dev images, tagged with `debug` will use your _locally built code_ with gradle. +Building locally and bypassing the need to rebuild the Docker images should be much faster. + +We highly recommend you just invoke `./gradlew quickstartDebug` task. + +```shell +./gradlew quickstartDebug +``` + +This task is defined in `docker/build.gradle` and executes the following steps: + +1. Builds all required artifacts to run DataHub. This includes both application code such as the GMS war, the frontend + distribution zip which contains javascript, as wel as secondary support docker containers. + +1. Locally builds Docker images with the expected `debug` tag required by the docker compose files. + +1. Runs the special `docker-compose.dev.yml` and supporting docker-compose files to mount local files directly in the + containers with remote debugging ports enabled. + +Once the `debug` docker images are constructed you'll see images similar to the following: + +```shell +linkedin/datahub-frontend-react debug e52fef698025 28 minutes ago 763MB +linkedin/datahub-kafka-setup debug 3375aaa2b12d 55 minutes ago 659MB +linkedin/datahub-gms debug ea2b0a8ea115 56 minutes ago 408MB +acryldata/datahub-upgrade debug 322377a7a21d 56 minutes ago 463MB +acryldata/datahub-mysql-setup debug 17768edcc3e5 2 hours ago 58.2MB +linkedin/datahub-elasticsearch-setup debug 4d935be7c62c 2 hours ago 26.1MB +``` + +At this point it is possible to view the DataHub UI at `http://localhost:9002` as you normally would with quickstart. + +## Reloading + +Next, perform the desired modifications and rebuild the frontend and/or GMS components. + +**Builds GMS** + +```shell +./gradlew :metadata-service:war:build +``` + +**Builds the frontend** + +Including javascript components. + +```shell +./gradlew :datahub-frontend:build +``` + +After building the artifacts only a restart of the container(s) is required to run with the updated code. +The restart can be performed using a docker UI, the docker cli, or the following gradle task. + +```shell +./gradlew :docker:debugReload +``` + +## Start/Stop + +The following commands can pause the debugging environment to release resources when not needed. + +Pause containers and free resources. + +```shell +docker compose -p datahub stop +``` + +Resume containers for further debugging. + +```shell +docker compose -p datahub start +``` + +## Debugging + +The default debugging process uses your local code and enables debugging by default for both GMS and the frontend. Attach +to the instance using your IDE by using its Remote Java Debugging features. + +Environment variables control the debugging ports for GMS and the frontend. + +- `DATAHUB_MAPPED_GMS_DEBUG_PORT` - Default: 5001 +- `DATAHUB_MAPPED_FRONTEND_DEBUG_PORT` - Default: 5002 + +### IntelliJ Remote Debug Configuration + +The screenshot shows an example configuration for IntelliJ using the default GMS debugging port of 5001. + +

+ +

+ +## Tips for People New To Docker + +### Accessing Logs + +It is highly recommended you use [Docker Desktop's dashboard](https://www.docker.com/products/docker-desktop) to access service logs. If you double click an image it will pull up the logs for you. + +### Quickstart Conflicts + +If you run quickstart, use `./gradlew quickstartDebug` to return to using the debugging containers. + +### Docker Prune + +If you run into disk space issues and prune the images & containers you will need to execute the `./gradlew quickstartDebug` +again. + +### System Update + +The `datahub-upgrade` job will not block the startup of the other containers as it normally +does in a quickstart or production environment. Normally this is process is required when making updates which +require Elasticsearch reindexing. If reindexing is required, the UI will render but may temporarily return errors +until this job finishes. + +### Running a specific service + +`docker-compose up` will launch all services in the configuration, including dependencies, unless they're already +running. If you, for some reason, wish to change this behavior, check out these example commands. + +``` +docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml -f docker-compose-without-neo4j.m1.yml -f docker-compose.dev.yml up datahub-gms +``` + +Will only start `datahub-gms` and its dependencies. + +``` +docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml -f docker-compose-without-neo4j.m1.yml -f docker-compose.dev.yml up --no-deps datahub-gms +``` + +Will only start `datahub-gms`, without dependencies. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/domains.md b/docs-website/versioned_docs/version-0.10.4/docs/domains.md new file mode 100644 index 0000000000000..9a707588e6c8a --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/domains.md @@ -0,0 +1,263 @@ +--- +title: About DataHub Domains +sidebar_label: Domains +slug: /domains +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/domains.md" +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub Domains + + + +Starting in version `0.8.25`, DataHub supports grouping data assets into logical collections called **Domains**. Domains are curated, top-level folders or categories where related assets can be explicitly grouped. Management of Domains can be centralized, or distributed out to Domain owners Currently, an asset can belong to only one Domain at a time. + +## Domains Setup, Prerequisites, and Permissions + +What you need to create and add domains: + +- **Manage Domains** platform privilege to add domains at the entity level + +You can create this privileges by creating a new [Metadata Policy](./authorization/policies.md). + +## Using Domains + +### Creating a Domain + +To create a Domain, first navigate to the **Domains** tab in the top-right menu of DataHub. + +

+ +

+ +Once you're on the Domains page, you'll see a list of all the Domains that have been created on DataHub. Additionally, you can +view the number of entities inside each Domain. + +

+ +

+ +To create a new Domain, click '+ New Domain'. + +

+ +

+ +Inside the form, you can choose a name for your Domain. Most often, this will align with your business units or groups, for example +'Platform Engineering' or 'Social Marketing'. You can also add an optional description. Don't worry, this can be changed later. + +#### Advanced: Setting a Custom Domain id + +Click on 'Advanced' to show the option to set a custom Domain id. The Domain id determines what will appear in the DataHub 'urn' (primary key) +for the Domain. This option is useful if you intend to refer to Domains by a common name inside your code, or you want the primary +key to be human-readable. Proceed with caution: once you select a custom id, it cannot be easily changed. + +

+ +

+ +By default, you don't need to worry about this. DataHub will auto-generate a unique Domain id for you. + +Once you've chosen a name and a description, click 'Create' to create the new Domain. + +### Assigning an Asset to a Domain + +You can assign assets to Domain using the UI or programmatically using the API or during ingestion. + +#### UI-Based Assignment + +To assign an asset to a Domain, simply navigate to the asset's profile page. At the bottom left-side menu bar, you'll +see a 'Domain' section. Click 'Set Domain', and then search for the Domain you'd like to add to. When you're done, click 'Add'. + +

+ +

+ +To remove an asset from a Domain, click the 'x' icon on the Domain tag. + +> Notice: Adding or removing an asset from a Domain requires the `Edit Domain` Metadata Privilege, which can be granted +> by a [Policy](authorization/policies.md). + +#### Ingestion-time Assignment + +All SQL-based ingestion sources support assigning domains during ingestion using the `domain` configuration. Consult your source's configuration details page (e.g. [Snowflake](./generated/ingestion/sources/snowflake.md)), to verify that it supports the Domain capability. + +:::note + +Assignment of domains during ingestion will overwrite domains that you have assigned in the UI. A single table can only belong to one domain. + +::: + +Here is a quick example of a snowflake ingestion recipe that has been enhanced to attach the **Analytics** domain to all tables in the **long_tail_companions** database in the **analytics** schema, and the **Finance** domain to all tables in the **long_tail_companions** database in the **ecommerce** schema. + +```yaml +source: + type: snowflake + config: + username: ${SNOW_USER} + password: ${SNOW_PASS} + account_id: + warehouse: COMPUTE_WH + role: accountadmin + database_pattern: + allow: + - "long_tail_companions" + schema_pattern: + deny: + - information_schema + profiling: + enabled: False + domain: + Analytics: + allow: + - "long_tail_companions.analytics.*" + Finance: + allow: + - "long_tail_companions.ecommerce.*" +``` + +:::note + +When bare domain names like `Analytics` is used, the ingestion system will first check if a domain like `urn:li:domain:Analytics` is provisioned, failing that; it will check for a provisioned domain that has the same name. If we are unable to resolve bare domain names to provisioned domains, then ingestion will refuse to proceeed until the domain is provisioned on DataHub. + +::: + +You can also provide fully-qualified domain names to ensure that no ingestion-time domain resolution is needed. For example, the following recipe shows an example using fully qualified domain names: + +```yaml +source: + type: snowflake + config: + username: ${SNOW_USER} + password: ${SNOW_PASS} + account_id: + warehouse: COMPUTE_WH + role: accountadmin + database_pattern: + allow: + - "long_tail_companions" + schema_pattern: + deny: + - information_schema + profiling: + enabled: False + domain: + "urn:li:domain:6289fccc-4af2-4cbb-96ed-051e7d1de93c": + allow: + - "long_tail_companions.analytics.*" + "urn:li:domain:07155b15-cee6-4fda-b1c1-5a19a6b74c3a": + allow: + - "long_tail_companions.ecommerce.*" +``` + +### Searching by Domain + +Once you've created a Domain, you can use the search bar to find it. + +

+ +

+ +Clicking on the search result will take you to the Domain's profile, where you +can edit its description, add / remove owners, and view the assets inside the Domain. + +

+ +

+ +Once you've added assets to a Domain, you can filter search results to limit to those Assets +within a particular Domain using the left-side search filters. + +

+ +

+ +On the homepage, you'll also find a list of the most popular Domains in your organization. + +

+ +

+ +## Additional Resources + +### Videos + +**Supercharge Data Mesh with Domains in DataHub** + +

+ +

+ +### GraphQL + +- [domain](../graphql/queries.md#domain) +- [listDomains](../graphql/queries.md#listdomains) +- [createDomains](../graphql/mutations.md#createdomain) +- [setDomain](../graphql/mutations.md#setdomain) +- [unsetDomain](../graphql/mutations.md#unsetdomain) + +#### Examples + +**Creating a Domain** + +```graphql +mutation createDomain { + createDomain( + input: { name: "My New Domain", description: "An optional description" } + ) +} +``` + +This query will return an `urn` which you can use to fetch the Domain details. + +**Fetching a Domain by Urn** + +```graphql +query getDomain { + domain(urn: "urn:li:domain:engineering") { + urn + properties { + name + description + } + entities { + total + } + } +} +``` + +**Adding a Dataset to a Domain** + +```graphql +mutation setDomain { + setDomain( + entityUrn: "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)" + domainUrn: "urn:li:domain:engineering" + ) +} +``` + +> Pro Tip! You can try out the sample queries by visiting `/api/graphiql`. + +### DataHub Blog + +- [Just Shipped: UI-Based Ingestion, Data Domains & Containers, Tableau support, and MORE!](https://blog.datahubproject.io/just-shipped-ui-based-ingestion-data-domains-containers-and-more-f1b1c90ed3a) + +## FAQ and Troubleshooting + +**What is the difference between DataHub Domains, Tags, and Glossary Terms?** + +DataHub supports Tags, Glossary Terms, & Domains as distinct types of Metadata that are suited for specific purposes: + +- **Tags**: Informal, loosely controlled labels that serve as a tool for search & discovery. Assets may have multiple tags. No formal, central management. +- **Glossary Terms**: A controlled vocabulary, with optional hierarchy. Terms are typically used to standardize types of leaf-level attributes (i.e. schema fields) for governance. E.g. (EMAIL_PLAINTEXT) +- **Domains**: A set of top-level categories. Usually aligned to business units / disciplines to which the assets are most relevant. Central or distributed management. Single Domain assignment per data asset. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ + +### Related Features + +- [Glossary Terms](./glossary/business-glossary.md) +- [Tags](./tags.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/features.md b/docs-website/versioned_docs/version-0.10.4/docs/features.md new file mode 100644 index 0000000000000..c530d2c0f286f --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/features.md @@ -0,0 +1,124 @@ +--- +title: Features +sidebar_label: Features +slug: /features +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/features.md" +--- + +# DataHub Features Overview + +DataHub is a modern data catalog built to enable end-to-end data discovery, data observability, and data governance. This extensible metadata platform is built for developers to tame the complexity of their rapidly evolving data ecosystems and for data practitioners to leverage the total value of data within their organization. + +Here’s an overview of DataHub’s current functionality. Check out our [roadmap](https://feature-requests.datahubproject.io/roadmap) to see what's to come. + +--- + +## Search and Discovery + +### **Search All Corners of Your Data Stack** + +DataHub's unified search experience surfaces results across databases, data lakes, BI platforms, ML feature stores, orchestration tools, and more. + +

+ +

+ +### **Trace End-to-End Lineage** + +Quickly understand the end-to-end journey of data by tracing lineage across platforms, datasets, ETL/ELT pipelines, charts, dashboards, and beyond. + +

+ +

+ +### **Understand the Impact of Breaking Changes on Downstream Dependencies** + +Proactively identify which entities may be impacted by a breaking change using Impact Analysis. + +

+ +

+ +### **View Metadata 360 at a Glance** + +Combine _technical_ and _logical_ metadata to provide a 360º view of your data entities. + +Generate **Dataset Stats** to understand the shape & distribution of the data + +

+ +

+ +Capture historical **Data Validation Outcomes** from tools like Great Expectations + +

+ +

+ +Leverage DataHub's **Schema Version History** to track changes to the physical structure of data over time + +

+ +

+ +--- + +## Modern Data Governance + +### **Govern in Real Time** + +[The Actions Framework](./actions/README.md) powers the following real-time use cases: + +- **Notifications:** Generate organization-specific notifications when a change is made on DataHub. For example, send an email to the governance team when a "PII" tag is added to any data asset. +- **Workflow Integration:** Integrate DataHub into your organization's internal workflows. For example, create a Jira ticket when specific Tags or Terms are proposed on a Dataset. +- **Synchronization:** Sync changes made in DataHub into a 3rd party system. For example, reflect Tag additions in DataHub into Snowflake. +- **Auditing:** Audit who is making what changes on DataHub through time. + +

+ +

+ +### **Manage Entity Ownership** + +Quickly and easily assign entity ownership to users and user groups. + +

+ +

+ +### **Govern with Tags, Glossary Terms, and Domains** + +Empower data owners to govern their data entities with: + +1. **Tags:** Informal, loosely controlled labels that serve as a tool for search & discovery. No formal, central management. +2. **Glossary Terms:** A controlled vocabulary with optional hierarchy, commonly used to describe core business concepts and measurements. +3. **Domains:** Curated, top-level folders or categories, widely used in Data Mesh to organize entities by department (i.e., Finance, Marketing) or Data Products. + +

+ +

+ +--- + +## DataHub Administration + +### **Create Users, Groups, & Access Policies** + +DataHub admins can create Policies to define who can perform what action against which resource(s). When you create a new Policy, you will be able to define the following: + +- **Policy Type** - Platform (top-level DataHub Platform privileges, i.e., managing users, groups, and policies) or Metadata (ability to manipulate ownership, tags, documentation, and more) +- **Resource Type** - Specify the type of resources, such as Datasets, Dashboards, Pipelines, and beyond +- **Privileges** - Choose the set of permissions, such as Edit Owners, Edit Documentation, Edit Links +- **Users and/or Groups** - Assign relevant Users and Groups; you can also assign the Policy to Resource Owners, regardless of which Group they belong + +

+ +

+ +### **Ingest Metadata from the UI** + +Create, configure, schedule, & execute batch metadata ingestion using the DataHub user interface. This makes getting metadata into DataHub easier by minimizing the overhead required to operate custom integration pipelines. + +

+ +

diff --git a/docs-website/versioned_docs/version-0.10.4/docs/features/dataset-usage-and-query-history.md b/docs-website/versioned_docs/version-0.10.4/docs/features/dataset-usage-and-query-history.md new file mode 100644 index 0000000000000..9a8b70912f4c7 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/features/dataset-usage-and-query-history.md @@ -0,0 +1,92 @@ +--- +title: About DataHub Dataset Usage & Query History +sidebar_label: Dataset Usage & Query History +slug: /features/dataset-usage-and-query-history +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/features/dataset-usage-and-query-history.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub Dataset Usage & Query History + + + +Dataset Usage & Query History can give dataset-level information about the top queries which referenced a dataset. + +Usage data can help identify the top users who probably know the most about the dataset and top queries referencing this dataset. +You can also get an overview of the overall number of queries and distinct users. +In some sources, column level usage is also calculated, which can help identify frequently used columns. + +With sources that support usage statistics, you can collect Dataset, Dashboard, and Chart usages. + +## Dataset Usage & Query History Setup, Prerequisites, and Permissions + +To ingest Dataset Usage & Query History data, you should check first on the specific source doc +if it is supported by the Datahub source and how to enable it. + +You can validate this on the Datahub source's capabilities section: + +

+ +

+ +Some sources require a separate, usage-specific recipe to ingest Usage and Query History metadata. In this case, it is noted in the capabilities summary, like so: + +

+ +

+ +Please, always check the usage prerequisities page if the source has as it can happen you have to add additional +permissions which only needs for usage. + +## Using Dataset Usage & Query History + +After successful ingestion, the Queries and Stats tab will be enabled on datasets with any usage. + +

+ +

+ +On the Queries tab, you can see the top 5 most often run queries which referenced this dataset. + +

+ +

+ +On the Stats tab, you can see the top 5 users who run the most queries which referenced this dataset + +

+ +

+ +With the collected usage data, you can even see column-level usage statistics (Redshift Usage doesn't supported this yet): + +

+ +

+ +## Additional Resources + +### Videos + +**DataHub 101: Data Profiling and Usage Stats 101** + +

+ +

+ +### GraphQL + +- +- +- +- + +## FAQ and Troubleshooting + +### Why is my Queries/Stats tab greyed out? + +Queries/Stats tab is greyed out if there is no usage statistics for that dataset or there were no ingestion with usage extraction run before. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/athena.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/athena.md new file mode 100644 index 0000000000000..257cdff003c4d --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/athena.md @@ -0,0 +1,628 @@ +--- +sidebar_position: 1 +title: Athena +slug: /generated/ingestion/sources/athena +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/athena.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Athena + +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +### Important Capabilities + +| Capability | Status | Notes | +| -------------------------------------------------------------------------------- | ------ | ----------------------------------------------------------------------------------------------------------------- | +| [Data Profiling](../../../../metadata-ingestion/docs/dev_guides/sql_profiles.md) | ✅ | Optionally enabled via configuration. Profiling uses sql queries on whole table which can be expensive operation. | +| Descriptions | ✅ | Enabled by default | +| [Domains](../../../domains.md) | ✅ | Supported via the `domain` config field | +| [Platform Instance](../../../platform-instances.md) | ✅ | Enabled by default | +| Table-Level Lineage | ✅ | Supported for S3 tables | + +This plugin supports extracting the following metadata from Athena + +- Tables, schemas etc. +- Lineage for S3 tables. +- Profiling when enabled. + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[athena]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: athena + config: + # Coordinates + aws_region: my_aws_region + work_group: primary + + # Options + query_result_location: "s3://my_staging_athena_results_bucket/results/" + +sink: +# sink configs +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +|
aws_region 
string
| Aws region where your Athena database is located | +|
query_result_location 
string
| S3 path to the [query result bucket](https://docs.aws.amazon.com/athena/latest/ug/querying.html#query-results-specify-location) which should be used by AWS Athena to store results of thequeries executed by DataHub. | +|
work_group 
string
| The name of your Amazon Athena Workgroups | +|
aws_role_arn
string
| AWS Role arn for Pyathena to assume in its connection | +|
aws_role_assumption_duration
integer
| Duration to assume the AWS Role for. Maximum of 43200 (12 hours)
Default: 3600
| +|
catalog_name
string
| Athena Catalog Name
Default: awsdatacatalog
| +|
database
string
| The athena database to ingest from. If not set it will be autodetected | +|
include_table_location_lineage
boolean
| If the source supports it, include table lineage to the underlying storage location.
Default: True
| +|
include_tables
boolean
| Whether tables should be ingested.
Default: True
| +|
include_views
boolean
| Whether views should be ingested.
Default: False
| +|
options
object
| Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs. | +|
password
string(password)
| Same detection scheme as username | +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
s3_staging_dir
string
| [deprecated in favor of `query_result_location`] S3 query location | +|
scheme
string
|
Default: awsathena+rest
| +|
username
string
| Username credential. If not specified, detected with boto3 rules. See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html | +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
domain
map(str,AllowDenyPattern)
| A class to store allow deny regexes | +|
domain.`key`.allow
array(string)
| | +|
domain.`key`.deny
array(string)
| | +|
domain.`key`.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profile_pattern
AllowDenyPattern
| Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
profile_pattern.allow
array(string)
| | +|
profile_pattern.deny
array(string)
| | +|
profile_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
schema_pattern
AllowDenyPattern
| Regex patterns for schemas to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
schema_pattern.allow
array(string)
| | +|
schema_pattern.deny
array(string)
| | +|
schema_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
table_pattern
AllowDenyPattern
| Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.\*'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
table_pattern.allow
array(string)
| | +|
table_pattern.deny
array(string)
| | +|
table_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
view_pattern
AllowDenyPattern
| Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.\*'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
view_pattern.allow
array(string)
| | +|
view_pattern.deny
array(string)
| | +|
view_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profiling
GEProfilingConfig
|
Default: {'enabled': False, 'limit': None, 'offset': None, ...
| +|
profiling.catch_exceptions
boolean
|
Default: True
| +|
profiling.enabled
boolean
| Whether profiling should be done.
Default: False
| +|
profiling.field_sample_values_limit
integer
| Upper limit for number of sample values to collect for all columns.
Default: 20
| +|
profiling.include_field_distinct_count
boolean
| Whether to profile for the number of distinct values for each column.
Default: True
| +|
profiling.include_field_distinct_value_frequencies
boolean
| Whether to profile for distinct value frequencies.
Default: False
| +|
profiling.include_field_histogram
boolean
| Whether to profile for the histogram for numeric fields.
Default: False
| +|
profiling.include_field_max_value
boolean
| Whether to profile for the max value of numeric columns.
Default: True
| +|
profiling.include_field_mean_value
boolean
| Whether to profile for the mean value of numeric columns.
Default: True
| +|
profiling.include_field_median_value
boolean
| Whether to profile for the median value of numeric columns.
Default: True
| +|
profiling.include_field_min_value
boolean
| Whether to profile for the min value of numeric columns.
Default: True
| +|
profiling.include_field_null_count
boolean
| Whether to profile for the number of nulls for each column.
Default: True
| +|
profiling.include_field_quantiles
boolean
| Whether to profile for the quantiles of numeric columns.
Default: False
| +|
profiling.include_field_sample_values
boolean
| Whether to profile for the sample values for all columns.
Default: True
| +|
profiling.include_field_stddev_value
boolean
| Whether to profile for the standard deviation of numeric columns.
Default: True
| +|
profiling.limit
integer
| Max number of documents to profile. By default, profiles all documents. | +|
profiling.max_number_of_fields_to_profile
integer
| A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. | +|
profiling.max_workers
integer
| Number of worker threads to use for profiling. Set to 1 to disable.
Default: 80
| +|
profiling.offset
integer
| Offset in documents to profile. By default, uses no offset. | +|
profiling.partition_datetime
string(date-time)
| For partitioned datasets profile only the partition which matches the datetime or profile the latest one if not set. Only Bigquery supports this. | +|
profiling.partition_profiling_enabled
boolean
|
Default: True
| +|
profiling.profile_if_updated_since_days
number
| Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`. | +|
profiling.profile_table_level_only
boolean
| Whether to perform profiling at table-level only, or include column-level profiling as well.
Default: False
| +|
profiling.profile_table_row_count_estimate_only
boolean
| Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL.
Default: False
| +|
profiling.profile_table_row_limit
integer
| Profile tables only if their row count is less then specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `snowflake` and `BigQuery`
Default: 5000000
| +|
profiling.profile_table_size_limit
integer
| Profile tables only if their size is less then specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `snowflake` and `BigQuery`
Default: 5
| +|
profiling.query_combiner_enabled
boolean
| _This feature is still experimental and can be disabled if it causes issues._ Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.
Default: True
| +|
profiling.report_dropped_profiles
boolean
| Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.
Default: False
| +|
profiling.turn_off_expensive_profiling_metrics
boolean
| Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.
Default: False
| +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| Base specialized config for Stateful Ingestion with stale metadata removal capability. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "AthenaConfig", + "description": "Base configuration class for stateful ingestion for source configs to inherit from.", + "type": "object", + "properties": { + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "stateful_ingestion": { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + }, + "options": { + "title": "Options", + "description": "Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs.", + "type": "object" + }, + "schema_pattern": { + "title": "Schema Pattern", + "description": "Regex patterns for schemas to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "table_pattern": { + "title": "Table Pattern", + "description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "view_pattern": { + "title": "View Pattern", + "description": "Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "profile_pattern": { + "title": "Profile Pattern", + "description": "Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "domain": { + "title": "Domain", + "description": "Attach domains to databases, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.", + "default": {}, + "type": "object", + "additionalProperties": { + "$ref": "#/definitions/AllowDenyPattern" + } + }, + "include_views": { + "title": "Include Views", + "description": "Whether views should be ingested.", + "default": false, + "type": "boolean" + }, + "include_tables": { + "title": "Include Tables", + "description": "Whether tables should be ingested.", + "default": true, + "type": "boolean" + }, + "include_table_location_lineage": { + "title": "Include Table Location Lineage", + "description": "If the source supports it, include table lineage to the underlying storage location.", + "default": true, + "type": "boolean" + }, + "profiling": { + "title": "Profiling", + "default": { + "enabled": false, + "limit": null, + "offset": null, + "report_dropped_profiles": false, + "turn_off_expensive_profiling_metrics": false, + "profile_table_level_only": false, + "include_field_null_count": true, + "include_field_distinct_count": true, + "include_field_min_value": true, + "include_field_max_value": true, + "include_field_mean_value": true, + "include_field_median_value": true, + "include_field_stddev_value": true, + "include_field_quantiles": false, + "include_field_distinct_value_frequencies": false, + "include_field_histogram": false, + "include_field_sample_values": true, + "field_sample_values_limit": 20, + "max_number_of_fields_to_profile": null, + "profile_if_updated_since_days": null, + "profile_table_size_limit": 5, + "profile_table_row_limit": 5000000, + "profile_table_row_count_estimate_only": false, + "max_workers": 80, + "query_combiner_enabled": true, + "catch_exceptions": true, + "partition_profiling_enabled": true, + "partition_datetime": null + }, + "allOf": [ + { + "$ref": "#/definitions/GEProfilingConfig" + } + ] + }, + "scheme": { + "title": "Scheme", + "default": "awsathena+rest", + "type": "string" + }, + "username": { + "title": "Username", + "description": "Username credential. If not specified, detected with boto3 rules. See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html", + "type": "string" + }, + "password": { + "title": "Password", + "description": "Same detection scheme as username", + "type": "string", + "writeOnly": true, + "format": "password" + }, + "database": { + "title": "Database", + "description": "The athena database to ingest from. If not set it will be autodetected", + "type": "string" + }, + "aws_region": { + "title": "Aws Region", + "description": "Aws region where your Athena database is located", + "type": "string" + }, + "aws_role_arn": { + "title": "Aws Role Arn", + "description": "AWS Role arn for Pyathena to assume in its connection", + "type": "string" + }, + "aws_role_assumption_duration": { + "title": "Aws Role Assumption Duration", + "description": "Duration to assume the AWS Role for. Maximum of 43200 (12 hours)", + "default": 3600, + "type": "integer" + }, + "s3_staging_dir": { + "title": "S3 Staging Dir", + "description": "[deprecated in favor of `query_result_location`] S3 query location", + "deprecated": true, + "type": "string" + }, + "work_group": { + "title": "Work Group", + "description": "The name of your Amazon Athena Workgroups", + "type": "string" + }, + "catalog_name": { + "title": "Catalog Name", + "description": "Athena Catalog Name", + "default": "awsdatacatalog", + "type": "string" + }, + "query_result_location": { + "title": "Query Result Location", + "description": "S3 path to the [query result bucket](https://docs.aws.amazon.com/athena/latest/ug/querying.html#query-results-specify-location) which should be used by AWS Athena to store results of thequeries executed by DataHub.", + "type": "string" + } + }, + "required": [ + "aws_region", + "work_group", + "query_result_location" + ], + "additionalProperties": false, + "definitions": { + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "AllowDenyPattern": { + "title": "AllowDenyPattern", + "description": "A class to store allow deny regexes", + "type": "object", + "properties": { + "allow": { + "title": "Allow", + "description": "List of regex patterns to include in ingestion", + "default": [ + ".*" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "deny": { + "title": "Deny", + "description": "List of regex patterns to exclude from ingestion.", + "default": [], + "type": "array", + "items": { + "type": "string" + } + }, + "ignoreCase": { + "title": "Ignorecase", + "description": "Whether to ignore case sensitivity during pattern matching.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "GEProfilingConfig": { + "title": "GEProfilingConfig", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "Whether profiling should be done.", + "default": false, + "type": "boolean" + }, + "limit": { + "title": "Limit", + "description": "Max number of documents to profile. By default, profiles all documents.", + "type": "integer" + }, + "offset": { + "title": "Offset", + "description": "Offset in documents to profile. By default, uses no offset.", + "type": "integer" + }, + "report_dropped_profiles": { + "title": "Report Dropped Profiles", + "description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.", + "default": false, + "type": "boolean" + }, + "turn_off_expensive_profiling_metrics": { + "title": "Turn Off Expensive Profiling Metrics", + "description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.", + "default": false, + "type": "boolean" + }, + "profile_table_level_only": { + "title": "Profile Table Level Only", + "description": "Whether to perform profiling at table-level only, or include column-level profiling as well.", + "default": false, + "type": "boolean" + }, + "include_field_null_count": { + "title": "Include Field Null Count", + "description": "Whether to profile for the number of nulls for each column.", + "default": true, + "type": "boolean" + }, + "include_field_distinct_count": { + "title": "Include Field Distinct Count", + "description": "Whether to profile for the number of distinct values for each column.", + "default": true, + "type": "boolean" + }, + "include_field_min_value": { + "title": "Include Field Min Value", + "description": "Whether to profile for the min value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_max_value": { + "title": "Include Field Max Value", + "description": "Whether to profile for the max value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_mean_value": { + "title": "Include Field Mean Value", + "description": "Whether to profile for the mean value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_median_value": { + "title": "Include Field Median Value", + "description": "Whether to profile for the median value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_stddev_value": { + "title": "Include Field Stddev Value", + "description": "Whether to profile for the standard deviation of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_quantiles": { + "title": "Include Field Quantiles", + "description": "Whether to profile for the quantiles of numeric columns.", + "default": false, + "type": "boolean" + }, + "include_field_distinct_value_frequencies": { + "title": "Include Field Distinct Value Frequencies", + "description": "Whether to profile for distinct value frequencies.", + "default": false, + "type": "boolean" + }, + "include_field_histogram": { + "title": "Include Field Histogram", + "description": "Whether to profile for the histogram for numeric fields.", + "default": false, + "type": "boolean" + }, + "include_field_sample_values": { + "title": "Include Field Sample Values", + "description": "Whether to profile for the sample values for all columns.", + "default": true, + "type": "boolean" + }, + "field_sample_values_limit": { + "title": "Field Sample Values Limit", + "description": "Upper limit for number of sample values to collect for all columns.", + "default": 20, + "type": "integer" + }, + "max_number_of_fields_to_profile": { + "title": "Max Number Of Fields To Profile", + "description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.", + "exclusiveMinimum": 0, + "type": "integer" + }, + "profile_if_updated_since_days": { + "title": "Profile If Updated Since Days", + "description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.", + "exclusiveMinimum": 0, + "type": "number" + }, + "profile_table_size_limit": { + "title": "Profile Table Size Limit", + "description": "Profile tables only if their size is less then specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `snowflake` and `BigQuery`", + "default": 5, + "type": "integer" + }, + "profile_table_row_limit": { + "title": "Profile Table Row Limit", + "description": "Profile tables only if their row count is less then specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `snowflake` and `BigQuery`", + "default": 5000000, + "type": "integer" + }, + "profile_table_row_count_estimate_only": { + "title": "Profile Table Row Count Estimate Only", + "description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ", + "default": false, + "type": "boolean" + }, + "max_workers": { + "title": "Max Workers", + "description": "Number of worker threads to use for profiling. Set to 1 to disable.", + "default": 80, + "type": "integer" + }, + "query_combiner_enabled": { + "title": "Query Combiner Enabled", + "description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.", + "default": true, + "type": "boolean" + }, + "catch_exceptions": { + "title": "Catch Exceptions", + "default": true, + "type": "boolean" + }, + "partition_profiling_enabled": { + "title": "Partition Profiling Enabled", + "default": true, + "type": "boolean" + }, + "partition_datetime": { + "title": "Partition Datetime", + "description": "For partitioned datasets profile only the partition which matches the datetime or profile the latest one if not set. Only Bigquery supports this.", + "type": "string", + "format": "date-time" + } + }, + "additionalProperties": false + } + } +} +``` + + +
+ +### Code Coordinates + +- Class Name: `datahub.ingestion.source.sql.athena.AthenaSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/athena.py) + +

Questions

+ +If you've got any questions on configuring ingestion for Athena, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/azure-ad.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/azure-ad.md new file mode 100644 index 0000000000000..e49793c7fca9e --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/azure-ad.md @@ -0,0 +1,507 @@ +--- +sidebar_position: 2 +title: Azure AD +slug: /generated/ingestion/sources/azure-ad +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/azure-ad.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Azure AD + +### Extracting DataHub Users + +#### Usernames + +Usernames serve as unique identifiers for users on DataHub. This connector extracts usernames using the +"userPrincipalName" field of an [Azure AD User Response](https://docs.microsoft.com/en-us/graph/api/user-list?view=graph-rest-1.0&tabs=http#response-1), +which is the unique identifier for your Azure AD users. + +If this is not how you wish to map to DataHub usernames, you can provide a custom mapping using the configurations options detailed below. Namely, `azure_ad_response_to_username_attr` +and `azure_ad_response_to_username_regex`. + +#### Responses + +This connector also extracts basic user response information from Azure. The following fields of the Azure User Response are extracted +and mapped to the DataHub `CorpUserInfo` aspect: + +- display name +- first name +- last name +- email +- title +- country + +### Extracting DataHub Groups + +#### Group Names + +Group names serve as unique identifiers for groups on DataHub. This connector extracts group names using the "name" attribute of an Azure Group Response. +By default, a URL-encoded version of the full group name is used as the unique identifier (CorpGroupKey) and the raw "name" attribute is mapped +as the display name that will appear in DataHub's UI. + +If this is not how you wish to map to DataHub group names, you can provide a custom mapping using the configurations options detailed below. Namely, `azure_ad_response_to_groupname_attr` +and `azure_ad_response_to_groupname_regex`. + +#### Responses + +This connector also extracts basic group information from Azure. The following fields of the [Azure AD Group Response](https://docs.microsoft.com/en-us/graph/api/group-list?view=graph-rest-1.0&tabs=http#response-1) are extracted and mapped to the +DataHub `CorpGroupInfo` aspect: + +- name +- description + +### Extracting Group Membership + +This connector additional extracts the edges between Users and Groups that are stored in [Azure AD](https://docs.microsoft.com/en-us/graph/api/group-list-members?view=graph-rest-1.0&tabs=http#response-1). It maps them to the `GroupMembership` aspect +associated with DataHub users (CorpUsers). Today this has the unfortunate side effect of **overwriting** any Group Membership information that +was created outside of the connector. That means if you've used the DataHub REST API to assign users to groups, this information will be overridden +when the Azure AD Source is executed. If you intend to _always_ pull users, groups, and their relationships from your Identity Provider, then +this should not matter. + +This is a known limitation in our data model that is being tracked by [this ticket](https://github.com/datahub-project/datahub/issues/3065).![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +### Important Capabilities + +| Capability | Status | Notes | +| ---------------------------------------------------------------------------------------------------------- | ------ | ----------------------------------------- | +| [Detect Deleted Entities](../../../../metadata-ingestion/docs/dev_guides/stateful.md#stale-entity-removal) | ✅ | Optionally enabled via stateful_ingestion | + +This plugin extracts the following: + +- Users +- Groups +- Group Membership + +from your Azure AD instance. + +Note that any users ingested from this connector will not be able to log into DataHub unless you have Azure AD OIDC +SSO enabled. You can, however, have these users ingested into DataHub before they log in for the first time if you +would like to take actions like adding them to a group or assigning them a role. + +For instructions on how to do configure Azure AD OIDC SSO, please read the documentation +[here](/docs/authentication/guides/sso/configure-oidc-react-azure). + +### Extracting DataHub Users + +#### Usernames + +Usernames serve as unique identifiers for users on DataHub. This connector extracts usernames using the +"userPrincipalName" field of an [Azure AD User Response](https://docs.microsoft.com/en-us/graph/api/user-list?view=graph-rest-1.0&tabs=http#response-1), +which is the unique identifier for your Azure AD users. + +If this is not how you wish to map to DataHub usernames, you can provide a custom mapping using the configurations options detailed below. Namely, `azure_ad_response_to_username_attr` +and `azure_ad_response_to_username_regex`. + +#### Responses + +This connector also extracts basic user response information from Azure. The following fields of the Azure User Response are extracted +and mapped to the DataHub `CorpUserInfo` aspect: + +- display name +- first name +- last name +- email +- title +- country + +### Extracting DataHub Groups + +#### Group Names + +Group names serve as unique identifiers for groups on DataHub. This connector extracts group names using the "name" attribute of an Azure Group Response. +By default, a URL-encoded version of the full group name is used as the unique identifier (CorpGroupKey) and the raw "name" attribute is mapped +as the display name that will appear in DataHub's UI. + +If this is not how you wish to map to DataHub group names, you can provide a custom mapping using the configurations options detailed below. Namely, `azure_ad_response_to_groupname_attr` +and `azure_ad_response_to_groupname_regex`. + +#### Responses + +This connector also extracts basic group information from Azure. The following fields of the [Azure AD Group Response](https://docs.microsoft.com/en-us/graph/api/group-list?view=graph-rest-1.0&tabs=http#response-1) are extracted and mapped to the +DataHub `CorpGroupInfo` aspect: + +- name +- description + +### Extracting Group Membership + +This connector additional extracts the edges between Users and Groups that are stored in [Azure AD](https://docs.microsoft.com/en-us/graph/api/group-list-members?view=graph-rest-1.0&tabs=http#response-1). It maps them to the `GroupMembership` aspect +associated with DataHub users (CorpUsers). + +### Prerequisite + +[Create a DataHub Application](https://docs.microsoft.com/en-us/graph/toolkit/get-started/add-aad-app-registration) within the Azure AD Portal with the permissions +to read your organization's Users and Groups. The following permissions are required, with the `Application` permission type: + +- `Group.Read.All` +- `GroupMember.Read.All` +- `User.Read.All` + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[azure-ad]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: "azure-ad" + config: + client_id: "00000000-0000-0000-0000-000000000000" + tenant_id: "00000000-0000-0000-0000-000000000000" + client_secret: "xxxxx" + redirect: "https://login.microsoftonline.com/common/oauth2/nativeclient" + authority: "https://login.microsoftonline.com/00000000-0000-0000-0000-000000000000" + token_url: "https://login.microsoftonline.com/00000000-0000-0000-0000-000000000000/oauth2/token" + graph_url: "https://graph.microsoft.com/v1.0" + ingest_users: True + ingest_groups: True + groups_pattern: + allow: + - ".*" + users_pattern: + allow: + - ".*" + +sink: + # sink configs +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +|
authority 
string
| The authority (https://docs.microsoft.com/en-us/azure/active-directory/develop/msal-client-application-configuration) is a URL that indicates a directory that MSAL can request tokens from. | +|
client_id 
string
| Application ID. Found in your app registration on Azure AD Portal | +|
client_secret 
string
| Client secret. Found in your app registration on Azure AD Portal | +|
tenant_id 
string
| Directory ID. Found in your app registration on Azure AD Portal | +|
token_url 
string
| The token URL that acquires a token from Azure AD for authorizing requests. This source will only work with v1.0 endpoint. | +|
azure_ad_response_to_groupname_attr
string
| Which Azure AD Group Response attribute to use as input to DataHub group name mapping.
Default: displayName
| +|
azure_ad_response_to_groupname_regex
string
| A regex used to parse the DataHub group name from the attribute specified in `azure_ad_response_to_groupname_attr`.
Default: (.\*)
| +|
azure_ad_response_to_username_attr
string
| Which Azure AD User Response attribute to use as input to DataHub username mapping.
Default: userPrincipalName
| +|
azure_ad_response_to_username_regex
string
| A regex used to parse the DataHub username from the attribute specified in `azure_ad_response_to_username_attr`.
Default: (.\*)
| +|
filtered_tracking
boolean
| If enabled, report will contain names of filtered users and groups.
Default: True
| +|
graph_url
string
| [Microsoft Graph API endpoint](https://docs.microsoft.com/en-us/graph/use-the-api)
Default: https://graph.microsoft.com/v1.0
| +|
ingest_group_membership
boolean
| Whether group membership should be ingested into DataHub. ingest_groups must be True if this is True.
Default: True
| +|
ingest_groups
boolean
| Whether groups should be ingested into DataHub.
Default: True
| +|
ingest_groups_users
boolean
| This option is useful only when `ingest_users` is set to False and `ingest_group_membership` to True. As effect, only the users which belongs to the selected groups will be ingested.
Default: True
| +|
ingest_users
boolean
| Whether users should be ingested into DataHub.
Default: True
| +|
mask_group_id
boolean
| Whether workunit ID's for groups should be masked to avoid leaking sensitive information.
Default: True
| +|
mask_user_id
boolean
| Whether workunit ID's for users should be masked to avoid leaking sensitive information.
Default: True
| +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
redirect
string
| Redirect URI. Found in your app registration on Azure AD Portal.
Default: https://login.microsoftonline.com/common/oauth2/na...
| +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
groups_pattern
AllowDenyPattern
| regex patterns for groups to include in ingestion.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
groups_pattern.allow
array(string)
| | +|
groups_pattern.deny
array(string)
| | +|
groups_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
users_pattern
AllowDenyPattern
| regex patterns for users to filter in ingestion.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
users_pattern.allow
array(string)
| | +|
users_pattern.deny
array(string)
| | +|
users_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| Azure AD Stateful Ingestion Config. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "AzureADConfig", + "description": "Config to create a token and connect to Azure AD instance", + "type": "object", + "properties": { + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "stateful_ingestion": { + "title": "Stateful Ingestion", + "description": "Azure AD Stateful Ingestion Config.", + "allOf": [ + { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + } + ] + }, + "client_id": { + "title": "Client Id", + "description": "Application ID. Found in your app registration on Azure AD Portal", + "type": "string" + }, + "tenant_id": { + "title": "Tenant Id", + "description": "Directory ID. Found in your app registration on Azure AD Portal", + "type": "string" + }, + "client_secret": { + "title": "Client Secret", + "description": "Client secret. Found in your app registration on Azure AD Portal", + "type": "string" + }, + "authority": { + "title": "Authority", + "description": "The authority (https://docs.microsoft.com/en-us/azure/active-directory/develop/msal-client-application-configuration) is a URL that indicates a directory that MSAL can request tokens from.", + "type": "string" + }, + "token_url": { + "title": "Token Url", + "description": "The token URL that acquires a token from Azure AD for authorizing requests. This source will only work with v1.0 endpoint.", + "type": "string" + }, + "redirect": { + "title": "Redirect", + "description": "Redirect URI. Found in your app registration on Azure AD Portal.", + "default": "https://login.microsoftonline.com/common/oauth2/nativeclient", + "type": "string" + }, + "graph_url": { + "title": "Graph Url", + "description": "[Microsoft Graph API endpoint](https://docs.microsoft.com/en-us/graph/use-the-api)", + "default": "https://graph.microsoft.com/v1.0", + "type": "string" + }, + "azure_ad_response_to_username_attr": { + "title": "Azure Ad Response To Username Attr", + "description": "Which Azure AD User Response attribute to use as input to DataHub username mapping.", + "default": "userPrincipalName", + "type": "string" + }, + "azure_ad_response_to_username_regex": { + "title": "Azure Ad Response To Username Regex", + "description": "A regex used to parse the DataHub username from the attribute specified in `azure_ad_response_to_username_attr`.", + "default": "(.*)", + "type": "string" + }, + "azure_ad_response_to_groupname_attr": { + "title": "Azure Ad Response To Groupname Attr", + "description": "Which Azure AD Group Response attribute to use as input to DataHub group name mapping.", + "default": "displayName", + "type": "string" + }, + "azure_ad_response_to_groupname_regex": { + "title": "Azure Ad Response To Groupname Regex", + "description": "A regex used to parse the DataHub group name from the attribute specified in `azure_ad_response_to_groupname_attr`.", + "default": "(.*)", + "type": "string" + }, + "ingest_users": { + "title": "Ingest Users", + "description": "Whether users should be ingested into DataHub.", + "default": true, + "type": "boolean" + }, + "ingest_groups": { + "title": "Ingest Groups", + "description": "Whether groups should be ingested into DataHub.", + "default": true, + "type": "boolean" + }, + "ingest_group_membership": { + "title": "Ingest Group Membership", + "description": "Whether group membership should be ingested into DataHub. ingest_groups must be True if this is True.", + "default": true, + "type": "boolean" + }, + "ingest_groups_users": { + "title": "Ingest Groups Users", + "description": "This option is useful only when `ingest_users` is set to False and `ingest_group_membership` to True. As effect, only the users which belongs to the selected groups will be ingested.", + "default": true, + "type": "boolean" + }, + "users_pattern": { + "title": "Users Pattern", + "description": "regex patterns for users to filter in ingestion.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "groups_pattern": { + "title": "Groups Pattern", + "description": "regex patterns for groups to include in ingestion.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "filtered_tracking": { + "title": "Filtered Tracking", + "description": "If enabled, report will contain names of filtered users and groups.", + "default": true, + "type": "boolean" + }, + "mask_group_id": { + "title": "Mask Group Id", + "description": "Whether workunit ID's for groups should be masked to avoid leaking sensitive information.", + "default": true, + "type": "boolean" + }, + "mask_user_id": { + "title": "Mask User Id", + "description": "Whether workunit ID's for users should be masked to avoid leaking sensitive information.", + "default": true, + "type": "boolean" + } + }, + "required": [ + "client_id", + "tenant_id", + "client_secret", + "authority", + "token_url" + ], + "additionalProperties": false, + "definitions": { + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "AllowDenyPattern": { + "title": "AllowDenyPattern", + "description": "A class to store allow deny regexes", + "type": "object", + "properties": { + "allow": { + "title": "Allow", + "description": "List of regex patterns to include in ingestion", + "default": [ + ".*" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "deny": { + "title": "Deny", + "description": "List of regex patterns to exclude from ingestion.", + "default": [], + "type": "array", + "items": { + "type": "string" + } + }, + "ignoreCase": { + "title": "Ignorecase", + "description": "Whether to ignore case sensitivity during pattern matching.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + } + } +} +``` + + +
+ +As a prerequisite, you should [create a DataHub Application](https://docs.microsoft.com/en-us/graph/toolkit/get-started/add-aad-app-registration) within the Azure AD Portal with the permissions +to read your organization's Users and Groups. The following permissions are required, with the `Application` permission type: + +- `Group.Read.All` +- `GroupMember.Read.All` +- `User.Read.All` + +You can add a permission by navigating to the permissions tab in your DataHub application on the Azure AD portal. + +

+ +

+ +You can view the necessary endpoints to configure by clicking on the Endpoints button in the Overview tab. + +

+ +

+ +### Code Coordinates + +- Class Name: `datahub.ingestion.source.identity.azure_ad.AzureADSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/identity/azure_ad.py) + +

Questions

+ +If you've got any questions on configuring ingestion for Azure AD, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/bigquery.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/bigquery.md new file mode 100644 index 0000000000000..5eb200a4115de --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/bigquery.md @@ -0,0 +1,1219 @@ +--- +sidebar_position: 3 +title: BigQuery +slug: /generated/ingestion/sources/bigquery +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/bigquery.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# BigQuery + +Ingesting metadata from BigQuery requires using the **bigquery** module. +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +### Important Capabilities + +| Capability | Status | Notes | +| ---------------------------------------------------------------------------------------------------------- | ------ | -------------------------------------------------------------------------------- | +| Asset Containers | ✅ | Enabled by default | +| [Data Profiling](../../../../metadata-ingestion/docs/dev_guides/sql_profiles.md) | ✅ | Optionally enabled via configuration | +| Dataset Usage | ✅ | Enabled by default, can be disabled via configuration `include_usage_statistics` | +| Descriptions | ✅ | Enabled by default | +| [Detect Deleted Entities](../../../../metadata-ingestion/docs/dev_guides/stateful.md#stale-entity-removal) | ✅ | Optionally enabled via `stateful_ingestion.remove_stale_metadata` | +| [Domains](../../../domains.md) | ✅ | Supported via the `domain` config field | +| [Platform Instance](../../../platform-instances.md) | ❌ | Platform instance is pre-set to the BigQuery project id | +| Schema Metadata | ✅ | Enabled by default | +| Table-Level Lineage | ✅ | Optionally enabled via configuration | + +### Prerequisites + +To understand how BigQuery ingestion needs to be set up, first familiarize yourself with the concepts in the diagram below: + +

+ +

+ +There are two important concepts to understand and identify: + +- _Extractor Project_: This is the project associated with a service-account, whose credentials you will be configuring in the connector. The connector uses this service-account to run jobs (including queries) within the project. +- _Bigquery Projects_ are the projects from which table metadata, lineage, usage, and profiling data need to be collected. By default, the extractor project is included in the list of projects that DataHub collects metadata from, but you can control that by passing in a specific list of project ids that you want to collect metadata from. Read the configuration section below to understand how to limit the list of projects that DataHub extracts metadata from. + +#### Create a datahub profile in GCP + +1. Create a custom role for datahub as per [BigQuery docs](https://cloud.google.com/iam/docs/creating-custom-roles#creating_a_custom_role). +2. Follow the sections below to grant permissions to this role on this project and other projects. + +##### Basic Requirements (needed for metadata ingestion) + +1. Identify your Extractor Project where the service account will run queries to extract metadata. + +| permission                       | Description                                                                                                                         | Capability                                                               | +| -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------- | +| `bigquery.jobs.create`           | Run jobs (e.g. queries) within the project. _This only needs for the extractor project where the service account belongs_           |                                                                                                               | +| `bigquery.jobs.list`             | Manage the queries that the service account has sent. _This only needs for the extractor project where the service account belongs_ |                                                                                                               | +| `bigquery.readsessions.create`   | Create a session for streaming large results. _This only needs for the extractor project where the service account belongs_         |                                                                                                               | +| `bigquery.readsessions.getData` | Get data from the read session. _This only needs for the extractor project where the service account belongs_                       | + +2. Grant the following permissions to the Service Account on every project where you would like to extract metadata from + +:::info + +If you have multiple projects in your BigQuery setup, the role should be granted these permissions in each of the projects. + +::: +| permission                       | Description                                                                                                 | Capability               | Default GCP role which contains this permission                                                                 | +|----------------------------------|--------------------------------------------------------------------------------------------------------------|-------------------------------------|-----------------------------------------------------------------------------------------------------------------| +| `bigquery.datasets.get`         | Retrieve metadata about a dataset.                                                                           | Table Metadata Extraction           | [roles/bigquery.metadataViewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.metadataViewer) | +| `bigquery.datasets.getIamPolicy` | Read a dataset's IAM permissions.                                                                           | Table Metadata Extraction           | [roles/bigquery.metadataViewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.metadataViewer) | +| `bigquery.tables.list`           | List BigQuery tables.                                                                                       | Table Metadata Extraction           | [roles/bigquery.metadataViewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.metadataViewer) | +| `bigquery.tables.get`           | Retrieve metadata for a table.                                                                               | Table Metadata Extraction           | [roles/bigquery.metadataViewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.metadataViewer) | +| `bigquery.routines.get`           | Get Routines. Needs to retrieve metadata for a table from system table.                                                                                       | Table Metadata Extraction           | [roles/bigquery.metadataViewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.metadataViewer) | +| `bigquery.routines.list`           | List Routines. Needs to retrieve metadata for a table from system table                                                                               | Table Metadata Extraction           | [roles/bigquery.metadataViewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.metadataViewer) | +| `resourcemanager.projects.get`   | Retrieve project names and metadata.                                                                         | Table Metadata Extraction           | [roles/bigquery.metadataViewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.metadataViewer) | +| `bigquery.jobs.listAll`         | List all jobs (queries) submitted by any user. Needs for Lineage extraction.                                 | Lineage Extraction/Usage extraction | [roles/bigquery.resourceViewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.resourceViewer) | +| `logging.logEntries.list`       | Fetch log entries for lineage/usage data. Not required if `use_exported_bigquery_audit_metadata` is enabled. | Lineage Extraction/Usage extraction | [roles/logging.privateLogViewer](https://cloud.google.com/logging/docs/access-control#logging.privateLogViewer) | +| `logging.privateLogEntries.list` | Fetch log entries for lineage/usage data. Not required if `use_exported_bigquery_audit_metadata` is enabled. | Lineage Extraction/Usage extraction | [roles/logging.privateLogViewer](https://cloud.google.com/logging/docs/access-control#logging.privateLogViewer) | +| `bigquery.tables.getData`       | Access table data to extract storage size, last updated at, data profiles etc. | Profiling                           |                                                                                                                 | + +#### Create a service account in the Extractor Project + +1. Setup a ServiceAccount as per [BigQuery docs](https://cloud.google.com/iam/docs/creating-managing-service-accounts#iam-service-accounts-create-console) + and assign the previously created role to this service account. +2. Download a service account JSON keyfile. + Example credential file: + +```json +{ + "type": "service_account", + "project_id": "project-id-1234567", + "private_key_id": "d0121d0000882411234e11166c6aaa23ed5d74e0", + "private_key": "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----", + "client_email": "test@suppproject-id-1234567.iam.gserviceaccount.com", + "client_id": "113545814931671546333", + "auth_uri": "https://accounts.google.com/o/oauth2/auth", + "token_uri": "https://oauth2.googleapis.com/token", + "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", + "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/test%suppproject-id-1234567.iam.gserviceaccount.com" +} +``` + +3. To provide credentials to the source, you can either: + + Set an environment variable: + + ```sh + $ export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json" + ``` + + _or_ + + Set credential config in your source based on the credential json file. For example: + + ```yml + credential: + project_id: project-id-1234567 + private_key_id: "d0121d0000882411234e11166c6aaa23ed5d74e0" + private_key: "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----\n" + client_email: "test@suppproject-id-1234567.iam.gserviceaccount.com" + client_id: "123456678890" + ``` + +### Lineage Computation Details + +When `use_exported_bigquery_audit_metadata` is set to `true`, lineage information will be computed using exported bigquery logs. On how to setup exported bigquery audit logs, refer to the following [docs](https://cloud.google.com/bigquery/docs/reference/auditlogs#defining_a_bigquery_log_sink_using_gcloud) on BigQuery audit logs. Note that only protoPayloads with "type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata" are supported by the current ingestion version. The `bigquery_audit_metadata_datasets` parameter will be used only if `use_exported_bigquery_audit_metadat` is set to `true`. + +Note: the `bigquery_audit_metadata_datasets` parameter receives a list of datasets, in the format $PROJECT.$DATASET. This way queries from a multiple number of projects can be used to compute lineage information. + +Note: Since bigquery source also supports dataset level lineage, the auth client will require additional permissions to be able to access the google audit logs. Refer the permissions section in bigquery-usage section below which also accesses the audit logs. + +### Profiling Details + +For performance reasons, we only profile the latest partition for partitioned tables and the latest shard for sharded tables. +You can set partition explicitly with `partition.partition_datetime` property if you want, though note that partition config will be applied to all partitioned tables. + +### Caveats + +- For materialized views, lineage is dependent on logs being retained. If your GCP logging is retained for 30 days (default) and 30 days have passed since the creation of the materialized view we won't be able to get lineage for them. + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[bigquery]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: bigquery + config: + # `schema_pattern` for BQ Datasets + schema_pattern: + allow: + - finance_bq_dataset + table_pattern: + deny: + # The exact name of the table is revenue_table_name + # The reason we have this `.*` at the beginning is because the current implmenetation of table_pattern is testing + # project_id.dataset_name.table_name + # We will improve this in the future + - .*revenue_table_name + include_table_lineage: true + include_usage_statistics: true + profiling: + enabled: true + profile_table_level_only: true + +sink: + # sink configs +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +|
bigquery_audit_metadata_datasets
array(string)
| | +|
bucket_duration
Enum
| Size of the time window to aggregate usage stats.
Default: DAY
| +|
capture_dataset_label_as_tag
boolean
| Capture BigQuery dataset labels as DataHub tag
Default: False
| +|
capture_table_label_as_tag
boolean
| Capture BigQuery table labels as DataHub tag
Default: False
| +|
column_limit
integer
| Maximum number of columns to process in a table. This is a low level config property which should be touched with care. This restriction is needed because excessively wide tables can result in failure to ingest the schema.
Default: 300
| +|
convert_urns_to_lowercase
boolean
| Convert urns to lowercase.
Default: False
| +|
debug_include_full_payloads
boolean
| Include full payload into events. It is only for debugging and internal use.
Default: False
| +|
enable_legacy_sharded_table_support
boolean
| Use the legacy sharded table urn suffix added.
Default: True
| +|
end_time
string(date-time)
| Latest date of usage to consider. Default: Current time in UTC | +|
extra_client_options
object
| Additional options to pass to google.cloud.logging_v2.client.Client.
Default: {}
| +|
extract_column_lineage
boolean
| If enabled, generate column level lineage. Requires lineage_use_sql_parser to be enabled. This and `incremental_lineage` cannot both be enabled.
Default: False
| +|
extract_lineage_from_catalog
boolean
| This flag enables the data lineage extraction from Data Lineage API exposed by Google Data Catalog. NOTE: This extractor can't build views lineage. It's recommended to enable the view's DDL parsing. Read the docs to have more information about: https://cloud.google.com/data-catalog/docs/concepts/about-data-lineage
Default: False
| +|
include_external_url
boolean
| Whether to populate BigQuery Console url to Datasets/Tables
Default: True
| +|
include_table_lineage
boolean
| Option to enable/disable lineage generation. Is enabled by default.
Default: True
| +|
include_table_location_lineage
boolean
| If the source supports it, include table lineage to the underlying storage location.
Default: True
| +|
include_tables
boolean
| Whether tables should be ingested.
Default: True
| +|
include_usage_statistics
boolean
| Generate usage statistic
Default: True
| +|
include_views
boolean
| Whether views should be ingested.
Default: True
| +|
incremental_lineage
boolean
| When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.
Default: True
| +|
lineage_parse_view_ddl
boolean
| Sql parse view ddl to get lineage.
Default: True
| +|
lineage_sql_parser_use_raw_names
boolean
| This parameter ignores the lowercase pattern stipulated in the SQLParser. NOTE: Ignored if lineage_use_sql_parser is False.
Default: False
| +|
lineage_use_sql_parser
boolean
| Use sql parser to resolve view/table lineage.
Default: True
| +|
log_page_size
integer
| The number of log item will be queried per page for lineage collection
Default: 1000
| +|
match_fully_qualified_names
boolean
| Whether `dataset_pattern` is matched against fully qualified dataset name `.`.
Default: False
| +|
max_query_duration
number(time-delta)
| Correction to pad start_time and end_time with. For handling the case where the read happens within our time range but the query completion event is delayed and happens after the configured end time.
Default: 900.0
| +|
number_of_datasets_process_in_batch_if_profiling_enabled
integer
| Number of partitioned table queried in batch when getting metadata. This is a low level config property which should be touched with care. This restriction is needed because we query partitions system view which throws error if we try to touch too many tables.
Default: 200
| +|
options
object
| Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs. | +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
project_id
string
| [deprecated] Use project_id_pattern or project_ids instead. | +|
project_ids
array(string)
| | +|
project_on_behalf
string
| [Advanced] The BigQuery project in which queries are executed. Will be passed when creating a job. If not passed, falls back to the project associated with the service account. | +|
rate_limit
boolean
| Should we rate limit requests made to API.
Default: False
| +|
requests_per_min
integer
| Used to control number of API calls made per min. Only used when `rate_limit` is set to `True`.
Default: 60
| +|
scheme
string
|
Default: bigquery
| +|
sharded_table_pattern
string
| The regex pattern to match sharded tables and group as one table. This is a very low level config parameter, only change if you know what you are doing,
Default: ((.+)[\_$])?(\d{8})$
| +|
sql_parser_use_external_process
boolean
| When enabled, sql parser will run in isolated in a separate process. This can affect processing time but can protect from sql parser's mem leak.
Default: False
| +|
start_time
string(date-time)
| Earliest date of usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`) | +|
store_last_lineage_extraction_timestamp
boolean
| Enable checking last lineage extraction date in store.
Default: False
| +|
store_last_profiling_timestamps
boolean
| Enable storing last profile timestamp in store.
Default: False
| +|
store_last_usage_extraction_timestamp
boolean
| Enable checking last usage timestamp in store.
Default: True
| +|
temp_table_dataset_prefix
string
| If you are creating temp tables in a dataset with a particular prefix you can use this config to set the prefix for the dataset. This is to support workflows from before bigquery's introduction of temp tables. By default we use `_` because of datasets that begin with an underscore are hidden by default https://cloud.google.com/bigquery/docs/datasets#dataset-naming.
Default: \_
| +|
upstream_lineage_in_report
boolean
| Useful for debugging lineage information. Set to True to see the raw lineage created internally.
Default: False
| +|
use_date_sharded_audit_log_tables
boolean
| Whether to read date sharded tables or time partitioned tables when extracting usage from exported audit logs.
Default: False
| +|
use_exported_bigquery_audit_metadata
boolean
| When configured, use BigQueryAuditMetadata in bigquery_audit_metadata_datasets to compute lineage information.
Default: False
| +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
credential
BigQueryCredential
| BigQuery credential informations | +|
credential.client_email 
string
| Client email | +|
credential.client_id 
string
| Client Id | +|
credential.private_key 
string
| Private key in a form of '-----BEGIN PRIVATE KEY-----\nprivate-key\n-----END PRIVATE KEY-----\n' | +|
credential.private_key_id 
string
| Private key id | +|
credential.project_id 
string
| Project id to set the credentials | +|
credential.auth_provider_x509_cert_url
string
| Auth provider x509 certificate url
Default: https://www.googleapis.com/oauth2/v1/certs
| +|
credential.auth_uri
string
| Authentication uri
Default: https://accounts.google.com/o/oauth2/auth
| +|
credential.client_x509_cert_url
string
| If not set it will be default to https://www.googleapis.com/robot/v1/metadata/x509/client_email | +|
credential.token_uri
string
| Token uri
Default: https://oauth2.googleapis.com/token
| +|
credential.type
string
| Authentication type
Default: service_account
| +|
dataset_pattern
AllowDenyPattern
| Regex patterns for dataset to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
dataset_pattern.allow
array(string)
| | +|
dataset_pattern.deny
array(string)
| | +|
dataset_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
domain
map(str,AllowDenyPattern)
| A class to store allow deny regexes | +|
domain.`key`.allow
array(string)
| | +|
domain.`key`.deny
array(string)
| | +|
domain.`key`.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profile_pattern
AllowDenyPattern
| Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
profile_pattern.allow
array(string)
| | +|
profile_pattern.deny
array(string)
| | +|
profile_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
project_id_pattern
AllowDenyPattern
| Regex patterns for project_id to filter in ingestion.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
project_id_pattern.allow
array(string)
| | +|
project_id_pattern.deny
array(string)
| | +|
project_id_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
schema_pattern
AllowDenyPattern
| Regex patterns for schemas to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
schema_pattern.allow
array(string)
| | +|
schema_pattern.deny
array(string)
| | +|
schema_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
table_pattern
AllowDenyPattern
| Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.\*'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
table_pattern.allow
array(string)
| | +|
table_pattern.deny
array(string)
| | +|
table_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
usage
BigQueryUsageConfig
| Usage related configs
Default: {'bucket_duration': 'DAY', 'end_time': '2023-07-26...
| +|
usage.apply_view_usage_to_tables
boolean
| Whether to apply view's usage to its base tables. If set to False, uses sql parser and applies usage to views / tables mentioned in the query. If set to True, usage is applied to base tables only.
Default: False
| +|
usage.bucket_duration
Enum
| Size of the time window to aggregate usage stats.
Default: DAY
| +|
usage.end_time
string(date-time)
| Latest date of usage to consider. Default: Current time in UTC | +|
usage.format_sql_queries
boolean
| Whether to format sql queries
Default: False
| +|
usage.include_operational_stats
boolean
| Whether to display operational stats.
Default: True
| +|
usage.include_read_operational_stats
boolean
| Whether to report read operational stats. Experimental.
Default: False
| +|
usage.include_top_n_queries
boolean
| Whether to ingest the top_n_queries.
Default: True
| +|
usage.max_query_duration
number(time-delta)
| Correction to pad start_time and end_time with. For handling the case where the read happens within our time range but the query completion event is delayed and happens after the configured end time.
Default: 900.0
| +|
usage.start_time
string(date-time)
| Earliest date of usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`) | +|
usage.top_n_queries
integer
| Number of top queries to save to each table.
Default: 10
| +|
usage.user_email_pattern
AllowDenyPattern
| regex patterns for user emails to filter in usage.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
usage.user_email_pattern.allow
array(string)
| | +|
usage.user_email_pattern.deny
array(string)
| | +|
usage.user_email_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
view_pattern
AllowDenyPattern
| Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.\*'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
view_pattern.allow
array(string)
| | +|
view_pattern.deny
array(string)
| | +|
view_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profiling
GEProfilingConfig
|
Default: {'enabled': False, 'limit': None, 'offset': None, ...
| +|
profiling.catch_exceptions
boolean
|
Default: True
| +|
profiling.enabled
boolean
| Whether profiling should be done.
Default: False
| +|
profiling.field_sample_values_limit
integer
| Upper limit for number of sample values to collect for all columns.
Default: 20
| +|
profiling.include_field_distinct_count
boolean
| Whether to profile for the number of distinct values for each column.
Default: True
| +|
profiling.include_field_distinct_value_frequencies
boolean
| Whether to profile for distinct value frequencies.
Default: False
| +|
profiling.include_field_histogram
boolean
| Whether to profile for the histogram for numeric fields.
Default: False
| +|
profiling.include_field_max_value
boolean
| Whether to profile for the max value of numeric columns.
Default: True
| +|
profiling.include_field_mean_value
boolean
| Whether to profile for the mean value of numeric columns.
Default: True
| +|
profiling.include_field_median_value
boolean
| Whether to profile for the median value of numeric columns.
Default: True
| +|
profiling.include_field_min_value
boolean
| Whether to profile for the min value of numeric columns.
Default: True
| +|
profiling.include_field_null_count
boolean
| Whether to profile for the number of nulls for each column.
Default: True
| +|
profiling.include_field_quantiles
boolean
| Whether to profile for the quantiles of numeric columns.
Default: False
| +|
profiling.include_field_sample_values
boolean
| Whether to profile for the sample values for all columns.
Default: True
| +|
profiling.include_field_stddev_value
boolean
| Whether to profile for the standard deviation of numeric columns.
Default: True
| +|
profiling.limit
integer
| Max number of documents to profile. By default, profiles all documents. | +|
profiling.max_number_of_fields_to_profile
integer
| A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. | +|
profiling.max_workers
integer
| Number of worker threads to use for profiling. Set to 1 to disable.
Default: 80
| +|
profiling.offset
integer
| Offset in documents to profile. By default, uses no offset. | +|
profiling.partition_datetime
string(date-time)
| For partitioned datasets profile only the partition which matches the datetime or profile the latest one if not set. Only Bigquery supports this. | +|
profiling.partition_profiling_enabled
boolean
|
Default: True
| +|
profiling.profile_if_updated_since_days
number
| Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`. | +|
profiling.profile_table_level_only
boolean
| Whether to perform profiling at table-level only, or include column-level profiling as well.
Default: False
| +|
profiling.profile_table_row_count_estimate_only
boolean
| Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL.
Default: False
| +|
profiling.profile_table_row_limit
integer
| Profile tables only if their row count is less then specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `snowflake` and `BigQuery`
Default: 5000000
| +|
profiling.profile_table_size_limit
integer
| Profile tables only if their size is less then specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `snowflake` and `BigQuery`
Default: 5
| +|
profiling.query_combiner_enabled
boolean
| _This feature is still experimental and can be disabled if it causes issues._ Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.
Default: True
| +|
profiling.report_dropped_profiles
boolean
| Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.
Default: False
| +|
profiling.turn_off_expensive_profiling_metrics
boolean
| Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.
Default: False
| +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| Base specialized config for Stateful Ingestion with stale metadata removal capability. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "BigQueryV2Config", + "description": "Base configuration class for stateful ingestion for source configs to inherit from.", + "type": "object", + "properties": { + "store_last_profiling_timestamps": { + "title": "Store Last Profiling Timestamps", + "description": "Enable storing last profile timestamp in store.", + "default": false, + "type": "boolean" + }, + "incremental_lineage": { + "title": "Incremental Lineage", + "description": "When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.", + "default": true, + "type": "boolean" + }, + "sql_parser_use_external_process": { + "title": "Sql Parser Use External Process", + "description": "When enabled, sql parser will run in isolated in a separate process. This can affect processing time but can protect from sql parser's mem leak.", + "default": false, + "type": "boolean" + }, + "store_last_lineage_extraction_timestamp": { + "title": "Store Last Lineage Extraction Timestamp", + "description": "Enable checking last lineage extraction date in store.", + "default": false, + "type": "boolean" + }, + "bucket_duration": { + "description": "Size of the time window to aggregate usage stats.", + "default": "DAY", + "allOf": [ + { + "$ref": "#/definitions/BucketDuration" + } + ] + }, + "end_time": { + "title": "End Time", + "description": "Latest date of usage to consider. Default: Current time in UTC", + "type": "string", + "format": "date-time" + }, + "start_time": { + "title": "Start Time", + "description": "Earliest date of usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`)", + "type": "string", + "format": "date-time" + }, + "store_last_usage_extraction_timestamp": { + "title": "Store Last Usage Extraction Timestamp", + "description": "Enable checking last usage timestamp in store.", + "default": true, + "type": "boolean" + }, + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "stateful_ingestion": { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + }, + "options": { + "title": "Options", + "description": "Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs.", + "type": "object" + }, + "schema_pattern": { + "title": "Schema Pattern", + "description": "Regex patterns for schemas to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "table_pattern": { + "title": "Table Pattern", + "description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "view_pattern": { + "title": "View Pattern", + "description": "Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "profile_pattern": { + "title": "Profile Pattern", + "description": "Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "domain": { + "title": "Domain", + "description": "Attach domains to databases, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.", + "default": {}, + "type": "object", + "additionalProperties": { + "$ref": "#/definitions/AllowDenyPattern" + } + }, + "include_views": { + "title": "Include Views", + "description": "Whether views should be ingested.", + "default": true, + "type": "boolean" + }, + "include_tables": { + "title": "Include Tables", + "description": "Whether tables should be ingested.", + "default": true, + "type": "boolean" + }, + "include_table_location_lineage": { + "title": "Include Table Location Lineage", + "description": "If the source supports it, include table lineage to the underlying storage location.", + "default": true, + "type": "boolean" + }, + "profiling": { + "title": "Profiling", + "default": { + "enabled": false, + "limit": null, + "offset": null, + "report_dropped_profiles": false, + "turn_off_expensive_profiling_metrics": false, + "profile_table_level_only": false, + "include_field_null_count": true, + "include_field_distinct_count": true, + "include_field_min_value": true, + "include_field_max_value": true, + "include_field_mean_value": true, + "include_field_median_value": true, + "include_field_stddev_value": true, + "include_field_quantiles": false, + "include_field_distinct_value_frequencies": false, + "include_field_histogram": false, + "include_field_sample_values": true, + "field_sample_values_limit": 20, + "max_number_of_fields_to_profile": null, + "profile_if_updated_since_days": null, + "profile_table_size_limit": 5, + "profile_table_row_limit": 5000000, + "profile_table_row_count_estimate_only": false, + "max_workers": 80, + "query_combiner_enabled": true, + "catch_exceptions": true, + "partition_profiling_enabled": true, + "partition_datetime": null + }, + "allOf": [ + { + "$ref": "#/definitions/GEProfilingConfig" + } + ] + }, + "rate_limit": { + "title": "Rate Limit", + "description": "Should we rate limit requests made to API.", + "default": false, + "type": "boolean" + }, + "requests_per_min": { + "title": "Requests Per Min", + "description": "Used to control number of API calls made per min. Only used when `rate_limit` is set to `True`.", + "default": 60, + "type": "integer" + }, + "temp_table_dataset_prefix": { + "title": "Temp Table Dataset Prefix", + "description": "If you are creating temp tables in a dataset with a particular prefix you can use this config to set the prefix for the dataset. This is to support workflows from before bigquery's introduction of temp tables. By default we use `_` because of datasets that begin with an underscore are hidden by default https://cloud.google.com/bigquery/docs/datasets#dataset-naming.", + "default": "_", + "type": "string" + }, + "sharded_table_pattern": { + "title": "Sharded Table Pattern", + "description": "The regex pattern to match sharded tables and group as one table. This is a very low level config parameter, only change if you know what you are doing, ", + "default": "((.+)[_$])?(\\d{8})$", + "deprecated": true, + "type": "string" + }, + "project_id_pattern": { + "title": "Project Id Pattern", + "description": "Regex patterns for project_id to filter in ingestion.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "usage": { + "title": "Usage", + "description": "Usage related configs", + "default": { + "bucket_duration": "DAY", + "end_time": "2023-07-26T06:31:16.841609+00:00", + "start_time": "2023-07-25T00:00:00+00:00", + "top_n_queries": 10, + "user_email_pattern": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "include_operational_stats": true, + "include_read_operational_stats": false, + "format_sql_queries": false, + "include_top_n_queries": true, + "max_query_duration": 900.0, + "apply_view_usage_to_tables": false + }, + "allOf": [ + { + "$ref": "#/definitions/BigQueryUsageConfig" + } + ] + }, + "include_usage_statistics": { + "title": "Include Usage Statistics", + "description": "Generate usage statistic", + "default": true, + "type": "boolean" + }, + "capture_table_label_as_tag": { + "title": "Capture Table Label As Tag", + "description": "Capture BigQuery table labels as DataHub tag", + "default": false, + "type": "boolean" + }, + "capture_dataset_label_as_tag": { + "title": "Capture Dataset Label As Tag", + "description": "Capture BigQuery dataset labels as DataHub tag", + "default": false, + "type": "boolean" + }, + "dataset_pattern": { + "title": "Dataset Pattern", + "description": "Regex patterns for dataset to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "match_fully_qualified_names": { + "title": "Match Fully Qualified Names", + "description": "Whether `dataset_pattern` is matched against fully qualified dataset name `.`.", + "default": false, + "type": "boolean" + }, + "include_external_url": { + "title": "Include External Url", + "description": "Whether to populate BigQuery Console url to Datasets/Tables", + "default": true, + "type": "boolean" + }, + "debug_include_full_payloads": { + "title": "Debug Include Full Payloads", + "description": "Include full payload into events. It is only for debugging and internal use.", + "default": false, + "type": "boolean" + }, + "number_of_datasets_process_in_batch_if_profiling_enabled": { + "title": "Number Of Datasets Process In Batch If Profiling Enabled", + "description": "Number of partitioned table queried in batch when getting metadata. This is a low level config property which should be touched with care. This restriction is needed because we query partitions system view which throws error if we try to touch too many tables.", + "default": 200, + "type": "integer" + }, + "column_limit": { + "title": "Column Limit", + "description": "Maximum number of columns to process in a table. This is a low level config property which should be touched with care. This restriction is needed because excessively wide tables can result in failure to ingest the schema.", + "default": 300, + "type": "integer" + }, + "project_id": { + "title": "Project Id", + "description": "[deprecated] Use project_id_pattern or project_ids instead.", + "type": "string" + }, + "project_ids": { + "title": "Project Ids", + "description": "Ingests specified project_ids. Use this property if you want to specify what projects to ingest or don't want to give project resourcemanager.projects.list to your service account. Overrides `project_id_pattern`.", + "type": "array", + "items": { + "type": "string" + } + }, + "project_on_behalf": { + "title": "Project On Behalf", + "description": "[Advanced] The BigQuery project in which queries are executed. Will be passed when creating a job. If not passed, falls back to the project associated with the service account.", + "type": "string" + }, + "lineage_use_sql_parser": { + "title": "Lineage Use Sql Parser", + "description": "Use sql parser to resolve view/table lineage.", + "default": true, + "type": "boolean" + }, + "lineage_parse_view_ddl": { + "title": "Lineage Parse View Ddl", + "description": "Sql parse view ddl to get lineage.", + "default": true, + "type": "boolean" + }, + "lineage_sql_parser_use_raw_names": { + "title": "Lineage Sql Parser Use Raw Names", + "description": "This parameter ignores the lowercase pattern stipulated in the SQLParser. NOTE: Ignored if lineage_use_sql_parser is False.", + "default": false, + "type": "boolean" + }, + "extract_column_lineage": { + "title": "Extract Column Lineage", + "description": "If enabled, generate column level lineage. Requires lineage_use_sql_parser to be enabled. This and `incremental_lineage` cannot both be enabled.", + "default": false, + "type": "boolean" + }, + "extract_lineage_from_catalog": { + "title": "Extract Lineage From Catalog", + "description": "This flag enables the data lineage extraction from Data Lineage API exposed by Google Data Catalog. NOTE: This extractor can't build views lineage. It's recommended to enable the view's DDL parsing. Read the docs to have more information about: https://cloud.google.com/data-catalog/docs/concepts/about-data-lineage", + "default": false, + "type": "boolean" + }, + "convert_urns_to_lowercase": { + "title": "Convert Urns To Lowercase", + "description": "Convert urns to lowercase.", + "default": false, + "type": "boolean" + }, + "enable_legacy_sharded_table_support": { + "title": "Enable Legacy Sharded Table Support", + "description": "Use the legacy sharded table urn suffix added.", + "default": true, + "type": "boolean" + }, + "scheme": { + "title": "Scheme", + "default": "bigquery", + "type": "string" + }, + "log_page_size": { + "title": "Log Page Size", + "description": "The number of log item will be queried per page for lineage collection", + "default": 1000, + "exclusiveMinimum": 0, + "type": "integer" + }, + "credential": { + "title": "Credential", + "description": "BigQuery credential informations", + "allOf": [ + { + "$ref": "#/definitions/BigQueryCredential" + } + ] + }, + "extra_client_options": { + "title": "Extra Client Options", + "description": "Additional options to pass to google.cloud.logging_v2.client.Client.", + "default": {}, + "type": "object" + }, + "include_table_lineage": { + "title": "Include Table Lineage", + "description": "Option to enable/disable lineage generation. Is enabled by default.", + "default": true, + "type": "boolean" + }, + "max_query_duration": { + "title": "Max Query Duration", + "description": "Correction to pad start_time and end_time with. For handling the case where the read happens within our time range but the query completion event is delayed and happens after the configured end time.", + "default": 900.0, + "type": "number", + "format": "time-delta" + }, + "bigquery_audit_metadata_datasets": { + "title": "Bigquery Audit Metadata Datasets", + "description": "A list of datasets that contain a table named cloudaudit_googleapis_com_data_access which contain BigQuery audit logs, specifically, those containing BigQueryAuditMetadata. It is recommended that the project of the dataset is also specified, for example, projectA.datasetB.", + "type": "array", + "items": { + "type": "string" + } + }, + "use_exported_bigquery_audit_metadata": { + "title": "Use Exported Bigquery Audit Metadata", + "description": "When configured, use BigQueryAuditMetadata in bigquery_audit_metadata_datasets to compute lineage information.", + "default": false, + "type": "boolean" + }, + "use_date_sharded_audit_log_tables": { + "title": "Use Date Sharded Audit Log Tables", + "description": "Whether to read date sharded tables or time partitioned tables when extracting usage from exported audit logs.", + "default": false, + "type": "boolean" + }, + "upstream_lineage_in_report": { + "title": "Upstream Lineage In Report", + "description": "Useful for debugging lineage information. Set to True to see the raw lineage created internally.", + "default": false, + "type": "boolean" + } + }, + "additionalProperties": false, + "definitions": { + "BucketDuration": { + "title": "BucketDuration", + "description": "An enumeration.", + "enum": [ + "DAY", + "HOUR" + ], + "type": "string" + }, + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "AllowDenyPattern": { + "title": "AllowDenyPattern", + "description": "A class to store allow deny regexes", + "type": "object", + "properties": { + "allow": { + "title": "Allow", + "description": "List of regex patterns to include in ingestion", + "default": [ + ".*" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "deny": { + "title": "Deny", + "description": "List of regex patterns to exclude from ingestion.", + "default": [], + "type": "array", + "items": { + "type": "string" + } + }, + "ignoreCase": { + "title": "Ignorecase", + "description": "Whether to ignore case sensitivity during pattern matching.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "GEProfilingConfig": { + "title": "GEProfilingConfig", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "Whether profiling should be done.", + "default": false, + "type": "boolean" + }, + "limit": { + "title": "Limit", + "description": "Max number of documents to profile. By default, profiles all documents.", + "type": "integer" + }, + "offset": { + "title": "Offset", + "description": "Offset in documents to profile. By default, uses no offset.", + "type": "integer" + }, + "report_dropped_profiles": { + "title": "Report Dropped Profiles", + "description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.", + "default": false, + "type": "boolean" + }, + "turn_off_expensive_profiling_metrics": { + "title": "Turn Off Expensive Profiling Metrics", + "description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.", + "default": false, + "type": "boolean" + }, + "profile_table_level_only": { + "title": "Profile Table Level Only", + "description": "Whether to perform profiling at table-level only, or include column-level profiling as well.", + "default": false, + "type": "boolean" + }, + "include_field_null_count": { + "title": "Include Field Null Count", + "description": "Whether to profile for the number of nulls for each column.", + "default": true, + "type": "boolean" + }, + "include_field_distinct_count": { + "title": "Include Field Distinct Count", + "description": "Whether to profile for the number of distinct values for each column.", + "default": true, + "type": "boolean" + }, + "include_field_min_value": { + "title": "Include Field Min Value", + "description": "Whether to profile for the min value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_max_value": { + "title": "Include Field Max Value", + "description": "Whether to profile for the max value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_mean_value": { + "title": "Include Field Mean Value", + "description": "Whether to profile for the mean value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_median_value": { + "title": "Include Field Median Value", + "description": "Whether to profile for the median value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_stddev_value": { + "title": "Include Field Stddev Value", + "description": "Whether to profile for the standard deviation of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_quantiles": { + "title": "Include Field Quantiles", + "description": "Whether to profile for the quantiles of numeric columns.", + "default": false, + "type": "boolean" + }, + "include_field_distinct_value_frequencies": { + "title": "Include Field Distinct Value Frequencies", + "description": "Whether to profile for distinct value frequencies.", + "default": false, + "type": "boolean" + }, + "include_field_histogram": { + "title": "Include Field Histogram", + "description": "Whether to profile for the histogram for numeric fields.", + "default": false, + "type": "boolean" + }, + "include_field_sample_values": { + "title": "Include Field Sample Values", + "description": "Whether to profile for the sample values for all columns.", + "default": true, + "type": "boolean" + }, + "field_sample_values_limit": { + "title": "Field Sample Values Limit", + "description": "Upper limit for number of sample values to collect for all columns.", + "default": 20, + "type": "integer" + }, + "max_number_of_fields_to_profile": { + "title": "Max Number Of Fields To Profile", + "description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.", + "exclusiveMinimum": 0, + "type": "integer" + }, + "profile_if_updated_since_days": { + "title": "Profile If Updated Since Days", + "description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.", + "exclusiveMinimum": 0, + "type": "number" + }, + "profile_table_size_limit": { + "title": "Profile Table Size Limit", + "description": "Profile tables only if their size is less then specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `snowflake` and `BigQuery`", + "default": 5, + "type": "integer" + }, + "profile_table_row_limit": { + "title": "Profile Table Row Limit", + "description": "Profile tables only if their row count is less then specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `snowflake` and `BigQuery`", + "default": 5000000, + "type": "integer" + }, + "profile_table_row_count_estimate_only": { + "title": "Profile Table Row Count Estimate Only", + "description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ", + "default": false, + "type": "boolean" + }, + "max_workers": { + "title": "Max Workers", + "description": "Number of worker threads to use for profiling. Set to 1 to disable.", + "default": 80, + "type": "integer" + }, + "query_combiner_enabled": { + "title": "Query Combiner Enabled", + "description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.", + "default": true, + "type": "boolean" + }, + "catch_exceptions": { + "title": "Catch Exceptions", + "default": true, + "type": "boolean" + }, + "partition_profiling_enabled": { + "title": "Partition Profiling Enabled", + "default": true, + "type": "boolean" + }, + "partition_datetime": { + "title": "Partition Datetime", + "description": "For partitioned datasets profile only the partition which matches the datetime or profile the latest one if not set. Only Bigquery supports this.", + "type": "string", + "format": "date-time" + } + }, + "additionalProperties": false + }, + "BigQueryUsageConfig": { + "title": "BigQueryUsageConfig", + "type": "object", + "properties": { + "bucket_duration": { + "description": "Size of the time window to aggregate usage stats.", + "default": "DAY", + "allOf": [ + { + "$ref": "#/definitions/BucketDuration" + } + ] + }, + "end_time": { + "title": "End Time", + "description": "Latest date of usage to consider. Default: Current time in UTC", + "type": "string", + "format": "date-time" + }, + "start_time": { + "title": "Start Time", + "description": "Earliest date of usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`)", + "type": "string", + "format": "date-time" + }, + "top_n_queries": { + "title": "Top N Queries", + "description": "Number of top queries to save to each table.", + "default": 10, + "exclusiveMinimum": 0, + "type": "integer" + }, + "user_email_pattern": { + "title": "User Email Pattern", + "description": "regex patterns for user emails to filter in usage.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "include_operational_stats": { + "title": "Include Operational Stats", + "description": "Whether to display operational stats.", + "default": true, + "type": "boolean" + }, + "include_read_operational_stats": { + "title": "Include Read Operational Stats", + "description": "Whether to report read operational stats. Experimental.", + "default": false, + "type": "boolean" + }, + "format_sql_queries": { + "title": "Format Sql Queries", + "description": "Whether to format sql queries", + "default": false, + "type": "boolean" + }, + "include_top_n_queries": { + "title": "Include Top N Queries", + "description": "Whether to ingest the top_n_queries.", + "default": true, + "type": "boolean" + }, + "max_query_duration": { + "title": "Max Query Duration", + "description": "Correction to pad start_time and end_time with. For handling the case where the read happens within our time range but the query completion event is delayed and happens after the configured end time.", + "default": 900.0, + "type": "number", + "format": "time-delta" + }, + "apply_view_usage_to_tables": { + "title": "Apply View Usage To Tables", + "description": "Whether to apply view's usage to its base tables. If set to False, uses sql parser and applies usage to views / tables mentioned in the query. If set to True, usage is applied to base tables only.", + "default": false, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "BigQueryCredential": { + "title": "BigQueryCredential", + "type": "object", + "properties": { + "project_id": { + "title": "Project Id", + "description": "Project id to set the credentials", + "type": "string" + }, + "private_key_id": { + "title": "Private Key Id", + "description": "Private key id", + "type": "string" + }, + "private_key": { + "title": "Private Key", + "description": "Private key in a form of '-----BEGIN PRIVATE KEY-----\\nprivate-key\\n-----END PRIVATE KEY-----\\n'", + "type": "string" + }, + "client_email": { + "title": "Client Email", + "description": "Client email", + "type": "string" + }, + "client_id": { + "title": "Client Id", + "description": "Client Id", + "type": "string" + }, + "auth_uri": { + "title": "Auth Uri", + "description": "Authentication uri", + "default": "https://accounts.google.com/o/oauth2/auth", + "type": "string" + }, + "token_uri": { + "title": "Token Uri", + "description": "Token uri", + "default": "https://oauth2.googleapis.com/token", + "type": "string" + }, + "auth_provider_x509_cert_url": { + "title": "Auth Provider X509 Cert Url", + "description": "Auth provider x509 certificate url", + "default": "https://www.googleapis.com/oauth2/v1/certs", + "type": "string" + }, + "type": { + "title": "Type", + "description": "Authentication type", + "default": "service_account", + "type": "string" + }, + "client_x509_cert_url": { + "title": "Client X509 Cert Url", + "description": "If not set it will be default to https://www.googleapis.com/robot/v1/metadata/x509/client_email", + "type": "string" + } + }, + "required": [ + "project_id", + "private_key_id", + "private_key", + "client_email", + "client_id" + ], + "additionalProperties": false + } + } +} +``` + + +
+ +### Code Coordinates + +- Class Name: `datahub.ingestion.source.bigquery_v2.bigquery.BigqueryV2Source` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py) + +

Questions

+ +If you've got any questions on configuring ingestion for BigQuery, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/business-glossary.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/business-glossary.md new file mode 100644 index 0000000000000..7aaba3ec0096f --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/business-glossary.md @@ -0,0 +1,358 @@ +--- +sidebar_position: 4 +title: Business Glossary +slug: /generated/ingestion/sources/business-glossary +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/business-glossary.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Business Glossary + +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +This plugin pulls business glossary metadata from a yaml-formatted file. An example of one such file is located in the examples directory [here](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/bootstrap_data/business_glossary.yml). + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[datahub-business-glossary]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: datahub-business-glossary + config: + # Coordinates + file: /path/to/business_glossary_yaml + enable_auto_id: true # recommended to set to true so datahub will auto-generate guids from your term names + +# sink configs if needed +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +|
file 
One of string, string(path)
| File path or URL to business glossary file to ingest. | +|
enable_auto_id
boolean
| Generate guid urns instead of a plaintext path urn with the node/term's hierarchy.
Default: False
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "BusinessGlossarySourceConfig", + "type": "object", + "properties": { + "file": { + "title": "File", + "description": "File path or URL to business glossary file to ingest.", + "anyOf": [ + { + "type": "string" + }, + { + "type": "string", + "format": "path" + } + ] + }, + "enable_auto_id": { + "title": "Enable Auto Id", + "description": "Generate guid urns instead of a plaintext path urn with the node/term's hierarchy.", + "default": false, + "type": "boolean" + } + }, + "required": [ + "file" + ], + "additionalProperties": false +} +``` + + +
+ +### Business Glossary File Format + +The business glossary source file should be a .yml file with the following top-level keys: + +**Glossary**: the top level keys of the business glossary file + +Example **Glossary**: + +```yaml +version: 1 # the version of business glossary file config the config conforms to. Currently the only version released is `1`. +source: DataHub # the source format of the terms. Currently only supports `DataHub` +owners: # owners contains two nested fields + users: # (optional) a list of user IDs + - njones + groups: # (optional) a list of group IDs + - logistics +url: "https://github.com/datahub-project/datahub/" # (optional) external url pointing to where the glossary is defined externally, if applicable +nodes: # list of child **GlossaryNode** objects. See **GlossaryNode** section below + ... +``` + +**GlossaryNode**: a container of **GlossaryNode** and **GlossaryTerm** objects + +Example **GlossaryNode**: + +```yaml +- name: Shipping # name of the node + description: Provides terms related to the shipping domain # description of the node + owners: # (optional) owners contains 2 nested fields + users: # (optional) a list of user IDs + - njones + groups: # (optional) a list of group IDs + - logistics + nodes: # list of child **GlossaryNode** objects + ... + knowledge_links: # (optional) list of **KnowledgeCard** objects + - label: Wiki link for shipping + url: "https://en.wikipedia.org/wiki/Freight_transport" +``` + +**GlossaryTerm**: a term in your business glossary + +Example **GlossaryTerm**: + +```yaml +- name: FullAddress # name of the term + description: A collection of information to give the location of a building or plot of land. # description of the term + owners: # (optional) owners contains 2 nested fields + users: # (optional) a list of user IDs + - njones + groups: # (optional) a list of group IDs + - logistics + term_source: "EXTERNAL" # one of `EXTERNAL` or `INTERNAL`. Whether the term is coming from an external glossary or one defined in your organization. + source_ref: FIBO # (optional) if external, what is the name of the source the glossary term is coming from? + source_url: "https://www.google.com" # (optional) if external, what is the url of the source definition? + inherits: # (optional) list of **GlossaryTerm** that this term inherits from + - Privacy.PII + contains: # (optional) a list of **GlossaryTerm** that this term contains + - Shipping.ZipCode + - Shipping.CountryCode + - Shipping.StreetAddress + custom_properties: # (optional) a map of key/value pairs of arbitrary custom properties + - is_used_for_compliance_tracking: true + knowledge_links: # (optional) a list of **KnowledgeCard** related to this term. These appear as links on the glossary node's page + - url: "https://en.wikipedia.org/wiki/Address" + label: Wiki link + domain: "urn:li:domain:Logistics" # (optional) domain name or domain urn +``` + +To see how these all work together, check out this comprehensive example business glossary file below: + +
+Example business glossary file + +```yaml +version: 1 +source: DataHub +owners: + users: + - mjames +url: "https://github.com/datahub-project/datahub/" +nodes: + - name: Classification + description: A set of terms related to Data Classification + knowledge_links: + - label: Wiki link for classification + url: "https://en.wikipedia.org/wiki/Classification" + terms: + - name: Sensitive + description: Sensitive Data + custom_properties: + is_confidential: false + - name: Confidential + description: Confidential Data + custom_properties: + is_confidential: true + - name: HighlyConfidential + description: Highly Confidential Data + custom_properties: + is_confidential: true + domain: Marketing + - name: PersonalInformation + description: All terms related to personal information + owners: + users: + - mjames + terms: + - name: Email + ## An example of using an id to pin a term to a specific guid + ## See "how to generate custom IDs for your terms" section below + # id: "urn:li:glossaryTerm:41516e310acbfd9076fffc2c98d2d1a3" + description: An individual's email address + inherits: + - Classification.Confidential + owners: + groups: + - Trust and Safety + - name: Address + description: A physical address + - name: Gender + description: The gender identity of the individual + inherits: + - Classification.Sensitive + - name: Shipping + description: Provides terms related to the shipping domain + owners: + users: + - njones + groups: + - logistics + terms: + - name: FullAddress + description: A collection of information to give the location of a building or plot of land. + owners: + users: + - njones + groups: + - logistics + term_source: "EXTERNAL" + source_ref: FIBO + source_url: "https://www.google.com" + inherits: + - Privacy.PII + contains: + - Shipping.ZipCode + - Shipping.CountryCode + - Shipping.StreetAddress + related_terms: + - Housing.Kitchen.Cutlery + custom_properties: + - is_used_for_compliance_tracking: true + knowledge_links: + - url: "https://en.wikipedia.org/wiki/Address" + label: Wiki link + domain: "urn:li:domain:Logistics" + knowledge_links: + - label: Wiki link for shipping + url: "https://en.wikipedia.org/wiki/Freight_transport" + - name: ClientsAndAccounts + description: Provides basic concepts such as account, account holder, account provider, relationship manager that are commonly used by financial services providers to describe customers and to determine counterparty identities + owners: + groups: + - finance + terms: + - name: Account + description: Container for records associated with a business arrangement for regular transactions and services + term_source: "EXTERNAL" + source_ref: FIBO + source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account" + inherits: + - Classification.HighlyConfidential + contains: + - ClientsAndAccounts.Balance + - name: Balance + description: Amount of money available or owed + term_source: "EXTERNAL" + source_ref: FIBO + source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Balance" + - name: Housing + description: Provides terms related to the housing domain + owners: + users: + - mjames + groups: + - interior + nodes: + - name: Colors + description: "Colors that are used in Housing construction" + terms: + - name: Red + description: "red color" + term_source: "EXTERNAL" + source_ref: FIBO + source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account" + + - name: Green + description: "green color" + term_source: "EXTERNAL" + source_ref: FIBO + source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account" + + - name: Pink + description: pink color + term_source: "EXTERNAL" + source_ref: FIBO + source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account" + terms: + - name: WindowColor + description: Supported window colors + term_source: "EXTERNAL" + source_ref: FIBO + source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account" + values: + - Housing.Colors.Red + - Housing.Colors.Pink + + - name: Kitchen + description: a room or area where food is prepared and cooked. + term_source: "EXTERNAL" + source_ref: FIBO + source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account" + + - name: Spoon + description: an implement consisting of a small, shallow oval or round bowl on a long handle, used for eating, stirring, and serving food. + term_source: "EXTERNAL" + source_ref: FIBO + source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account" + related_terms: + - Housing.Kitchen + knowledge_links: + - url: "https://en.wikipedia.org/wiki/Spoon" + label: Wiki link +``` + +
+ +Source file linked [here](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/bootstrap_data/business_glossary.yml). + +## Generating custom IDs for your terms + +IDs are normally inferred from the glossary term/node's name, see the `enable_auto_id` config. But, if you need a stable +identifier, you can generate a custom ID for your term. It should be unique across the entire Glossary. + +Here's an example ID: +`id: "urn:li:glossaryTerm:41516e310acbfd9076fffc2c98d2d1a3"` + +A note of caution: once you select a custom ID, it cannot be easily changed. + +## Compatibility + +Compatible with version 1 of business glossary format. +The source will be evolved as we publish newer versions of this format. + +### Code Coordinates + +- Class Name: `datahub.ingestion.source.metadata.business_glossary.BusinessGlossaryFileSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/metadata/business_glossary.py) + +

Questions

+ +If you've got any questions on configuring ingestion for Business Glossary, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/clickhouse.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/clickhouse.md new file mode 100644 index 0000000000000..01ff577032d98 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/clickhouse.md @@ -0,0 +1,1469 @@ +--- +sidebar_position: 5 +title: ClickHouse +slug: /generated/ingestion/sources/clickhouse +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/clickhouse.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# ClickHouse + +There are 2 sources that provide integration with ClickHouse + + + + + + + + + + +
Source ModuleDocumentation
+ +`clickhouse` + + + +This plugin extracts the following: + +- Metadata for tables, views, materialized views and dictionaries +- Column types associated with each table(except \*AggregateFunction and DateTime with timezone) +- Table, row, and column statistics via optional SQL profiling. +- Table, view, materialized view and dictionary(with CLICKHOUSE source_type) lineage + +:::tip + +You can also get fine-grained usage statistics for ClickHouse using the `clickhouse-usage` source described below. + +::: + +[Read more...](#module-clickhouse) + +
+ +`clickhouse-usage` + + + +This plugin has the below functionalities - + +1. For a specific dataset this plugin ingests the following statistics - + 1. top n queries. + 2. top users. + 3. usage of each column in the dataset. +2. Aggregation of these statistics into buckets, by day or hour granularity. + +Usage information is computed by querying the system.query_log table. In case you have a cluster or need to apply additional transformation/filters you can create a view and put to the `query_log_table` setting. + +:::note + +This source only does usage statistics. To get the tables, views, and schemas in your ClickHouse warehouse, ingest using the `clickhouse` source described above. + +::: + +[Read more...](#module-clickhouse-usage) + +
+ +## Module `clickhouse` + +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +### Important Capabilities + +| Capability | Status | Notes | +| ---------------------------------------------------------------------------------------------------------- | ------ | ------------------------------------ | +| [Data Profiling](../../../../metadata-ingestion/docs/dev_guides/sql_profiles.md) | ✅ | Optionally enabled via configuration | +| [Detect Deleted Entities](../../../../metadata-ingestion/docs/dev_guides/stateful.md#stale-entity-removal) | ✅ | Enabled via stateful ingestion | + +This plugin extracts the following: + +- Metadata for tables, views, materialized views and dictionaries +- Column types associated with each table(except \*AggregateFunction and DateTime with timezone) +- Table, row, and column statistics via optional SQL profiling. +- Table, view, materialized view and dictionary(with CLICKHOUSE source_type) lineage + +:::tip + +You can also get fine-grained usage statistics for ClickHouse using the `clickhouse-usage` source described below. + +::: + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[clickhouse]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: clickhouse + config: + # Coordinates + host_port: localhost:9000 + + # Credentials + username: user + password: pass + + # Options + platform_instance: DatabaseNameToBeIngested + + include_views: True # whether to include views, defaults to True + include_tables: True # whether to include views, defaults to True + +sink: + # sink configs + +#--------------------------------------------------------------------------- +# For the HTTP interface: +#--------------------------------------------------------------------------- +source: + type: clickhouse + config: + host_port: localhost:8443 + protocol: https + +#--------------------------------------------------------------------------- +# For the Native interface: +#--------------------------------------------------------------------------- + +source: + type: clickhouse + config: + host_port: localhost:9440 + scheme: clickhouse+native + secure: True +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +|
bucket_duration
Enum
| Size of the time window to aggregate usage stats.
Default: DAY
| +|
database
string
| database (catalog) | +|
database_alias
string
| [Deprecated] Alias to apply to database when ingesting. | +|
end_time
string(date-time)
| Latest date of usage to consider. Default: Current time in UTC | +|
host_port
string
| ClickHouse host URL.
Default: localhost:8123
| +|
include_materialized_views
boolean
|
Default: True
| +|
include_table_lineage
boolean
| Whether table lineage should be ingested.
Default: True
| +|
include_table_location_lineage
boolean
| If the source supports it, include table lineage to the underlying storage location.
Default: True
| +|
include_tables
boolean
| Whether tables should be ingested.
Default: True
| +|
include_views
boolean
| Whether views should be ingested.
Default: True
| +|
options
object
| Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs. | +|
password
string(password)
| password
Default:
| +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
platform_instance_map
map(str,string)
| | +|
protocol
string
| | +|
secure
boolean
| | +|
sqlalchemy_uri
string
| URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters. | +|
start_time
string(date-time)
| Earliest date of usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`) | +|
username
string
| username | +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
domain
map(str,AllowDenyPattern)
| A class to store allow deny regexes | +|
domain.`key`.allow
array(string)
| | +|
domain.`key`.deny
array(string)
| | +|
domain.`key`.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profile_pattern
AllowDenyPattern
| Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
profile_pattern.allow
array(string)
| | +|
profile_pattern.deny
array(string)
| | +|
profile_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
schema_pattern
AllowDenyPattern
| Regex patterns for schemas to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
schema_pattern.allow
array(string)
| | +|
schema_pattern.deny
array(string)
| | +|
schema_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
table_pattern
AllowDenyPattern
| Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.\*'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
table_pattern.allow
array(string)
| | +|
table_pattern.deny
array(string)
| | +|
table_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
view_pattern
AllowDenyPattern
| Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.\*'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
view_pattern.allow
array(string)
| | +|
view_pattern.deny
array(string)
| | +|
view_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profiling
GEProfilingConfig
|
Default: {'enabled': False, 'limit': None, 'offset': None, ...
| +|
profiling.catch_exceptions
boolean
|
Default: True
| +|
profiling.enabled
boolean
| Whether profiling should be done.
Default: False
| +|
profiling.field_sample_values_limit
integer
| Upper limit for number of sample values to collect for all columns.
Default: 20
| +|
profiling.include_field_distinct_count
boolean
| Whether to profile for the number of distinct values for each column.
Default: True
| +|
profiling.include_field_distinct_value_frequencies
boolean
| Whether to profile for distinct value frequencies.
Default: False
| +|
profiling.include_field_histogram
boolean
| Whether to profile for the histogram for numeric fields.
Default: False
| +|
profiling.include_field_max_value
boolean
| Whether to profile for the max value of numeric columns.
Default: True
| +|
profiling.include_field_mean_value
boolean
| Whether to profile for the mean value of numeric columns.
Default: True
| +|
profiling.include_field_median_value
boolean
| Whether to profile for the median value of numeric columns.
Default: True
| +|
profiling.include_field_min_value
boolean
| Whether to profile for the min value of numeric columns.
Default: True
| +|
profiling.include_field_null_count
boolean
| Whether to profile for the number of nulls for each column.
Default: True
| +|
profiling.include_field_quantiles
boolean
| Whether to profile for the quantiles of numeric columns.
Default: False
| +|
profiling.include_field_sample_values
boolean
| Whether to profile for the sample values for all columns.
Default: True
| +|
profiling.include_field_stddev_value
boolean
| Whether to profile for the standard deviation of numeric columns.
Default: True
| +|
profiling.limit
integer
| Max number of documents to profile. By default, profiles all documents. | +|
profiling.max_number_of_fields_to_profile
integer
| A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. | +|
profiling.max_workers
integer
| Number of worker threads to use for profiling. Set to 1 to disable.
Default: 80
| +|
profiling.offset
integer
| Offset in documents to profile. By default, uses no offset. | +|
profiling.partition_datetime
string(date-time)
| For partitioned datasets profile only the partition which matches the datetime or profile the latest one if not set. Only Bigquery supports this. | +|
profiling.partition_profiling_enabled
boolean
|
Default: True
| +|
profiling.profile_if_updated_since_days
number
| Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`. | +|
profiling.profile_table_level_only
boolean
| Whether to perform profiling at table-level only, or include column-level profiling as well.
Default: False
| +|
profiling.profile_table_row_count_estimate_only
boolean
| Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL.
Default: False
| +|
profiling.profile_table_row_limit
integer
| Profile tables only if their row count is less then specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `snowflake` and `BigQuery`
Default: 5000000
| +|
profiling.profile_table_size_limit
integer
| Profile tables only if their size is less then specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `snowflake` and `BigQuery`
Default: 5
| +|
profiling.query_combiner_enabled
boolean
| _This feature is still experimental and can be disabled if it causes issues._ Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.
Default: True
| +|
profiling.report_dropped_profiles
boolean
| Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.
Default: False
| +|
profiling.turn_off_expensive_profiling_metrics
boolean
| Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.
Default: False
| +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| Base specialized config for Stateful Ingestion with stale metadata removal capability. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "ClickHouseConfig", + "description": "Base configuration class for stateful ingestion for source configs to inherit from.", + "type": "object", + "properties": { + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "platform_instance_map": { + "title": "Platform Instance Map", + "description": "A holder for platform -> platform_instance mappings to generate correct dataset urns", + "type": "object", + "additionalProperties": { + "type": "string" + } + }, + "bucket_duration": { + "description": "Size of the time window to aggregate usage stats.", + "default": "DAY", + "allOf": [ + { + "$ref": "#/definitions/BucketDuration" + } + ] + }, + "end_time": { + "title": "End Time", + "description": "Latest date of usage to consider. Default: Current time in UTC", + "type": "string", + "format": "date-time" + }, + "start_time": { + "title": "Start Time", + "description": "Earliest date of usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`)", + "type": "string", + "format": "date-time" + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "stateful_ingestion": { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + }, + "options": { + "title": "Options", + "description": "Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs.", + "type": "object" + }, + "schema_pattern": { + "title": "Schema Pattern", + "description": "Regex patterns for schemas to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "table_pattern": { + "title": "Table Pattern", + "description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "view_pattern": { + "title": "View Pattern", + "description": "Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "profile_pattern": { + "title": "Profile Pattern", + "description": "Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "domain": { + "title": "Domain", + "description": "Attach domains to databases, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.", + "default": {}, + "type": "object", + "additionalProperties": { + "$ref": "#/definitions/AllowDenyPattern" + } + }, + "include_views": { + "title": "Include Views", + "description": "Whether views should be ingested.", + "default": true, + "type": "boolean" + }, + "include_tables": { + "title": "Include Tables", + "description": "Whether tables should be ingested.", + "default": true, + "type": "boolean" + }, + "include_table_location_lineage": { + "title": "Include Table Location Lineage", + "description": "If the source supports it, include table lineage to the underlying storage location.", + "default": true, + "type": "boolean" + }, + "profiling": { + "title": "Profiling", + "default": { + "enabled": false, + "limit": null, + "offset": null, + "report_dropped_profiles": false, + "turn_off_expensive_profiling_metrics": false, + "profile_table_level_only": false, + "include_field_null_count": true, + "include_field_distinct_count": true, + "include_field_min_value": true, + "include_field_max_value": true, + "include_field_mean_value": true, + "include_field_median_value": true, + "include_field_stddev_value": true, + "include_field_quantiles": false, + "include_field_distinct_value_frequencies": false, + "include_field_histogram": false, + "include_field_sample_values": true, + "field_sample_values_limit": 20, + "max_number_of_fields_to_profile": null, + "profile_if_updated_since_days": null, + "profile_table_size_limit": 5, + "profile_table_row_limit": 5000000, + "profile_table_row_count_estimate_only": false, + "max_workers": 80, + "query_combiner_enabled": true, + "catch_exceptions": true, + "partition_profiling_enabled": true, + "partition_datetime": null + }, + "allOf": [ + { + "$ref": "#/definitions/GEProfilingConfig" + } + ] + }, + "username": { + "title": "Username", + "description": "username", + "type": "string" + }, + "password": { + "title": "Password", + "description": "password", + "default": "", + "type": "string", + "writeOnly": true, + "format": "password" + }, + "host_port": { + "title": "Host Port", + "description": "ClickHouse host URL.", + "default": "localhost:8123", + "type": "string" + }, + "database": { + "title": "Database", + "description": "database (catalog)", + "type": "string" + }, + "database_alias": { + "title": "Database Alias", + "description": "[Deprecated] Alias to apply to database when ingesting.", + "type": "string" + }, + "sqlalchemy_uri": { + "title": "Sqlalchemy Uri", + "description": "URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.", + "type": "string" + }, + "secure": { + "title": "Secure", + "type": "boolean" + }, + "protocol": { + "title": "Protocol", + "type": "string" + }, + "include_table_lineage": { + "title": "Include Table Lineage", + "description": "Whether table lineage should be ingested.", + "default": true, + "type": "boolean" + }, + "include_materialized_views": { + "title": "Include Materialized Views", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false, + "definitions": { + "BucketDuration": { + "title": "BucketDuration", + "description": "An enumeration.", + "enum": [ + "DAY", + "HOUR" + ], + "type": "string" + }, + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "AllowDenyPattern": { + "title": "AllowDenyPattern", + "description": "A class to store allow deny regexes", + "type": "object", + "properties": { + "allow": { + "title": "Allow", + "description": "List of regex patterns to include in ingestion", + "default": [ + ".*" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "deny": { + "title": "Deny", + "description": "List of regex patterns to exclude from ingestion.", + "default": [], + "type": "array", + "items": { + "type": "string" + } + }, + "ignoreCase": { + "title": "Ignorecase", + "description": "Whether to ignore case sensitivity during pattern matching.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "GEProfilingConfig": { + "title": "GEProfilingConfig", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "Whether profiling should be done.", + "default": false, + "type": "boolean" + }, + "limit": { + "title": "Limit", + "description": "Max number of documents to profile. By default, profiles all documents.", + "type": "integer" + }, + "offset": { + "title": "Offset", + "description": "Offset in documents to profile. By default, uses no offset.", + "type": "integer" + }, + "report_dropped_profiles": { + "title": "Report Dropped Profiles", + "description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.", + "default": false, + "type": "boolean" + }, + "turn_off_expensive_profiling_metrics": { + "title": "Turn Off Expensive Profiling Metrics", + "description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.", + "default": false, + "type": "boolean" + }, + "profile_table_level_only": { + "title": "Profile Table Level Only", + "description": "Whether to perform profiling at table-level only, or include column-level profiling as well.", + "default": false, + "type": "boolean" + }, + "include_field_null_count": { + "title": "Include Field Null Count", + "description": "Whether to profile for the number of nulls for each column.", + "default": true, + "type": "boolean" + }, + "include_field_distinct_count": { + "title": "Include Field Distinct Count", + "description": "Whether to profile for the number of distinct values for each column.", + "default": true, + "type": "boolean" + }, + "include_field_min_value": { + "title": "Include Field Min Value", + "description": "Whether to profile for the min value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_max_value": { + "title": "Include Field Max Value", + "description": "Whether to profile for the max value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_mean_value": { + "title": "Include Field Mean Value", + "description": "Whether to profile for the mean value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_median_value": { + "title": "Include Field Median Value", + "description": "Whether to profile for the median value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_stddev_value": { + "title": "Include Field Stddev Value", + "description": "Whether to profile for the standard deviation of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_quantiles": { + "title": "Include Field Quantiles", + "description": "Whether to profile for the quantiles of numeric columns.", + "default": false, + "type": "boolean" + }, + "include_field_distinct_value_frequencies": { + "title": "Include Field Distinct Value Frequencies", + "description": "Whether to profile for distinct value frequencies.", + "default": false, + "type": "boolean" + }, + "include_field_histogram": { + "title": "Include Field Histogram", + "description": "Whether to profile for the histogram for numeric fields.", + "default": false, + "type": "boolean" + }, + "include_field_sample_values": { + "title": "Include Field Sample Values", + "description": "Whether to profile for the sample values for all columns.", + "default": true, + "type": "boolean" + }, + "field_sample_values_limit": { + "title": "Field Sample Values Limit", + "description": "Upper limit for number of sample values to collect for all columns.", + "default": 20, + "type": "integer" + }, + "max_number_of_fields_to_profile": { + "title": "Max Number Of Fields To Profile", + "description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.", + "exclusiveMinimum": 0, + "type": "integer" + }, + "profile_if_updated_since_days": { + "title": "Profile If Updated Since Days", + "description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.", + "exclusiveMinimum": 0, + "type": "number" + }, + "profile_table_size_limit": { + "title": "Profile Table Size Limit", + "description": "Profile tables only if their size is less then specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `snowflake` and `BigQuery`", + "default": 5, + "type": "integer" + }, + "profile_table_row_limit": { + "title": "Profile Table Row Limit", + "description": "Profile tables only if their row count is less then specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `snowflake` and `BigQuery`", + "default": 5000000, + "type": "integer" + }, + "profile_table_row_count_estimate_only": { + "title": "Profile Table Row Count Estimate Only", + "description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ", + "default": false, + "type": "boolean" + }, + "max_workers": { + "title": "Max Workers", + "description": "Number of worker threads to use for profiling. Set to 1 to disable.", + "default": 80, + "type": "integer" + }, + "query_combiner_enabled": { + "title": "Query Combiner Enabled", + "description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.", + "default": true, + "type": "boolean" + }, + "catch_exceptions": { + "title": "Catch Exceptions", + "default": true, + "type": "boolean" + }, + "partition_profiling_enabled": { + "title": "Partition Profiling Enabled", + "default": true, + "type": "boolean" + }, + "partition_datetime": { + "title": "Partition Datetime", + "description": "For partitioned datasets profile only the partition which matches the datetime or profile the latest one if not set. Only Bigquery supports this.", + "type": "string", + "format": "date-time" + } + }, + "additionalProperties": false + } + } +} +``` + + +
+ +### Code Coordinates + +- Class Name: `datahub.ingestion.source.sql.clickhouse.ClickHouseSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/clickhouse.py) + +## Module `clickhouse-usage` + +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +### Important Capabilities + +| Capability | Status | Notes | +| ---------------------------------------------------------------------------------------------------------- | ------ | ------------------------------------ | +| [Data Profiling](../../../../metadata-ingestion/docs/dev_guides/sql_profiles.md) | ✅ | Optionally enabled via configuration | +| [Detect Deleted Entities](../../../../metadata-ingestion/docs/dev_guides/stateful.md#stale-entity-removal) | ✅ | Enabled via stateful ingestion | + +This plugin has the below functionalities - + +1. For a specific dataset this plugin ingests the following statistics - + 1. top n queries. + 2. top users. + 3. usage of each column in the dataset. +2. Aggregation of these statistics into buckets, by day or hour granularity. + +Usage information is computed by querying the system.query_log table. In case you have a cluster or need to apply additional transformation/filters you can create a view and put to the `query_log_table` setting. + +:::note + +This source only does usage statistics. To get the tables, views, and schemas in your ClickHouse warehouse, ingest using the `clickhouse` source described above. + +::: + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[clickhouse-usage]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: clickhouse-usage + config: + # Coordinates + host_port: db_host:port + platform_instance: dev_cluster + email_domain: acryl.io + + # Credentials + username: username + password: "password" + +sink: +# sink configs +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +|
email_domain 
string
| | +|
bucket_duration
Enum
| Size of the time window to aggregate usage stats.
Default: DAY
| +|
database
string
| database (catalog) | +|
database_alias
string
| [Deprecated] Alias to apply to database when ingesting. | +|
end_time
string(date-time)
| Latest date of usage to consider. Default: Current time in UTC | +|
format_sql_queries
boolean
| Whether to format sql queries
Default: False
| +|
host_port
string
| ClickHouse host URL.
Default: localhost:8123
| +|
include_materialized_views
boolean
|
Default: True
| +|
include_operational_stats
boolean
| Whether to display operational stats.
Default: True
| +|
include_read_operational_stats
boolean
| Whether to report read operational stats. Experimental.
Default: False
| +|
include_table_lineage
boolean
| Whether table lineage should be ingested.
Default: True
| +|
include_table_location_lineage
boolean
| If the source supports it, include table lineage to the underlying storage location.
Default: True
| +|
include_tables
boolean
| Whether tables should be ingested.
Default: True
| +|
include_top_n_queries
boolean
| Whether to ingest the top_n_queries.
Default: True
| +|
include_views
boolean
| Whether views should be ingested.
Default: True
| +|
options
object
|
Default: {}
| +|
password
string(password)
| password
Default:
| +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
platform_instance_map
map(str,string)
| | +|
protocol
string
| | +|
query_log_table
string
|
Default: system.query_log
| +|
secure
boolean
| | +|
sqlalchemy_uri
string
| URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters. | +|
start_time
string(date-time)
| Earliest date of usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`) | +|
top_n_queries
integer
| Number of top queries to save to each table.
Default: 10
| +|
username
string
| username | +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
domain
map(str,AllowDenyPattern)
| A class to store allow deny regexes | +|
domain.`key`.allow
array(string)
| | +|
domain.`key`.deny
array(string)
| | +|
domain.`key`.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profile_pattern
AllowDenyPattern
| Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
profile_pattern.allow
array(string)
| | +|
profile_pattern.deny
array(string)
| | +|
profile_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
schema_pattern
AllowDenyPattern
| Regex patterns for schemas to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
schema_pattern.allow
array(string)
| | +|
schema_pattern.deny
array(string)
| | +|
schema_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
table_pattern
AllowDenyPattern
| Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.\*'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
table_pattern.allow
array(string)
| | +|
table_pattern.deny
array(string)
| | +|
table_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
user_email_pattern
AllowDenyPattern
| regex patterns for user emails to filter in usage.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
user_email_pattern.allow
array(string)
| | +|
user_email_pattern.deny
array(string)
| | +|
user_email_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
view_pattern
AllowDenyPattern
| Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.\*'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
view_pattern.allow
array(string)
| | +|
view_pattern.deny
array(string)
| | +|
view_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profiling
GEProfilingConfig
|
Default: {'enabled': False, 'limit': None, 'offset': None, ...
| +|
profiling.catch_exceptions
boolean
|
Default: True
| +|
profiling.enabled
boolean
| Whether profiling should be done.
Default: False
| +|
profiling.field_sample_values_limit
integer
| Upper limit for number of sample values to collect for all columns.
Default: 20
| +|
profiling.include_field_distinct_count
boolean
| Whether to profile for the number of distinct values for each column.
Default: True
| +|
profiling.include_field_distinct_value_frequencies
boolean
| Whether to profile for distinct value frequencies.
Default: False
| +|
profiling.include_field_histogram
boolean
| Whether to profile for the histogram for numeric fields.
Default: False
| +|
profiling.include_field_max_value
boolean
| Whether to profile for the max value of numeric columns.
Default: True
| +|
profiling.include_field_mean_value
boolean
| Whether to profile for the mean value of numeric columns.
Default: True
| +|
profiling.include_field_median_value
boolean
| Whether to profile for the median value of numeric columns.
Default: True
| +|
profiling.include_field_min_value
boolean
| Whether to profile for the min value of numeric columns.
Default: True
| +|
profiling.include_field_null_count
boolean
| Whether to profile for the number of nulls for each column.
Default: True
| +|
profiling.include_field_quantiles
boolean
| Whether to profile for the quantiles of numeric columns.
Default: False
| +|
profiling.include_field_sample_values
boolean
| Whether to profile for the sample values for all columns.
Default: True
| +|
profiling.include_field_stddev_value
boolean
| Whether to profile for the standard deviation of numeric columns.
Default: True
| +|
profiling.limit
integer
| Max number of documents to profile. By default, profiles all documents. | +|
profiling.max_number_of_fields_to_profile
integer
| A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. | +|
profiling.max_workers
integer
| Number of worker threads to use for profiling. Set to 1 to disable.
Default: 80
| +|
profiling.offset
integer
| Offset in documents to profile. By default, uses no offset. | +|
profiling.partition_datetime
string(date-time)
| For partitioned datasets profile only the partition which matches the datetime or profile the latest one if not set. Only Bigquery supports this. | +|
profiling.partition_profiling_enabled
boolean
|
Default: True
| +|
profiling.profile_if_updated_since_days
number
| Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`. | +|
profiling.profile_table_level_only
boolean
| Whether to perform profiling at table-level only, or include column-level profiling as well.
Default: False
| +|
profiling.profile_table_row_count_estimate_only
boolean
| Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL.
Default: False
| +|
profiling.profile_table_row_limit
integer
| Profile tables only if their row count is less then specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `snowflake` and `BigQuery`
Default: 5000000
| +|
profiling.profile_table_size_limit
integer
| Profile tables only if their size is less then specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `snowflake` and `BigQuery`
Default: 5
| +|
profiling.query_combiner_enabled
boolean
| _This feature is still experimental and can be disabled if it causes issues._ Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.
Default: True
| +|
profiling.report_dropped_profiles
boolean
| Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.
Default: False
| +|
profiling.turn_off_expensive_profiling_metrics
boolean
| Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.
Default: False
| +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| Base specialized config for Stateful Ingestion with stale metadata removal capability. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "ClickHouseUsageConfig", + "description": "Base configuration class for stateful ingestion for source configs to inherit from.", + "type": "object", + "properties": { + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "bucket_duration": { + "description": "Size of the time window to aggregate usage stats.", + "default": "DAY", + "allOf": [ + { + "$ref": "#/definitions/BucketDuration" + } + ] + }, + "end_time": { + "title": "End Time", + "description": "Latest date of usage to consider. Default: Current time in UTC", + "type": "string", + "format": "date-time" + }, + "start_time": { + "title": "Start Time", + "description": "Earliest date of usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`)", + "type": "string", + "format": "date-time" + }, + "top_n_queries": { + "title": "Top N Queries", + "description": "Number of top queries to save to each table.", + "default": 10, + "exclusiveMinimum": 0, + "type": "integer" + }, + "user_email_pattern": { + "title": "User Email Pattern", + "description": "regex patterns for user emails to filter in usage.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "include_operational_stats": { + "title": "Include Operational Stats", + "description": "Whether to display operational stats.", + "default": true, + "type": "boolean" + }, + "include_read_operational_stats": { + "title": "Include Read Operational Stats", + "description": "Whether to report read operational stats. Experimental.", + "default": false, + "type": "boolean" + }, + "format_sql_queries": { + "title": "Format Sql Queries", + "description": "Whether to format sql queries", + "default": false, + "type": "boolean" + }, + "include_top_n_queries": { + "title": "Include Top N Queries", + "description": "Whether to ingest the top_n_queries.", + "default": true, + "type": "boolean" + }, + "platform_instance_map": { + "title": "Platform Instance Map", + "description": "A holder for platform -> platform_instance mappings to generate correct dataset urns", + "type": "object", + "additionalProperties": { + "type": "string" + } + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "stateful_ingestion": { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + }, + "options": { + "title": "Options", + "default": {}, + "type": "object" + }, + "schema_pattern": { + "title": "Schema Pattern", + "description": "Regex patterns for schemas to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "table_pattern": { + "title": "Table Pattern", + "description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "view_pattern": { + "title": "View Pattern", + "description": "Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "profile_pattern": { + "title": "Profile Pattern", + "description": "Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "domain": { + "title": "Domain", + "description": "Attach domains to databases, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.", + "default": {}, + "type": "object", + "additionalProperties": { + "$ref": "#/definitions/AllowDenyPattern" + } + }, + "include_views": { + "title": "Include Views", + "description": "Whether views should be ingested.", + "default": true, + "type": "boolean" + }, + "include_tables": { + "title": "Include Tables", + "description": "Whether tables should be ingested.", + "default": true, + "type": "boolean" + }, + "include_table_location_lineage": { + "title": "Include Table Location Lineage", + "description": "If the source supports it, include table lineage to the underlying storage location.", + "default": true, + "type": "boolean" + }, + "profiling": { + "title": "Profiling", + "default": { + "enabled": false, + "limit": null, + "offset": null, + "report_dropped_profiles": false, + "turn_off_expensive_profiling_metrics": false, + "profile_table_level_only": false, + "include_field_null_count": true, + "include_field_distinct_count": true, + "include_field_min_value": true, + "include_field_max_value": true, + "include_field_mean_value": true, + "include_field_median_value": true, + "include_field_stddev_value": true, + "include_field_quantiles": false, + "include_field_distinct_value_frequencies": false, + "include_field_histogram": false, + "include_field_sample_values": true, + "field_sample_values_limit": 20, + "max_number_of_fields_to_profile": null, + "profile_if_updated_since_days": null, + "profile_table_size_limit": 5, + "profile_table_row_limit": 5000000, + "profile_table_row_count_estimate_only": false, + "max_workers": 80, + "query_combiner_enabled": true, + "catch_exceptions": true, + "partition_profiling_enabled": true, + "partition_datetime": null + }, + "allOf": [ + { + "$ref": "#/definitions/GEProfilingConfig" + } + ] + }, + "username": { + "title": "Username", + "description": "username", + "type": "string" + }, + "password": { + "title": "Password", + "description": "password", + "default": "", + "type": "string", + "writeOnly": true, + "format": "password" + }, + "host_port": { + "title": "Host Port", + "description": "ClickHouse host URL.", + "default": "localhost:8123", + "type": "string" + }, + "database": { + "title": "Database", + "description": "database (catalog)", + "type": "string" + }, + "database_alias": { + "title": "Database Alias", + "description": "[Deprecated] Alias to apply to database when ingesting.", + "type": "string" + }, + "sqlalchemy_uri": { + "title": "Sqlalchemy Uri", + "description": "URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.", + "type": "string" + }, + "secure": { + "title": "Secure", + "type": "boolean" + }, + "protocol": { + "title": "Protocol", + "type": "string" + }, + "include_table_lineage": { + "title": "Include Table Lineage", + "description": "Whether table lineage should be ingested.", + "default": true, + "type": "boolean" + }, + "include_materialized_views": { + "title": "Include Materialized Views", + "default": true, + "type": "boolean" + }, + "email_domain": { + "title": "Email Domain", + "type": "string" + }, + "query_log_table": { + "title": "Query Log Table", + "default": "system.query_log", + "type": "string" + } + }, + "required": [ + "email_domain" + ], + "additionalProperties": false, + "definitions": { + "BucketDuration": { + "title": "BucketDuration", + "description": "An enumeration.", + "enum": [ + "DAY", + "HOUR" + ], + "type": "string" + }, + "AllowDenyPattern": { + "title": "AllowDenyPattern", + "description": "A class to store allow deny regexes", + "type": "object", + "properties": { + "allow": { + "title": "Allow", + "description": "List of regex patterns to include in ingestion", + "default": [ + ".*" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "deny": { + "title": "Deny", + "description": "List of regex patterns to exclude from ingestion.", + "default": [], + "type": "array", + "items": { + "type": "string" + } + }, + "ignoreCase": { + "title": "Ignorecase", + "description": "Whether to ignore case sensitivity during pattern matching.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "GEProfilingConfig": { + "title": "GEProfilingConfig", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "Whether profiling should be done.", + "default": false, + "type": "boolean" + }, + "limit": { + "title": "Limit", + "description": "Max number of documents to profile. By default, profiles all documents.", + "type": "integer" + }, + "offset": { + "title": "Offset", + "description": "Offset in documents to profile. By default, uses no offset.", + "type": "integer" + }, + "report_dropped_profiles": { + "title": "Report Dropped Profiles", + "description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.", + "default": false, + "type": "boolean" + }, + "turn_off_expensive_profiling_metrics": { + "title": "Turn Off Expensive Profiling Metrics", + "description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.", + "default": false, + "type": "boolean" + }, + "profile_table_level_only": { + "title": "Profile Table Level Only", + "description": "Whether to perform profiling at table-level only, or include column-level profiling as well.", + "default": false, + "type": "boolean" + }, + "include_field_null_count": { + "title": "Include Field Null Count", + "description": "Whether to profile for the number of nulls for each column.", + "default": true, + "type": "boolean" + }, + "include_field_distinct_count": { + "title": "Include Field Distinct Count", + "description": "Whether to profile for the number of distinct values for each column.", + "default": true, + "type": "boolean" + }, + "include_field_min_value": { + "title": "Include Field Min Value", + "description": "Whether to profile for the min value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_max_value": { + "title": "Include Field Max Value", + "description": "Whether to profile for the max value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_mean_value": { + "title": "Include Field Mean Value", + "description": "Whether to profile for the mean value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_median_value": { + "title": "Include Field Median Value", + "description": "Whether to profile for the median value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_stddev_value": { + "title": "Include Field Stddev Value", + "description": "Whether to profile for the standard deviation of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_quantiles": { + "title": "Include Field Quantiles", + "description": "Whether to profile for the quantiles of numeric columns.", + "default": false, + "type": "boolean" + }, + "include_field_distinct_value_frequencies": { + "title": "Include Field Distinct Value Frequencies", + "description": "Whether to profile for distinct value frequencies.", + "default": false, + "type": "boolean" + }, + "include_field_histogram": { + "title": "Include Field Histogram", + "description": "Whether to profile for the histogram for numeric fields.", + "default": false, + "type": "boolean" + }, + "include_field_sample_values": { + "title": "Include Field Sample Values", + "description": "Whether to profile for the sample values for all columns.", + "default": true, + "type": "boolean" + }, + "field_sample_values_limit": { + "title": "Field Sample Values Limit", + "description": "Upper limit for number of sample values to collect for all columns.", + "default": 20, + "type": "integer" + }, + "max_number_of_fields_to_profile": { + "title": "Max Number Of Fields To Profile", + "description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.", + "exclusiveMinimum": 0, + "type": "integer" + }, + "profile_if_updated_since_days": { + "title": "Profile If Updated Since Days", + "description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.", + "exclusiveMinimum": 0, + "type": "number" + }, + "profile_table_size_limit": { + "title": "Profile Table Size Limit", + "description": "Profile tables only if their size is less then specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `snowflake` and `BigQuery`", + "default": 5, + "type": "integer" + }, + "profile_table_row_limit": { + "title": "Profile Table Row Limit", + "description": "Profile tables only if their row count is less then specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `snowflake` and `BigQuery`", + "default": 5000000, + "type": "integer" + }, + "profile_table_row_count_estimate_only": { + "title": "Profile Table Row Count Estimate Only", + "description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ", + "default": false, + "type": "boolean" + }, + "max_workers": { + "title": "Max Workers", + "description": "Number of worker threads to use for profiling. Set to 1 to disable.", + "default": 80, + "type": "integer" + }, + "query_combiner_enabled": { + "title": "Query Combiner Enabled", + "description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.", + "default": true, + "type": "boolean" + }, + "catch_exceptions": { + "title": "Catch Exceptions", + "default": true, + "type": "boolean" + }, + "partition_profiling_enabled": { + "title": "Partition Profiling Enabled", + "default": true, + "type": "boolean" + }, + "partition_datetime": { + "title": "Partition Datetime", + "description": "For partitioned datasets profile only the partition which matches the datetime or profile the latest one if not set. Only Bigquery supports this.", + "type": "string", + "format": "date-time" + } + }, + "additionalProperties": false + } + } +} +``` + + +
+ +### Code Coordinates + +- Class Name: `datahub.ingestion.source.usage.clickhouse_usage.ClickHouseUsageSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/usage/clickhouse_usage.py) + +

Questions

+ +If you've got any questions on configuring ingestion for ClickHouse, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/csv.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/csv.md new file mode 100644 index 0000000000000..10bd92e71d986 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/csv.md @@ -0,0 +1,129 @@ +--- +sidebar_position: 6 +title: CSV +slug: /generated/ingestion/sources/csv +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/csv.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# CSV + +![Incubating](https://img.shields.io/badge/support%20status-incubating-blue) + +This plugin is used to bulk upload metadata to Datahub. +It will apply glossary terms, tags, decription, owners and domain at the entity level. It can also be used to apply tags, +glossary terms, and documentation at the column level. These values are read from a CSV file. You have the option to either overwrite +or append existing values. + +The format of the CSV is demonstrated below. The header is required and URNs should be surrounded by quotes when they contains commas (most URNs contains commas). + +``` +resource,subresource,glossary_terms,tags,owners,ownership_type,description,domain +"urn:li:dataset:(urn:li:dataPlatform:snowflake,datahub.growth.users,PROD",,[urn:li:glossaryTerm:Users],[urn:li:tag:HighQuality],[urn:li:corpuser:lfoe;urn:li:corpuser:jdoe],TECHNICAL_OWNER,"description for users table",urn:li:domain:Engineering +"urn:li:dataset:(urn:li:dataPlatform:hive,datahub.growth.users,PROD",first_name,[urn:li:glossaryTerm:FirstName],,,,"first_name description" +"urn:li:dataset:(urn:li:dataPlatform:hive,datahub.growth.users,PROD",last_name,[urn:li:glossaryTerm:LastName],,,,"last_name description" +``` + +Note that the first row does not have a subresource populated. That means any glossary terms, tags, and owners will +be applied at the entity field. If a subresource is populated (as it is for the second and third rows), glossary +terms and tags will be applied on the column. Every row MUST have a resource. Also note that owners can only +be applied at the resource level. + +:::note +This source will not work on very large csv files that do not fit in memory. +::: + +### CLI based Ingestion + +#### Install the Plugin + +The `csv-enricher` source works out of the box with `acryl-datahub`. + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: csv-enricher + config: + # relative path to your csv file to ingest + filename: ./path/to/your/file.csv +# Default sink is datahub-rest and doesn't need to be configured +# See https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for customization options +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +|
filename 
string
| File path or URL of CSV file to ingest. | +|
array_delimiter
string
| Delimiter to use when parsing array fields (tags, terms and owners)
Default: |
| +|
delimiter
string
| Delimiter to use when parsing CSV
Default: ,
| +|
write_semantics
string
| Whether the new tags, terms and owners to be added will override the existing ones added only by this source or not. Value for this config can be "PATCH" or "OVERRIDE"
Default: PATCH
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "CSVEnricherConfig", + "type": "object", + "properties": { + "filename": { + "title": "Filename", + "description": "File path or URL of CSV file to ingest.", + "type": "string" + }, + "write_semantics": { + "title": "Write Semantics", + "description": "Whether the new tags, terms and owners to be added will override the existing ones added only by this source or not. Value for this config can be \"PATCH\" or \"OVERRIDE\"", + "default": "PATCH", + "type": "string" + }, + "delimiter": { + "title": "Delimiter", + "description": "Delimiter to use when parsing CSV", + "default": ",", + "type": "string" + }, + "array_delimiter": { + "title": "Array Delimiter", + "description": "Delimiter to use when parsing array fields (tags, terms and owners)", + "default": "|", + "type": "string" + } + }, + "required": [ + "filename" + ], + "additionalProperties": false +} +``` + + +
+ +### Code Coordinates + +- Class Name: `datahub.ingestion.source.csv_enricher.CSVEnricherSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/csv_enricher.py) + +

Questions

+ +If you've got any questions on configuring ingestion for CSV, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/databricks.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/databricks.md new file mode 100644 index 0000000000000..f696ceeb1ccf9 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/databricks.md @@ -0,0 +1,591 @@ +--- +sidebar_position: 7 +title: Databricks +slug: /generated/ingestion/sources/databricks +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/databricks.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Databricks + +DataHub supports integration with Databricks ecosystem using a multitude of connectors, depending on your exact setup. + +## Databricks Hive + +The simplest way to integrate is usually via the Hive connector. The [Hive starter recipe](http://datahubproject.io/docs/generated/ingestion/sources/hive#starter-recipe) has a section describing how to connect to your Databricks workspace. + +## Databricks Unity Catalog (new) + +The recently introduced [Unity Catalog](https://www.databricks.com/product/unity-catalog) provides a new way to govern your assets within the Databricks lakehouse. If you have enabled Unity Catalog, you can use the `unity-catalog` source (see below) to integrate your metadata into DataHub as an alternate to the Hive pathway. + +## Databricks Spark + +To complete the picture, we recommend adding push-based ingestion from your Spark jobs to see real-time activity and lineage between your Databricks tables and your Spark jobs. Use the Spark agent to push metadata to DataHub using the instructions [here](../../../../metadata-integration/java/spark-lineage/README.md#configuration-instructions-databricks). + +## Watch the DataHub Talk at the Data and AI Summit 2022 + +For a deeper look at how to think about DataHub within and across your Databricks ecosystem, watch the recording of our talk at the Data and AI Summit 2022. + +

+ + + +

+ +![Incubating](https://img.shields.io/badge/support%20status-incubating-blue) + +### Important Capabilities + +| Capability | Status | Notes | +| ---------------------------------------------------------------------------------------------------------- | ------ | ----------------------------------------------------------------- | +| Asset Containers | ✅ | Enabled by default | +| Column-level Lineage | ✅ | Enabled by default | +| Dataset Usage | ✅ | Enabled by default | +| Descriptions | ✅ | Enabled by default | +| [Detect Deleted Entities](../../../../metadata-ingestion/docs/dev_guides/stateful.md#stale-entity-removal) | ✅ | Optionally enabled via `stateful_ingestion.remove_stale_metadata` | +| [Domains](../../../domains.md) | ✅ | Supported via the `domain` config field | +| Extract Ownership | ✅ | Supported via the `include_ownership` config | +| [Platform Instance](../../../platform-instances.md) | ✅ | Enabled by default | +| Schema Metadata | ✅ | Enabled by default | +| Table-Level Lineage | ✅ | Enabled by default | + +This plugin extracts the following metadata from Databricks Unity Catalog: + +- metastores +- schemas +- tables and column lineage + +### Prerequisities + +- Get your Databricks instance's [workspace url](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids) +- Create a [Databricks Service Principal](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#what-is-a-service-principal) + - You can skip this step and use your own account to get things running quickly, + but we strongly recommend creating a dedicated service principal for production use. +- Generate a Databricks Personal Access token following the following guides: + - [Service Principals](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#personal-access-tokens) + - [Personal Access Tokens](https://docs.databricks.com/dev-tools/auth.html#databricks-personal-access-tokens) +- Provision your service account: + - To ingest your workspace's metadata and lineage, your service principal must have all of the following: + - One of: metastore admin role, ownership of, or `USE CATALOG` privilege on any catalogs you want to ingest + - One of: metastore admin role, ownership of, or `USE SCHEMA` privilege on any schemas you want to ingest + - Ownership of or `SELECT` privilege on any tables and views you want to ingest + - [Ownership documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/ownership.html) + - [Privileges documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/privileges.html) + - To `include_usage_statistics` (enabled by default), your service principal must have `CAN_MANAGE` permissions on any SQL Warehouses you want to ingest: [guide](https://docs.databricks.com/security/auth-authz/access-control/sql-endpoint-acl.html). + - To ingest `profiling` information with `call_analyze` (enabled by default), your service principal must have ownership or `MODIFY` privilege on any tables you want to profile. + - Alternatively, you can run [ANALYZE TABLE](https://docs.databricks.com/sql/language-manual/sql-ref-syntax-aux-analyze-table.html) yourself on any tables you want to profile, then set `call_analyze` to `false`. + You will still need `SELECT` privilege on those tables to fetch the results. +- Check the starter recipe below and replace `workspace_url` and `token` with your information from the previous steps. + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[unity-catalog]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: unity-catalog + config: + workspace_url: https://my-workspace.cloud.databricks.com + token: "mygenerated_databricks_token" + #metastore_id_pattern: + # deny: + # - 11111-2222-33333-44-555555 + #catalog_pattern: + # allow: + # - my-catalog + #schema_pattern: + # deny: + # - information_schema + #table_pattern: + # allow: + # - test.lineagedemo.dinner + # First you have to create domains on Datahub by following this guide -> https://datahubproject.io/docs/domains/#domains-setup-prerequisites-and-permissions + #domain: + # urn:li:domain:1111-222-333-444-555: + # allow: + # - main.* + + stateful_ingestion: + enabled: true + +pipeline_name: acme-corp-unity +# sink configs if needed +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +|
token 
string
| Databricks personal access token | +|
workspace_url 
string
| Databricks workspace url. e.g. https://my-workspace.cloud.databricks.com | +|
bucket_duration
Enum
| Size of the time window to aggregate usage stats.
Default: DAY
| +|
end_time
string(date-time)
| Latest date of usage to consider. Default: Current time in UTC | +|
format_sql_queries
boolean
| Whether to format sql queries
Default: False
| +|
include_column_lineage
boolean
| Option to enable/disable lineage generation. Currently we have to call a rest call per column to get column level lineage due to the Databrick api which can slow down ingestion.
Default: True
| +|
include_operational_stats
boolean
| Whether to display operational stats.
Default: True
| +|
include_ownership
boolean
| Option to enable/disable ownership generation for metastores, catalogs, schemas, and tables.
Default: False
| +|
include_read_operational_stats
boolean
| Whether to report read operational stats. Experimental.
Default: False
| +|
include_table_lineage
boolean
| Option to enable/disable lineage generation.
Default: True
| +|
include_top_n_queries
boolean
| Whether to ingest the top_n_queries.
Default: True
| +|
include_usage_statistics
boolean
| Generate usage statistics.
Default: True
| +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
start_time
string(date-time)
| Earliest date of usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`) | +|
store_last_profiling_timestamps
boolean
| Enable storing last profile timestamp in store.
Default: False
| +|
top_n_queries
integer
| Number of top queries to save to each table.
Default: 10
| +|
workspace_name
string
| Name of the workspace. Default to deployment name present in workspace_url | +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
catalog_pattern
AllowDenyPattern
| Regex patterns for catalogs to filter in ingestion. Specify regex to match the full `metastore.catalog` name.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
catalog_pattern.allow
array(string)
| | +|
catalog_pattern.deny
array(string)
| | +|
catalog_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
domain
map(str,AllowDenyPattern)
| A class to store allow deny regexes | +|
domain.`key`.allow
array(string)
| | +|
domain.`key`.deny
array(string)
| | +|
domain.`key`.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
schema_pattern
AllowDenyPattern
| Regex patterns for schemas to filter in ingestion. Specify regex to the full `metastore.catalog.schema` name. e.g. to match all tables in schema analytics, use the regex `^mymetastore\.mycatalog\.analytics$`.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
schema_pattern.allow
array(string)
| | +|
schema_pattern.deny
array(string)
| | +|
schema_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
table_pattern
AllowDenyPattern
| Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in `catalog.schema.table` format. e.g. to match all tables starting with customer in Customer catalog and public schema, use the regex `Customer\.public\.customer.*`.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
table_pattern.allow
array(string)
| | +|
table_pattern.deny
array(string)
| | +|
table_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
user_email_pattern
AllowDenyPattern
| regex patterns for user emails to filter in usage.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
user_email_pattern.allow
array(string)
| | +|
user_email_pattern.deny
array(string)
| | +|
user_email_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profiling
UnityCatalogProfilerConfig
| Data profiling configuration
Default: {'enabled': False, 'warehouse_id': None, 'profile\_...
| +|
profiling.call_analyze
boolean
| Whether to call ANALYZE TABLE as part of profile ingestion.If false, will ingest the results of the most recent ANALYZE TABLE call, if any.
Default: True
| +|
profiling.enabled
boolean
| Whether profiling should be done.
Default: False
| +|
profiling.max_wait_secs
integer
| Maximum time to wait for an ANALYZE TABLE query to complete.
Default: 3600
| +|
profiling.max_workers
integer
| Number of worker threads to use for profiling. Set to 1 to disable.
Default: 80
| +|
profiling.profile_table_level_only
boolean
| Whether to perform profiling at table-level only or include column-level profiling as well.
Default: False
| +|
profiling.warehouse_id
string
| SQL Warehouse id, for running profiling queries. | +|
profiling.pattern
AllowDenyPattern
| Regex patterns to filter tables for profiling during ingestion. Specify regex to match the `catalog.schema.table` format. Note that only tables allowed by the `table_pattern` will be considered.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
profiling.pattern.allow
array(string)
| | +|
profiling.pattern.deny
array(string)
| | +|
profiling.pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| Unity Catalog Stateful Ingestion Config. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "UnityCatalogSourceConfig", + "description": "Base configuration class for stateful ingestion for source configs to inherit from.", + "type": "object", + "properties": { + "store_last_profiling_timestamps": { + "title": "Store Last Profiling Timestamps", + "description": "Enable storing last profile timestamp in store.", + "default": false, + "type": "boolean" + }, + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "bucket_duration": { + "description": "Size of the time window to aggregate usage stats.", + "default": "DAY", + "allOf": [ + { + "$ref": "#/definitions/BucketDuration" + } + ] + }, + "end_time": { + "title": "End Time", + "description": "Latest date of usage to consider. Default: Current time in UTC", + "type": "string", + "format": "date-time" + }, + "start_time": { + "title": "Start Time", + "description": "Earliest date of usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`)", + "type": "string", + "format": "date-time" + }, + "top_n_queries": { + "title": "Top N Queries", + "description": "Number of top queries to save to each table.", + "default": 10, + "exclusiveMinimum": 0, + "type": "integer" + }, + "user_email_pattern": { + "title": "User Email Pattern", + "description": "regex patterns for user emails to filter in usage.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "include_operational_stats": { + "title": "Include Operational Stats", + "description": "Whether to display operational stats.", + "default": true, + "type": "boolean" + }, + "include_read_operational_stats": { + "title": "Include Read Operational Stats", + "description": "Whether to report read operational stats. Experimental.", + "default": false, + "type": "boolean" + }, + "format_sql_queries": { + "title": "Format Sql Queries", + "description": "Whether to format sql queries", + "default": false, + "type": "boolean" + }, + "include_top_n_queries": { + "title": "Include Top N Queries", + "description": "Whether to ingest the top_n_queries.", + "default": true, + "type": "boolean" + }, + "stateful_ingestion": { + "title": "Stateful Ingestion", + "description": "Unity Catalog Stateful Ingestion Config.", + "allOf": [ + { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + } + ] + }, + "token": { + "title": "Token", + "description": "Databricks personal access token", + "type": "string" + }, + "workspace_url": { + "title": "Workspace Url", + "description": "Databricks workspace url. e.g. https://my-workspace.cloud.databricks.com", + "type": "string" + }, + "workspace_name": { + "title": "Workspace Name", + "description": "Name of the workspace. Default to deployment name present in workspace_url", + "type": "string" + }, + "catalog_pattern": { + "title": "Catalog Pattern", + "description": "Regex patterns for catalogs to filter in ingestion. Specify regex to match the full `metastore.catalog` name.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "schema_pattern": { + "title": "Schema Pattern", + "description": "Regex patterns for schemas to filter in ingestion. Specify regex to the full `metastore.catalog.schema` name. e.g. to match all tables in schema analytics, use the regex `^mymetastore\\.mycatalog\\.analytics$`.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "table_pattern": { + "title": "Table Pattern", + "description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in `catalog.schema.table` format. e.g. to match all tables starting with customer in Customer catalog and public schema, use the regex `Customer\\.public\\.customer.*`.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "domain": { + "title": "Domain", + "description": "Attach domains to catalogs, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.", + "default": {}, + "type": "object", + "additionalProperties": { + "$ref": "#/definitions/AllowDenyPattern" + } + }, + "include_table_lineage": { + "title": "Include Table Lineage", + "description": "Option to enable/disable lineage generation.", + "default": true, + "type": "boolean" + }, + "include_ownership": { + "title": "Include Ownership", + "description": "Option to enable/disable ownership generation for metastores, catalogs, schemas, and tables.", + "default": false, + "type": "boolean" + }, + "include_column_lineage": { + "title": "Include Column Lineage", + "description": "Option to enable/disable lineage generation. Currently we have to call a rest call per column to get column level lineage due to the Databrick api which can slow down ingestion. ", + "default": true, + "type": "boolean" + }, + "include_usage_statistics": { + "title": "Include Usage Statistics", + "description": "Generate usage statistics.", + "default": true, + "type": "boolean" + }, + "profiling": { + "title": "Profiling", + "description": "Data profiling configuration", + "default": { + "enabled": false, + "warehouse_id": null, + "profile_table_level_only": false, + "pattern": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "call_analyze": true, + "max_wait_secs": 3600, + "max_workers": 80 + }, + "allOf": [ + { + "$ref": "#/definitions/UnityCatalogProfilerConfig" + } + ] + } + }, + "required": [ + "token", + "workspace_url" + ], + "additionalProperties": false, + "definitions": { + "BucketDuration": { + "title": "BucketDuration", + "description": "An enumeration.", + "enum": [ + "DAY", + "HOUR" + ], + "type": "string" + }, + "AllowDenyPattern": { + "title": "AllowDenyPattern", + "description": "A class to store allow deny regexes", + "type": "object", + "properties": { + "allow": { + "title": "Allow", + "description": "List of regex patterns to include in ingestion", + "default": [ + ".*" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "deny": { + "title": "Deny", + "description": "List of regex patterns to exclude from ingestion.", + "default": [], + "type": "array", + "items": { + "type": "string" + } + }, + "ignoreCase": { + "title": "Ignorecase", + "description": "Whether to ignore case sensitivity during pattern matching.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "UnityCatalogProfilerConfig": { + "title": "UnityCatalogProfilerConfig", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "Whether profiling should be done.", + "default": false, + "type": "boolean" + }, + "warehouse_id": { + "title": "Warehouse Id", + "description": "SQL Warehouse id, for running profiling queries.", + "type": "string" + }, + "profile_table_level_only": { + "title": "Profile Table Level Only", + "description": "Whether to perform profiling at table-level only or include column-level profiling as well.", + "default": false, + "type": "boolean" + }, + "pattern": { + "title": "Pattern", + "description": "Regex patterns to filter tables for profiling during ingestion. Specify regex to match the `catalog.schema.table` format. Note that only tables allowed by the `table_pattern` will be considered.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "call_analyze": { + "title": "Call Analyze", + "description": "Whether to call ANALYZE TABLE as part of profile ingestion.If false, will ingest the results of the most recent ANALYZE TABLE call, if any.", + "default": true, + "type": "boolean" + }, + "max_wait_secs": { + "title": "Max Wait Secs", + "description": "Maximum time to wait for an ANALYZE TABLE query to complete.", + "default": 3600, + "type": "integer" + }, + "max_workers": { + "title": "Max Workers", + "description": "Number of worker threads to use for profiling. Set to 1 to disable.", + "default": 80, + "type": "integer" + } + }, + "additionalProperties": false + } + } +} +``` + + +
+ +#### Troubleshooting + +##### No data lineage captured or missing lineage + +Check that you meet the [Unity Catalog lineage requirements](https://docs.databricks.com/data-governance/unity-catalog/data-lineage.html#requirements). + +Also check the [Unity Catalog limitations](https://docs.databricks.com/data-governance/unity-catalog/data-lineage.html#limitations) to make sure that lineage would be expected to exist in this case. + +##### Lineage extraction is too slow + +Currently, there is no way to get table or column lineage in bulk from the Databricks Unity Catalog REST api. Table lineage calls require one API call per table, and column lineage calls require one API call per column. If you find metadata extraction taking too long, you can turn off column level lineage extraction via the `include_column_lineage` config flag. + +### Code Coordinates + +- Class Name: `datahub.ingestion.source.unity.source.UnityCatalogSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/unity/source.py) + +

Questions

+ +If you've got any questions on configuring ingestion for Databricks, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/dbt.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/dbt.md new file mode 100644 index 0000000000000..888c1141cc213 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/dbt.md @@ -0,0 +1,1379 @@ +--- +sidebar_position: 8 +title: dbt +slug: /generated/ingestion/sources/dbt +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/dbt.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# dbt + +There are 2 sources that provide integration with dbt + + + + + + + + + + +
Source ModuleDocumentation
+ +`dbt` + + + +The artifacts used by this source are: + +- [dbt manifest file](https://docs.getdbt.com/reference/artifacts/manifest-json) + - This file contains model, source, tests and lineage data. +- [dbt catalog file](https://docs.getdbt.com/reference/artifacts/catalog-json) + - This file contains schema data. + - dbt does not record schema data for Ephemeral models, as such datahub will show Ephemeral models in the lineage, however there will be no associated schema for Ephemeral models +- [dbt sources file](https://docs.getdbt.com/reference/artifacts/sources-json) + - This file contains metadata for sources with freshness checks. + - We transfer dbt's freshness checks to DataHub's last-modified fields. + - Note that this file is optional – if not specified, we'll use time of ingestion instead as a proxy for time last-modified. +- [dbt run_results file](https://docs.getdbt.com/reference/artifacts/run-results-json) + - This file contains metadata from the result of a dbt run, e.g. dbt test + - When provided, we transfer dbt test run results into assertion run events to see a timeline of test runs on the dataset + [Read more...](#module-dbt) + +
+ +`dbt-cloud` + + + +This source pulls dbt metadata directly from the dbt Cloud APIs. + +You'll need to have a dbt Cloud job set up to run your dbt project, and "Generate docs on run" should be enabled. + +The token should have the "read metadata" permission. + +To get the required IDs, go to the job details page (this is the one with the "Run History" table), and look at the URL. +It should look something like this: https://cloud.getdbt.com/next/deploy/107298/projects/175705/jobs/148094. +In this example, the account ID is 107298, the project ID is 175705, and the job ID is 148094. +[Read more...](#module-dbt-cloud) + +
+ +Ingesting metadata from dbt requires either using the **dbt** module or the **dbt-cloud** module. + +### Concept Mapping + +| Source Concept | DataHub Concept | Notes | +| --------------- | ------------------------------------------------------------- | ------------------ | +| `"dbt"` | [Data Platform](../../metamodel/entities/dataPlatform.md) | | +| dbt Source | [Dataset](../../metamodel/entities/dataset.md) | Subtype `source` | +| dbt Seed | [Dataset](../../metamodel/entities/dataset.md) | Subtype `seed` | +| dbt Model | [Dataset](../../metamodel/entities/dataset.md) | Subtype `model` | +| dbt Snapshot | [Dataset](../../metamodel/entities/dataset.md) | Subtype `snapshot` | +| dbt Test | [Assertion](../../metamodel/entities/assertion.md) | | +| dbt Test Result | [Assertion Run Result](../../metamodel/entities/assertion.md) | | + +Note: + +1. It also generates lineage between the `dbt` nodes (e.g. ephemeral nodes that depend on other dbt sources) as well as lineage between the `dbt` nodes and the underlying (target) platform nodes (e.g. BigQuery Table -> dbt Source, dbt View -> BigQuery View). +2. We also support automated actions (like add a tag, term or owner) based on properties defined in dbt meta. + +## Module `dbt` + +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +### Important Capabilities + +| Capability | Status | Notes | +| ---------------------------------------------------------------------------------------------------------- | ------ | ------------------------------ | +| Dataset Usage | ❌ | | +| [Detect Deleted Entities](../../../../metadata-ingestion/docs/dev_guides/stateful.md#stale-entity-removal) | ✅ | Enabled via stateful ingestion | +| Table-Level Lineage | ✅ | Enabled by default | + +The artifacts used by this source are: + +- [dbt manifest file](https://docs.getdbt.com/reference/artifacts/manifest-json) + - This file contains model, source, tests and lineage data. +- [dbt catalog file](https://docs.getdbt.com/reference/artifacts/catalog-json) + - This file contains schema data. + - dbt does not record schema data for Ephemeral models, as such datahub will show Ephemeral models in the lineage, however there will be no associated schema for Ephemeral models +- [dbt sources file](https://docs.getdbt.com/reference/artifacts/sources-json) + - This file contains metadata for sources with freshness checks. + - We transfer dbt's freshness checks to DataHub's last-modified fields. + - Note that this file is optional – if not specified, we'll use time of ingestion instead as a proxy for time last-modified. +- [dbt run_results file](https://docs.getdbt.com/reference/artifacts/run-results-json) + - This file contains metadata from the result of a dbt run, e.g. dbt test + - When provided, we transfer dbt test run results into assertion run events to see a timeline of test runs on the dataset + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[dbt]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: "dbt" + config: + # Coordinates + # To use this as-is, set the environment variable DBT_PROJECT_ROOT to the root folder of your dbt project + manifest_path: "${DBT_PROJECT_ROOT}/target/manifest_file.json" + catalog_path: "${DBT_PROJECT_ROOT}/target/catalog_file.json" + sources_path: "${DBT_PROJECT_ROOT}/target/sources_file.json" # optional for freshness + test_results_path: "${DBT_PROJECT_ROOT}/target/run_results.json" # optional for recording dbt test results after running dbt test + + # Options + target_platform: "my_target_platform_id" # e.g. bigquery/postgres/etc. + +# sink configs +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +|
catalog_path 
string
| Path to dbt catalog JSON. See https://docs.getdbt.com/reference/artifacts/catalog-json Note this can be a local file or a URI. | +|
manifest_path 
string
| Path to dbt manifest JSON. See https://docs.getdbt.com/reference/artifacts/manifest-json Note this can be a local file or a URI. | +|
target_platform 
string
| The platform that dbt is loading onto. (e.g. bigquery / redshift / postgres etc.) | +|
column_meta_mapping
object
| mapping rules that will be executed against dbt column meta properties. Refer to the section below on dbt meta automated mappings.
Default: {}
| +|
convert_column_urns_to_lowercase
boolean
| When enabled, converts column URNs to lowercase to ensure cross-platform compatibility. If `target_platform` is Snowflake, the default is True.
Default: False
| +|
enable_meta_mapping
boolean
| When enabled, applies the mappings that are defined through the meta_mapping directives.
Default: True
| +|
enable_owner_extraction
boolean
| When enabled, ownership info will be extracted from the dbt meta
Default: True
| +|
enable_query_tag_mapping
boolean
| When enabled, applies the mappings that are defined through the `query_tag_mapping` directives.
Default: True
| +|
include_env_in_assertion_guid
boolean
| Prior to version 0.9.4.2, the assertion GUIDs did not include the environment. If you're using multiple dbt ingestion that are only distinguished by env, then you should set this flag to True.
Default: False
| +|
incremental_lineage
boolean
| When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.
Default: True
| +|
meta_mapping
object
| mapping rules that will be executed against dbt meta properties. Refer to the section below on dbt meta automated mappings.
Default: {}
| +|
owner_extraction_pattern
string
| Regex string to extract owner from the dbt node using the `(?P...) syntax` of the [match object](https://docs.python.org/3/library/re.html#match-objects), where the group name must be `owner`. Examples: (1)`r"(?P(.*)): (\w+) (\w+)"` will extract `jdoe` as the owner from `"jdoe: John Doe"` (2) `r"@(?P(.*))"` will extract `alice` as the owner from `"@alice"`. | +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
query_tag_mapping
object
| mapping rules that will be executed against dbt query_tag meta properties. Refer to the section below on dbt meta automated mappings.
Default: {}
| +|
sources_path
string
| Path to dbt sources JSON. See https://docs.getdbt.com/reference/artifacts/sources-json. If not specified, last-modified fields will not be populated. Note this can be a local file or a URI. | +|
sql_parser_use_external_process
boolean
| When enabled, sql parser will run in isolated in a separate process. This can affect processing time but can protect from sql parser's mem leak.
Default: False
| +|
strip_user_ids_from_email
boolean
| Whether or not to strip email id while adding owners using dbt meta actions.
Default: False
| +|
tag_prefix
string
| Prefix added to tags during ingestion.
Default: dbt:
| +|
target_platform_instance
string
| The platform instance for the platform that dbt is operating on. Use this if you have multiple instances of the same platform (e.g. redshift) and need to distinguish between them. | +|
test_results_path
string
| Path to output of dbt test run as run_results file in JSON format. See https://docs.getdbt.com/reference/artifacts/run-results-json. If not specified, test execution results will not be populated in DataHub. | +|
use_identifiers
boolean
| Use model identifier instead of model name if defined (if not, default to model name).
Default: False
| +|
write_semantics
string
| Whether the new tags, terms and owners to be added will override the existing ones added only by this source or not. Value for this config can be "PATCH" or "OVERRIDE"
Default: PATCH
| +|
env
string
| Environment to use in namespace when constructing URNs.
Default: PROD
| +|
aws_connection
AwsConnectionConfig
| When fetching manifest files from s3, configuration for aws connection details | +|
aws_connection.aws_region 
string
| AWS region code. | +|
aws_connection.aws_access_key_id
string
| AWS access key ID. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details. | +|
aws_connection.aws_endpoint_url
string
| The AWS service endpoint. This is normally [constructed automatically](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html), but can be overridden here. | +|
aws_connection.aws_profile
string
| Named AWS profile to use. Only used if access key / secret are unset. If not set the default will be used | +|
aws_connection.aws_proxy
map(str,string)
| | +|
aws_connection.aws_secret_access_key
string
| AWS secret access key. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details. | +|
aws_connection.aws_session_token
string
| AWS session token. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details. | +|
aws_connection.aws_role
One of string, union(anyOf), string, AwsAssumeRoleConfig
| AWS roles to assume. If using the string format, the role ARN can be specified directly. If using the object format, the role can be specified in the RoleArn field and additional available arguments are documented at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sts.html?highlight=assume_role#STS.Client.assume_role | +|
aws_connection.aws_role.RoleArn 
string
| ARN of the role to assume. | +|
aws_connection.aws_role.ExternalId
string
| External ID to use when assuming the role. | +|
entities_enabled
DBTEntitiesEnabled
| Controls for enabling / disabling metadata emission for different dbt entities (models, test definitions, test results, etc.)
Default: {'models': 'YES', 'sources': 'YES', 'seeds': 'YES'...
| +|
entities_enabled.models
Enum
| Emit metadata for dbt models when set to Yes or Only
Default: YES
| +|
entities_enabled.seeds
Enum
| Emit metadata for dbt seeds when set to Yes or Only
Default: YES
| +|
entities_enabled.snapshots
Enum
| Emit metadata for dbt snapshots when set to Yes or Only
Default: YES
| +|
entities_enabled.sources
Enum
| Emit metadata for dbt sources when set to Yes or Only
Default: YES
| +|
entities_enabled.test_definitions
Enum
| Emit metadata for test definitions when enabled when set to Yes or Only
Default: YES
| +|
entities_enabled.test_results
Enum
| Emit metadata for test results when set to Yes or Only
Default: YES
| +|
git_info
GitReference
| Reference to your git location to enable easy navigation from DataHub to your dbt files. | +|
git_info.repo 
string
| Name of your Git repo e.g. https://github.com/datahub-project/datahub or https://gitlab.com/gitlab-org/gitlab. If organization/repo is provided, we assume it is a GitHub repo. | +|
git_info.branch
string
| Branch on which your files live by default. Typically main or master. This can also be a commit hash.
Default: main
| +|
git_info.url_template
string
| Template for generating a URL to a file in the repo e.g. '{repo_url}/blob/{branch}/{file_path}'. We can infer this for GitHub and GitLab repos, and it is otherwise required.It supports the following variables: {repo_url}, {branch}, {file_path} | +|
node_name_pattern
AllowDenyPattern
| regex patterns for dbt model names to filter in ingestion.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
node_name_pattern.allow
array(string)
| | +|
node_name_pattern.deny
array(string)
| | +|
node_name_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| DBT Stateful Ingestion Config. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "DBTCoreConfig", + "description": "Base configuration class for stateful ingestion for source configs to inherit from.", + "type": "object", + "properties": { + "incremental_lineage": { + "title": "Incremental Lineage", + "description": "When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.", + "default": true, + "type": "boolean" + }, + "sql_parser_use_external_process": { + "title": "Sql Parser Use External Process", + "description": "When enabled, sql parser will run in isolated in a separate process. This can affect processing time but can protect from sql parser's mem leak.", + "default": false, + "type": "boolean" + }, + "env": { + "title": "Env", + "description": "Environment to use in namespace when constructing URNs.", + "default": "PROD", + "type": "string" + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "stateful_ingestion": { + "title": "Stateful Ingestion", + "description": "DBT Stateful Ingestion Config.", + "allOf": [ + { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + } + ] + }, + "target_platform": { + "title": "Target Platform", + "description": "The platform that dbt is loading onto. (e.g. bigquery / redshift / postgres etc.)", + "type": "string" + }, + "target_platform_instance": { + "title": "Target Platform Instance", + "description": "The platform instance for the platform that dbt is operating on. Use this if you have multiple instances of the same platform (e.g. redshift) and need to distinguish between them.", + "type": "string" + }, + "use_identifiers": { + "title": "Use Identifiers", + "description": "Use model identifier instead of model name if defined (if not, default to model name).", + "default": false, + "type": "boolean" + }, + "entities_enabled": { + "title": "Entities Enabled", + "description": "Controls for enabling / disabling metadata emission for different dbt entities (models, test definitions, test results, etc.)", + "default": { + "models": "YES", + "sources": "YES", + "seeds": "YES", + "snapshots": "YES", + "test_definitions": "YES", + "test_results": "YES" + }, + "allOf": [ + { + "$ref": "#/definitions/DBTEntitiesEnabled" + } + ] + }, + "tag_prefix": { + "title": "Tag Prefix", + "description": "Prefix added to tags during ingestion.", + "default": "dbt:", + "type": "string" + }, + "node_name_pattern": { + "title": "Node Name Pattern", + "description": "regex patterns for dbt model names to filter in ingestion.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "meta_mapping": { + "title": "Meta Mapping", + "description": "mapping rules that will be executed against dbt meta properties. Refer to the section below on dbt meta automated mappings.", + "default": {}, + "type": "object" + }, + "column_meta_mapping": { + "title": "Column Meta Mapping", + "description": "mapping rules that will be executed against dbt column meta properties. Refer to the section below on dbt meta automated mappings.", + "default": {}, + "type": "object" + }, + "query_tag_mapping": { + "title": "Query Tag Mapping", + "description": "mapping rules that will be executed against dbt query_tag meta properties. Refer to the section below on dbt meta automated mappings.", + "default": {}, + "type": "object" + }, + "write_semantics": { + "title": "Write Semantics", + "description": "Whether the new tags, terms and owners to be added will override the existing ones added only by this source or not. Value for this config can be \"PATCH\" or \"OVERRIDE\"", + "default": "PATCH", + "type": "string" + }, + "strip_user_ids_from_email": { + "title": "Strip User Ids From Email", + "description": "Whether or not to strip email id while adding owners using dbt meta actions.", + "default": false, + "type": "boolean" + }, + "enable_owner_extraction": { + "title": "Enable Owner Extraction", + "description": "When enabled, ownership info will be extracted from the dbt meta", + "default": true, + "type": "boolean" + }, + "owner_extraction_pattern": { + "title": "Owner Extraction Pattern", + "description": "Regex string to extract owner from the dbt node using the `(?P...) syntax` of the [match object](https://docs.python.org/3/library/re.html#match-objects), where the group name must be `owner`. Examples: (1)`r\"(?P(.*)): (\\w+) (\\w+)\"` will extract `jdoe` as the owner from `\"jdoe: John Doe\"` (2) `r\"@(?P(.*))\"` will extract `alice` as the owner from `\"@alice\"`.", + "type": "string" + }, + "include_env_in_assertion_guid": { + "title": "Include Env In Assertion Guid", + "description": "Prior to version 0.9.4.2, the assertion GUIDs did not include the environment. If you're using multiple dbt ingestion that are only distinguished by env, then you should set this flag to True.", + "default": false, + "type": "boolean" + }, + "convert_column_urns_to_lowercase": { + "title": "Convert Column Urns To Lowercase", + "description": "When enabled, converts column URNs to lowercase to ensure cross-platform compatibility. If `target_platform` is Snowflake, the default is True.", + "default": false, + "type": "boolean" + }, + "enable_meta_mapping": { + "title": "Enable Meta Mapping", + "description": "When enabled, applies the mappings that are defined through the meta_mapping directives.", + "default": true, + "type": "boolean" + }, + "enable_query_tag_mapping": { + "title": "Enable Query Tag Mapping", + "description": "When enabled, applies the mappings that are defined through the `query_tag_mapping` directives.", + "default": true, + "type": "boolean" + }, + "manifest_path": { + "title": "Manifest Path", + "description": "Path to dbt manifest JSON. See https://docs.getdbt.com/reference/artifacts/manifest-json Note this can be a local file or a URI.", + "type": "string" + }, + "catalog_path": { + "title": "Catalog Path", + "description": "Path to dbt catalog JSON. See https://docs.getdbt.com/reference/artifacts/catalog-json Note this can be a local file or a URI.", + "type": "string" + }, + "sources_path": { + "title": "Sources Path", + "description": "Path to dbt sources JSON. See https://docs.getdbt.com/reference/artifacts/sources-json. If not specified, last-modified fields will not be populated. Note this can be a local file or a URI.", + "type": "string" + }, + "test_results_path": { + "title": "Test Results Path", + "description": "Path to output of dbt test run as run_results file in JSON format. See https://docs.getdbt.com/reference/artifacts/run-results-json. If not specified, test execution results will not be populated in DataHub.", + "type": "string" + }, + "aws_connection": { + "title": "Aws Connection", + "description": "When fetching manifest files from s3, configuration for aws connection details", + "allOf": [ + { + "$ref": "#/definitions/AwsConnectionConfig" + } + ] + }, + "git_info": { + "title": "Git Info", + "description": "Reference to your git location to enable easy navigation from DataHub to your dbt files.", + "allOf": [ + { + "$ref": "#/definitions/GitReference" + } + ] + } + }, + "required": [ + "target_platform", + "manifest_path", + "catalog_path" + ], + "additionalProperties": false, + "definitions": { + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "EmitDirective": { + "title": "EmitDirective", + "description": "A holder for directives for emission for specific types of entities", + "enum": [ + "YES", + "NO", + "ONLY" + ] + }, + "DBTEntitiesEnabled": { + "title": "DBTEntitiesEnabled", + "description": "Controls which dbt entities are going to be emitted by this source", + "type": "object", + "properties": { + "models": { + "description": "Emit metadata for dbt models when set to Yes or Only", + "default": "YES", + "allOf": [ + { + "$ref": "#/definitions/EmitDirective" + } + ] + }, + "sources": { + "description": "Emit metadata for dbt sources when set to Yes or Only", + "default": "YES", + "allOf": [ + { + "$ref": "#/definitions/EmitDirective" + } + ] + }, + "seeds": { + "description": "Emit metadata for dbt seeds when set to Yes or Only", + "default": "YES", + "allOf": [ + { + "$ref": "#/definitions/EmitDirective" + } + ] + }, + "snapshots": { + "description": "Emit metadata for dbt snapshots when set to Yes or Only", + "default": "YES", + "allOf": [ + { + "$ref": "#/definitions/EmitDirective" + } + ] + }, + "test_definitions": { + "description": "Emit metadata for test definitions when enabled when set to Yes or Only", + "default": "YES", + "allOf": [ + { + "$ref": "#/definitions/EmitDirective" + } + ] + }, + "test_results": { + "description": "Emit metadata for test results when set to Yes or Only", + "default": "YES", + "allOf": [ + { + "$ref": "#/definitions/EmitDirective" + } + ] + } + }, + "additionalProperties": false + }, + "AllowDenyPattern": { + "title": "AllowDenyPattern", + "description": "A class to store allow deny regexes", + "type": "object", + "properties": { + "allow": { + "title": "Allow", + "description": "List of regex patterns to include in ingestion", + "default": [ + ".*" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "deny": { + "title": "Deny", + "description": "List of regex patterns to exclude from ingestion.", + "default": [], + "type": "array", + "items": { + "type": "string" + } + }, + "ignoreCase": { + "title": "Ignorecase", + "description": "Whether to ignore case sensitivity during pattern matching.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "AwsAssumeRoleConfig": { + "title": "AwsAssumeRoleConfig", + "type": "object", + "properties": { + "RoleArn": { + "title": "Rolearn", + "description": "ARN of the role to assume.", + "type": "string" + }, + "ExternalId": { + "title": "Externalid", + "description": "External ID to use when assuming the role.", + "type": "string" + } + }, + "required": [ + "RoleArn" + ] + }, + "AwsConnectionConfig": { + "title": "AwsConnectionConfig", + "description": "Common AWS credentials config.\n\nCurrently used by:\n - Glue source\n - SageMaker source\n - dbt source", + "type": "object", + "properties": { + "aws_access_key_id": { + "title": "Aws Access Key Id", + "description": "AWS access key ID. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details.", + "type": "string" + }, + "aws_secret_access_key": { + "title": "Aws Secret Access Key", + "description": "AWS secret access key. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details.", + "type": "string" + }, + "aws_session_token": { + "title": "Aws Session Token", + "description": "AWS session token. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details.", + "type": "string" + }, + "aws_role": { + "title": "Aws Role", + "description": "AWS roles to assume. If using the string format, the role ARN can be specified directly. If using the object format, the role can be specified in the RoleArn field and additional available arguments are documented at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sts.html?highlight=assume_role#STS.Client.assume_role", + "anyOf": [ + { + "type": "string" + }, + { + "type": "array", + "items": { + "anyOf": [ + { + "type": "string" + }, + { + "$ref": "#/definitions/AwsAssumeRoleConfig" + } + ] + } + } + ] + }, + "aws_profile": { + "title": "Aws Profile", + "description": "Named AWS profile to use. Only used if access key / secret are unset. If not set the default will be used", + "type": "string" + }, + "aws_region": { + "title": "Aws Region", + "description": "AWS region code.", + "type": "string" + }, + "aws_endpoint_url": { + "title": "Aws Endpoint Url", + "description": "The AWS service endpoint. This is normally [constructed automatically](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html), but can be overridden here.", + "type": "string" + }, + "aws_proxy": { + "title": "Aws Proxy", + "description": "A set of proxy configs to use with AWS. See the [botocore.config](https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html) docs for details.", + "type": "object", + "additionalProperties": { + "type": "string" + } + } + }, + "required": [ + "aws_region" + ], + "additionalProperties": false + }, + "GitReference": { + "title": "GitReference", + "description": "Reference to a hosted Git repository. Used to generate \"view source\" links.", + "type": "object", + "properties": { + "repo": { + "title": "Repo", + "description": "Name of your Git repo e.g. https://github.com/datahub-project/datahub or https://gitlab.com/gitlab-org/gitlab. If organization/repo is provided, we assume it is a GitHub repo.", + "type": "string" + }, + "branch": { + "title": "Branch", + "description": "Branch on which your files live by default. Typically main or master. This can also be a commit hash.", + "default": "main", + "type": "string" + }, + "url_template": { + "title": "Url Template", + "description": "Template for generating a URL to a file in the repo e.g. '{repo_url}/blob/{branch}/{file_path}'. We can infer this for GitHub and GitLab repos, and it is otherwise required.It supports the following variables: {repo_url}, {branch}, {file_path}", + "type": "string" + } + }, + "required": [ + "repo" + ], + "additionalProperties": false + } + } +} +``` + + +
+ +### dbt meta automated mappings + +dbt allows authors to define meta properties for datasets. Checkout this link to know more - [dbt meta](https://docs.getdbt.com/reference/resource-configs/meta). Our dbt source allows users to define +actions such as add a tag, term or owner. For example if a dbt model has a meta config `"has_pii": True`, we can define an action +that evaluates if the property is set to true and add, lets say, a `pii` tag. +To leverage this feature we require users to define mappings as part of the recipe. The following section describes how you can build these mappings. Listed below is a `meta_mapping` and `column_meta_mapping` section that among other things, looks for keys like `business_owner` and adds owners that are listed there. + +```yaml +meta_mapping: + business_owner: + match: ".*" + operation: "add_owner" + config: + owner_type: user + owner_category: BUSINESS_OWNER + has_pii: + match: True + operation: "add_tag" + config: + tag: "has_pii_test" + int_property: + match: 1 + operation: "add_tag" + config: + tag: "int_meta_property" + double_property: + match: 2.5 + operation: "add_term" + config: + term: "double_meta_property" + data_governance.team_owner: + match: "Finance" + operation: "add_term" + config: + term: "Finance_test" + terms_list: + match: ".*" + operation: "add_terms" + config: + separator: "," +column_meta_mapping: + terms_list: + match: ".*" + operation: "add_terms" + config: + separator: "," + is_sensitive: + match: True + operation: "add_tag" + config: + tag: "sensitive" +``` + +We support the following operations: + +1. add_tag - Requires `tag` property in config. +2. add_term - Requires `term` property in config. +3. add_terms - Accepts an optional `separator` property in config. +4. add_owner - Requires `owner_type` property in config which can be either user or group. Optionally accepts the `owner_category` config property which you can set to one of `['TECHNICAL_OWNER', 'BUSINESS_OWNER', 'DATA_STEWARD', 'DATAOWNER'` (defaults to `DATAOWNER`). + +Note: + +1. The dbt `meta_mapping` config works at the model level, while the `column_meta_mapping` config works at the column level. The `add_owner` operation is not supported at the column level. +2. For string meta properties we support regex matching. + +With regex matching, you can also use the matched value to customize how you populate the tag, term or owner fields. Here are a few advanced examples: + +#### Data Tier - Bronze, Silver, Gold + +If your meta section looks like this: + +```yaml +meta: + data_tier: Bronze # chosen from [Bronze,Gold,Silver] +``` + +and you wanted to attach a glossary term like `urn:li:glossaryTerm:Bronze` for all the models that have this value in the meta section attached to them, the following meta_mapping section would achieve that outcome: + +```yaml +meta_mapping: + data_tier: + match: "Bronze|Silver|Gold" + operation: "add_term" + config: + term: "{{ $match }}" +``` + +to match any data_tier of Bronze, Silver or Gold and maps it to a glossary term with the same name. + +#### Case Numbers - create tags + +If your meta section looks like this: + +```yaml +meta: + case: PLT-4678 # internal Case Number +``` + +and you want to generate tags that look like `case_4678` from this, you can use the following meta_mapping section: + +```yaml +meta_mapping: + case: + match: "PLT-(.*)" + operation: "add_tag" + config: + tag: "case_{{ $match }}" +``` + +#### Stripping out leading @ sign + +You can also match specific groups within the value to extract subsets of the matched value. e.g. if you have a meta section that looks like this: + +```yaml +meta: + owner: "@finance-team" + business_owner: "@janet" +``` + +and you want to mark the finance-team as a group that owns the dataset (skipping the leading @ sign), while marking janet as an individual user (again, skipping the leading @ sign) that owns the dataset, you can use the following meta-mapping section. + +```yaml +meta_mapping: + owner: + match: "^@(.*)" + operation: "add_owner" + config: + owner_type: group + business_owner: + match: "^@(?P(.*))" + operation: "add_owner" + config: + owner_type: user + owner_category: BUSINESS_OWNER +``` + +In the examples above, we show two ways of writing the matching regexes. In the first one, `^@(.*)` the first matching group (a.k.a. match.group(1)) is automatically inferred. In the second example, `^@(?P(.*))`, we use a named matching group (called owner, since we are matching an owner) to capture the string we want to provide to the ownership urn. + +### dbt query_tag automated mappings + +This works similarly as the dbt meta mapping but for the query tags + +We support the below actions - + +1. add_tag - Requires `tag` property in config. + +The below example set as global tag the query tag `tag` key's value. + +```json +"query_tag_mapping": +{ + "tag": + "match": ".*" + "operation": "add_tag" + "config": + "tag": "{{ $match }}" +} +``` + +### Integrating with dbt test + +To integrate with dbt tests, the `dbt` source needs access to the `run_results.json` file generated after a `dbt test` execution. Typically, this is written to the `target` directory. A common pattern you can follow is: + +1. Run `dbt docs generate` and upload `manifest.json` and `catalog.json` to a location accessible to the `dbt` source (e.g. s3 or local file system) +2. Run `dbt test` and upload `run_results.json` to a location accessible to the `dbt` source (e.g. s3 or local file system) +3. Run `datahub ingest -c dbt_recipe.dhub.yaml` with the following config parameters specified + - test_results_path: pointing to the run_results.json file that you just created + +The connector will produce the following things: + +- Assertion definitions that are attached to the dataset (or datasets) +- Results from running the tests attached to the timeline of the dataset + +#### View of dbt tests for a dataset + +![test view](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/dbt-tests-view.png) + +#### Viewing the SQL for a dbt test + +![test logic view](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/dbt-test-logic-view.png) + +#### Viewing timeline for a failed dbt test + +![test view](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/dbt-tests-failure-view.png) + +#### Separating test result emission from other metadata emission + +You can segregate emission of test results from the emission of other dbt metadata using the `entities_enabled` config flag. +The following recipe shows you how to emit only test results. + +```yaml +source: + type: dbt + config: + manifest_path: _path_to_manifest_json + catalog_path: _path_to_catalog_json + test_results_path: _path_to_run_results_json + target_platform: postgres + entities_enabled: + test_results: Only +``` + +Similarly, the following recipe shows you how to emit everything (i.e. models, sources, seeds, test definitions) but not test results: + +```yaml +source: + type: dbt + config: + manifest_path: _path_to_manifest_json + catalog_path: _path_to_catalog_json + run_results_path: _path_to_run_results_json + target_platform: postgres + entities_enabled: + test_results: No +``` + +### Code Coordinates + +- Class Name: `datahub.ingestion.source.dbt.dbt_core.DBTCoreSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_core.py) + +## Module `dbt-cloud` + +![Incubating](https://img.shields.io/badge/support%20status-incubating-blue) + +### Important Capabilities + +| Capability | Status | Notes | +| ---------------------------------------------------------------------------------------------------------- | ------ | ------------------------------ | +| Dataset Usage | ❌ | | +| [Detect Deleted Entities](../../../../metadata-ingestion/docs/dev_guides/stateful.md#stale-entity-removal) | ✅ | Enabled via stateful ingestion | +| Table-Level Lineage | ✅ | Enabled by default | + +This source pulls dbt metadata directly from the dbt Cloud APIs. + +You'll need to have a dbt Cloud job set up to run your dbt project, and "Generate docs on run" should be enabled. + +The token should have the "read metadata" permission. + +To get the required IDs, go to the job details page (this is the one with the "Run History" table), and look at the URL. +It should look something like this: https://cloud.getdbt.com/next/deploy/107298/projects/175705/jobs/148094. +In this example, the account ID is 107298, the project ID is 175705, and the job ID is 148094. + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[dbt-cloud]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: "dbt-cloud" + config: + token: ${DBT_CLOUD_TOKEN} + + # In the URL https://cloud.getdbt.com/next/deploy/107298/projects/175705/jobs/148094, + # 107298 is the account_id, 175705 is the project_id, and 148094 is the job_id + + account_id: # set to your dbt cloud account id + project_id: # set to your dbt cloud project id + job_id: # set to your dbt cloud job id + run_id: # set to your dbt cloud run id. This is optional, and defaults to the latest run + + target_platform: postgres + + # Options + target_platform: "my_target_platform_id" # e.g. bigquery/postgres/etc. + +# sink configs + +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +|
account_id 
integer
| The DBT Cloud account ID to use. | +|
job_id 
integer
| The ID of the job to ingest metadata from. | +|
project_id 
integer
| The dbt Cloud project ID to use. | +|
target_platform 
string
| The platform that dbt is loading onto. (e.g. bigquery / redshift / postgres etc.) | +|
token 
string
| The API token to use to authenticate with DBT Cloud. | +|
column_meta_mapping
object
| mapping rules that will be executed against dbt column meta properties. Refer to the section below on dbt meta automated mappings.
Default: {}
| +|
convert_column_urns_to_lowercase
boolean
| When enabled, converts column URNs to lowercase to ensure cross-platform compatibility. If `target_platform` is Snowflake, the default is True.
Default: False
| +|
enable_meta_mapping
boolean
| When enabled, applies the mappings that are defined through the meta_mapping directives.
Default: True
| +|
enable_owner_extraction
boolean
| When enabled, ownership info will be extracted from the dbt meta
Default: True
| +|
enable_query_tag_mapping
boolean
| When enabled, applies the mappings that are defined through the `query_tag_mapping` directives.
Default: True
| +|
include_env_in_assertion_guid
boolean
| Prior to version 0.9.4.2, the assertion GUIDs did not include the environment. If you're using multiple dbt ingestion that are only distinguished by env, then you should set this flag to True.
Default: False
| +|
incremental_lineage
boolean
| When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.
Default: True
| +|
meta_mapping
object
| mapping rules that will be executed against dbt meta properties. Refer to the section below on dbt meta automated mappings.
Default: {}
| +|
metadata_endpoint
string
| The dbt Cloud metadata API endpoint.
Default: https://metadata.cloud.getdbt.com/graphql
| +|
owner_extraction_pattern
string
| Regex string to extract owner from the dbt node using the `(?P...) syntax` of the [match object](https://docs.python.org/3/library/re.html#match-objects), where the group name must be `owner`. Examples: (1)`r"(?P(.*)): (\w+) (\w+)"` will extract `jdoe` as the owner from `"jdoe: John Doe"` (2) `r"@(?P(.*))"` will extract `alice` as the owner from `"@alice"`. | +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
query_tag_mapping
object
| mapping rules that will be executed against dbt query_tag meta properties. Refer to the section below on dbt meta automated mappings.
Default: {}
| +|
run_id
integer
| The ID of the run to ingest metadata from. If not specified, we'll default to the latest run. | +|
sql_parser_use_external_process
boolean
| When enabled, sql parser will run in isolated in a separate process. This can affect processing time but can protect from sql parser's mem leak.
Default: False
| +|
strip_user_ids_from_email
boolean
| Whether or not to strip email id while adding owners using dbt meta actions.
Default: False
| +|
tag_prefix
string
| Prefix added to tags during ingestion.
Default: dbt:
| +|
target_platform_instance
string
| The platform instance for the platform that dbt is operating on. Use this if you have multiple instances of the same platform (e.g. redshift) and need to distinguish between them. | +|
use_identifiers
boolean
| Use model identifier instead of model name if defined (if not, default to model name).
Default: False
| +|
write_semantics
string
| Whether the new tags, terms and owners to be added will override the existing ones added only by this source or not. Value for this config can be "PATCH" or "OVERRIDE"
Default: PATCH
| +|
env
string
| Environment to use in namespace when constructing URNs.
Default: PROD
| +|
entities_enabled
DBTEntitiesEnabled
| Controls for enabling / disabling metadata emission for different dbt entities (models, test definitions, test results, etc.)
Default: {'models': 'YES', 'sources': 'YES', 'seeds': 'YES'...
| +|
entities_enabled.models
Enum
| Emit metadata for dbt models when set to Yes or Only
Default: YES
| +|
entities_enabled.seeds
Enum
| Emit metadata for dbt seeds when set to Yes or Only
Default: YES
| +|
entities_enabled.snapshots
Enum
| Emit metadata for dbt snapshots when set to Yes or Only
Default: YES
| +|
entities_enabled.sources
Enum
| Emit metadata for dbt sources when set to Yes or Only
Default: YES
| +|
entities_enabled.test_definitions
Enum
| Emit metadata for test definitions when enabled when set to Yes or Only
Default: YES
| +|
entities_enabled.test_results
Enum
| Emit metadata for test results when set to Yes or Only
Default: YES
| +|
node_name_pattern
AllowDenyPattern
| regex patterns for dbt model names to filter in ingestion.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
node_name_pattern.allow
array(string)
| | +|
node_name_pattern.deny
array(string)
| | +|
node_name_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| DBT Stateful Ingestion Config. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "DBTCloudConfig", + "description": "Base configuration class for stateful ingestion for source configs to inherit from.", + "type": "object", + "properties": { + "incremental_lineage": { + "title": "Incremental Lineage", + "description": "When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.", + "default": true, + "type": "boolean" + }, + "sql_parser_use_external_process": { + "title": "Sql Parser Use External Process", + "description": "When enabled, sql parser will run in isolated in a separate process. This can affect processing time but can protect from sql parser's mem leak.", + "default": false, + "type": "boolean" + }, + "env": { + "title": "Env", + "description": "Environment to use in namespace when constructing URNs.", + "default": "PROD", + "type": "string" + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "stateful_ingestion": { + "title": "Stateful Ingestion", + "description": "DBT Stateful Ingestion Config.", + "allOf": [ + { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + } + ] + }, + "target_platform": { + "title": "Target Platform", + "description": "The platform that dbt is loading onto. (e.g. bigquery / redshift / postgres etc.)", + "type": "string" + }, + "target_platform_instance": { + "title": "Target Platform Instance", + "description": "The platform instance for the platform that dbt is operating on. Use this if you have multiple instances of the same platform (e.g. redshift) and need to distinguish between them.", + "type": "string" + }, + "use_identifiers": { + "title": "Use Identifiers", + "description": "Use model identifier instead of model name if defined (if not, default to model name).", + "default": false, + "type": "boolean" + }, + "entities_enabled": { + "title": "Entities Enabled", + "description": "Controls for enabling / disabling metadata emission for different dbt entities (models, test definitions, test results, etc.)", + "default": { + "models": "YES", + "sources": "YES", + "seeds": "YES", + "snapshots": "YES", + "test_definitions": "YES", + "test_results": "YES" + }, + "allOf": [ + { + "$ref": "#/definitions/DBTEntitiesEnabled" + } + ] + }, + "tag_prefix": { + "title": "Tag Prefix", + "description": "Prefix added to tags during ingestion.", + "default": "dbt:", + "type": "string" + }, + "node_name_pattern": { + "title": "Node Name Pattern", + "description": "regex patterns for dbt model names to filter in ingestion.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "meta_mapping": { + "title": "Meta Mapping", + "description": "mapping rules that will be executed against dbt meta properties. Refer to the section below on dbt meta automated mappings.", + "default": {}, + "type": "object" + }, + "column_meta_mapping": { + "title": "Column Meta Mapping", + "description": "mapping rules that will be executed against dbt column meta properties. Refer to the section below on dbt meta automated mappings.", + "default": {}, + "type": "object" + }, + "query_tag_mapping": { + "title": "Query Tag Mapping", + "description": "mapping rules that will be executed against dbt query_tag meta properties. Refer to the section below on dbt meta automated mappings.", + "default": {}, + "type": "object" + }, + "write_semantics": { + "title": "Write Semantics", + "description": "Whether the new tags, terms and owners to be added will override the existing ones added only by this source or not. Value for this config can be \"PATCH\" or \"OVERRIDE\"", + "default": "PATCH", + "type": "string" + }, + "strip_user_ids_from_email": { + "title": "Strip User Ids From Email", + "description": "Whether or not to strip email id while adding owners using dbt meta actions.", + "default": false, + "type": "boolean" + }, + "enable_owner_extraction": { + "title": "Enable Owner Extraction", + "description": "When enabled, ownership info will be extracted from the dbt meta", + "default": true, + "type": "boolean" + }, + "owner_extraction_pattern": { + "title": "Owner Extraction Pattern", + "description": "Regex string to extract owner from the dbt node using the `(?P...) syntax` of the [match object](https://docs.python.org/3/library/re.html#match-objects), where the group name must be `owner`. Examples: (1)`r\"(?P(.*)): (\\w+) (\\w+)\"` will extract `jdoe` as the owner from `\"jdoe: John Doe\"` (2) `r\"@(?P(.*))\"` will extract `alice` as the owner from `\"@alice\"`.", + "type": "string" + }, + "include_env_in_assertion_guid": { + "title": "Include Env In Assertion Guid", + "description": "Prior to version 0.9.4.2, the assertion GUIDs did not include the environment. If you're using multiple dbt ingestion that are only distinguished by env, then you should set this flag to True.", + "default": false, + "type": "boolean" + }, + "convert_column_urns_to_lowercase": { + "title": "Convert Column Urns To Lowercase", + "description": "When enabled, converts column URNs to lowercase to ensure cross-platform compatibility. If `target_platform` is Snowflake, the default is True.", + "default": false, + "type": "boolean" + }, + "enable_meta_mapping": { + "title": "Enable Meta Mapping", + "description": "When enabled, applies the mappings that are defined through the meta_mapping directives.", + "default": true, + "type": "boolean" + }, + "enable_query_tag_mapping": { + "title": "Enable Query Tag Mapping", + "description": "When enabled, applies the mappings that are defined through the `query_tag_mapping` directives.", + "default": true, + "type": "boolean" + }, + "metadata_endpoint": { + "title": "Metadata Endpoint", + "description": "The dbt Cloud metadata API endpoint.", + "default": "https://metadata.cloud.getdbt.com/graphql", + "type": "string" + }, + "token": { + "title": "Token", + "description": "The API token to use to authenticate with DBT Cloud.", + "type": "string" + }, + "account_id": { + "title": "Account Id", + "description": "The DBT Cloud account ID to use.", + "type": "integer" + }, + "project_id": { + "title": "Project Id", + "description": "The dbt Cloud project ID to use.", + "type": "integer" + }, + "job_id": { + "title": "Job Id", + "description": "The ID of the job to ingest metadata from.", + "type": "integer" + }, + "run_id": { + "title": "Run Id", + "description": "The ID of the run to ingest metadata from. If not specified, we'll default to the latest run.", + "type": "integer" + } + }, + "required": [ + "target_platform", + "token", + "account_id", + "project_id", + "job_id" + ], + "additionalProperties": false, + "definitions": { + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "EmitDirective": { + "title": "EmitDirective", + "description": "A holder for directives for emission for specific types of entities", + "enum": [ + "YES", + "NO", + "ONLY" + ] + }, + "DBTEntitiesEnabled": { + "title": "DBTEntitiesEnabled", + "description": "Controls which dbt entities are going to be emitted by this source", + "type": "object", + "properties": { + "models": { + "description": "Emit metadata for dbt models when set to Yes or Only", + "default": "YES", + "allOf": [ + { + "$ref": "#/definitions/EmitDirective" + } + ] + }, + "sources": { + "description": "Emit metadata for dbt sources when set to Yes or Only", + "default": "YES", + "allOf": [ + { + "$ref": "#/definitions/EmitDirective" + } + ] + }, + "seeds": { + "description": "Emit metadata for dbt seeds when set to Yes or Only", + "default": "YES", + "allOf": [ + { + "$ref": "#/definitions/EmitDirective" + } + ] + }, + "snapshots": { + "description": "Emit metadata for dbt snapshots when set to Yes or Only", + "default": "YES", + "allOf": [ + { + "$ref": "#/definitions/EmitDirective" + } + ] + }, + "test_definitions": { + "description": "Emit metadata for test definitions when enabled when set to Yes or Only", + "default": "YES", + "allOf": [ + { + "$ref": "#/definitions/EmitDirective" + } + ] + }, + "test_results": { + "description": "Emit metadata for test results when set to Yes or Only", + "default": "YES", + "allOf": [ + { + "$ref": "#/definitions/EmitDirective" + } + ] + } + }, + "additionalProperties": false + }, + "AllowDenyPattern": { + "title": "AllowDenyPattern", + "description": "A class to store allow deny regexes", + "type": "object", + "properties": { + "allow": { + "title": "Allow", + "description": "List of regex patterns to include in ingestion", + "default": [ + ".*" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "deny": { + "title": "Deny", + "description": "List of regex patterns to exclude from ingestion.", + "default": [], + "type": "array", + "items": { + "type": "string" + } + }, + "ignoreCase": { + "title": "Ignorecase", + "description": "Whether to ignore case sensitivity during pattern matching.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + } + } +} +``` + + +
+ +### Code Coordinates + +- Class Name: `datahub.ingestion.source.dbt.dbt_cloud.DBTCloudSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_cloud.py) + +

Questions

+ +If you've got any questions on configuring ingestion for dbt, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/delta-lake.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/delta-lake.md new file mode 100644 index 0000000000000..6ae42fcd5dc2b --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/delta-lake.md @@ -0,0 +1,493 @@ +--- +sidebar_position: 9 +title: Delta Lake +slug: /generated/ingestion/sources/delta-lake +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/delta-lake.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Delta Lake + +![Incubating](https://img.shields.io/badge/support%20status-incubating-blue) + +### Important Capabilities + +| Capability | Status | Notes | +| ------------ | ------ | -------------------------------------------- | +| Extract Tags | ✅ | Can extract S3 object/bucket tags if enabled | + +This plugin extracts: + +- Column types and schema associated with each delta table +- Custom properties: number_of_files, partition_columns, table_creation_time, location, version etc. + +:::caution + +If you are ingesting datasets from AWS S3, we recommend running the ingestion on a server in the same region to avoid high egress costs. + +::: + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[delta-lake]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: delta-lake + config: + env: "PROD" + platform_instance: "my-delta-lake" + base_path: "/path/to/data/folder" + +sink: + # sink configs +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +|
base_path 
string
| Path to table (s3 or local file system). If path is not a delta table path then all subfolders will be scanned to detect and ingest delta tables. | +|
platform
string
| The platform that this source connects to
Default: delta-lake
| +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
relative_path
string
| If set, delta-tables will be searched at location '/' and URNs will be created using relative_path only. | +|
require_files
boolean
| Whether DeltaTable should track files. Consider setting this to `False` for large delta tables, resulting in significant memory reduction for ingestion process.When set to `False`, number_of_files in delta table can not be reported.
Default: True
| +|
version_history_lookback
integer
| Number of previous version histories to be ingested. Defaults to 1. If set to -1 all version history will be ingested.
Default: 1
| +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
s3
S3
| | +|
s3.use_s3_bucket_tags
boolean
| Whether or not to create tags in datahub from the s3 bucket
Default: False
| +|
s3.use_s3_object_tags
boolean
| # Whether or not to create tags in datahub from the s3 object
Default: False
| +|
s3.aws_config
AwsConnectionConfig
| AWS configuration | +|
s3.aws_config.aws_region 
string
| AWS region code. | +|
s3.aws_config.aws_access_key_id
string
| AWS access key ID. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details. | +|
s3.aws_config.aws_endpoint_url
string
| The AWS service endpoint. This is normally [constructed automatically](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html), but can be overridden here. | +|
s3.aws_config.aws_profile
string
| Named AWS profile to use. Only used if access key / secret are unset. If not set the default will be used | +|
s3.aws_config.aws_proxy
map(str,string)
| | +|
s3.aws_config.aws_secret_access_key
string
| AWS secret access key. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details. | +|
s3.aws_config.aws_session_token
string
| AWS session token. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details. | +|
s3.aws_config.aws_role
One of string, union(anyOf), string, AwsAssumeRoleConfig
| AWS roles to assume. If using the string format, the role ARN can be specified directly. If using the object format, the role can be specified in the RoleArn field and additional available arguments are documented at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sts.html?highlight=assume_role#STS.Client.assume_role | +|
s3.aws_config.aws_role.RoleArn 
string
| ARN of the role to assume. | +|
s3.aws_config.aws_role.ExternalId
string
| External ID to use when assuming the role. | +|
table_pattern
AllowDenyPattern
| regex patterns for tables to filter in ingestion.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
table_pattern.allow
array(string)
| | +|
table_pattern.deny
array(string)
| | +|
table_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "DeltaLakeSourceConfig", + "description": "Any source that connects to a platform should inherit this class", + "type": "object", + "properties": { + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "base_path": { + "title": "Base Path", + "description": "Path to table (s3 or local file system). If path is not a delta table path then all subfolders will be scanned to detect and ingest delta tables.", + "type": "string" + }, + "relative_path": { + "title": "Relative Path", + "description": "If set, delta-tables will be searched at location '/' and URNs will be created using relative_path only.", + "type": "string" + }, + "platform": { + "title": "Platform", + "description": "The platform that this source connects to", + "default": "delta-lake", + "const": "delta-lake", + "type": "string" + }, + "table_pattern": { + "title": "Table Pattern", + "description": "regex patterns for tables to filter in ingestion.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "version_history_lookback": { + "title": "Version History Lookback", + "description": "Number of previous version histories to be ingested. Defaults to 1. If set to -1 all version history will be ingested.", + "default": 1, + "type": "integer" + }, + "require_files": { + "title": "Require Files", + "description": "Whether DeltaTable should track files. Consider setting this to `False` for large delta tables, resulting in significant memory reduction for ingestion process.When set to `False`, number_of_files in delta table can not be reported.", + "default": true, + "type": "boolean" + }, + "s3": { + "$ref": "#/definitions/S3" + } + }, + "required": [ + "base_path" + ], + "additionalProperties": false, + "definitions": { + "AllowDenyPattern": { + "title": "AllowDenyPattern", + "description": "A class to store allow deny regexes", + "type": "object", + "properties": { + "allow": { + "title": "Allow", + "description": "List of regex patterns to include in ingestion", + "default": [ + ".*" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "deny": { + "title": "Deny", + "description": "List of regex patterns to exclude from ingestion.", + "default": [], + "type": "array", + "items": { + "type": "string" + } + }, + "ignoreCase": { + "title": "Ignorecase", + "description": "Whether to ignore case sensitivity during pattern matching.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "AwsAssumeRoleConfig": { + "title": "AwsAssumeRoleConfig", + "type": "object", + "properties": { + "RoleArn": { + "title": "Rolearn", + "description": "ARN of the role to assume.", + "type": "string" + }, + "ExternalId": { + "title": "Externalid", + "description": "External ID to use when assuming the role.", + "type": "string" + } + }, + "required": [ + "RoleArn" + ] + }, + "AwsConnectionConfig": { + "title": "AwsConnectionConfig", + "description": "Common AWS credentials config.\n\nCurrently used by:\n - Glue source\n - SageMaker source\n - dbt source", + "type": "object", + "properties": { + "aws_access_key_id": { + "title": "Aws Access Key Id", + "description": "AWS access key ID. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details.", + "type": "string" + }, + "aws_secret_access_key": { + "title": "Aws Secret Access Key", + "description": "AWS secret access key. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details.", + "type": "string" + }, + "aws_session_token": { + "title": "Aws Session Token", + "description": "AWS session token. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details.", + "type": "string" + }, + "aws_role": { + "title": "Aws Role", + "description": "AWS roles to assume. If using the string format, the role ARN can be specified directly. If using the object format, the role can be specified in the RoleArn field and additional available arguments are documented at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sts.html?highlight=assume_role#STS.Client.assume_role", + "anyOf": [ + { + "type": "string" + }, + { + "type": "array", + "items": { + "anyOf": [ + { + "type": "string" + }, + { + "$ref": "#/definitions/AwsAssumeRoleConfig" + } + ] + } + } + ] + }, + "aws_profile": { + "title": "Aws Profile", + "description": "Named AWS profile to use. Only used if access key / secret are unset. If not set the default will be used", + "type": "string" + }, + "aws_region": { + "title": "Aws Region", + "description": "AWS region code.", + "type": "string" + }, + "aws_endpoint_url": { + "title": "Aws Endpoint Url", + "description": "The AWS service endpoint. This is normally [constructed automatically](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html), but can be overridden here.", + "type": "string" + }, + "aws_proxy": { + "title": "Aws Proxy", + "description": "A set of proxy configs to use with AWS. See the [botocore.config](https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html) docs for details.", + "type": "object", + "additionalProperties": { + "type": "string" + } + } + }, + "required": [ + "aws_region" + ], + "additionalProperties": false + }, + "S3": { + "title": "S3", + "type": "object", + "properties": { + "aws_config": { + "title": "Aws Config", + "description": "AWS configuration", + "allOf": [ + { + "$ref": "#/definitions/AwsConnectionConfig" + } + ] + }, + "use_s3_bucket_tags": { + "title": "Use S3 Bucket Tags", + "description": "Whether or not to create tags in datahub from the s3 bucket", + "default": false, + "type": "boolean" + }, + "use_s3_object_tags": { + "title": "Use S3 Object Tags", + "description": "# Whether or not to create tags in datahub from the s3 object", + "default": false, + "type": "boolean" + } + }, + "additionalProperties": false + } + } +} +``` + + +
+ +## Usage Guide + +If you are new to [Delta Lake](https://delta.io/) and want to test out a simple integration with Delta Lake and DataHub, you can follow this guide. + +### Delta Table on Local File System + +#### Step 1 + +Create a delta table using the sample PySpark code below if you don't have a delta table you can point to. + +```python +import uuid +import random +from pyspark.sql import SparkSession +from delta.tables import DeltaTable + +def generate_data(): + return [(y, m, d, str(uuid.uuid4()), str(random.randrange(10000) % 26 + 65) * 3, random.random()*10000) + for d in range(1, 29) + for m in range(1, 13) + for y in range(2000, 2021)] + +jar_packages = ["org.apache.hadoop:hadoop-aws:3.2.3", "io.delta:delta-core_2.12:1.2.1"] +spark = SparkSession.builder \ + .appName("quickstart") \ + .master("local[*]") \ + .config("spark.jars.packages", ",".join(jar_packages)) \ + .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \ + .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \ + .getOrCreate() + +table_path = "quickstart/my-table" +columns = ["year", "month", "day", "sale_id", "customer", "total_cost"] +spark.sparkContext.parallelize(generate_data()).toDF(columns).repartition(1).write.format("delta").save(table_path) + +df = spark.read.format("delta").load(table_path) +df.show() + +``` + +#### Step 2 + +Create a datahub ingestion yaml file (delta.dhub.yaml) to ingest metadata from the delta table you just created. + +```yaml +source: + type: "delta-lake" + config: + base_path: "quickstart/my-table" + +sink: + type: "datahub-rest" + config: + server: "http://localhost:8080" +``` + +Note: Make sure you run the Spark code as well as recipe from same folder otherwise use absolute paths. + +#### Step 3 + +Execute the ingestion recipe: + +```shell +datahub ingest -c delta.dhub.yaml +``` + +### Delta Table on S3 + +#### Step 1 + +Set up your AWS credentials by creating an AWS credentials config file; typically in '$HOME/.aws/credentials'. + +``` +[my-creds] +aws_access_key_id: ###### +aws_secret_access_key: ###### +``` + +Step 2: Create a Delta Table using the PySpark sample code below unless you already have Delta Tables on your S3. + +```python +from pyspark.sql import SparkSession +from delta.tables import DeltaTable +from configparser import ConfigParser +import uuid +import random +def generate_data(): + return [(y, m, d, str(uuid.uuid4()), str(random.randrange(10000) % 26 + 65) * 3, random.random()*10000) + for d in range(1, 29) + for m in range(1, 13) + for y in range(2000, 2021)] + +jar_packages = ["org.apache.hadoop:hadoop-aws:3.2.3", "io.delta:delta-core_2.12:1.2.1"] +spark = SparkSession.builder \ + .appName("quickstart") \ + .master("local[*]") \ + .config("spark.jars.packages", ",".join(jar_packages)) \ + .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \ + .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \ + .getOrCreate() + + +config_object = ConfigParser() +config_object.read("$HOME/.aws/credentials") +profile_info = config_object["my-creds"] +access_id = profile_info["aws_access_key_id"] +access_key = profile_info["aws_secret_access_key"] + +hadoop_conf = spark._jsc.hadoopConfiguration() +hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") +hadoop_conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider") +hadoop_conf.set("fs.s3a.access.key", access_id) +hadoop_conf.set("fs.s3a.secret.key", access_key) + +table_path = "s3a://my-bucket/my-folder/sales-table" +columns = ["year", "month", "day", "sale_id", "customer", "total_cost"] +spark.sparkContext.parallelize(generate_data()).toDF(columns).repartition(1).write.format("delta").save(table_path) +df = spark.read.format("delta").load(table_path) +df.show() + +``` + +#### Step 3 + +Create a datahub ingestion yaml file (delta.s3.dhub.yaml) to ingest metadata from the delta table you just created. + +```yml +source: + type: "delta-lake" + config: + base_path: "s3://my-bucket/my-folder/sales-table" + s3: + aws_config: + aws_access_key_id: <> + aws_secret_access_key: <> + +sink: + type: "datahub-rest" + config: + server: "http://localhost:8080" +``` + +#### Step 4 + +Execute the ingestion recipe: + +```shell +datahub ingest -c delta.s3.dhub.yaml +``` + +### Note + +The above recipes are minimal recipes. Please refer to [Config Details](#config-details) section for the full configuration. + +### Code Coordinates + +- Class Name: `datahub.ingestion.source.delta_lake.source.DeltaLakeSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/delta_lake/source.py) + +

Questions

+ +If you've got any questions on configuring ingestion for Delta Lake, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/demo-data.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/demo-data.md new file mode 100644 index 0000000000000..b70166405458b --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/demo-data.md @@ -0,0 +1,71 @@ +--- +sidebar_position: 10 +title: Demo Data +slug: /generated/ingestion/sources/demo-data +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/demo-data.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Demo Data + +This source loads sample data into DataHub. It is intended for demo and testing purposes only. + +### CLI based Ingestion + +#### Install the Plugin + +The `demo-data` source works out of the box with `acryl-datahub`. + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: demo-data + config: {} +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :---- | :---------- | + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "DemoDataConfig", + "type": "object", + "properties": {}, + "additionalProperties": false +} +``` + + +
+ +### Code Coordinates + +- Class Name: `datahub.ingestion.source.demo_data.DemoDataSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/demo_data.py) + +

Questions

+ +If you've got any questions on configuring ingestion for Demo Data, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/druid.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/druid.md new file mode 100644 index 0000000000000..95f1e0980b281 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/druid.md @@ -0,0 +1,599 @@ +--- +sidebar_position: 11 +title: Druid +slug: /generated/ingestion/sources/druid +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/druid.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Druid + +![Incubating](https://img.shields.io/badge/support%20status-incubating-blue) + +### Important Capabilities + +| Capability | Status | Notes | +| --------------------------------------------------- | ------ | ------------------ | +| [Platform Instance](../../../platform-instances.md) | ✅ | Enabled by default | + +This plugin extracts the following: + +- Metadata for databases, schemas, and tables +- Column types associated with each table +- Table, row, and column statistics via optional SQL profiling. + +**Note**: It is important to explicitly define the deny schema pattern for internal Druid databases (lookup & sys) if adding a schema pattern. Otherwise, the crawler may crash before processing relevant databases. This deny pattern is defined by default but is overriden by user-submitted configurations. + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[druid]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: druid + config: + # Coordinates + host_port: "localhost:8082" + + # Credentials + username: admin + password: password + +sink: + # sink configs +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +|
host_port 
string
| host URL | +|
database
string
| database (catalog) | +|
database_alias
string
| [Deprecated] Alias to apply to database when ingesting. | +|
include_table_location_lineage
boolean
| If the source supports it, include table lineage to the underlying storage location.
Default: True
| +|
include_tables
boolean
| Whether tables should be ingested.
Default: True
| +|
include_views
boolean
| Whether views should be ingested.
Default: True
| +|
options
object
| Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs. | +|
password
string(password)
| password | +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
scheme
string
|
Default: druid
| +|
sqlalchemy_uri
string
| URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters. | +|
username
string
| username | +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
domain
map(str,AllowDenyPattern)
| A class to store allow deny regexes | +|
domain.`key`.allow
array(string)
| | +|
domain.`key`.deny
array(string)
| | +|
domain.`key`.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profile_pattern
AllowDenyPattern
| Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
profile_pattern.allow
array(string)
| | +|
profile_pattern.deny
array(string)
| | +|
profile_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
schema_pattern
AllowDenyPattern
| regex patterns for schemas to filter in ingestion.
Default: {'allow': ['.\*'], 'deny': ['^(lookup|sysgit|view)....
| +|
schema_pattern.allow
array(string)
| | +|
schema_pattern.deny
array(string)
| | +|
schema_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
table_pattern
AllowDenyPattern
| Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.\*'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
table_pattern.allow
array(string)
| | +|
table_pattern.deny
array(string)
| | +|
table_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
view_pattern
AllowDenyPattern
| Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.\*'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
view_pattern.allow
array(string)
| | +|
view_pattern.deny
array(string)
| | +|
view_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profiling
GEProfilingConfig
|
Default: {'enabled': False, 'limit': None, 'offset': None, ...
| +|
profiling.catch_exceptions
boolean
|
Default: True
| +|
profiling.enabled
boolean
| Whether profiling should be done.
Default: False
| +|
profiling.field_sample_values_limit
integer
| Upper limit for number of sample values to collect for all columns.
Default: 20
| +|
profiling.include_field_distinct_count
boolean
| Whether to profile for the number of distinct values for each column.
Default: True
| +|
profiling.include_field_distinct_value_frequencies
boolean
| Whether to profile for distinct value frequencies.
Default: False
| +|
profiling.include_field_histogram
boolean
| Whether to profile for the histogram for numeric fields.
Default: False
| +|
profiling.include_field_max_value
boolean
| Whether to profile for the max value of numeric columns.
Default: True
| +|
profiling.include_field_mean_value
boolean
| Whether to profile for the mean value of numeric columns.
Default: True
| +|
profiling.include_field_median_value
boolean
| Whether to profile for the median value of numeric columns.
Default: True
| +|
profiling.include_field_min_value
boolean
| Whether to profile for the min value of numeric columns.
Default: True
| +|
profiling.include_field_null_count
boolean
| Whether to profile for the number of nulls for each column.
Default: True
| +|
profiling.include_field_quantiles
boolean
| Whether to profile for the quantiles of numeric columns.
Default: False
| +|
profiling.include_field_sample_values
boolean
| Whether to profile for the sample values for all columns.
Default: True
| +|
profiling.include_field_stddev_value
boolean
| Whether to profile for the standard deviation of numeric columns.
Default: True
| +|
profiling.limit
integer
| Max number of documents to profile. By default, profiles all documents. | +|
profiling.max_number_of_fields_to_profile
integer
| A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. | +|
profiling.max_workers
integer
| Number of worker threads to use for profiling. Set to 1 to disable.
Default: 80
| +|
profiling.offset
integer
| Offset in documents to profile. By default, uses no offset. | +|
profiling.partition_datetime
string(date-time)
| For partitioned datasets profile only the partition which matches the datetime or profile the latest one if not set. Only Bigquery supports this. | +|
profiling.partition_profiling_enabled
boolean
|
Default: True
| +|
profiling.profile_if_updated_since_days
number
| Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`. | +|
profiling.profile_table_level_only
boolean
| Whether to perform profiling at table-level only, or include column-level profiling as well.
Default: False
| +|
profiling.profile_table_row_count_estimate_only
boolean
| Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL.
Default: False
| +|
profiling.profile_table_row_limit
integer
| Profile tables only if their row count is less then specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `snowflake` and `BigQuery`
Default: 5000000
| +|
profiling.profile_table_size_limit
integer
| Profile tables only if their size is less then specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `snowflake` and `BigQuery`
Default: 5
| +|
profiling.query_combiner_enabled
boolean
| _This feature is still experimental and can be disabled if it causes issues._ Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.
Default: True
| +|
profiling.report_dropped_profiles
boolean
| Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.
Default: False
| +|
profiling.turn_off_expensive_profiling_metrics
boolean
| Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.
Default: False
| +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| Base specialized config for Stateful Ingestion with stale metadata removal capability. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "DruidConfig", + "description": "Base configuration class for stateful ingestion for source configs to inherit from.", + "type": "object", + "properties": { + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "stateful_ingestion": { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + }, + "options": { + "title": "Options", + "description": "Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs.", + "type": "object" + }, + "schema_pattern": { + "title": "Schema Pattern", + "description": "regex patterns for schemas to filter in ingestion.", + "default": { + "allow": [ + ".*" + ], + "deny": [ + "^(lookup|sysgit|view).*" + ], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "table_pattern": { + "title": "Table Pattern", + "description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "view_pattern": { + "title": "View Pattern", + "description": "Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "profile_pattern": { + "title": "Profile Pattern", + "description": "Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "domain": { + "title": "Domain", + "description": "Attach domains to databases, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.", + "default": {}, + "type": "object", + "additionalProperties": { + "$ref": "#/definitions/AllowDenyPattern" + } + }, + "include_views": { + "title": "Include Views", + "description": "Whether views should be ingested.", + "default": true, + "type": "boolean" + }, + "include_tables": { + "title": "Include Tables", + "description": "Whether tables should be ingested.", + "default": true, + "type": "boolean" + }, + "include_table_location_lineage": { + "title": "Include Table Location Lineage", + "description": "If the source supports it, include table lineage to the underlying storage location.", + "default": true, + "type": "boolean" + }, + "profiling": { + "title": "Profiling", + "default": { + "enabled": false, + "limit": null, + "offset": null, + "report_dropped_profiles": false, + "turn_off_expensive_profiling_metrics": false, + "profile_table_level_only": false, + "include_field_null_count": true, + "include_field_distinct_count": true, + "include_field_min_value": true, + "include_field_max_value": true, + "include_field_mean_value": true, + "include_field_median_value": true, + "include_field_stddev_value": true, + "include_field_quantiles": false, + "include_field_distinct_value_frequencies": false, + "include_field_histogram": false, + "include_field_sample_values": true, + "field_sample_values_limit": 20, + "max_number_of_fields_to_profile": null, + "profile_if_updated_since_days": null, + "profile_table_size_limit": 5, + "profile_table_row_limit": 5000000, + "profile_table_row_count_estimate_only": false, + "max_workers": 80, + "query_combiner_enabled": true, + "catch_exceptions": true, + "partition_profiling_enabled": true, + "partition_datetime": null + }, + "allOf": [ + { + "$ref": "#/definitions/GEProfilingConfig" + } + ] + }, + "username": { + "title": "Username", + "description": "username", + "type": "string" + }, + "password": { + "title": "Password", + "description": "password", + "type": "string", + "writeOnly": true, + "format": "password" + }, + "host_port": { + "title": "Host Port", + "description": "host URL", + "type": "string" + }, + "database": { + "title": "Database", + "description": "database (catalog)", + "type": "string" + }, + "database_alias": { + "title": "Database Alias", + "description": "[Deprecated] Alias to apply to database when ingesting.", + "type": "string" + }, + "scheme": { + "title": "Scheme", + "default": "druid", + "type": "string" + }, + "sqlalchemy_uri": { + "title": "Sqlalchemy Uri", + "description": "URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.", + "type": "string" + } + }, + "required": [ + "host_port" + ], + "additionalProperties": false, + "definitions": { + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "AllowDenyPattern": { + "title": "AllowDenyPattern", + "description": "A class to store allow deny regexes", + "type": "object", + "properties": { + "allow": { + "title": "Allow", + "description": "List of regex patterns to include in ingestion", + "default": [ + ".*" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "deny": { + "title": "Deny", + "description": "List of regex patterns to exclude from ingestion.", + "default": [], + "type": "array", + "items": { + "type": "string" + } + }, + "ignoreCase": { + "title": "Ignorecase", + "description": "Whether to ignore case sensitivity during pattern matching.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "GEProfilingConfig": { + "title": "GEProfilingConfig", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "Whether profiling should be done.", + "default": false, + "type": "boolean" + }, + "limit": { + "title": "Limit", + "description": "Max number of documents to profile. By default, profiles all documents.", + "type": "integer" + }, + "offset": { + "title": "Offset", + "description": "Offset in documents to profile. By default, uses no offset.", + "type": "integer" + }, + "report_dropped_profiles": { + "title": "Report Dropped Profiles", + "description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.", + "default": false, + "type": "boolean" + }, + "turn_off_expensive_profiling_metrics": { + "title": "Turn Off Expensive Profiling Metrics", + "description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.", + "default": false, + "type": "boolean" + }, + "profile_table_level_only": { + "title": "Profile Table Level Only", + "description": "Whether to perform profiling at table-level only, or include column-level profiling as well.", + "default": false, + "type": "boolean" + }, + "include_field_null_count": { + "title": "Include Field Null Count", + "description": "Whether to profile for the number of nulls for each column.", + "default": true, + "type": "boolean" + }, + "include_field_distinct_count": { + "title": "Include Field Distinct Count", + "description": "Whether to profile for the number of distinct values for each column.", + "default": true, + "type": "boolean" + }, + "include_field_min_value": { + "title": "Include Field Min Value", + "description": "Whether to profile for the min value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_max_value": { + "title": "Include Field Max Value", + "description": "Whether to profile for the max value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_mean_value": { + "title": "Include Field Mean Value", + "description": "Whether to profile for the mean value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_median_value": { + "title": "Include Field Median Value", + "description": "Whether to profile for the median value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_stddev_value": { + "title": "Include Field Stddev Value", + "description": "Whether to profile for the standard deviation of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_quantiles": { + "title": "Include Field Quantiles", + "description": "Whether to profile for the quantiles of numeric columns.", + "default": false, + "type": "boolean" + }, + "include_field_distinct_value_frequencies": { + "title": "Include Field Distinct Value Frequencies", + "description": "Whether to profile for distinct value frequencies.", + "default": false, + "type": "boolean" + }, + "include_field_histogram": { + "title": "Include Field Histogram", + "description": "Whether to profile for the histogram for numeric fields.", + "default": false, + "type": "boolean" + }, + "include_field_sample_values": { + "title": "Include Field Sample Values", + "description": "Whether to profile for the sample values for all columns.", + "default": true, + "type": "boolean" + }, + "field_sample_values_limit": { + "title": "Field Sample Values Limit", + "description": "Upper limit for number of sample values to collect for all columns.", + "default": 20, + "type": "integer" + }, + "max_number_of_fields_to_profile": { + "title": "Max Number Of Fields To Profile", + "description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.", + "exclusiveMinimum": 0, + "type": "integer" + }, + "profile_if_updated_since_days": { + "title": "Profile If Updated Since Days", + "description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.", + "exclusiveMinimum": 0, + "type": "number" + }, + "profile_table_size_limit": { + "title": "Profile Table Size Limit", + "description": "Profile tables only if their size is less then specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `snowflake` and `BigQuery`", + "default": 5, + "type": "integer" + }, + "profile_table_row_limit": { + "title": "Profile Table Row Limit", + "description": "Profile tables only if their row count is less then specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `snowflake` and `BigQuery`", + "default": 5000000, + "type": "integer" + }, + "profile_table_row_count_estimate_only": { + "title": "Profile Table Row Count Estimate Only", + "description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ", + "default": false, + "type": "boolean" + }, + "max_workers": { + "title": "Max Workers", + "description": "Number of worker threads to use for profiling. Set to 1 to disable.", + "default": 80, + "type": "integer" + }, + "query_combiner_enabled": { + "title": "Query Combiner Enabled", + "description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.", + "default": true, + "type": "boolean" + }, + "catch_exceptions": { + "title": "Catch Exceptions", + "default": true, + "type": "boolean" + }, + "partition_profiling_enabled": { + "title": "Partition Profiling Enabled", + "default": true, + "type": "boolean" + }, + "partition_datetime": { + "title": "Partition Datetime", + "description": "For partitioned datasets profile only the partition which matches the datetime or profile the latest one if not set. Only Bigquery supports this.", + "type": "string", + "format": "date-time" + } + }, + "additionalProperties": false + } + } +} +``` + + +
+ +### Code Coordinates + +- Class Name: `datahub.ingestion.source.sql.druid.DruidSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/druid.py) + +

Questions

+ +If you've got any questions on configuring ingestion for Druid, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/elasticsearch.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/elasticsearch.md new file mode 100644 index 0000000000000..58af9e4f49ad0 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/elasticsearch.md @@ -0,0 +1,324 @@ +--- +sidebar_position: 12 +title: Elasticsearch +slug: /generated/ingestion/sources/elasticsearch +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/elasticsearch.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Elasticsearch + +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +### Important Capabilities + +| Capability | Status | Notes | +| --------------------------------------------------- | ------ | ------------------ | +| [Platform Instance](../../../platform-instances.md) | ✅ | Enabled by default | + +This plugin extracts the following: + +- Metadata for indexes +- Column types associated with each index field + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[elasticsearch]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: "elasticsearch" + config: + # Coordinates + host: "localhost:9200" + + # Credentials + username: user # optional + password: pass # optional + + # SSL support + use_ssl: False + verify_certs: False + ca_certs: "./path/ca.cert" + client_cert: "./path/client.cert" + client_key: "./path/client.key" + ssl_assert_hostname: False + ssl_assert_fingerprint: "./path/cert.fingerprint" + + # Options + url_prefix: "" # optional url_prefix + env: "PROD" + index_pattern: + allow: [".*some_index_name_pattern*"] + deny: [".*skip_index_name_pattern*"] + ingest_index_templates: False + index_template_pattern: + allow: [".*some_index_template_name_pattern*"] + +sink: +# sink configs +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +|
ca_certs
string
| Path to a certificate authority (CA) certificate. | +|
client_cert
string
| Path to the file containing the private key and the certificate, or cert only if using client_key. | +|
client_key
string
| Path to the file containing the private key if using separate cert and key files. | +|
host
string
| The elastic search host URI.
Default: localhost:9200
| +|
ingest_index_templates
boolean
| Ingests ES index templates if enabled.
Default: False
| +|
password
string
| The password credential. | +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
ssl_assert_fingerprint
string
| Verify the supplied certificate fingerprint if not None. | +|
ssl_assert_hostname
boolean
| Use hostname verification if not False.
Default: False
| +|
url_prefix
string
| There are cases where an enterprise would have multiple elastic search clusters. One way for them to manage is to have a single endpoint for all the elastic search clusters and use url_prefix for routing requests to different clusters.
Default:
| +|
use_ssl
boolean
| Whether to use SSL for the connection or not.
Default: False
| +|
username
string
| The username credential. | +|
verify_certs
boolean
| Whether to verify SSL certificates.
Default: False
| +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
collapse_urns
CollapseUrns
| | +|
collapse_urns.urns_suffix_regex
array(string)
| | +|
index_pattern
AllowDenyPattern
| regex patterns for indexes to filter in ingestion.
Default: {'allow': ['.\*'], 'deny': ['^\_.\*', '^ilm-history.\*...
| +|
index_pattern.allow
array(string)
| | +|
index_pattern.deny
array(string)
| | +|
index_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
index_template_pattern
AllowDenyPattern
| The regex patterns for filtering index templates to ingest.
Default: {'allow': ['.\*'], 'deny': ['^\_.\*'], 'ignoreCase': ...
| +|
index_template_pattern.allow
array(string)
| | +|
index_template_pattern.deny
array(string)
| | +|
index_template_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profiling
ElasticProfiling
| | +|
profiling.enabled
boolean
| Whether to enable profiling for the elastic search source.
Default: False
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "ElasticsearchSourceConfig", + "description": "Any source that connects to a platform should inherit this class", + "type": "object", + "properties": { + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "host": { + "title": "Host", + "description": "The elastic search host URI.", + "default": "localhost:9200", + "type": "string" + }, + "username": { + "title": "Username", + "description": "The username credential.", + "type": "string" + }, + "password": { + "title": "Password", + "description": "The password credential.", + "type": "string" + }, + "use_ssl": { + "title": "Use Ssl", + "description": "Whether to use SSL for the connection or not.", + "default": false, + "type": "boolean" + }, + "verify_certs": { + "title": "Verify Certs", + "description": "Whether to verify SSL certificates.", + "default": false, + "type": "boolean" + }, + "ca_certs": { + "title": "Ca Certs", + "description": "Path to a certificate authority (CA) certificate.", + "type": "string" + }, + "client_cert": { + "title": "Client Cert", + "description": "Path to the file containing the private key and the certificate, or cert only if using client_key.", + "type": "string" + }, + "client_key": { + "title": "Client Key", + "description": "Path to the file containing the private key if using separate cert and key files.", + "type": "string" + }, + "ssl_assert_hostname": { + "title": "Ssl Assert Hostname", + "description": "Use hostname verification if not False.", + "default": false, + "type": "boolean" + }, + "ssl_assert_fingerprint": { + "title": "Ssl Assert Fingerprint", + "description": "Verify the supplied certificate fingerprint if not None.", + "type": "string" + }, + "url_prefix": { + "title": "Url Prefix", + "description": "There are cases where an enterprise would have multiple elastic search clusters. One way for them to manage is to have a single endpoint for all the elastic search clusters and use url_prefix for routing requests to different clusters.", + "default": "", + "type": "string" + }, + "index_pattern": { + "title": "Index Pattern", + "description": "regex patterns for indexes to filter in ingestion.", + "default": { + "allow": [ + ".*" + ], + "deny": [ + "^_.*", + "^ilm-history.*" + ], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "ingest_index_templates": { + "title": "Ingest Index Templates", + "description": "Ingests ES index templates if enabled.", + "default": false, + "type": "boolean" + }, + "index_template_pattern": { + "title": "Index Template Pattern", + "description": "The regex patterns for filtering index templates to ingest.", + "default": { + "allow": [ + ".*" + ], + "deny": [ + "^_.*" + ], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "profiling": { + "$ref": "#/definitions/ElasticProfiling" + }, + "collapse_urns": { + "$ref": "#/definitions/CollapseUrns" + } + }, + "additionalProperties": false, + "definitions": { + "AllowDenyPattern": { + "title": "AllowDenyPattern", + "description": "A class to store allow deny regexes", + "type": "object", + "properties": { + "allow": { + "title": "Allow", + "description": "List of regex patterns to include in ingestion", + "default": [ + ".*" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "deny": { + "title": "Deny", + "description": "List of regex patterns to exclude from ingestion.", + "default": [], + "type": "array", + "items": { + "type": "string" + } + }, + "ignoreCase": { + "title": "Ignorecase", + "description": "Whether to ignore case sensitivity during pattern matching.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "ElasticProfiling": { + "title": "ElasticProfiling", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "Whether to enable profiling for the elastic search source.", + "default": false, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "CollapseUrns": { + "title": "CollapseUrns", + "type": "object", + "properties": { + "urns_suffix_regex": { + "title": "Urns Suffix Regex", + "description": "List of regex patterns to remove from the name of the URN. All of the indices before removal of URNs are considered as the same dataset. These are applied in order for each URN.\n The main case where you would want to have multiple of these if the name where you are trying to remove suffix from have different formats.\n e.g. ending with -YYYY-MM-DD as well as ending -epochtime would require you to have 2 regex patterns to remove the suffixes across all URNs.", + "type": "array", + "items": { + "type": "string" + } + } + }, + "additionalProperties": false + } + } +} +``` + + +
+ +### Code Coordinates + +- Class Name: `datahub.ingestion.source.elastic_search.ElasticsearchSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/elastic_search.py) + +

Questions

+ +If you've got any questions on configuring ingestion for Elasticsearch, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/feast.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/feast.md new file mode 100644 index 0000000000000..be0818f1ae885 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/feast.md @@ -0,0 +1,117 @@ +--- +sidebar_position: 13 +title: Feast +slug: /generated/ingestion/sources/feast +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/feast.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Feast + +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +### Important Capabilities + +| Capability | Status | Notes | +| ------------------- | ------ | ------------------ | +| Table-Level Lineage | ✅ | Enabled by default | + +This plugin extracts: + +- Entities as [`MLPrimaryKey`](/docs/graphql/objects#mlprimarykey) +- Fields as [`MLFeature`](/docs/graphql/objects#mlfeature) +- Feature views and on-demand feature views as [`MLFeatureTable`](/docs/graphql/objects#mlfeaturetable) +- Batch and stream source details as [`Dataset`](/docs/graphql/objects#dataset) +- Column types associated with each entity and feature + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[feast]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: feast + config: + # Coordinates + path: "/path/to/repository/" + # Options + environment: "PROD" + +sink: + # sink configs +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------- | +|
path 
string
| Path to Feast repository | +|
environment
string
| Environment to use when constructing URNs
Default: PROD
| +|
fs_yaml_file
string
| Path to the `feature_store.yaml` file used to configure the feature store | + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "FeastRepositorySourceConfig", + "type": "object", + "properties": { + "path": { + "title": "Path", + "description": "Path to Feast repository", + "type": "string" + }, + "fs_yaml_file": { + "title": "Fs Yaml File", + "description": "Path to the `feature_store.yaml` file used to configure the feature store", + "type": "string" + }, + "environment": { + "title": "Environment", + "description": "Environment to use when constructing URNs", + "default": "PROD", + "type": "string" + } + }, + "required": [ + "path" + ], + "additionalProperties": false +} +``` + + +
+ +### Code Coordinates + +- Class Name: `datahub.ingestion.source.feast.FeastRepositorySource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/feast.py) + +

Questions

+ +If you've got any questions on configuring ingestion for Feast, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/file-based-lineage.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/file-based-lineage.md new file mode 100644 index 0000000000000..fb075f99c8f21 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/file-based-lineage.md @@ -0,0 +1,146 @@ +--- +sidebar_position: 15 +title: File Based Lineage +slug: /generated/ingestion/sources/file-based-lineage +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/file-based-lineage.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# File Based Lineage + +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +This plugin pulls lineage metadata from a yaml-formatted file. An example of one such file is located in the examples directory [here](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/bootstrap_data/file_lineage.yml). + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[datahub-lineage-file]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: datahub-lineage-file + config: + # Coordinates + file: /path/to/file_lineage.yml + # Whether we want to query datahub-gms for upstream data + preserve_upstream: False + +sink: +# sink configs +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +|
file 
string
| File path or URL to lineage file to ingest. | +|
preserve_upstream
boolean
| Whether we want to query datahub-gms for upstream data. False means it will hard replace upstream data for a given entity. True means it will query the backend for existing upstreams and include it in the ingestion run
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "LineageFileSourceConfig", + "type": "object", + "properties": { + "file": { + "title": "File", + "description": "File path or URL to lineage file to ingest.", + "type": "string" + }, + "preserve_upstream": { + "title": "Preserve Upstream", + "description": "Whether we want to query datahub-gms for upstream data. False means it will hard replace upstream data for a given entity. True means it will query the backend for existing upstreams and include it in the ingestion run", + "default": true, + "type": "boolean" + } + }, + "required": [ + "file" + ], + "additionalProperties": false +} +``` + + +
+ +### Lineage File Format + +The lineage source file should be a `.yml` file with the following top-level keys: + +**version**: the version of lineage file config the config conforms to. Currently, the only version released +is `1`. + +**lineage**: the top level key of the lineage file containing a list of **EntityNodeConfig** objects + +**EntityNodeConfig**: + +- **entity**: **EntityConfig** object +- **upstream**: (optional) list of child **EntityNodeConfig** objects +- **fineGrainedLineages**: (optional) list of **FineGrainedLineageConfig** objects + +**EntityConfig**: + +- **name**: identifier of the entity. Typically name or guid, as used in constructing entity urn. +- **type**: type of the entity (only `dataset` is supported as of now) +- **env**: the environment of this entity. Should match the values in the + table [here](/docs/graphql/enums/#fabrictype) +- **platform**: a valid platform like kafka, snowflake, etc.. +- **platform_instance**: optional string specifying the platform instance of this entity + +For example if dataset URN is `urn:li:dataset:(urn:li:dataPlatform:redshift,userdb.public.customer_table,DEV)` then **EntityConfig** will look like: + +```yml +name: userdb.public.customer_table +type: dataset +env: DEV +platform: redshift +``` + +**FineGrainedLineageConfig**: + +- **upstreamType**: type of upstream entity in a fine-grained lineage; default = "FIELD_SET" +- **upstreams**: (optional) list of upstream schema field urns +- **downstreamType**: type of downstream entity in a fine-grained lineage; default = "FIELD_SET" +- **downstreams**: (optional) list of downstream schema field urns +- **transformOperation**: (optional) transform operation applied to the upstream entities to produce the downstream field(s) +- **confidenceScore**: (optional) the confidence in this lineage between 0 (low confidence) and 1 (high confidence); default = 1.0 + +**FineGrainedLineageConfig** can be used to display fine grained lineage, also referred to as column-level lineage, +for custom sources. + +You can also view an example lineage file checked in [here](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/bootstrap_data/file_lineage.yml) + +### Code Coordinates + +- Class Name: `datahub.ingestion.source.metadata.lineage.LineageFileSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/metadata/lineage.py) + +

Questions

+ +If you've got any questions on configuring ingestion for File Based Lineage, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/file.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/file.md new file mode 100644 index 0000000000000..d54b5c0b14561 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/file.md @@ -0,0 +1,128 @@ +--- +sidebar_position: 14 +title: File +slug: /generated/ingestion/sources/file +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/file.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# File + +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +This plugin pulls metadata from a previously generated file. The [file sink](../../../../metadata-ingestion/sink_docs/file.md) can produce such files, and a number of samples are included in the [examples/mce_files](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/mce_files) directory. + +### CLI based Ingestion + +#### Install the Plugin + +The `file` source works out of the box with `acryl-datahub`. + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: file + config: + # Coordinates + filename: ./path/to/mce/file.json + +sink: + # sink configs +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +|
path 
string
| File path to folder or file to ingest, or URL to a remote file. If pointed to a folder, all files with extension {file_extension} (default json) within that folder will be processed. | +|
aspect
string
| Set to an aspect to only read this aspect for ingestion. | +|
count_all_before_starting
boolean
| When enabled, counts total number of records in the file before starting. Used for accurate estimation of completion time. Turn it off if startup time is too high.
Default: True
| +|
file_extension
string
| When providing a folder to use to read files, set this field to control file extensions that you want the source to process. \* is a special value that means process every file regardless of extension
Default: .json
| +|
read_mode
Enum
|
Default: AUTO
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "FileSourceConfig", + "type": "object", + "properties": { + "path": { + "title": "Path", + "description": "File path to folder or file to ingest, or URL to a remote file. If pointed to a folder, all files with extension {file_extension} (default json) within that folder will be processed.", + "type": "string" + }, + "file_extension": { + "title": "File Extension", + "description": "When providing a folder to use to read files, set this field to control file extensions that you want the source to process. * is a special value that means process every file regardless of extension", + "default": ".json", + "type": "string" + }, + "read_mode": { + "default": "AUTO", + "allOf": [ + { + "$ref": "#/definitions/FileReadMode" + } + ] + }, + "aspect": { + "title": "Aspect", + "description": "Set to an aspect to only read this aspect for ingestion.", + "type": "string" + }, + "count_all_before_starting": { + "title": "Count All Before Starting", + "description": "When enabled, counts total number of records in the file before starting. Used for accurate estimation of completion time. Turn it off if startup time is too high.", + "default": true, + "type": "boolean" + } + }, + "required": [ + "path" + ], + "additionalProperties": false, + "definitions": { + "FileReadMode": { + "title": "FileReadMode", + "description": "An enumeration.", + "enum": [ + "STREAM", + "BATCH", + "AUTO" + ] + } + } +} +``` + + +
+ +### Code Coordinates + +- Class Name: `datahub.ingestion.source.file.GenericFileSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/file.py) + +

Questions

+ +If you've got any questions on configuring ingestion for File, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/gcs.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/gcs.md new file mode 100644 index 0000000000000..e2b953d606e9f --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/gcs.md @@ -0,0 +1,460 @@ +--- +sidebar_position: 17 +title: Google Cloud Storage +slug: /generated/ingestion/sources/gcs +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/gcs.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Google Cloud Storage + +![Incubating](https://img.shields.io/badge/support%20status-incubating-blue) + +### Important Capabilities + +| Capability | Status | Notes | +| -------------------------------------------------------------------------------- | ------ | ------------------ | +| Asset Containers | ✅ | Enabled by default | +| [Data Profiling](../../../../metadata-ingestion/docs/dev_guides/sql_profiles.md) | ❌ | Not supported | +| Schema Metadata | ✅ | Enabled by default | + +This connector extracting datasets located on Google Cloud Storage. Supported file types are as follows: + +- CSV +- TSV +- JSON +- Parquet +- Apache Avro + +Schemas for Parquet and Avro files are extracted as provided. + +Schemas for schemaless formats (CSV, TSV, JSON) are inferred. For CSV and TSV files, we consider the first 100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details)) +JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance. + +This source leverages [Interoperability of GCS with S3](https://cloud.google.com/storage/docs/interoperability) +and uses DataHub S3 Data Lake integration source under the hood. + +### Prerequisites + +1. Create a service account with "Storage Object Viewer" Role - https://cloud.google.com/iam/docs/service-accounts-create +2. Make sure you meet following requirements to generate HMAC key - https://cloud.google.com/storage/docs/authentication/managing-hmackeys#before-you-begin +3. Create an HMAC key for service account created above - https://cloud.google.com/storage/docs/authentication/managing-hmackeys#create . + +To ingest datasets from your data lake, you need to provide the dataset path format specifications using `path_specs` configuration in ingestion recipe. +Refer section [Path Specs](/docs/generated/ingestion/sources/gcs/#path-specs) for examples. + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[gcs]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: gcs + config: + path_specs: + - include: gs://gcs-ingestion-bucket/parquet_example/{table}/year={partition[0]}/*.parquet + credential: + hmac_access_id: + hmac_access_secret: +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +|
credential 
HMACKey
| Google cloud storage [HMAC keys](https://cloud.google.com/storage/docs/authentication/hmackeys) | +|
credential.hmac_access_id 
string
| Access ID | +|
credential.hmac_access_secret 
string(password)
| Secret | +|
max_rows
integer
| Maximum number of rows to use when inferring schemas for TSV and CSV files.
Default: 100
| +|
number_of_files_to_sample
integer
| Number of files to list to sample for schema inference. This will be ignored if sample_files is set to False in the pathspec.
Default: 100
| +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
path_specs
array(object)
| | +|
path_specs.include 
string
| Path to table. Name variable `{table}` is used to mark the folder with dataset. In absence of `{table}`, file level dataset will be created. Check below examples for more details. | +|
path_specs.default_extension
string
| For files without extension it will assume the specified file type. If it is not set the files without extensions will be skipped. | +|
path_specs.enable_compression
boolean
| Enable or disable processing compressed files. Currently .gz and .bz files are supported.
Default: True
| +|
path_specs.exclude
array(string)
| | +|
path_specs.file_types
array(string)
| | +|
path_specs.sample_files
boolean
| Not listing all the files but only taking a handful amount of sample file to infer the schema. File count and file size calculation will be disabled. This can affect performance significantly if enabled
Default: True
| +|
path_specs.table_name
string
| Display name of the dataset.Combination of named variables from include path and strings | +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| Base specialized config for Stateful Ingestion with stale metadata removal capability. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "GCSSourceConfig", + "description": "Base configuration class for stateful ingestion for source configs to inherit from.", + "type": "object", + "properties": { + "path_specs": { + "title": "Path Specs", + "description": "List of PathSpec. See [below](#path-spec) the details about PathSpec", + "type": "array", + "items": { + "$ref": "#/definitions/PathSpec" + } + }, + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "stateful_ingestion": { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + }, + "credential": { + "title": "Credential", + "description": "Google cloud storage [HMAC keys](https://cloud.google.com/storage/docs/authentication/hmackeys)", + "allOf": [ + { + "$ref": "#/definitions/HMACKey" + } + ] + }, + "max_rows": { + "title": "Max Rows", + "description": "Maximum number of rows to use when inferring schemas for TSV and CSV files.", + "default": 100, + "type": "integer" + }, + "number_of_files_to_sample": { + "title": "Number Of Files To Sample", + "description": "Number of files to list to sample for schema inference. This will be ignored if sample_files is set to False in the pathspec.", + "default": 100, + "type": "integer" + } + }, + "required": [ + "path_specs", + "credential" + ], + "additionalProperties": false, + "definitions": { + "PathSpec": { + "title": "PathSpec", + "type": "object", + "properties": { + "include": { + "title": "Include", + "description": "Path to table. Name variable `{table}` is used to mark the folder with dataset. In absence of `{table}`, file level dataset will be created. Check below examples for more details.", + "type": "string" + }, + "exclude": { + "title": "Exclude", + "description": "list of paths in glob pattern which will be excluded while scanning for the datasets", + "type": "array", + "items": { + "type": "string" + } + }, + "file_types": { + "title": "File Types", + "description": "Files with extenstions specified here (subset of default value) only will be scanned to create dataset. Other files will be omitted.", + "default": [ + "csv", + "tsv", + "json", + "parquet", + "avro" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "default_extension": { + "title": "Default Extension", + "description": "For files without extension it will assume the specified file type. If it is not set the files without extensions will be skipped.", + "type": "string" + }, + "table_name": { + "title": "Table Name", + "description": "Display name of the dataset.Combination of named variables from include path and strings", + "type": "string" + }, + "enable_compression": { + "title": "Enable Compression", + "description": "Enable or disable processing compressed files. Currently .gz and .bz files are supported.", + "default": true, + "type": "boolean" + }, + "sample_files": { + "title": "Sample Files", + "description": "Not listing all the files but only taking a handful amount of sample file to infer the schema. File count and file size calculation will be disabled. This can affect performance significantly if enabled", + "default": true, + "type": "boolean" + } + }, + "required": [ + "include" + ], + "additionalProperties": false + }, + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "HMACKey": { + "title": "HMACKey", + "type": "object", + "properties": { + "hmac_access_id": { + "title": "Hmac Access Id", + "description": "Access ID", + "type": "string" + }, + "hmac_access_secret": { + "title": "Hmac Access Secret", + "description": "Secret", + "type": "string", + "writeOnly": true, + "format": "password" + } + }, + "required": [ + "hmac_access_id", + "hmac_access_secret" + ], + "additionalProperties": false + } + } +} +``` + + +
+ +### Path Specs + +**Example - Dataset per file** + +Bucket structure: + +``` +test-gs-bucket +├── employees.csv +└── food_items.csv +``` + +Path specs config + +``` +path_specs: + - include: gs://test-gs-bucket/*.csv + +``` + +**Example - Datasets with partitions** + +Bucket structure: + +``` +test-gs-bucket +├── orders +│   └── year=2022 +│   └── month=2 +│   ├── 1.parquet +│   └── 2.parquet +└── returns + └── year=2021 + └── month=2 + └── 1.parquet + +``` + +Path specs config: + +``` +path_specs: + - include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet +``` + +**Example - Datasets with partition and exclude** + +Bucket structure: + +``` +test-gs-bucket +├── orders +│   └── year=2022 +│   └── month=2 +│   ├── 1.parquet +│   └── 2.parquet +└── tmp_orders + └── year=2021 + └── month=2 + └── 1.parquet + + +``` + +Path specs config: + +``` +path_specs: + - include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet + exclude: + - **/tmp_orders/** +``` + +**Example - Datasets of mixed nature** + +Bucket structure: + +``` +test-gs-bucket +├── customers +│   ├── part1.json +│   ├── part2.json +│   ├── part3.json +│   └── part4.json +├── employees.csv +├── food_items.csv +├── tmp_10101000.csv +└── orders +    └── year=2022 +    └── month=2 +   ├── 1.parquet +   ├── 2.parquet +   └── 3.parquet + +``` + +Path specs config: + +``` +path_specs: + - include: gs://test-gs-bucket/*.csv + exclude: + - **/tmp_10101000.csv + - include: gs://test-gs-bucket/{table}/*.json + - include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet +``` + +**Valid path_specs.include** + +```python +gs://my-bucket/foo/tests/bar.avro # single file table +gs://my-bucket/foo/tests/*.* # mulitple file level tables +gs://my-bucket/foo/tests/{table}/*.avro #table without partition +gs://my-bucket/foo/tests/{table}/*/*.avro #table where partitions are not specified +gs://my-bucket/foo/tests/{table}/*.* # table where no partitions as well as data type specified +gs://my-bucket/{dept}/tests/{table}/*.avro # specifying keywords to be used in display name +gs://my-bucket/{dept}/tests/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.avro # specify partition key and value format +gs://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.avro # specify partition value only format +gs://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # for all extensions +gs://my-bucket/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 2 levels down in bucket +gs://my-bucket/*/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 3 levels down in bucket +``` + +**Valid path_specs.exclude** + +- \*\*/tests/\*\* +- gs://my-bucket/hr/\*\* +- \*_/tests/_.csv +- gs://my-bucket/foo/\*/my_table/\*\* + +**Notes** + +- {table} represents folder for which dataset will be created. +- include path must end with (_._ or \*.[ext]) to represent leaf level. +- if \*.[ext] is provided then only files with specified type will be scanned. +- /\*/ represents single folder. +- {partition[i]} represents value of partition. +- {partition_key[i]} represents name of the partition. +- While extracting, “i” will be used to match partition_key to partition. +- all folder levels need to be specified in include. Only exclude path can have \*\* like matching. +- exclude path cannot have named variables ( {} ). +- Folder names should not contain {, }, \*, / in their names. +- {folder} is reserved for internal working. please do not use in named variables. + +If you would like to write a more complicated function for resolving file names, then a {transformer} would be a good fit. + +:::caution + +Specify as long fixed prefix ( with out /\*/ ) as possible in `path_specs.include`. This will reduce the scanning time and cost, specifically on Google Cloud Storage. + +::: + +:::caution + +If you are ingesting datasets from Google Cloud Storage, we recommend running the ingestion on a server in the same region to avoid high egress costs. + +::: + +### Code Coordinates + +- Class Name: `datahub.ingestion.source.gcs.gcs_source.GCSSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/gcs/gcs_source.py) + +

Questions

+ +If you've got any questions on configuring ingestion for Google Cloud Storage, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/glue.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/glue.md new file mode 100644 index 0000000000000..96122a12fdaef --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/glue.md @@ -0,0 +1,552 @@ +--- +sidebar_position: 16 +title: Glue +slug: /generated/ingestion/sources/glue +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/glue.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Glue + +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +### Important Capabilities + +| Capability | Status | Notes | +| ---------------------------------------------------------------------------------------------------------- | ------ | -------------------------------------------------------- | +| [Detect Deleted Entities](../../../../metadata-ingestion/docs/dev_guides/stateful.md#stale-entity-removal) | ✅ | Enabled by default when stateful ingestion is turned on. | +| [Domains](../../../domains.md) | ✅ | Supported via the `domain` config field | +| [Platform Instance](../../../platform-instances.md) | ✅ | Enabled by default | + +Note: if you also have files in S3 that you'd like to ingest, we recommend you use Glue's built-in data catalog. See [here](../../../../docs/generated/ingestion/sources/s3.md) for a quick guide on how to set up a crawler on Glue and ingest the outputs with DataHub. + +This plugin extracts the following: + +- Tables in the Glue catalog +- Column types associated with each table +- Table metadata, such as owner, description and parameters +- Jobs and their component transformations, data sources, and data sinks + +### IAM permissions + +For ingesting datasets, the following IAM permissions are required: + +```json +{ + "Effect": "Allow", + "Action": ["glue:GetDatabases", "glue:GetTables"], + "Resource": [ + "arn:aws:glue:$region-id:$account-id:catalog", + "arn:aws:glue:$region-id:$account-id:database/*", + "arn:aws:glue:$region-id:$account-id:table/*" + ] +} +``` + +For ingesting jobs (`extract_transforms: True`), the following additional permissions are required: + +```json +{ + "Effect": "Allow", + "Action": ["glue:GetDataflowGraph", "glue:GetJobs"], + "Resource": "*" +} +``` + +plus `s3:GetObject` for the job script locations. + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[glue]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: glue + config: + # Coordinates + aws_region: "my-aws-region" + +sink: + # sink configs +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +|
aws_region 
string
| AWS region code. | +|
aws_access_key_id
string
| AWS access key ID. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details. | +|
aws_endpoint_url
string
| The AWS service endpoint. This is normally [constructed automatically](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html), but can be overridden here. | +|
aws_profile
string
| Named AWS profile to use. Only used if access key / secret are unset. If not set the default will be used | +|
aws_proxy
map(str,string)
| | +|
aws_secret_access_key
string
| AWS secret access key. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details. | +|
aws_session_token
string
| AWS session token. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details. | +|
catalog_id
string
| The aws account id where the target glue catalog lives. If None, datahub will ingest glue in aws caller's account. | +|
emit_s3_lineage
boolean
| Whether to emit S3-to-Glue lineage.
Default: False
| +|
extract_owners
boolean
| When enabled, extracts ownership from Glue directly and overwrites existing owners. When disabled, ownership is left empty for datasets.
Default: True
| +|
extract_transforms
boolean
| Whether to extract Glue transform jobs.
Default: True
| +|
glue_s3_lineage_direction
string
| If `upstream`, S3 is upstream to Glue. If `downstream` S3 is downstream to Glue.
Default: upstream
| +|
ignore_resource_links
boolean
| If set to True, ignore database resource links.
Default: False
| +|
ignore_unsupported_connectors
boolean
| Whether to ignore unsupported connectors. If disabled, an error will be raised.
Default: True
| +|
platform
string
| The platform to use for the dataset URNs. Must be one of ['glue', 'athena'].
Default: glue
| +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
use_s3_bucket_tags
boolean
| If an S3 Buckets Tags should be created for the Tables ingested by Glue. Please Note that this will not apply tags to any folders ingested, only the files.
Default: False
| +|
use_s3_object_tags
boolean
| If an S3 Objects Tags should be created for the Tables ingested by Glue.
Default: False
| +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
aws_role
One of string, union(anyOf), string, AwsAssumeRoleConfig
| AWS roles to assume. If using the string format, the role ARN can be specified directly. If using the object format, the role can be specified in the RoleArn field and additional available arguments are documented at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sts.html?highlight=assume_role#STS.Client.assume_role | +|
aws_role.RoleArn 
string
| ARN of the role to assume. | +|
aws_role.ExternalId
string
| External ID to use when assuming the role. | +|
database_pattern
AllowDenyPattern
| regex patterns for databases to filter in ingestion.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
database_pattern.allow
array(string)
| | +|
database_pattern.deny
array(string)
| | +|
database_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
domain
map(str,AllowDenyPattern)
| A class to store allow deny regexes | +|
domain.`key`.allow
array(string)
| | +|
domain.`key`.deny
array(string)
| | +|
domain.`key`.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
table_pattern
AllowDenyPattern
| regex patterns for tables to filter in ingestion.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
table_pattern.allow
array(string)
| | +|
table_pattern.deny
array(string)
| | +|
table_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profiling
GlueProfilingConfig
| Configs to ingest data profiles from glue table | +|
profiling.column_count
string
| The parameter name for column count in glue table. | +|
profiling.max
string
| The parameter name for the max value of a column. | +|
profiling.mean
string
| The parameter name for the mean value of a column. | +|
profiling.median
string
| The parameter name for the median value of a column. | +|
profiling.min
string
| The parameter name for the min value of a column. | +|
profiling.null_count
string
| The parameter name for the count of null values in a column. | +|
profiling.null_proportion
string
| The parameter name for the proportion of null values in a column. | +|
profiling.row_count
string
| The parameter name for row count in glue table. | +|
profiling.stdev
string
| The parameter name for the standard deviation of a column. | +|
profiling.unique_count
string
| The parameter name for the count of unique value in a column. | +|
profiling.unique_proportion
string
| The parameter name for the proportion of unique values in a column. | +|
profiling.partition_patterns
AllowDenyPattern
| Regex patterns for filtering partitions for profile. The pattern should be a string like: "{'key':'value'}".
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
profiling.partition_patterns.allow
array(string)
| | +|
profiling.partition_patterns.deny
array(string)
| | +|
profiling.partition_patterns.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| Base specialized config for Stateful Ingestion with stale metadata removal capability. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "GlueSourceConfig", + "description": "Base configuration class for stateful ingestion for source configs to inherit from.", + "type": "object", + "properties": { + "aws_access_key_id": { + "title": "Aws Access Key Id", + "description": "AWS access key ID. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details.", + "type": "string" + }, + "aws_secret_access_key": { + "title": "Aws Secret Access Key", + "description": "AWS secret access key. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details.", + "type": "string" + }, + "aws_session_token": { + "title": "Aws Session Token", + "description": "AWS session token. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details.", + "type": "string" + }, + "aws_role": { + "title": "Aws Role", + "description": "AWS roles to assume. If using the string format, the role ARN can be specified directly. If using the object format, the role can be specified in the RoleArn field and additional available arguments are documented at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sts.html?highlight=assume_role#STS.Client.assume_role", + "anyOf": [ + { + "type": "string" + }, + { + "type": "array", + "items": { + "anyOf": [ + { + "type": "string" + }, + { + "$ref": "#/definitions/AwsAssumeRoleConfig" + } + ] + } + } + ] + }, + "aws_profile": { + "title": "Aws Profile", + "description": "Named AWS profile to use. Only used if access key / secret are unset. If not set the default will be used", + "type": "string" + }, + "aws_region": { + "title": "Aws Region", + "description": "AWS region code.", + "type": "string" + }, + "aws_endpoint_url": { + "title": "Aws Endpoint Url", + "description": "The AWS service endpoint. This is normally [constructed automatically](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html), but can be overridden here.", + "type": "string" + }, + "aws_proxy": { + "title": "Aws Proxy", + "description": "A set of proxy configs to use with AWS. See the [botocore.config](https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html) docs for details.", + "type": "object", + "additionalProperties": { + "type": "string" + } + }, + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "database_pattern": { + "title": "Database Pattern", + "description": "regex patterns for databases to filter in ingestion.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "table_pattern": { + "title": "Table Pattern", + "description": "regex patterns for tables to filter in ingestion.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "stateful_ingestion": { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + }, + "platform": { + "title": "Platform", + "description": "The platform to use for the dataset URNs. Must be one of ['glue', 'athena'].", + "default": "glue", + "type": "string" + }, + "extract_owners": { + "title": "Extract Owners", + "description": "When enabled, extracts ownership from Glue directly and overwrites existing owners. When disabled, ownership is left empty for datasets.", + "default": true, + "type": "boolean" + }, + "extract_transforms": { + "title": "Extract Transforms", + "description": "Whether to extract Glue transform jobs.", + "default": true, + "type": "boolean" + }, + "ignore_unsupported_connectors": { + "title": "Ignore Unsupported Connectors", + "description": "Whether to ignore unsupported connectors. If disabled, an error will be raised.", + "default": true, + "type": "boolean" + }, + "emit_s3_lineage": { + "title": "Emit S3 Lineage", + "description": "Whether to emit S3-to-Glue lineage.", + "default": false, + "type": "boolean" + }, + "glue_s3_lineage_direction": { + "title": "Glue S3 Lineage Direction", + "description": "If `upstream`, S3 is upstream to Glue. If `downstream` S3 is downstream to Glue.", + "default": "upstream", + "type": "string" + }, + "domain": { + "title": "Domain", + "description": "regex patterns for tables to filter to assign domain_key. ", + "default": {}, + "type": "object", + "additionalProperties": { + "$ref": "#/definitions/AllowDenyPattern" + } + }, + "catalog_id": { + "title": "Catalog Id", + "description": "The aws account id where the target glue catalog lives. If None, datahub will ingest glue in aws caller's account.", + "type": "string" + }, + "ignore_resource_links": { + "title": "Ignore Resource Links", + "description": "If set to True, ignore database resource links.", + "default": false, + "type": "boolean" + }, + "use_s3_bucket_tags": { + "title": "Use S3 Bucket Tags", + "description": "If an S3 Buckets Tags should be created for the Tables ingested by Glue. Please Note that this will not apply tags to any folders ingested, only the files.", + "default": false, + "type": "boolean" + }, + "use_s3_object_tags": { + "title": "Use S3 Object Tags", + "description": "If an S3 Objects Tags should be created for the Tables ingested by Glue.", + "default": false, + "type": "boolean" + }, + "profiling": { + "title": "Profiling", + "description": "Configs to ingest data profiles from glue table", + "allOf": [ + { + "$ref": "#/definitions/GlueProfilingConfig" + } + ] + } + }, + "required": [ + "aws_region" + ], + "additionalProperties": false, + "definitions": { + "AwsAssumeRoleConfig": { + "title": "AwsAssumeRoleConfig", + "type": "object", + "properties": { + "RoleArn": { + "title": "Rolearn", + "description": "ARN of the role to assume.", + "type": "string" + }, + "ExternalId": { + "title": "Externalid", + "description": "External ID to use when assuming the role.", + "type": "string" + } + }, + "required": [ + "RoleArn" + ] + }, + "AllowDenyPattern": { + "title": "AllowDenyPattern", + "description": "A class to store allow deny regexes", + "type": "object", + "properties": { + "allow": { + "title": "Allow", + "description": "List of regex patterns to include in ingestion", + "default": [ + ".*" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "deny": { + "title": "Deny", + "description": "List of regex patterns to exclude from ingestion.", + "default": [], + "type": "array", + "items": { + "type": "string" + } + }, + "ignoreCase": { + "title": "Ignorecase", + "description": "Whether to ignore case sensitivity during pattern matching.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "GlueProfilingConfig": { + "title": "GlueProfilingConfig", + "type": "object", + "properties": { + "row_count": { + "title": "Row Count", + "description": "The parameter name for row count in glue table.", + "type": "string" + }, + "column_count": { + "title": "Column Count", + "description": "The parameter name for column count in glue table.", + "type": "string" + }, + "unique_count": { + "title": "Unique Count", + "description": "The parameter name for the count of unique value in a column.", + "type": "string" + }, + "unique_proportion": { + "title": "Unique Proportion", + "description": "The parameter name for the proportion of unique values in a column.", + "type": "string" + }, + "null_count": { + "title": "Null Count", + "description": "The parameter name for the count of null values in a column.", + "type": "string" + }, + "null_proportion": { + "title": "Null Proportion", + "description": "The parameter name for the proportion of null values in a column.", + "type": "string" + }, + "min": { + "title": "Min", + "description": "The parameter name for the min value of a column.", + "type": "string" + }, + "max": { + "title": "Max", + "description": "The parameter name for the max value of a column.", + "type": "string" + }, + "mean": { + "title": "Mean", + "description": "The parameter name for the mean value of a column.", + "type": "string" + }, + "median": { + "title": "Median", + "description": "The parameter name for the median value of a column.", + "type": "string" + }, + "stdev": { + "title": "Stdev", + "description": "The parameter name for the standard deviation of a column.", + "type": "string" + }, + "partition_patterns": { + "title": "Partition Patterns", + "description": "Regex patterns for filtering partitions for profile. The pattern should be a string like: \"{'key':'value'}\".", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + } + }, + "additionalProperties": false + } + } +} +``` + + +
+ +### Concept Mapping + +| Source Concept | DataHub Concept | Notes | +| -------------------- | --------------------------------------------------------- | ------------------ | +| `"glue"` | [Data Platform](../../metamodel/entities/dataPlatform.md) | | +| Glue Database | [Container](../../metamodel/entities/container.md) | Subtype `Database` | +| Glue Table | [Dataset](../../metamodel/entities/dataset.md) | Subtype `Table` | +| Glue Job | [Data Flow](../../metamodel/entities/dataFlow.md) | | +| Glue Job Transform | [Data Job](../../metamodel/entities/dataJob.md) | | +| Glue Job Data source | [Dataset](../../metamodel/entities/dataset.md) | | +| Glue Job Data sink | [Dataset](../../metamodel/entities/dataset.md) | | + +### Compatibility + +To capture lineage across Glue jobs and databases, a requirements must be met – otherwise the AWS API is unable to report any lineage. The job must be created in Glue Studio with the "Generate classic script" option turned on (this option can be accessed in the "Script" tab). Any custom scripts that do not have the proper annotations will not have reported lineage. + +### Code Coordinates + +- Class Name: `datahub.ingestion.source.aws.glue.GlueSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/aws/glue.py) + +

Questions

+ +If you've got any questions on configuring ingestion for Glue, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/hana.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/hana.md new file mode 100644 index 0000000000000..34f550d04df2f --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/hana.md @@ -0,0 +1,602 @@ +--- +sidebar_position: 45 +title: SAP HANA +slug: /generated/ingestion/sources/hana +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/hana.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# SAP HANA + +![Testing](https://img.shields.io/badge/support%20status-testing-lightgrey) + +### Important Capabilities + +| Capability | Status | Notes | +| ---------------------------------------------------------------------------------------------------------- | ------ | --------------------------------------- | +| [Data Profiling](../../../../metadata-ingestion/docs/dev_guides/sql_profiles.md) | ✅ | Optionally enabled via configuration | +| [Detect Deleted Entities](../../../../metadata-ingestion/docs/dev_guides/stateful.md#stale-entity-removal) | ✅ | Enabled via stateful ingestion | +| [Domains](../../../domains.md) | ✅ | Supported via the `domain` config field | +| [Platform Instance](../../../platform-instances.md) | ✅ | Enabled by default | + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[hana]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: hana + config: + # Coordinates + host_port: localhost:39041 + database: dbname + + # Credentials + username: ${HANA_USER} + password: ${HANA_PASS} + +sink: + # sink configs +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +|
database
string
| database (catalog) | +|
database_alias
string
| [Deprecated] Alias to apply to database when ingesting. | +|
host_port
string
|
Default: localhost:39041
| +|
include_table_location_lineage
boolean
| If the source supports it, include table lineage to the underlying storage location.
Default: True
| +|
include_tables
boolean
| Whether tables should be ingested.
Default: True
| +|
include_views
boolean
| Whether views should be ingested.
Default: True
| +|
options
object
| Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs. | +|
password
string(password)
| password | +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
scheme
string
|
Default: hana+hdbcli
| +|
sqlalchemy_uri
string
| URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters. | +|
username
string
| username | +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
domain
map(str,AllowDenyPattern)
| A class to store allow deny regexes | +|
domain.`key`.allow
array(string)
| | +|
domain.`key`.deny
array(string)
| | +|
domain.`key`.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profile_pattern
AllowDenyPattern
| Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
profile_pattern.allow
array(string)
| | +|
profile_pattern.deny
array(string)
| | +|
profile_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
schema_pattern
AllowDenyPattern
| Regex patterns for schemas to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
schema_pattern.allow
array(string)
| | +|
schema_pattern.deny
array(string)
| | +|
schema_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
table_pattern
AllowDenyPattern
| Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.\*'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
table_pattern.allow
array(string)
| | +|
table_pattern.deny
array(string)
| | +|
table_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
view_pattern
AllowDenyPattern
| Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.\*'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
view_pattern.allow
array(string)
| | +|
view_pattern.deny
array(string)
| | +|
view_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profiling
GEProfilingConfig
|
Default: {'enabled': False, 'limit': None, 'offset': None, ...
| +|
profiling.catch_exceptions
boolean
|
Default: True
| +|
profiling.enabled
boolean
| Whether profiling should be done.
Default: False
| +|
profiling.field_sample_values_limit
integer
| Upper limit for number of sample values to collect for all columns.
Default: 20
| +|
profiling.include_field_distinct_count
boolean
| Whether to profile for the number of distinct values for each column.
Default: True
| +|
profiling.include_field_distinct_value_frequencies
boolean
| Whether to profile for distinct value frequencies.
Default: False
| +|
profiling.include_field_histogram
boolean
| Whether to profile for the histogram for numeric fields.
Default: False
| +|
profiling.include_field_max_value
boolean
| Whether to profile for the max value of numeric columns.
Default: True
| +|
profiling.include_field_mean_value
boolean
| Whether to profile for the mean value of numeric columns.
Default: True
| +|
profiling.include_field_median_value
boolean
| Whether to profile for the median value of numeric columns.
Default: True
| +|
profiling.include_field_min_value
boolean
| Whether to profile for the min value of numeric columns.
Default: True
| +|
profiling.include_field_null_count
boolean
| Whether to profile for the number of nulls for each column.
Default: True
| +|
profiling.include_field_quantiles
boolean
| Whether to profile for the quantiles of numeric columns.
Default: False
| +|
profiling.include_field_sample_values
boolean
| Whether to profile for the sample values for all columns.
Default: True
| +|
profiling.include_field_stddev_value
boolean
| Whether to profile for the standard deviation of numeric columns.
Default: True
| +|
profiling.limit
integer
| Max number of documents to profile. By default, profiles all documents. | +|
profiling.max_number_of_fields_to_profile
integer
| A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. | +|
profiling.max_workers
integer
| Number of worker threads to use for profiling. Set to 1 to disable.
Default: 80
| +|
profiling.offset
integer
| Offset in documents to profile. By default, uses no offset. | +|
profiling.partition_datetime
string(date-time)
| For partitioned datasets profile only the partition which matches the datetime or profile the latest one if not set. Only Bigquery supports this. | +|
profiling.partition_profiling_enabled
boolean
|
Default: True
| +|
profiling.profile_if_updated_since_days
number
| Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`. | +|
profiling.profile_table_level_only
boolean
| Whether to perform profiling at table-level only, or include column-level profiling as well.
Default: False
| +|
profiling.profile_table_row_count_estimate_only
boolean
| Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL.
Default: False
| +|
profiling.profile_table_row_limit
integer
| Profile tables only if their row count is less then specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `snowflake` and `BigQuery`
Default: 5000000
| +|
profiling.profile_table_size_limit
integer
| Profile tables only if their size is less then specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `snowflake` and `BigQuery`
Default: 5
| +|
profiling.query_combiner_enabled
boolean
| _This feature is still experimental and can be disabled if it causes issues._ Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.
Default: True
| +|
profiling.report_dropped_profiles
boolean
| Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.
Default: False
| +|
profiling.turn_off_expensive_profiling_metrics
boolean
| Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.
Default: False
| +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| Base specialized config for Stateful Ingestion with stale metadata removal capability. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "HanaConfig", + "description": "Base configuration class for stateful ingestion for source configs to inherit from.", + "type": "object", + "properties": { + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "stateful_ingestion": { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + }, + "options": { + "title": "Options", + "description": "Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs.", + "type": "object" + }, + "schema_pattern": { + "title": "Schema Pattern", + "description": "Regex patterns for schemas to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "table_pattern": { + "title": "Table Pattern", + "description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "view_pattern": { + "title": "View Pattern", + "description": "Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "profile_pattern": { + "title": "Profile Pattern", + "description": "Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "domain": { + "title": "Domain", + "description": "Attach domains to databases, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.", + "default": {}, + "type": "object", + "additionalProperties": { + "$ref": "#/definitions/AllowDenyPattern" + } + }, + "include_views": { + "title": "Include Views", + "description": "Whether views should be ingested.", + "default": true, + "type": "boolean" + }, + "include_tables": { + "title": "Include Tables", + "description": "Whether tables should be ingested.", + "default": true, + "type": "boolean" + }, + "include_table_location_lineage": { + "title": "Include Table Location Lineage", + "description": "If the source supports it, include table lineage to the underlying storage location.", + "default": true, + "type": "boolean" + }, + "profiling": { + "title": "Profiling", + "default": { + "enabled": false, + "limit": null, + "offset": null, + "report_dropped_profiles": false, + "turn_off_expensive_profiling_metrics": false, + "profile_table_level_only": false, + "include_field_null_count": true, + "include_field_distinct_count": true, + "include_field_min_value": true, + "include_field_max_value": true, + "include_field_mean_value": true, + "include_field_median_value": true, + "include_field_stddev_value": true, + "include_field_quantiles": false, + "include_field_distinct_value_frequencies": false, + "include_field_histogram": false, + "include_field_sample_values": true, + "field_sample_values_limit": 20, + "max_number_of_fields_to_profile": null, + "profile_if_updated_since_days": null, + "profile_table_size_limit": 5, + "profile_table_row_limit": 5000000, + "profile_table_row_count_estimate_only": false, + "max_workers": 80, + "query_combiner_enabled": true, + "catch_exceptions": true, + "partition_profiling_enabled": true, + "partition_datetime": null + }, + "allOf": [ + { + "$ref": "#/definitions/GEProfilingConfig" + } + ] + }, + "username": { + "title": "Username", + "description": "username", + "type": "string" + }, + "password": { + "title": "Password", + "description": "password", + "type": "string", + "writeOnly": true, + "format": "password" + }, + "host_port": { + "title": "Host Port", + "default": "localhost:39041", + "type": "string" + }, + "database": { + "title": "Database", + "description": "database (catalog)", + "type": "string" + }, + "database_alias": { + "title": "Database Alias", + "description": "[Deprecated] Alias to apply to database when ingesting.", + "type": "string" + }, + "scheme": { + "title": "Scheme", + "default": "hana+hdbcli", + "type": "string" + }, + "sqlalchemy_uri": { + "title": "Sqlalchemy Uri", + "description": "URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.", + "type": "string" + } + }, + "additionalProperties": false, + "definitions": { + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "AllowDenyPattern": { + "title": "AllowDenyPattern", + "description": "A class to store allow deny regexes", + "type": "object", + "properties": { + "allow": { + "title": "Allow", + "description": "List of regex patterns to include in ingestion", + "default": [ + ".*" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "deny": { + "title": "Deny", + "description": "List of regex patterns to exclude from ingestion.", + "default": [], + "type": "array", + "items": { + "type": "string" + } + }, + "ignoreCase": { + "title": "Ignorecase", + "description": "Whether to ignore case sensitivity during pattern matching.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "GEProfilingConfig": { + "title": "GEProfilingConfig", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "Whether profiling should be done.", + "default": false, + "type": "boolean" + }, + "limit": { + "title": "Limit", + "description": "Max number of documents to profile. By default, profiles all documents.", + "type": "integer" + }, + "offset": { + "title": "Offset", + "description": "Offset in documents to profile. By default, uses no offset.", + "type": "integer" + }, + "report_dropped_profiles": { + "title": "Report Dropped Profiles", + "description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.", + "default": false, + "type": "boolean" + }, + "turn_off_expensive_profiling_metrics": { + "title": "Turn Off Expensive Profiling Metrics", + "description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.", + "default": false, + "type": "boolean" + }, + "profile_table_level_only": { + "title": "Profile Table Level Only", + "description": "Whether to perform profiling at table-level only, or include column-level profiling as well.", + "default": false, + "type": "boolean" + }, + "include_field_null_count": { + "title": "Include Field Null Count", + "description": "Whether to profile for the number of nulls for each column.", + "default": true, + "type": "boolean" + }, + "include_field_distinct_count": { + "title": "Include Field Distinct Count", + "description": "Whether to profile for the number of distinct values for each column.", + "default": true, + "type": "boolean" + }, + "include_field_min_value": { + "title": "Include Field Min Value", + "description": "Whether to profile for the min value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_max_value": { + "title": "Include Field Max Value", + "description": "Whether to profile for the max value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_mean_value": { + "title": "Include Field Mean Value", + "description": "Whether to profile for the mean value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_median_value": { + "title": "Include Field Median Value", + "description": "Whether to profile for the median value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_stddev_value": { + "title": "Include Field Stddev Value", + "description": "Whether to profile for the standard deviation of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_quantiles": { + "title": "Include Field Quantiles", + "description": "Whether to profile for the quantiles of numeric columns.", + "default": false, + "type": "boolean" + }, + "include_field_distinct_value_frequencies": { + "title": "Include Field Distinct Value Frequencies", + "description": "Whether to profile for distinct value frequencies.", + "default": false, + "type": "boolean" + }, + "include_field_histogram": { + "title": "Include Field Histogram", + "description": "Whether to profile for the histogram for numeric fields.", + "default": false, + "type": "boolean" + }, + "include_field_sample_values": { + "title": "Include Field Sample Values", + "description": "Whether to profile for the sample values for all columns.", + "default": true, + "type": "boolean" + }, + "field_sample_values_limit": { + "title": "Field Sample Values Limit", + "description": "Upper limit for number of sample values to collect for all columns.", + "default": 20, + "type": "integer" + }, + "max_number_of_fields_to_profile": { + "title": "Max Number Of Fields To Profile", + "description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.", + "exclusiveMinimum": 0, + "type": "integer" + }, + "profile_if_updated_since_days": { + "title": "Profile If Updated Since Days", + "description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.", + "exclusiveMinimum": 0, + "type": "number" + }, + "profile_table_size_limit": { + "title": "Profile Table Size Limit", + "description": "Profile tables only if their size is less then specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `snowflake` and `BigQuery`", + "default": 5, + "type": "integer" + }, + "profile_table_row_limit": { + "title": "Profile Table Row Limit", + "description": "Profile tables only if their row count is less then specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `snowflake` and `BigQuery`", + "default": 5000000, + "type": "integer" + }, + "profile_table_row_count_estimate_only": { + "title": "Profile Table Row Count Estimate Only", + "description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ", + "default": false, + "type": "boolean" + }, + "max_workers": { + "title": "Max Workers", + "description": "Number of worker threads to use for profiling. Set to 1 to disable.", + "default": 80, + "type": "integer" + }, + "query_combiner_enabled": { + "title": "Query Combiner Enabled", + "description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.", + "default": true, + "type": "boolean" + }, + "catch_exceptions": { + "title": "Catch Exceptions", + "default": true, + "type": "boolean" + }, + "partition_profiling_enabled": { + "title": "Partition Profiling Enabled", + "default": true, + "type": "boolean" + }, + "partition_datetime": { + "title": "Partition Datetime", + "description": "For partitioned datasets profile only the partition which matches the datetime or profile the latest one if not set. Only Bigquery supports this.", + "type": "string", + "format": "date-time" + } + }, + "additionalProperties": false + } + } +} +``` + + +
+ +## Integration Details + +The implementation uses the [SQLAlchemy Dialect for SAP HANA](https://github.com/SAP/sqlalchemy-hana). The SQLAlchemy Dialect for SAP HANA is an open-source project hosted at GitHub that is actively maintained by SAP SE, and is not part of a licensed SAP HANA edition or option. It is provided under the terms of the project license. Please notice that sqlalchemy-hana isn't an official SAP product and isn't covered by SAP support. + +## Compatibility + +Under the hood, [SQLAlchemy Dialect for SAP HANA](https://github.com/SAP/sqlalchemy-hana) uses the SAP HANA Python Driver hdbcli. Therefore it is compatible with HANA or HANA express versions since HANA SPS 2. + +## Questions + +If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)! + +### Code Coordinates + +- Class Name: `datahub.ingestion.source.sql.hana.HanaSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/hana.py) + +

Questions

+ +If you've got any questions on configuring ingestion for SAP HANA, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/hive.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/hive.md new file mode 100644 index 0000000000000..462fbc1e99d2f --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/hive.md @@ -0,0 +1,663 @@ +--- +sidebar_position: 18 +title: Hive +slug: /generated/ingestion/sources/hive +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/hive.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Hive + +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +### Important Capabilities + +| Capability | Status | Notes | +| --------------------------------------------------- | ------ | --------------------------------------- | +| [Domains](../../../domains.md) | ✅ | Supported via the `domain` config field | +| [Platform Instance](../../../platform-instances.md) | ✅ | Enabled by default | + +This plugin extracts the following: + +- Metadata for databases, schemas, and tables +- Column types associated with each table +- Detailed table and storage information +- Table, row, and column statistics via optional SQL profiling. + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[hive]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +````yaml +source: + type: hive + config: + # Coordinates + host_port: localhost:10000 + database: DemoDatabase # optional, if not specified, ingests from all databases + + # Credentials + username: user # optional + password: pass # optional + + # For more details on authentication, see the PyHive docs: + # https://github.com/dropbox/PyHive#passing-session-configuration. + # LDAP, Kerberos, etc. are supported using connect_args, which can be + # added under the `options` config parameter. + #options: + # connect_args: + # auth: KERBEROS + # kerberos_service_name: hive + #scheme: 'hive+http' # set this if Thrift should use the HTTP transport + #scheme: 'hive+https' # set this if Thrift should use the HTTP with SSL transport + #scheme: 'sparksql' # set this for Spark Thrift Server + +sink: + # sink configs + +# --------------------------------------------------------- +# Recipe (Azure HDInsight) +# Connecting to Microsoft Azure HDInsight using TLS. +# --------------------------------------------------------- + +source: + type: hive + config: + # Coordinates + host_port: .azurehdinsight.net:443 + + # Credentials + username: admin + password: password + + # Options + options: + connect_args: + http_path: "/hive2" + auth: BASIC + +sink: + # sink configs + +# --------------------------------------------------------- +# Recipe (Databricks) +# Ensure that databricks-dbapi is installed. If not, use ```pip install databricks-dbapi``` to install. +# Use the ```http_path``` from your Databricks cluster in the following recipe. +# See (https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#get-server-hostname-port-http-path-and-jdbc-url) for instructions to find ```http_path```. +# --------------------------------------------------------- + +source: + type: hive + config: + host_port: :443 + username: token / username + password: / password + scheme: 'databricks+pyhive' + + options: + connect_args: + http_path: 'sql/protocolv1/o/xxxyyyzzzaaasa/1234-567890-hello123' + +sink: + # sink configs +```` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +|
host_port 
string
| host URL | +|
database
string
| database (catalog) | +|
database_alias
string
| [Deprecated] Alias to apply to database when ingesting. | +|
include_table_location_lineage
boolean
| If the source supports it, include table lineage to the underlying storage location.
Default: True
| +|
include_tables
boolean
| Whether tables should be ingested.
Default: True
| +|
options
object
| Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs. | +|
password
string(password)
| password | +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
sqlalchemy_uri
string
| URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters. | +|
username
string
| username | +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
database_pattern
AllowDenyPattern
| Regex patterns for databases to filter in ingestion.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
database_pattern.allow
array(string)
| | +|
database_pattern.deny
array(string)
| | +|
database_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
domain
map(str,AllowDenyPattern)
| A class to store allow deny regexes | +|
domain.`key`.allow
array(string)
| | +|
domain.`key`.deny
array(string)
| | +|
domain.`key`.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profile_pattern
AllowDenyPattern
| Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
profile_pattern.allow
array(string)
| | +|
profile_pattern.deny
array(string)
| | +|
profile_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
schema_pattern
AllowDenyPattern
| Deprecated in favour of database_pattern.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
schema_pattern.allow
array(string)
| | +|
schema_pattern.deny
array(string)
| | +|
schema_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
table_pattern
AllowDenyPattern
| Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.\*'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
table_pattern.allow
array(string)
| | +|
table_pattern.deny
array(string)
| | +|
table_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
view_pattern
AllowDenyPattern
| Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.\*'
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
view_pattern.allow
array(string)
| | +|
view_pattern.deny
array(string)
| | +|
view_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profiling
GEProfilingConfig
|
Default: {'enabled': False, 'limit': None, 'offset': None, ...
| +|
profiling.catch_exceptions
boolean
|
Default: True
| +|
profiling.enabled
boolean
| Whether profiling should be done.
Default: False
| +|
profiling.field_sample_values_limit
integer
| Upper limit for number of sample values to collect for all columns.
Default: 20
| +|
profiling.include_field_distinct_count
boolean
| Whether to profile for the number of distinct values for each column.
Default: True
| +|
profiling.include_field_distinct_value_frequencies
boolean
| Whether to profile for distinct value frequencies.
Default: False
| +|
profiling.include_field_histogram
boolean
| Whether to profile for the histogram for numeric fields.
Default: False
| +|
profiling.include_field_max_value
boolean
| Whether to profile for the max value of numeric columns.
Default: True
| +|
profiling.include_field_mean_value
boolean
| Whether to profile for the mean value of numeric columns.
Default: True
| +|
profiling.include_field_median_value
boolean
| Whether to profile for the median value of numeric columns.
Default: True
| +|
profiling.include_field_min_value
boolean
| Whether to profile for the min value of numeric columns.
Default: True
| +|
profiling.include_field_null_count
boolean
| Whether to profile for the number of nulls for each column.
Default: True
| +|
profiling.include_field_quantiles
boolean
| Whether to profile for the quantiles of numeric columns.
Default: False
| +|
profiling.include_field_sample_values
boolean
| Whether to profile for the sample values for all columns.
Default: True
| +|
profiling.include_field_stddev_value
boolean
| Whether to profile for the standard deviation of numeric columns.
Default: True
| +|
profiling.limit
integer
| Max number of documents to profile. By default, profiles all documents. | +|
profiling.max_number_of_fields_to_profile
integer
| A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. | +|
profiling.max_workers
integer
| Number of worker threads to use for profiling. Set to 1 to disable.
Default: 80
| +|
profiling.offset
integer
| Offset in documents to profile. By default, uses no offset. | +|
profiling.partition_datetime
string(date-time)
| For partitioned datasets profile only the partition which matches the datetime or profile the latest one if not set. Only Bigquery supports this. | +|
profiling.partition_profiling_enabled
boolean
|
Default: True
| +|
profiling.profile_if_updated_since_days
number
| Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`. | +|
profiling.profile_table_level_only
boolean
| Whether to perform profiling at table-level only, or include column-level profiling as well.
Default: False
| +|
profiling.profile_table_row_count_estimate_only
boolean
| Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL.
Default: False
| +|
profiling.profile_table_row_limit
integer
| Profile tables only if their row count is less then specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `snowflake` and `BigQuery`
Default: 5000000
| +|
profiling.profile_table_size_limit
integer
| Profile tables only if their size is less then specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `snowflake` and `BigQuery`
Default: 5
| +|
profiling.query_combiner_enabled
boolean
| _This feature is still experimental and can be disabled if it causes issues._ Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.
Default: True
| +|
profiling.report_dropped_profiles
boolean
| Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.
Default: False
| +|
profiling.turn_off_expensive_profiling_metrics
boolean
| Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.
Default: False
| +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| Base specialized config for Stateful Ingestion with stale metadata removal capability. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "HiveConfig", + "description": "Base configuration class for stateful ingestion for source configs to inherit from.", + "type": "object", + "properties": { + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "stateful_ingestion": { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + }, + "options": { + "title": "Options", + "description": "Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs.", + "type": "object" + }, + "schema_pattern": { + "title": "Schema Pattern", + "description": "Deprecated in favour of database_pattern.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "table_pattern": { + "title": "Table Pattern", + "description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "view_pattern": { + "title": "View Pattern", + "description": "Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "profile_pattern": { + "title": "Profile Pattern", + "description": "Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "domain": { + "title": "Domain", + "description": "Attach domains to databases, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.", + "default": {}, + "type": "object", + "additionalProperties": { + "$ref": "#/definitions/AllowDenyPattern" + } + }, + "include_tables": { + "title": "Include Tables", + "description": "Whether tables should be ingested.", + "default": true, + "type": "boolean" + }, + "include_table_location_lineage": { + "title": "Include Table Location Lineage", + "description": "If the source supports it, include table lineage to the underlying storage location.", + "default": true, + "type": "boolean" + }, + "profiling": { + "title": "Profiling", + "default": { + "enabled": false, + "limit": null, + "offset": null, + "report_dropped_profiles": false, + "turn_off_expensive_profiling_metrics": false, + "profile_table_level_only": false, + "include_field_null_count": true, + "include_field_distinct_count": true, + "include_field_min_value": true, + "include_field_max_value": true, + "include_field_mean_value": true, + "include_field_median_value": true, + "include_field_stddev_value": true, + "include_field_quantiles": false, + "include_field_distinct_value_frequencies": false, + "include_field_histogram": false, + "include_field_sample_values": true, + "field_sample_values_limit": 20, + "max_number_of_fields_to_profile": null, + "profile_if_updated_since_days": null, + "profile_table_size_limit": 5, + "profile_table_row_limit": 5000000, + "profile_table_row_count_estimate_only": false, + "max_workers": 80, + "query_combiner_enabled": true, + "catch_exceptions": true, + "partition_profiling_enabled": true, + "partition_datetime": null + }, + "allOf": [ + { + "$ref": "#/definitions/GEProfilingConfig" + } + ] + }, + "username": { + "title": "Username", + "description": "username", + "type": "string" + }, + "password": { + "title": "Password", + "description": "password", + "type": "string", + "writeOnly": true, + "format": "password" + }, + "host_port": { + "title": "Host Port", + "description": "host URL", + "type": "string" + }, + "database": { + "title": "Database", + "description": "database (catalog)", + "type": "string" + }, + "database_alias": { + "title": "Database Alias", + "description": "[Deprecated] Alias to apply to database when ingesting.", + "type": "string" + }, + "sqlalchemy_uri": { + "title": "Sqlalchemy Uri", + "description": "URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.", + "type": "string" + }, + "database_pattern": { + "title": "Database Pattern", + "description": "Regex patterns for databases to filter in ingestion.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + } + }, + "required": [ + "host_port" + ], + "additionalProperties": false, + "definitions": { + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "AllowDenyPattern": { + "title": "AllowDenyPattern", + "description": "A class to store allow deny regexes", + "type": "object", + "properties": { + "allow": { + "title": "Allow", + "description": "List of regex patterns to include in ingestion", + "default": [ + ".*" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "deny": { + "title": "Deny", + "description": "List of regex patterns to exclude from ingestion.", + "default": [], + "type": "array", + "items": { + "type": "string" + } + }, + "ignoreCase": { + "title": "Ignorecase", + "description": "Whether to ignore case sensitivity during pattern matching.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "GEProfilingConfig": { + "title": "GEProfilingConfig", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "Whether profiling should be done.", + "default": false, + "type": "boolean" + }, + "limit": { + "title": "Limit", + "description": "Max number of documents to profile. By default, profiles all documents.", + "type": "integer" + }, + "offset": { + "title": "Offset", + "description": "Offset in documents to profile. By default, uses no offset.", + "type": "integer" + }, + "report_dropped_profiles": { + "title": "Report Dropped Profiles", + "description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.", + "default": false, + "type": "boolean" + }, + "turn_off_expensive_profiling_metrics": { + "title": "Turn Off Expensive Profiling Metrics", + "description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.", + "default": false, + "type": "boolean" + }, + "profile_table_level_only": { + "title": "Profile Table Level Only", + "description": "Whether to perform profiling at table-level only, or include column-level profiling as well.", + "default": false, + "type": "boolean" + }, + "include_field_null_count": { + "title": "Include Field Null Count", + "description": "Whether to profile for the number of nulls for each column.", + "default": true, + "type": "boolean" + }, + "include_field_distinct_count": { + "title": "Include Field Distinct Count", + "description": "Whether to profile for the number of distinct values for each column.", + "default": true, + "type": "boolean" + }, + "include_field_min_value": { + "title": "Include Field Min Value", + "description": "Whether to profile for the min value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_max_value": { + "title": "Include Field Max Value", + "description": "Whether to profile for the max value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_mean_value": { + "title": "Include Field Mean Value", + "description": "Whether to profile for the mean value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_median_value": { + "title": "Include Field Median Value", + "description": "Whether to profile for the median value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_stddev_value": { + "title": "Include Field Stddev Value", + "description": "Whether to profile for the standard deviation of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_quantiles": { + "title": "Include Field Quantiles", + "description": "Whether to profile for the quantiles of numeric columns.", + "default": false, + "type": "boolean" + }, + "include_field_distinct_value_frequencies": { + "title": "Include Field Distinct Value Frequencies", + "description": "Whether to profile for distinct value frequencies.", + "default": false, + "type": "boolean" + }, + "include_field_histogram": { + "title": "Include Field Histogram", + "description": "Whether to profile for the histogram for numeric fields.", + "default": false, + "type": "boolean" + }, + "include_field_sample_values": { + "title": "Include Field Sample Values", + "description": "Whether to profile for the sample values for all columns.", + "default": true, + "type": "boolean" + }, + "field_sample_values_limit": { + "title": "Field Sample Values Limit", + "description": "Upper limit for number of sample values to collect for all columns.", + "default": 20, + "type": "integer" + }, + "max_number_of_fields_to_profile": { + "title": "Max Number Of Fields To Profile", + "description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.", + "exclusiveMinimum": 0, + "type": "integer" + }, + "profile_if_updated_since_days": { + "title": "Profile If Updated Since Days", + "description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.", + "exclusiveMinimum": 0, + "type": "number" + }, + "profile_table_size_limit": { + "title": "Profile Table Size Limit", + "description": "Profile tables only if their size is less then specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `snowflake` and `BigQuery`", + "default": 5, + "type": "integer" + }, + "profile_table_row_limit": { + "title": "Profile Table Row Limit", + "description": "Profile tables only if their row count is less then specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `snowflake` and `BigQuery`", + "default": 5000000, + "type": "integer" + }, + "profile_table_row_count_estimate_only": { + "title": "Profile Table Row Count Estimate Only", + "description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ", + "default": false, + "type": "boolean" + }, + "max_workers": { + "title": "Max Workers", + "description": "Number of worker threads to use for profiling. Set to 1 to disable.", + "default": 80, + "type": "integer" + }, + "query_combiner_enabled": { + "title": "Query Combiner Enabled", + "description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.", + "default": true, + "type": "boolean" + }, + "catch_exceptions": { + "title": "Catch Exceptions", + "default": true, + "type": "boolean" + }, + "partition_profiling_enabled": { + "title": "Partition Profiling Enabled", + "default": true, + "type": "boolean" + }, + "partition_datetime": { + "title": "Partition Datetime", + "description": "For partitioned datasets profile only the partition which matches the datetime or profile the latest one if not set. Only Bigquery supports this.", + "type": "string", + "format": "date-time" + } + }, + "additionalProperties": false + } + } +} +``` + + +
+ +### Code Coordinates + +- Class Name: `datahub.ingestion.source.sql.hive.HiveSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/hive.py) + +

Questions

+ +If you've got any questions on configuring ingestion for Hive, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/iceberg.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/iceberg.md new file mode 100644 index 0000000000000..cde6f983fbebb --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/iceberg.md @@ -0,0 +1,413 @@ +--- +sidebar_position: 19 +title: Iceberg +slug: /generated/ingestion/sources/iceberg +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/iceberg.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Iceberg + +![Testing](https://img.shields.io/badge/support%20status-testing-lightgrey) + +### Important Capabilities + +| Capability | Status | Notes | +| ---------------------------------------------------------------------------------------------------------- | ------ | ----------------------------------------------------------------------------------------------------------------- | +| [Data Profiling](../../../../metadata-ingestion/docs/dev_guides/sql_profiles.md) | ✅ | Optionally enabled via configuration. | +| Descriptions | ✅ | Enabled by default. | +| [Detect Deleted Entities](../../../../metadata-ingestion/docs/dev_guides/stateful.md#stale-entity-removal) | ✅ | Enabled via stateful ingestion | +| [Domains](../../../domains.md) | ❌ | Currently not supported. | +| Extract Ownership | ✅ | Optionally enabled via configuration by specifying which Iceberg table property holds user or group ownership. | +| Partition Support | ❌ | Currently not supported. | +| [Platform Instance](../../../platform-instances.md) | ✅ | Optionally enabled via configuration, an Iceberg instance represents the datalake name where the table is stored. | + +## Integration Details + +The DataHub Iceberg source plugin extracts metadata from [Iceberg tables](https://iceberg.apache.org/spec/) stored in a distributed or local file system. +Typically, Iceberg tables are stored in a distributed file system like S3 or Azure Data Lake Storage (ADLS) and registered in a catalog. There are various catalog +implementations like Filesystem-based, RDBMS-based or even REST-based catalogs. This Iceberg source plugin relies on the +[Iceberg python_legacy library](https://github.com/apache/iceberg/tree/master/python_legacy) and its support for catalogs is limited at the moment. +A new version of the [Iceberg Python library](https://github.com/apache/iceberg/tree/master/python) is currently in development and should fix this. +Because of this limitation, this source plugin **will only ingest HadoopCatalog-based tables that have a `version-hint.text` metadata file**. + +Ingestion of tables happens in 2 steps: + +1. Discover Iceberg tables stored in file system. +2. Load discovered tables using Iceberg python_legacy library + +The current implementation of the Iceberg source plugin will only discover tables stored in a local file system or in ADLS. Support for S3 could +be added fairly easily. + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[iceberg]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: "iceberg" + config: + env: PROD + adls: + # Will be translated to https://{account_name}.dfs.core.windows.net + account_name: my_adls_account + # Can use sas_token or account_key + sas_token: "${SAS_TOKEN}" + # account_key: "${ACCOUNT_KEY}" + container_name: warehouse + base_path: iceberg + platform_instance: my_iceberg_catalog + table_pattern: + allow: + - marketing.* + profiling: + enabled: true + +sink: + # sink configs +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +|
group_ownership_property
string
| Iceberg table property to look for a `CorpGroup` owner. Can only hold a single group value. If property has no value, no owner information will be emitted. | +|
localfs
string
| Local path to crawl for Iceberg tables. This is one filesystem type supported by this source and **only one can be configured**. | +|
max_path_depth
integer
| Maximum folder depth to crawl for Iceberg tables. Folders deeper than this value will be silently ignored.
Default: 2
| +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
user_ownership_property
string
| Iceberg table property to look for a `CorpUser` owner. Can only hold a single user value. If property has no value, no owner information will be emitted.
Default: owner
| +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
adls
AdlsSourceConfig
| [Azure Data Lake Storage](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) to crawl for Iceberg tables. This is one filesystem type supported by this source and **only one can be configured**. | +|
adls.account_name 
string
| Name of the Azure storage account. See [Microsoft official documentation on how to create a storage account.](https://docs.microsoft.com/en-us/azure/storage/blobs/create-data-lake-storage-account) | +|
adls.container_name 
string
| Azure storage account container name. | +|
adls.account_key
string
| Azure storage account access key that can be used as a credential. **An account key, a SAS token or a client secret is required for authentication.** | +|
adls.base_path
string
| Base folder in hierarchical namespaces to start from.
Default: /
| +|
adls.client_id
string
| Azure client (Application) ID required when a `client_secret` is used as a credential. | +|
adls.client_secret
string
| Azure client secret that can be used as a credential. **An account key, a SAS token or a client secret is required for authentication.** | +|
adls.sas_token
string
| Azure storage account Shared Access Signature (SAS) token that can be used as a credential. **An account key, a SAS token or a client secret is required for authentication.** | +|
adls.tenant_id
string
| Azure tenant (Directory) ID required when a `client_secret` is used as a credential. | +|
table_pattern
AllowDenyPattern
| Regex patterns for tables to filter in ingestion.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
table_pattern.allow
array(string)
| | +|
table_pattern.deny
array(string)
| | +|
table_pattern.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
profiling
IcebergProfilingConfig
|
Default: {'enabled': False, 'include_field_null_count': Tru...
| +|
profiling.enabled
boolean
| Whether profiling should be done.
Default: False
| +|
profiling.include_field_max_value
boolean
| Whether to profile for the max value of numeric columns.
Default: True
| +|
profiling.include_field_min_value
boolean
| Whether to profile for the min value of numeric columns.
Default: True
| +|
profiling.include_field_null_count
boolean
| Whether to profile for the number of nulls for each column.
Default: True
| +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| Iceberg Stateful Ingestion Config. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "IcebergSourceConfig", + "description": "Base configuration class for stateful ingestion for source configs to inherit from.", + "type": "object", + "properties": { + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "stateful_ingestion": { + "title": "Stateful Ingestion", + "description": "Iceberg Stateful Ingestion Config.", + "allOf": [ + { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + } + ] + }, + "adls": { + "title": "Adls", + "description": "[Azure Data Lake Storage](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) to crawl for Iceberg tables. This is one filesystem type supported by this source and **only one can be configured**.", + "allOf": [ + { + "$ref": "#/definitions/AdlsSourceConfig" + } + ] + }, + "localfs": { + "title": "Localfs", + "description": "Local path to crawl for Iceberg tables. This is one filesystem type supported by this source and **only one can be configured**.", + "type": "string" + }, + "max_path_depth": { + "title": "Max Path Depth", + "description": "Maximum folder depth to crawl for Iceberg tables. Folders deeper than this value will be silently ignored.", + "default": 2, + "type": "integer" + }, + "table_pattern": { + "title": "Table Pattern", + "description": "Regex patterns for tables to filter in ingestion.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "user_ownership_property": { + "title": "User Ownership Property", + "description": "Iceberg table property to look for a `CorpUser` owner. Can only hold a single user value. If property has no value, no owner information will be emitted.", + "default": "owner", + "type": "string" + }, + "group_ownership_property": { + "title": "Group Ownership Property", + "description": "Iceberg table property to look for a `CorpGroup` owner. Can only hold a single group value. If property has no value, no owner information will be emitted.", + "type": "string" + }, + "profiling": { + "title": "Profiling", + "default": { + "enabled": false, + "include_field_null_count": true, + "include_field_min_value": true, + "include_field_max_value": true + }, + "allOf": [ + { + "$ref": "#/definitions/IcebergProfilingConfig" + } + ] + } + }, + "additionalProperties": false, + "definitions": { + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "AdlsSourceConfig": { + "title": "AdlsSourceConfig", + "description": "Common Azure credentials config.\n\nhttps://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-python", + "type": "object", + "properties": { + "base_path": { + "title": "Base Path", + "description": "Base folder in hierarchical namespaces to start from.", + "default": "/", + "type": "string" + }, + "container_name": { + "title": "Container Name", + "description": "Azure storage account container name.", + "type": "string" + }, + "account_name": { + "title": "Account Name", + "description": "Name of the Azure storage account. See [Microsoft official documentation on how to create a storage account.](https://docs.microsoft.com/en-us/azure/storage/blobs/create-data-lake-storage-account)", + "type": "string" + }, + "account_key": { + "title": "Account Key", + "description": "Azure storage account access key that can be used as a credential. **An account key, a SAS token or a client secret is required for authentication.**", + "type": "string" + }, + "sas_token": { + "title": "Sas Token", + "description": "Azure storage account Shared Access Signature (SAS) token that can be used as a credential. **An account key, a SAS token or a client secret is required for authentication.**", + "type": "string" + }, + "client_secret": { + "title": "Client Secret", + "description": "Azure client secret that can be used as a credential. **An account key, a SAS token or a client secret is required for authentication.**", + "type": "string" + }, + "client_id": { + "title": "Client Id", + "description": "Azure client (Application) ID required when a `client_secret` is used as a credential.", + "type": "string" + }, + "tenant_id": { + "title": "Tenant Id", + "description": "Azure tenant (Directory) ID required when a `client_secret` is used as a credential.", + "type": "string" + } + }, + "required": [ + "container_name", + "account_name" + ], + "additionalProperties": false + }, + "AllowDenyPattern": { + "title": "AllowDenyPattern", + "description": "A class to store allow deny regexes", + "type": "object", + "properties": { + "allow": { + "title": "Allow", + "description": "List of regex patterns to include in ingestion", + "default": [ + ".*" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "deny": { + "title": "Deny", + "description": "List of regex patterns to exclude from ingestion.", + "default": [], + "type": "array", + "items": { + "type": "string" + } + }, + "ignoreCase": { + "title": "Ignorecase", + "description": "Whether to ignore case sensitivity during pattern matching.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "IcebergProfilingConfig": { + "title": "IcebergProfilingConfig", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "Whether profiling should be done.", + "default": false, + "type": "boolean" + }, + "include_field_null_count": { + "title": "Include Field Null Count", + "description": "Whether to profile for the number of nulls for each column.", + "default": true, + "type": "boolean" + }, + "include_field_min_value": { + "title": "Include Field Min Value", + "description": "Whether to profile for the min value of numeric columns.", + "default": true, + "type": "boolean" + }, + "include_field_max_value": { + "title": "Include Field Max Value", + "description": "Whether to profile for the max value of numeric columns.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + } + } +} +``` + + +
+ +### Concept Mapping + + + + +This ingestion source maps the following Source System Concepts to DataHub Concepts: + + + +| Source Concept | DataHub Concept | Notes | +| --------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `iceberg` | [Data Platform](docs/generated/metamodel/entities/dataPlatform.md) | | +| Table | [Dataset](docs/generated/metamodel/entities/dataset.md) | Each Iceberg table maps to a Dataset named using the parent folders. If a table is stored under `my/namespace/table`, the dataset name will be `my.namespace.table`. If a [Platform Instance](/docs/platform-instances/) is configured, it will be used as a prefix: `.my.namespace.table`. | +| [Table property](https://iceberg.apache.org/docs/latest/configuration/#table-properties) | [User (a.k.a CorpUser)](docs/generated/metamodel/entities/corpuser.md) | The value of a table property can be used as the name of a CorpUser owner. This table property name can be configured with the source option `user_ownership_property`. | +| [Table property](https://iceberg.apache.org/docs/latest/configuration/#table-properties) | CorpGroup | The value of a table property can be used as the name of a CorpGroup owner. This table property name can be configured with the source option `group_ownership_property`. | +| Table parent folders (excluding [warehouse catalog location](https://iceberg.apache.org/docs/latest/configuration/#catalog-properties)) | Container | Available in a future release | +| [Table schema](https://iceberg.apache.org/spec/#schemas-and-data-types) | SchemaField | Maps to the fields defined within the Iceberg table schema definition. | + +## Troubleshooting + +### [Common Issue] + +[Provide description of common issues with this integration and steps to resolve] + +### Code Coordinates + +- Class Name: `datahub.ingestion.source.iceberg.iceberg.IcebergSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/iceberg/iceberg.py) + +

Questions

+ +If you've got any questions on configuring ingestion for Iceberg, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/json-schema.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/json-schema.md new file mode 100644 index 0000000000000..06a5514d7cb32 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/json-schema.md @@ -0,0 +1,238 @@ +--- +sidebar_position: 20 +title: JSON Schemas +slug: /generated/ingestion/sources/json-schema +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/json-schema.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# JSON Schemas + +![Incubating](https://img.shields.io/badge/support%20status-incubating-blue) + +### Important Capabilities + +| Capability | Status | Notes | +| ---------------------------------------------------------------------------------------------------------- | ------ | -------------------------------------------------------------------------------------------------------------- | +| Descriptions | ✅ | Extracts descriptions at top level and field level | +| [Detect Deleted Entities](../../../../metadata-ingestion/docs/dev_guides/stateful.md#stale-entity-removal) | ✅ | With stateful ingestion enabled, will remove entities from DataHub if they are no longer present in the source | +| Extract Ownership | ❌ | Does not currently support extracting ownership | +| Extract Tags | ❌ | Does not currently support extracting tags | +| [Platform Instance](../../../platform-instances.md) | ✅ | Supports platform instance via config | +| Schema Metadata | ✅ | Extracts schemas, following references | + +This source extracts metadata from a single JSON Schema or multiple JSON Schemas rooted at a particular path. +It performs reference resolution based on the `$ref` keyword. + +Metadata mapping: + +- Schemas are mapped to Datasets with sub-type Schema +- The name of the Schema (Dataset) is inferred from the `$id` property and if that is missing, the file name. +- Browse paths are minted based on the path + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[json-schema]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +pipeline_name: json_schema_ingestion +source: + type: json-schema + config: + path: # e.g. https://json.schemastore.org/petstore-v1.0.json + platform: # e.g. schemaregistry + # platform_instance: + stateful_ingestion: + enabled: true # recommended to have this turned on + +# sink configs if needed +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +|
path 
One of string(file-path), string(directory-path), string(uri)
| Set this to a single file-path or a directory-path (for recursive traversal) or a remote url. e.g. https://json.schemastore.org/petstore-v1.0.json | +|
platform 
string
| Set this to a platform that you want all schemas to live under. e.g. schemaregistry / schemarepo etc. | +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
use_id_as_base_uri
boolean
| When enabled, uses the `$id` field in the json schema as the base uri for following references.
Default: False
| +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
uri_replace_pattern
URIReplacePattern
| Use this if URI-s need to be modified during reference resolution. Simple string match - replace capabilities are supported. | +|
uri_replace_pattern.match 
string
| Pattern to match on uri-s as part of reference resolution. See replace field | +|
uri_replace_pattern.replace 
string
| Pattern to replace with as part of reference resolution. See match field | +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| Base specialized config for Stateful Ingestion with stale metadata removal capability. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "JsonSchemaSourceConfig", + "description": "Base configuration class for stateful ingestion for source configs to inherit from.", + "type": "object", + "properties": { + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "stateful_ingestion": { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + }, + "path": { + "title": "Path", + "description": "Set this to a single file-path or a directory-path (for recursive traversal) or a remote url. e.g. https://json.schemastore.org/petstore-v1.0.json", + "anyOf": [ + { + "type": "string", + "format": "file-path" + }, + { + "type": "string", + "format": "directory-path" + }, + { + "type": "string", + "minLength": 1, + "maxLength": 65536, + "format": "uri" + } + ] + }, + "platform": { + "title": "Platform", + "description": "Set this to a platform that you want all schemas to live under. e.g. schemaregistry / schemarepo etc.", + "type": "string" + }, + "use_id_as_base_uri": { + "title": "Use Id As Base Uri", + "description": "When enabled, uses the `$id` field in the json schema as the base uri for following references.", + "default": false, + "type": "boolean" + }, + "uri_replace_pattern": { + "title": "Uri Replace Pattern", + "description": "Use this if URI-s need to be modified during reference resolution. Simple string match - replace capabilities are supported.", + "allOf": [ + { + "$ref": "#/definitions/URIReplacePattern" + } + ] + } + }, + "required": [ + "path", + "platform" + ], + "additionalProperties": false, + "definitions": { + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "URIReplacePattern": { + "title": "URIReplacePattern", + "type": "object", + "properties": { + "match": { + "title": "Match", + "description": "Pattern to match on uri-s as part of reference resolution. See replace field", + "type": "string" + }, + "replace": { + "title": "Replace", + "description": "Pattern to replace with as part of reference resolution. See match field", + "type": "string" + } + }, + "required": [ + "match", + "replace" + ], + "additionalProperties": false + } + } +} +``` + + +
+ +#### Configuration Notes + +- You must provide a `platform` field. Most organizations have custom project names for their schema repositories, so you can pick whatever name makes sense. For example, you might want to call your schema platform **schemaregistry**. After picking a custom platform, you can use the [put platform](../../../../docs/cli.md#put-platform) command to register your custom platform into DataHub. + +### Code Coordinates + +- Class Name: `datahub.ingestion.source.schema.json_schema.JsonSchemaSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/schema/json_schema.py) + +

Questions

+ +If you've got any questions on configuring ingestion for JSON Schemas, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/kafka-connect.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/kafka-connect.md new file mode 100644 index 0000000000000..7f4ffd732b1e6 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/kafka-connect.md @@ -0,0 +1,371 @@ +--- +sidebar_position: 22 +title: Kafka Connect +slug: /generated/ingestion/sources/kafka-connect +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/kafka-connect.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Kafka Connect + +## Integration Details + +This plugin extracts the following: + +- Source and Sink Connectors in Kafka Connect as Data Pipelines +- For Source connectors - Data Jobs to represent lineage information between source dataset to Kafka topic per `{connector_name}:{source_dataset}` combination +- For Sink connectors - Data Jobs to represent lineage information between Kafka topic to destination dataset per `{connector_name}:{topic}` combination + +### Concept Mapping + +This ingestion source maps the following Source System Concepts to DataHub Concepts: + +| Source Concept | DataHub Concept | Notes | +| ------------------------------------------------------------------------------- | ----------------------------------------------------------------- | ----- | +| `"kafka-connect"` | [Data Platform](/docs/generated/metamodel/entities/dataPlatform/) | | +| [Connector](https://kafka.apache.org/documentation/#connect_connectorsandtasks) | [DataFlow](/docs/generated/metamodel/entities/dataflow/) | | +| Kafka Topic | [Dataset](/docs/generated/metamodel/entities/dataset/) | | + +## Current limitations + +Works only for + +- Source connectors: JDBC, Debezium, Mongo and Generic connectors with user-defined lineage graph +- Sink connectors: BigQuery + ![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +### Important Capabilities + +| Capability | Status | Notes | +| --------------------------------------------------- | ------ | ------------------ | +| [Platform Instance](../../../platform-instances.md) | ✅ | Enabled by default | + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[kafka-connect]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: "kafka-connect" + config: + # Coordinates + connect_uri: "http://localhost:8083" + + # Credentials + username: admin + password: password + + # Optional + platform_instance_map: + bigquery: bigquery_platform_instance_id + +sink: + # sink configs +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +|
cluster_name
string
| Cluster to ingest from.
Default: connect-cluster
| +|
connect_to_platform_map
map(str,map)
| | +|
connect_uri
string
| URI to connect to.
Default: http://localhost:8083/
| +|
convert_lineage_urns_to_lowercase
boolean
| Whether to convert the urns of ingested lineage dataset to lowercase
Default: False
| +|
password
string
| Kafka Connect password. | +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
platform_instance_map
map(str,string)
| | +|
username
string
| Kafka Connect username. | +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
connector_patterns
AllowDenyPattern
| regex patterns for connectors to filter for ingestion.
Default: {'allow': ['.\*'], 'deny': [], 'ignoreCase': True}
| +|
connector_patterns.allow
array(string)
| | +|
connector_patterns.deny
array(string)
| | +|
connector_patterns.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
generic_connectors
array(object)
| | +|
generic_connectors.connector_name 
string
| | +|
generic_connectors.source_dataset 
string
| | +|
generic_connectors.source_platform 
string
| | +|
provided_configs
array(object)
| | +|
provided_configs.path_key 
string
| | +|
provided_configs.provider 
string
| | +|
provided_configs.value 
string
| | +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| Base specialized config for Stateful Ingestion with stale metadata removal capability. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "KafkaConnectSourceConfig", + "description": "Any source that connects to a platform should inherit this class", + "type": "object", + "properties": { + "stateful_ingestion": { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + }, + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "platform_instance_map": { + "title": "Platform Instance Map", + "description": "Platform instance mapping to use when constructing URNs. e.g.`platform_instance_map: { \"hive\": \"warehouse\" }`", + "type": "object", + "additionalProperties": { + "type": "string" + } + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "connect_uri": { + "title": "Connect Uri", + "description": "URI to connect to.", + "default": "http://localhost:8083/", + "type": "string" + }, + "username": { + "title": "Username", + "description": "Kafka Connect username.", + "type": "string" + }, + "password": { + "title": "Password", + "description": "Kafka Connect password.", + "type": "string" + }, + "cluster_name": { + "title": "Cluster Name", + "description": "Cluster to ingest from.", + "default": "connect-cluster", + "type": "string" + }, + "convert_lineage_urns_to_lowercase": { + "title": "Convert Lineage Urns To Lowercase", + "description": "Whether to convert the urns of ingested lineage dataset to lowercase", + "default": false, + "type": "boolean" + }, + "connector_patterns": { + "title": "Connector Patterns", + "description": "regex patterns for connectors to filter for ingestion.", + "default": { + "allow": [ + ".*" + ], + "deny": [], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "provided_configs": { + "title": "Provided Configs", + "description": "Provided Configurations", + "type": "array", + "items": { + "$ref": "#/definitions/ProvidedConfig" + } + }, + "connect_to_platform_map": { + "title": "Connect To Platform Map", + "description": "Platform instance mapping when multiple instances for a platform is available. Entry for a platform should be in either `platform_instance_map` or `connect_to_platform_map`. e.g.`connect_to_platform_map: { \"postgres-connector-finance-db\": \"postgres\": \"core_finance_instance\" }`", + "type": "object", + "additionalProperties": { + "type": "object", + "additionalProperties": { + "type": "string" + } + } + }, + "generic_connectors": { + "title": "Generic Connectors", + "description": "Provide lineage graph for sources connectors other than Confluent JDBC Source Connector, Debezium Source Connector, and Mongo Source Connector", + "default": [], + "type": "array", + "items": { + "$ref": "#/definitions/GenericConnectorConfig" + } + } + }, + "additionalProperties": false, + "definitions": { + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "AllowDenyPattern": { + "title": "AllowDenyPattern", + "description": "A class to store allow deny regexes", + "type": "object", + "properties": { + "allow": { + "title": "Allow", + "description": "List of regex patterns to include in ingestion", + "default": [ + ".*" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "deny": { + "title": "Deny", + "description": "List of regex patterns to exclude from ingestion.", + "default": [], + "type": "array", + "items": { + "type": "string" + } + }, + "ignoreCase": { + "title": "Ignorecase", + "description": "Whether to ignore case sensitivity during pattern matching.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "ProvidedConfig": { + "title": "ProvidedConfig", + "type": "object", + "properties": { + "provider": { + "title": "Provider", + "type": "string" + }, + "path_key": { + "title": "Path Key", + "type": "string" + }, + "value": { + "title": "Value", + "type": "string" + } + }, + "required": [ + "provider", + "path_key", + "value" + ], + "additionalProperties": false + }, + "GenericConnectorConfig": { + "title": "GenericConnectorConfig", + "type": "object", + "properties": { + "connector_name": { + "title": "Connector Name", + "type": "string" + }, + "source_dataset": { + "title": "Source Dataset", + "type": "string" + }, + "source_platform": { + "title": "Source Platform", + "type": "string" + } + }, + "required": [ + "connector_name", + "source_dataset", + "source_platform" + ], + "additionalProperties": false + } + } +} +``` + + +
+ +## Advanced Configurations + +Kafka Connect supports pluggable configuration providers which can load configuration data from external sources at runtime. These values are not available to DataHub ingestion source through Kafka Connect APIs. If you are using such provided configurations to specify connection url (database, etc) in Kafka Connect connector configuration then you will need also add these in `provided_configs` section in recipe for DataHub to generate correct lineage. + +```yml +# Optional mapping of provider configurations if using +provided_configs: + - provider: env + path_key: MYSQL_CONNECTION_URL + value: jdbc:mysql://test_mysql:3306/librarydb +``` + +### Code Coordinates + +- Class Name: `datahub.ingestion.source.kafka_connect.KafkaConnectSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/kafka_connect.py) + +

Questions

+ +If you've got any questions on configuring ingestion for Kafka Connect, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/kafka.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/kafka.md new file mode 100644 index 0000000000000..4013ee8302bad --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/kafka.md @@ -0,0 +1,445 @@ +--- +sidebar_position: 21 +title: Kafka +slug: /generated/ingestion/sources/kafka +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/kafka.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Kafka + +Extract Topics & Schemas from Apache Kafka or Confluent Cloud. +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +### Important Capabilities + +| Capability | Status | Notes | +| --------------------------------------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Descriptions | ✅ | Set dataset description to top level doc field for Avro schema | +| [Platform Instance](../../../platform-instances.md) | ✅ | For multiple Kafka clusters, use the platform_instance configuration | +| Schema Metadata | ✅ | Schemas associated with each topic are extracted from the schema registry. Avro and Protobuf (certified), JSON (incubating). Schema references are supported. | + +This plugin extracts the following: + +- Topics from the Kafka broker +- Schemas associated with each topic from the schema registry (Avro, Protobuf and JSON schemas are supported) + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[kafka]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: "kafka" + config: + platform_instance: "YOUR_CLUSTER_ID" + connection: + bootstrap: "broker:9092" + schema_registry_url: http://localhost:8081 + +sink: + # sink configs +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +|
disable_topic_record_naming_strategy
boolean
| Disables the utilization of the TopicRecordNameStrategy for Schema Registry subjects. For more information, visit: https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#handling-differences-between-preregistered-and-client-derived-schemas:~:text=io.confluent.kafka.serializers.subject.TopicRecordNameStrategy
Default: False
| +|
ignore_warnings_on_schema_type
boolean
| Disables warnings reported for non-AVRO/Protobuf value or key schemas if set.
Default: False
| +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
schema_registry_class
string
| The fully qualified implementation class(custom) that implements the KafkaSchemaRegistryBase interface.
Default: datahub.ingestion.source.confluent_schema_registry...
| +|
topic_subject_map
map(str,string)
| | +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
connection
KafkaConsumerConnectionConfig
|
Default: {'bootstrap': 'localhost:9092', 'schema_registry_u...
| +|
connection.bootstrap
string
|
Default: localhost:9092
| +|
connection.client_timeout_seconds
integer
| The request timeout used when interacting with the Kafka APIs.
Default: 60
| +|
connection.consumer_config
object
| Extra consumer config serialized as JSON. These options will be passed into Kafka's DeserializingConsumer. See https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#deserializingconsumer and https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md . | +|
connection.schema_registry_config
object
| Extra schema registry config serialized as JSON. These options will be passed into Kafka's SchemaRegistryClient. https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html?#schemaregistryclient | +|
connection.schema_registry_url
string
|
Default: http://localhost:8080/schema-registry/api/
| +|
domain
map(str,AllowDenyPattern)
| A class to store allow deny regexes | +|
domain.`key`.allow
array(string)
| | +|
domain.`key`.deny
array(string)
| | +|
domain.`key`.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
topic_patterns
AllowDenyPattern
|
Default: {'allow': ['.\*'], 'deny': ['^\_.\*'], 'ignoreCase': ...
| +|
topic_patterns.allow
array(string)
| | +|
topic_patterns.deny
array(string)
| | +|
topic_patterns.ignoreCase
boolean
| Whether to ignore case sensitivity during pattern matching.
Default: True
| +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| Base specialized config for Stateful Ingestion with stale metadata removal capability. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "KafkaSourceConfig", + "description": "Base configuration class for stateful ingestion for source configs to inherit from.", + "type": "object", + "properties": { + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "stateful_ingestion": { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + }, + "connection": { + "title": "Connection", + "default": { + "bootstrap": "localhost:9092", + "schema_registry_url": "http://localhost:8080/schema-registry/api/", + "schema_registry_config": {}, + "client_timeout_seconds": 60, + "consumer_config": {} + }, + "allOf": [ + { + "$ref": "#/definitions/KafkaConsumerConnectionConfig" + } + ] + }, + "topic_patterns": { + "title": "Topic Patterns", + "default": { + "allow": [ + ".*" + ], + "deny": [ + "^_.*" + ], + "ignoreCase": true + }, + "allOf": [ + { + "$ref": "#/definitions/AllowDenyPattern" + } + ] + }, + "domain": { + "title": "Domain", + "description": "A map of domain names to allow deny patterns. Domains can be urn-based (`urn:li:domain:13ae4d85-d955-49fc-8474-9004c663a810`) or bare (`13ae4d85-d955-49fc-8474-9004c663a810`).", + "default": {}, + "type": "object", + "additionalProperties": { + "$ref": "#/definitions/AllowDenyPattern" + } + }, + "topic_subject_map": { + "title": "Topic Subject Map", + "description": "Provides the mapping for the `key` and the `value` schemas of a topic to the corresponding schema registry subject name. Each entry of this map has the form `-key`:`` and `-value`:`` for the key and the value schemas associated with the topic, respectively. This parameter is mandatory when the [RecordNameStrategy](https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#how-the-naming-strategies-work) is used as the subject naming strategy in the kafka schema registry. NOTE: When provided, this overrides the default subject name resolution even when the `TopicNameStrategy` or the `TopicRecordNameStrategy` are used.", + "default": {}, + "type": "object", + "additionalProperties": { + "type": "string" + } + }, + "schema_registry_class": { + "title": "Schema Registry Class", + "description": "The fully qualified implementation class(custom) that implements the KafkaSchemaRegistryBase interface.", + "default": "datahub.ingestion.source.confluent_schema_registry.ConfluentSchemaRegistry", + "type": "string" + }, + "ignore_warnings_on_schema_type": { + "title": "Ignore Warnings On Schema Type", + "description": "Disables warnings reported for non-AVRO/Protobuf value or key schemas if set.", + "default": false, + "type": "boolean" + }, + "disable_topic_record_naming_strategy": { + "title": "Disable Topic Record Naming Strategy", + "description": "Disables the utilization of the TopicRecordNameStrategy for Schema Registry subjects. For more information, visit: https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#handling-differences-between-preregistered-and-client-derived-schemas:~:text=io.confluent.kafka.serializers.subject.TopicRecordNameStrategy", + "default": false, + "type": "boolean" + } + }, + "additionalProperties": false, + "definitions": { + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + }, + "KafkaConsumerConnectionConfig": { + "title": "KafkaConsumerConnectionConfig", + "description": "Configuration class for holding connectivity information for Kafka consumers", + "type": "object", + "properties": { + "bootstrap": { + "title": "Bootstrap", + "default": "localhost:9092", + "type": "string" + }, + "schema_registry_url": { + "title": "Schema Registry Url", + "default": "http://localhost:8080/schema-registry/api/", + "type": "string" + }, + "schema_registry_config": { + "title": "Schema Registry Config", + "description": "Extra schema registry config serialized as JSON. These options will be passed into Kafka's SchemaRegistryClient. https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html?#schemaregistryclient", + "type": "object" + }, + "client_timeout_seconds": { + "title": "Client Timeout Seconds", + "description": "The request timeout used when interacting with the Kafka APIs.", + "default": 60, + "type": "integer" + }, + "consumer_config": { + "title": "Consumer Config", + "description": "Extra consumer config serialized as JSON. These options will be passed into Kafka's DeserializingConsumer. See https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#deserializingconsumer and https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md .", + "type": "object" + } + }, + "additionalProperties": false + }, + "AllowDenyPattern": { + "title": "AllowDenyPattern", + "description": "A class to store allow deny regexes", + "type": "object", + "properties": { + "allow": { + "title": "Allow", + "description": "List of regex patterns to include in ingestion", + "default": [ + ".*" + ], + "type": "array", + "items": { + "type": "string" + } + }, + "deny": { + "title": "Deny", + "description": "List of regex patterns to exclude from ingestion.", + "default": [], + "type": "array", + "items": { + "type": "string" + } + }, + "ignoreCase": { + "title": "Ignorecase", + "description": "Whether to ignore case sensitivity during pattern matching.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + } + } +} +``` + + +
+ +:::note +Stateful Ingestion is available only when a Platform Instance is assigned to this source. +::: + +### Connecting to Confluent Cloud + +If using Confluent Cloud you can use a recipe like this. In this `consumer_config.sasl.username` and `consumer_config.sasl.password` are the API credentials that you get (in the Confluent UI) from your cluster -> Data Integration -> API Keys. `schema_registry_config.basic.auth.user.info` has API credentials for Confluent schema registry which you get (in Confluent UI) from Schema Registry -> API credentials. + +When creating API Key for the cluster ensure that the ACLs associated with the key are set like below. This is required for DataHub to read topic metadata from topics in Confluent Cloud. + +``` +Topic Name = * +Permission = ALLOW +Operation = DESCRIBE +Pattern Type = LITERAL +``` + +```yml +source: + type: "kafka" + config: + platform_instance: "YOUR_CLUSTER_ID" + connection: + bootstrap: "abc-defg.eu-west-1.aws.confluent.cloud:9092" + consumer_config: + security.protocol: "SASL_SSL" + sasl.mechanism: "PLAIN" + sasl.username: "${CLUSTER_API_KEY_ID}" + sasl.password: "${CLUSTER_API_KEY_SECRET}" + schema_registry_url: "https://abc-defgh.us-east-2.aws.confluent.cloud" + schema_registry_config: + basic.auth.user.info: "${REGISTRY_API_KEY_ID}:${REGISTRY_API_KEY_SECRET}" + +sink: + # sink configs +``` + +If you are trying to add domains to your topics you can use a configuration like below. + +```yml +source: + type: "kafka" + config: + # ...connection block + domain: + "urn:li:domain:13ae4d85-d955-49fc-8474-9004c663a810": + allow: + - ".*" + "urn:li:domain:d6ec9868-6736-4b1f-8aa6-fee4c5948f17": + deny: + - ".*" +``` + +Note that the `domain` in config above can be either an _urn_ or a domain _id_ (i.e. `urn:li:domain:13ae4d85-d955-49fc-8474-9004c663a810` or simply `13ae4d85-d955-49fc-8474-9004c663a810`). The Domain should exist in your DataHub instance before ingesting data into the Domain. To create a Domain on DataHub, check out the [Domains User Guide](/docs/domains/). + +If you are using a non-default subject naming strategy in the schema registry, such as [RecordNameStrategy](https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#how-the-naming-strategies-work), the mapping for the topic's key and value schemas to the schema registry subject names should be provided via `topic_subject_map` as shown in the configuration below. + +```yml +source: + type: "kafka" + config: + # ...connection block + # Defines the mapping for the key & value schemas associated with a topic & the subject name registered with the + # kafka schema registry. + topic_subject_map: + # Defines both key & value schema for topic 'my_topic_1' + "my_topic_1-key": "io.acryl.Schema1" + "my_topic_1-value": "io.acryl.Schema2" + # Defines only the value schema for topic 'my_topic_2' (the topic doesn't have a key schema). + "my_topic_2-value": "io.acryl.Schema3" +``` + +### Custom Schema Registry + +The Kafka Source uses the schema registry to figure out the schema associated with both `key` and `value` for the topic. +By default it uses the [Confluent's Kafka Schema registry](https://docs.confluent.io/platform/current/schema-registry/index.html) +and supports the `AVRO` and `PROTOBUF` schema types. + +If you're using a custom schema registry, or you are using schema type other than `AVRO` or `PROTOBUF`, then you can provide your own +custom implementation of the `KafkaSchemaRegistryBase` class, and implement the `get_schema_metadata(topic, platform_urn)` method that +given a topic name would return object of `SchemaMetadata` containing schema for that topic. Please refer +`datahub.ingestion.source.confluent_schema_registry::ConfluentSchemaRegistry` for sample implementation of this class. + +```python +class KafkaSchemaRegistryBase(ABC): + @abstractmethod + def get_schema_metadata( + self, topic: str, platform_urn: str + ) -> Optional[SchemaMetadata]: + pass +``` + +The custom schema registry class can be configured using the `schema_registry_class` config param of the `kafka` source as shown below. + +```YAML +source: + type: "kafka" + config: + # Set the custom schema registry implementation class + schema_registry_class: "datahub.ingestion.source.confluent_schema_registry.ConfluentSchemaRegistry" + # Coordinates + connection: + bootstrap: "broker:9092" + schema_registry_url: http://localhost:8081 + +# sink configs +``` + +### Limitations of `PROTOBUF` schema types implementation + +The current implementation of the support for `PROTOBUF` schema type has the following limitations: + +- Recursive types are not supported. +- If the schemas of different topics define a type in the same package, the source would raise an exception. + +In addition to this, maps are represented as arrays of messages. The following message, + +``` +message MessageWithMap { + map map_1 = 1; +} +``` + +becomes: + +``` +message Map1Entry { + int key = 1; + string value = 2/ +} +message MessageWithMap { + repeated Map1Entry map_1 = 1; +} +``` + +### Code Coordinates + +- Class Name: `datahub.ingestion.source.kafka.KafkaSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/kafka.py) + +

Questions

+ +If you've got any questions on configuring ingestion for Kafka, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/ldap.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/ldap.md new file mode 100644 index 0000000000000..89e230f1724cf --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/ldap.md @@ -0,0 +1,248 @@ +--- +sidebar_position: 23 +title: LDAP +slug: /generated/ingestion/sources/ldap +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/ldap.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# LDAP + +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +This plugin extracts the following: + +- People +- Names, emails, titles, and manager information for each person +- List of groups + +### CLI based Ingestion + +#### Install the Plugin + +```shell +pip install 'acryl-datahub[ldap]' +``` + +### Starter Recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../../../../metadata-ingestion/README.md#recipes). + +```yaml +source: + type: "ldap" + config: + # Coordinates + ldap_server: ldap://localhost + + # Credentials + ldap_user: "cn=admin,dc=example,dc=org" + ldap_password: "admin" + + # Options + base_dn: "dc=example,dc=org" + +sink: + # sink configs +``` + +### Config Details + + + + +Note that a `.` is used to denote nested fields in the YAML recipe. + +
+ +| Field | Description | +| :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +|
base_dn 
string
| LDAP DN. | +|
ldap_password 
string
| LDAP password. | +|
ldap_server 
string
| LDAP server URL. | +|
ldap_user 
string
| LDAP user. | +|
attrs_list
array(string)
| | +|
custom_props_list
array(string)
| | +|
drop_missing_first_last_name
boolean
| If set to true, any users without first and last names will be dropped.
Default: True
| +|
filter
string
| LDAP extractor filter.
Default: (objectClass=\*)
| +|
group_attrs_map
object
|
Default: {}
| +|
manager_filter_enabled
boolean
| Use LDAP extractor filter to search managers.
Default: True
| +|
manager_pagination_enabled
boolean
| Use pagination while search for managers (enabled by default).
Default: True
| +|
page_size
integer
| Size of each page to fetch when extracting metadata.
Default: 20
| +|
platform_instance
string
| The instance of the platform that all assets produced by this recipe belong to | +|
user_attrs_map
object
|
Default: {}
| +|
env
string
| The environment that all assets produced by this connector belong to
Default: PROD
| +|
stateful_ingestion
StatefulStaleMetadataRemovalConfig
| Base specialized config for Stateful Ingestion with stale metadata removal capability. | +|
stateful_ingestion.enabled
boolean
| The type of the ingestion state provider registered with datahub.
Default: False
| +|
stateful_ingestion.remove_stale_metadata
boolean
| Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True
| + +
+
+ + +The [JSONSchema](https://json-schema.org/) for this configuration is inlined below. + +```javascript +{ + "title": "LDAPSourceConfig", + "description": "Config used by the LDAP Source.", + "type": "object", + "properties": { + "env": { + "title": "Env", + "description": "The environment that all assets produced by this connector belong to", + "default": "PROD", + "type": "string" + }, + "platform_instance": { + "title": "Platform Instance", + "description": "The instance of the platform that all assets produced by this recipe belong to", + "type": "string" + }, + "stateful_ingestion": { + "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig" + }, + "ldap_server": { + "title": "Ldap Server", + "description": "LDAP server URL.", + "type": "string" + }, + "ldap_user": { + "title": "Ldap User", + "description": "LDAP user.", + "type": "string" + }, + "ldap_password": { + "title": "Ldap Password", + "description": "LDAP password.", + "type": "string" + }, + "base_dn": { + "title": "Base Dn", + "description": "LDAP DN.", + "type": "string" + }, + "filter": { + "title": "Filter", + "description": "LDAP extractor filter.", + "default": "(objectClass=*)", + "type": "string" + }, + "attrs_list": { + "title": "Attrs List", + "description": "Retrieved attributes list", + "type": "array", + "items": { + "type": "string" + } + }, + "custom_props_list": { + "title": "Custom Props List", + "description": "A list of custom attributes to extract from the LDAP provider.", + "type": "array", + "items": { + "type": "string" + } + }, + "drop_missing_first_last_name": { + "title": "Drop Missing First Last Name", + "description": "If set to true, any users without first and last names will be dropped.", + "default": true, + "type": "boolean" + }, + "page_size": { + "title": "Page Size", + "description": "Size of each page to fetch when extracting metadata.", + "default": 20, + "type": "integer" + }, + "manager_filter_enabled": { + "title": "Manager Filter Enabled", + "description": "Use LDAP extractor filter to search managers.", + "default": true, + "type": "boolean" + }, + "manager_pagination_enabled": { + "title": "Manager Pagination Enabled", + "description": "Use pagination while search for managers (enabled by default).", + "default": true, + "type": "boolean" + }, + "user_attrs_map": { + "title": "User Attrs Map", + "default": {}, + "type": "object" + }, + "group_attrs_map": { + "title": "Group Attrs Map", + "default": {}, + "type": "object" + } + }, + "required": [ + "ldap_server", + "ldap_user", + "ldap_password", + "base_dn" + ], + "additionalProperties": false, + "definitions": { + "DynamicTypedStateProviderConfig": { + "title": "DynamicTypedStateProviderConfig", + "type": "object", + "properties": { + "type": { + "title": "Type", + "description": "The type of the state provider to use. For DataHub use `datahub`", + "type": "string" + }, + "config": { + "title": "Config", + "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)." + } + }, + "required": [ + "type" + ], + "additionalProperties": false + }, + "StatefulStaleMetadataRemovalConfig": { + "title": "StatefulStaleMetadataRemovalConfig", + "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.", + "type": "object", + "properties": { + "enabled": { + "title": "Enabled", + "description": "The type of the ingestion state provider registered with datahub.", + "default": false, + "type": "boolean" + }, + "remove_stale_metadata": { + "title": "Remove Stale Metadata", + "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.", + "default": true, + "type": "boolean" + } + }, + "additionalProperties": false + } + } +} +``` + + +
+ +### Code Coordinates + +- Class Name: `datahub.ingestion.source.ldap.LDAPSource` +- Browse on [GitHub](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/ldap.py) + +

Questions

+ +If you've got any questions on configuring ingestion for LDAP, feel free to ping us on [our Slack](https://slack.datahubproject.io). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/looker.md b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/looker.md new file mode 100644 index 0000000000000..9243e37607377 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/generated/ingestion/sources/looker.md @@ -0,0 +1,1395 @@ +--- +sidebar_position: 24 +title: Looker +slug: /generated/ingestion/sources/looker +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/generated/ingestion/sources/looker.md +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Looker + +There are 2 sources that provide integration with Looker + + + + + + + + + + +
Source ModuleDocumentation
+ +`looker` + + + +This plugin extracts the following: + +- Looker dashboards, dashboard elements (charts) and explores +- Names, descriptions, URLs, chart types, input explores for the charts +- Schemas and input views for explores +- Owners of dashboards + +:::note +To get complete Looker metadata integration (including Looker views and lineage to the underlying warehouse tables), you must ALSO use the `lookml` module. +::: +[Read more...](#module-looker) + +
+ +`lookml` + + + +This plugin extracts the following: + +- LookML views from model files in a project +- Name, upstream table names, metadata for dimensions, measures, and dimension groups attached as tags +- If API integration is enabled (recommended), resolves table and view names by calling the Looker API, otherwise supports offline resolution of these names. + +:::note +To get complete Looker metadata integration (including Looker dashboards and charts and lineage to the underlying Looker views, you must ALSO use the `looker` source module. +::: +[Read more...](#module-lookml) + +
+ +## Module `looker` + +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen) + +### Important Capabilities + +| Capability | Status | Notes | +| --------------------------------------------------- | ------ | ------------------------------------------------------------ | +| Dataset Usage | ✅ | Enabled by default, configured using `extract_usage_history` | +| Descriptions | ✅ | Enabled by default | +| Extract Ownership | ✅ | Enabled by default, configured using `extract_owners` | +| [Platform Instance](../../../platform-instances.md) | ❌ | Not supported | + +This plugin extracts the following: + +- Looker dashboards, dashboard elements (charts) and explores +- Names, descriptions, URLs, chart types, input explores for the charts +- Schemas and input views for explores +- Owners of dashboards + +:::note +To get complete Looker metadata integration (including Looker views and lineage to the underlying warehouse tables), you must ALSO use the `lookml` module. +::: + +### Prerequisites + +#### Set up the right permissions + +You need to provide the following permissions for ingestion to work correctly. + +``` +access_data +explore +manage_models +see_datagroups +see_lookml +see_lookml_dashboards +see_looks +see_pdts +see_queries +see_schedules +see_sql +see_system_activity +see_user_dashboards +see_users +``` + +Here is an example permission set after configuration. + +

+ +

+ +#### Get an API key + +You need to get an API key for the account with the above privileges to perform ingestion. See the [Looker authentication docs](https://docs.looker.com/reference/api-and-integration/api-auth#authentication_with_an_sdk) for the steps to create a client ID and secret. + +### Ingestion through UI + +The following video shows you how to get started with ingesting Looker metadata through the UI. + +:::note + +You will need to run `lookml` ingestion through the CLI after you have ingested Looker metadata through the UI. Otherwise you will not be able to see Looker Views and their lineage to your warehouse tables. + +::: + +
+ +

+ +### GraphQL + +- [searchAcrossEntities](/docs/graphql/queries/#searchacrossentities) +- You can try out the API on the demo instance's public GraphQL interface: [here](https://demo.datahubproject.io/api/graphiql) + +The same GraphQL API that powers the Search UI can be used +for integrations and programmatic use-cases. + +``` +# Example query +{ + searchAcrossEntities( + input: {types: [], query: "*", start: 0, count: 10, filters: [{field: "fieldTags", value: "urn:li:tag:Dimension"}]} + ) { + start + count + total + searchResults { + entity { + type + ... on Dataset { + urn + type + platform { + name + } + name + } + } + } + } +} +``` + +### Searching at Scale + +For queries that return more than 10k entities we recommend using the [scrollAcrossEntities](/docs/graphql/queries/#scrollacrossentities) GraphQL API: + +``` +# Example query +{ + scrollAcrossEntities(input: { types: [DATASET], query: "*", count: 10}) { + nextScrollId + count + searchResults { + entity { + type + ... on Dataset { + urn + type + platform { + name + } + name + } + } + } + } +} +``` + +This will return a response containing a `nextScrollId` value which must be used in subsequent queries to retrieve more data, i.e: + +``` +{ + scrollAcrossEntities(input: + { types: [DATASET], query: "*", count: 10, + scrollId: "eyJzb3J0IjpbMy4wLCJ1cm46bGk6ZGF0YXNldDoodXJuOmxpOmRhdGFQbGF0Zm9ybTpiaWdxdWVyeSxiaWdxdWVyeS1wdWJsaWMtZGF0YS5jb3ZpZDE5X2dlb3RhYl9tb2JpbGl0eV9pbXBhY3QucG9ydF90cmFmZmljLFBST0QpIl0sInBpdElkIjpudWxsLCJleHBpcmF0aW9uVGltZSI6MH0="} + ) { + nextScrollId + count + searchResults { + entity { + type + ... on Dataset { + urn + type + platform { + name + } + name + } + } + } + } +} +``` + +In order to complete scrolling through all of the results, continue to request data in batches until the `nextScrollId` returned is null or undefined. + +### DataHub Blog + +- [Using DataHub for Search & Discovery](https://blog.datahubproject.io/using-datahub-for-search-discovery-fa309089be22) + +## Customizing Search + +It is possible to completely customize search ranking, filtering, and queries using a search configuration yaml file. +This no-code solution provides the ability to extend, or replace, the Elasticsearch-based search functionality. The +only limitation is that the information used in the query/ranking/filtering must be present in the entities' document, +however this does include `customProperties`, `tags`, `terms`, `domain`, as well as many additional fields. + +Additionally, multiple customizations can be applied to different query strings. A regex is applied to the search query +to determine which customized search profile to use. This means a different query/ranking/filtering can be applied to +a `select all`/`*` query or one that contains an actual query. + +Search results (excluding select `*`) are a balance between relevancy and the scoring function. In +general, when trying to improve relevancy, focus on changing the query in the `boolQuery` section and rely on the +`functionScore` for surfacing the _importance_ in the case of a relevancy tie. Consider the scenario +where a dataset named `orders` exists in multiple places. The relevancy between the dataset with the **name** `orders` and +the **term** `orders` is the same, however one location may be more important and thus the function score preferred. + +**Note:** The customized query is a pass-through to Elasticsearch and must comply with their API, syntax errors are possible. +It is recommended to test the customized queries prior to production deployment and knowledge of the Elasticsearch query +language is required. + +### Enable Custom Search + +The following environment variables on GMS control whether a search configuration is enabled and the location of the +configuration file. + +Enable Custom Search: + +```shell +ELASTICSEARCH_QUERY_CUSTOM_CONFIG_ENABLED=true +``` + +Custom Search File Location: + +```shell +ELASTICSEARCH_QUERY_CUSTOM_CONFIG_FILE=search_config.yml +``` + +The location of the configuration file can be on the Java classpath or the local filesystem. A default configuration +file is included with the GMS jar with the name `search_config.yml`. + +### Search Configuration + +The search configuration yaml contains a simple list of configuration profiles selected using the `queryRegex`. If a +single profile is desired, a catch-all regex of `.*` can be used. + +The list of search configurations can be grouped into 4 general sections. + +1. `queryRegex` - Responsible for selecting the search customization based on the [regex matching](https://www.w3schools.com/java/java_regex.asp) the search query string. + _The first match is applied._ +2. Built-in query booleans - There are 3 built-in queries which can be individually enabled/disabled. These include + the `simple query string`[[1]](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-simple-query-string-query.html), + `match phrase prefix`[[2]](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-match-query-phrase-prefix.html), and + `exact match`[[3]](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-term-query.html) queries, + enabled with the following booleans + respectively [`simpleQuery`, `prefixMatchQuery`, `exactMatchQuery`] +3. `boolQuery` - The base Elasticsearch `boolean query`[[4](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-bool-query.html)]. + If enabled in #2 above, those queries will + appear in the `should` section of the `boolean query`[[4](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-bool-query.html)]. +4. `functionScore` - The Elasticsearch `function score`[[5](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-function-score-query.html#score-functions)] section of the overall query. + +### Examples + +These examples assume a match-all `queryRegex` of `.*` so that it would impact any search query for simplicity. + +#### Example 1: Ranking By Tags/Terms + +Boost entities with tags of `primary` or `gold` and an example glossary term's uuid. + +```yaml +queryConfigurations: + - queryRegex: .* + + simpleQuery: true + prefixMatchQuery: true + exactMatchQuery: true + + functionScore: + functions: + - filter: + terms: + tags.keyword: + - urn:li:tag:primary + - urn:li:tag:gold + weight: 3.0 + + - filter: + terms: + glossaryTerms.keyword: + - urn:li:glossaryTerm:9afa9a59-93b2-47cb-9094-aa342eec24ad + weight: 3.0 + + score_mode: multiply + boost_mode: multiply +``` + +#### Example 2: Preferred Data Platform + +Boost the `urn:li:dataPlatform:hive` platform. + +```yaml +queryConfigurations: + - queryRegex: .* + + simpleQuery: true + prefixMatchQuery: true + exactMatchQuery: true + + functionScore: + functions: + - filter: + terms: + platform.keyword: + - urn:li:dataPlatform:hive + weight: 3.0 + score_mode: multiply + boost_mode: multiply +``` + +#### Example 3: Exclusion & Bury + +This configuration extends the 3 built-in queries with a rule to exclude `deprecated` entities from search results +because they are not generally relevant as well as reduces the score of `materialized`. + +```yaml +queryConfigurations: + - queryRegex: .* + + simpleQuery: true + prefixMatchQuery: true + exactMatchQuery: true + + boolQuery: + must_not: + term: + deprecated: + value: true + + functionScore: + functions: + - filter: + term: + materialized: + value: true + weight: 0.5 + score_mode: multiply + boost_mode: multiply +``` + +## FAQ and Troubleshooting + +**How are the results ordered?** + +The order of the search results is based on the weight what Datahub gives them based on our search algorithm. The current algorithm in OSS DataHub is based on a text-match score from Elasticsearch. + +**Where to find more information?** + +The sample queries here are non exhaustive. [The link here](https://demo.datahubproject.io/tag/urn:li:tag:Searchable) shows the current list of indexed fields for each entity inside Datahub. Click on the fields inside each entity and see which field has the tag `Searchable`. +However, it does not tell you the specific attribute name to use for specialized searches. One way to do so is to inspect the ElasticSearch indices, for example: +`curl http://localhost:9200/_cat/indices` returns all the ES indices in the ElasticSearch container. + +``` +yellow open chartindex_v2_1643510690325 bQO_RSiCSUiKJYsmJClsew 1 1 2 0 8.5kb 8.5kb +yellow open mlmodelgroupindex_v2_1643510678529 OjIy0wb7RyKqLz3uTENRHQ 1 1 0 0 208b 208b +yellow open dataprocessindex_v2_1643510676831 2w-IHpuiTUCs6e6gumpYHA 1 1 0 0 208b 208b +yellow open corpgroupindex_v2_1643510673894 O7myCFlqQWKNtgsldzBS6g 1 1 3 0 16.8kb 16.8kb +yellow open corpuserindex_v2_1643510672335 0rIe_uIQTjme5Wy61MFbaw 1 1 6 2 32.4kb 32.4kb +yellow open datasetindex_v2_1643510688970 bjBfUEswSoSqPi3BP4iqjw 1 1 15 0 29.2kb 29.2kb +yellow open dataflowindex_v2_1643510681607 N8CMlRFvQ42rnYMVDaQJ2g 1 1 1 0 10.2kb 10.2kb +yellow open dataset_datasetusagestatisticsaspect_v1_1643510694706 kdqvqMYLRWq1oZt1pcAsXQ 1 1 4 0 8.9kb 8.9kb +yellow open .ds-datahub_usage_event-000003 YMVcU8sHTFilUwyI4CWJJg 1 1 186 0 203.9kb 203.9kb +yellow open datajob_datahubingestioncheckpointaspect_v1 nTXJf7C1Q3GoaIJ71gONxw 1 1 0 0 208b 208b +yellow open .ds-datahub_usage_event-000004 XRFwisRPSJuSr6UVmmsCsg 1 1 196 0 165.5kb 165.5kb +yellow open .ds-datahub_usage_event-000005 d0O6l5wIRLOyG6iIfAISGw 1 1 77 0 108.1kb 108.1kb +yellow open dataplatformindex_v2_1643510671426 _4SIIhfAT8yq_WROufunXA 1 1 0 0 208b 208b +yellow open mlmodeldeploymentindex_v2_1643510670629 n81eJIypSp2Qx-fpjZHgRw 1 1 0 0 208b 208b +yellow open .ds-datahub_usage_event-000006 oyrWKndjQ-a8Rt1IMD9aSA 1 1 143 0 127.1kb 127.1kb +yellow open mlfeaturetableindex_v2_1643510677164 iEXPt637S1OcilXpxPNYHw 1 1 5 0 8.9kb 8.9kb +yellow open .ds-datahub_usage_event-000001 S9EnGj64TEW8O3sLUb9I2Q 1 1 257 0 163.9kb 163.9kb +yellow open .ds-datahub_usage_event-000002 2xJyvKG_RYGwJOG9yq8pJw 1 1 44 0 155.4kb 155.4kb +yellow open dataset_datasetprofileaspect_v1_1643510693373 uahwTHGRRAC7w1c2VqVy8g 1 1 31 0 18.9kb 18.9kb +yellow open mlprimarykeyindex_v2_1643510687579 MUcmT8ASSASzEpLL98vrWg 1 1 7 0 9.5kb 9.5kb +yellow open glossarytermindex_v2_1643510686127 cQL8Pg6uQeKfMly9GPhgFQ 1 1 3 0 10kb 10kb +yellow open datajob_datahubingestionrunsummaryaspect_v1 rk22mIsDQ02-52MpWLm1DA 1 1 0 0 208b 208b +yellow open mlmodelindex_v2_1643510675399 gk-WSTVjRZmkDU5ggeFSqg 1 1 1 0 10.3kb 10.3kb +yellow open dashboardindex_v2_1643510691686 PQjSaGhTRqWW6zYjcqXo6Q 1 1 1 0 8.7kb 8.7kb +yellow open datahubpolicyindex_v2_1643510671774 ZyTrYx3-Q1e-7dYq1kn5Gg 1 1 0 0 208b 208b +yellow open datajobindex_v2_1643510682977 K-rbEyjBS6ew5uOQQS4sPw 1 1 2 0 11.3kb 11.3kb +yellow open datahubretentionindex_v2 8XrQTPwRTX278mx1SrNwZA 1 1 0 0 208b 208b +yellow open glossarynodeindex_v2_1643510678826 Y3_bCz0YR2KPwCrrVngDdA 1 1 1 0 7.4kb 7.4kb +yellow open system_metadata_service_v1 36spEDbDTdKgVlSjE8t-Jw 1 1 387 8 63.2kb 63.2kb +yellow open schemafieldindex_v2_1643510684410 tZ1gC3haTReRLmpCxirVxQ 1 1 0 0 208b 208b +yellow open mlfeatureindex_v2_1643510680246 aQO5HF0mT62Znn-oIWBC8A 1 1 20 0 17.4kb 17.4kb +yellow open tagindex_v2_1643510684785 PfnUdCUORY2fnF3I3W7HwA 1 1 3 1 18.6kb 18.6kb +``` + +The index name will vary from instance to instance. Indexed information about Datasets can be found in: +`curl http://localhost:9200/datasetindex_v2_1643510688970/_search?=pretty` + +example information of a dataset: + +``` +{ + "_index" : "datasetindex_v2_1643510688970", + "_type" : "_doc", + "_id" : "urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Akafka%2CSampleKafkaDataset%2CPROD%29", + "_score" : 1.0, + "_source" : { + "urn" : "urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)", + "name" : "SampleKafkaDataset", + "browsePaths" : [ + "/prod/kafka/SampleKafkaDataset" + ], + "origin" : "PROD", + "customProperties" : [ + "prop2=pikachu", + "prop1=fakeprop" + ], + "hasDescription" : false, + "hasOwners" : true, + "owners" : [ + "urn:li:corpuser:jdoe", + "urn:li:corpuser:datahub" + ], + "fieldPaths" : [ + "[version=2.0].[type=boolean].field_foo_2", + "[version=2.0].[type=boolean].field_bar", + "[version=2.0].[key=True].[type=int].id" + ], + "fieldGlossaryTerms" : [ ], + "fieldDescriptions" : [ + "Foo field description", + "Bar field description", + "Id specifying which partition the message should go to" + ], + "fieldTags" : [ + "urn:li:tag:NeedsDocumentation" + ], + "platform" : "urn:li:dataPlatform:kafka" + } + }, +``` + + + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ + +### Related Features + +- [Metadata ingestion framework](../../metadata-ingestion/README.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/how/ui-tabs-guide.md b/docs-website/versioned_docs/version-0.10.4/docs/how/ui-tabs-guide.md new file mode 100644 index 0000000000000..a1a88751a67a2 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/how/ui-tabs-guide.md @@ -0,0 +1,26 @@ +--- +title: UI Tabs Guide +slug: /how/ui-tabs-guide +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/how/ui-tabs-guide.md +--- + +# UI Tabs Guide + +Some of the tabs in the UI might not be enabled by default. This guide is supposed to tell Admins of DataHub how to enable those UI tabs. + +## Datasets + +### Stats and Queries Tab + +To enable these tabs you need to use one of the usage sources which gets the relevant metadata from your sources and ingests them into DataHub. These usage sources are listed under other sources which support them e.g. [Snowflake source](../../docs/generated/ingestion/sources/snowflake.md), [BigQuery source](../../docs/generated/ingestion/sources/bigquery.md) + +### Validation Tab + +This tab is enabled if you use [Data Quality Integration with Great Expectations](../../metadata-ingestion/integration_docs/great-expectations.md). + +## Common to multiple entities + +### Properties Tab + +Properties are a catch-all bag for metadata not captured in other aspects stored for a Dataset. These are populated via the various source connectors when [metadata is ingested](../../metadata-ingestion/README.md). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/how/updating-datahub.md b/docs-website/versioned_docs/version-0.10.4/docs/how/updating-datahub.md new file mode 100644 index 0000000000000..84eb111e23703 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/how/updating-datahub.md @@ -0,0 +1,359 @@ +--- +title: Updating DataHub +slug: /how/updating-datahub +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/how/updating-datahub.md +--- + +# Updating DataHub + +This file documents any backwards-incompatible changes in DataHub and assists people when migrating to a new version. + +## Next + +### Breaking Changes + +- #8201: Python SDK: In the DataFlow class, the `cluster` argument is deprecated in favor of `env`. +- #8263: Okta source config option `okta_profile_to_username_attr` default changed from `login` to `email`. + This determines which Okta profile attribute is used for the corresponding DataHub user + and thus may change what DataHub users are generated by the Okta source. And in a follow up `okta_profile_to_username_regex` has been set to `.*` which taken together with previous change brings the defaults in line with OIDC. +- #8331: For all sql-based sources that support profiling, you can no longer specify + `profile_table_level_only` together with `include_field_xyz` config options to ingest + certain column-level metrics. Instead, set `profile_table_level_only` to `false` and + individually enable / disable desired field metrics. +- #8451: The `bigquery-beta` and `snowflake-beta` source aliases have been dropped. Use `bigquery` and `snowflake` as the source type instead. +- #8472: Ingestion runs created with Pipeline.create will show up in the DataHub ingestion tab as CLI-based runs. To revert to the previous behavior of not showing these runs in DataHub, pass `no_default_report=True`. + +### Potential Downtime + +### Deprecations + +- #8198: In the Python SDK, the `PlatformKey` class has been renamed to `ContainerKey`. + +### Other notable Changes + +## 0.10.4 + +### Breaking Changes + +### Potential Downtime + +### Deprecations + +- #8045: With the introduction of custom ownership types, the `Owner` aspect has been updated where the `type` field is deprecated in favor of a new field `typeUrn`. This latter field is an urn reference to the new OwnershipType entity. GraphQL endpoints have been updated to use the new field. For pre-existing ownership aspect records, DataHub now has logic to map the old field to the new field. + +### Other notable Changes + +- #8191: Updates GMS's health check endpoint to account for its dependency on external components. Notably, at this time, elasticsearch. This means that DataHub operators can now use GMS health status more reliably. + +## 0.10.3 + +### Breaking Changes + +- #7900: The `catalog_pattern` and `schema_pattern` options of the Unity Catalog source now match against the fully qualified name of the catalog/schema instead of just the name. Unless you're using regex `^` in your patterns, this should not affect you. +- #7942: Renaming the `containerPath` aspect to `browsePathsV2`. This means any data with the aspect name `containerPath` will be invalid. We had not exposed this in the UI or used it anywhere, but it was a model we recently merged to open up other work. This should not affect many people if anyone at all unless you were manually creating `containerPath` data through ingestion on your instance. +- #8068: In the `datahub delete` CLI, if an `--entity-type` filter is not specified, we automatically delete across all entity types. The previous behavior was to use a default entity type of dataset. +- #8068: In the `datahub delete` CLI, the `--start-time` and `--end-time` parameters are not required for timeseries aspect hard deletes. To recover the previous behavior of deleting all data, use `--start-time min --end-time max`. + +### Potential Downtime + +### Deprecations + +- The signature of `Source.get_workunits()` is changed from `Iterable[WorkUnit]` to the more restrictive `Iterable[MetadataWorkUnit]`. +- Legacy usage creation via the `UsageAggregation` aspect, `/usageStats?action=batchIngest` GMS endpoint, and `UsageStatsWorkUnit` metadata-ingestion class are all deprecated. + +### Other notable Changes + +## 0.10.2 + +### Breaking Changes + +- #7016 Add `add_database_name_to_urn` flag to Oracle source which ensure that Dataset urns have the DB name as a prefix to prevent collision (.e.g. {database}.{schema}.{table}). ONLY breaking if you set this flag to true, otherwise behavior remains the same. +- The Airflow plugin no longer includes the DataHub Kafka emitter by default. Use `pip install acryl-datahub-airflow-plugin[datahub-kafka]` for Kafka support. +- The Airflow lineage backend no longer includes the DataHub Kafka emitter by default. Use `pip install acryl-datahub[airflow,datahub-kafka]` for Kafka support. +- Java SDK PatchBuilders have been modified in a backwards incompatible way to align more with the Python SDK and support more use cases. Any application utilizing the Java SDK for patch building may be affected on upgrading this dependency. + +### Deprecations + +- The docker image and script for updating from Elasticsearch 6 to 7 is no longer being maintained and will be removed from the `/contrib` section of + the repository. Please refer to older releases if needed. + +## 0.10.0 + +### Breaking Changes + +- #7103 This should only impact users who have configured explicit non-default names for DataHub's Kafka topics. The environment variables used to configure Kafka topics for DataHub used in the `kafka-setup` docker image have been updated to be in-line with other DataHub components, for more info see our docs on [Configuring Kafka in DataHub + ](https://datahubproject.io/docs/how/kafka-config). They have been suffixed with `_TOPIC` where as now the correct suffix is `_TOPIC_NAME`. This change should not affect any user who is using default Kafka names. +- #6906 The Redshift source has been reworked and now also includes usage capabilities. The old Redshift source was renamed to `redshift-legacy`. The `redshift-usage` source has also been renamed to `redshift-usage-legacy` will be removed in the future. + +### Potential Downtime + +- #6894 Search improvements requires reindexing indices. A `system-update` job will run which will set indices to read-only and create a backup/clone of each index. During the reindexing new components will be prevented from start-up until the reindex completes. The logs of this job will indicate a % complete per index. Depending on index sizes and infrastructure this process can take 5 minutes to hours however as a rough estimate 1 hour for every 2.3 million entities. + +#### Helm Notes + +Helm without `--atomic`: The default timeout for an upgrade command is 5 minutes. If the reindex takes longer (depending on data size) it will continue to run in the background even though helm will report a failure. Allow this job to finish and then re-run the helm upgrade command. + +Helm with `--atomic`: In general, it is recommended to not use the `--atomic` setting for this particular upgrade since the system update job will be terminated before completion. If `--atomic` is preferred, then increase the timeout using the `--timeout` flag to account for the reindexing time (see note above for estimating this value). + +### Deprecations + +## 0.9.6 + +### Breaking Changes + +- #6742 The metadata file sink's output format no longer contains nested JSON strings for MCP aspects, but instead unpacks the stringified JSON into a real JSON object. The previous sink behavior can be recovered using the `legacy_nested_json_string` option. The file source is backwards compatible and supports both formats. +- #6901 The `env` and `database_alias` fields have been marked deprecated across all sources. We recommend using `platform_instance` where possible instead. + +### Potential Downtime + +### Deprecations + +- #6851 - Sources bigquery-legacy and bigquery-usage-legacy have been removed + +### Other notable Changes + +- If anyone faces issues with login please clear your cookies. Some security updates are part of this release. That may cause login issues until cookies are cleared. + +## 0.9.4 / 0.9.5 + +### Breaking Changes + +- #6243 apache-ranger authorizer is no longer the core part of DataHub GMS, and it is shifted as plugin. Please refer updated documentation [Configuring Authorization with Apache Ranger](./configuring-authorization-with-apache-ranger.md#configuring-your-datahub-deployment) for configuring `apache-ranger-plugin` in DataHub GMS. +- #6243 apache-ranger authorizer as plugin is not supported in DataHub Kubernetes deployment. +- #6243 Authentication and Authorization plugins configuration are removed from [application.yml](https://github.com/datahub-project/datahub/blob/master/metadata-service/configuration/src/main/resources/application.yml). Refer documentation [Migration Of Plugins From application.yml](../plugins.md#migration-of-plugins-from-applicationyml) for migrating any existing custom plugins. +- `datahub check graph-consistency` command has been removed. It was a beta API that we had considered but decided there are better solutions for this. So removing this. +- `graphql_url` option of `powerbi-report-server` source deprecated as the options is not used. +- #6789 BigQuery ingestion: If `enable_legacy_sharded_table_support` is set to False, sharded table names will be suffixed with \_yyyymmdd to make sure they don't clash with non-sharded tables. This means if stateful ingestion is enabled then old sharded tables will be recreated with a new id and attached tags/glossary terms/etc will need to be added again. _This behavior is not enabled by default yet, but will be enabled by default in a future release._ + +### Potential Downtime + +### Deprecations + +### Other notable Changes + +- #6611 - Snowflake `schema_pattern` now accepts pattern for fully qualified schema name in format `.` by setting config `match_fully_qualified_names : True`. Current default `match_fully_qualified_names: False` is only to maintain backward compatibility. The config option `match_fully_qualified_names` will be deprecated in future and the default behavior will assume `match_fully_qualified_names: True`." +- #6636 - Sources `snowflake-legacy` and `snowflake-usage-legacy` have been removed. + +## 0.9.3 + +### Breaking Changes + +- The beta `datahub check graph-consistency` command has been removed. + +### Potential Downtime + +### Deprecations + +- PowerBI source: `workspace_id_pattern` is introduced in place of `workspace_id`. `workspace_id` is now deprecated and set for removal in a future version. + +### Other notable Changes + +## 0.9.2 + +- LookML source will only emit views that are reachable from explores while scanning your git repo. Previous behavior can be achieved by setting `emit_reachable_views_only` to False. +- LookML source will always lowercase urns for lineage edges from views to upstream tables. There is no fallback provided to previous behavior because it was inconsistent in application of lower-casing earlier. +- dbt config `node_type_pattern` which was previously deprecated has been removed. Use `entities_enabled` instead to control whether to emit metadata for sources, models, seeds, tests, etc. +- The dbt source will always lowercase urns for lineage edges to the underlying data platform. +- The DataHub Airflow lineage backend and plugin no longer support Airflow 1.x. You can still run DataHub ingestion in Airflow 1.x using the [PythonVirtualenvOperator](https://airflow.apache.org/docs/apache-airflow/1.10.15/_api/airflow/operators/python_operator/index.html?highlight=pythonvirtualenvoperator#airflow.operators.python_operator.PythonVirtualenvOperator). + +### Breaking Changes + +- #6570 `snowflake` connector now populates created and last modified timestamps for snowflake datasets and containers. This version of snowflake connector will not work with **datahub-gms** version older than `v0.9.3` + +### Potential Downtime + +### Deprecations + +### Other notable Changes + +## 0.9.1 + +### Breaking Changes + +- We have promoted `bigquery-beta` to `bigquery`. If you are using `bigquery-beta` then change your recipes to use the type `bigquery`. + +### Potential Downtime + +### Deprecations + +### Other notable Changes + +## 0.9.0 + +### Breaking Changes + +- Java version 11 or greater is required. +- For any of the GraphQL search queries, the input no longer supports value but instead now accepts a list of values. These values represent an OR relationship where the field value must match any of the values. + +### Potential Downtime + +### Deprecations + +### Other notable Changes + +## `v0.8.45` + +### Breaking Changes + +- The `getNativeUserInviteToken` and `createNativeUserInviteToken` GraphQL endpoints have been renamed to + `getInviteToken` and `createInviteToken` respectively. Additionally, both now accept an optional `roleUrn` parameter. + Both endpoints also now require the `MANAGE_POLICIES` privilege to execute, rather than `MANAGE_USER_CREDENTIALS` + privilege. +- One of the default policies shipped with DataHub (`urn:li:dataHubPolicy:7`, or `All Users - All Platform Privileges`) + has been edited to no longer include `MANAGE_POLICIES`. Its name has consequently been changed to + `All Users - All Platform Privileges (EXCEPT MANAGE POLICIES)`. This change was made to prevent all users from + effectively acting as superusers by default. + +### Potential Downtime + +### Deprecations + +### Other notable Changes + +## `v0.8.44` + +### Breaking Changes + +- Browse Paths have been upgraded to a new format to align more closely with the intention of the feature. + Learn more about the changes, including steps on upgrading, here: +- The dbt ingestion source's `disable_dbt_node_creation` and `load_schema` options have been removed. They were no longer necessary due to the recently added sibling entities functionality. +- The `snowflake` source now uses newer faster implementation (earlier `snowflake-beta`). Config properties `provision_role` and `check_role_grants` are not supported. Older `snowflake` and `snowflake-usage` are available as `snowflake-legacy` and `snowflake-usage-legacy` sources respectively. + +### Potential Downtime + +- [Helm] If you're using Helm, please ensure that your version of the `datahub-actions` container is bumped to `v0.0.7` or `head`. + This version contains changes to support running ingestion in debug mode. Previous versions are not compatible with this release. + Upgrading to helm chart version `0.2.103` will ensure that you have the compatible versions by default. + +### Deprecations + +### Other notable Changes + +## `v0.8.42` + +### Breaking Changes + +- Python 3.6 is no longer supported for metadata ingestion +- #5451 `GMS_HOST` and `GMS_PORT` environment variables deprecated in `v0.8.39` have been removed. Use `DATAHUB_GMS_HOST` and `DATAHUB_GMS_PORT` instead. +- #5478 DataHub CLI `delete` command when used with `--hard` option will delete soft-deleted entities which match the other filters given. +- #5471 Looker now populates `userEmail` in dashboard user usage stats. This version of looker connnector will not work with older version of **datahub-gms** if you have `extract_usage_history` looker config enabled. +- #5529 - `ANALYTICS_ENABLED` environment variable in **datahub-gms** is now deprecated. Use `DATAHUB_ANALYTICS_ENABLED` instead. +- #5485 `--include-removed` option was removed from delete CLI + +### Potential Downtime + +### Deprecations + +### Other notable Changes + +## `v0.8.41` + +### Breaking Changes + +- The `should_overwrite` flag in `csv-enricher` has been replaced with `write_semantics` to match the format used for other sources. See the [documentation](/docs/generated/ingestion/sources/csv/) for more details +- Closing an authorization hole in creating tags adding a Platform Privilege called `Create Tags` for creating tags. This is assigned to `datahub` root user, along + with default All Users policy. Notice: You may need to add this privilege (or `Manage Tags`) to existing users that need the ability to create tags on the platform. +- #5329 Below profiling config parameters are now supported in `BigQuery`: + + - profiling.profile_if_updated_since_days (default=1) + - profiling.profile_table_size_limit (default=1GB) + - profiling.profile_table_row_limit (default=50000) + + Set above parameters to `null` if you want older behaviour. + +### Potential Downtime + +### Deprecations + +### Other notable Changes + +## `v0.8.40` + +### Breaking Changes + +- #5240 `lineage_client_project_id` in `bigquery` source is removed. Use `storage_project_id` instead. + +### Potential Downtime + +### Deprecations + +### Other notable Changes + +## `v0.8.39` + +### Breaking Changes + +- Refactored the `health` field of the `Dataset` GraphQL Type to be of type **list of HealthStatus** (was type **HealthStatus**). See [this PR](https://github.com/datahub-project/datahub/pull/5222/files) for more details. + +### Potential Downtime + +### Deprecations + +- #4875 Lookml view file contents will no longer be populated in custom_properties, instead view definitions will be always available in the View Definitions tab. +- #5208 `GMS_HOST` and `GMS_PORT` environment variables being set in various containers are deprecated in favour of `DATAHUB_GMS_HOST` and `DATAHUB_GMS_PORT`. +- `KAFKA_TOPIC_NAME` environment variable in **datahub-mae-consumer** and **datahub-gms** is now deprecated. Use `METADATA_AUDIT_EVENT_NAME` instead. +- `KAFKA_MCE_TOPIC_NAME` environment variable in **datahub-mce-consumer** and **datahub-gms** is now deprecated. Use `METADATA_CHANGE_EVENT_NAME` instead. +- `KAFKA_FMCE_TOPIC_NAME` environment variable in **datahub-mce-consumer** and **datahub-gms** is now deprecated. Use `FAILED_METADATA_CHANGE_EVENT_NAME` instead. + +### Other notable Changes + +- #5132 Profile tables in `snowflake` source only if they have been updated since configured (default: `1`) number of day(s). Update the config `profiling.profile_if_updated_since_days` as per your profiling schedule or set it to `None` if you want older behaviour. + +## `v0.8.38` + +### Breaking Changes + +### Potential Downtime + +### Deprecations + +### Other notable Changes + +- Create & Revoke Access Tokens via the UI +- Create and Manage new users via the UI +- Improvements to Business Glossary UI +- FIX - Do not require reindexing to migrate to using the UI business glossary + +## `v0.8.36` + +### Breaking Changes + +- In this release we introduce a brand new Business Glossary experience. With this new experience comes some new ways of indexing data in order to make viewing and traversing the different levels of your Glossary possible. Therefore, you will have to [restore your indices](/docs/how/restore-indices/) in order for the new Glossary experience to work for users that already have existing Glossaries. If this is your first time using DataHub Glossaries, you're all set! + +### Potential Downtime + +### Deprecations + +### Other notable Changes + +- #4961 Dropped profiling is not reported by default as that caused a lot of spurious logging in some cases. Set `profiling.report_dropped_profiles` to `True` if you want older behaviour. + +## `v0.8.35` + +### Breaking Changes + +### Potential Downtime + +### Deprecations + +- #4875 Lookml view file contents will no longer be populated in custom_properties, instead view definitions will be always available in the View Definitions tab. + +### Other notable Changes + +## `v0.8.34` + +### Breaking Changes + +- #4644 Remove `database` option from `snowflake` source which was deprecated since `v0.8.5` +- #4595 Rename confusing config `report_upstream_lineage` to `upstream_lineage_in_report` in `snowflake` connector which was added in `0.8.32` + +### Potential Downtime + +### Deprecations + +- #4644 `host_port` option of `snowflake` and `snowflake-usage` sources deprecated as the name was confusing. Use `account_id` option instead. + +### Other notable Changes + +- #4760 `check_role_grants` option was added in `snowflake` to disable checking roles in `snowflake` as some people were reporting long run times when checking roles. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/lineage/airflow.md b/docs-website/versioned_docs/version-0.10.4/docs/lineage/airflow.md new file mode 100644 index 0000000000000..1ac84ab393f4e --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/lineage/airflow.md @@ -0,0 +1,195 @@ +--- +title: Airflow Integration +slug: /lineage/airflow +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/lineage/airflow.md" +--- + +# Airflow Integration + +DataHub supports integration of + +- Airflow Pipeline (DAG) metadata +- DAG and Task run information as well as +- Lineage information when present + +You can use either the DataHub Airflow lineage plugin (recommended) or the Airflow lineage backend (deprecated). + +## Using Datahub's Airflow lineage plugin + +:::note + +The Airflow lineage plugin is only supported with Airflow version >= 2.0.2 or on MWAA with an Airflow version >= 2.0.2. + +If you're using Airflow 1.x, use the Airflow lineage plugin with acryl-datahub-airflow-plugin <= 0.9.1.0. + +::: + +This plugin registers a task success/failure callback on every task with a cluster policy and emits DataHub events from that. This allows this plugin to be able to register both task success as well as failures compared to the older Airflow Lineage Backend which could only support emitting task success. + +### Setup + +1. You need to install the required dependency in your airflow. + +```shell +pip install acryl-datahub-airflow-plugin +``` + +:::note + +The [DataHub Rest](../../metadata-ingestion/sink_docs/datahub.md#datahub-rest) emitter is included in the plugin package by default. To use [DataHub Kafka](../../metadata-ingestion/sink_docs/datahub.md#datahub-kafka) install `pip install acryl-datahub-airflow-plugin[datahub-kafka]`. + +::: + +2. Disable lazy plugin loading in your airflow.cfg. + On MWAA you should add this config to your [Apache Airflow configuration options](https://docs.aws.amazon.com/mwaa/latest/userguide/configuring-env-variables.html#configuring-2.0-airflow-override). + +```ini title="airflow.cfg" +[core] +lazy_load_plugins = False +``` + +3. You must configure an Airflow hook for Datahub. We support both a Datahub REST hook and a Kafka-based hook, but you only need one. + + ```shell + # For REST-based: + airflow connections add --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host 'http://datahub-gms:8080' --conn-password '' + # For Kafka-based (standard Kafka sink config can be passed via extras): + airflow connections add --conn-type 'datahub_kafka' 'datahub_kafka_default' --conn-host 'broker:9092' --conn-extra '{}' + ``` + +4. Add your `datahub_conn_id` and/or `cluster` to your `airflow.cfg` file if it is not align with the default values. See configuration parameters below + + **Configuration options:** + + | Name | Default value | Description | + | ------------------------------ | -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | + | datahub.enabled | true | If the plugin should be enabled. | + | datahub.conn_id | datahub_rest_default | The name of the datahub connection you set in step 1. | + | datahub.cluster | prod | name of the airflow cluster | + | datahub.capture_ownership_info | true | If true, the owners field of the DAG will be capture as a DataHub corpuser. | + | datahub.capture_tags_info | true | If true, the tags field of the DAG will be captured as DataHub tags. | + | datahub.graceful_exceptions | true | If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions. | + +5. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py`](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_demo.py), or reference [`lineage_backend_taskflow_demo.py`](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_taskflow_demo.py) if you're using the [TaskFlow API](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html). +6. [optional] Learn more about [Airflow lineage](https://airflow.apache.org/docs/apache-airflow/stable/lineage.html), including shorthand notation and some automation. + +### How to validate installation + +1. Go and check in Airflow at Admin -> Plugins menu if you can see the DataHub plugin +2. Run an Airflow DAG. In the task logs, you should see Datahub related log messages like: + +``` +Emitting DataHub ... +``` + +### Emitting lineage via a custom operator to the Airflow Plugin + +If you have created a custom Airflow operator [docs](https://airflow.apache.org/docs/apache-airflow/stable/howto/custom-operator.html) that inherits from the BaseOperator class, +when overriding the `execute` function, set inlets and outlets via `context['ti'].task.inlets` and `context['ti'].task.outlets`. +The DataHub Airflow plugin will then pick up those inlets and outlets after the task runs. + +```python +class DbtOperator(BaseOperator): + ... + + def execute(self, context): + # do something + inlets, outlets = self._get_lineage() + # inlets/outlets are lists of either datahub_provider.entities.Dataset or datahub_provider.entities.Urn + context['ti'].task.inlets = self.inlets + context['ti'].task.outlets = self.outlets + + def _get_lineage(self): + # Do some processing to get inlets/outlets + + return inlets, outlets +``` + +If you override the `pre_execute` and `post_execute` function, ensure they include the `@prepare_lineage` and `@apply_lineage` decorators respectively. [source](https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/lineage.html#lineage) + +## Using DataHub's Airflow lineage backend (deprecated) + +:::caution + +The DataHub Airflow plugin (above) is the recommended way to integrate Airflow with DataHub. For managed services like MWAA, the lineage backend is not supported and so you must use the Airflow plugin. + +If you're using Airflow 1.x, we recommend using the Airflow lineage backend with acryl-datahub <= 0.9.1.0. + +::: + +:::note + +If you are looking to run Airflow and DataHub using docker locally, follow the guide [here](../../docker/airflow/local_airflow.md). Otherwise proceed to follow the instructions below. +::: + +### Setting up Airflow to use DataHub as Lineage Backend + +1. You need to install the required dependency in your airflow. See + +```shell +pip install acryl-datahub[airflow] +# If you need the Kafka-based emitter/hook: +pip install acryl-datahub[airflow,datahub-kafka] +``` + +2. You must configure an Airflow hook for Datahub. We support both a Datahub REST hook and a Kafka-based hook, but you only need one. + + ```shell + # For REST-based: + airflow connections add --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host 'http://datahub-gms:8080' --conn-password '' + # For Kafka-based (standard Kafka sink config can be passed via extras): + airflow connections add --conn-type 'datahub_kafka' 'datahub_kafka_default' --conn-host 'broker:9092' --conn-extra '{}' + ``` + +3. Add the following lines to your `airflow.cfg` file. + + ```ini title="airflow.cfg" + [lineage] + backend = datahub_provider.lineage.datahub.DatahubLineageBackend + datahub_kwargs = { + "enabled": true, + "datahub_conn_id": "datahub_rest_default", + "cluster": "prod", + "capture_ownership_info": true, + "capture_tags_info": true, + "graceful_exceptions": true } + # The above indentation is important! + ``` + + **Configuration options:** + + - `datahub_conn_id` (required): Usually `datahub_rest_default` or `datahub_kafka_default`, depending on what you named the connection in step 1. + - `cluster` (defaults to "prod"): The "cluster" to associate Airflow DAGs and tasks with. + - `capture_ownership_info` (defaults to true): If true, the owners field of the DAG will be capture as a DataHub corpuser. + - `capture_tags_info` (defaults to true): If true, the tags field of the DAG will be captured as DataHub tags. + - `capture_executions` (defaults to false): If true, it captures task runs as DataHub DataProcessInstances. + - `graceful_exceptions` (defaults to true): If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions. + +4. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py`](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_demo.py), or reference [`lineage_backend_taskflow_demo.py`](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_taskflow_demo.py) if you're using the [TaskFlow API](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html). +5. [optional] Learn more about [Airflow lineage](https://airflow.apache.org/docs/apache-airflow/stable/lineage.html), including shorthand notation and some automation. + +## Emitting lineage via a separate operator + +Take a look at this sample DAG: + +- [`lineage_emission_dag.py`](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub_provider/example_dags/lineage_emission_dag.py) - emits lineage using the DatahubEmitterOperator. + +In order to use this example, you must first configure the Datahub hook. Like in ingestion, we support a Datahub REST hook and a Kafka-based hook. See step 1 above for details. + +## Debugging + +### Incorrect URLs + +If your URLs aren't being generated correctly (usually they'll start with `http://localhost:8080` instead of the correct hostname), you may need to set the webserver `base_url` config. + +```ini title="airflow.cfg" +[webserver] +base_url = http://airflow.example.com +``` + +## Additional references + +Related Datahub videos: + +- [Airflow Lineage](https://www.youtube.com/watch?v=3wiaqhb8UR0) +- [Airflow Run History in DataHub](https://www.youtube.com/watch?v=YpUOqDU5ZYg) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/lineage/lineage-feature-guide.md b/docs-website/versioned_docs/version-0.10.4/docs/lineage/lineage-feature-guide.md new file mode 100644 index 0000000000000..9652c77136dd6 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/lineage/lineage-feature-guide.md @@ -0,0 +1,233 @@ +--- +title: About DataHub Lineage +sidebar_label: Lineage +slug: /lineage/lineage-feature-guide +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/lineage/lineage-feature-guide.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub Lineage + + + +Lineage is used to capture data dependencies within an organization. It allows you to track the inputs from which a data asset is derived, along with the data assets that depend on it downstream. + +If you're using an ingestion source that supports extraction of Lineage (e.g. the "Table Lineage Capability"), then lineage information can be extracted automatically. For detailed instructions, refer to the source documentation for the source you are using. If you are not using a Lineage-support ingestion source, you can programmatically emit lineage edges between entities via API. + +Alternatively, as of `v0.9.5`, DataHub supports the manual editing of lineage between entities. Data experts are free to add or remove upstream and downstream lineage edges in both the Lineage Visualization screen as well as the Lineage tab on entity pages. Use this feature to supplement automatic lineage extraction or establish important entity relationships in sources that do not support automatic extraction. Editing lineage by hand is supported for Datasets, Charts, Dashboards, and Data Jobs. + +:::note + +Lineage added by hand and programmatically may conflict with one another to cause unwanted overwrites. It is strongly recommend that lineage is edited manually in cases where lineage information is not also extracted in automated fashion, e.g. by running an ingestion source. + +::: + +Types of lineage connections supported in DataHub are: + +- Dataset-to-dataset +- Pipeline lineage (dataset-to-job-to-dataset) +- Dashboard-to-chart lineage +- Chart-to-dataset lineage +- Job-to-dataflow (dbt lineage) + +## Lineage Setup, Prerequisites, and Permissions + +To edit lineage for an entity, you'll need the following [Metadata Privilege](../authorization/policies.md): + +- **Edit Lineage** metadata privilege to edit lineage at the entity level + +It is important to know that the **Edit Lineage** privilege is required for all entities whose lineage is affected by the changes. For example, in order to add "Dataset B" as an upstream dependency of "Dataset A", you'll need the **Edit Lineage** privilege for both Dataset A and Dataset B. + +## Managing Lineage via the DataHub UI + +### Viewing lineage on the Datahub UI + +The UI shows the latest version of the lineage. The time picker can be used to filter out edges within the latest version to exclude those that were last updated outside of the time window. Selecting time windows in the patch will not show you historical lineages. It will only filter the view of the latest version of the lineage. + +### Editing from Lineage Graph View + +The first place that you can edit lineage for entities is from the Lineage Visualization screen. Click on the "Lineage" button on the top right of an entity's profile to get to this view. + +

+ +

+ +Once you find the entity that you want to edit the lineage of, click on the three-dot menu dropdown to select whether you want to edit lineage in the upstream direction or the downstream direction. + +

+ +

+ +If you want to edit upstream lineage for entities downstream of the center node or downstream lineage for entities upstream of the center node, you can simply re-center to focus on the node you want to edit. Once focused on the desired node, you can edit lineage in either direction. + +

+ +

+ +#### Adding Lineage Edges + +Once you click "Edit Upstream" or "Edit Downstream," a modal will open that allows you to manage lineage for the selected entity in the chosen direction. In order to add a lineage edge to a new entity, search for it by name in the provided search bar and select it. Once you're satisfied with everything you've added, click "Save Changes." If you change your mind, you can always cancel or exit without saving the changes you've made. + +

+ +

+ +#### Removing Lineage Edges + +You can remove lineage edges from the same modal used to add lineage edges. Find the edge(s) that you want to remove, and click the "X" on the right side of it. And just like adding, you need to click "Save Changes" to save and if you exit without saving, your changes won't be applied. + +

+ +

+ +#### Reviewing Changes + +Any time lineage is edited manually, we keep track of who made the change and when they made it. You can see this information in the modal where you add and remove edges. If an edge was added manually, a user avatar will be in line with the edge that was added. You can hover over this avatar in order to see who added it and when. + +

+ +

+ +### Editing from Lineage Tab + +The other place that you can edit lineage for entities is from the Lineage Tab on an entity's profile. Click on the "Lineage" tab in an entity's profile and then find the "Edit" dropdown that allows you to edit upstream or downstream lineage for the given entity. + +

+ +

+ +Using the modal from this view will work the same as described above for editing from the Lineage Visualization screen. + +## Managing Lineage via API + +:::note + +When you emit any lineage aspect, the existing aspect gets completely overwritten, unless specifically using patch semantics. +This means that the latest version visible in the UI will be your version. + +::: + +### Using Dataset-to-Dataset Lineage + +This relationship model uses dataset -> dataset connection through the UpstreamLineage aspect in the Dataset entity. + +Here are a few samples for the usage of this type of lineage: + +- [lineage_emitter_mcpw_rest.py](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_mcpw_rest.py) - emits simple bigquery table-to-table (dataset-to-dataset) lineage via REST as MetadataChangeProposalWrapper. +- [lineage_emitter_rest.py](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_rest.py) - emits simple dataset-to-dataset lineage via REST as MetadataChangeEvent. +- [lineage_emitter_kafka.py](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_kafka.py) - emits simple dataset-to-dataset lineage via Kafka as MetadataChangeEvent. +- [lineage_emitter_dataset_finegrained.py](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_dataset_finegrained.py) - emits fine-grained dataset-dataset lineage via REST as MetadataChangeProposalWrapper. +- [Datahub Snowflake Lineage](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py) - emits Datahub's Snowflake lineage as MetadataChangeProposalWrapper. +- [Datahub BigQuery Lineage](https://github.com/datahub-project/datahub/blob/3022c2d12e68d221435c6134362c1a2cba2df6b3/metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py#L1028) - emits Datahub's Bigquery lineage as MetadataChangeProposalWrapper. **Use the patch feature to add to rather than overwrite the current lineage.** + +### Using dbt Lineage + +This model captures dbt specific nodes (tables, views, etc.) and + +- uses datasets as the base entity type and +- extends subclass datasets for each dbt-specific concept, and +- links them together for dataset-to-dataset lineage + +Here is a sample usage of this lineage: + +- [Datahub dbt Lineage](https://github.com/datahub-project/datahub/blob/a9754ebe83b6b73bc2bfbf49d9ebf5dbd2ca5a8f/metadata-ingestion/src/datahub/ingestion/source/dbt.py#L625,L630) - emits Datahub's dbt lineage as MetadataChangeEvent. + +### Using Pipeline Lineage + +The relationship model for this is datajob-to-dataset through the dataJobInputOutput aspect in the DataJob entity. + +For Airflow, this lineage is supported using Airflow’s lineage backend which allows you to specify the inputs to and output from that task. + +If you annotate that on your task we can pick up that information and push that as lineage edges into datahub automatically. You can install this package from Airflow’s Astronomer marketplace [here](https://registry.astronomer.io/providers/datahub). + +Here are a few samples for the usage of this type of lineage: + +- [lineage_dataset_job_dataset.py](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_dataset_job_dataset.py) - emits mysql-to-airflow-to-kafka (dataset-to-job-to-dataset) lineage via REST as MetadataChangeProposalWrapper. +- [lineage_job_dataflow.py](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_job_dataflow.py) - emits the job-to-dataflow lineage via REST as MetadataChangeProposalWrapper. + +### Using Dashboard-to-Chart Lineage + +This relationship model uses the dashboardInfo aspect of the Dashboard entity and models an explicit edge between a dashboard and a chart (such that charts can be attached to multiple dashboards). + +Here is a sample usage of this lineage: + +- [lineage_chart_dashboard.py](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_chart_dashboard.py) - emits the chart-to-dashboard lineage via REST as MetadataChangeProposalWrapper. + +### Using Chart-to-Dataset Lineage + +This relationship model uses the chartInfo aspect of the Chart entity. + +Here is a sample usage of this lineage: + +- [lineage_dataset_chart.py](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_dataset_chart.py) - emits the dataset-to-chart lineage via REST as MetadataChangeProposalWrapper. + +## Additional Resources + +### Videos + +**DataHub Basics: Lineage 101** + +

+ +

+ +**DataHub November 2022 Town Hall - Including Manual Lineage Demo** + +

+ +

+ +### GraphQL + +- [updateLineage](../../graphql/mutations.md#updatelineage) +- [searchAcrossLineage](../../graphql/queries.md#searchacrosslineage) +- [searchAcrossLineageInput](../../graphql/inputObjects.md#searchacrosslineageinput) + +#### Examples + +**Updating Lineage** + +```graphql +mutation updateLineage { + updateLineage( + input: { + edgesToAdd: [ + { + downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)" + upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:datahub,Dataset,PROD)" + } + ] + edgesToRemove: [ + { + downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)" + upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)" + } + ] + } + ) +} +``` + +### DataHub Blog + +- [Acryl Data introduces lineage support and automated propagation of governance information for Snowflake in DataHub](https://blog.datahubproject.io/acryl-data-introduces-lineage-support-and-automated-propagation-of-governance-information-for-339c99536561) +- [Data in Context: Lineage Explorer in DataHub](https://blog.datahubproject.io/data-in-context-lineage-explorer-in-datahub-a53a9a476dc4) +- [Harnessing the Power of Data Lineage with DataHub](https://blog.datahubproject.io/harnessing-the-power-of-data-lineage-with-datahub-ad086358dec4) + +## FAQ and Troubleshooting + +**The Lineage Tab is greyed out - why can’t I click on it?** + +This means you have not yet ingested lineage metadata for that entity. Please ingest lineage to proceed. + +**Are there any recommended practices for emitting lineage?** + +We recommend emitting aspects as MetadataChangeProposalWrapper over emitting them via the MetadataChangeEvent. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ + +### Related Features + +- [DataHub Lineage Impact Analysis](../act-on-metadata/impact-analysis.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/links.md b/docs-website/versioned_docs/version-0.10.4/docs/links.md new file mode 100644 index 0000000000000..223f1a43d90b9 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/links.md @@ -0,0 +1,73 @@ +--- +title: Articles & Talks +slug: /links +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/links.md" +--- + +# Articles & Talks + +## Overviews + +- [Tech Deep Dive: DataHub Metadata Service Authentication](https://blog.datahubproject.io/tech-deep-dive-introducing-datahub-metadata-service-authentication-661e3aabbad0) and [video](https://www.youtube.com/watch?v=DPY0G3Ix7Y8) +- [Data in Context: Lineage Explorer in DataHub](https://blog.datahubproject.io/data-in-context-lineage-explorer-in-datahub-a53a9a476dc4) +- [DataHub Basics — Users, Groups, & Authentication 101](https://www.youtube.com/watch?v=8Osw6p9vDYY) +- [DataHub Basics: Lineage 101](https://www.youtube.com/watch?v=rONGpsndzRw) + +## Best Practices + +- [Tags and Terms: Two Powerful DataHub Features, Used in Two Different Scenarios](https://blog.datahubproject.io/tags-and-terms-two-powerful-datahub-features-used-in-two-different-scenarios-b5b4791e892e) + +## Case Studies + +- [Enabling Data Discovery in a Data Mesh: The Saxo Journey](https://blog.datahubproject.io/enabling-data-discovery-in-a-data-mesh-the-saxo-journey-451b06969c8f) +- [DataHub @ Grofers Case Study](https://www.youtube.com/watch?v=m9kUYAuezFI) +- [DataHub @ LinkedIn: Extending the OSS UI](https://www.youtube.com/watch?v=Rdt4kJqDoww) +- [DataHub @ hipages Case Study: Oct 29 2021](https://www.youtube.com/watch?v=OFNzjUdMcJQ) +- [DataHub @ Adevinta Case Study: Sept 24 2021 Community Town Hall](https://www.youtube.com/watch?v=u9DRa_5uPIM) +- [DataHub at Bizzy (Case Study): Aug 27 2021 Community Meeting](https://www.youtube.com/watch?v=SuhLRr3QKt8) + +## Related Articles + +- [DataHub: A Generalized Metadata Search & Discovery Tool](https://engineering.linkedin.com/blog/2019/data-hub) +- [Open sourcing DataHub: LinkedIn’s metadata search and discovery platform](https://engineering.linkedin.com/blog/2020/open-sourcing-datahub--linkedins-metadata-search-and-discovery-p) +- [Data Catalogue — Knowing your data](https://medium.com/albert-franzi/data-catalogue-knowing-your-data-15f7d0724900) +- [LinkedIn Datahub Application Architecture Quick Understanding](https://medium.com/@liangjunjiang/linkedin-datahub-application-architecture-quick-understanding-a5b7868ee205) +- [LinkIn Datahub Metadata Ingestion Scripts Unofficical Guide](https://medium.com/@liangjunjiang/linkin-datahub-etl-unofficical-guide-7c3949483f8b) +- [Datahub - RPubs](https://rpubs.com/Priya_Shaji/dataHub) +- [A Dive Into Metadata Hubs](https://www.holistics.io/blog/a-dive-into-metadata-hubs/) +- [How LinkedIn, Uber, Lyft, Airbnb and Netflix are Solving Data Management and Discovery for Machine Learning Solutions](https://www.kdnuggets.com/2019/08/linkedin-uber-lyft-airbnb-netflix-solving-data-management-discovery-machine-learning-solutions.html) +- [Data Discovery in 2020](https://medium.com/@torokyle/data-discovery-in-2020-3c907383caa0) +- [Work-Bench Snapshot: The Evolution of Data Discovery & Catalog](https://medium.com/work-bench/work-bench-snapshot-the-evolution-of-data-discovery-catalog-2f6c0425616b) +- [In-house Data Discovery platforms](https://datastrategy.substack.com/p/in-house-data-discovery-platforms) +- [A Data Engineer’s Perspective On Data Democratization](https://towardsdatascience.com/a-data-engineers-perspective-on-data-democratization-a8aed10f4253) +- [25 Hot New Data Tools and What They DON’T Do](https://blog.amplifypartners.com/25-hot-new-data-tools-and-what-they-dont-do/) +- [4 Data Trends to Watch in 2020](https://medium.com/memory-leak/4-data-trends-to-watch-in-2020-491707902c09) +- [Application Performance Monitor and Distributed Tracing with Apache SkyWalking in Datahub](https://medium.com/@liangjunjiang/application-performance-monitor-and-distributed-tracing-with-apache-skywalking-in-datahub-16bc65e6c670) +- [Emerging Architectures for Modern Data Infrastructure](https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/) +- [Almost Everything You Need To Know on Data Discovery Platforms](https://eugeneyan.com/writing/data-discovery-platforms/) +- [Creating Notebook-based Dynamic Dashboards](https://towardsdatascience.com/creating-notebook-based-dynamic-dashboards-91f936adc6f3) + +## Talks & Presentations + +- [DataHub: Powering LinkedIn's Metadata](../../../docs/demo/DataHub_-_Powering_LinkedIn_Metadata.pdf) @ [Budapest Data Forum 2020](https://budapestdata.hu/2020/en/) +- [Taming the Data Beast Using DataHub](https://www.youtube.com/watch?v=bo4OhiPro7Y) @ [Data Engineering Melbourne Meetup November 2020](https://www.meetup.com/Data-Engineering-Melbourne/events/kgnvlrybcpbjc/) +- [Metadata Management And Integration At LinkedIn With DataHub](https://www.dataengineeringpodcast.com/datahub-metadata-management-episode-147/) @ [Data Engineering Podcast](https://www.dataengineeringpodcast.com) +- [The evolution of metadata: LinkedIn’s story](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019) @ [Strata Data Conference 2019](https://conferences.oreilly.com/strata/strata-ny-2019.html) +- [Journey of metadata at LinkedIn](https://www.youtube.com/watch?v=OB-O0Y6OYDE) @ [Crunch Data Conference 2019](https://crunchconf.com/2019) +- [DataHub Journey with Expedia Group](https://www.youtube.com/watch?v=ajcRdB22s5o) +- [Saxo Bank's Data Workbench](https://www.slideshare.net/SheetalPratik/linkedinsaxobankdataworkbench) +- [Data Discoverability at SpotHero](https://www.slideshare.net/MaggieHays/data-discoverability-at-spothero) + +## Non-English + +- [LinkedIn元数据之旅的最新进展—Data Hub](https://blog.csdn.net/DataPipeline/article/details/100155781) +- [数据治理篇: 元数据之datahub-概述](https://www.jianshu.com/p/04630b0c63f7) +- [DataHub——实时数据治理平台](https://segmentfault.com/a/1190000022563622) +- [数据治理工具-元数据管理](https://blog.csdn.net/weixin_42526352/article/details/105371012) +- [元数据管理框架的独舞](https://mp.weixin.qq.com/s/J6xtX3js70brdN3c_7ZkNg) +- [【DataHub】DataHub QuickStart](https://www.jianshu.com/p/eb34e7088c77) +- [数据治理工具调研之DataHub](https://www.cnblogs.com/CodingJacob/p/di2jiang-gong-ju-diao-yan-zhidatahub.html) +- [LinkedIn gibt die Datenplattform DataHub als Open Source frei](https://www.heise.de/developer/meldung/LinkedIn-gibt-die-Datenplattform-DataHub-als-Open-Source-frei-4663773.html) +- [Linkedin bringt Open-Source-Datahub](https://www.itmagazine.ch/artikel/71532/Linkedin_bringt_Open-Source-Datahub.html) +- [DataHub: универсальный инструмент поиска и обнаружения метаданных](https://habr.com/ru/post/520930/) +- [DataHub с открытым исходным кодом: платформа поиска и обнаружения метаданных от LinkedIn](https://habr.com/ru/post/521536/) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/approval-workflows.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/approval-workflows.md new file mode 100644 index 0000000000000..df1be6ba92c60 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/approval-workflows.md @@ -0,0 +1,202 @@ +--- +title: About DataHub Approval Workflows +sidebar_label: Approval Workflows +slug: /managed-datahub/approval-workflows +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/approval-workflows.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub Approval Workflows + + + +## Overview + +Keeping all your metadata properly classified can be hard work when you only have a limited number of trusted data stewards. With Managed DataHub, you can source proposals of Tags and Glossary Terms associated to datasets or dataset columns. These proposals may come from users with limited context or programatic processes using hueristics. Then, data stewards and data owners can go through them and only approve proposals they consider correct. This reduces the burden of your stewards and owners while increasing coverage. + +Approval workflows also cover the Business Glossary itself. This allows you to source Glossary Terms and Glossary Term description changes from across your organization while limiting who has final control over what gets in. + +## Using Approval Workflows + +### Proposing Tags and Glossary Terms + +1. When adding a Tag or Glossary Term to a column or entity, you will see a propose button. + +

+ +

+ +2. After proposing the Glossary Term, you will see it appear in a proposed state. + +

+ +

+ +3. This proposal will be sent to the inbox of Admins with proposal permissions and data owners. + +

+ +

+ +4. From there, they can choose to either accept or reject the proposal. A full log of all accepted or rejected proposals is kept for each user. + +### Proposing additions to your Business Glossary + +1. Navigate to your glossary by going to the Govern menu in the top right and selecting Glossary. + +2. Click the plus button to create a new Glossary Term. From that menu, select Propose. + +

+ +

+ +3. This proposal will be sent to the inbox of Admins with proposal permissions and data owners. + +

+ +

+ +4. From there, they can choose to either accept or reject the proposal. A full log of all accepted or rejected proposals is kept for each user. + +### Proposing Glossary Term Description Updates + +1. When updating the description of a Glossary Term, click propse after making your change. + +

+ +

+ +2. This proposal will be sent to the inbox of Admins with proposal permissions and data owners. + +

+ +

+ +3. From there, they can choose to either accept or reject the proposal. + +## Proposing Programatically + +DataHub exposes a GraphQL API for proposing Tags and Glossary Terms. + +At a high level, callers of this API will be required to provide the following details: + +1. A unique identifier for the target Metadata Entity (URN) +2. An optional sub-resource identifier which designates a sub-resource to attach the Tag or Glossary Term to. for example reference to a particular "field" within a Dataset. +3. A unique identifier for the Tag or Glossary Term they wish to propose (URN) + +In the following sections, we will describe how to construct each of these items and use the DataHub GraphQL API to submit Tag or Glossary Term proposals. + +#### Constructing an Entity Identifier + +Inside DataHub, each Metadata Entity is uniquely identified by a Universal Resource Name, or an URN. This identifier can be copied from the entity page, extracted from the API, or read from a downloaded search result. You can also use the helper methods in the datahub python library given a set of components. + +#### Constructing a Sub-Resource Identifier + +Specific Metadata Entity types have additional sub-resources to which Tags may be applied. +Today, this only applies for Dataset Metadata Entities, which have a "fields" sub-resource. In this case, the `subResource` value would be the field path for the schema field. + +#### Finding a Tag or Glossary Term Identifier + +Tags and Glossary Terms are also uniquely identified by an URN. + +Tag URNs have the following format: +`urn:li:tag:` + +Glossary Term URNs have the following format: +`urn:li:glossaryTerm:` + +These full identifiers can be copied from the entity pages of the Tag or Glossary Term. + +

+ +

+ +#### Issuing a GraphQL Query + +Once we've constructed an Entity URN, any relevant sub-resource identifiers, and a Tag or Term URN, we're ready to propose! To do so, we'll use the DataHub GraphQL API. + +In particular, we'll be using the proposeTag, proposeGlossaryTerm, and proposeUpdateDescription Mutations, which have the following interface: + +``` +type Mutation { +proposeTerm(input: TermAssociationInput!): String! # Returns Proposal URN. +} + +input TermAssociationInput { + resourceUrn: String! # Required. e.g. "urn:li:dataset:(...)" + subResource: String # Optional. e.g. "fieldName" + subResourceType: String # Optional. "DATASET_FIELD" for dataset fields + term: String! # Required. e.g. "urn:li:tag:Marketing" +} +``` + +``` +type Mutation { +proposeTag(input: TagAssociationInput!): String! # Returns Proposal URN. +} + +input TagAssociationInput { + resourceUrn: String! # Required. e.g. "urn:li:dataset:(...)" subResource: String # Optional. e.g. "fieldName" + subResourceType: String # Optional. "DATASET_FIELD" for dataset fields + tagUrn: String! # Required. e.g. "urn:li:tag:Marketing" +} +``` + +``` +mutation proposeUpdateDescription($input: DescriptionUpdateInput!) { + proposeUpdateDescription(input: $input) +} + +""" +Currently supports updates to Glossary Term descriptions only +""" +input DescriptionUpdateInput { + description: String! # the new description + + resourceUrn: String! + + subResourceType: SubResourceType + + subResource: String +} +``` + +## Additional Resources + +### Permissions + +To create & manage metadata proposals, certain access policies or roles are required. + +#### Privileges for Creating Proposals + +To create a new proposal one of these Metadata privileges are required. All roles have these priveleges by default. + +- Propose Tags - Allows to propose tags at the Entity level +- Propose Dataset Column Tags - Allows to propose tags at the Dataset Field level +- Propose Glossary Terms - Allows to propose terms at the Entity level +- Propose Dataset Column Glossary Terms - Allows to propose terms at the Dataset Field level + +To be able to see the Proposals Tab you need the "View Metadata Proposals" PLATFORM privilege + +#### Privileges for Managing Proposals + +To be able to approve or deny proposals you need one of the following Metadata privileges. `Admin` and `Editor` roles already have these by default. + +- Manage Tag Proposals +- Manage Glossary Term Proposals +- Manage Dataset Column Tag Proposals +- Manage Dataset Column Term Proposals + +These map directly to the 4 privileges for doing the proposals. + +To be able to approve or deny proposals to the glossary itself, you just need one permission: + +- Manage Glossaries + +### Videos + +

+ +

diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/chrome-extension.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/chrome-extension.md new file mode 100644 index 0000000000000..14e2c9eb64534 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/chrome-extension.md @@ -0,0 +1,90 @@ +--- +description: >- + Learn how to upload and use the Acryl DataHub Chrome extension (beta) locally + before it's available on the Chrome store. +title: Acryl DataHub Chrome Extension +slug: /managed-datahub/chrome-extension +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/chrome-extension.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# Acryl DataHub Chrome Extension + + + +## Installing the Extension + +In order to use the Acryl DataHub Chrome extension, you need to download it onto your browser from the Chrome web store [here](https://chrome.google.com/webstore/detail/datahub-chrome-extension/aoenebhmfokhglijmoacfjcnebdpchfj). + +

+ +

+ +Simply click "Add to Chrome" then "Add extension" on the ensuing popup. + +## Configuring the Extension + +Once you have your extension installed, you'll need to configure it to work with your Acryl DataHub deployment. + +1. Click the extension button on the right of your browser's address bar to view all of your installed extensions. Click on the newly installed DataHub extension. + +

+ +

+ +2. Fill in your DataHub domain and click "Continue" in the extension popup that appears. + +

+ +

+ +If your organization uses standard SaaS domains for Looker, you should be ready to go! + +### Additional Configurations + +Some organizations have custom SaaS domains for Looker and some Acryl DataHub deployments utilize **Platform Instances** and set custom **Environments** when creating DataHub assets. If any of these situations applies to you, please follow the next few steps to finish configuring your extension. + +1. Click on the extension button and select your DataHub extension to open the popup again. Now click the settings icon in order to open the configurations page. + +

+ +

+ +2. Fill out any and save custom configurations you have in the **TOOL CONFIGURATIONS** section. Here you can configure a custom domain, a Platform Instance associated with that domain, and the Environment set on your DataHub assets. If you don't have a custom domain but do have a custom Platform Instance or Environment, feel free to leave the field domain empty. + +

+ +

+ +## Using the Extension + +Once you have everything configured on your extension, it's time to use it! + +1. First ensure that you are logged in to your Acryl DataHub instance. + +2. Navigate to Looker or Tableau and log in to view your data assets. + +3. Navigate to a page where DataHub can provide insights on your data assets (Dashboards and Explores). + +4. Click the Acryl DataHub extension button on the bottom right of your page to open a drawer where you can now see additional information about this asset right from your DataHub instance. + +

+ +

+ +## Advanced: Self-Hosted DataHub + +If you are using the Acryl DataHub Chrome extension for your self-hosted DataHub instance, everything above is applicable. However, there is one additional step you must take in order to set up your instance to be compatible with the extension. + +### Configure Auth Cookies + +In order for the Chrome extension to work with your instance, it needs to be able to make authenticated requests. Therefore, authentication cookies need to be set up so that they can be shared with the extension on your browser. You must update the values of two environment variables in your `datahub-frontend` container: + +``` +AUTH_COOKIE_SAME_SITE="NONE" +AUTH_COOKIE_SECURE=true +``` + +Once your re-deploy your `datahub-frontend` container with these values, you should be good to go! diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/datahub-api/entity-events-api.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/datahub-api/entity-events-api.md new file mode 100644 index 0000000000000..bdb7ad6128adf --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/datahub-api/entity-events-api.md @@ -0,0 +1,812 @@ +--- +description: >- + This guide details the Entity Events API, which allows you to take action when + things change on DataHub. +title: Entity Events API +slug: /managed-datahub/datahub-api/entity-events-api +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/datahub-api/entity-events-api.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# Entity Events API + + + +## Introduction + +The Events API allows you to integrate changes happening on the DataHub Metadata Graph in real time into a broader event-based architecture. + +### Supported Integrations + +- [AWS EventBridge](docs/managed-datahub/operator-guide/setting-up-events-api-on-aws-eventbridge.md) + +### Use Cases + +Real-time use cases broadly fall into the following categories: + +- **Workflow Integration:** Integrate DataHub flows into your organization's internal workflow management system. For example, create a Jira ticket when specific Tags or Terms are proposed on a Dataset. +- **Notifications**: Generate organization-specific notifications when a change is made on DataHub. For example, send an email to the governance team when a "PII" tag is added to any data asset. +- **Metadata Enrichment**: Trigger downstream metadata changes when an upstream change occurs. For example, propagating glossary terms or tags to downstream entities. +- **Synchronization**: Syncing changes made in DataHub into a 3rd party system. For example, reflecting Tag additions in DataHub into Snowflake. +- **Auditing:** Audit \*\*\*\* _who_ is making _what changes_ on DataHub through time. + +## Event Structure + +Each entity event is serialized to JSON & follows a common base structure. + +**Common Fields** + +| Name | Type | Description | Optional | +| -------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------- | +| **entityUrn** | String | The unique identifier for the Entity being changed. For example, a Dataset's urn. | Fals**e** | +| **entityType** | String | The type of the entity being changed. Supported values include `dataset`, `chart`, `dashboard`, `dataFlow (Pipeline)`, `dataJob` (Task), `domain`, `tag`, `glossaryTerm`, `corpGroup`, & `corpUser.` | False | +| **category** | String | The category of the change, related to the kind of operation that was performed. Examples include `TAG`, `GLOSSARY_TERM`, `DOMAIN`, `LIFECYCLE`, and more. | False | +| **operation** | String | The operation being performed on the entity given the category. For example, `ADD` ,`REMOVE`, `MODIFY`. For the set of valid operations, see the full catalog below. | False | +| **modifier** | String | The modifier that has been applied to the entity. The value depends on the category. An example includes the URN of a tag being applied to a Dataset or Schema Field. | True | +| **parameters** | Dict | Additional key-value parameters used to provide specific context. The precise contents depends on the category + operation of the event. See the catalog below for a full summary of the combinations. | True | +| **auditStamp.actor** | String | The urn of the actor who triggered the change. | False | +| **auditStamp.time** | Number | The timestamp in milliseconds corresponding to the event. | False | + +For example, an event indicating that a Tag has been added to a particular Dataset would populate each of these fields: + +``` +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "TAG", + "operation": "ADD", + "modifier": "urn:li:tag:PII", + "parameters": { + "tagUrn": "urn:li:tag:PII" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +In the following sections, we'll take a closer look at the purpose and structure of each supported event type. + +## Event Types + +Below, we will review the catalog of events available for consumption. + +### Add Tag Event + +This event is emitted when a Tag has been added to an entity on DataHub. + +#### Header + +
CategoryOperationEntity Types
TAGADDdataset, dashboard, chart, dataJob, container, dataFlow , schemaField
+ +#### Parameters + +| Name | Type | Description | Optional | +| --------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | +| tagUrn | String | The urn of the tag that has been added. | False | +| fieldPath | String | The path of the schema field which the tag is being added to. This field is **only** present if the entity type is `schemaField`. | True | +| parentUrn | String | The urn of a parent entity. This field is only present if the entity type is `schemaField`, and will contain the parent Dataset to which the field belongs. | True | + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "TAG", + "operation": "ADD", + "modifier": "urn:li:tag:PII" + "parameters": { + "tagUrn": "urn:li:tag:PII" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Remove Tag Event + +This event is emitted when a Tag has been removed from an entity on DataHub. + +#### Header + +
CategoryOperationEntity Types
TAGREMOVEdataset, dashboard, chart, dataJob, container, dataFlow, schemaField
+ +#### Parameters + +| Name | Type | Description | Optional | +| --------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | +| tagUrn | String | The urn of the tag that has been removed. | False | +| fieldPath | String | The path of the schema field which the tag is being removed from. This field is **only** present if the entity type is `schemaField`. | True | +| parentUrn | String | The urn of a parent entity. This field is only present if the entity type is `schemaField`, and will contain the parent Dataset to which the field belongs. | True | + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "TAG", + "operation": "REMOVE", + "modifier": "urn:li:tag:PII", + "parameters": { + "tagUrn": "urn:li:tag:PII" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Add Glossary Term Event + +This event is emitted when a Glossary Term has been added to an entity on DataHub. + +**Header** + +
CategoryOperationEntity Types
GLOSSARY_TERMADDdataset, dashboard, chart, dataJob, container, dataFlow , schemaField
+ +#### Parameters + +| Name | | Type | Description | Optional | +| --------- | --- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | +| termUrn | | String | The urn of the glossary term that has been added. | False | +| fieldPath | | String | The path of the schema field to which the term is being added. This field is **only** present if the entity type is `schemaField`. | True | +| parentUrn | | String | The urn of a parent entity. This field is only present if the entity type is `schemaField`, and will contain the parent Dataset to which the field belongs. | True | + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "GLOSSARY_TERM", + "operation": "ADD", + "modifier": "urn:li:glossaryTerm:ExampleNode.ExampleTerm", + "parameters": { + "termUrn": "urn:li:glossaryTerm:ExampleNode.ExampleTerm" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Remove Glossary Term Event + +This event is emitted when a Glossary Term has been removed from an entity on DataHub. + +#### Header + +
CategoryOperationEntity Types
GLOSSARY_TERMREMOVEdataset, dashboard, chart, dataJob, container, dataFlow , schemaField
+ +#### Parameters + +| Name | Type | Description | Optional | +| --------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | +| termUrn | String | The urn of the glossary term that has been removed. | False | +| fieldPath | String | The path of the schema field from which the term is being removed. This field is **only** present if the entity type is `schemaField`. | True | +| parentUrn | String | The urn of a parent entity. This field is only present if the entity type is `schemaField`, and will contain the parent Dataset to which the field belongs. | True | + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "GLOSSARY_TERM", + "operation": "REMOVE", + "modifier": "urn:li:glossaryTerm:ExampleNode.ExampleTerm", + "parameters": { + "termUrn": "urn:li:glossaryTerm:ExampleNode.ExampleTerm" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Add Domain Event + +This event is emitted when Domain has been added to an entity on DataHub. + +#### Header + +
CategoryOperationEntity Types
DOMAINADDdataset, dashboard, chart, dataJob, container, dataFlow
+ +#### Parameters + +| Name | Type | Description | Optional | +| --------- | ------ | ------------------------------------------ | -------- | +| domainUrn | String | The urn of the domain that has been added. | False | + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "DOMAIN", + "operation": "ADD", + "modifier": "urn:li:domain:ExampleDomain", + "parameters": { + "domainUrn": "urn:li:domain:ExampleDomain" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Remove Domain Event + +This event is emitted when Domain has been removed from an entity on DataHub. + +#### Header + +
CategoryOperationEntity Types
DOMAINREMOVEdataset, dashboard, chart, dataJob, container ,dataFlow
+ +#### Parameters + +| Name | Type | Description | Optional | +| --------- | ------ | -------------------------------------------- | -------- | +| domainUrn | String | The urn of the domain that has been removed. | False | + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "DOMAIN", + "operation": "REMOVE", + "modifier": "urn:li:domain:ExampleDomain", + "parameters": { + "domainUrn": "urn:li:domain:ExampleDomain" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Add Owner Event + +This event is emitted when a new owner has been assigned to an entity on DataHub. + +#### Header + +
CategoryOperationEntity Types
OWNERADDdataset, dashboard, chart, dataJob, dataFlow , container, glossaryTerm, domain, tag
+ +#### Parameters + +| Name | Type | Description | Optional | +| --------- | ------ | ------------------------------------------------------------------------------------------------------------ | -------- | +| ownerUrn | String | The urn of the owner that has been added. | False | +| ownerType | String | The type of the owner that has been added. `TECHNICAL_OWNER`, `BUSINESS_OWNER`, `DATA_STEWARD`, `NONE`, etc. | False | + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "OWNER", + "operation": "ADD", + "modifier": "urn:li:corpuser:jdoe", + "parameters": { + "ownerUrn": "urn:li:corpuser:jdoe", + "ownerType": "BUSINESS_OWNER" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Remove Owner Event + +This event is emitted when an existing owner has been removed from an entity on DataHub. + +#### Header + +
CategoryOperationEntity Types
OWNERREMOVEdataset, dashboard, chart, dataJob, container ,dataFlow , glossaryTerm, domain, tag
+ +#### Parameters + +| Name | Type | Description | Optional | +| --------- | ------ | -------------------------------------------------------------------------------------------------------------- | -------- | +| ownerUrn | String | The urn of the owner that has been removed. | False | +| ownerType | String | The type of the owner that has been removed. `TECHNICAL_OWNER`, `BUSINESS_OWNER`, `DATA_STEWARD`, `NONE`, etc. | False | + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "OWNER", + "operation": "REMOVE", + "modifier": "urn:li:corpuser:jdoe", + "parameters": { + "ownerUrn": "urn:li:corpuser:jdoe", + "ownerType": "BUSINESS_OWNER" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Modify Deprecation Event + +This event is emitted when the deprecation status of an entity has been modified on DataHub. + +#### Header + +
CategoryOperationEntity Types
DEPRECATIONMODIFYdataset, dashboard, chart, dataJob, dataFlow , container
+ +#### Parameters + +| Name | Type | Description | Optional | +| ------ | ------ | -------------------------------------------------------------------------- | -------- | +| status | String | The new deprecation status of the entity, either `DEPRECATED` or `ACTIVE`. | False | + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "DEPRECATION", + "operation": "MODIFY", + "modifier": "DEPRECATED", + "parameters": { + "status": "DEPRECATED" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Add Dataset Schema Field Event + +This event is emitted when a new field has been added to a **Dataset** **Schema**. + +#### Header + +
CategoryOperationEntity Types
TECHNICAL_SCHEMAADDdataset
+ +#### Parameters + +| Name | Type | Description | Optional | +| --------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | +| fieldUrn | String | The urn of the new schema field. | False | +| fieldPath | String | The path of the new field. For more information about field paths, check out [Dataset Field Paths Explained](docs/generated/metamodel/entities/dataset.md#field-paths-explained) | False | +| nullable | Boolean | Whether the new field is nullable. | False | + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "TECHNICAL_SCHEMA", + "operation": "ADD", + "modifier": "urn:li:schemaField:(urn:li:dataset:abc,newFieldName)", + "parameters": { + "fieldUrn": "urn:li:schemaField:(urn:li:dataset:abc,newFieldName)", + "fieldPath": "newFieldName", + "nullable": false + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Remove Dataset Schema Field Event + +This event is emitted when a new field has been remove from a **Dataset** **Schema**. + +#### Header + +
CategoryOperationEntity Types
TECHNICAL_SCHEMAREMOVEdataset
+ +#### Parameters + +| Name | Type | Description | Optional | +| --------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------- | +| fieldUrn | String | The urn of the removed schema field. | False | +| fieldPath | String | The path of the removed field. For more information about field paths, check out [Dataset Field Paths Explained](docs/generated/metamodel/entities/dataset.md#field-paths-explained) | False | +| nullable | Boolean | Whether the removed field is nullable. | False | + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "TECHNICAL_SCHEMA", + "operation": "REMOVE", + "modifier": "urn:li:schemaField:(urn:li:dataset:abc,newFieldName)", + "parameters": { + "fieldUrn": "urn:li:schemaField:(urn:li:dataset:abc,newFieldName)", + "fieldPath": "newFieldName", + "nullable": false + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Entity Create Event + +This event is emitted when a new entity has been created on DataHub. + +#### Header + +
CategoryOperationEntity Types
LIFECYCLECREATEdataset, dashboard, chart, dataJob, dataFlow , glossaryTerm, domain, tag, container
+ +#### Parameters + +_None_ + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "LIFECYCLE", + "operation": "CREATE", + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Entity Soft-Delete Event + +This event is emitted when a new entity has been soft-deleted on DataHub. + +#### Header + +
CategoryOperationEntity Types
LIFECYCLESOFT_DELETEdataset, dashboard, chart, dataJob, dataFlow , glossaryTerm, domain, tag, container
+ +#### Parameters + +_None_ + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "LIFECYCLE", + "operation": "SOFT_DELETE", + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Entity Hard-Delete Event + +This event is emitted when a new entity has been hard-deleted on DataHub. + +#### Header + +
CategoryOperationEntity Types
LIFECYCLEHARD_DELETEdataset, dashboard, chart, dataJob, dataFlow , glossaryTerm, domain, tag, container
+ +#### Parameters + +_None_ + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "LIFECYCLE", + "operation": "HARD_DELETE", + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Completed Assertion Run Event + +This event is emitted when an Assertion has been run has succeeded on DataHub. + +#### Header + +
CategoryOperationEntity Types
RUNCOMPLETEDassertion
+ +#### Parameters + +| Name | Type | Description | Optional | +| ---------- | ------ | ----------------------------------------------------- | -------- | +| runResult | String | The result of the run, either `SUCCESS` or `FAILURE`. | False | +| runId | String | Native (platform-specific) identifier for this run. | False | +| aserteeUrn | String | Urn of entity on which the assertion is applicable. | False | + +#### + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:assertion:abc", + "entityType": "assertion", + "category": "RUN", + "operation": "COMPLETED", + "parameters": { + "runResult": "SUCCESS", + "runId": "123", + "aserteeUrn": "urn:li:dataset:def" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Started Data Process Instance Run Event + +This event is emitted when a Data Process Instance Run has STARTED on DataHub. + +#### Header + +
CategoryOperationEntity Types
RUNSTARTEDdataProcessInstance
+ +#### Parameters + +| Name | Type | Description | Optional | +| ----------------- | ------- | ----------------------------------------------------------------------------------------------- | -------- | +| attempt | Integer | The number of attempts that have been made. | True | +| dataFlowUrn | String | The urn of the associated Data Flow. Only filled in if this run is associated with a Data Flow. | True | +| dataJobUrn | String | The urn of the associated Data Flow. Only filled in if this run is associated with a Data Job. | True | +| parentInstanceUrn | String | Urn of the parent DataProcessInstance (if there is one). | True | + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:dataProcessInstance:abc", + "entityType": "dataProcessInstance", + "category": "RUN", + "operation": "STARTED", + "parameters": { + "dataFlowUrn": "urn:li:dataFlow:def", + "attempt": "1", + "parentInstanceUrn": ""urn:li:dataProcessInstance:ghi" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Completed Data Process Instance Run Event + +This event is emitted when a Data Process Instance Run has been COMPLETED on DataHub. + +#### Header + +
CategoryOperationEntity Types
RUNCOMPLETEDdataProcessInstance
+ +#### Parameters + +| Name | Type | Description | Optional | +| ----------------- | ------- | ----------------------------------------------------------------------------------------------- | -------- | +| runResult | String | The result of the run, one of `SUCCESS` , `FAILURE`, `SKIPPED`, or `UP_FOR_RETRY` . | False | +| attempt | Integer | The number of attempts that have been made. | True | +| dataFlowUrn | String | The urn of the associated Data Flow. Only filled in if this run is associated with a Data Flow. | True | +| dataJobUrn | String | The urn of the associated Data Flow. Only filled in if this run is associated with a Data Job. | True | +| parentInstanceUrn | String | Urn of the parent DataProcessInstance. | True | + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:dataProcessInstance:abc", + "entityType": "dataProcessInstance", + "category": "RUN", + "operation": "COMPLETED", + "parameters": { + "runResult": "SUCCESS" + "attempt": "2", + "dataFlowUrn": "urn:li:dataFlow:def", + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Action Request Created Event + +This event is emitted when a new Action Request (Metadata Proposal) has been created. + +#### Header + +
CategoryOperationEntity Types
LIFECYCLECREATEDactionRequest
+ +#### Parameters + +These are the common parameters for all Action Request create events. + +| Name | Type | Description | Optional | +| ----------------- | ------ | -------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | +| actionRequestType | String | The type of Action Request. One of `TAG_ASSOCIATION`, `TERM_ASSOCIATION`, `CREATE_GLOSSARY_NODE`, `CREATE_GLOSSARY_TERM`, or `UPDATE_DESCRIPTION.` | False | +| resourceType | String | The type of entity this Action Request is applied on, such as `dataset`. | True | +| resourceUrn | String | The entity this Action Request is applied on. | True | +| subResourceType | String | Filled if this Action Request is applied on a sub-resource, such as a `schemaField`. | True | +| subResource | String | Identifier of the sub-resource if this proposal is applied on one. | True | + +Parameters specific to different proposal types are listed below. + +#### Tag Association Proposal Specific Parameters and Sample Event + +| Name | Type | Description | Optional | +| ------ | ------ | ----------------------------------------- | -------- | +| tagUrn | String | The urn of the Tag that would be applied. | False | + +``` +{ + "entityUrn": "urn:li:actionRequest:abc", + "entityType": "actionRequest", + "category": "LIFECYCLE", + "operation": "CREATED", + "parameters": { + "actionRequestType": "TAG_ASSOCIATION", + "resourceType": "dataset", + "resourceUrn": "urn:li:dataset:snowflakeDataset, + "tagUrn": "urn:li:tag:Classification" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +#### Term Association Proposal Specific Parameters and Sample Event + +| Name | Type | Description | Optional | +| ------- | ------ | --------------------------------------------------- | -------- | +| termUrn | String | The urn of the Glossary Term that would be applied. | False | + +``` +{ + "entityUrn": "urn:li:actionRequest:abc", + "entityType": "actionRequest", + "category": "LIFECYCLE", + "operation": "CREATED", + "parameters": { + "actionRequestType": "TERM_ASSOCIATION", + "resourceType": "dataset", + "resourceUrn": "urn:li:dataset:snowflakeDataset, + "termUrn": "urn:li:glossaryTerm:Classification" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +#### Create Glossary Node/Term Proposal Specific Parameters and Sample Event + +| Name | Type | Description | Optional | +| ------------------ | ------ | --------------------------------------------------------------------------------- | -------- | +| glossaryEntityName | String | The name of the Glossary Entity that would be created. | False | +| parentNodeUrn | String | The urn of the Parent Node that would be associated with the new Glossary Entity. | True | +| description | String | The description of the new Glossary Entity. | True | + +``` +{ + "entityUrn": "urn:li:actionRequest:abc", + "entityType": "actionRequest", + "category": "LIFECYCLE", + "operation": "CREATED", + "parameters": { + "actionRequestType": "CREATE_GLOSSARY_TERM", + "resourceType": "glossaryNode", + "glossaryEntityName": "PII", + "parentNodeUrn": "urn:li:glossaryNode:Classification", + "description": "Personally Identifiable Information" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +#### Update Description Proposal Specific Parameters + +| Name | Type | Description | Optional | +| ----------- | ------ | --------------------------------- | -------- | +| description | String | The proposed updated description. | False | + +``` +{ + "entityUrn": "urn:li:actionRequest:abc", + "entityType": "actionRequest", + "category": "LIFECYCLE", + "operation": "CREATED", + "parameters": { + "actionRequestType": "UPDATE_DESCRIPTION", + "resourceType": "glossaryNode", + "description": "Personally Identifiable Information" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` + +### Action Request Status Change Event + +This event is emitted when an existing Action Request (proposal) changes status. For example, this event will be emitted when an Action Request transitions from pending to completed. + +#### Header + +
CategoryOperationEntity Types
LIFECYCLEPENDING, COMPLETEDactionRequest
+ +#### Parameters + +These are the common parameters for all parameters. + +| Name | Type | Description | Optional | +| ------------------- | ------ | ----------------------------------------------------------------------------------------- | -------- | +| actionRequestStatus | String | The status of the Action Request. | False | +| actionRequestResult | String | Only filled if the `actionRequestStatus` is `COMPLETED`. Either `ACCEPTED` or `REJECTED`. | True | + +#### Sample Event + +``` +{ + "entityUrn": "urn:li:actionRequest:abc", + "entityType": "actionRequest", + "category": "LIFECYCLE", + "operation": "COMPLETED", + "parameters": { + "actionRequestStatus": "COMPLETED", + "actionRequestResult": "ACCEPTED" + }, + "auditStamp": { + "actor": "urn:li:corpuser:jdoe", + "time": 1649953100653 + } +} +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/datahub-api/graphql-api/getting-started.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/datahub-api/graphql-api/getting-started.md new file mode 100644 index 0000000000000..088307cb10e7c --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/datahub-api/graphql-api/getting-started.md @@ -0,0 +1,48 @@ +--- +description: Getting started with the DataHub GraphQL API. +title: Getting Started +slug: /managed-datahub/datahub-api/graphql-api/getting-started +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/datahub-api/graphql-api/getting-started.md +--- + +# Getting Started + +The Acryl DataHub GraphQL API is an extension of the open source [DataHub GraphQL API.](docs/api/graphql/overview.md) + +For a full reference to the Queries & Mutations available for consumption, check out [Queries](graphql/queries.md) & [Mutations](graphql/mutations.md). + +### Connecting to the API + +

+ +

+ +When you generate the token you will see an example of `curl` command which you can use to connect to the GraphQL API. + +Note that there is a single URL mentioned there but it can be any of these + +- https://`your-account`.acryl.io/api/graphql +- https://`your-account`.acryl.io/api/gms/graphql + +If there is any example that requires you to connect to GMS then you can use the second URL and change the endpoints. + +e.g. to get configuration of your GMS server you can use + +``` +curl -X GET 'https://your-account.acryl.io/api/gms/config' --header +``` + +e.g. to connect to ingestion endpoint for doing ingestion programmatically you can use the below URL + +- https://your-account.acryl.io/api/gms/aspects?action=ingestProposal + +### Exploring the API + +The entire GraphQL API can be explored & [introspected](https://graphql.org/learn/introspection/) using GraphiQL, an interactive query tool which allows you to navigate the entire Acryl GraphQL schema as well as craft & issue using an intuitive UI. + +[GraphiQL](https://www.gatsbyjs.com/docs/how-to/querying-data/running-queries-with-graphiql/) is available for each Acryl DataHub deployment, locating at `https://your-account.acryl.io/api/graphiql`. + +### Querying the API + +Currently, we do not offer language-specific SDKs for accessing the DataHub GraphQL API. For querying the API, you can make use of a variety of per-language client libraries. For a full list, see [GraphQL Code Libraries, Tools, & Services](https://graphql.org/code/). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/datahub-api/graphql-api/incidents-api-beta.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/datahub-api/graphql-api/incidents-api-beta.md new file mode 100644 index 0000000000000..b5fdad7487c59 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/datahub-api/graphql-api/incidents-api-beta.md @@ -0,0 +1,416 @@ +--- +description: This page provides an overview of working with the DataHub Incidents API. +title: Incidents API (Beta) +slug: /managed-datahub/datahub-api/graphql-api/incidents-api-beta +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/datahub-api/graphql-api/incidents-api-beta.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# Incidents API (Beta) + + + +## Introduction + +**Incidents** are a concept used to flag particular Data Assets as being in an unhealthy state. Each incident has an independent lifecycle and details including a state (active, resolved), a title, a description, & more. + +A couple scenarios in which incidents can be useful are + +1. **Pipeline Circuit Breaking:** You can use Incidents as the basis for intelligent data pipelines that verify upstream inputs (e.g. datasets) are free of any active incidents before executing. +2. \[Coming Soon] **Announcing Known-Bad Assets**: You can mark a known-bad data asset as under an ongoing incident so consumers and stakeholders can be informed about the health status of a data asset via the DataHub UI. Moreover, they can follow the incident as it progresses toward resolution. + +In the next section, we'll show you how to + +1. Create a new incident +2. Fetch all incidents for a data asset +3. Resolve an incident + +for **Datasets** using the Acryl [GraphQL API](docs/api/graphql/overview.md). + +Let's get started! + +## Creating an Incident + +:::info +Creating incidents is currently only supported against **Dataset** assets. +::: + +To create (i.e. raise) a new incident for a data asset, simply create a GraphQL request using the `raiseIncident` mutation. + +``` +type Mutation { + """ + Raise a new incident for a data asset + """ + raiseIncident(input: RaiseIncidentInput!): String! # Returns new Incident URN. +} + +input RaiseIncidentInput { + """ + The type of incident, e.g. OPERATIONAL + """ + type: IncidentType! + + """ + A custom type of incident. Present only if type is 'CUSTOM' + """ + customType: String + + """ + An optional title associated with the incident + """ + title: String + + """ + An optional description associated with the incident + """ + description: String + + """ + The resource (dataset, dashboard, chart, dataFlow, etc) that the incident is associated with. + """ + resourceUrn: String! + + """ + The source of the incident, i.e. how it was generated + """ + source: IncidentSourceInput +} +``` + +### Examples + +First, we'll create a demo GraphQL query, then show how to represent it via CURL & Python. + +Imagine we want to raise a new incident on a Dataset with URN `urn:li:dataset:(abc)` because it's failed automated quality checks. To do so, we could make the following GraphQL query: + +_Request_ + +``` +mutation raiseIncident { + raiseIncident(input: { + type: OPERATIONAL + title: "Dataset Failed Quality Checks" + description: "Dataset failed 2/6 Quality Checks for suite run id xy123mksj812pk23." + resourceUrn: "urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)" + }) +} +``` + +After we make this query, we will get back a unique URN for the incident. + +_Response_ + +``` +{ + "data": { + "raiseIncident": "urn:li:incident:bfecab62-dc10-49a6-a305-78ce0cc6e5b1" + } +} +``` + +Now we'll see how to issue this query using a CURL or Python. + +#### CURL + +To issue the above GraphQL as a CURL: + +``` +curl --location --request POST 'https://your-account.acryl.io/api/graphql' \ +--header 'Authorization: Bearer your-access-token' \ +--header 'Content-Type: application/json' \ +--data-raw '{"query":"mutation raiseIncident {\n raiseIncident(input: {\n type: OPERATIONAL\n title: \"Dataset Failed Quality Checks\"\n description: \"Dataset failed 2/6 Quality Checks for suite run id xy123mksj812pk23.\"\n resourceUrn: \"urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)\"\n })\n}","variables":{}}' +``` + +#### Python + +To issue the above GraphQL query in Python (requests): + +``` +import requests + +datahub_session = requests.Session() + +headers = { + "Content-Type": "application/json", + "Authorization": "Bearer your-personal-access-token", +} + +json = { + "query": """mutation raiseIncident {\n + raiseIncident(input: {\n + type: OPERATIONAL\n + resourceUrn: \"urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)\"\n + })}""", + "variables": {}, +} + +response = datahub_session.post(f"https://your-account.acryl.io/api/graphql", headers=headers, json=json) +response.raise_for_status() +res_data = response.json() # Get result as JSON +``` + +## Retrieving Active Incidents + +To fetch the the ongoing incidents for a data asset, we can use the `incidents` GraphQL field on the entity of interest. + +### Datasets + +To retrieve all incidents for a Dataset with a particular [URN](docs/what/urn.md), you can reference the 'incidents' field of the Dataset type: + +``` +type Dataset { + .... + """ + Incidents associated with the Dataset + """ + incidents( + """ + Optional incident state to filter by, defaults to any state. + """ + state: IncidentState, + """ + Optional start offset, defaults to 0. + """ + start: Int, + """ + Optional start offset, defaults to 20. + """ + count: Int): EntityIncidentsResult # Returns a list of incidents. +} +``` + +### Examples + +Now that we've raised an incident on it, imagine we want to fetch the first 10 "active" incidents for the Dataset with URN `urn:li:dataset:(abc`). To do so, we could issue the following request: + +_Request_ + +``` +query dataset { + dataset(urn: "urn:li:dataset:(abc)") { + incidents(state: ACTIVE, start: 0, count: 10) { + total + incidents { + urn + title + description + status { + state + } + } + } + } +} +``` + +After we make this query, we will get back a unique URN for the incident. + +_Response_ + +``` +{ + "data": { + "dataset": { + "incidents": { + "total": 1, + "incidents": [ + { + "urn": "urn:li:incident:bfecab62-dc10-49a6-a305-78ce0cc6e5b1", + "title": "Dataset Failed Quality Check", + "description": "Dataset failed 2/6 Quality Checks for suite run id xy123mksj812pk23.", + "status": { + "state": "ACTIVE" + } + } + ] + } + } + } +} +``` + +Now we'll see how to issue this query using a CURL or Python. + +#### CURL + +To issue the above GraphQL as a CURL: + +``` +curl --location --request POST 'https://your-account.acryl.io/api/graphql' \ +--header 'Authorization: Bearer your-access-token' \ +--header 'Content-Type: application/json' \ +--data-raw '{"query":"query dataset {\n dataset(urn: "urn:li:dataset:(abc)") {\n incidents(state: ACTIVE, start: 0, count: 10) {\n total\n incidents {\n urn\n title\n description\n status {\n state\n }\n }\n }\n }\n}","variables":{}}'Python +``` + +To issue the above GraphQL query in Python (requests): + +``` +import requests + +datahub_session = requests.Session() + +headers = { + "Content-Type": "application/json", + "Authorization": "Bearer your-personal-access-token", +} + +json = { + "query": """query dataset {\n + dataset(urn: "urn:li:dataset:(abc)") {\n + incidents(state: ACTIVE, start: 0, count: 10) {\n + total\n + incidents {\n + urn\n + title\n + description\n + status {\n + state\n + }\n + }\n + }\n + }\n + }""", + "variables": {}, +} + +response = datahub_session.post(f"https://your-account.acryl.io/api/graphql", headers=headers, json=json) +response.raise_for_status() +res_data = response.json() # Get result as JSON +``` + +## Resolving an Incident + +To resolve an incident for a data asset, simply create a GraphQL request using the `updateIncidentStatus` mutation. To mark an incident as resolved, simply update its state to `RESOLVED`. + +``` +type Mutation { + """ + Update an existing incident for a resource (asset) + """ + updateIncidentStatus( + """ + The urn for an existing incident + """ + urn: String! + + """ + Input required to update the state of an existing incident + """ + input: UpdateIncidentStatusInput!): String +} + +""" +Input required to update status of an existing incident +""" +input UpdateIncidentStatusInput { + """ + The new state of the incident + """ + state: IncidentState! + + """ + An optional message associated with the new state + """ + message: String +} +``` + +### Examples + +Imagine that we've fixed our Dataset with urn `urn:li:dataset:(abc)` so that it's passing validation. Now we want to mark the Dataset as healthy, so stakeholders and downstream consumers know it's ready to use. + +To do so, we need the URN of the Incident that we raised previously. + +_Request_ + +``` +mutation updateIncidentStatus { + updateIncidentStatus(urn: "urn:li:incident:bfecab62-dc10-49a6-a305-78ce0cc6e5b1", + input: { + state: RESOLVED + message: "Dataset is now passing validations. Verified by John Joyce on Data Platform eng." + }) +} +``` + +_Response_ + +``` +{ + "data": { + "updateIncidentStatus": "true" + } +} +``` + +True is returned if the incident's was successfully marked as resolved. + +#### CURL + +To issue the above GraphQL as a CURL: + +``` +curl --location --request POST 'https://your-account.acryl.io/api/graphql' \ +--header 'Authorization: Bearer your-access-token' \ +--header 'Content-Type: application/json' \ +--data-raw '{"query":"mutation updateIncidentStatus {\n updateIncidentStatus(urn: "urn:li:incident:bfecab62-dc10-49a6-a305-78ce0cc6e5b1", \n input: {\n state: RESOLVED\n message: "Dataset is now passing validations. Verified by John Joyce on Data Platform eng."\n })\n}","variables":{}}'Python +``` + +To issue the above GraphQL query in Python (requests): + +``` +import requests + +datahub_session = requests.Session() + +headers = { + "Content-Type": "application/json", + "Authorization": "Bearer your-personal-access-token", +} + +json = { + "query": """mutation updateIncidentStatus {\n + updateIncidentStatus(urn: \"urn:li:incident:bfecab62-dc10-49a6-a305-78ce0cc6e5b1\",\n + input: {\n + state: RESOLVED\n + message: \"Dataset is now passing validations. Verified by John Joyce on Data Platform eng.\"\n + })\n + }""", + "variables": {}, +} + +response = datahub_session.post(f"https://your-account.acryl.io/api/graphql", headers=headers, json=json) +response.raise_for_status() +res_data = response.json() # Get result as JSON +``` + +## Tips + +:::info +**Authorization** + +Remember to always provide a DataHub Personal Access Token when calling the GraphQL API. To do so, just add the 'Authorization' header as follows: + +``` +Authorization: Bearer +``` + +**Exploring GraphQL API** + +Also, remember that you can play with an interactive version of the Acryl GraphQL API at `https://your-account-id.acryl.io/api/graphiql` +::: + +## Enabling Slack Notifications + +You can configure Acryl to send slack notifications to a specific channel when incidents are raised or their status is changed. + +These notifications are also able to tag the immediate asset's owners, along with the owners of downstream assets consuming it. + +

+ +

+ +To do so, simply follow the [Slack Integration Guide](docs/managed-datahub/saas-slack-setup.md) and contact your Acryl customer success team to enable the feature! diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/integrations/aws-privatelink.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/integrations/aws-privatelink.md new file mode 100644 index 0000000000000..a523901740d5b --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/integrations/aws-privatelink.md @@ -0,0 +1,39 @@ +--- +title: AWS PrivateLink +slug: /managed-datahub/integrations/aws-privatelink +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/integrations/aws-privatelink.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# AWS PrivateLink + + + +If you require a private connection between the provisioned DataHub instance and your own existing AWS account, Acryl supports using AWS PrivateLink in order to complete this private connection. + +In order to complete this connection, the Acryl integrations team will require the AWS ARN for a user or role that can accept and complete the connection to your AWS account. + +Once that team reports the PrivateLink has been created, the team will give you a VPC Endpoint Service Name to use. + +In order to complete the connection, you will have to create a VPC Endpoint in your AWS account. To do so, please follow these instructions: + +:::info +Before following the instructions below, please create a VPC security group with ports 80, and 443 (Both TCP) and any required CIDR blocks or other sources as an inbound rule +::: + +1. Open the AWS console to the region that the VPC Endpoint Service is created (Generally this will be in `us-west-2 (Oregon)` but will be seen in the service name itself) +2. Browse to the **VPC** Service and click on **Endpoints** +3. Click on **Create Endpoint** in the top right corner +4. Give the endpoint a name tag (such as _datahub-pl_) +5. Click on the **Other endpoint services** radio button +6. In the **Service setting**, copy the service name that was given to you by the integrations team into the **Service name** field and click **Verify Service** +7. Now select the VPC from the dropdown menu where the endpoint will be created. +8. A list of availability zones will now be shown in the **Subnets** section. Please select at least 1 availability zone and then a corresponding subnet ID from the drop down menu to the right of that AZ. +9. Choose **IPv4** for the **IP address type** +10. Choose an existing security group (or multiple) to use on this endpoint +11. (Optional) For **Policy,** you can keep it on **Full access** or **custom** if you have specific access requirements +12. (Optional) Create any tags you wish to add to this endpoint +13. Click **Create endpoint** +14. Once it has been created, Acryl will need to accept the incoming connection from your AWS account; the integrations team will advise you when this has been completed. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/integrations/oidc-sso-integration.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/integrations/oidc-sso-integration.md new file mode 100644 index 0000000000000..58a60575d8369 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/integrations/oidc-sso-integration.md @@ -0,0 +1,51 @@ +--- +description: >- + This page will help you set up OIDC SSO with your identity provider to log + into Acryl Data +title: OIDC SSO Integration +slug: /managed-datahub/integrations/oidc-sso-integration +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/integrations/oidc-sso-integration.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# OIDC SSO Integration + + + +_Note that we do not yet support LDAP or SAML authentication. Please let us know if either of these integrations would be useful for your organization._ + +If you'd like to do a deeper dive into OIDC configuration outside of the UI, please see our docs [here](/docs/authentication/guides/sso/configure-oidc-react.md) + +### Getting Details From Your Identity Provider + +To set up the OIDC integration, you will need the following pieces of information. + +1. _Client ID_ - A unique identifier for your application with the identity provider +2. _Client Secret_ - A shared secret to use for exchange between you and your identity provider. +3. _Discovery URL_ - A URL where the OIDC API of your identity provider can be discovered. This should suffixed by `.well-known/openid-configuration`. Sometimes, identity providers will not explicitly include this URL in their setup guides, though this endpoint will exist as per the OIDC specification. For more info see [here](http://openid.net/specs/openid-connect-discovery-1_0.html). + +The callback URL to register in your Identity Provider will be + +``` +https://.acryl.io/callback/oidc +``` + +### Configuring OIDC SSO + +> In order to set up the OIDC SSO integration, the user must have the `Manage Platform Settings` privilege. + +#### Enabling the OIDC Integration + +To enable the OIDC integration, start by navigating to **Settings > Platform > SSO.** + +1. Click **OIDC** +2. Enable the Integration +3. Enter the **Client ID, Client Secret, and Discovery URI** obtained in the previous steps +4. If there are any advanced settings you would like to configure, click on the **Advanced** button. These come with defaults, so only input settings here if there is something you need changed from the default configuration. +5. Click **Update** to save your settings. + +

+ +

diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/managed-datahub-overview.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/managed-datahub-overview.md new file mode 100644 index 0000000000000..73000cf49341b --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/managed-datahub-overview.md @@ -0,0 +1,35 @@ +--- +title: Managed DataHub Exclusives +slug: /managed-datahub/managed-datahub-overview +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/managed-datahub-overview.md +--- + +# Managed DataHub Exclusives + +Acryl DataHub offers a slew of additional features on top of the normal OSS project. + +## Chrome Extension + +- [Early Access to the DataHub Chrome Extension](docs/managed-datahub/chrome-extension.md) + +## Additional Integrations + +- [Slack Integration](docs/managed-datahub/saas-slack-setup.md) +- [AWS Privatelink](docs/managed-datahub/integrations/aws-privatelink.md) +- [AWS Ingestion Executor](docs/managed-datahub/operator-guide/setting-up-remote-ingestion-executor-on-aws.md) +- [AWS Eventbridge](docs/managed-datahub/operator-guide/setting-up-events-api-on-aws-eventbridge.md) + +## Additional SSO/Login Support + +- [OIDC SSO Integration in the UI](docs/managed-datahub/integrations/oidc-sso-integration.md) + +## Expanded API Features + +- [Entity Events API](docs/managed-datahub/datahub-api/entity-events-api.md) +- [Incidents API](docs/managed-datahub/datahub-api/graphql-api/incidents-api-beta.md) + +## More Ways to Act on Metadata + +- [Approval Workflows](docs/managed-datahub/approval-workflows.md) +- [Metadata Tests](docs/tests/metadata-tests.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md new file mode 100644 index 0000000000000..88ac3fce301b0 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md @@ -0,0 +1,120 @@ +--- +title: Ingestion +slug: /managed-datahub/metadata-ingestion-with-acryl/ingestion +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md +--- + +# Ingestion + +Acryl Metadata Ingestion functions similarly to that in open source DataHub. Sources are configured via the[ UI Ingestion](docs/ui-ingestion.md) or via a [Recipe](metadata-ingestion/README.md#recipes), ingestion recipes can be scheduled using your system of choice, and metadata can be pushed from anywhere. +This document will describe the steps required to ingest metadata from your data sources. + +## Batch Ingestion + +Batch ingestion involves extracting metadata from a source system in bulk. Typically, this happens on a predefined schedule using the [Metadata Ingestion ](metadata-ingestion/README.md#install-from-pypi)framework. +The metadata that is extracted includes point-in-time instances of dataset, chart, dashboard, pipeline, user, group, usage, and task metadata. + +### Step 1: Install DataHub CLI + +Regardless of how you ingest metadata, you'll need your account subdomain and API key handy. + +#### **Install from Gemfury Private Repository** + +**Installing from command line with pip** + +Determine the version you would like to install and obtain a read access token by requesting a one-time-secret from the Acryl team then run the following command: + +`python3 -m pip install acryl-datahub== --index-url https://:@pypi.fury.io/acryl-data/` + +#### Install from PyPI for OSS Release + +Run the following commands in your terminal: + +``` +python3 -m pip install --upgrade pip wheel setuptools +python3 -m pip install --upgrade acryl-datahub +python3 -m datahub version +``` + +_Note: Requires Python 3.6+_ + +Your command line should return the proper version of DataHub upon executing these commands successfully. + +### Step 2: Install Connector Plugins + +Our CLI follows a plugin architecture. You must install connectors for different data sources individually. For a list of all supported data sources, see [the open source docs](metadata-ingestion/README.md#installing-plugins). +Once you've found the connectors you care about, simply install them using `pip install`. +For example, to install the `mysql` connector, you can run + +```python +pip install --upgrade acryl-datahub[mysql] +``` + +### Step 3: Write a Recipe + +[Recipes](metadata-ingestion/README.md#recipes) are yaml configuration files that serve as input to the Metadata Ingestion framework. Each recipe file define a single source to read from and a single destination to push the metadata. +The two most important pieces of the file are the _source_ and _sink_ configuration blocks. +The _source_ configuration block defines where to extract metadata from. This can be an OLTP database system, a data warehouse, or something as simple as a file. Each source has custom configuration depending on what is required to access metadata from the source. To see configurations required for each supported source, refer to the [Sources](metadata-ingestion/README.md#sources) documentation. +The _sink_ configuration block defines where to push metadata into. Each sink type requires specific configurations, the details of which are detailed in the [Sinks](metadata-ingestion/README.md#sinks) documentation. +In Acryl DataHub deployments, you _must_ use a sink of type `datahub-rest`, which simply means that metadata will be pushed to the REST endpoints exposed by your DataHub instance. The required configurations for this sink are + +1. **server**: the location of the REST API exposed by your instance of DataHub +2. **token**: a unique API key used to authenticate requests to your instance's REST API + +The token can be retrieved by logging in as admin. You can go to Settings page and generate a Personal Access Token with your desired expiration date. + +

+ +

+ +

+ +

+ +To configure your instance of DataHub as the destination for ingestion, set the "server" field of your recipe to point to your Acryl instance's domain suffixed by the path `/gms`, as shown below. +A complete example of a DataHub recipe file, which reads from MySQL and writes into a DataHub instance: + +```yaml +# example-recipe.yml + +# MySQL source configuration +source: + type: mysql + config: + username: root + password: password + host_port: localhost:3306 + +# Recipe sink configuration. +sink: + type: "datahub-rest" + config: + server: "https://.acryl.io/gms" + token: +``` + +:::info +Your API key is a signed JSON Web Token that is valid for 6 months from the date of issuance. Please keep this key secure & avoid sharing it. +::: + +If your key is compromised for any reason, please reach out to the Acryl team at support@acryl.io.::: + +### Step 4: Running Ingestion + +The final step requires invoking the DataHub CLI to ingest metadata based on your recipe configuration file. +To do so, simply run `datahub ingest` with a pointer to your YAML recipe file: + +``` +datahub ingest -c ./example-recipe.yml +``` + +### Step 5: Scheduling Ingestion + +Ingestion can either be run in an ad-hoc manner by a system administrator or scheduled for repeated executions. Most commonly, ingestion will be run on a daily cadence. +To schedule your ingestion job, we recommend using a job schedule like [Apache Airflow](https://airflow.apache.org/). In cases of simpler deployments, a CRON job scheduled on an always-up machine can also work. +Note that each source system will require a separate recipe file. This allows you to schedule ingestion from different sources independently or together. + +_Looking for information on real-time ingestion? Click_ [_here_](docs/lineage/airflow.md)_._ + +_Note: Real-time ingestion setup is not recommended for an initial POC as it generally takes longer to configure and is prone to inevitable system errors._ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/operator-guide/setting-up-events-api-on-aws-eventbridge.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/operator-guide/setting-up-events-api-on-aws-eventbridge.md new file mode 100644 index 0000000000000..954cf9d8666b0 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/operator-guide/setting-up-events-api-on-aws-eventbridge.md @@ -0,0 +1,143 @@ +--- +description: >- + This guide will walk through the configuration required to start receiving + Acryl DataHub events via AWS EventBridge. +title: Setting up Events API on AWS EventBridge +slug: /managed-datahub/operator-guide/setting-up-events-api-on-aws-eventbridge +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/operator-guide/setting-up-events-api-on-aws-eventbridge.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# Setting up Events API on AWS EventBridge + + + +## Entity Events API + +- See the Entity Events API Docs [here](docs/managed-datahub/datahub-api/entity-events-api.md) + +## Event Structure + +As are all AWS EventBridge events, the payload itself will be wrapped by a set of standard fields, outlined [here](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-events.html). The most notable include + +- **source:** A unique identifier for the source of the event. We tend to use \`acryl.events\` by default. +- **account**: The account in which the event originated. This will be the Acryl AWS Account ID provided by your Acryl customer success rep. +- **detail**: The place where the Entity Event payload will appear. + +#### Sample Event + +``` +{ + "version": "0", + "id": "6a7e8feb-b491-4cf7-a9f1-bf3703467718", + "detail-type": "entityChangeEvent", + "source": "acryl.events", + "account": "111122223333", + "time": "2017-12-22T18:43:48Z", + "region": "us-west-1", + "detail": { + "entityUrn": "urn:li:dataset:abc", + "entityType": "dataset", + "category": "TAG", + "operation": "ADD", + "modifier": "urn:li:tag:pii", + "parameters": { + "tagUrn": "urn:li:tag:pii" + } + } +} +``` + +#### Sample Pattern + +``` +{ + "source": ["acryl.events"], + "detail": { + "category": ["TAG"], + "parameters": { + "tagUrn": ["urn:li:tag:pii"] + } + } +} +``` + +_Sample Event Pattern Filtering any Add Tag Events on a PII Tag_ + +## Step 1: Create an Event Bus + +We recommend creating a dedicated event bus for Acryl. To do so, follow the steps below: + +1\. Navigate to the AWS console inside the account where you will deploy Event Bridge. + +2\. Search and navigate to the **EventBridge** page. + +3\. Navigate to the **Event Buses** tab. + +3\. Click **Create Event Bus.** + +4\. Give the new bus a name, e.g. **acryl-events.** + +5\. Define a **Resource Policy** + +When creating your new event bus, you need to create a Policy that allows the Acryl AWS account to publish messages to the bus. This involves granting the **PutEvents** privilege to the Acryl account via an account id. + +**Sample Policy** + +``` +{ + "Version": "2012-10-17", + "Statement": [{ + "Sid": "allow_account_to_put_events", + "Effect": "Allow", + "Principal": { + "AWS": "arn:aws:iam::795586375822:root" + }, + "Action": "events:PutEvents", + "Resource": "" + }] +} +``` + +Notice that you'll need to populate the following fields on your own + +- **event-bus-arn**: This is the AWS ARN of your new event bus. + +## Step 2: Create a Routing Rule + +Once you've defined an event bus, you need to create a rule for routing incoming events to your destinations, for example an SQS topic, a Lambda function, a Log Group, etc. + +To do so, follow the below steps + +1\. Navigate to the **Rules** tab. + +2\. Click **Create Rule**. + +3\. Give the rule a name. This will usually depend on the target where you intend to route requests matching the rule. + +4\. In the **Event Bus** field, select the event bus created in **Step 1**. + +5\. Select the 'Rule with Event Pattern' option + +6\. Click **Next.** + +7\. For **Event Source**, choose **Other** + +8\. \***\* Optional: Define a Sample Event. You can use the Sample Event defined in the **Event Structure\*\* section above. + +9\. Define a matching Rule. This determines which Acryl events will be routed based on the current rule. You can use the Sample Rule defined in the **Event Structure** section above as a reference. + +10\. Define a Target: This defines where the events that match the rule should be routed. + +## Step 3: Configure Acryl to Send Events + +Once you've completed these steps, communicate the following information to your Acryl Customer Success rep: + +- The ARN of the new Event Bus. +- The AWS region in which the Event Bus is located. + +This will enable Acryl to begin sending events to your EventBridge bus. + +\_\_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/operator-guide/setting-up-remote-ingestion-executor-on-aws.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/operator-guide/setting-up-remote-ingestion-executor-on-aws.md new file mode 100644 index 0000000000000..d8f0a5df6ec64 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/operator-guide/setting-up-remote-ingestion-executor-on-aws.md @@ -0,0 +1,133 @@ +--- +description: >- + This page describes the steps required to configure a remote ingestion + executor, which allows you to ingest metadata from private metadata sources + using private credentials via the DataHub UI. +title: Setting up Remote Ingestion Executor on AWS +slug: /managed-datahub/operator-guide/setting-up-remote-ingestion-executor-on-aws +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/operator-guide/setting-up-remote-ingestion-executor-on-aws.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# Setting up Remote Ingestion Executor on AWS + + + +## Overview + +UI-based Metadata Ingestion reduces the overhead associated with operating DataHub. It allows you to create, schedule, and run batch metadata ingestion on demand in just a few clicks, without requiring custom orchestration. Behind the scenes, a simple ingestion "executor" abstraction makes this possible. + +Acryl DataHub comes packaged with an Acryl-managed ingestion executor, which is hosted inside of Acryl's environment on your behalf. However, there are certain scenarios in which an Acryl-hosted executor is not sufficient to cover all of an organization's ingestion sources. + +For example, if an ingestion source is not publicly accessible via the internet, e.g. hosted privately within a specific AWS account, then the Acryl executor will be unable to extract metadata from it. + +

+ +

+ +To accommodate these cases, Acryl supports configuring a remote ingestion executor which can be deployed inside of your AWS account. This setup allows you to continue leveraging the Acryl DataHub console to create, schedule, and run metadata ingestion, all while retaining network and credential isolation. + +

+ +

+ +## Deploying a Remote Ingestion Executor + +1. **Provide AWS Account Id**: Provide Acryl Team with the id of the AWS in which the remote executor will be hosted. This will be used to grant access to private Acryl containers and create a unique SQS queue which your remote agent will subscribe to. The account id can be provided to your Acryl representative via Email or [One Time Secret](https://onetimesecret.com/). + +2. **Provision an Acryl Executor** (ECS)**:** Acryl team will provide a [Cloudformation Template](https://github.com/acryldata/datahub-cloudformation/blob/master/Ingestion/templates/python.ecs.template.yaml) that you can run to provision an ECS cluster with a single remote ingestion task. It will also provision an AWS role for the task which grants the permissions necessary to read and delete from the private SQS queue created for you, along with reading the secrets you've specified. At minimum, the template requires the following parameters: + + 1. **Deployment Location:** The AWS VPC + subnet in which the Acryl Executor task is to be provisioned. + 2. **SQS Queue ARN**: Reference to your private SQS command queue. This is provided by Acryl and is used to configure IAM policies enabling the Task role to read from the shared queue. + 3. **SQS Queue URL**: The URL referring to your private SQS command queue. This is provided by Acryl and is used to read messages. + 4. **DataHub Personal Access Token**: A valid DataHub PAT. This can be generated inside of **Settings > Access Tokens** of DataHub web application. You can alternatively create a secret in AWS Secrets Manager and refer to that by ARN. + 5. **Acryl DataHub URL**: The URL for your DataHub instance, e.g. `.acryl.io/gms`. Note that you MUST enter the trailing /gms when configuring the executor. + 6. **Acryl Remote Executor Version:** The version of the remote executor to deploy. This is converted into a container image tag. It will be set to the latest version of the executor by default. + 7. **Ingestion Source Secrets:** The template accepts up to 10 named secrets which live inside your environment. Secrets are specified using the **OptionalSecrets** parameter in the following form: `SECRET_NAME=SECRET_ARN` with multiple separated by comma, e.g. `SECRET_NAME_1=SECRET_ARN_1,SECRET_NAME_2,SECRET_ARN_2.` + 8. **Environment Variables:** The template accepts up to 10 arbitrary environment variables. These can be used to inject properties into your ingestion recipe from within your environment. Environment variables are specified using the **OptionalEnvVars** parameter in the following form: `ENV_VAR_NAME=ENV_VAR_VALUE` with multiple separated by comma, e.g. `ENV_VAR_NAME_1=ENV_VAR_VALUE_1,ENV_VAR_NAME_2,ENV_VAR_VALUE_2.` + ` + + `Providing secrets enables you to manage ingestion sources from the DataHub UI without storing credentials inside DataHub. Once defined, secrets can be referenced by name inside of your DataHub Ingestion Source configurations using the usual convention: `${SECRET_NAME}`. + + Note that the only external secret provider that is currently supported is AWS Secrets Manager. + +

+ +

+ +

+ +

+ +3. **Test the Executor:** To test your remote executor: + + 1. Create a new Ingestion Source by clicking '**Create new Source**' the '**Ingestion**' tab of the DataHub console. Configure your Ingestion Recipe as though you were running it from inside of your environment. + 2. When working with "secret" fields (passwords, keys, etc), you can refer to any "self-managed" secrets by name: `${SECRET_NAME}:` + +

+ +

+ + 3. In the 'Finish Up' step, click '**Advanced'**. + 4. Update the '**Executor Id**' form field to be '**remote**'. This indicates that you'd like to use the remote executor. + 5. Click '**Done**'. + + Now, simple click '**Execute**' to test out the remote executor. If your remote executor is configured properly, you should promptly see the ingestion task state change to 'Running'. + +

+ +

+ +## Updating a Remote Ingestion Executor + +In order to update the executor, ie. to deploy a new container version, you'll need to update the CloudFormation Stack to re-deploy the CloudFormation template with a new set of parameters. + +### Steps - AWS Console + +1. Navigate to CloudFormation in AWS Console +2. Select the stack dedicated to the remote executor +3. Click **Update** +4. Select **Replace Current Template** +5. Select **Upload a template file** +6. Upload a copy of the Acryl Remote Executor [CloudFormation Template](https://raw.githubusercontent.com/acryldata/datahub-cloudformation/master/Ingestion/templates/python.ecs.template.yaml) + +

+ +

+ +7. Click **Next** +8. Change parameters based on your modifications (e.g. ImageTag, etc) +9. Click **Next** +10. Confirm your parameter changes, and update. This should perform the necessary upgrades. + +## FAQ + +### If I need to change (or add) a secret that is stored in AWS Secrets Manager, e.g. for rotation, will the new secret automatically get picked up by Acryl's executor?\*\* + +Unfortunately, no. Secrets are wired into the executor container at deployment time, via environment variables. Therefore, the ECS Task will need to be restarted (either manually or via a stack parameter update) whenever your secrets change. + +### I want to deploy multiple Acryl Executors. Is this currently possible?\*\* + +This is possible, but requires a new SQS queue is maintained (on per executor). Please contact your Acryl representative for more information. + +### I've run the CloudFormation Template, how can I tell that the container was successfully deployed?\*\* + +We recommend verifying in AWS Console by navigating to **ECS > Cluster > Stack Name > Services > Logs.** +When you first deploy the executor, you should a single log line to indicate success: + +``` +Starting AWS executor consumer.. +``` + +This indicates that the remote executor has established a successful connection to your DataHub instance and is ready to execute ingestion runs. +If you DO NOT see this log line, but instead see something else, please contact your Acryl representative for support. + +## Release Notes + +This is where release notes for the Acryl Remote Executor Container will live. + +### v0.0.3.9 + +Bumping to the latest version of acryl-executor, which includes smarter messaging around OOM errors. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_1_69.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_1_69.md new file mode 100644 index 0000000000000..3d7a8171c7a45 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_1_69.md @@ -0,0 +1,22 @@ +--- +title: v0.1.69 +slug: /managed-datahub/release-notes/v_0_1_69 +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/release-notes/v_0_1_69.md +--- + +# v0.1.69 + +--- + +This is a scheduled release which contains all changes from OSS DataHub upto commit `10a31b1aa08138c616c0e44035f8f843bef13085`. In addition to all the features added in OSS DataHub below are Managed DataHub specific release notes. + +## Release Availability Date + +06 Dec 2022 + +## Release Changlog + +--- + +- We now support >10k results in Metadata Test results diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_1_70.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_1_70.md new file mode 100644 index 0000000000000..18d9fbb86e302 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_1_70.md @@ -0,0 +1,23 @@ +--- +title: v0.1.70 +slug: /managed-datahub/release-notes/v_0_1_70 +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/release-notes/v_0_1_70.md +--- + +# v0.1.70 + +--- + +This is a scheduled release which contains all changes from OSS DataHub upto commit `70659711a841bcce4bb1e0350027704b3783f6a5`. In addition to all the features added in OSS DataHub below are Managed DataHub specific release notes. + +## Release Availability Date + +30 Dec 2022 + +## Release Changlog + +--- + +- Improvements in Caching implementation to fix search consistency problems +- We have heard many organisations ask for metrics for the SaaS product. We have made good progress towards this goal which allows us to share Grafana dashboards. We will be testing it selectively. Expect more updates in coming month on this. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_1_72.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_1_72.md new file mode 100644 index 0000000000000..823be20abfaec --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_1_72.md @@ -0,0 +1,28 @@ +--- +title: v0.1.72 +slug: /managed-datahub/release-notes/v_0_1_72 +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/release-notes/v_0_1_72.md +--- + +# v0.1.72 + +--- + +## Release Availability Date + +18 Jan 2023 + +## Release Changlog + +--- + +- Since `v0.1.70` these changes from OSS DataHub https://github.com/datahub-project/datahub/compare/43c566ee4ff2ee950a4f845c2fd8a1c690c1d607...afaee58ded40dc4cf39f94f1b4331ceb0a4d93eb have been pulled in +- add GZip compression to lineage cache +- Make browse paths upgrade non-blocking + +## Special Notes + +--- + +- If anyone faces issues with login please clear your cookies. Some security updates are part of this release. That may cause login issues until cookies are cleared. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_1_73.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_1_73.md new file mode 100644 index 0000000000000..e151270ec7315 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_1_73.md @@ -0,0 +1,23 @@ +--- +title: v0.1.73 +slug: /managed-datahub/release-notes/v_0_1_73 +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/release-notes/v_0_1_73.md +--- + +# v0.1.73 + +--- + +## Release Availability Date + +01 Fev 2023 + +## Release Changlog + +--- + +- Since `v0.1.72` these changes from OSS DataHub https://github.com/datahub-project/datahub/compare/afaee58ded40dc4cf39f94f1b4331ceb0a4d93eb...36afdec3946df2fb4166ac27a89b933ced87d00e have been pulled in +- Fixes related to Metadata tests to ensure correct results +- Adding Properties and Searchable Annotations to Usage + Storage Features +- Fixes delete references for single relationship aspects diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_0.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_0.md new file mode 100644 index 0000000000000..6d12f7b580ae6 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_0.md @@ -0,0 +1,35 @@ +--- +title: v0.2.0 +slug: /managed-datahub/release-notes/v_0_2_0 +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/release-notes/v_0_2_0.md +--- + +# v0.2.0 + +--- + +## Release Availability Date + +09 Feb 2023 + +## Update Downtime + +During release installation the Elasticsearch indices will be reindex to improve search capabilities. While the upgrade is in progress +DataHub will be set to a read-only mode. Once this operation is completed, the upgrade will proceed normally. Depending on index sizes and +infrastructure this process can take 5 minutes to hours however as a rough estimate 1 hour for every 2.3 million entities. + +## Release Changlog + +--- + +- Since `v0.1.73` these changes from OSS DataHub https://github.com/datahub-project/datahub/compare/36afdec3946df2fb4166ac27a89b933ced87d00e...v0.10.0 have been pulled in + - Improved documentation editor + - Filter lineage graphs based on time windows + - Improvements in Search + - Metadata Ingestion + - Redshift: You can now extract lineage information from unload queries + - PowerBI: Ingestion now maps Workspaces to DataHub Containers + - BigQuery: You can now extract lineage metadata from the Catalog + - Glue: Ingestion now uses table name as the human-readable name +- SSO Preferred Algorithm Setting diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_1.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_1.md new file mode 100644 index 0000000000000..aa0ec0bf4522f --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_1.md @@ -0,0 +1,25 @@ +--- +title: v0.2.1 +slug: /managed-datahub/release-notes/v_0_2_1 +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/release-notes/v_0_2_1.md +--- + +# v0.2.1 + +--- + +## Release Availability Date + +23-Feb-2023 + +## Release Changlog + +--- + +- Since `v0.2.0` these changes from OSS DataHub https://github.com/datahub-project/datahub/compare/cf1e627e55431fc69d72918b2bcc3c5f3a1d5002...36037cf288eea12f1760dd0718255eeb1d7039c7 have been pulled in. +- Add first, last synched + last updated properties to metadata tests. +- Update link colors to pass accessibility. +- Extend tag and term proposals to other entity types besides datasets. This allows proposals to work on entities other than datasets. +- We are skipping running metadata tests in real-time processing as it was not scaling out and causing issues in ingestion +- Re-enabling hard-deletes which was temporarily disabled diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_2.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_2.md new file mode 100644 index 0000000000000..41752acce48bd --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_2.md @@ -0,0 +1,26 @@ +--- +title: v0.2.2 +slug: /managed-datahub/release-notes/v_0_2_2 +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/release-notes/v_0_2_2.md +--- + +# v0.2.2 + +--- + +## Release Availability Date + +01-Mar-2023 + +## Release Changelog + +--- + +- Since `v0.2.1` no changes from OSS DataHub have been pulled in. +- fix(lineage): fix filtering for Timeline Lineage, regression for Search Ingestion Summaries +- fix(recommendations): recommendations now display on the homepage for recently viewed, searched, and most popular. +- fix(analytics): chart smoothing and date range fixes +- fix(search): case-sensitive exact match +- fix(search): fix handling of 2 character search terms when not a prefix or exact match +- fix(ingestion): fix ingestion run summary showing no results diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_3.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_3.md new file mode 100644 index 0000000000000..e1eb62fea5145 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_3.md @@ -0,0 +1,28 @@ +--- +title: v0.2.3 +slug: /managed-datahub/release-notes/v_0_2_3 +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/release-notes/v_0_2_3.md +--- + +# v0.2.3 + +--- + +## Release Availability Date + +14-Mar-2023 + +## Release Changelog + +--- + +- Since `v0.2.2` no changes from OSS DataHub have been pulled in. +- fix(mcl): only restate Lineage MCL's - This should help with some lag issues being seen +- feat(proposals): Add ability to propose descriptions on datasets +- Hotfix 2023 03 06 - Some Miscellaneous search improvements +- fix(bootstrap): only ingest default metadata tests once - This should help with some deleted metadata tests re-appearing. +- refactor(lineage): Fix & optimize getAndUpdatePaths - The impact should be a reduced page load time for the lineage-intensive entities +- refactor(ui): Loading schema dynamically for datasets +- fix(lineage): nullpointer exceptions - should fix some errors related to lineage search +- chore(ci): add daylight savings timezone for tests, fix daylight saving bug in analytics charts - Should fix gaps in Monthly charts for people with daylight savings diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_4.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_4.md new file mode 100644 index 0000000000000..27b8f7b1ef2f4 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_4.md @@ -0,0 +1,31 @@ +--- +title: v0.2.4 +slug: /managed-datahub/release-notes/v_0_2_4 +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/release-notes/v_0_2_4.md +--- + +# v0.2.4 + +--- + +## Release Availability Date + +24-Mar-2023 + +## Release Changelog + +--- + +- Since `v0.2.3` no changes from OSS DataHub have been pulled in. +- fix(ui) Safeguard ingestion execution request check - Fixes an error on frontend managed ingestion page +- fix(impactAnalysis): fix filtering for lightning mode search +- fix(search): fix tags with colons +- refactor(incidents): Remove dataset health caching to make incident health instantly update +- fix(ui): Address regression in column usage stats + add unit test +- fix(timeBasedLineage): fix ingestProposal flow for no ops +- feat(assertions + incidents): Support Querying Entities by Assertion / Incident Status + Chrome Embed Optimizations +- fix(lineage): change default lineage time window to All Time +- Truncate cache key for search lineage +- feat(config): Add endpoint to exact search query information +- fix(default policies): Add Manage Proposals Default Policies for Root User diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_5.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_5.md new file mode 100644 index 0000000000000..2a41b330ec3d9 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_5.md @@ -0,0 +1,31 @@ +--- +title: v0.2.5 +slug: /managed-datahub/release-notes/v_0_2_5 +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/release-notes/v_0_2_5.md +--- + +# v0.2.5 + +--- + +## Release Availability Date + +11-Apr-2023 + +## Release Changelog + +--- + +- Since `v0.2.4` these changes from OSS DataHub https://github.com/datahub-project/datahub/compare/2764c44977583d8a34a3425454e81a730b120829...294c5ff50789564dc836ca0cbcd8f7020756eb0a have been pulled in. +- feat(graphql): Adding new offline features to dataset stats summary +- feat(metadata tests): Further Metadata Tests Improvements (Prep for Uplift) +- refactor(tests): Supporting soft-deleted Metadata Tests +- feat(tests): Adding a high-quality set of Default Metadata Tests +- refactor(tests): Uplift Metadata Tests UX +- refactor(Tests): Metadata Tests Uplift: Adding Empty Tests state +- refactor(Tests): Adding Test Results Modal +- refactor(tests): Adding more default tests and tags +- fix(graphQL): Add protection agaisnt optional null OwnershipTypes +- fix(ui): Fix tags display name + color in UI for autocomplete, search preview, entity profile +- fix(ui) Fix tags and terms columns on nested schema fields diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_6.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_6.md new file mode 100644 index 0000000000000..6d5287457e73f --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_6.md @@ -0,0 +1,29 @@ +--- +title: v0.2.6 +slug: /managed-datahub/release-notes/v_0_2_6 +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/release-notes/v_0_2_6.md +--- + +# v0.2.6 + +--- + +## Release Availability Date + +28-Apr-2023 + +## Recommended CLI + +- `v0.10.1.2` with release notes at https://github.com/acryldata/datahub/releases/tag/v0.10.1.2 +- There is a newer CLI available https://github.com/acryldata/datahub/releases/tag/v0.10.2.2 currently but we do not recommend using that because of a Regression in Redshift connector. If you are not using Redshift connector then you can use the newer CLI version. + +## Release Changelog + +--- + +- Since `v0.2.5` these changes from OSS DataHub https://github.com/datahub-project/datahub/compare/294c5ff50789564dc836ca0cbcd8f7020756eb0a...2bc0a781a63fd4aed50080ab453bcbd3ec0570bd have been pulled in. +- fix(tests): Ensure that default Test has a description field +- fix(openapi): allow configuration of async on openapi +- fix(cache): clear cache entry when skipped for search +- fix(ui) Update copy for chrome extension health component diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_7.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_7.md new file mode 100644 index 0000000000000..727d37d0f839a --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_7.md @@ -0,0 +1,39 @@ +--- +title: v0.2.7 +slug: /managed-datahub/release-notes/v_0_2_7 +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/release-notes/v_0_2_7.md +--- + +# v0.2.7 + +--- + +## Release Availability Date + +19-May-2023 + +## Recommended CLI + +- `v0.10.2.3` with release notes at https://github.com/acryldata/datahub/releases/tag/v0.10.2.3 + +## Release Changelog + +--- + +- Since `v0.2.6` these changes from OSS DataHub https://github.com/datahub-project/datahub/compare/2bc0a781a63fd4aed50080ab453bcbd3ec0570bd...44406f7adf09674727e433c2136654cc21e79dd2 have been pulled in. +- feat(observability): Extending Incidents Models for Observability +- models(integrations + obs): Adding a Connection entity +- feat(observability): Extending Assertions Models for Observability +- feat(observability): Introducing Anomaly Models +- feat(fastpath): pre-process updateIndicesHook for UI sourced updates +- fix(metadataTests): change scroll to searchAfter based API +- feat(observability): Assertions-Based Incidents Generator Hook +- fix(notifications): fix double notifications issue +- fix(tag): render tag name via properties +- fix(jackson): add stream reader constraint with 16 MB limit +- fix(metadataTests): gold tier metadata tests condition +- fix(ingest/dbt): fix siblings resolution for sources +- Some search fixes +- fix(graphql) Fix autocomplete for views with un-searchable types +- fix(ui) Allow users to be able to propose new terms/term groups from UI diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_8.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_8.md new file mode 100644 index 0000000000000..daa818cea625d --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_8.md @@ -0,0 +1,43 @@ +--- +title: v0.2.8 +slug: /managed-datahub/release-notes/v_0_2_8 +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/release-notes/v_0_2_8.md +--- + +# v0.2.8 + +--- + +## Release Availability Date + +07-June-2023 + +## Recommended CLI + +- `v0.10.3.1` with release notes at https://github.com/acryldata/datahub/releases/tag/v0.10.3.1 + +## Release Changelog + +--- + +- Since `v0.2.7` these changes from OSS DataHub https://github.com/datahub-project/datahub/compare/a68833769e1fe1b27c22269971a51c63cc285c18...e7d1b900ec09cefca4e6ca979f391d3a17b473c9 have been pulled in. +- feat(assertions): Extending Assertions READ GraphQL APIs for Observability +- fix(embed): styling updates for chrome extension +- feat: slack integrations service +- feat(assertions): Extending Assertions WRITE GraphQL APIs for Observability +- feat(contracts): Adding models for Data Contracts +- feat(tests): prevent reprocessing of test sourced events +- feat(tests): add parallelization for workloads on metadata tests +- feat(observability): Monitor Models for Observability +- fix(datahub-upgrade) fix while loop predicates for scrolling +- fix(usage): Add resource spec for authenticated access where possible +- feat(observability): Assertions-Based Anomalies Generator Hook +- feat(observability): Adding the GraphQL Implementation for Monitor Entity +- fix(restli): update base client retry logic + +## Some notable features's documentation in this SaaS release + +- [Custom Ownership types](../../ownership/ownership-types.md) +- [Chrome extension](../chrome-extension.md) supports Tableau +- [Data Products](../../dataproducts.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_9.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_9.md new file mode 100644 index 0000000000000..46d2f2c8d622e --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/release-notes/v_0_2_9.md @@ -0,0 +1,55 @@ +--- +title: v0.2.9 +slug: /managed-datahub/release-notes/v_0_2_9 +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/release-notes/v_0_2_9.md +--- + +# v0.2.9 + +--- + +## Release Availability Date + +28-June-2023 + +## Recommended CLI/SDK + +- `v0.10.4.3` with release notes at https://github.com/acryldata/datahub/releases/tag/v0.10.4.3 + +If you are using an older CLI/SDK version then please upgrade it. This applies for all CLI/SDK usages, if you are using it through your terminal, github actions, airflow, in python SDK somewhere, Java SKD etc. This is a strong recommendation to upgrade as we keep on pushing fixes in the CLI and it helps us support you better. + +Special Notes + +- We have a new search and browse experience. We cannot enable it unless all of your CLI/SDK usages are upgraded. If you are using a custom source then you need to upgrade your source to produce `browsePathv2` aspects. +- [BREAKING CHANGE] If you are using our okta source to do ingestion then you MUST read this. Okta source config option `okta_profile_to_username_attr` default changed from `login` to `email`. This determines which Okta profile attribute is used for the corresponding DataHub user and thus may change what DataHub users are generated by the Okta source. And in a follow up `okta_profile_to_username_regex` has been set to `.*` which taken together with previous change brings the defaults in line with OIDC which is used for login via Okta. +- [DEPRECATION] In the DataFlow class, the cluster argument is deprecated in favor of env. + +## Release Changelog + +--- + +- Since `v0.2.8` these changes from OSS DataHub https://github.com/datahub-project/datahub/compare/e7d1b900ec09cefca4e6ca979f391d3a17b473c9...1f0723fad109658a69bb1d4279100de8514f35d7 have been pulled in. +- fix(tests): Fixing pagination on Metadata Test results +- fix(assertions): fix assertion actions hook npe +- fix(notifications): Fixing duplicate ingestion started notifications +- feat(slack integrations): Existing component changes required for revised Slack integration +- fix(proposals): fixing propose glossary term description and adding tests +- fix(search): populate scroll ID properly for other scroll usages +- fix(metadata test icon): hide metadata test pass/fail icon on entity header + +These changes are for an upcoming feature which we have not enabled yet. We are putting it here for transparency purposes. Acryl team will reach out once we start the rollout of our observability features. + +- feat(incidents): Extending Incidents GraphQL APIs for Observability +- feat(anomalies): Adding Anomalies READ GraphQL APIs for Observability +- feat(observability): Minor models and graphql improvements +- feat(observability): UI for creating Dataset SLA Assertions +- feat(observability): Adding support for patching monitor info aspect +- feat(observability): Adding GraphQL APIs for enabling / disabling System Monitors +- feat(observability): DataHub Monitors Service +- feat(observability): display assertion externalUrl if available + +## Some notable features in this SaaS release + +- New search and Browse v2 experience. This can only be enabled if you upgrade all your CLI/SDK usage as per our recommendation provided above. +- Patch support for `dataJobInputOutput` as described [here](../../advanced/patch.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/saas-slack-setup.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/saas-slack-setup.md new file mode 100644 index 0000000000000..455fa96c00320 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/saas-slack-setup.md @@ -0,0 +1,122 @@ +--- +title: Configure Slack Notifications +slug: /managed-datahub/saas-slack-setup +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/saas-slack-setup.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# Configure Slack Notifications + + + +## Install the DataHub Slack App into your Slack workspace + +The following steps should be performed by a Slack Workspace Admin. + +- Navigate to https://api.slack.com/apps/ +- Click Create New App +- Use “From an app manifest” option +- Select your workspace +- Paste this Manifest in YAML. Suggest changing name and `display_name` to be `DataHub App YOUR_TEAM_NAME` but not required. This name will show up in your slack workspace + +```yml +display_information: + name: DataHub App + description: An app to integrate DataHub with Slack + background_color: "#000000" +features: + bot_user: + display_name: DataHub App + always_online: false +oauth_config: + scopes: + bot: + - channels:read + - chat:write + - commands + - groups:read + - im:read + - mpim:read + - team:read + - users:read + - users:read.email +settings: + org_deploy_enabled: false + socket_mode_enabled: false + token_rotation_enabled: false +``` + +Confirm you see the Basic Information Tab + +![](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/slack/slack_basic_info.png) + +- Click **Install to Workspace** +- It will show you permissions the Slack App is asking for, what they mean and a default channel in which you want to add the slack app + - Note that the Slack App will only be able to post in channels that the app has been added to. This is made clear by slack’s Authentication screen also. +- Select the channel you'd like notifications to go to and click **Allow** +- Go to DataHub App page + - You can find your workspace's list of apps at https://api.slack.com/apps/ + +## Generating a Bot Token + +- Go to **OAuth & Permissions** Tab + +![](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/slack/slack_oauth_and_permissions.png) + +Here you'll find a “Bot User OAuth Token” which DataHub will need to communicate with your slack through the bot. +In the next steps, we'll show you how to configure the Slack Integration inside of Acryl DataHub. + +## Configuring Notifications + +> In order to set up the Slack integration, the user must have the `Manage Platform Settings` privilege. + +To enable the integration with slack + +- Navigate to **Settings > Integrations** +- Click **Slack** +- Enable the Integration +- Enter the **Bot Token** obtained in the previous steps +- Enter a **Default Slack Channel** - this is where all notifications will be routed unless +- Click **Update** to save your settings + +![](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/slack/slack_add_token.png) + +To enable and disable specific types of notifications, or configure custom routing for notifications, start by navigating to **Settings > Notifications**. +To enable or disable a specific notification type in Slack, simply click the check mark. By default, all notification types are enabled. +To customize the channel where notifications are send, click the button to the right of the check box. + +![](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/slack/slack_channel.png) + +If provided, a custom channel will be used to route notifications of the given type. If not provided, the default channel will be used. +That's it! You should begin to receive notifications on Slack. Note that it may take up to 1 minute for notification settings to take effect after saving. + +## Sending Notifications + +For now we support sending notifications to + +- Slack Channel ID (e.g. `C029A3M079U`) +- Slack Channel Name (e.g. `#troubleshoot`) +- Specific Users (aka Direct Messages or DMs) via user ID + +## How to find Team ID and Channel ID in Slack + +- Go to the Slack channel for which you want to get channel ID +- Check the URL e.g. for the troubleshoot channel in OSS DataHub slack + +![](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/slack/slack_channel_url.png) + +- Notice `TUMKD5EGJ/C029A3M079U` in the URL + - Team ID = `TUMKD5EGJ` from above + - Channel ID = `C029A3M079U` from above + +## How to find User ID in Slack + +- Go to user DM +- Click on their profile picture +- Click on View Full Profile +- Click on “More” +- Click on “Copy member ID” + +![](https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/integrations/slack/slack_user_id.png) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/welcome-acryl.md b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/welcome-acryl.md new file mode 100644 index 0000000000000..d20a444697652 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/managed-datahub/welcome-acryl.md @@ -0,0 +1,64 @@ +--- +title: Getting Started with Acryl DataHub +slug: /managed-datahub/welcome-acryl +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/managed-datahub/welcome-acryl.md +--- + +# Getting Started with Acryl DataHub + +Welcome to the Acryl DataHub! We at Acryl are on a mission to make data reliable by bringing clarity to the who, what, when, & how of your data ecosystem. We're thrilled to be on this journey with you; and cannot wait to see what we build together! + +Close communication is not only welcomed, but highly encouraged. For all questions, concerns, & feedback, please reach out to us directly at support@acryl.io. + +## Prerequisites + +Before you go further, you'll need to have a DataHub instance provisioned. The Acryl integrations team will provide you the following once it has been deployed: + +1. The URL for your Acryl instance (https://your-domain-name.acryl.io) +2. Admin account credentials for logging into the DataHub UI + +Once you have these, you're ready to go. + +:::info +If you wish to have a private connection to your DataHub instance, Acryl supports [AWS PrivateLink](https://aws.amazon.com/privatelink/) to complete this connection to your existing AWS account. Please see more details [here](integrations/aws-privatelink.md). +::: + +### Logging In + +Acryl DataHub currently supports the following means to log into a DataHub instance: + +1. **Admin account**: With each deployment of DataHub comes a master admin account. It has a randomly generated password that can be accessed by reaching out to Acryl Integrations team (support@acryl.io). To log in with an admin account, navigate to https://your-domain.acryl.io/login +2. **OIDC**: Acryl DataHub also supports OIDC integration with the Identity Provider of your choice (Okta, Google, etc). To set this up, Acryl integrations team will require the following: +3. _Client ID_ - A unique identifier for your application with the identity provider +4. _Client Secret_ - A shared secret to use for exchange between you and your identity provider. To send this over securely, we recommend using [onetimesecret.com](https://onetimesecret.com/) to create a link. +5. _Discovery URL_ - A URL where the OIDC API of your identity provider can be discovered. This should suffixed by `.well-known/openid-configuration`. Sometimes, identity providers will not explicitly include this URL in their setup guides, though this endpoint will exist as per the OIDC specification. For more info see [here](http://openid.net/specs/openid-connect-discovery-1_0.html). + +The callback URL to register in your Identity Provider will be + +``` +https://your-acryl-domain.acryl.io/callback/oidc +``` + +_Note that we do not yet support LDAP or SAML authentication. Please let us know if either of these integrations would be useful for your organization._ + +## Getting Started + +Acryl DataHub is first and foremost a metadata Search & Discovery product. As such, the two most important parts of the experience are + +1. Ingesting metadata +2. Discovering metadata + +### Ingesting Metadata + +Acryl DataHub employs a push-based metadata ingestion model. In practice, this means running an Acryl-provided agent inside your organization's infrastructure, and pushing that data out to your DataHub instance in the cloud. One benefit of this approach is that metadata can be aggregated across any number of distributed sources, regardless of form or location. + +This approach comes with another benefit: security. By managing your own instance of the agent, you can keep the secrets and credentials within your walled garden. Skip uploading secrets & keys into a third-party cloud tool. + +To push metadata into DataHub, Acryl provide's an ingestion framework written in Python. Typically, push jobs are run on a schedule at an interval of your choosing. For our step-by-step guide on ingestion, click [here](docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md). + +### Discovering Metadata + +There are 2 primary ways to find metadata: search and browse. Both can be accessed via the DataHub home page. + +By default, we provide rich search capabilities across your ingested metadata. This includes the ability to search by tags, descriptions, column names, column descriptions, and more using the global search bar found on the home page. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/modeling/extending-the-metadata-model.md b/docs-website/versioned_docs/version-0.10.4/docs/modeling/extending-the-metadata-model.md new file mode 100644 index 0000000000000..6060a7c4a353f --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/modeling/extending-the-metadata-model.md @@ -0,0 +1,508 @@ +--- +slug: /metadata-modeling/extending-the-metadata-model +title: Extending the Metadata Model +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/modeling/extending-the-metadata-model.md +--- + +# Extending the Metadata Model + +You can extend the metadata model by either creating a new Entity or extending an existing one. Unsure if you need to +create a new entity or add an aspect to an existing entity? Read [metadata-model](./metadata-model.md) to understand +these two concepts prior to making changes. + +## To fork or not to fork? + +An important question that will arise once you've decided to extend the metadata model is whether you need to fork the main repo or not. Use the diagram below to understand how to make this decision. + +

+ +

+ +The green lines represent pathways that will lead to lesser friction for you to maintain your code long term. The red lines represent higher risk of conflicts in the future. We are working hard to move the majority of model extension use-cases to no-code / low-code pathways to ensure that you can extend the core metadata model without having to maintain a custom fork of DataHub. + +We will refer to the two options as the **open-source fork** and **custom repository** approaches in the rest of the document below. + +## This Guide + +This guide will outline what the experience of adding a new Entity should look like through a real example of adding the +Dashboard Entity. If you want to extend an existing Entity, you can skip directly to [Step 3](#step_3). + +At a high level, an entity is made up of: + +1. A Key Aspect: Uniquely identifies an instance of an entity, +2. A list of specified Aspects, groups of related attributes that are attached to an entity. + +## Defining an Entity + +Now we'll walk through the steps required to create, ingest, and view your extensions to the metadata model. We will use +the existing "Dashboard" entity for purposes of illustration. + +### Step 1: Define the Entity Key Aspect + +A key represents the fields that uniquely identify the entity. For those familiar with DataHub’s legacy architecture, +these fields were previously part of the Urn Java Class that was defined for each entity. + +This struct will be used to generate a serialized string key, represented by an Urn. Each field in the key struct will +be converted into a single part of the Urn's tuple, in the order they are defined. + +Let’s define a Key aspect for our new Dashboard entity. + +``` +namespace com.linkedin.metadata.key + +/** + * Key for a Dashboard + */ +@Aspect = { + "name": "dashboardKey", +} +record DashboardKey { + /** + * The name of the dashboard tool such as looker, redash etc. + */ + @Searchable = { + ... + } + dashboardTool: string + + /** + * Unique id for the dashboard. This id should be globally unique for a dashboarding tool even when there are multiple deployments of it. As an example, dashboard URL could be used here for Looker such as 'looker.linkedin.com/dashboards/1234' + */ + dashboardId: string +} + +``` + +The Urn representation of the Key shown above would be: + +``` +urn:li:dashboard:(,) +``` + +Because they are aspects, keys need to be annotated with an @Aspect annotation, This instructs DataHub that this struct +can be a part of. + +The key can also be annotated with the two index annotations: @Relationship and @Searchable. This instructs DataHub +infra to use the fields in the key to create relationships and index fields for search. See [Step 3](#step_3) for more details on +the annotation model. + +**Constraints**: Note that each field in a Key Aspect MUST be of String or Enum type. + +### Step 2: Create the new entity with its key aspect + +Define the entity within an `entity-registry.yml` file. Depending on your approach, the location of this file may vary. More on that in steps [4](#step_4) and [5](#step_5). + +Example: + +```yaml +- name: dashboard + doc: A container of related data assets. + keyAspect: dashboardKey +``` + +- name: The entity name/type, this will be present as a part of the Urn. +- doc: A brief description of the entity. +- keyAspect: The name of the Key Aspect defined in step 1. This name must match the value in the PDL annotation. + +# + +### Step 3: Define custom aspects or attach existing aspects to your entity + +Some aspects, like Ownership and GlobalTags, are reusable across entities. They can be included in an entity’s set of +aspects freely. To include attributes that are not included in an existing Aspect, a new Aspect must be created. + +Let’s look at the DashboardInfo aspect as an example of what goes into a new aspect. + +``` +namespace com.linkedin.dashboard + +import com.linkedin.common.AccessLevel +import com.linkedin.common.ChangeAuditStamps +import com.linkedin.common.ChartUrn +import com.linkedin.common.Time +import com.linkedin.common.Url +import com.linkedin.common.CustomProperties +import com.linkedin.common.ExternalReference + +/** + * Information about a dashboard + */ +@Aspect = { + "name": "dashboardInfo" +} +record DashboardInfo includes CustomProperties, ExternalReference { + + /** + * Title of the dashboard + */ + @Searchable = { + "fieldType": "TEXT_WITH_PARTIAL_MATCHING", + "queryByDefault": true, + "enableAutocomplete": true, + "boostScore": 10.0 + } + title: string + + /** + * Detailed description about the dashboard + */ + @Searchable = { + "fieldType": "TEXT", + "queryByDefault": true, + "hasValuesFieldName": "hasDescription" + } + description: string + + /** + * Charts in a dashboard + */ + @Relationship = { + "/*": { + "name": "Contains", + "entityTypes": [ "chart" ] + } + } + charts: array[ChartUrn] = [ ] + + /** + * Captures information about who created/last modified/deleted this dashboard and when + */ + lastModified: ChangeAuditStamps + + /** + * URL for the dashboard. This could be used as an external link on DataHub to allow users access/view the dashboard + */ + dashboardUrl: optional Url + + /** + * Access level for the dashboard + */ + @Searchable = { + "fieldType": "KEYWORD", + "addToFilters": true + } + access: optional AccessLevel + + /** + * The time when this dashboard last refreshed + */ + lastRefreshed: optional Time +} +``` + +The Aspect has four key components: its properties, the @Aspect annotation, the @Searchable annotation and the +@Relationship annotation. Let’s break down each of these: + +- **Aspect properties**: The record’s properties can be declared as a field on the record, or by including another + record in the Aspect’s definition (`record DashboardInfo includes CustomProperties, ExternalReference {`). Properties + can be defined as PDL primitives, enums, records, or collections ( + see [pdl schema documentation](https://linkedin.github.io/rest.li/pdl_schema)) + references to other entities, of type Urn or optionally `Urn` +- **@Aspect annotation**: Declares record is an Aspect and includes it when serializing an entity. Unlike the following + two annotations, @Aspect is applied to the entire record, rather than a specific field. Note, you can mark an aspect + as a timeseries aspect. Check out this [doc](metadata-model.md#timeseries-aspects) for details. +- **@Searchable annotation**: This annotation can be applied to any primitive field or a map field to indicate that it + should be indexed in Elasticsearch and can be searched on. For a complete guide on using the search annotation, see + the annotation docs further down in this document. +- **@Relationship annotation**: These annotations create edges between the Entity’s Urn and the destination of the + annotated field when the entities are ingested. @Relationship annotations must be applied to fields of type Urn. In + the case of DashboardInfo, the `charts` field is an Array of Urns. The @Relationship annotation cannot be applied + directly to an array of Urns. That’s why you see the use of an Annotation override (`”/\*”:) to apply the @Relationship + annotation to the Urn directly. Read more about overrides in the annotation docs further down on this page. + +After you create your Aspect, you need to attach to all the entities that it applies to. + +**Constraints**: Note that all aspects MUST be of type Record. + +### Step 4: Choose a place to store your model extension + +At the beginning of this document, we walked you through a flow-chart that should help you decide whether you need to maintain a fork of the open source DataHub repo for your model extensions, or whether you can just use a model extension repository that can stay independent of the DataHub repo. Depending on what path you took, the place you store your aspect model files (the .pdl files) and the entity-registry files (the yaml file called `entity-registry.yaml` or `entity-registry.yml`) will vary. + +- Open source Fork: Aspect files go under [`metadata-models`](https://github.com/datahub-project/datahub/blob/master/metadata-models) module in the main repo, entity registry goes into [`metadata-models/src/main/resources/entity-registry.yml`](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/resources/entity-registry.yml). Read on for more details in [Step 5](#step_5). +- Custom repository: Read the [metadata-models-custom](../../metadata-models-custom/README.md) documentation to learn how to store and version your aspect models and registry. + +### Step 5: Attaching your non-key Aspect(s) to the Entity + +Attaching non-key aspects to an entity can be done simply by adding them to the entity registry yaml file. The location of this file differs based on whether you are following the oss-fork path or the custom-repository path. + +Here is an minimal example of adding our new `DashboardInfo` aspect to the `Dashboard` entity. + +```yaml +entities: + - name: dashboard + - keyAspect: dashBoardKey + aspects: + # the name of the aspect must be the same as that on the @Aspect annotation on the class + - dashboardInfo +``` + +Previously, you were required to add all aspects for the entity into an Aspect union. You will see examples of this pattern throughout the code-base (e.g. `DatasetAspect`, `DashboardAspect` etc.). This is no longer required. + +### Step 6 (Oss-Fork approach): Re-build DataHub to have access to your new or updated entity + +If you opted for the open-source fork approach, where you are editing models in the `metadata-models` repository of DataHub, you will need to re-build the DataHub metadata service using the steps below. If you are following the custom model repository approach, you just need to build your custom model repository and deploy it to a running metadata service instance to read and write metadata using your new model extensions. + +Read on to understand how to re-build DataHub for the oss-fork option. + +**_NOTE_**: If you have updated any existing types or see an `Incompatible changes` warning when building, you will need to run +`./gradlew :metadata-service:restli-servlet-impl:build -Prest.model.compatibility=ignore` +before running `build`. + +Then, run `./gradlew build` from the repository root to rebuild Datahub with access to your new entity. + +Then, re-deploy metadata-service (gms), and mae-consumer and mce-consumer (optionally if you are running them unbundled). See [docker development](../../docker/README.md) for details on how +to deploy during development. This will allow Datahub to read and write your new entity or extensions to existing entities, along with serving search and graph queries for that entity type. + +To emit proposals to ingest from the Datahub CLI tool, first install datahub cli +locally [following the instructions here](../../metadata-ingestion/developing.md). `./gradlew build` generated the avro +schemas your local ingestion cli tool uses earlier. After following the developing guide, you should be able to emit +your new event using the local datahub cli. + +Now you are ready to start ingesting metadata for your new entity! + +### (Optional) Step 7: Extend the DataHub frontend to view your entity in GraphQL & React + +If you are extending an entity with additional aspects, and you can use the auto-render specifications to automatically render these aspects to your satisfaction, you do not need to write any custom code. + +However, if you want to write specific code to render your model extensions, or if you introduced a whole new entity and want to give it its own page, you will need to write custom React and Grapqhl code to view and mutate your entity in GraphQL or React. For +instructions on how to start extending the GraphQL graph, see [graphql docs](../../datahub-graphql-core/README.md). Once you’ve done that, you can follow the guide [here](../../datahub-web-react/README.md) to add your entity into the React UI. + +## Metadata Annotations + +There are four core annotations that DataHub recognizes: + +#### @Entity + +**Legacy** +This annotation is applied to each Entity Snapshot record, such as DashboardSnapshot.pdl. Each one that is included in +the root Snapshot.pdl model must have this annotation. + +It takes the following parameters: + +- **name**: string - A common name used to identify the entity. Must be unique among all entities DataHub is aware of. + +##### Example + +```aidl +@Entity = { + // name used when referring to the entity in APIs. + String name; +} +``` + +#### @Aspect + +This annotation is applied to each Aspect record, such as DashboardInfo.pdl. Each aspect that is included in an entity’s +set of aspects in the `entity-registry.yml` must have this annotation. + +It takes the following parameters: + +- **name**: string - A common name used to identify the Aspect. Must be unique among all aspects DataHub is aware of. +- **type**: string (optional) - set to "timeseries" to mark this aspect as timeseries. Check out + this [doc](metadata-model.md#timeseries-aspects) for details. +- **autoRender**: boolean (optional) - defaults to false. When set to true, the aspect will automatically be displayed + on entity pages in a tab using a default renderer. **_This is currently only supported for Charts, Dashboards, DataFlows, DataJobs, Datasets, Domains, and GlossaryTerms_**. +- **renderSpec**: RenderSpec (optional) - config for autoRender aspects that controls how they are displayed. **_This is currently only supported for Charts, Dashboards, DataFlows, DataJobs, Datasets, Domains, and GlossaryTerms_**. Contains three fields: + - **displayType**: One of `tabular`, `properties`. Tabular should be used for a list of data elements, properties for a single data bag. + - **displayName**: How the aspect should be referred to in the UI. Determines the name of the tab on the entity page. + - **key**: For `tabular` aspects only. Specifies the key in which the array to render may be found. + +##### Example + +```aidl +@Aspect = { + // name used when referring to the aspect in APIs. + String name; +} +``` + +#### @Searchable + +This annotation is applied to fields inside an Aspect. It instructs DataHub to index the field so it can be retrieved +via the search APIs. + +It takes the following parameters: + +- **fieldType**: string - The settings for how each field is indexed is defined by the field type. Each field type is + associated with a set of analyzers Elasticsearch will use to tokenize the field. Such sets are defined in the + MappingsBuider, which generates the mappings for the index for each entity given the fields with the search + annotations. To customize the set of analyzers used to index a certain field, you must add a new field type and define + the set of mappings to be applied in the MappingsBuilder. + + Thus far, we have implemented 10 fieldTypes: + + 1. _KEYWORD_ - Short text fields that only support exact matches, often used only for filtering + + 2. _TEXT_ - Text fields delimited by spaces/slashes/periods. Default field type for string variables. + + 3. _TEXT_PARTIAL_ - Text fields delimited by spaces/slashes/periods with partial matching support. Note, partial + matching is expensive, so this field type should not be applied to fields with long values (like description) + + 4. _BROWSE_PATH_ - Field type for browse paths. Applies specific mappings for slash delimited paths. + + 5. _URN_ - Urn fields where each sub-component inside the urn is indexed. For instance, for a data platform urn like + "urn:li:dataplatform:kafka", it will index the platform name "kafka" and ignore the common components + + 6. _URN_PARTIAL_ - Urn fields where each sub-component inside the urn is indexed with partial matching support. + + 7. _BOOLEAN_ - Boolean fields used for filtering. + + 8. _COUNT_ - Count fields used for filtering. + 9. _DATETIME_ - Datetime fields used to represent timestamps. + + 10. _OBJECT_ - Each property in an object will become an extra column in Elasticsearch and can be referenced as + `field.property` in queries. You should be careful to not use it on objects with many properties as it can cause a + mapping explosion in Elasticsearch. + +- **fieldName**: string (optional) - The name of the field in search index document. Defaults to the field name where + the annotation resides. +- **queryByDefault**: boolean (optional) - Whether we should match the field for the default search query. True by + default for text and urn fields. +- **enableAutocomplete**: boolean (optional) - Whether we should use the field for autocomplete. Defaults to false +- **addToFilters**: boolean (optional) - Whether or not to add field to filters. Defaults to false +- **boostScore**: double (optional) - Boost multiplier to the match score. Matches on fields with higher boost score + ranks higher. +- **hasValuesFieldName**: string (optional) - If set, add an index field of the given name that checks whether the field + exists +- **numValuesFieldName**: string (optional) - If set, add an index field of the given name that checks the number of + elements +- **weightsPerFieldValue**: map[object, double] (optional) - Weights to apply to score for a given value. + +##### Example + +Let’s take a look at a real world example using the `title` field of `DashboardInfo.pdl`: + +```aidl +record DashboardInfo { + /** + * Title of the dashboard + */ + @Searchable = { + "fieldType": "TEXT_PARTIAL", + "enableAutocomplete": true, + "boostScore": 10.0 + } + title: string + .... +} +``` + +This annotation is saying that we want to index the title field in Elasticsearch. We want to support partial matches on +the title, so queries for `Cust` should return a Dashboard with the title `Customers`. `enableAutocomplete` is set to +true, meaning that we can autocomplete on this field when typing into the search bar. Finally, a boostScore of 10 is +provided, meaning that we should prioritize matches to title over matches to other fields, such as description, when +ranking. + +Now, when Datahub ingests Dashboards, it will index the Dashboard’s title in Elasticsearch. When a user searches for +Dashboards, that query will be used to search on the title index and matching Dashboards will be returned. + +Note, when @Searchable annotation is applied to a map, it will convert it into a list with "key.toString() +=value.toString()" as elements. This allows us to index map fields, while not increasing the number of columns indexed. +This way, the keys can be queried by `aMapField:key1=value1`. + +You can change this behavior by specifying the fieldType as OBJECT in the @Searchable annotation. It will put each key +into a column in Elasticsearch instead of an array of serialized kay-value pairs. This way the query would look more +like `aMapField.key1:value1`. As this method will increase the number of columns with each unique key - large maps can +cause a mapping explosion in Elasticsearch. You should _not_ use the object fieldType if you expect your maps to get +large. + +#### @Relationship + +This annotation is applied to fields inside an Aspect. This annotation creates edges between an Entity’s Urn and the +destination of the annotated field when the Entity is ingested. @Relationship annotations must be applied to fields of +type Urn. + +It takes the following parameters: + +- **name**: string - A name used to identify the Relationship type. +- **entityTypes**: array[string] (Optional) - A list of entity types that are valid values for the foreign-key + relationship field. + +##### Example + +Let’s take a look at a real world example to see how this annotation is used. The `Owner.pdl` struct is referenced by +the `Ownership.pdl` aspect. `Owned.pdl` contains a relationship to a CorpUser or CorpGroup: + +``` +namespace com.linkedin.common + +/** + * Ownership information + */ +record Owner { + + /** + * Owner URN, e.g. urn:li:corpuser:ldap, urn:li:corpGroup:group_name, and urn:li:multiProduct:mp_name + */ + @Relationship = { + "name": "OwnedBy", + "entityTypes": [ "corpUser", "corpGroup" ] + } + owner: Urn + + ... +} +``` + +This annotation says that when we ingest an Entity with an Ownership Aspect, DataHub will create an OwnedBy relationship +between that entity and the CorpUser or CorpGroup who owns it. This will be queryable using the Relationships resource +in both the forward and inverse directions. + +#### Annotating Collections & Annotation Overrides + +You will not always be able to apply annotations to a primitive field directly. This may be because the field is wrapped +in an Array, or because the field is part of a shared struct that many entities reference. In these cases, you need to +use annotation overrides. An override is done by specifying a fieldPath to the target field inside the annotation, like +so: + +``` + /** + * Charts in a dashboard + */ + @Relationship = { + "/*": { + "name": "Contains", + "entityTypes": [ "chart" ] + } + } + charts: array[ChartUrn] = [ ] +``` + +This override applies the relationship annotation to each element in the Array, rather than the array itself. This +allows a unique Relationship to be created for between the Dashboard and each of its charts. + +Another example can be seen in the case of tags. In this case, TagAssociation.pdl has a @Searchable annotation: + +``` + @Searchable = { + "fieldName": "tags", + "fieldType": "URN_WITH_PARTIAL_MATCHING", + "queryByDefault": true, + "hasValuesFieldName": "hasTags" + } + tag: TagUrn +``` + +At the same time, SchemaField overrides that annotation to allow for searching for tags applied to schema fields +specifically. To do this, it overrides the Searchable annotation applied to the `tag` field of `TagAssociation` and +replaces it with its own- this has a different boostScore and a different fieldName. + +``` + /** + * Tags associated with the field + */ + @Searchable = { + "/tags/*/tag": { + "fieldName": "fieldTags", + "fieldType": "URN_WITH_PARTIAL_MATCHING", + "queryByDefault": true, + "boostScore": 0.5 + } + } + globalTags: optional GlobalTags +``` + +As a result, you can issue a query specifically for tags on Schema Fields via `fieldTags:` or tags directly +applied to an entity via `tags:`. Since both have `queryByDefault` set to true, you can also search for +entities with either of these properties just by searching for the tag name. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/modeling/metadata-model.md b/docs-website/versioned_docs/version-0.10.4/docs/modeling/metadata-model.md new file mode 100644 index 0000000000000..28653090a09e3 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/modeling/metadata-model.md @@ -0,0 +1,631 @@ +--- +title: The Metadata Model +sidebar_label: The Metadata Model +slug: /metadata-modeling/metadata-model +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/modeling/metadata-model.md +--- + +# How does DataHub model metadata? + +DataHub takes a schema-first approach to modeling metadata. We use the open-source Pegasus schema language ([PDL](https://linkedin.github.io/rest.li/pdl_schema)) extended with a custom set of annotations to model metadata. The DataHub storage, serving, indexing and ingestion layer operates directly on top of the metadata model and supports strong types all the way from the client to the storage layer. + +Conceptually, metadata is modeled using the following abstractions + +- **Entities**: An entity is the primary node in the metadata graph. For example, an instance of a Dataset or a CorpUser is an Entity. An entity is made up of a type, e.g. 'dataset', a unique identifier (e.g. an 'urn') and groups of metadata attributes (e.g. documents) which we call aspects. + +- **Aspects**: An aspect is a collection of attributes that describes a particular facet of an entity. They are the smallest atomic unit of write in DataHub. That is, multiple aspects associated with the same Entity can be updated independently. For example, DatasetProperties contains a collection of attributes that describes a Dataset. Aspects can be shared across entities, for example "Ownership" is an aspect that is re-used across all the Entities that have owners. Common aspects include + + - [ownership](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl): Captures the users and groups who own an Entity. + - [globalTags](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/GlobalTags.pdl): Captures references to the Tags associated with an Entity. + - [glossaryTerms](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/GlossaryTerms.pdl): Captures references to the Glossary Terms associated with an Entity. + - [institutionalMemory](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/InstitutionalMemory.pdl): Captures internal company Documents associated with an Entity (e.g. links!) + - [status](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Status.pdl): Captures the "deletion" status of an Entity, i.e. whether it should be soft-deleted. + - [subTypes](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/SubTypes.pdl): Captures one or more "sub types" of a more generic Entity type. An example can be a "Looker Explore" Dataset, a "View" Dataset. Specific sub types can imply that certain additional aspects are present for a given Entity. + +- **Relationships**: A relationship represents a named edge between 2 entities. They are declared via foreign key attributes within Aspects along with a custom annotation (@Relationship). Relationships permit edges to be traversed bi-directionally. For example, a Chart may refer to a CorpUser as its owner via a relationship named "OwnedBy". This edge would be walkable starting from the Chart _or_ the CorpUser instance. + +- **Identifiers (Keys & Urns)**: A key is a special type of aspect that contains the fields that uniquely identify an individual Entity. Key aspects can be serialized into _Urns_, which represent a stringified form of the key fields used for primary-key lookup. Moreover, _Urns_ can be converted back into key aspect structs, making key aspects a type of "virtual" aspect. Key aspects provide a mechanism for clients to easily read fields comprising the primary key, which are usually generally useful like Dataset names, platform names etc. Urns provide a friendly handle by which Entities can be queried without requiring a fully materialized struct. + +Here is an example graph consisting of 3 types of entity (CorpUser, Chart, Dashboard), 2 types of relationship (OwnedBy, Contains), and 3 types of metadata aspect (Ownership, ChartInfo, and DashboardInfo). + +

+ +

+ +## The Core Entities + +DataHub's "core" Entity types model the Data Assets that comprise the Modern Data Stack. They include + +1. **[Data Platform](docs/generated/metamodel/entities/dataPlatform.md)**: A type of Data "Platform". That is, an external system that is involved in processing, storing, or visualizing Data Assets. Examples include MySQL, Snowflake, Redshift, and S3. +2. **[Dataset](docs/generated/metamodel/entities/dataset.md)**: A collection of data. Tables, Views, Streams, Document Collections, and Files are all modeled as "Datasets" on DataHub. Datasets can have tags, owners, links, glossary terms, and descriptions attached to them. They can also have specific sub-types, such as "View", "Collection", "Stream", "Explore", and more. Examples include Postgres Tables, MongoDB Collections, or S3 files. +3. **[Chart](docs/generated/metamodel/entities/chart.md)**: A single data vizualization derived from a Dataset. A single Chart can be a part of multiple Dashboards. Charts can have tags, owners, links, glossary terms, and descriptions attached to them. Examples include a Superset or Looker Chart. +4. **[Dashboard](docs/generated/metamodel/entities/dashboard.md)**: A collection of Charts for visualization. Dashboards can have tags, owners, links, glossary terms, and descriptions attached to them. Examples include a Superset or Mode Dashboard. +5. **[Data Job](docs/generated/metamodel/entities/dataJob.md)** (Task): An executable job that processes data assets, where "processing" implies consuming data, producing data, or both. Data Jobs can have tags, owners, links, glossary terms, and descriptions attached to them. They must belong to a single Data Flow. Examples include an Airflow Task. +6. **[Data Flow](docs/generated/metamodel/entities/dataFlow.md)** (Pipeline): An executable collection of Data Jobs with dependencies among them, or a DAG. Data Jobs can have tags, owners, links, glossary terms, and descriptions attached to them. Examples include an Airflow DAG. + +See the **Metadata Modeling/Entities** section on the left to explore the entire model. + +## The Entity Registry + +Where are Entities and their aspects defined in DataHub? Where does the Metadata Model "live"? The Metadata Model is stitched together by means +of an **Entity Registry**, a catalog of Entities that comprise the Metadata Graph along with the aspects associated with each. Put +simply, this is where the "schema" of the model is defined. + +Traditionally, the Entity Registry was constructed using [Snapshot](https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot) models, which are schemas that explicitly tie +an Entity to the Aspects associated with it. An example is [DatasetSnapshot](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/DatasetSnapshot.pdl), which defines the core `Dataset` Entity. +The Aspects of the Dataset entity are captured via a union field inside a special "Aspect" schema. An example is +[DatasetAspect](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/aspect/DatasetAspect.pdl). +This file associates dataset-specific aspects (like [DatasetProperties](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/dataset/DatasetProperties.pdl)) and common aspects (like [Ownership](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl), +[InstitutionalMemory](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/InstitutionalMemory.pdl), +and [Status](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Status.pdl)) +to the Dataset Entity. This approach to defining Entities will soon be deprecated in favor of a new approach. + +As of January 2022, DataHub has deprecated support for Snapshot models as a means of adding new entities. Instead, +the Entity Registry is defined inside a YAML configuration file called [entity-registry.yml](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/resources/entity-registry.yml), +which is provided to DataHub's Metadata Service at start up. This file declares Entities and Aspects by referring to their [names](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl#L7). +At boot time, DataHub validates the structure of the registry file and ensures that it can find PDL schemas associated with +each aspect name provided by configuration (via the [@Aspect](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl#L6) annotation). + +By moving to this format, evolving the Metadata Model becomes much easier. Adding Entities & Aspects becomes a matter of adding a +to the YAML configuration, instead of creating new Snapshot / Aspect files. + +## Exploring DataHub's Metadata Model + +To explore the current DataHub metadata model, you can inspect this high-level picture that shows the different entities and edges between them showing the relationships between them. + +

+ +

+ +To navigate the aspect model for specific entities and explore relationships using the `foreign-key` concept, you can view them in our demo environment or navigate the auto-generated docs in the **Metadata Modeling/Entities** section on the left. + +For example, here are helpful links to the most popular entities in DataHub's metadata model: + +- [Dataset](docs/generated/metamodel/entities/dataset.md): [Profile]() [Documentation]() +- [Dashboard](docs/generated/metamodel/entities/dashboard.md): [Profile]() [Documentation]() +- [User (a.k.a CorpUser)](docs/generated/metamodel/entities/corpuser.md): [Profile]() [Documentation]() +- [Pipeline (a.k.a DataFlow)](docs/generated/metamodel/entities/dataFlow.md): [Profile]() [Documentation]() +- [Feature Table (a.k.a. MLFeatureTable)](docs/generated/metamodel/entities/mlFeatureTable.md): [Profile]() [Documentation]() +- For the full list of entities in the metadata model, browse them [here](https://demo.datahubproject.io/browse/dataset/prod/datahub/entities) or use the **Metadata Modeling/Entities** section on the left. + +### Generating documentation for the Metadata Model + +- This website: Metadata model documentation for this website is generated using `./gradlew :docs-website:yarnBuild`, which delegates the model doc generation to the `modelDocGen` task in the `metadata-ingestion` module. +- Uploading documentation to a running DataHub Instance: The metadata model documentation can be generated and uploaded into a running DataHub instance using the command `./gradlew :metadata-ingestion:modelDocUpload`. **_NOTE_**: This will upload the model documentation to the DataHub instance running at the environment variable `$DATAHUB_SERVER` (http://localhost:8080 by default) + +## Querying the Metadata Graph + +DataHub’s modeling language allows you to optimize metadata persistence to align with query patterns. + +There are three supported ways to query the metadata graph: by primary key lookup, a search query, and via relationship traversal. + +> New to [PDL](https://linkedin.github.io/rest.li/pdl_schema) files? Don't fret. They are just a way to define a JSON document "schema" for Aspects in DataHub. All Data ingested to DataHub's Metadata Service is validated against a PDL schema, with each @Aspect corresponding to a single schema. Structurally, PDL is quite similar to [Protobuf](https://developers.google.com/protocol-buffers) and conveniently maps to JSON. + +### Querying an Entity + +#### Fetching Latest Entity Aspects (Snapshot) + +Querying an Entity by primary key means using the "entities" endpoint, passing in the +urn of the entity to retrieve. + +For example, to fetch a Chart entity, we can use the following `curl`: + +``` +curl --location --request GET 'http://localhost:8080/entities/urn%3Ali%3Achart%3Acustomers +``` + +This request will return a set of versioned aspects, each at the latest version. + +As you'll notice, we perform the lookup using the url-encoded _Urn_ associated with an entity. +The response would be an "Entity" record containing the Entity Snapshot (which in turn contains the latest aspects associated with the Entity). + +#### Fetching Versioned Aspects + +DataHub also supports fetching individual pieces of metadata about an Entity, which we call aspects. To do so, +you'll provide both an Entity's primary key (urn) along with the aspect name and version that you'd like to retrieve. + +For example, to fetch the latest version of a Dataset's SchemaMetadata aspect, you would issue the following query: + +``` +curl 'http://localhost:8080/aspects/urn%3Ali%3Adataset%3A(urn%3Ali%3AdataPlatform%3Afoo%2Cbar%2CPROD)?aspect=schemaMetadata&version=0' + +{ + "version":0, + "aspect":{ + "com.linkedin.schema.SchemaMetadata":{ + "created":{ + "actor":"urn:li:corpuser:fbar", + "time":0 + }, + "platformSchema":{ + "com.linkedin.schema.KafkaSchema":{ + "documentSchema":"{\"type\":\"record\",\"name\":\"MetadataChangeEvent\",\"namespace\":\"com.linkedin.mxe\",\"doc\":\"Kafka event for proposing a metadata change for an entity.\",\"fields\":[{\"name\":\"auditHeader\",\"type\":{\"type\":\"record\",\"name\":\"KafkaAuditHeader\",\"namespace\":\"com.linkedin.avro2pegasus.events\",\"doc\":\"Header\"}}]}" + } + }, + "lastModified":{ + "actor":"urn:li:corpuser:fbar", + "time":0 + }, + "schemaName":"FooEvent", + "fields":[ + { + "fieldPath":"foo", + "description":"Bar", + "type":{ + "type":{ + "com.linkedin.schema.StringType":{ + + } + } + }, + "nativeDataType":"string" + } + ], + "version":0, + "hash":"", + "platform":"urn:li:dataPlatform:foo" + } + } +} +``` + +#### Fetching Timeseries Aspects + +DataHub supports an API for fetching a group of Timeseries aspects about an Entity. For example, you may want to use this API +to fetch recent profiling runs & statistics about a Dataset. To do so, you can issue a "get" request against the `/aspects` endpoint. + +For example, to fetch dataset profiles (ie. stats) for a Dataset, you would issue the following query: + +``` +curl -X POST 'http://localhost:8080/aspects?action=getTimeseriesAspectValues' \ +--data '{ + "urn": "urn:li:dataset:(urn:li:dataPlatform:redshift,global_dev.larxynx_carcinoma_data_2020,PROD)", + "entity": "dataset", + "aspect": "datasetProfile", + "startTimeMillis": 1625122800000, + "endTimeMillis": 1627455600000 +}' + +{ + "value":{ + "limit":10000, + "aspectName":"datasetProfile", + "endTimeMillis":1627455600000, + "startTimeMillis":1625122800000, + "entityName":"dataset", + "values":[ + { + "aspect":{ + "value":"{\"timestampMillis\":1626912000000,\"fieldProfiles\":[{\"uniqueProportion\":1.0,\"sampleValues\":[\"123MMKK12\",\"13KDFMKML\",\"123NNJJJL\"],\"fieldPath\":\"id\",\"nullCount\":0,\"nullProportion\":0.0,\"uniqueCount\":3742},{\"uniqueProportion\":1.0,\"min\":\"1524406400000\",\"max\":\"1624406400000\",\"sampleValues\":[\"1640023230002\",\"1640343012207\",\"16303412330117\"],\"mean\":\"1555406400000\",\"fieldPath\":\"date\",\"nullCount\":0,\"nullProportion\":0.0,\"uniqueCount\":3742},{\"uniqueProportion\":0.037,\"min\":\"21\",\"median\":\"68\",\"max\":\"92\",\"sampleValues\":[\"45\",\"65\",\"81\"],\"mean\":\"65\",\"distinctValueFrequencies\":[{\"value\":\"12\",\"frequency\":103},{\"value\":\"54\",\"frequency\":12}],\"fieldPath\":\"patient_age\",\"nullCount\":0,\"nullProportion\":0.0,\"uniqueCount\":79},{\"uniqueProportion\":0.00820873786407767,\"sampleValues\":[\"male\",\"female\"],\"fieldPath\":\"patient_gender\",\"nullCount\":120,\"nullProportion\":0.03,\"uniqueCount\":2}],\"rowCount\":3742,\"columnCount\":4}", + "contentType":"application/json" + } + }, + ] + } +} +``` + +You'll notice that the aspect itself is serialized as escaped JSON. This is part of a shift toward a more generic set of READ / WRITE APIs +that permit serialization of aspects in different ways. By default, the content type will be JSON, and the aspect can be deserialized into a normal JSON object +in the language of your choice. Note that this will soon become the de-facto way to both write and read individual aspects. + +### Search Query + +A search query allows you to search for entities matching an arbitrary string. + +For example, to search for entities matching the term "customers", we can use the following CURL: + +``` +curl --location --request POST 'http://localhost:8080/entities?action=search' \ +--header 'X-RestLi-Protocol-Version: 2.0.0' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "input": "\"customers\"", + "entity": "chart", + "start": 0, + "count": 10 +}' +``` + +The notable parameters are `input` and `entity`. `input` specifies the query we are issuing and `entity` specifies the Entity Type we want to search over. This is the common name of the Entity as defined in the @Entity definition. The response contains a list of Urns, that can be used to fetch the full entity. + +### Relationship Query + +A relationship query allows you to find Entity connected to a particular source Entity via an edge of a particular type. + +For example, to find the owners of a particular Chart, we can use the following CURL: + +``` +curl --location --request GET --header 'X-RestLi-Protocol-Version: 2.0.0' 'http://localhost:8080/relationships?direction=OUTGOING&urn=urn%3Ali%3Achart%3Acustomers&types=List(OwnedBy)' +``` + +The notable parameters are `direction`, `urn` and `types`. The response contains _Urns_ associated with all entities connected +to the primary entity (urn:li:chart:customer) by an relationship named "OwnedBy". That is, it permits fetching the owners of a given +chart. + +### Special Aspects + +There are a few special aspects worth mentioning: + +1. Key aspects: Contain the properties that uniquely identify an Entity. +2. Browse Paths aspect: Represents a hierarchical path associated with an Entity. + +#### Key aspects + +As introduced above, Key aspects are structs / records that contain the fields that uniquely identify an Entity. There are +some constraints about the fields that can be present in Key aspects: + +- All fields must be of STRING or ENUM type +- All fields must be REQUIRED + +Keys can be created from and turned into _Urns_, which represent the stringified version of the Key record. +The algorithm used to do the conversion is straightforward: the fields of the Key aspect are substituted into a +string template based on their index (order of definition) using the following template: + +```aidl +// Case 1: # key fields == 1 +urn:li::key-field-1 + +// Case 2: # key fields > 1 +urn:li::(key-field-1, key-field-2, ... key-field-n) +``` + +By convention, key aspects are defined under [metadata-models/src/main/pegasus/com/linkedin/metadata/key](https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin/metadata/key). + +##### Example + +A CorpUser can be uniquely identified by a "username", which should typically correspond to an LDAP name. + +Thus, it's Key Aspect is defined as the following: + +```aidl +namespace com.linkedin.metadata.key + +/** + * Key for a CorpUser + */ +@Aspect = { + "name": "corpUserKey" +} +record CorpUserKey { + /** + * The name of the AD/LDAP user. + */ + username: string +} +``` + +and it's Entity Snapshot model is defined as + +```aidl +/** + * A metadata snapshot for a specific CorpUser entity. + */ +@Entity = { + "name": "corpuser", + "keyAspect": "corpUserKey" +} +record CorpUserSnapshot { + + /** + * URN for the entity the metadata snapshot is associated with. + */ + urn: CorpuserUrn + + /** + * The list of metadata aspects associated with the CorpUser. Depending on the use case, this can either be all, or a selection, of supported aspects. + */ + aspects: array[CorpUserAspect] +} +``` + +Using a combination of the information provided by these models, we are able to generate the Urn corresponding to a CorpUser as + +``` +urn:li:corpuser: +``` + +Imagine we have a CorpUser Entity with the username "johnsmith". In this world, the JSON version of the Key Aspect associated with the Entity would be + +```aidl +{ + "username": "johnsmith" +} +``` + +and its corresponding Urn would be + +```aidl +urn:li:corpuser:johnsmith +``` + +#### BrowsePaths aspect + +The BrowsePaths aspect allows you to define a custom "browse path" for an Entity. A browse path is a way to hierarchically organize +entities. They manifest within the "Explore" features on the UI, allowing users to navigate through trees of related entities of a given type. + +To support browsing a particular entity, add the "browsePaths" aspect to the entity in your `entity-registry.yml` file. + +```aidl +/// entity-registry.yml +entities: + - name: dataset + doc: Datasets represent logical or physical data assets stored or represented in various data platforms. Tables, Views, Streams are all instances of datasets. + keyAspect: datasetKey + aspects: + ... + - browsePaths +``` + +By declaring this aspect, you can produce custom browse paths as well as query for browse paths manually using a CURL like the following: + +```aidl +curl --location --request POST 'http://localhost:8080/entities?action=browse' \ +--header 'X-RestLi-Protocol-Version: 2.0.0' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "path": "/my/custom/browse/path", + "entity": "dataset", + "start": 0, + "limit": 10 +}' +``` + +Please note you must provide: + +- The "/"-delimited root path for which to fetch results. +- An entity "type" using its common name ("dataset" in the example above). + +### Types of Aspect + +There are 2 "types" of Metadata Aspects. Both are modeled using PDL schemas, and both can be ingested in the same way. +However, they differ in what they represent and how they are handled by DataHub's Metadata Service. + +#### 1. Versioned Aspects + +Versioned Aspects each have a **numeric version** associated with them. When a field in an aspect changes, a new +version is automatically created and stored within DataHub's backend. In practice, all versioned aspects are stored inside a relational database +that can be backed up and restored. Versioned aspects power much of the UI experience you're used to, including Ownership, Descriptions, +Tags, Glossary Terms, and more. Examples include Ownership, Global Tags, and Glossary Terms. + +#### 2. Timeseries Aspects + +Timeseries Aspects each have a **timestamp** associated with them. They are useful for representing +time-ordered events about an Entity. For example, the results of profiling a Dataset, or a set of Data Quality checks that +run every day. It is important to note that Timeseries aspects are NOT persisted inside the relational store, and are instead +persisted only in the search index (e.g. elasticsearch) and the message queue (Kafka). This makes restoring timeseries aspects +in a disaster scenario a bit more challenge. Timeseries aspects can be queried by time range, which is what makes them most different from Versioned Aspects. +A timeseries aspect can be identified by the "timeseries" [type](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/dataset/DatasetProfile.pdl#L10) in its [@Aspect](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/dataset/DatasetProfile.pdl#L8) annotation. +Examples include [DatasetProfile](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/dataset/DatasetProfile.pdl) & [DatasetUsageStatistics](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/dataset/DatasetUsageStatistics.pdl). + +Timeseries aspects are aspects that have a timestampMillis field, and are meant for aspects that continuously change on a +timely basis e.g. data profiles, usage statistics, etc. + +Each timeseries aspect must be declared "type": "timeseries" and must +include [TimeseriesAspectBase](https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin/timeseries/TimeseriesAspectBase.pdl) +, which contains a timestampMillis field. + +Timeseries aspect cannot have any fields that have the @Searchable or @Relationship annotation, as it goes through a +completely different flow. + +Please refer +to [DatasetProfile](https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin/dataset/DatasetProfile.pdl) +to see an example of a timeseries aspect. + +Because timeseries aspects are updated on a frequent basis, ingests of these aspects go straight to elastic search ( +instead of being stored in local DB). + +You can retrieve timeseries aspects using the "aspects?action=getTimeseriesAspectValues" end point. + +##### Aggregatable Timeseries aspects + +Being able to perform SQL like _group by + aggregate_ operations on the timeseries aspects is a very natural use-case for +this kind of data (dataset profiles, usage statistics etc.). This section describes how to define, ingest and perform an +aggregation query against a timeseries aspect. + +###### Defining a new aggregatable Timeseries aspect. + +The _@TimeseriesField_ and the _@TimeseriesFieldCollection_ are two new annotations that can be attached to a field of +a _Timeseries aspect_ that allows it to be part of an aggregatable query. The kinds of aggregations allowed on these +annotated fields depends on the type of the field, as well as the kind of aggregation, as +described [here](#Performing-an-aggregation-on-a-Timeseries-aspect). + +- `@TimeseriesField = {}` - this annotation can be used with any type of non-collection type field of the aspect such as + primitive types and records (see the fields _stat_, _strStat_ and _strArray_ fields + of [TestEntityProfile.pdl](https://github.com/datahub-project/datahub/blob/master/test-models/src/main/pegasus/com/datahub/test/TestEntityProfile.pdl)). + +- The `@TimeseriesFieldCollection {"key":""}` annotation allows for + aggregation support on the items of a collection type (supported only for the array type collections for now), where the + value of `"key"` is the name of the field in the collection item type that will be used to specify the group-by clause ( + see _userCounts_ and _fieldCounts_ fields of [DatasetUsageStatistics.pdl](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/dataset/DatasetUsageStatistics.pdl)). + +In addition to defining the new aspect with appropriate Timeseries annotations, +the [entity-registry.yml](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/resources/entity-registry.yml) +file needs to be updated as well. Just add the new aspect name under the list of aspects against the appropriate entity as shown below, such as `datasetUsageStatistics` for the aspect DatasetUsageStatistics. + +```yaml +entities: + - name: dataset + keyAspect: datasetKey + aspects: + - datasetProfile + - datasetUsageStatistics +``` + +###### Ingesting a Timeseries aspect + +The timeseries aspects can be ingested via the GMS REST endpoint `/aspects?action=ingestProposal` or via the python API. + +Example1: Via GMS REST API using curl. + +```shell +curl --location --request POST 'http://localhost:8080/aspects?action=ingestProposal' \ +--header 'X-RestLi-Protocol-Version: 2.0.0' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "proposal" : { + "entityType": "dataset", + "entityUrn" : "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)", + "changeType" : "UPSERT", + "aspectName" : "datasetUsageStatistics", + "aspect" : { + "value" : "{ \"timestampMillis\":1629840771000,\"uniqueUserCount\" : 10, \"totalSqlQueries\": 20, \"fieldCounts\": [ {\"fieldPath\": \"col1\", \"count\": 20}, {\"fieldPath\" : \"col2\", \"count\": 5} ]}", + "contentType": "application/json" + } + } +}' +``` + +Example2: Via Python API to Kafka(or REST) + +```python +from datahub.metadata.schema_classes import ( + ChangeTypeClass, + DatasetFieldUsageCountsClass, + DatasetUsageStatisticsClass, +) +from datahub.emitter.kafka_emitter import DatahubKafkaEmitter +from datahub.emitter.mcp import MetadataChangeProposalWrapper +from datahub.emitter.rest_emitter import DatahubRestEmitter + +usageStats = DatasetUsageStatisticsClass( + timestampMillis=1629840771000, + uniqueUserCount=10, + totalSqlQueries=20, + fieldCounts=[ + DatasetFieldUsageCountsClass( + fieldPath="col1", + count=10 + ) + ] + ) + +mcpw = MetadataChangeProposalWrapper( + entityType="dataset", + aspectName="datasetUsageStatistics", + changeType=ChangeTypeClass.UPSERT, + entityUrn="urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)", + aspect=usageStats, +) + +# Instantiate appropriate emitter (kafka_emitter/rest_emitter) +# my_emitter = DatahubKafkaEmitter("""""") +my_emitter = DatahubRestEmitter("http://localhost:8080") +my_emitter.emit(mcpw) +``` + +###### Performing an aggregation on a Timeseries aspect. + +Aggreations on timeseries aspects can be performed by the GMS REST API for `/analytics?action=getTimeseriesStats` which +accepts the following params. + +- `entityName` - The name of the entity the aspect is associated with. +- `aspectName` - The name of the aspect. +- `filter` - Any pre-filtering criteria before grouping and aggregations are performed. +- `metrics` - A list of aggregation specification. The `fieldPath` member of an aggregation specification refers to the + field name against which the aggregation needs to be performed, and the `aggregationType` specifies the kind of aggregation. +- `buckets` - A list of grouping bucket specifications. Each grouping bucket has a `key` field that refers to the field + to use for grouping. The `type` field specifies the kind of grouping bucket. + +We support three kinds of aggregations that can be specified in an aggregation query on the Timeseries annotated fields. +The values that `aggregationType` can take are: + +- `LATEST`: The latest value of the field in each bucket. Supported for any type of field. +- `SUM`: The cumulative sum of the field in each bucket. Supported only for integral types. +- `CARDINALITY`: The number of unique values or the cardinality of the set in each bucket. Supported for string and + record types. + +We support two types of grouping for defining the buckets to perform aggregations against: + +- `DATE_GROUPING_BUCKET`: Allows for creating time-based buckets such as by second, minute, hour, day, week, month, + quarter, year etc. Should be used in conjunction with a timestamp field whose value is in milliseconds since _epoch_. + The `timeWindowSize` param specifies the date histogram bucket width. +- `STRING_GROUPING_BUCKET`: Allows for creating buckets grouped by the unique values of a field. Should always be used in + conjunction with a string type field. + +The API returns a generic SQL like table as the `table` member of the output that contains the results of +the `group-by/aggregate` query, in addition to echoing the input params. + +- `columnNames`: the names of the table columns. The group-by `key` names appear in the same order as they are specified + in the request. Aggregation specifications follow the grouping fields in the same order as specified in the request, + and will be named `_`. +- `columnTypes`: the data types of the columns. +- `rows`: the data values, each row corresponding to the respective bucket(s). + +Example: Latest unique user count for each day. + +```shell +# QUERY +curl --location --request POST 'http://localhost:8080/analytics?action=getTimeseriesStats' \ +--header 'X-RestLi-Protocol-Version: 2.0.0' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "entityName": "dataset", + "aspectName": "datasetUsageStatistics", + "filter": { + "criteria": [] + }, + "metrics": [ + { + "fieldPath": "uniqueUserCount", + "aggregationType": "LATEST" + } + ], + "buckets": [ + { + "key": "timestampMillis", + "type": "DATE_GROUPING_BUCKET", + "timeWindowSize": { + "multiple": 1, + "unit": "DAY" + } + } + ] +}' + +# SAMPLE RESPOSNE +{ + "value": { + "filter": { + "criteria": [] + }, + "aspectName": "datasetUsageStatistics", + "entityName": "dataset", + "groupingBuckets": [ + { + "type": "DATE_GROUPING_BUCKET", + "timeWindowSize": { + "multiple": 1, + "unit": "DAY" + }, + "key": "timestampMillis" + } + ], + "aggregationSpecs": [ + { + "fieldPath": "uniqueUserCount", + "aggregationType": "LATEST" + } + ], + "table": { + "columnNames": [ + "timestampMillis", + "latest_uniqueUserCount" + ], + "rows": [ + [ + "1631491200000", + "1" + ] + ], + "columnTypes": [ + "long", + "int" + ] + } + } +} +``` + +For more examples on the complex types of group-by/aggregations, refer to the tests in the group `getAggregatedStats` of [ElasticSearchTimeseriesAspectServiceTest.java](https://github.com/datahub-project/datahub/blob/master/metadata-io/src/test/java/com/linkedin/metadata/timeseries/elastic/ElasticSearchTimeseriesAspectServiceTest.java). diff --git a/docs-website/versioned_docs/version-0.10.4/docs/ownership/ownership-types.md b/docs-website/versioned_docs/version-0.10.4/docs/ownership/ownership-types.md new file mode 100644 index 0000000000000..40e79ef853023 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/ownership/ownership-types.md @@ -0,0 +1,197 @@ +--- +title: Custom Ownership Types +slug: /ownership/ownership-types +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/ownership/ownership-types.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Custom Ownership Types + + + +**🤝 Version compatibility** + +> Open Source DataHub: **0.10.3** | Acryl: **0.2.8** + +## What are Custom Ownership Types? + +Custom Ownership Types are an improvement on the way to establish ownership relationships between users and the data assets they manage within DataHub. + +## Why Custom Ownership Types? + +DataHub brings a pre-defined opinion on ownership relationships. We are aware that it may not always precisely match what you may need. +With this feature you can modify it to better match the terminology used by stakeholders. + +## Benefits of Custom Ownership Types + +Custom ownership types allow users to bring in their organization's ownership nomenclature straight into DataHub. +This allows stakeholders to discover what relationships an owner of an entity has using the language already in-use at organizations. + +## How Can You Use Custom Ownership Types? + +Custom Ownership types have been implemented as a net-new entity in DataHub's Metadata Model meaning all entity-related APIs can be used for them. +Additionally, they can be managed through DataHub's Admin UI and then used for ownership across the system in the same way pre-existing ownership types are. + +## Custom Ownership Types Setup, Prerequisites, and Permissions + +What you need to create and add ownership types: + +- **Manage Ownership Types** metadata privilege to create/delete/update Ownership Types at the platform level. These can be granted by a [Platform Policy](./../authorization/policies.md#platform-policies). +- **Edit Owners** metadata privilege to add or remove an owner with an associated custom ownership type for a given entity. + +You can create this privileges by creating a new [Metadata Policy](./../authorization/policies.md#metadata-policies). + +## Using Custom Ownership Types + +Custom Ownership Types can be managed using the UI, via a graphQL command or ingesting an MCP which can be managed using software engineering (GitOps) practices. + +### Managing Custom Ownership Types + + + + +To manage a Custom Ownership type, first navigate to the DataHub Admin page: + +

+ +

+ +

+ +Then navigate to the `Ownership Types` tab under the `Management` section. + +To create a new type simply click '+ Create new Ownership Type'. + +This will open a new modal where you can configure your Ownership Type. + +Inside the form, you can choose a name for your Ownership Type. You can also add descriptions for your ownership types to help other users more easily understand their meaning. + +Don't worry, this can be changed later. + +

+ +

+ +Once you've chosen a name and a description, click 'Save' to create the new Ownership Type. + +You can also edit and delete types in this UI by click on the ellipsis in the management view for the type you wish to change/delete. +
+ +Just like all other DataHub metadata entities DataHub ships with a JSON-based custom ownership type spec, for defining and managing Custom Ownership Types as code. + +Here is an example of a custom ownership type named "Architect": + +```json +# Inlined from /metadata-ingestion/examples/ownership/ownership_type.json +{ + "urn": "urn:li:ownershipType:architect", + "info": { + "name": "Architect", + "description": "Technical person responsible for the asset" + } +} +``` + +To upload this file to DataHub, use the `datahub` cli via the `ingest` group of commands using the file-based recipe: + +```yaml +# see https://datahubproject.io/docs/generated/ingestion/sources/file for complete documentation +source: + type: "file" + config: + # path to json file + filename: "metadata-ingestion/examples/ownership/ownership_type.json" + +# see https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for complete documentation +sink: + type: "datahub-rest" + config: + server: "http://localhost:9002/api/gms" +``` + +Finally running + +```shell +datahub ingest -c recipe.yaml +``` + +For any update you wish to do, simply update the json file and re-ingest via the cli. + +To delete the ownership type, simply run a [delete command](../how/delete-metadata.md#soft-delete-the-default) for the urn of the ownership type in question, in this case `urn:li:ownershipType:architect`. + + + + +You can also create/update/delete custom ownership types using DataHub's built-in [`GraphiQL` editor](../api/graphql/how-to-set-up-graphql.md#graphql-explorer-graphiql): + +```json +mutation { + createOwnershipType( + input: { + name: "Architect" + description: "Technical person responsible for the asset" + } + ) { + urn + type + info { + name + description + } + } +} +``` + +If you see the following response, the operation was successful: + +```json +{ + "data": { + "createOwnershipType": { + "urn": "urn:li:ownershipType:ccf9aa80-e3f3-4620-93a1-8d4a2ceaf5de", + "type": "CUSTOM_OWNERSHIP_TYPE", + "status": null, + "info": { + "name": "Architect", + "description": "Technical person responsible for the asset", + "created": null, + "lastModified": null + } + } + }, + "extensions": {} +} +``` + +There are also `updateOwnershipType`, `deleteOwnershipType` and `listOwnershipTypes` endpoints for CRUD operations. + +Feel free to read our [GraphQL reference documentation](../api/graphql/overview.md) on these endpoints. + +
+ +### Assigning a Custom Ownership Type to an Entity (UI) + +You can assign an owner with a custom ownership type to an entity either using the Entity's page as the starting point. + +On an Entity's profile page, use the right sidebar to locate the Owners section. + +

+ +

+ +Click 'Add Owners', select the owner you want and then search for the Custom Ownership Type you'd like to add this asset to. When you're done, click 'Add'. + +

+ +

+ +To remove ownership from an asset, click the 'x' icon on the Owner label. + +> Notice: Adding or removing an Owner to an asset requires the `Edit Owners` Metadata Privilege, which can be granted +> by a [Policy](./../authorization/policies.md). + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/platform-instances.md b/docs-website/versioned_docs/version-0.10.4/docs/platform-instances.md new file mode 100644 index 0000000000000..f867474b68f05 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/platform-instances.md @@ -0,0 +1,55 @@ +--- +title: Working With Platform Instances +slug: /platform-instances +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/platform-instances.md +--- + +# Working With Platform Instances + +DataHub's metadata model for Datasets supports a three-part key currently: + +- Data Platform (e.g. urn:li:dataPlatform:mysql) +- Name (e.g. db.schema.name) +- Env or Fabric (e.g. DEV, PROD, etc.) + +This naming scheme unfortunately does not allow for easy representation of the multiplicity of platforms (or technologies) that might be deployed at an organization within the same environment or fabric. For example, an organization might have multiple Redshift instances in Production and would want to see all the data assets located in those instances inside the DataHub metadata repository. + +As part of the `v0.8.24+` releases, we are unlocking the first phase of supporting Platform Instances in the metadata model. This is done via two main additions: + +- The `dataPlatformInstance` aspect that has been added to Datasets which allows datasets to be associated to an instance of a platform +- Enhancements to all ingestion sources that allow them to attach a platform instance to the recipe that changes the generated urns to go from `urn:li:dataset:(urn:li:dataPlatform:,,ENV)` format to `urn:li:dataset:(urn:li:dataPlatform:,,ENV)` format. Sources that produce lineage to datasets in other platforms (e.g. Looker, Superset etc) also have specific configuration additions that allow the recipe author to specify the mapping between a platform and the instance name that it should be mapped to. + +

+ +

+ +## Naming Platform Instances + +When configuring a platform instance, choose an instance name that is understandable and will be stable for the foreseeable future. e.g. `core_warehouse` or `finance_redshift` are allowed names, as are pure guids like `a37dc708-c512-4fe4-9829-401cd60ed789`. Remember that whatever instance name you choose, you will need to specify it in more than one recipe to ensure that the identifiers produced by different sources will line up. + +## Enabling Platform Instances + +Read the Ingestion source specific guides for how to enable platform instances in each of them. +The general pattern is to add an additional optional configuration parameter called `platform_instance`. + +e.g. here is how you would configure a recipe to ingest a mysql instance that you want to call `core_finance` + +```yaml +source: + type: mysql + config: + # Coordinates + host_port: localhost:3306 + platform_instance: core_finance + database: dbname + + # Credentials + username: root + password: example + +sink: + # sink configs +``` + +## diff --git a/docs-website/versioned_docs/version-0.10.4/docs/plugins.md b/docs-website/versioned_docs/version-0.10.4/docs/plugins.md new file mode 100644 index 0000000000000..5604c9190f659 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/plugins.md @@ -0,0 +1,324 @@ +--- +title: Plugins Guide +slug: /plugins +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/plugins.md" +--- + +# Plugins Guide + +Plugins are way to enhance the basic DataHub functionality in a custom manner. + +Currently, DataHub formally supports 2 types of plugins: + +- [Authentication](#authentication) +- [Authorization](#authorization) + +## Authentication + +> **Note:** This is in BETA version + +> It is recommend that you do not do this unless you really know what you are doing + +Custom authentication plugin makes it possible to authenticate DataHub users against any Identity Management System. +Choose your Identity Management System and write custom authentication plugin as per detail mentioned in this section. + +> Currently, custom authenticators cannot be used to authenticate users of DataHub's web UI. This is because the DataHub web app expects the presence of 2 special cookies PLAY_SESSION and actor which are explicitly set by the server when a login action is performed. +> Instead, custom authenticators are useful for authenticating API requests to DataHub's backend (GMS), and can stand in addition to the default Authentication performed by DataHub, which is based on DataHub-minted access tokens. + +The sample authenticator implementation can be found at [Authenticator Sample](https://github.com/datahub-project/datahub/blob/master/metadata-service/plugin/src/test/sample-test-plugins) + +### Implementing an Authentication Plugin + +1. Add _datahub-auth-api_ as compileOnly dependency: Maven coordinates of _datahub-auth-api_ can be found at [Maven](https://mvnrepository.com/artifact/io.acryl/datahub-auth-api) + + Example of gradle dependency is given below. + + ```groovy + dependencies { + + def auth_api = 'io.acryl:datahub-auth-api:0.9.3-3rc3' + compileOnly "${auth_api}" + testImplementation "${auth_api}" + + } + ``` + +2. Implement the Authenticator interface: Refer [Authenticator Sample](https://github.com/datahub-project/datahub/blob/master/metadata-service/plugin/src/test/sample-test-plugins) + +
+ Sample class which implements the Authenticator interface + + ```java + public class GoogleAuthenticator implements Authenticator { + + @Override + public void init(@Nonnull Map authenticatorConfig, @Nullable AuthenticatorContext context) { + // Plugin initialization code will go here + // DataHub will call this method on boot time + } + + @Nullable + @Override + public Authentication authenticate(@Nonnull AuthenticationRequest authenticationRequest) + throws AuthenticationException { + // DataHub will call this method whenever authentication decisions are need to be taken + // Authenticate the request and return Authentication + } + } + ``` + +
+ +3. Use `getResourceAsStream` to read files: If your plugin read any configuration file like properties or YAML or JSON or xml then use `this.getClass().getClassLoader().getResourceAsStream("")` to read that file from DataHub GMS plugin's class-path. For DataHub GMS resource look-up behavior please refer [Plugin Installation](#plugin-installation) section. Sample code of `getResourceAsStream` is available in sample Authenticator plugin [TestAuthenticator.java](https://github.com/datahub-project/datahub/blob/master/metadata-service/plugin/src/test/sample-test-plugins/src/main/java/com/datahub/plugins/test/TestAuthenticator.java). + +4. Bundle your Jar: Use `com.github.johnrengelman.shadow` gradle plugin to create an uber jar. + + To see an example of building an uber jar, check out the `build.gradle` file for the apache-ranger-plugin file of [Apache Ranger Plugin](https://github.com/acryldata/datahub-ranger-auth-plugin/tree/main/apache-ranger-plugin) for reference. + + Exclude signature files as shown in below `shadowJar` task. + + ```groovy + apply plugin: 'com.github.johnrengelman.shadow'; + shadowJar { + // Exclude com.datahub.plugins package and files related to jar signature + exclude "META-INF/*.RSA", "META-INF/*.SF","META-INF/*.DSA" + } + ``` + +5. Refer section [Plugin Installation](#plugin-installation) for plugin installation in DataHub environment + +## Enable GMS Authentication + +By default, authentication is disabled in DataHub GMS. + +Follow below steps to enable GMS authentication + +1. Download docker-compose.quickstart.yml: Download docker compose file [docker-compose.quickstart.yml](https://github.com/datahub-project/datahub/blob/master/docker/quickstart/docker-compose.quickstart.yml) + +2. Set environment variable: Set `METADATA_SERVICE_AUTH_ENABLED` environment variable to `true` + +3. Redeploy DataHub GMS: Below is quickstart command to redeploy DataHub GMS + + ```shell + datahub docker quickstart -f docker-compose.quickstart.yml + ``` + +## Authorization + +> **Note:** This is in BETA version + +> It is recommend that you do not do this unless you really know what you are doing + +Custom authorization plugin makes it possible to authorize DataHub users against any Access Management System. +Choose your Access Management System and write custom authorization plugin as per detail mentioned in this section. + +The sample authorizer implementation can be found at [Authorizer Sample](https://github.com/acryldata/datahub-ranger-auth-plugin/tree/main/apache-ranger-plugin) + +### Implementing an Authorization Plugin + +1. Add _datahub-auth-api_ as compileOnly dependency: Maven coordinates of _datahub-auth-api_ can be found at [Maven](https://mvnrepository.com/artifact/io.acryl/datahub-auth-api) + + Example of gradle dependency is given below. + + ```groovy + dependencies { + + def auth_api = 'io.acryl:datahub-auth-api:0.9.3-3rc3' + compileOnly "${auth_api}" + testImplementation "${auth_api}" + + } + ``` + +2. Implement the Authorizer interface: [Authorizer Sample](https://github.com/acryldata/datahub-ranger-auth-plugin/tree/main/apache-ranger-plugin) + +
+ Sample class which implements the Authorization interface + + ```java + public class ApacheRangerAuthorizer implements Authorizer { + @Override + public void init(@Nonnull Map authorizerConfig, @Nonnull AuthorizerContext ctx) { + // Plugin initialization code will go here + // DataHub will call this method on boot time + } + + @Override + public AuthorizationResult authorize(@Nonnull AuthorizationRequest request) { + // DataHub will call this method whenever authorization decisions are need be taken + // Authorize the request and return AuthorizationResult + } + + @Override + public AuthorizedActors authorizedActors(String privilege, Optional resourceSpec) { + // Need to add doc + } + } + ``` + +
+ +3. Use `getResourceAsStream` to read files: If your plugin read any configuration file like properties or YAML or JSON or xml then use `this.getClass().getClassLoader().getResourceAsStream("")` to read that file from DataHub GMS plugin's class-path. For DataHub GMS resource look-up behavior please refer [Plugin Installation](#plugin-installation) section. Sample code of `getResourceAsStream` is available in sample Authenticator plugin [TestAuthenticator.java](https://github.com/datahub-project/datahub/blob/master/metadata-service/plugin/src/test/sample-test-plugins/src/main/java/com/datahub/plugins/test/TestAuthenticator.java). + +4. Bundle your Jar: Use `com.github.johnrengelman.shadow` gradle plugin to create an uber jar. + + To see an example of building an uber jar, check out the `build.gradle` file for the apache-ranger-plugin file of [Apache Ranger Plugin](https://github.com/acryldata/datahub-ranger-auth-plugin/tree/main/apache-ranger-plugin) for reference. + + Exclude signature files as shown in below `shadowJar` task. + + ```groovy + apply plugin: 'com.github.johnrengelman.shadow'; + shadowJar { + // Exclude com.datahub.plugins package and files related to jar signature + exclude "META-INF/*.RSA", "META-INF/*.SF","META-INF/*.DSA" + } + ``` + +5. Install the Plugin: Refer to the section (Plugin Installation)[#plugin_installation] for plugin installation in DataHub environment + +## Plugin Installation + +DataHub's GMS Service searches for the plugins in container's local directory at location `/etc/datahub/plugins/auth/`. This location will be referred as `plugin-base-directory` hereafter. + +For docker, we set docker-compose to mount `${HOME}/.datahub` directory to `/etc/datahub` directory within the GMS containers. + +### Docker + +Follow below steps to install plugins: + +Lets consider you have created an uber jar for authorizer plugin and jar name is apache-ranger-authorizer.jar and class com.abc.RangerAuthorizer has implemented the [Authorizer](https://github.com/datahub-project/datahub/blob/master/metadata-auth/auth-api/src/main/java/com/datahub/plugins/auth/authorization/Authorizer.java) interface. + +1. Create a plugin configuration file: Create a `config.yml` file at `${HOME}/.datahub/plugins/auth/`. For more detail on configuration refer [Config Detail](#config-detail) section + +2. Create a plugin directory: Create plugin directory as `apache-ranger-authorizer`, this directory will be referred as `plugin-home` hereafter + + ```shell + mkdir -p ${HOME}/.datahub/plugins/auth/apache-ranger-authorizer + ``` + +3. Copy plugin jar to `plugin-home`: Copy `apache-ranger-authorizer.jar` to `plugin-home` + + ```shell + copy apache-ranger-authorizer.jar ${HOME}/.datahub/plugins/auth/apache-ranger-authorizer + ``` + +4. Update plugin configuration file: Add below entry in `config.yml` file, the plugin can take any arbitrary configuration under the "configs" block. in our example, there is username and password + + ```yaml + plugins: + - name: "apache-ranger-authorizer" + type: "authorizer" + enabled: "true" + params: + className: "com.abc.RangerAuthorizer" + configs: + username: "foo" + password: "fake" + ``` + +5. Restart datahub-gms container: + + On startup DataHub GMS service performs below steps + + 1. Load `config.yml` + 2. Prepare list of plugin where `enabled` is set to `true` + 3. Look for directory equivalent to plugin `name` in `plugin-base-directory`. In this case it is `/etc/datahub/plugins/auth/apache-ranger-authorizer/`, this directory will become `plugin-home` + 4. Look for `params.jarFileName` attribute otherwise look for jar having name as <plugin-name>.jar. In this case it is `/etc/datahub/plugins/auth/apache-ranger-authorizer/apache-ranger-authorizer.jar` + 5. Load class given in plugin `params.className` attribute from the jar, here load class `com.abc.RangerAuthorizer` from `apache-ranger-authorizer.jar` + 6. Call `init` method of plugin + +
On method call of `getResourceAsStream` DataHub GMS service looks for the resource in below order. + + 1. Look for the requested resource in plugin-jar file. if found then return the resource as InputStream. + 2. Look for the requested resource in `plugin-home` directory. if found then return the resource as InputStream. + 3. Look for the requested resource in application class-loader. if found then return the resource as InputStream. + 4. Return `null` as requested resource is not found. + +By default, authentication is disabled in DataHub GMS, Please follow section [Enable GMS Authentication](#enable-gms-authentication) to enable authentication. + +### Kubernetes + +Helm support is coming soon. + +## Config Detail + +A sample `config.yml` can be found at [config.yml](https://github.com/datahub-project/datahub/blob/master/metadata-service/plugin/src/test/resources/valid-base-plugin-dir1/config.yml). + +`config.yml` structure: + +| Field | Required | Type | Default | Description | +| ---------------------------- | -------- | ------------------------------- | ------------------------------- | ---------------------------------------------------------------------------------------- | +| plugins[].name | ✅ | string | | name of the plugin | +| plugins[].type | ✅ | enum[authenticator, authorizer] | | type of plugin, possible values are authenticator or authorizer | +| plugins[].enabled | ✅ | boolean | | whether this plugin is enabled or disabled. DataHub GMS wouldn't process disabled plugin | +| plugins[].params.className | ✅ | string | | Authenticator or Authorizer implementation class' fully qualified class name | +| plugins[].params.jarFileName | | string | default to `plugins[].name`.jar | jar file name in `plugin-home` | +| plugins[].params.configs | | map | default to empty map | Runtime configuration required for plugin | + +> plugins[] is an array of plugin, where you can define multiple authenticator and authorizer plugins. plugin name should be unique in plugins array. + +## Plugin Permissions + +Adhere to below plugin access control to keep your plugin forward compatible. + +- Plugin should read/write file to and from `plugin-home` directory only. Refer [Plugin Installation](#plugin-installation) step2 for `plugin-home` definition +- Plugin should access port 80 or 443 or port higher than 1024 + +All other access are forbidden for the plugin. + +> Disclaimer: In BETA version your plugin can access any port and can read/write to any location on file system, however you should implement the plugin as per above access permission to keep your plugin compatible with upcoming release of DataHub. + +## Migration Of Plugins From application.yml + +If you have any custom Authentication or Authorization plugin define in `authorization` or `authentication` section of [application.yml](https://github.com/datahub-project/datahub/blob/master/metadata-service/configuration/src/main/resources/application.yml) then migrate them as per below steps. + +1. Implement Plugin: For Authentication Plugin follow steps of [Implementing an Authentication Plugin](#implementing-an-authentication-plugin) and for Authorization Plugin follow steps of [Implementing an Authorization Plugin](#implementing-an-authorization-plugin) +2. Install Plugin: Install the plugins as per steps mentioned in [Plugin Installation](#plugin-installation). Here you need to map the configuration from [application.yml](https://github.com/datahub-project/datahub/blob/master/metadata-service/configuration/src/main/resources/application.yml) to configuration in `config.yml`. This mapping from `application.yml` to `config.yml` is described below + + **Mapping for Authenticators** + + a. In `config.yml` set `plugins[].type` to `authenticator` + + b. `authentication.authenticators[].type` is mapped to `plugins[].params.className` + + c. `authentication.authenticators[].configs` is mapped to `plugins[].params.configs` + + Example Authenticator Plugin configuration in `config.yml` + + ```yaml + plugins: + - name: "apache-ranger-authenticator" + type: "authenticator" + enabled: "true" + params: + className: "com.abc.RangerAuthenticator" + configs: + username: "foo" + password: "fake" + + ``` + + **Mapping for Authorizer** + + a. In `config.yml` set `plugins[].type` to `authorizer` + + b. `authorization.authorizers[].type` is mapped to `plugins[].params.className` + + c. `authorization.authorizers[].configs` is mapped to `plugins[].params.configs` + + Example Authorizer Plugin configuration in `config.yml` + + ```yaml + plugins: + - name: "apache-ranger-authorizer" + type: "authorizer" + enabled: "true" + params: + className: "com.abc.RangerAuthorizer" + configs: + username: "foo" + password: "fake" + + ``` + +3. Move any other configurations files of your plugin to `plugin_home` directory. The detail about `plugin_home` is mentioned in [Plugin Installation](#plugin-installation) section. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/posts.md b/docs-website/versioned_docs/version-0.10.4/docs/posts.md new file mode 100644 index 0000000000000..e062147be593e --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/posts.md @@ -0,0 +1,135 @@ +--- +title: About DataHub Posts +sidebar_label: Posts +slug: /posts +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/posts.md" +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub Posts + + +DataHub allows users to make Posts that can be displayed on the app. Currently, Posts are only supported on the Home Page, but may be extended to other surfaces of the app in the future. Posts can be used to accomplish the following: + +- Allowing Admins to post announcements on the home page +- Pinning important DataHub assets or pages +- Pinning important external links + +## Posts Setup, Prerequisites, and Permissions + +Anyone can view Posts on the home page. To create Posts, a user must either have the **Create Global Announcements** Privilege, or possess the **Admin** DataHub Role. + +## Using Posts + +To create a post, users must use the [createPost](../graphql/mutations.md#createPost) GraphQL mutation. There is currently no way to create posts using the UI, though this will come in the future. + +There is only one type of Post that can be currently made, and that is a **Home Page Announcement**. This may be extended in the future to other surfaces. + +DataHub currently supports two types of Post content. Posts can either contain **TEXT** or can be a **LINK**. When creating a post through GraphQL, users will have to supply the post content. + +For **TEXT** posts, the following pieces of information are required in the `content` object (of type [UpdatePostContentInput](../graphql/inputObjects.md#updatepostcontentinput)) of the GraphQL `input` (of type [CreatePostInput](../graphql/inputObjects.md#createpostinput))). **TEXT** posts cannot be clicked. + +- `contentType: TEXT` +- `title` +- `description` + +The `link` and `media` attributes are currently unused for **TEXT** posts. + +For **LINK** posts, the following pieces of information are required in the `content` object (of type [UpdatePostContentInput](../graphql/inputObjects.md#updatepostcontentinput)) of the GraphQL `input` (of type [CreatePostInput](../graphql/inputObjects.md#createpostinput))). **LINK** posts redirect to the provided link when clicked. + +- `contentType: LINK` +- `title` +- `link` +- `media`. Currently only the **IMAGE** type is supported, and the URL of the image must be provided + +The `description` attribute is currently unused for **LINK** posts. + +Here are some examples of Posts displayed on the home page, with one **TEXT** post and two **LINK** posts. + +

+ +

+ +### GraphQL + +- [createPost](../graphql/mutations.md#createpost) +- [listPosts](../graphql/queries.md#listposts) +- [deletePosts](../graphql/queries.md#listposts) + +### Examples + +##### Create Post + +```graphql +mutation test { +  createPost( +    input: { + postType: HOME_PAGE_ANNOUNCEMENT, + content: { + contentType: TEXT, + title: "Planed Upgrade 2023-03-23 20:05 - 2023-03-23 23:05", + description: "datahub upgrade to v0.10.1" + } + } +  ) +} + +``` + +##### List Post + +```graphql +query listPosts($input: ListPostsInput!) { +  listPosts(input: $input) { +    start +    count +    total +    posts { +      urn +      type +      postType +      content { +        contentType +        title +        description +        link +        media { +          type +          location +          __typename +        } +        __typename +      } +      __typename +    } +    __typename +  } +} + +``` + +##### Input for list post + +```shell +{ +  "input": { +    "start": 0, +    "count": 10 +  } +} +``` + +##### Delete Post + +```graphql +mutation deletePosting { + deletePost ( +  urn: "urn:li:post:61dd86fa-9e76-4924-ad45-3a533671835e" + ) +} +``` + +## FAQ and Troubleshooting + +_Need more help with Posts? Join the conversation in [Slack](http://slack.datahubproject.io)! Please post in the **#ui** channel!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/bigquery/configuration.md b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/bigquery/configuration.md new file mode 100644 index 0000000000000..49ef1c1527bdc --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/bigquery/configuration.md @@ -0,0 +1,153 @@ +--- +title: Configuration +sidebar_label: Configuration +slug: /quick-ingestion-guides/bigquery/configuration +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/bigquery/configuration.md +--- + +# Configuring Your BigQuery Connector to DataHub + +Now that you have created a Service Account and Service Account Key in BigQuery in [the prior step](setup.md), it's now time to set up a connection via the DataHub UI. + +## Configure Secrets + +1. Within DataHub, navigate to the **Ingestion** tab in the top, right corner of your screen + +

+ Navigate to the "Ingestion Tab" +

+ +:::note +If you do not see the Ingestion tab, please contact your DataHub admin to grant you the correct permissions +::: + +2. Navigate to the **Secrets** tab and click **Create new secret** + +

+ Secrets Tab +

+ +3. Create a Private Key secret + +This will securely store your BigQuery Service Account Private Key within DataHub + +- Enter a name like `BIGQUERY_PRIVATE_KEY` - we will use this later to refer to the secret +- Copy and paste the `private_key` value from your Service Account Key +- Optionally add a description +- Click **Create** + +

+ Private Key Secret +

+ +4. Create a Private Key ID secret + +This will securely store your BigQuery Service Account Private Key ID within DataHub + +- Click **Create new secret** again +- Enter a name like `BIGQUERY_PRIVATE_KEY_ID` - we will use this later to refer to the secret +- Copy and paste the `private_key_id` value from your Service Account Key +- Optionally add a description +- Click **Create** + +

+ Private Key Id Secret +

+ +## Configure Recipe + +5. Navigate to the **Sources** tab and click **Create new source** + +

+ Click "Create new source" +

+ +6. Select BigQuery + +

+ Select BigQuery from the options +

+ +7. Fill out the BigQuery Recipe + +You can find the following details in your Service Account Key file: + +- Project ID +- Client Email +- Client ID + +Populate the Secret Fields by selecting the Private Key and Private Key ID secrets you created in steps 3 and 4. + +

+ Fill out the BigQuery Recipe +

+ +8. Click **Test Connection** + +This step will ensure you have configured your credentials accurately and confirm you have the required permissions to extract all relevant metadata. + +

+ Test BigQuery connection +

+ +After you have successfully tested your connection, click **Next**. + +## Schedule Execution + +Now it's time to schedule a recurring ingestion pipeline to regularly extract metadata from your BigQuery instance. + +9. Decide how regularly you want this ingestion to run-- day, month, year, hour, minute, etc. Select from the dropdown +

+ schedule selector +

+ +10. Ensure you've configured your correct timezone +

+ timezone_selector +

+ +11. Click **Next** when you are done + +## Finish Up + +12. Name your ingestion source, then click **Save and Run** +

+ Name your ingestion +

+ +You will now find your new ingestion source running + +

+ ingestion_running +

+ +## Validate Ingestion Runs + +13. View the latest status of ingestion runs on the Ingestion page + +

+ ingestion succeeded +

+ +14. Click the plus sign to expand the full list of historical runs and outcomes; click **Details** to see the outcomes of a specific run + +

+ ingestion_details +

+ +15. From the Ingestion Run Details page, pick **View All** to see which entities were ingested + +

+ ingestion_details_view_all +

+ +16. Pick an entity from the list to manually validate if it contains the detail you expected + +

+ ingestion_details_view_all +

+ +**Congratulations!** You've successfully set up BigQuery as an ingestion source for DataHub! + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/bigquery/overview.md b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/bigquery/overview.md new file mode 100644 index 0000000000000..f8d5135dae300 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/bigquery/overview.md @@ -0,0 +1,44 @@ +--- +title: Overview +sidebar_label: Overview +slug: /quick-ingestion-guides/bigquery/overview +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/bigquery/overview.md +--- + +# BigQuery Ingestion Guide: Overview + +## What You Will Get Out of This Guide + +This guide will help you set up the BigQuery connector through the DataHub UI to begin ingesting metadata into DataHub. + +Upon completing this guide, you will have a recurring ingestion pipeline that will extract metadata from BigQuery and load it into DataHub. This will include to following BigQuery asset types: + +- [Projects](https://cloud.google.com/bigquery/docs/resource-hierarchy#projects) +- [Datasets](https://cloud.google.com/bigquery/docs/datasets-intro) +- [Tables](https://cloud.google.com/bigquery/docs/tables-intro) +- [Views](https://cloud.google.com/bigquery/docs/views-intro) +- [Materialized Views](https://cloud.google.com/bigquery/docs/materialized-views-intro) + +This recurring ingestion pipeline will also extract: + +- **Usage statistics** to help you understand recent query activity +- **Table-level lineage** (where available) to automatically define interdependencies between datasets +- **Table- and column-level profile statistics** to help you understand the shape of the data + +:::caution +You will NOT have extracted [Routines](https://cloud.google.com/bigquery/docs/routines), [Search Indexes](https://cloud.google.com/bigquery/docs/search-intro) from BigQuery, as the connector does not support ingesting these assets +::: + +## Next Steps + +If that all sounds like what you're looking for, navigate to the [next page](setup.md), where we'll talk about prerequisites + +## Advanced Guides and Reference + +If you're looking to do something more in-depth, want to use CLI instead of the DataHub UI, or just need to look at the reference documentation for this connector, use these links: + +- Learn about CLI Ingestion in the [Introduction to Metadata Ingestion](../../../metadata-ingestion/README.md) +- [BigQuery Ingestion Reference Guide](/docs/generated/ingestion/sources/bigquery/#module-bigquery) + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/bigquery/setup.md b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/bigquery/setup.md new file mode 100644 index 0000000000000..682ae713702dc --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/bigquery/setup.md @@ -0,0 +1,69 @@ +--- +title: Setup +sidebar_label: Setup +slug: /quick-ingestion-guides/bigquery/setup +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/bigquery/setup.md +--- + +# BigQuery Ingestion Guide: Setup & Prerequisites + +To configure ingestion from BigQuery, you'll need a [Service Account](https://cloud.google.com/iam/docs/creating-managing-service-accounts) configured with the proper permission sets and an associated [Service Account Key](https://cloud.google.com/iam/docs/creating-managing-service-account-keys). + +This setup guide will walk you through the steps you'll need to take via your Google Cloud Console. + +## BigQuery Prerequisites + +If you do not have an existing Service Account and Service Account Key, please work with your BigQuery Admin to ensure you have the appropriate permissions and/or roles to continue with this setup guide. + +When creating and managing new Service Accounts and Service Account Keys, we have found the following permissions and roles to be required: + +- Create a Service Account: `iam.serviceAccounts.create` permission +- Assign roles to a Service Account: `serviceusage.services.enable` permission +- Set permission policy to the project: `resourcemanager.projects.setIamPolicy` permission +- Generate Key for Service Account: Service Account Key Admin (`roles/iam.serviceAccountKeyAdmin`) IAM role + +:::note +Please refer to the BigQuery [Permissions](https://cloud.google.com/iam/docs/permissions-reference) and [IAM Roles](https://cloud.google.com/iam/docs/understanding-roles) references for details +::: + +## BigQuery Setup + +1. To set up a new Service Account follow [this guide](https://cloud.google.com/iam/docs/creating-managing-service-accounts) + +2. When you are creating a Service Account, assign the following predefined Roles: + - [BigQuery Job User](https://cloud.google.com/bigquery/docs/access-control#bigquery.jobUser) + - [BigQuery Metadata Viewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.metadataViewer) + - [BigQuery Resource Viewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.resourceViewer) -> This role is for Table-Level Lineage and Usage extraction + - [Logs View Accessor](https://cloud.google.com/bigquery/docs/access-control#bigquery.dataViewer) -> This role is for Table-Level Lineage and Usage extraction + - [BigQuery Data Viewer](https://cloud.google.com/bigquery/docs/access-control#bigquery.dataViewer) -> This role is for Profiling + - [BigQuery Read Session User](https://cloud.google.com/bigquery/docs/access-control#bigquery.readSessionUser) -> This role is for Profiling + +:::note +You can always add/remove roles to Service Accounts later on. Please refer to the BigQuery [Manage access to projects, folders, and organizations](https://cloud.google.com/iam/docs/granting-changing-revoking-access) guide for more details. +::: + +3. Create and download a [Service Account Key](https://cloud.google.com/iam/docs/creating-managing-service-account-keys). We will use this to set up authentication within DataHub. + +The key file looks like this: + +```json +{ + "type": "service_account", + "project_id": "project-id-1234567", + "private_key_id": "d0121d0000882411234e11166c6aaa23ed5d74e0", + "private_key": "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----", + "client_email": "test@suppproject-id-1234567.iam.gserviceaccount.com", + "client_id": "113545814931671546333", + "auth_uri": "https://accounts.google.com/o/oauth2/auth", + "token_uri": "https://oauth2.googleapis.com/token", + "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", + "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/test%suppproject-id-1234567.iam.gserviceaccount.com" +} +``` + +## Next Steps + +Once you've confirmed all of the above in BigQuery, it's time to [move on](configuration.md) to configure the actual ingestion source within the DataHub UI. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/powerbi/configuration.md b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/powerbi/configuration.md new file mode 100644 index 0000000000000..e7da12c58979b --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/powerbi/configuration.md @@ -0,0 +1,165 @@ +--- +title: Configuration +sidebar_label: Configuration +slug: /quick-ingestion-guides/powerbi/configuration +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/powerbi/configuration.md +--- + +# Configuring Your PowerBI Connector to DataHub + +Now that you have created a DataHub specific Azure AD app with the relevant access in [the prior step](setup.md), it's now time to set up a connection via the DataHub UI. + +## Configure Secrets + +1. Within DataHub, navigate to the **Ingestion** tab in the top, right corner of your screen + +

+ Navigate to the "Ingestion Tab" +

+ +:::note +If you do not see the Ingestion tab, please contact your DataHub admin to grant you the correct permissions +::: + +2. Navigate to the **Secrets** tab and click **Create new secret**. + +

+ Secrets Tab +

+ +3. Create a client id secret + +This will securely store your PowerBI `Application (client) ID` within DataHub + +- Enter a name like `POWER_BI_CLIENT_ID` - we will use this later to refer to the `Application (client) ID` +- Enter the `Application (client) ID` +- Optionally add a description +- Click **Create** + +

+ Application (client) ID +

+ +4. Create a secret to store the Azure AD Client Secret + +This will securely store your client secret" + +- Enter a name like `POWER_BI_CLIENT_SECRET` - we will use this later to refer to the client secret +- Enter the client secret +- Optionally add a description +- Click **Create** + +

+ Azure AD app Secret +

+ +## Configure Recipe + +1. Navigate to the **Sources** tab and click **Create new source** + +

+ Click "Create new source" +

+ +2. Choose PowerBI + +

+ Select PowerBI from the options +

+ +3. Enter details into the PowerBI Recipe + + You need to set minimum 3 field in the recipe: + + a. **tenant_id:** This is the unique identifier (GUID) of the Azure Active Directory instance. Tenant Id can be found at: PowerBI Portal -> Click on `?` at top-right corner -> Click on `About PowerBI` + +

+ Select PowerBI from the options +

+ + On `About PowerBI` window copy `ctid`: + +

+ copy ctid +

+ + b. **client_id:** Use the secret POWER_BI_CLIENT_ID with the format "${POWER_BI_CLIENT_ID}". + + c. **client_secret:** Use the secret POWER_BI_CLIENT_SECRET with the format "${POWER_BI_CLIENT_SECRET}". + +Optionally, use the `workspace_id_pattern` field to filter for specific workspaces. + + config: + ... + workspace_id_pattern: + allow: + - "258829b1-82b1-4bdb-b9fb-6722c718bbd3" + +Your recipe should look something like this: + +

+ tenant id +

+ +After completing the recipe, click **Next**. + +## Schedule Execution + +Now it's time to schedule a recurring ingestion pipeline to regularly extract metadata from your PowerBI instance. + +1. Decide how regularly you want this ingestion to run-- day, month, year, hour, minute, etc. Select from the dropdown + +

+ schedule selector +

+ +2. Ensure you've configured your correct timezone +

+ timezone_selector +

+ +3. Click **Next** when you are done + +## Finish Up + +1. Name your ingestion source, then click **Save and Run** +

+ Name your ingestion +

+ +You will now find your new ingestion source running + +

+ ingestion_running +

+ +## Validate Ingestion Runs + +1. View the latest status of ingestion runs on the Ingestion page + +

+ ingestion succeeded +

+ +2. Click the plus sign to expand the full list of historical runs and outcomes; click **Details** to see the outcomes of a specific run + +

+ ingestion_details +

+ +3. From the Ingestion Run Details page, pick **View All** to see which entities were ingested + +

+ ingestion_details_view_all +

+ +4. Pick an entity from the list to manually validate if it contains the detail you expected + +

+ ingestion_details_view_all +

+ +**Congratulations!** You've successfully set up PowerBI as an ingestion source for DataHub! + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/powerbi/overview.md b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/powerbi/overview.md new file mode 100644 index 0000000000000..9842281860ddb --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/powerbi/overview.md @@ -0,0 +1,37 @@ +--- +title: Overview +sidebar_label: Overview +slug: /quick-ingestion-guides/powerbi/overview +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/powerbi/overview.md +--- + +# PowerBI Ingestion Guide: Overview + +## What You Will Get Out of This Guide + +This guide will help you set up the PowerBI connector to begin ingesting metadata into DataHub. + +Upon completing this guide, you will have a recurring ingestion pipeline that will extract metadata from PowerBI and load it into DataHub. This will include to following PowerBI asset types: + +- Dashboards +- Tiles +- Reports +- Pages +- Datasets +- Lineage + +_To learn more about setting these advanced values, check out the [PowerBI Ingestion Source](/docs/generated/ingestion/sources/powerbi)._ + +## Next Steps + +Continue to the [setup guide](setup.md), where we'll describe the prerequisites. + +## Advanced Guides and Reference + +If you want to ingest metadata from PowerBI using the DataHub CLI, check out the following resources: + +- Learn about CLI Ingestion in the [Introduction to Metadata Ingestion](../../../metadata-ingestion/README.md) +- [PowerBI Ingestion Source](/docs/generated/ingestion/sources/powerbi) + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/powerbi/setup.md b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/powerbi/setup.md new file mode 100644 index 0000000000000..751c7d8b38134 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/powerbi/setup.md @@ -0,0 +1,87 @@ +--- +title: Setup +sidebar_label: Setup +slug: /quick-ingestion-guides/powerbi/setup +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/powerbi/setup.md +--- + +# PowerBI Ingestion Guide: Setup & Prerequisites + +In order to configure ingestion from PowerBI, you'll first have to ensure you have an Azure AD app with permission to access the PowerBI resources. + +## PowerBI Prerequisites + +1. **Create an Azure AD app:** Follow below steps to create an Azure AD app + + a. Login to https://portal.azure.com + + b. Go to `Azure Active Directory` + + c. Navigate to `App registrations` + + d. Click on `+ New registration` + + e. On `Register an application` window fill the `Name` of application says `powerbi-app-connector` and keep other default as is + +

+ app_registration +

+ + f. On `Register an application` window click on `Register` + + g. The Azure portal will open up the `powerbi-app-connector` window as shown below. On this screen note down the `Application (client) ID` and click on `Add a certificate or secret` to generate a secret for the `Application (client) ID` + +

+ powerbi_app_connector +

+ + f. On `powerbi-connector-app | Certificates & secrets` window generate the client secret and note down the `Secret` + +2. **Create an Azure AD Security Group:** You need to add the `Azure AD app` into the security group to control resource permissions for the `Azure AD app`. Follow below steps to create an Azure AD Security Group. + + a. Go to `Azure Active Directory` + + b. Navigate to `Groups` and click on `New group` + + c. On `New group` window fill out the `Group type`,  `Group name`,  `Group description`.  `Group type` should be set to `Security` .   `New group` window is shown in below screenshot. + +

+ powerbi_app_connector +

+ + d. On `New group` window click on `No members selected` and add `Azure AD app` i.e. _powerbi-connector-app_ as member + + f. On `New group` window click on `Create` to create the security group `powerbi-connector-app-security-group`. + +3. **Assign privileges to powerbi-connector-app-security-group:** You need to add the created security group into PowerBI portal to grant resource access. Follow below steps to assign privileges to your security group. + + a. Login to https://app.powerbi.com/ + + b. Go to `Settings` -> `Admin Portal` + + c. On `Admin Portal` navigate to `Tenant settings` as shown in below screenshot. + +

+ powerbi_admin_portal +

+ + d. **Enable PowerBI API:** Under `Tenant settings` -> `Developer settings` -> `Allow service principals to use Power BI APIs` add the previously created security group i.e. _powerbi-connector-app-security-group_ into `Specific security groups (Recommended)` + + e. **Enable Admin API Settings:** Under `Tenant settings` -> `Admin API settings` enable the following options + + - `Allow service principals to use read-only admin APIs` + - `Enhance admin APIs responses with detailed metadata` + - `Enhance admin APIs responses with DAX and mashup expressions` + + f. **Add Security Group to Workspace:** Navigate to `Workspaces` window and open workspace which you want to ingest as shown in below screenshot and click on `Access` and add `powerbi-connector-app-security-group` as member + +

+ workspace-window-underlined +

+ +## Next Steps + +Once you've done all of the above steps, it's time to [move on](configuration.md) to configuring the actual ingestion source within DataHub. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/redshift/configuration.md b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/redshift/configuration.md new file mode 100644 index 0000000000000..f980d9f367b34 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/redshift/configuration.md @@ -0,0 +1,141 @@ +--- +title: Configuration +sidebar_label: Configuration +slug: /quick-ingestion-guides/redshift/configuration +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/redshift/configuration.md +--- + +# Configuring Your Redshift Connector to DataHub + +Now that you have created a DataHub user in Redshift in [the prior step](setup.md), it's time to set up a connection via the DataHub UI. + +## Configure Secrets + +1. Within DataHub, navigate to the **Ingestion** tab in the top, right corner of your screen + +

+ Navigate to the "Ingestion Tab" +

+ +:::note +If you do not see the Ingestion tab, please contact your DataHub admin to grant you the correct permissions +::: + +2. Navigate to the **Secrets** tab and click **Create new secret** + +

+ Secrets Tab +

+ +3. Create a Redshift User's Password secret + +This will securely store your Redshift User's password within DataHub + +- Click **Create new secret** again +- Enter a name like `REDSHIFT_PASSWORD` - we will use this later to refer to the secret +- Enter your `datahub` redshift user's password +- Optionally add a description +- Click **Create** + +

+ Redshift Password Secret +

+ +## Configure Recipe + +4. Navigate to the **Sources** tab and click **Create new source** + +

+ Click "Create new source" +

+ +5. Select Redshift + +

+ Select BigQuery from the options +

+ +6. Fill out the Redshift Recipe + +Populate the Password field by selecting Redshift Password secrets you created in steps 3 and 4. + +

+ Fill out the Redshift Recipe +

+ + + +## Schedule Execution + +Now it's time to schedule a recurring ingestion pipeline to regularly extract metadata from your Redshift instance. + +7. Decide how regularly you want this ingestion to run-- day, month, year, hour, minute, etc. Select from the dropdown + +

+ schedule selector +

+ +8. Ensure you've configured your correct timezone + +

+ timezone_selector +

+ +9. Click **Next** when you are done + +## Finish Up + +10. Name your ingestion source, then click **Save and Run** + +

+ Name your ingestion +

+ +You will now find your new ingestion source running + +

+ ingestion_running +

+ +## Validate Ingestion Runs + +11. View the latest status of ingestion runs on the Ingestion page + +

+ ingestion succeeded +

+ +12. Click the plus sign to expand the full list of historical runs and outcomes; click **Details** to see the outcomes of a specific run + +

+ ingestion_details +

+ +13. From the Ingestion Run Details page, pick **View All** to see which entities were ingested + +

+ ingestion_details_view_all +

+ +14. Pick an entity from the list to manually validate if it contains the detail you expected + +

+ ingestion_details_view_all +

+ +**Congratulations!** You've successfully set up Redshift as an ingestion source for DataHub! + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/redshift/overview.md b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/redshift/overview.md new file mode 100644 index 0000000000000..3f5803302a7bf --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/redshift/overview.md @@ -0,0 +1,43 @@ +--- +title: Overview +sidebar_label: Overview +slug: /quick-ingestion-guides/redshift/overview +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/redshift/overview.md +--- + +# Redshift Ingestion Guide: Overview + +## What You Will Get Out of This Guide + +This guide will help you set up the Redshift connector through the DataHub UI to begin ingesting metadata into DataHub. + +Upon completing this guide, you will have a recurring ingestion pipeline that will extract metadata from Redshift and load it into DataHub. This will include to following Redshift asset types: + +- Database +- Schemas (External and Internal) +- Tables (External and Internal) +- Views + +This recurring ingestion pipeline will also extract: + +- **Usage statistics** to help you understand recent query activity +- **Table-level lineage** (where available) to automatically define interdependencies between datasets +- **Table- and column-level profile statistics** to help you understand the shape of the data + +:::caution +The source currently can ingest one database with one recipe +::: + +## Next Steps + +If that all sounds like what you're looking for, navigate to the [next page](setup.md), where we'll talk about prerequisites + +## Advanced Guides and Reference + +If you're looking to do something more in-depth, want to use CLI instead of the DataHub UI, or just need to look at the reference documentation for this connector, use these links: + +- Learn about CLI Ingestion in the [Introduction to Metadata Ingestion](../../../metadata-ingestion/README.md) +- [Redshift Ingestion Reference Guide](/docs/generated/ingestion/sources/redshift/#module-redshift) + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/redshift/setup.md b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/redshift/setup.md new file mode 100644 index 0000000000000..443c3cae03d01 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/redshift/setup.md @@ -0,0 +1,40 @@ +--- +title: Setup +sidebar_label: Setup +slug: /quick-ingestion-guides/redshift/setup +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/redshift/setup.md +--- + +# Redshift Ingestion Guide: Setup & Prerequisites + +To configure ingestion from Redshift, you'll need a [User](https://docs.aws.amazon.com/redshift/latest/gsg/t_adding_redshift_user_cmd.html) configured with the proper permission sets, and an associated. + +This setup guide will walk you through the steps you'll need to take via your Google Cloud Console. + +## Redshift Prerequisites + +1. Connect to your Amazon Redshift cluster using an SQL client such as SQL Workbench/J or Amazon Redshift Query Editor with your Admin user. +2. Create a [Redshift User](https://docs.aws.amazon.com/redshift/latest/gsg/t_adding_redshift_user_cmd.html) that will be used to perform the metadata extraction if you don't have one already. + For example: + +```sql +CREATE USER datahub WITH PASSWORD 'Datahub1234'; +``` + +## Redshift Setup + +1. Grant the following permission to your `datahub` user: + +```sql +ALTER USER datahub WITH SYSLOG ACCESS UNRESTRICTED; +GRANT SELECT ON pg_catalog.svv_table_info to datahub; +GRANT SELECT ON pg_catalog.svl_user_info to datahub; + +``` + +## Next Steps + +Once you've confirmed all of the above in Redshift, it's time to [move on](configuration.md) to configure the actual ingestion source within the DataHub UI. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/snowflake/configuration.md b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/snowflake/configuration.md new file mode 100644 index 0000000000000..bf28db559898a --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/snowflake/configuration.md @@ -0,0 +1,151 @@ +--- +title: Configuration +sidebar_label: Configuration +slug: /quick-ingestion-guides/snowflake/configuration +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/snowflake/configuration.md +--- + +# Configuring Your Snowflake Connector to DataHub + +Now that you have created a DataHub-specific user with the relevant roles in Snowflake in [the prior step](setup.md), it's now time to set up a connection via the DataHub UI. + +## Configure Secrets + +1. Within DataHub, navigate to the **Ingestion** tab in the top, right corner of your screen + +

+ Navigate to the "Ingestion Tab" +

+ +:::note +If you do not see the Ingestion tab, please contact your DataHub admin to grant you the correct permissions +::: + +2. Navigate to the **Secrets** tab and click **Create new secret** + +

+ Secrets Tab +

+ +3. Create a Password secret + +This will securely store your Snowflake password within DataHub + +- Enter a name like `SNOWFLAKE_PASSWORD` - we will use this later to refer to the secret +- Enter the password configured for the DataHub user in the previous step +- Optionally add a description +- Click **Create** + +

+ Snowflake Password Secret +

+ +## Configure Recipe + +4. Navigate to the **Sources** tab and click **Create new source** + +

+ Click "Create new source" +

+ +5. Select Snowflake + +

+ Select Snowflake from the options +

+ +6. Fill out the Snowflake Recipe + +Enter the Snowflake Account Identifier as **Account ID** field. Account identifier is the part before `.snowflakecomputing.com` in your snowflake host URL: + +

+ Account Id Field +

+ +_Learn more about Snowflake Account Identifiers [here](https://docs.snowflake.com/en/user-guide/admin-account-identifier.html#account-identifiers)_ + +Add the previously added Password secret to **Password** field: + +- Click on the Password input field +- Select `SNOWFLAKE_PASSWORD` secret + +

+ Password field +

+ +Populate the relevant fields using the same **Username**, **Role**, and **Warehouse** you created and/or specified in [Snowflake Prerequisites](setup.md). + +

+ Warehouse Field +

+ +7. Click **Test Connection** + +This step will ensure you have configured your credentials accurately and confirm you have the required permissions to extract all relevant metadata. + +

+ Test Snoflake connection +

+ +After you have successfully tested your connection, click **Next**. + +## Schedule Execution + +Now it's time to schedule a recurring ingestion pipeline to regularly extract metadata from your Snowflake instance. + +8. Decide how regularly you want this ingestion to run-- day, month, year, hour, minute, etc. Select from the dropdown + +

+ schedule selector +

+ +9. Ensure you've configured your correct timezone +

+ timezone_selector +

+ +10. Click **Next** when you are done + +## Finish Up + +11. Name your ingestion source, then click **Save and Run** +

+ Name your ingestion +

+ +You will now find your new ingestion source running + +

+ ingestion_running +

+ +## Validate Ingestion Runs + +12. View the latest status of ingestion runs on the Ingestion page + +

+ ingestion succeeded +

+ +13. Click the plus sign to expand the full list of historical runs and outcomes; click **Details** to see the outcomes of a specific run + +

+ ingestion_details +

+ +14. From the Ingestion Run Details page, pick **View All** to see which entities were ingested + +

+ ingestion_details_view_all +

+ +15. Pick an entity from the list to manually validate if it contains the detail you expected + +

+ ingestion_details_view_all +

+ +**Congratulations!** You've successfully set up Snowflake as an ingestion source for DataHub! + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/snowflake/overview.md b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/snowflake/overview.md new file mode 100644 index 0000000000000..310e0bf254d27 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/snowflake/overview.md @@ -0,0 +1,53 @@ +--- +title: Overview +sidebar_label: Overview +slug: /quick-ingestion-guides/snowflake/overview +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/snowflake/overview.md +--- + +# Snowflake Ingestion Guide: Overview + +## What You Will Get Out of This Guide + +This guide will help you set up the Snowflake connector to begin ingesting metadata into DataHub. + +Upon completing this guide, you will have a recurring ingestion pipeline that will extract metadata from Snowflake and load it into DataHub. This will include to following Snowflake asset types: + +- Databases +- Schemas +- Tables +- External Tables +- Views +- Materialized Views + +The pipeline will also extract: + +- **Usage statistics** to help you understand recent query activity (available if using Snowflake Enterprise edition or above) +- **Table- and Column-level lineage** to automatically define interdependencies between datasets and columns (available if using Snowflake Enterprise edition or above) +- **Table-level profile statistics** to help you understand the shape of the data + +:::caution +You will NOT have extracted Stages, Snowpipes, Streams, Tasks, Procedures from Snowflake, as the connector does not support ingesting these assets yet. +::: + +### Caveats + +By default, DataHub only profiles datasets that have changed in the past 1 day. This can be changed in the YAML editor by setting the value of `profile_if_updated_since_days` to something greater than 1. + +Additionally, DataHub only extracts usage and lineage information based on operations performed in the last 1 day. This can be changed by setting a custom value for `start_time` and `end_time` in the YAML editor. + +_To learn more about setting these advanced values, check out the [Snowflake Ingestion Source](/docs/generated/ingestion/sources/snowflake/#module-snowflake)._ + +## Next Steps + +If that all sounds like what you're looking for, navigate to the [next page](setup.md), where we'll talk about prerequisites. + +## Advanced Guides and Reference + +If you want to ingest metadata from Snowflake using the DataHub CLI, check out the following resources: + +- Learn about CLI Ingestion in the [Introduction to Metadata Ingestion](../../../metadata-ingestion/README.md) +- [Snowflake Ingestion Source](/docs/generated/ingestion/sources/snowflake/#module-snowflake) + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/snowflake/setup.md b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/snowflake/setup.md new file mode 100644 index 0000000000000..cb425e134558e --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/snowflake/setup.md @@ -0,0 +1,75 @@ +--- +title: Setup +sidebar_label: Setup +slug: /quick-ingestion-guides/snowflake/setup +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/snowflake/setup.md +--- + +# Snowflake Ingestion Guide: Setup & Prerequisites + +In order to configure ingestion from Snowflake, you'll first have to ensure you have a Snowflake user with the `ACCOUNTADMIN` role or `MANAGE GRANTS` privilege. + +## Snowflake Prerequisites + +1. Create a DataHub-specific role by executing the following queries in Snowflake. Replace `` with an existing warehouse that you wish to use for DataHub ingestion. + + ```sql + create or replace role datahub_role; + -- Grant access to a warehouse to run queries to view metadata + grant operate, usage on warehouse "" to role datahub_role; + ``` + + Make note of this role and warehouse. You'll need this in the next step. + +2. Create a DataHub-specific user by executing the following queries. Replace `` with a strong password. Replace `` with the same warehouse used above. + + ```sql + create user datahub_user display_name = 'DataHub' password='' default_role = datahub_role default_warehouse = ''; + -- Grant access to the DataHub role created above + grant role datahub_role to user datahub_user; + ``` + + Make note of the user and its password. You'll need this in the next step. + +3. Assign privileges to read metadata about your assets by executing the following queries. Replace `` with an existing database. Repeat for all databases from your Snowflake instance that you wish to integrate with DataHub. + + ```sql + set db_var = '""'; + -- Grant access to view database and schema in which your tables/views exist + grant usage on DATABASE identifier($db_var) to role datahub_role; + grant usage on all schemas in database identifier($db_var) to role datahub_role; + grant usage on future schemas in database identifier($db_var) to role datahub_role; + + -- Grant Select acccess enable Data Profiling + grant select on all tables in database identifier($db_var) to role datahub_role; + grant select on future tables in database identifier($db_var) to role datahub_role; + grant select on all external tables in database identifier($db_var) to role datahub_role; + grant select on future external tables in database identifier($db_var) to role datahub_role; + grant select on all views in database identifier($db_var) to role datahub_role; + grant select on future views in database identifier($db_var) to role datahub_role; + + -- Grant access to view tables and views + grant references on all tables in database identifier($db_var) to role datahub_role; + grant references on future tables in database identifier($db_var) to role datahub_role; + grant references on all external tables in database identifier($db_var) to role datahub_role; + grant references on future external tables in database identifier($db_var) to role datahub_role; + grant references on all views in database identifier($db_var) to role datahub_role; + grant references on future views in database identifier($db_var) to role datahub_role; + + -- Assign privileges to extract lineage and usage statistics from Snowflake by executing the below query. + grant imported privileges on database snowflake to role datahub_role; + + ``` + + If you have imported databases in your Snowflake instance that you wish to integrate with DataHub, you'll need to use the below query for them. + + ```sql + grant IMPORTED PRIVILEGES on database "" to role datahub_role; + ``` + +## Next Steps + +Once you've done all of the above in Snowflake, it's time to [move on](configuration.md) to configuring the actual ingestion source within DataHub. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/tableau/configuration.md b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/tableau/configuration.md new file mode 100644 index 0000000000000..02d57dc88f850 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/tableau/configuration.md @@ -0,0 +1,157 @@ +--- +title: Configuration +sidebar_label: Configuration +slug: /quick-ingestion-guides/tableau/configuration +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/tableau/configuration.md +--- + +# Configuring Your Tableau Connector to DataHub + +Now that you have created a DataHub-specific user with the relevant access in Tableau in [the prior step](setup.md), it's now time to set up a connection via the DataHub UI. + +## Configure Secrets + +1. Within DataHub, navigate to the **Ingestion** tab in the top, right corner of your screen + +

+ Navigate to the "Ingestion Tab" +

+ +:::note +If you do not see the Ingestion tab, please contact your DataHub admin to grant you the correct permissions +::: + +2. Navigate to the **Secrets** tab and click **Create new secret** + +

+ Secrets Tab +

+ +3. Create a `username` secret + +This will securely store your Tableau `username` within DataHub + +- Enter a name like `TABLEAU_USERNAME` - we will use this later to refer in recipe +- Enter the `username`, setup in the [setup guide](setup.md) +- Optionally add a description +- Click **Create** + +

+ Tableau Username Secret +

+ +4. Create a `password` secret + +This will securely store your Tableau `password` within DataHub + +- Enter a name like `TABLEAU_PASSWORD` - we will use this later to refer in recipe +- Enter the `password` of the user, setup in the [setup guide](setup.md) +- Optionally add a description +- Click **Create** + +

+ Tableau Password Secret +

+ +## Configure Recipe + +5. Navigate to on the **Sources** tab and then **Create new source** + +

+ Click "Create new source" +

+ +6. Select Tableau + +

+ Select Tableau from the options +

+ +7. Fill in the Tableau Recipe form: + + You need to set minimum following fields in the recipe: + + a. **Host URL:** URL of your Tableau instance (e.g., https://15az.online.tableau.com/). It is available in browser address bar on Tableau Portal. + + b. **Username:** Use the TABLEAU_USERNAME secret (e.g., "${TABLEAU_USERNAME}"). + + c. **Password:** Use the TABLEAU_PASSWORD secret (e.g., "${TABLEAU_PASSWORD}"). + + d. **Site**: Required only if using tableau cloud/ tableau online + +To filter specific project, use `project_pattern` fields. + + config: + ... + project_pattern: + allow: + - "SalesProject" + +Your recipe should look something like this: + +

+ tableau recipe in form format +

+ +Click **Next** when you're done. + +## Schedule Execution + +Now it's time to schedule a recurring ingestion pipeline to regularly extract metadata from your Tableau instance. + +8. Decide how regularly you want this ingestion to run-- day, month, year, hour, minute, etc. Select from the dropdown + +

+ schedule selector +

+ +9. Ensure you've configured your correct timezone +

+ timezone_selector +

+ +10. Click **Next** when you are done + +## Finish Up + +11. Name your ingestion source, then click **Save and Run** +

+ Name your ingestion +

+ +You will now find your new ingestion source running + +

+ ingestion_running +

+ +## Validate Ingestion Runs + +12. View the latest status of ingestion runs on the Ingestion page + +

+ ingestion succeeded +

+ +13. Click the plus sign to expand the full list of historical runs and outcomes; click **Details** to see the outcomes of a specific run + +

+ ingestion_details +

+ +14. From the Ingestion Run Details page, pick **View All** to see which entities were ingested + +

+ ingestion_details_view_all +

+ +15. Pick an entity from the list to manually validate if it contains the detail you expected + +

+ ingestion_details_view_all +

+ +**Congratulations!** You've successfully set up Tableau as an ingestion source for DataHub! + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/tableau/overview.md b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/tableau/overview.md new file mode 100644 index 0000000000000..d79ee27480e12 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/tableau/overview.md @@ -0,0 +1,43 @@ +--- +title: Overview +sidebar_label: Overview +slug: /quick-ingestion-guides/tableau/overview +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/tableau/overview.md +--- + +# Tableau Ingestion Guide: Overview + +## What You Will Get Out of This Guide + +This guide will help you set up the Tableau connector to begin ingesting metadata into DataHub. + +Upon completing this guide, you will have a recurring ingestion pipeline that will extract metadata from Tableau and load it into DataHub. This will include to following Tableau asset types: + +- Dashboards +- Sheets +- Embedded DataSource +- Published DataSource +- Custom SQL Table +- Embedded or External Tables +- User +- Workbook +- Tag + +The pipeline will also extract: + +- **Usage statistics** help you understand top viewed Dashboard/Chart +- **Table- and Column-level lineage** automatically index relationships between datasets and columns + +## Next Steps + +Continue to the [setup guide](setup.md), where we'll describe the prerequisites. + +## Advanced Guides and Reference + +If you want to ingest metadata from Tableau using the DataHub CLI, check out the following resources: + +- Learn about CLI Ingestion in the [Introduction to Metadata Ingestion](../../../metadata-ingestion/README.md) +- [Tableau Ingestion Source](/docs/generated/ingestion/sources/tableau) + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/tableau/setup.md b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/tableau/setup.md new file mode 100644 index 0000000000000..5c1e3c165b9f9 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/quick-ingestion-guides/tableau/setup.md @@ -0,0 +1,62 @@ +--- +title: Setup +sidebar_label: Setup +slug: /quick-ingestion-guides/tableau/setup +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/tableau/setup.md +--- + +# Tableau Ingestion Guide: Setup & Prerequisites + +In order to configure ingestion from Tableau, you'll first have to enable Tableau Metadata API and you should have a user with Site Administrator Explorer permissions. + +## Tableau Prerequisites + +1. Grant `Site Administrator Explorer permissions` to a user + + A. Log in to Tableau Cloud https://sso.online.tableau.com/public/idp/SSO. + + B. Navigate to `Users`. + +

+ Navigate to the Users tab +

+ + C. **For New User**: Follow below steps to grant permission for new user. + + - Click `Add Users` -> `Add Users by Email` + +

+ Navigate to the Users tab +

+ + - Fill `Enter email addresses`, set `Site role` to `Site Administrator Explorer` and Click `Add Users` + +

+ Navigate to the Users tab +

+ + D. **For Existing User:** Follow below steps to grant permission for existing user. + + - Select a user and click `Actions` -> `Site Role` + +

+ Actions Site Role +

+ + - Change user role to `Site Administrator Explorer` + +

+ tableau site role +

+ +2. **Enable Tableau Metadata API:** This step is required only for Tableau Server. The Metadata API is installed with Tableau Server but disabled by default. + + - Open a command prompt as an admin on the initial node (_where TSM is installed_) in the cluster + - Run the command: `tsm maintenance metadata-services enable` + +## Next Steps + +Once you've done all of the above in Tableau, it's time to [move on](configuration.md) to configuring the actual ingestion source within DataHub. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/quickstart.md b/docs-website/versioned_docs/version-0.10.4/docs/quickstart.md new file mode 100644 index 0000000000000..06bdab98a78fb --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/quickstart.md @@ -0,0 +1,244 @@ +--- +title: DataHub Quickstart Guide +sidebar_label: Quickstart Guide +slug: /quickstart +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/quickstart.md" +--- + +# DataHub Quickstart Guide + +This guide provides instructions on deploying the open source DataHub locally. +If you're interested in a managed version, [Acryl Data](https://www.acryldata.io/product) provides a fully managed, premium version of DataHub. + + +Get Started with Managed DataHub + + +## Deploying DataHub + +To deploy a new instance of DataHub, perform the following steps. + +1. Install Docker and Docker Compose v2 for your platform. + +- On Windows or Mac, install [Docker Desktop](https://www.docker.com/products/docker-desktop/). +- On Linux, install [Docker for Linux](https://docs.docker.com/desktop/install/linux-install/) and [Docker Compose](https://docs.docker.com/compose/install/linux/). + +:::note + +Make sure to allocate enough hardware resources for Docker engine. +Tested & confirmed config: 2 CPUs, 8GB RAM, 2GB Swap area, and 10GB disk space. + +::: + +2. Launch the Docker Engine from command line or the desktop app. + +3. Install the DataHub CLI + + a. Ensure you have Python 3.7+ installed & configured. (Check using `python3 --version`). + + b. Run the following commands in your terminal + + ```sh + python3 -m pip install --upgrade pip wheel setuptools + python3 -m pip install --upgrade acryl-datahub + datahub version + ``` + + If you're using poetry, run the following command. + + ```sh + poetry add acryl-datahub + datahub version + ``` + +:::note + +If you see "command not found", try running cli commands with the prefix 'python3 -m' instead like `python3 -m datahub version` +Note that DataHub CLI does not support Python 2.x. + +::: + +4. To deploy a DataHub instance locally, run the following CLI command from your terminal + + ``` + datahub docker quickstart + ``` + + This will deploy a DataHub instance using [docker-compose](https://docs.docker.com/compose/). + If you are curious, the `docker-compose.yaml` file is downloaded to your home directory under the `.datahub/quickstart` directory. + + If things go well, you should see messages like the ones below: + + ``` + Fetching docker-compose file https://raw.githubusercontent.com/datahub-project/datahub/master/docker/quickstart/docker-compose-without-neo4j-m1.quickstart.yml from GitHub + Pulling docker images... + Finished pulling docker images! + + [+] Running 11/11 + ⠿ Container zookeeper Running 0.0s + ⠿ Container elasticsearch Running 0.0s + ⠿ Container broker Running 0.0s + ⠿ Container schema-registry Running 0.0s + ⠿ Container elasticsearch-setup Started 0.7s + ⠿ Container kafka-setup Started 0.7s + ⠿ Container mysql Running 0.0s + ⠿ Container datahub-gms Running 0.0s + ⠿ Container mysql-setup Started 0.7s + ⠿ Container datahub-datahub-actions-1 Running 0.0s + ⠿ Container datahub-frontend-react Running 0.0s + ....... + ✔ DataHub is now running + Ingest some demo data using `datahub docker ingest-sample-data`, + or head to http://localhost:9002 (username: datahub, password: datahub) to play around with the frontend. + Need support? Get in touch on Slack: https://slack.datahubproject.io/ + ``` + + Upon completion of this step, you should be able to navigate to the DataHub UI + at [http://localhost:9002](http://localhost:9002) in your browser. You can sign in using `datahub` as both the + username and password. + +:::note + +On Mac computers with Apple Silicon (M1, M2 etc.), you might see an error like `no matching manifest for linux/arm64/v8 in the manifest list entries`, this typically means that the datahub cli was not able to detect that you are running it on Apple Silicon. To resolve this issue, override the default architecture detection by issuing `datahub docker quickstart --arch m1` + +::: + +5. To ingest the sample metadata, run the following CLI command from your terminal + + ```bash + datahub docker ingest-sample-data + ``` + +:::note + +If you've enabled [Metadata Service Authentication](authentication/introducing-metadata-service-authentication.md), you'll need to provide a Personal Access Token +using the `--token ` parameter in the command. + +::: + +That's it! Now feel free to play around with DataHub! + +## Troubleshooting Issues + +Please refer to [Quickstart Debugging Guide](./troubleshooting/quickstart.md). + +## Next Steps + +### Ingest Metadata + +To start pushing your company's metadata into DataHub, take a look at [UI-based Ingestion Guide](./ui-ingestion.md), or to run ingestion using the cli, look at the [Metadata Ingestion Guide](../metadata-ingestion/README.md). + +### Invite Users + +To add users to your deployment to share with your team check out our [Adding Users to DataHub](authentication/guides/add-users.md) + +### Enable Authentication + +To enable SSO, check out [Configuring OIDC Authentication](authentication/guides/sso/configure-oidc-react.md) or [Configuring JaaS Authentication](authentication/guides/jaas.md). + +To enable backend Authentication, check out [authentication in DataHub's backend](authentication/introducing-metadata-service-authentication.md#configuring-metadata-service-authentication). + +### Change the Default `datahub` User Credentials + +:::note +Please note that deleting the `Data Hub` user in the UI **WILL NOT** disable the default user. You will still be able to log in using the default 'datahub:datahub' credentials. To safely delete the default credentials, please follow the guide provided below. +::: + +Please refer to [Change the default user datahub in quickstart](authentication/changing-default-credentials.md#quickstart). + +### Move to Production + +We recommend deploying DataHub to production using Kubernetes. We provide helpful [Helm Charts](https://artifacthub.io/packages/helm/datahub/datahub) to help you quickly get up and running. Check out [Deploying DataHub to Kubernetes](./deploy/kubernetes.md) for a step-by-step walkthrough. + +## Other Common Operations + +### Stopping DataHub + +To stop DataHub's quickstart, you can issue the following command. + +``` +datahub docker quickstart --stop +``` + +### Resetting DataHub (a.k.a factory reset) + +To cleanse DataHub of all of its state (e.g. before ingesting your own), you can use the CLI `nuke` command. + +``` +datahub docker nuke +``` + +### Backing up your DataHub Quickstart (experimental) + +The quickstart image is not recommended for use as a production instance. See [Moving to production](#move-to-production) for recommendations on setting up your production cluster. However, in case you want to take a backup of your current quickstart state (e.g. you have a demo to your company coming up and you want to create a copy of the quickstart data so you can restore it at a future date), you can supply the `--backup` flag to quickstart. + +``` +datahub docker quickstart --backup +``` + +will take a backup of your MySQL image and write it by default to your `~/.datahub/quickstart/` directory as the file `backup.sql`. You can customize this by passing a `--backup-file` argument. +e.g. + +``` +datahub docker quickstart --backup --backup-file /home/my_user/datahub_backups/quickstart_backup_2002_22_01.sql +``` + +:::note + +Note that the Quickstart backup does not include any timeseries data (dataset statistics, profiles, etc.), so you will lose that information if you delete all your indexes and restore from this backup. + +::: + +### Restoring your DataHub Quickstart (experimental) + +As you might imagine, these backups are restore-able. The following section describes a few different options you have to restore your backup. + +#### Restoring a backup (primary + index) [most common] + +To restore a previous backup, run the following command: + +``` +datahub docker quickstart --restore +``` + +This command will pick up the `backup.sql` file located under `~/.datahub/quickstart` and restore your primary database as well as the elasticsearch indexes with it. + +To supply a specific backup file, use the `--restore-file` option. + +``` +datahub docker quickstart --restore --restore-file /home/my_user/datahub_backups/quickstart_backup_2002_22_01.sql +``` + +#### Restoring only the index [to deal with index out of sync / corruption issues] + +Another situation that can come up is the index can get corrupt, or be missing some update. In order to re-bootstrap the index from the primary store, you can run this command to sync the index with the primary store. + +``` +datahub docker quickstart --restore-indices +``` + +#### Restoring a backup (primary but NO index) [rarely used] + +Sometimes, you might want to just restore the state of your primary database (MySQL), but not re-index the data. To do this, you have to explicitly disable the restore-indices capability. + +``` +datahub docker quickstart --restore --no-restore-indices +``` + +### Upgrading your local DataHub + +If you have been testing DataHub locally, a new version of DataHub got released and you want to try the new version then you can just issue the quickstart command again. It will pull down newer images and restart your instance without losing any data. + +``` +datahub docker quickstart +``` + +### Customization + +If you would like to customize the DataHub installation further, please download the [docker-compose.yaml](https://raw.githubusercontent.com/datahub-project/datahub/master/docker/quickstart/docker-compose-without-neo4j-m1.quickstart.yml) used by the cli tool, modify it as necessary and deploy DataHub by passing the downloaded docker-compose file: + +``` +datahub docker quickstart --quickstart-compose-file +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/rfc.md b/docs-website/versioned_docs/version-0.10.4/docs/rfc.md new file mode 100644 index 0000000000000..6b88567746e38 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/rfc.md @@ -0,0 +1,127 @@ +--- +title: DataHub RFC Process +sidebar_label: RFC Process +slug: /rfc +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/rfc.md" +--- + +# DataHub RFC Process + +## What is an RFC? + +The "RFC" (request for comments) process is intended to provide a consistent and controlled path for new features, +significant modifications, or any other significant proposal to enter DataHub and its related frameworks. + +Many changes, including bug fixes and documentation improvements can be implemented and reviewed via the normal GitHub +pull request workflow. + +Some changes though are "substantial", and we ask that these be put through a bit of a design process and produce a +consensus among the DataHub core teams. + +## The RFC life-cycle + +An RFC goes through the following stages: + +- _Discussion_ (Optional): Create an issue with the "RFC" label to have a more open ended, initial discussion around + your proposal (useful if you don't have a concrete proposal yet). Consider posting to #rfc in [Slack](./slack.md) + for more visibility. +- _Pending_: when the RFC is submitted as a PR. Please add the "RFC" label to the PR. +- _Active_: when an RFC PR is merged and undergoing implementation. +- _Landed_: when an RFC's proposed changes are shipped in an actual release. +- _Rejected_: when an RFC PR is closed without being merged. + +[Pending RFC List](https://github.com/datahub-project/rfcs/pulls?q=is%3Apr+is%3Aopen) + +## When to follow this process + +You need to follow this process if you intend to make "substantial" changes to any components in the DataHub git repo, +their documentation, or any other projects under the purview of the DataHub core teams. What constitutes a "substantial" +change is evolving based on community norms, but may include the following: + +- A new feature that creates new API surface area, and would require a feature flag if introduced. +- The removal of features that already shipped as part of the release channel. +- The introduction of new idiomatic usage or conventions, even if they do not include code changes to DataHub itself. + +Some changes do not require an RFC: + +- Rephrasing, reorganizing or refactoring +- Addition or removal of warnings +- Additions that strictly improve objective, numerical quality criteria (speedup) + +If you submit a pull request to implement a new, major feature without going through the RFC process, it may be closed +with a polite request to submit an RFC first. + +## Gathering feedback before submitting + +It's often helpful to get feedback on your concept before diving into the level of API design detail required for an +RFC. You may open an issue on this repo to start a high-level discussion, with the goal of eventually formulating an RFC +pull request with the specific implementation design. We also highly recommend sharing drafts of RFCs in #rfc on the +[DataHub Slack](./slack.md) for early feedback. + +## The process + +In short, to get a major feature added to DataHub, one must first get the RFC merged into the RFC repo as a markdown +file. At that point the RFC is 'active' and may be implemented with the goal of eventual inclusion into DataHub. + +- Fork the [datahub-project/rfc repository](https://github.com/datahub-project/rfcs). +- Copy the `000-template.md` template file to `rfc/active/000-my-feature.md`, where `my-feature` is more + descriptive. Don't assign an RFC number yet. +- Fill in the RFC. Put care into the details. _RFCs that do not present convincing motivation, demonstrate understanding + of the impact of the design, or are disingenuous about the drawback or alternatives tend to be poorly-received._ +- Submit a pull request. As a pull request the RFC will receive design feedback from the larger community, and the + author should be prepared to revise it in response. +- Update the pull request to add the number of the PR to the filename and add a link to the PR in the header of the RFC. +- Build consensus and integrate feedback. RFCs that have broad support are much more likely to make progress than those + that don't receive any comments. +- Eventually, the DataHub team will decide whether the RFC is a candidate for inclusion. +- RFCs that are candidates for inclusion will entire a "final comment period" lasting 7 days. The beginning of this + period will be signaled with a comment and tag on the pull request. Furthermore, an announcement will be made in the + \#rfc Slack channel for further visibility. +- An RFC acan be modified based upon feedback from the DataHub team and community. Significant modifications may trigger + a new final comment period. +- An RFC may be rejected by the DataHub team after public discussion has settled and comments have been made summarizing + the rationale for rejection. The RFC will enter a "final comment period to close" lasting 7 days. At the end of the "FCP + to close" period, the PR will be closed. +- An RFC author may withdraw their own RFC by closing it themselves. Please state the reason for the withdrawal. +- An RFC may be accepted at the close of its final comment period. A DataHub team member will merge the RFC's associated + pull request, at which point the RFC will become 'active'. + +## Details on Active RFCs + +Once an RFC becomes active then authors may implement it and submit the feature as a pull request to the DataHub repo. +Becoming 'active' is not a rubber stamp, and in particular still does not mean the feature will ultimately be merged; it +does mean that the core team has agreed to it in principle and are amenable to merging it. + +Furthermore, the fact that a given RFC has been accepted and is 'active' implies nothing about what priority is assigned +to its implementation, nor whether anybody is currently working on it. + +Modifications to active RFC's can be done in followup PR's. We strive to write each RFC in a manner that it will reflect +the final design of the feature; but the nature of the process means that we cannot expect every merged RFC to actually +reflect what the end result will be at the time of the next major release; therefore we try to keep each RFC document +somewhat in sync with the language feature as planned, tracking such changes via followup pull requests to the document. + +## Implementing an RFC + +The author of an RFC is not obligated to implement it. Of course, the RFC author (like any other developer) is welcome +to post an implementation for review after the RFC has been accepted. + +An active RFC should have the link to the implementation PR(s) listed, if there are any. Feedback to the actual +implementation should be conducted in the implementation PR instead of the original RFC PR. + +If you are interested in working on the implementation for an 'active' RFC, but cannot determine if someone else is +already working on it, feel free to ask (e.g. by leaving a comment on the associated issue). + +## Implemented RFCs + +Once an RFC has finally be implemented, first off, congratulations! And thank you for your contribution! Second, to +help track the status of the RFC, please make one final PR to move the RFC from `rfc/active` to +`rfc/finished`. + +## Reviewing RFCs + +Most of the DataHub team will attempt to review some set of open RFC pull requests on a regular basis. If a DataHub +team member believes an RFC PR is ready to be accepted into active status, they can approve the PR using GitHub's +review feature to signal their approval of the RFCs. + +_DataHub's RFC process is inspired by many others, including [Vue.js](https://github.com/vuejs/rfcs) and +[Ember](https://github.com/emberjs/rfcs)._ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/roadmap.md b/docs-website/versioned_docs/version-0.10.4/docs/roadmap.md new file mode 100644 index 0000000000000..4fc85790ad48f --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/roadmap.md @@ -0,0 +1,178 @@ +--- +title: DataHub Roadmap +sidebar_label: Roadmap +slug: /roadmap +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/roadmap.md" +--- + +# DataHub Roadmap + +## [The DataHub Roadmap has a new home!](https://feature-requests.datahubproject.io/roadmap) + +Please refer to the [new DataHub Roadmap](https://feature-requests.datahubproject.io/roadmap) for the most up-to-date details of what we are working on! + +_If you have suggestions about what we should consider in future cycles, feel free to submit a [feature request](https://feature-requests.datahubproject.io/) and/or upvote existing feature requests so we can get a sense of level of importance!_ + +## Historical Roadmap + +_This following represents the progress made on historical roadmap items as of January 2022. For incomplete roadmap items, we have created Feature Requests to gauge current community interest & impact to be considered in future cycles. If you see something that is still of high-interest to you, please up-vote via the Feature Request portal link and subscribe to the post for updates as we progress through the work in future cycles._ + +### Q4 2021 [Oct - Dec 2021] + +#### Data Lake Ecosystem Integration + +- [ ] Spark Delta Lake - [View in Feature Reqeust Portal](https://feature-requests.datahubproject.io/b/feedback/p/spark-delta-lake) +- [ ] Apache Iceberg - [Included in Q1 2022 Roadmap - Community-Driven Metadata Ingestion Sources](https://feature-requests.datahubproject.io/roadmap/540) +- [ ] Apache Hudi - [View in Feature Request Portal](https://feature-requests.datahubproject.io/b/feedback/p/apachi-hudi-ingestion-support) + +#### Metadata Trigger Framework + +[View in Feature Request Portal](https://feature-requests.datahubproject.io/b/User-Experience/p/ability-to-subscribe-to-an-entity-to-receive-notifications-when-something-changes) + +- [ ] Stateful sensors for Airflow +- [ ] Receive events for you to send alerts, email +- [ ] Slack integration + +#### ML Ecosystem + +- [x] Features (Feast) +- [x] Models (Sagemaker) +- [ ] Notebooks - View in Feature Request Portal](https://feature-requests.datahubproject.io/admin/p/jupyter-integration) + +#### Metrics Ecosystem + +[View in Feature Request Portal](https://feature-requests.datahubproject.io/b/User-Experience/p/ability-to-define-metrics-and-attach-them-to-entities) + +- [ ] Measures, Dimensions +- [ ] Relationships to Datasets and Dashboards + +#### Data Mesh oriented features + +- [ ] Data Product modeling +- [ ] Analytics to enable Data Meshification + +#### Collaboration + +[View in Feature Reqeust Portal](https://feature-requests.datahubproject.io/b/User-Experience/p/collaboration-within-datahub-ui) + +- [ ] Conversations on the platform +- [ ] Knowledge Posts (Gdocs, Gslides, Gsheets) + +### Q3 2021 [Jul - Sept 2021] + +#### Data Profiling and Dataset Previews + +Use Case: See sample data for a dataset and statistics on the shape of the data (column distribution, nullability etc.) + +- [x] Support for data profiling and preview extraction through ingestion pipeline (column samples, not rows) + +#### Data Quality + +Included in Q1 2022 Roadmap - [Display Data Quality Checks in the UI](https://feature-requests.datahubproject.io/roadmap/544) + +- [x] Support for data profiling and time-series views +- [ ] Support for data quality visualization +- [ ] Support for data health score based on data quality results and pipeline observability +- [ ] Integration with systems like Great Expectations, AWS deequ, dbt test etc. + +#### Fine-grained Access Control for Metadata + +- [x] Support for role-based access control to edit metadata +- Scope: Access control on entity-level, aspect-level and within aspects as well. + +#### Column-level lineage + +Included in Q1 2022 Roadmap - [Column Level Lineage](https://feature-requests.datahubproject.io/roadmap/541) + +- [ ] Metadata Model +- [ ] SQL Parsing + +#### Operational Metadata + +- [ ] Partitioned Datasets - - [View in Feature Request Portal](https://feature-requests.datahubproject.io/b/User-Experience/p/advanced-dataset-schema-properties-partition-support) +- [x] Support for operational signals like completeness, freshness etc. + +### Q2 2021 (Apr - Jun 2021) + +#### Cloud Deployment + +- [x] Production-grade Helm charts for Kubernetes-based deployment +- [ ] How-to guides for deploying DataHub to all the major cloud providers + - [x] AWS + - [ ] Azure + - [x] GCP + +#### Product Analytics for DataHub + +- [x] Helping you understand how your users are interacting with DataHub +- [x] Integration with common systems like Google Analytics etc. + +#### Usage-Based Insights + +- [x] Display frequently used datasets, etc. +- [ ] Improved search relevance through usage data + +#### Role-based Access Control + +- Support for fine-grained access control for metadata operations (read, write, modify) +- Scope: Access control on entity-level, aspect-level and within aspects as well. +- This provides the foundation for Tag Governance, Dataset Preview access control etc. + +#### No-code Metadata Model Additions + +Use Case: Developers should be able to add new entities and aspects to the metadata model easily + +- [x] No need to write any code (in Java or Python) to store, retrieve, search and query metadata +- [ ] No need to write any code (in GraphQL or UI) to visualize metadata + +### Q1 2021 [Jan - Mar 2021] + +#### React UI + +- [x] Build a new UI based on React +- [x] Deprecate open-source support for Ember UI + +#### Python-based Metadata Integration + +- [x] Build a Python-based Ingestion Framework +- [x] Support common people repositories (LDAP) +- [x] Support common data repositories (Kafka, SQL databases, AWS Glue, Hive) +- [x] Support common transformation sources (dbt, Looker) +- [x] Support for push-based metadata emission from Python (e.g. Airflow DAGs) + +#### Dashboards and Charts + +- [x] Support for dashboard and chart entity page +- [x] Support browse, search and discovery + +#### SSO for Authentication + +- [x] Support for Authentication (login) using OIDC providers (Okta, Google etc) + +#### Tags + +Use-Case: Support for free-form global tags for social collaboration and aiding discovery + +- [x] Edit / Create new tags +- [x] Attach tags to relevant constructs (e.g. datasets, dashboards, users, schema_fields) +- [x] Search using tags (e.g. find all datasets with this tag, find all entities with this tag) + +#### Business Glossary + +- [x] Support for business glossary model (definition + storage) +- [ ] Browse taxonomy +- [x] UI support for attaching business terms to entities and fields + +#### Jobs, Flows / Pipelines + +Use case: Search and Discover your Pipelines (e.g. Airflow DAGs) and understand lineage with datasets + +- [x] Support for Metadata Models + Backend Implementation +- [x] Metadata Integrations with systems like Airflow. + +#### Data Profiling and Dataset Previews + +Use Case: See sample data for a dataset and statistics on the shape of the data (column distribution, nullability etc.) + +- [ ] Support for data profiling and preview extraction through ingestion pipeline +- Out of scope for Q1: Access control of data profiles and sample data diff --git a/docs-website/versioned_docs/version-0.10.4/docs/saas.md b/docs-website/versioned_docs/version-0.10.4/docs/saas.md new file mode 100644 index 0000000000000..24ae858a67163 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/saas.md @@ -0,0 +1,20 @@ +--- +title: Managed DataHub +slug: /saas +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/saas.md" +--- + +# DataHub SaaS + +Sign up for fully managed, hassle-free and secure SaaS service for DataHub, provided by [Acryl Data](https://www.acryl.io/). + +

+ + Sign up + +

+ +Refer to [Managed Datahub Exclusives](/docs/managed-datahub/managed-datahub-overview.md) for more information. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/schema-history.md b/docs-website/versioned_docs/version-0.10.4/docs/schema-history.md new file mode 100644 index 0000000000000..151c523eef884 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/schema-history.md @@ -0,0 +1,69 @@ +--- +title: About DataHub Schema History +sidebar_label: Schema History +slug: /schema-history +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/schema-history.md" +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub Schema History + + + +Schema History is a valuable tool for understanding how a Dataset changes over time and gives insight into the following cases, +along with informing Data Practitioners when these changes happened. + +- A new field is added +- An existing field is removed +- An existing field changes type + +Schema History uses DataHub's [Timeline API](/docs/dev-guides/timeline/) to compute schema changes. + +## Schema History Setup, Prerequisites, and Permissions + +Schema History is viewable in the DataHub UI for any Dataset that has had at least one schema change. To view a Dataset, a user +must have the **View Entity Page** privilege, or be assigned to **any** DataHub Role. + +## Using Schema History + +You can view the Schema History for a Dataset by navigating to that Dataset's Schema Tab. As long as that Dataset has more than +one version, you can view what a Dataset looked like at any given version by using the version selector. +Here's an example from DataHub's official Demo environment with the +[Snowflake pets dataset](). + +

+ +

+ +If you click on an older version in the selector, you'll be able to see what the schema looked like back then. Notice +the changes here to the glossary terms for the `status` field, and to the descriptions for the `created_at` and `updated_at` +fields. + +

+ +

+ +In addition to this, you can also toggle the Audit view that shows you when the most recent changes were made to each field. +You can active this by clicking on the Audit icon you see above the top right of the table. + +

+ +

+ +You can see here that some of these fields were added at the oldest dataset version, while some were added only at this latest +version. Some fields were even modified and had a type change at the latest version! + +### GraphQL + +- [getSchemaBlame](../graphql/queries.md#getSchemaBlame) +- [getSchemaVersionList](../graphql/queries.md#getSchemaVersionList) + +## FAQ and Troubleshooting + +**What updates are planned for the Schema History feature?** + +In the future, we plan on adding the following features + +- Supporting a linear timeline view where you can see what changes were made to various schema fields over time +- Adding a diff viewer that highlights the differences between two versions of a Dataset diff --git a/docs-website/versioned_docs/version-0.10.4/docs/slack.md b/docs-website/versioned_docs/version-0.10.4/docs/slack.md new file mode 100644 index 0000000000000..946026c2869e6 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/slack.md @@ -0,0 +1,54 @@ +--- +title: Slack +slug: /slack +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/slack.md" +--- + +# Slack + +The DataHub Slack is a thriving and rapidly growing community - we can't wait for you to join us! + +_[Sign up here](https://slack.datahubproject.io) to join us on Slack and to subscribe to the DataHub Community newsletter. Already a member? [Log in here](https://slack.datahubproject.io)._ + +## Slack Guidelines + +In addition to our [Code of Conduct](CODE_OF_CONDUCT.md), we expect all Slack Community Members to respect the following guidelines: + +### Avoid using DMs and @mentions + +Whenever possible, post your questions and responses in public channels so other Community Members can benefit from the conversation and outcomes. Limit the use of @mentions of other Community Members to be considerate of notification noise. + +### Make use of threads + +Threads help us keep conversations contained and help us ensure we help you find a resolution and get you the support you need. + +Use threads when posting long messages and large blocks of code and/or stack trace - it is a MASSIVE help for us to keep track of the large volume of questions across our various support channels. + +### Do not post the same question across multiple channels + +If you're having a tough time getting the support you need (or aren't sure where to go!), please DM [@Maggie](https://datahubspace.slack.com/team/U0121TRV0FL) for support + +### Do not solicit members of our Slack + +The DataHub Community exists to collaborate with, learn from, and support one another. It is not a space to pitch your products or services directly to our members via public channels, private channels, or direct messages. + +We are excited to have a growing presence from vendors to help answer questions from Community Members as they may arise, but we have a strict 3-strike policy against solicitation: + +1. First occurrence: We'll give you a friendly but public reminder that the behavior is inappropriate according to our guidelines. +2. Second occurrence: We'll send you a DM warning that any additional violations will result in removal from the community. +3. Third occurrence: We'll delete or ban your account. + +We reserve the right to ban users without notice if they are clearly spamming our Community Members. + +## Navigating DataHub Slack + +Lets get you settled in -- + +- **Head over to [#introduce-yourself](https://datahubspace.slack.com/archives/C01PU1K6GDP) to, well, introduce yourself!** We'd love to learn more about you, what brings you here, and how we can support you +- **Not sure how to start?** You guessed it, check out [#getting-started](https://datahubspace.slack.com/archives/CV2KB471C) - we'll point you in the right direction +- **Looking for general debugging help?** [#troubleshoot](https://datahubspace.slack.com/archives/C029A3M079U) is the place to go +- **Need some live support from the Core DataHub Team?** Join us during our 2xWeek Office Hours via Zoom! Check out [#office-hours](https://datahubspace.slack.com/archives/C02AD211493) for more details +- **Looking for ways to contribute to the DataHub project?** Tell us all about it in [#contribute](https://datahubspace.slack.com/archives/C017W0NTZHR) +- **Have suggestions on how to make DataHub better?** We can't wait to hear them in [#feature-requests](https://datahubspace.slack.com/archives/C02FWNS2F08) +- **Excited to share your experience working with DataHub?** [#show-and-tell](https://datahubspace.slack.com/archives/C02FD9PLCA0) is the perfect channel for you +- **Need something else?** Reach out to [@Maggie](https://datahubspace.slack.com/team/U0121TRV0FL) - our Community Product Manager diff --git a/docs-website/versioned_docs/version-0.10.4/docs/sync-status.md b/docs-website/versioned_docs/version-0.10.4/docs/sync-status.md new file mode 100644 index 0000000000000..90b85e12e17a8 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/sync-status.md @@ -0,0 +1,53 @@ +--- +title: About DataHub Sync Status +sidebar_label: Sync Status +slug: /sync-status +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/sync-status.md" +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub Sync Status + + + +When looking at metadata in DataHub, it's useful to know if the information you're looking at is relevant. +Specifically, if metadata is stale, or hasn't been updated in a while, then you should consider refreshing that metadata +using [metadata ingestion](./../metadata-ingestion/README.md) or [deleting](./how/delete-metadata.md) it if it no longer exists. + +## Sync Status Setup, Prerequisites, and Permissions + +The sync status feature is enabled by default and does not require any special setup. + +## Using Sync Status + +The DataHub UI will display the sync status in the top right corner of the page. + +The last synchronized date is basically the last time an ingestion run saw an entity. It is computed as the most recent update to the entity, excluding changes done through the UI. If an ingestion run restates an entity but doesn't actually cause any changes, we still count that as an update for the purposes of sync status. + +
+ Technical details: computing the last synchronized timestamp + +To compute the last synchronized timestamp, we look at the system metadata of all aspects associated with the entity. +We exclude any aspects where the system metadata `runId` value is unset or equal to `no-run-id-provided`, as this is what filters out changes made through the UI. +Finally, we take the most recent system metadata `lastObserved` timestamp across the aspects and use that as the last synchronized timestamp. + +
+ +

+ +

+ +We'll automatically assign a color based on the sync status recency: + +- Green: last synchronized in the past week +- Yellow: last synchronized in the past month +- Red: last synchronized more than a month ago + +You can hover over the sync status message in the UI to view the exact timestamp of the most recent sync. + +

+ +

+ +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/tags.md b/docs-website/versioned_docs/version-0.10.4/docs/tags.md new file mode 100644 index 0000000000000..1c8e6d333b6b5 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/tags.md @@ -0,0 +1,118 @@ +--- +title: About DataHub Tags +sidebar_label: Tags +slug: /tags +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/tags.md" +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About DataHub Tags + + + +Tags are informal, loosely controlled labels that help in search & discovery. They can be added to datasets, dataset schemas, or containers, for an easy way to label or categorize entities – without having to associate them to a broader business glossary or vocabulary. + +Tags can help help you in: + +- Querying: Tagging a dataset with a phrase that a co-worker can use to query the same dataset +- Mapping assets to a category or group of your choice + +## Tags Setup, Prerequisites, and Permissions + +What you need to add tags: + +- **Edit Tags** metadata privilege to add tags at the entity level +- **Edit Dataset Column Tags** to edit tags at the column level + +You can create these privileges by creating a new [Metadata Policy](./authorization/policies.md). + +## Using DataHub Tags + +### Adding a Tag + +To add a tag at the dataset or container level, simply navigate to the page for that entity and click on the **Add Tag** button. + +

+ +

+ +Type in the name of the tag you want to add. You can add a new tag, or add a tag that already exists (the autocomplete will pull up the tag if it already exists). + +

+ +

+ +Click on the "Add" button and you'll see the tag has been added! + +

+ +

+ +If you would like to add a tag at the schema level, hover over the "Tags" column for a schema until the "Add Tag" button shows up, and then follow the same flow as above. + +

+ +

+ +### Removing a Tag + +To remove a tag, simply click on the "X" button in the tag. Then click "Yes" when prompted to confirm tag removal. + +### Searching by a Tag + +You can search for a tag in the search bar, and even filter entities by the presence of a specific tag. + +

+ +

+ +## Additional Resources + +### Videos + +**Add Ownership, Tags, Terms, and more to DataHub via CSV!** + +

+ +

+ +### GraphQL + +- [addTag](../graphql/mutations.md#addtag) +- [addTags](../graphql/mutations.md#addtags) +- [batchAddTags](../graphql/mutations.md#batchaddtags) +- [removeTag](../graphql/mutations.md#removetag) +- [batchRemoveTags](../graphql/mutations.md#batchremovetags) +- [createTag](../graphql/mutations.md#createtag) +- [updateTag](../graphql/mutations.md#updatetag) +- [deleteTag](../graphql/mutations.md#deletetag) + +You can easily fetch the Tags for an entity with a given its URN using the **tags** property. Check out [Working with Metadata Entities](./api/graphql/how-to-set-up-graphql.md#querying-for-tags-of-an-asset) for an example. + +### DataHub Blog + +- [Tags and Terms: Two Powerful DataHub Features, Used in Two Different Scenarios + Managing PII in DataHub: A Practitioner’s Guide](https://blog.datahubproject.io/tags-and-terms-two-powerful-datahub-features-used-in-two-different-scenarios-b5b4791e892e) + +## FAQ and Troubleshooting + +**What is the difference between DataHub Tags and Glossary Terms?** + +DataHub Tags are informal, loosely controlled labels while Terms are part of a controlled vocabulary, with optional hierarchy. Tags have no element of formal, central management. + +Usage and applications: + +- An asset may have multiple tags. +- Tags serve as a tool for search & discovery while Terms are typically used to standardize types of leaf-level attributes (i.e. schema fields) for governance. E.g. (EMAIL_PLAINTEXT) + +**How are DataHub Tags different from Domains?** + +Domains are a set of top-level categories usually aligned to business units/disciplines to which the assets are most relevant. They rely on central or distributed management. A single domain is assigned per data asset. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ + +### Related Features + +- [Glossary Terms](./glossary/business-glossary.md) +- [Domains](./domains.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/tests/metadata-tests.md b/docs-website/versioned_docs/version-0.10.4/docs/tests/metadata-tests.md new file mode 100644 index 0000000000000..990932df04d96 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/tests/metadata-tests.md @@ -0,0 +1,275 @@ +--- +title: About Metadata Tests +slug: /tests/metadata-tests +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/tests/metadata-tests.md +--- + +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + +# About Metadata Tests + + + +DataHub includes a highly configurable, no-code framework that allows you to configure broad-spanning monitors & continuous actions +for the data assets - datasets, dashboards, charts, pipelines - that make up your enterprise Metadata Graph. +At the center of this framework is the concept of a Metadata Test. + +There are two powerful use cases that are uniquely enabled by the Metadata Tests framework: + +1. Automated Asset Classification +2. Automated Metadata Completion Monitoring + +### Automated Asset Classification + +Metadata Tests allows you to define conditions for selecting a subset of data assets (e.g. datasets, dashboards, etc), +along with a set of actions to take for entities that are selected. After the test is defined, the actions +will be applied continuously over time, as the selection set evolves & changes with your data ecosystem. + +When defining selection criteria, you'll be able to choose from a range of useful technical signals (e.g. usage, size) that are automatically +extracted by DataHub (which vary by integration). This makes automatically classifying the "important" assets in your organization quite easy, which +is in turn critical for running effective Data Governance initiatives within your organization. + +For example, we can define a Metadata Test which selects all Snowflake Tables which are in the top 10% of "most queried" +for the past 30 days, and then assign those Tables to a special "Tier 1" group using DataHub Tags, Glossary Terms, or Domains. + +### Automated Data Governance Monitoring + +Metadata Tests allow you to define & monitor a set of rules that apply to assets in your data ecosystem (e.g. datasets, dashboards, etc). This is particularly useful when attempting to govern +your data, as it allows for the (1) definition and (2) measurement of centralized metadata standards, which are key for both bootstrapping +and maintaining a well-governed data ecosystem. + +For example, we can define a Metadata Test which requires that all "Tier 1" data assets (e.g. those marked with a special Tag or Glossary Term), +must have the following metadata: + +1. At least 1 explicit owner _and_ +2. High-level, human-authored documentation _and_ +3. At least 1 Glossary Term from the "Classification" Term Group + +Then, we can closely monitor which assets are passing and failing these rules as we work to improve things over time. +We can easily identify assets that are _in_ and _out of_ compliance with a set of centrally-defined standards. + +By applying automation, Metadata Tests +can enable the full lifecycle of complex Data Governance initiatives - from scoping to execution to monitoring. + +## Metadata Tests Setup, Prerequisites, and Permissions + +What you need to manage Metadata Tests on DataHub: + +- **Manage Tests** Privilege + +This Platform Privilege allows users to create, edit, and remove all Metadata Tests on DataHub. Therefore, it should only be +given to those users who will be serving as metadata Admins of the platform. The default `Admin` role has this Privilege. + +> Note that the Metadata Tests feature is currently limited in support for the following DataHub Asset Types: +> +> - Dataset +> - Dashboard +> - Chart +> - Data Flow (e.g. Pipeline) +> - Data Job (e.g. Task) +> - Container (Database, Schema, Project) +> +> If you'd like to see Metadata Tests for other asset types, please let your Acryl Customer Success partner know! + +## Using Metadata Tests + +Metadata Tests can be created by first navigating to **Govern > Tests**. + +To begin building a new Metadata, click **Create new Test**. + +

+ +

+ +### Creating a Metadata Test + +Inside the Metadata Test builder, we'll need to construct the 3 parts of a Metadata Test: + +1. **Selection Criteria** - Select assets that are in the scope of the test +2. **Rules** - Define rules that selected assets can either pass or fail +3. **Actions (Optional)** - Define automated actions to be taken assets that are passing + or failing the test + +

+ +

+ +#### Step 1. Defining Selection Criteria (Scope) + +In the first step, we define a set of conditions that are used to select a subset of the assets in our Metadata Graph +that will be "in the scope" of the new test. Assets that **match** the selection conditions will be considered in scope, while those which do not are simply not applicable for the test. +Once the test is created, the test will be evaluated for any assets which fall in scope on a continuous basis (when an asset changes on DataHub +or once every day). + +##### Selecting Asset Types + +You must select at least one asset _type_ from a set that includes Datasets, Dashboards, Charts, Data Flows (Pipelines), Data Jobs (Tasks), +and Containers. + +

+ +

+ +Entities will the selected types will be considered in scope, while those of other types will be considered out of scope and +thus omitted from evaluation of the test. + +##### Building Conditions + +**Property** conditions are the basic unit of comparison used for selecting data assets. Each **Property** condition consists of a target _property_, +an _operator_, and an optional _value_. + +A _property_ is an attribute of a data asset. It can either be a technical signal (e.g. **metric** such as usage, storage size) or a +metadata signal (e.g. owners, domain, glossary terms, tags, and more), depending on the asset type and applicability of the signal. +The full set of supported _properties_ can be found in the table below. + +An _operator_ is the type of predicate that will be applied to the selected _property_ when evaluating the test for an asset. The types +of operators that are applicable depend on the selected property. Some examples of operators include `Equals`, `Exists`, `Matches Regex`, +and `Contains`. + +A _value_ defines the right-hand side of the condition, or a pre-configured value to evaluate the property and operator against. The type of the value +is dependent on the selected _property_ and *operator. For example, if the selected *operator\* is `Matches Regex`, the type of the +value would be a string. + +By selecting a property, operator, and value, we can create a single condition (or predicate) used for +selecting a data asset to be tested. For example, we can build property conditions that match: + +- All datasets in the top 25% of query usage in the past 30 days +- All assets that have the "Tier 1" Glossary Term attached +- All assets in the "Marketing" Domain +- All assets without owners +- All assets without a description + +To create a **Property** condition, simply click **Add Condition** then select **Property** condition. + +

+ +

+ +We can combine **Property** conditions using boolean operators including `AND`, `OR`, and `NOT`, by +creating **Logical** conditions. To create a **Logical** condition, simply click **Add Condition** then select an +**And**, **Or**, or **Not** condition. + +

+ +

+ +Logical conditions allow us to accommodate complex real-world selection requirements: + +- All Snowflake Tables that are in the Top 25% of most queried AND do not have a Domain +- All Looker Dashboards that do not have a description authored in Looker OR in DataHub + +#### Step 2: Defining Rules + +In the second step, we can define a set of conditions that selected assets must match in order to be "passing" the test. +To do so, we can construct another set of **Property** conditions (as described above). + +> **Pro-Tip**: If no rules are supplied, then all assets that are selected by the criteria defined in Step 1 will be considered "passing". +> If you need to apply an automated Action to the selected assets, you can leave the Rules blank and continue to the next step. + +

+ +

+ +When combined with the selection criteria, Rules allow us to define complex, highly custom **Data Governance** policies such as: + +- All datasets in the top 25% of query usage in the past 30 days **must have an owner**. +- All assets in the "Marketing" Domain **must have a description** +- All Snowflake Tables that are in the Top 25% of most queried AND do not have a Domain **must have + a Glossary Term from the Classification Term Group** + +##### Validating Test Conditions + +During Step 2, we can quickly verify that the Selection Criteria & Rules we've authored +match our expectations by testing them against some existing assets indexed by DataHub. + +To verify your Test conditions, simply click **Try it out**, find an asset to test against by searching & filtering down your assets, +and finally click **Run Test** to see whether the asset is passes or fails the provided conditions. + +

+ +

+ +#### Step 3: Defining Actions (Optional) + +> If you don't wish to take any actions for assets that pass or fail the test, simply click 'Skip'. + +In the third step, we can define a set of Actions that will be automatically applied to each selected asset which passes or fails the Rules conditions. + +For example, we may wish to mark **passing** assets with a special DataHub Tag or Glossary Term (e.g. "Tier 1"), or remove these special marking for those which are failing. +This allows us to automatically control classifications of data assets as they move in and out of compliance with the Rules defined in Step 2. + +A few of the supported Action types include: + +- Adding or removing specific Tags +- Adding or removing specific Glossary Terms +- Adding or removing specific Owners +- Adding or removing to a specific Domain + +

+ +

+ +#### Step 4: Name, Category, Description + +In the final step, we can add a freeform name, category, and description for our new Metadata Test. + +### Viewing Test Results + +Metadata Test results can be viewed in 2 places: + +1. On an asset profile page (e.g. Dataset profile page), inside the **Validation** tab. +2. On the Metadata Tests management page. To view all assets passing or failing a particular test, + simply click on the labels which showing the number of passing or failing assets. + +

+ +

+ +### Updating an Existing Test + +To update an existing Test, simply click **Edit** on the test you wish to change. + +Then, make the changes required and click **Save**. When you save a Test, it may take up to 2 minutes for changes +to be reflected across DataHub. + +### Removing a Test + +To remove a Test, simply click on the trashcan icon located on the Tests list. This will remove the Test and +deactivate it so that it no is evaluated. + +When you delete a Test, it may take up to 2 minutes for changes to be reflected. + +### GraphQL + +- [listTests](../../graphql/queries.md#listtests) +- [createTest](../../graphql/mutations.md#createtest) +- [deleteTest](../../graphql/mutations.md#deletetest) + +## FAQ and Troubleshooting + +**When are Metadata Tests evaluated?** + +Metadata Tests are evaluated in 2 scenarios: + +1. When an individual asset is changed in DataHub, all tests that include it in scope are evaluated +2. On a recurring cadence (usually every 24 hours) by a dedicated Metadata Test evaluator, which evaluates all tests against the Metadata Graph + +**Can I configure a custom evaluation schedule for my Metadata Test?** + +No, you cannot. Currently, the internal evaluator will ensure that tests are run continuously for +each asset, regardless of whether it is being changed on DataHub. + +**How is a Metadata Test different from an Assertion?** + +An Assertion is a specific test, similar to a unit test, that is defined for a single data asset. Typically, +it will include domain-specific knowledge about the asset and test against physical attributes of it. For example, an Assertion +may verify that the number of rows for a specific table in Snowflake falls into a well-defined range. + +A Metadata Test is a broad spanning predicate which applies to a subset of the Metadata Graph (e.g. across multiple +data assets). Typically, it is defined against _metadata_ attributes, as opposed to the physical data itself. For example, +a Metadata Test may verify that ALL tables in Snowflake have at least 1 assigned owner, and a human-authored description. +Metadata Tests allow you to manage broad policies across your entire data ecosystem driven by metadata, for example to +augment a larger scale Data Governance initiative. + +_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_ diff --git a/docs-website/versioned_docs/version-0.10.4/docs/townhall-history.md b/docs-website/versioned_docs/version-0.10.4/docs/townhall-history.md new file mode 100644 index 0000000000000..964478170ebfb --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/townhall-history.md @@ -0,0 +1,523 @@ +--- +title: Town Hall History +slug: /townhall-history +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/townhall-history.md +--- + +# Town Hall History + +A list of previous Town Halls, their planned schedule, and the recording of the meeting. + +## 03/23/2023 + +[Full YouTube video](https://youtu.be/BTX8rIBe0yo) + +### Agenda + +- Community & Roadmap Update +- Recent Releases +- Community Case Study — Jumio’s DataHub adoption journey +- DataHub 201: Data Debugging +- Sneak Peek: Streamlined Filtering Experience + +## 02/23/2023 + +[Full YouTube video](https://youtu.be/UItt4ppJSFc) + +### Agenda + +- Community & Roadmap Update +- Recent Releases +- Community Case Study - How the Hurb Team successfully implemented and adopted DataHub within their organization +- Sneak Peek: Subscriptions and Notifications +- Search Improvements - API support for pagination +- New Feature - Custom Queries +- Simplifying Metadata Ingestion +- DataHub 201: Rolling Out DataHub + +## 01/26/2023 + +[Full YouTube video](https://youtu.be/A3mSiGHZ6Rc) + +### Agenda + +- What’s to Come - Q1 2023 Roadmap: Data Products, Data Contracts and more +- Community Case Study - Notion: Automating annotations and metadata propagation +- Community Contribution - Grab: Improvements to documentation editing +- Simplifying DataHub - Removing Schema Registry requirement and introducing DataHub Lite + +## 01/05/2023 + +[Full YouTube video](https://youtu.be/ECxIMbKwuOY) + +### Agenda + +- DataHub Community: 2022 in Review - Our Community of Data Practitioners is one of a kind. We’ll take the time to celebrate who we are, what we’ve built, and how we’ve collaborated in the past 12 months. +- Search Improvements - Learn how we’re making the Search experience smarter and faster to connect you with the most relevant resources during data discovery. +- Removing Schema Registry Requirement - Hear all about ongoing work to simplify the DataHub deployment process. +- Smart Data Profiling - We’re making big improvements to data profiling! Smart data profiling will reduce processing time by only scanning datasets that have recently changed. +- Sneak Peek: Time-based Lineage - Get a preview of how you’ll soon be able to trace lineage between datasets across different points in time to understand how interdependencies have evolved. +- Sneak Peek: Chrome Extension - Soon, you’ll be able to quickly access rich metadata from DataHub while exploring resources in Looker via our upcoming Chrome Extension. + +## 12/01/2022 + +[Full YouTube video](https://youtu.be/BlCLhG8lGoY) + +### Agenda + +November Town Hall (in December!) + +- Community Case Study - The Pinterest Team will share how they have integrated DataHub + Thrift and extended the Metadata Model with a Data Element entity to capture semantic types. +- NEW! Ingestion Quickstart Guides - DataHub newbies, this one is for you! We’re rolling out ingestion quickstart guides to help you quickly get up and running with DataHub + Snowflake, BigQuery, and more! +- NEW! In-App Product Tours - We’re making it easier than ever for end-users to get familiar with all that DataHub has to offer - hear all about the in-product onboarding resources we’re rolling out soon! +- DataHub UI Navigation and Performance - Learn all about upcoming changes to our user experience to make it easier (and faster!) for end users to work within DataHub. +- Sneak Peek! Manual Lineage via the UI - The Community asked and we’re delivering! Soon you’ll be able to manually add lineage connections between Entities in DataHub. +- NEW! Slack + Microsoft Teams Integrations - Send automated alerts to Slack and/or Teams to keep track of critical events and changes within DataHub. +- Hacktoberfest Winners Announced - We’ll recap this year’s Hacktoberfest and announce three winners of a $250 Amazon gift card & DataHub Swag. + +## 10/27/2022 + +[Full YouTube video](https://youtu.be/B74WHxX5EMk) + +### Agenda + +- Conquer Data Governance with Acryl Data’s Metadata Tests - Learn how to tackle Data Governance with incremental, automation-driven governance using Metadata Tests provided in Acryl Data’s managed DataHub offering +- Community Case Study - The Grab Team shares how they are using DataHub for data discoverability, automated classification and governance workflows, data quality observability, and beyond! +- Upcoming Ingestion Sources - We’ll tell you the ins and outs of our upcoming dbt Cloud and Unity Catalog connectors +- Sneak Peek! Saved Views - Learn how you can soon use Saved Views to help end-users navigate entities in DataHub with more precision and focus +- Performance Improvements - Hear about the latest upgrades to DataHub performance + +## 9/29/2022 + +[Full YouTube video](https://youtu.be/FjkNySWkghY) + +### Agenda + +- Column Level Lineage is here! - Demo of column-level lineage and impact analysis in the DataHub UI +- Community Case Study - The Stripe Team shares how they leverage DataHub to power observability within their Airflow-based ecosystem +- Sneak Peek! Automated PII Classification - Preview upcoming functionality to automatically identify data fields that likely contain sensitive data +- Ingestion Improvements Galore - Improved performance and functionality for dbt, Looker, Tableau, and Presto ingestion sources + +## 8/25/2022 + +[Full YouTube video](https://youtu.be/EJCKxKBvCwo) + +### Agenda + +- Community Case Study - The Etsy Team shares their journey of adopting DataHub +- Looker & DataHub Improvements - surface the most relevant Looks and Dashboards +- Home Page Improvements to tailor the Browse experience +- Unified Ingestion Summaries - View live logs for UI-based ingestion and see historical ingestion reports across CLI and UI-based ingestion +- Patch Support - Native support for PATCH in the metadata protocol to support efficient updates to add & remove owners, lineage, tags and more +- Sneak Peek! Advanced Search + +## 7/28/2022 + +[Full YouTube video](https://youtu.be/Zrkf3Mzcvc4) + +### Agenda + +- Community Updates +- Project Updates +- Improvements to UI-Based Ingestion +- Sneak Preview - Bulk Edits via the UI +- Streamlined Metadata Ingestion +- DataHub 201: Metadata Enrichment + +## 6/30/2022 + +[Full YouTube video](https://youtu.be/fAD53fEJ6m0) + +### Agenda + +- Community Updates +- Project Updates +- dbt Integration Updates +- CSV Ingestion Support +- DataHub 201 - Glossary Term Deep Dive + +## 5/26/2022 + +[Full YouTube video](https://youtu.be/taKb_zyowEE) + +### Agenda + +- Community Case Study: Hear how the G-Research team is using Cassandra as DataHub’s Backend +- Creating & Editing Glossary Terms from the DataHub UI +- DataHub User Onboarding via the UI +- DataHub 201: Impact Analysis +- Sneak Peek: Data Reliability with DataHub +- Metadata Day Hackathon Winners + +## 4/28/2022 + +[Full YouTube video](https://www.youtube.com/watch?v=7iwNxHgqxtg) + +### Agenda + +- Community Case Study: Hear from Included Health about how they are embedding external tools into the DataHub UI +- New! Actions Framework: run custom code when changes happen within DataHub +- UI Refresh for ML Entities +- Improved deletion support for time-series aspects, tags, terms, & more +- OpenAPI Improvements + +## 3/31/2022 + +[Full YouTube video](https://www.youtube.com/watch?v=IVazVgcNRdw) + +### Agenda + +- Community Case Study: Hear from Zendesk about how they are applying “shift left” principles by authoring metadata in their Protobuf schemas +- RBAC Functionality: View-Based Policies +- Schema Version History - surfacing the history of schema changes in DataHub's UI +- Improvements to Airflow Ingestion, including Run History +- Container/Domain-Level Property Inheritance +- Delete API + +## 2/25/2022 + +[Full YouTube video](https://www.youtube.com/watch?v=enBqB2Dbuv4) + +### Agenda + +- Lineage Impact Analysis - using DataHub to understand the impact of changes on downstream dependencies +- Displaying Data Quality Checks in the UI +- Roadmap update: Schema Version History & Column-Level Lineage +- Community Case Study: Managing Lineage via YAML + +## 1/28/2022 + +[Full YouTube video](https://youtu.be/ShlSR3dMUnE) + +### Agenda + +- Community & Roadmap Updates by Maggie Hays (Acryl Data) +- Project Updates by Shirshanka Das (Acryl Data) +- Community Case Study: Adding Dataset Transformers by Eric Cooklin (Stash) +- Demo: Data Domains & Containers by John Joyce (Acryl Data) +- DataHub Basics — Data Profiling & Usage Stats 101 by Maggie Hays & Tamás Németh (Acryl Data) +- Demo: Spark Lineage by Mugdha Hardikar (GS Lab) & Shirshanka Das + +## 12/17/2021 + +[Full YouTube video](https://youtu.be/rYInKCwxu7o) + +### Agenda + +- Community & Roadmap Updates by Maggie Hays (Acryl Data) +- Project Updates by Shirshanka Das (Acryl Data) +- 2021 DataHub Community in Review by Maggie Hays +- DataHub Basics -- Users, Groups, & Authentication 101 by Pedro Silva (Acryl Data) +- Sneak Peek: UI-Based Ingestion by John Joyce (Acryl Data) +- Case Study — DataHub at Grofers by Shubham Gupta +- Top DataHub Contributors of 2021 - Maggie Hays +- Final Surprise! We Interviewed a 10yo and a 70yo about DataHub + +## 11/19/2021 + +[Full YouTube video](https://youtu.be/to80sEDZz7k) + +### Agenda + +- Community & Roadmap Updates by Maggie Hays (Acryl Data) +- Project Updates by Shirshanka Das (Acryl Data) +- DataHub Basics -- Lineage 101 by John Joyce & Surya Lanka (Acryl Data) +- Introducing No-Code UI by Gabe Lyons & Shirshanka Das (Acryl Data) +- DataHub API Authentication by John Joyce (Acryl Data) +- Case Study: LinkedIn pilot to extend the OSS UI by Aikepaer Abuduweili & Joshua Shinavier + +## 10/29/2021 + +[Full YouTube video](https://youtu.be/GrS_uZhYNm0) + +### Agenda + +- DataHub Community & Roadmap Update - Maggie Hays (Acryl Data) +- October Project Updates - Shirshanka Das (Acryl Data) +- Introducing Recommendations - John Joyce & Dexter Lee (Acryl Data) +- Case Study: DataHub @ hipages - Chris Coulson (hipages) +- Data Profiling Improvements - Surya Lanka & Harshal Sheth (Acryl Data) +- Lineage Improvements & BigQuery Dataset Lineage by Gabe Lyons & Varun Bharill (Acryl Data) + +## 9/24/2021 + +[Full YouTube video](https://youtu.be/nQDiKPKnLLQ) + +### Agenda + +- Project Updates and Callouts by Shirshanka + - GraphQL Public API Annoucement +- Demo: Faceted Search by Gabe Lyons (Acryl Data) +- Stateful Ingestion by Shirshanka Das & Surya Lanka (Acryl Data) +- Case-Study: DataHub @ Adevinta by Martinez de Apellaniz +- Recent Improvements to the Looker Connector by Shirshanka Das & Maggie Hays (Acryl Data) +- Offline + - Foreign Key and Related Term Mapping by Gabe Lyons (Acryl Data) [video](https://www.loom.com/share/79f27c2d9f6c4a3b8aacbc48c19add18) + +## 8/27/2021 + +[Full YouTube video](https://youtu.be/3joZINi3ti4) + +### Agenda + +- Project Updates and Callouts by Shirshanka + - Business Glossary Demo + - 0.8.12 Upcoming Release Highlights + - Users and Groups Management (Okta, Azure AD) +- Demo: Fine Grained Access Control by John Joyce (Acryl Data) +- Community Case-Study: DataHub @ Warung Pintar and Redash integration by Taufiq Ibrahim (Bizzy Group) +- New User Experience by John Joyce (Acryl Data) +- Offline + - Performance Monitoring by Dexter Lee (Acryl Data) [video](https://youtu.be/6Xfr_Y9abZo) + +## 7/23/2021 + +[Full YouTube video](https://www.youtube.com/watch?v=rZsiB8z5rG4) + +[Medium Post](https://medium.com/datahub-project/datahub-project-updates-f4299cd3602e?source=friends_link&sk=27af7637f7ae44786ede694c3af512a5) + +### Agenda + +- Project Updates by Shirshanka + - Release highlights +- Deep Dive: Data Observability: Phase 1 by Harshal Sheth, Dexter Lee (Acryl Data) +- Case Study: Building User Feedback into DataHub by Melinda Cardenas (NY Times) +- Demo: AWS SageMaker integration for Models and Features by Kevin Hu (Acryl Data) + +## 6/25/2021 + +[Full YouTube video](https://www.youtube.com/watch?v=xUHOdDfdFpY) + +[Medium Post](https://medium.com/datahub-project/datahub-project-updates-ed3155476408?source=friends_link&sk=02816a16ff2acd688e6db8eb55808d31) + +#### Agenda + +- Project Updates by Shirshanka + - Release notes + - RBAC update + - Roadmap for H2 2021 +- Demo: Table Popularity powered by Query Activity by Harshal Sheth (Acryl Data) +- Case Study: Business Glossary in production at Saxo Bank by Sheetal Pratik (Saxo Bank), Madhu Podila (ThoughtWorks) +- Developer Session: Simplified Deployment for DataHub by John Joyce, Gabe Lyons (Acryl Data) + +## 5/27/2021 + +[Full YouTube video](https://www.youtube.com/watch?v=qgW_xpIr1Ho) + +[Medium Post](https://medium.com/datahub-project/linkedin-datahub-project-updates-ed98cdf913c1?source=friends_link&sk=9930ec5579299b155ea87c747683d1ad) + +#### Agenda + +- Project Updates by Shirshanka - 10 mins + - 0.8.0 Release + - AWS Recipe by Dexter Lee (Acryl Data) +- Demo: Product Analytics design sprint (Maggie Hays (SpotHero), Dexter Lee (Acryl Data)) - 10 mins +- Use-Case: DataHub on GCP by Sharath Chandra (Confluent) - 10 mins +- Deep Dive: No Code Metadata Engine by John Joyce (Acryl Data) - 20 mins +- General Q&A and closing remarks + +## 4/23/2021 + +[Full YouTube video](https://www.youtube.com/watch?v=dlFa4ubJ9ho) + +[Medium Digest](https://medium.com/datahub-project/linkedin-datahub-project-updates-2b0d26066b8f?source=friends_link&sk=686c47219ed294e0838ae3e2fe29084d) + +#### Agenda + +- Welcome - 5 mins +- Project Updates by Shirshanka - 10 mins + - 0.7.1 Release and callouts (dbt by Gary Lucas) + - Product Analytics design sprint announcement (Maggie Hayes) +- Use-Case: DataHub at DefinedCrowd ([video](https://www.youtube.com/watch?v=qz5Rpmw8I5E)) by Pedro Silva - 15 mins +- Deep Dive + Demo: Lineage! Airflow, Superset integration ([video](https://www.youtube.com/watch?v=3wiaqhb8UR0)) by Harshal Sheth and Gabe Lyons - 10 mins +- Use-Case: DataHub Hackathon at Depop ([video](https://www.youtube.com/watch?v=SmOMyFc-9J0)) by John Cragg - 10 mins +- Observability Feedback share out - 5 mins +- General Q&A and closing remarks - 5 mins + +## 3/19/2021 + +[YouTube video](https://www.youtube.com/watch?v=xE8Uc27VTG4) + +[Medium Digest](https://medium.com/datahub-project/linkedin-datahub-project-updates-697f0faddd10?source=friends_link&sk=9888633c5c7219b875125e87a703ec4d) + +#### Agenda + +- Welcome - 5 mins +- Project Updates ([slides](https://drive.google.com/file/d/1c3BTP3oDAzJr07l6pY6CkDZi5nT0cLRs/view?usp=sharing)) by [Shirshanka](https://www.linkedin.com/in/shirshankadas/) - 10 mins + - 0.7.0 Release + - Project Roadmap +- Demo Time: Themes and Tags in the React App! by [Gabe Lyons](https://www.linkedin.com/in/gabe-lyons-9a574543/) - 10 mins +- Use-Case: DataHub at [Wolt](https://www.linkedin.com/company/wolt-oy/) ([slides](https://drive.google.com/file/d/1za7NKbnXpFV2bBDblP35CbQEIDwc9tog/view?usp=sharing)) by [Fredrik](https://www.linkedin.com/in/fredriksannholm/?originalSubdomain=fi) and Matti - 15 mins +- Poll Time: Observability Mocks! ([slides](https://drive.google.com/file/d/1Ih2EGf-76jhbNAjr2EsBLb7n8bra2WIz/view?usp=sharing)) - 5 mins +- General Q&A from sign up sheet, slack, and participants - 10 mins +- Closing remarks - 5 mins + +## 2/19/2021 + +[YouTube video](https://www.youtube.com/watch?v=Z9ImbcsAVl0) + +[Medium Digest](https://medium.com/datahub-project/linkedin-datahub-project-updates-february-2021-edition-338d2c6021f0) + +#### Agenda + +- Welcome - 5 mins +- Latest React App Demo! ([video](https://www.youtube.com/watch?v=RQBEJhcen5E)) by John Joyce and Gabe Lyons - 5 mins +- Use-Case: DataHub at Geotab ([slides](https://docs.google.com/presentation/d/1qcgO3BW5NauuG0HnPqrxGcujsK-rJ1-EuU-7cbexkqE/edit?usp=sharing),[video](https://www.youtube.com/watch?v=boyjT2OrlU4)) by [John Yoon](https://www.linkedin.com/in/yhjyoon/) - 15 mins +- Tech Deep Dive: Tour of new pull-based Python Ingestion scripts ([slides](https://docs.google.com/presentation/d/15Xay596WDIhzkc5c8DEv6M-Bv1N4hP8quup1tkws6ms/edit#slide=id.gb478361595_0_10),[video](https://www.youtube.com/watch?v=u0IUQvG-_xI)) by [Harshal Sheth](https://www.linkedin.com/in/hsheth2/) - 15 mins +- General Q&A from sign up sheet, slack, and participants - 15 mins +- Closing remarks - 5 mins + +## 1/15/2021 + +[Full Recording](https://youtu.be/r862MZTLAJ0) + +[Slide-deck](https://docs.google.com/presentation/d/e/2PACX-1vQ2B0iHb2uwege1wlkXHOgQer0myOMEE5EGnzRjyqw0xxS5SaAc8VMZ_1XVOHuTZCJYzZZW4i9YnzSN/pub?start=false&loop=false&delayms=3000) + +Agenda + +- Announcements - 2 mins +- Community Updates ([video](https://youtu.be/r862MZTLAJ0?t=99)) - 10 mins +- Use-Case: DataHub at Viasat ([slides](../../../docs/demo/ViasatMetadataJourney.pdf),[video](https://youtu.be/2SrDAJnzkjE)) by [Anna Kepler](https://www.linkedin.com/in/akepler) - 15 mins +- Tech Deep Dive: GraphQL + React RFCs readout and discussion ([slides](https://docs.google.com/presentation/d/e/2PACX-1vRtnINnpi6PvFw7-5iW8PSQoT9Kdf1O_0YW7QAr1_mSdJMNftYFTVCjKL-e3fpe8t6IGkha8UpdmoOI/pub?start=false&loop=false&delayms=3000) ,[video](https://www.youtube.com/watch?v=PrBaFrb7pqA)) by [John Joyce](https://www.linkedin.com/in/john-joyce-759883aa) and [Arun Vasudevan](https://www.linkedin.com/in/arun-vasudevan-55117368/) - 15 mins +- General Q&A from sign up sheet, slack, and participants - 15 mins +- Closing remarks - 3 mins +- General Q&A from sign up sheet, slack, and participants - 15 mins +- Closing remarks - 5 minutes + +## 12/04/2020 + +[Recording](https://linkedin.zoom.us/rec/share/8E7-lFnCi_kQ8OvXR9kW6fn-AjvV8VlqOO2xYR8b5Y_UeWI_ODcKFlxlHqYgBP7j.S-c8C1YMrz7d3Mjq) + +Agenda + +- Quick intro - 5 mins +- [Why did Grofers choose DataHub for their data catalog?](../../../docs/demo/Datahub_at_Grofers.pdf) by [Shubham Gupta](https://www.linkedin.com/in/shubhamg931/) - 15 minutes +- [DataHub UI development - Part 2](../../../docs/demo/Town_Hall_Presentation_-_12-2020_-_UI_Development_Part_2.pdf) by [Charlie Tran](https://www.linkedin.com/in/charlie-tran/) (LinkedIn) - 20 minutes +- General Q&A from sign up sheet, slack, and participants - 15 mins +- Closing remarks - 5 minutes + +## 11/06/2020 + +[Recording](https://linkedin.zoom.us/rec/share/0yvjZ2fOzVmD8aaDo3lC59fXivmYG3EnF0U9tMVgKs827595usvSoIhtFUPjZCsU.b915nLRkw6iQlnoD) + +Agenda + +- Quick intro - 5 mins +- [Lightning talk on Metadata use-cases at LinkedIn](../../../docs/demo/Metadata_Use-Cases_at_LinkedIn_-_Lightning_Talk.pdf) by [Shirshanka Das](https://www.linkedin.com/in/shirshankadas/) (LinkedIn) - 5 mins +- [Strongly Consistent Secondary Index (SCSI) in GMA](../../../docs/demo/Datahub_-_Strongly_Consistent_Secondary_Indexing.pdf), an upcoming feature by [Jyoti Wadhwani](https://www.linkedin.com/in/jyotiwadhwani/) (LinkedIn) - 15 minutes +- [DataHub UI overview](../../../docs/demo/DataHub-UIOverview.pdf) by [Ignacio Bona](https://www.linkedin.com/in/ignaciobona) (LinkedIn) - 20 minutes +- General Q&A from sign up sheet, slack, and participants - 10 mins +- Closing remarks - 5 minutes + +## 09/25/2020 + +[Recording](https://linkedin.zoom.us/rec/share/uEQ2pRY0BHbVqk_sOTVRm05VXJ0xM_zKJ26yzfCBqNZItiBht__k_juCCahJ37QK.IKAU9qA_0qdURX4_) + +Agenda + +- Quick intro - 5 mins +- [Data Discoverability at SpotHero](../../../docs/demo/Data_Discoverability_at_SpotHero.pdf) by [Maggie Hays](https://www.linkedin.com/in/maggie-hays/) (SpotHero) - 20 mins +- [Designing the next generation of metadata events for scale](../../../docs/demo/Designing_the_next_generation_of_metadata_events_for_scale.pdf) by [Chris Lee](https://www.linkedin.com/in/chrisleecmu/) (LinkedIn) - 15 mins +- General Q&A from sign up sheet, slack, and participants - 15 mins +- Closing remarks - 5 mins + +## 08/28/2020 + +[Recording](https://linkedin.zoom.us/rec/share/vMBfcb31825IBZ3T71_wffM_GNv3T6a8hicf8_dcfzQlhfFxl5i_CPVKcmYaZA) + +Agenda + +- Quick intro - 5 mins +- [Data Governance look for a Digital Bank](https://www.slideshare.net/SheetalPratik/linkedinsaxobankdataworkbench) by [Sheetal Pratik](https://www.linkedin.com/in/sheetalpratik/) (Saxo Bank) - 20 mins +- Column level lineage for datasets demo by [Nagarjuna Kanamarlapudi](https://www.linkedin.com/in/nagarjunak/) (LinkedIn) - 15 mins +- General Q&A from sign up sheet and participants - 15 mins +- Closing remarks - 5 mins + +## 07/31/20 + +[Recording](https://bluejeans.com/s/wjnDRJevi5z/) + +Agenda + +- Quick intro - 5 mins +- Showcasing new entities onboarded to internal LinkedIn DataHub (Data Concepts, Schemas) by [Nagarjuna Kanamarlapudi](https://www.linkedin.com/in/nagarjunak) (LinkedIn) - 15 mins +- Showcasing new Lineage UI in internal LinkedIn DataHub By [Ignacio Bona](https://www.linkedin.com/in/ignaciobona) (LinkedIn) - 10 mins +- New [RFC Process](./rfc.md) by [John Plaisted](https://www.linkedin.com/in/john-plaisted-49a00a78/) (LinkedIn) - 2 mins +- Answering questions from the signup sheet - 13 mins +- Questions from the participants - 10 mins +- Closing remarks - 5 mins + +## 06/26/20 + +[Recording](https://bluejeans.com/s/yILyR/) + +Agenda + +- Quick intro - 5 mins +- Onboarding Data Process entity by [Liangjun Jiang](https://github.com/liangjun-jiang) (Expedia) - 15 mins +- How to onboard a new relationship to metadata graph by [Kerem Sahin](https://github.com/keremsahin1) (Linkedin) - 15 mins +- Answering questions from the signup sheet - 15 mins +- Questions from the participants - 10 mins +- Closing remarks - 5 mins + +## 05/29/20 + +[Recording](https://bluejeans.com/s/GCAzY) + +Agenda + +- Quick intro - 5 mins +- How to add a new aspect/feature for an existing entity in UI by [Charlie Tran](https://www.linkedin.com/in/charlie-tran/) (LinkedIn) - 10 mins +- How to search over a new field by [Jyoti Wadhwani](https://www.linkedin.com/in/jyotiwadhwani/) (LinkedIn) - 10 mins +- Answering questions from the signup sheet - 15 mins +- Questions from the participants - 10 mins +- Closing remarks - 5 mins + +## 04/17/20 + +[Recording](https://bluejeans.com/s/eYRD4) + +Agenda + +- Quick intro - 5 mins +- [DataHub Journey with Expedia Group](https://www.youtube.com/watch?v=ajcRdB22s5o&ab_channel=ArunVasudevan) by [Arun Vasudevan](https://www.linkedin.com/in/arun-vasudevan-55117368/) (Expedia) - 10 mins +- Deploying DataHub using Nix by [Larry Luo](https://github.com/clojurians-org) (Shanghai HuaRui Bank) - 10 mins +- Answering questions from the signup sheet - 15 mins +- Questions from the participants - 10 mins +- Closing remarks - 5 mins + +## 04/03/20 + +[Recording](https://bluejeans.com/s/vzYpa) + +[Q&A](https://docs.google.com/document/d/1ChF9jiJWv9wj3HLLkFYRg7NSYg8Kb0PT7COd7Hf9Zpk/edit?usp=sharing) + +- Agenda + - Quick intro - 5 mins + - Creating Helm charts for deploying DataHub on Kubernetes by [Bharat Akkinepalli](https://www.linkedin.com/in/bharat-akkinepalli-ba0b7223/) (ThoughtWorks) - 10 mins + - How to onboard a new metadata aspect by [Mars Lan](https://www.linkedin.com/in/marslan) (LinkedIn) - 10 mins + - Answering questions from the signup sheet - 15 mins + - Questions from the participants - 10 mins + - Closing remarks - 5 mins + +## 03/20/20 + +[Recording](https://bluejeans.com/s/FSKEF) + +[Q&A](https://docs.google.com/document/d/1vQ6tAGXsVafnPIcZv1GSYgnTJJXFOACa1aWzOQjiGHI/edit) + +Agenda + +- Quick intro - 5 mins +- Internal DataHub demo - 10 mins +- What's coming up next for DataHub (what roadmap items we are working on) - 10 mins +- Answering questions from the signup sheet - 15 mins +- Questions from the participants - 10 mins +- Closing remarks - 5 mins + +## 03/06/20 + +[Recording](https://bluejeans.com/s/vULMG) + +[Q&A](https://docs.google.com/document/d/1N_VGqlH9CD-54LBsVlpcK2Cf2Mgmuzq79EvN9qgBqtQ/edit) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/townhalls.md b/docs-website/versioned_docs/version-0.10.4/docs/townhalls.md new file mode 100644 index 0000000000000..c07997742cb04 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/townhalls.md @@ -0,0 +1,21 @@ +--- +title: DataHub Town Halls +sidebar_label: Town Halls +slug: /townhalls +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/townhalls.md" +--- + +# DataHub Town Halls + +We hold regular virtual town hall meetings to meet with DataHub community. +Currently it's held on the fourth Thursday of every month (with some exceptions such as holiday weekends). +It's the perfect venue to meet the team behind DataHub and other users, as well as to ask higher-level questions, such as roadmap and product direction. +From time to time we also use the opportunity to showcase upcoming features. + +## Meeting Invite & Agenda + +You can join with this link https://zoom.datahubproject.io, or [RSVP](https://rsvp.datahubproject.io/) to get a calendar invite - this will always have the most up-to-date agenda for upcoming sessions. + +## Past Meetings + +See [Town Hall History](townhall-history.md) for recordings of past town halls. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/troubleshooting/build.md b/docs-website/versioned_docs/version-0.10.4/docs/troubleshooting/build.md new file mode 100644 index 0000000000000..c41508b75af21 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/troubleshooting/build.md @@ -0,0 +1,49 @@ +--- +title: Build Debugging Guide +slug: /troubleshooting/build +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/troubleshooting/build.md +--- + +# Build Debugging Guide + +For when [Local Development](/docs/developers.md) did not work out smoothly. + +## Getting `Unsupported class file major version 57` + +You're probably using a Java version that's too new for gradle. Run the following command to check your Java version + +``` +java --version +``` + +While it may be possible to build and run DataHub using newer versions of Java, we currently only support [Java 11](https://openjdk.org/projects/jdk/11/) (aka Java 11). + +## Getting `cannot find symbol` error for `javax.annotation.Generated` + +Similar to the previous issue, please use Java 1.8 to build the project. +You can install multiple version of Java on a single machine and switch between them using the `JAVA_HOME` environment variable. See [this document](https://docs.oracle.com/cd/E21454_01/html/821-2531/inst_jdk_javahome_t.html) for more details. + +## `:metadata-models:generateDataTemplate` task fails with `java.nio.file.InvalidPathException: Illegal char <:> at index XX` or `Caused by: java.lang.IllegalArgumentException: 'other' has different root` error + +This is a [known issue](https://github.com/linkedin/rest.li/issues/287) when building the project on Windows due a bug in the Pegasus plugin. Please refer to [Windows Compatibility](/docs/developers.md#windows-compatibility). + +## Various errors related to `generateDataTemplate` or other `generate` tasks + +As we generate quite a few files from the models, it is possible that old generated files may conflict with new model changes. When this happens, a simple `./gradlew clean` should reosolve the issue. + +## `Execution failed for task ':metadata-service:restli-servlet-impl:checkRestModel'` + +This generally means that an [incompatible change](https://linkedin.github.io/rest.li/modeling/compatibility_check) was introduced to the rest.li API in GMS. You'll need to rebuild the snapshots/IDL by running the following command once + +``` +./gradlew :metadata-service:restli-servlet-impl:build -Prest.model.compatibility=ignore +``` + +## `java.io.IOException: No space left on device` + +This means you're running out of space on your disk to build. Please free up some space or try a different disk. + +## `Build failed` for task `./gradlew :datahub-frontend:dist -x yarnTest -x yarnLint` + +This could mean that you need to update your [Yarn](https://yarnpkg.com/getting-started/install) version diff --git a/docs-website/versioned_docs/version-0.10.4/docs/troubleshooting/general.md b/docs-website/versioned_docs/version-0.10.4/docs/troubleshooting/general.md new file mode 100644 index 0000000000000..9c029a81e0758 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/troubleshooting/general.md @@ -0,0 +1,19 @@ +--- +title: General Debugging Guide +slug: /troubleshooting/general +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/troubleshooting/general.md +--- + +# General Debugging Guide + +## Logo for my platform is not appearing on the Home Page or search results + +Please see if either of these guides help you + +- [Adding a custom Dataset Data Platform](../how/add-custom-data-platform.md) +- [DataHub CLI put platform command](../cli.md#put-platform) + +## How do I add dataset freshness indicator for datasets? + +You can emit an [operation aspect](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Operation.pdl) for the same. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/troubleshooting/quickstart.md b/docs-website/versioned_docs/version-0.10.4/docs/troubleshooting/quickstart.md new file mode 100644 index 0000000000000..fc97ecfcd59a5 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/troubleshooting/quickstart.md @@ -0,0 +1,323 @@ +--- +title: Quickstart Debugging Guide +slug: /troubleshooting/quickstart +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/troubleshooting/quickstart.md +--- + +# Quickstart Debugging Guide + +For when [Quickstart](/docs/quickstart.md) did not work out smoothly. + +## Common Problems + +
+Command not found: datahub + + +If running the datahub cli produces "command not found" errors inside your terminal, your system may be defaulting to an +older version of Python. Try prefixing your `datahub` commands with `python3 -m`: + +```bash +python3 -m datahub docker quickstart +``` + +Another possibility is that your system PATH does not include pip's `$HOME/.local/bin` directory. On linux, you can add this to your `~/.bashrc`: + +```bash +if [ -d "$HOME/.local/bin" ] ; then + PATH="$HOME/.local/bin:$PATH" +fi +``` + +
+ +
+ +Port Conflicts + + +By default the quickstart deploy will require the following ports to be free on your local machine: + +- 3306 for MySQL +- 9200 for Elasticsearch +- 9092 for the Kafka broker +- 8081 for Schema Registry +- 2181 for ZooKeeper +- 9002 for the DataHub Web Application (datahub-frontend) +- 8080 for the DataHub Metadata Service (datahub-gms) + +In case the default ports conflict with software you are already running on your machine, you can override these ports by passing additional flags to the `datahub docker quickstart` command. +e.g. To override the MySQL port with 53306 (instead of the default 3306), you can say: `datahub docker quickstart --mysql-port 53306`. Use `datahub docker quickstart --help` to see all the supported options. +For the metadata service container (datahub-gms), you need to use an environment variable, `DATAHUB_MAPPED_GMS_PORT`. So for instance to use the port 58080, you would say `DATAHUB_MAPPED_GMS_PORT=58080 datahub docker quickstart` + +
+ +
+ +no matching manifest for linux/arm64/v8 in the manifest list entries + +On Mac computers with Apple Silicon (M1, M2 etc.), you might see an error like `no matching manifest for linux/arm64/v8 in the manifest list entries`, this typically means that the datahub cli was not able to detect that you are running it on Apple Silicon. To resolve this issue, override the default architecture detection by issuing `datahub docker quickstart --arch m1` + +
+
+ +Miscellaneous Docker issues + +There can be misc issues with Docker, like conflicting containers and dangling volumes, that can often be resolved by +pruning your Docker state with the following command. Note that this command removes all unused containers, networks, +images (both dangling and unreferenced), and optionally, volumes. + +``` +docker system prune +``` + +
+ +
+ +Still stuck? + + +Hop over to our [Slack community](https://slack.datahubproject.io) and ask for help in the [#troubleshoot](https://datahubspace.slack.com/archives/C029A3M079U) channel! + +
+ +## How can I confirm if all Docker containers are running as expected after a quickstart? + +If you set up the `datahub` CLI tool (see [here](../../metadata-ingestion/README.md)), you can use the built-in check utility: + +```shell +datahub docker check +``` + +You can list all Docker containers in your local by running `docker container ls`. You should expect to see a log similar to the below: + +``` +CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES +979830a342ce linkedin/datahub-mce-consumer:latest "bash -c 'while ping…" 10 hours ago Up 10 hours datahub-mce-consumer +3abfc72e205d linkedin/datahub-frontend-react:latest "datahub-frontend…" 10 hours ago Up 10 hours 0.0.0.0:9002->9002/tcp datahub-frontend +50b2308a8efd linkedin/datahub-mae-consumer:latest "bash -c 'while ping…" 10 hours ago Up 10 hours datahub-mae-consumer +4d6b03d77113 linkedin/datahub-gms:latest "bash -c 'dockerize …" 10 hours ago Up 10 hours 0.0.0.0:8080->8080/tcp datahub-gms +c267c287a235 landoop/schema-registry-ui:latest "/run.sh" 10 hours ago Up 10 hours 0.0.0.0:8000->8000/tcp schema-registry-ui +4b38899cc29a confluentinc/cp-schema-registry:5.2.1 "/etc/confluent/dock…" 10 hours ago Up 10 hours 0.0.0.0:8081->8081/tcp schema-registry +37c29781a263 confluentinc/cp-kafka:5.2.1 "/etc/confluent/dock…" 10 hours ago Up 10 hours 0.0.0.0:9092->9092/tcp, 0.0.0.0:29092->29092/tcp broker +15440d99a510 docker.elastic.co/kibana/kibana:5.6.8 "/bin/bash /usr/loca…" 10 hours ago Up 10 hours 0.0.0.0:5601->5601/tcp kibana +943e60f9b4d0 neo4j:4.0.6 "/sbin/tini -g -- /d…" 10 hours ago Up 10 hours 0.0.0.0:7474->7474/tcp, 7473/tcp, 0.0.0.0:7687->7687/tcp neo4j +6d79b6f02735 confluentinc/cp-zookeeper:5.2.1 "/etc/confluent/dock…" 10 hours ago Up 10 hours 2888/tcp, 0.0.0.0:2181->2181/tcp, 3888/tcp zookeeper +491d9f2b2e9e docker.elastic.co/elasticsearch/elasticsearch:5.6.8 "/bin/bash bin/es-do…" 10 hours ago Up 10 hours 0.0.0.0:9200->9200/tcp, 9300/tcp elasticsearch +ce14b9758eb3 mysql:5.7 +``` + +Also you can check individual Docker container logs by running `docker logs <>`. For `datahub-gms`, you should see a log similar to this at the end of the initialization: + +``` +2020-02-06 09:20:54.870:INFO:oejs.Server:main: Started @18807ms +``` + +For `datahub-frontend-react`, you should see a log similar to this at the end of the initialization: + +``` +09:20:22 [main] INFO play.core.server.AkkaHttpServer - Listening for HTTP on /0.0.0.0:9002 +``` + +## My elasticsearch or broker container exited with error or was stuck forever + +If you're seeing errors like below, chances are you didn't give enough resource to docker. Please make sure to allocate at least 8GB of RAM + 2GB swap space. + +``` +datahub-gms | 2020/04/03 14:34:26 Problem with request: Get http://elasticsearch:9200: dial tcp 172.19.0.5:9200: connect: connection refused. Sleeping 1s +broker | [2020-04-03 14:34:42,398] INFO Client session timed out, have not heard from server in 6874ms for sessionid 0x10000023fa60002, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) +schema-registry | [2020-04-03 14:34:48,518] WARN Client session timed out, have not heard from server in 20459ms for sessionid 0x10000023fa60007 (org.apache.zookeeper.ClientCnxn) +``` + +## How can I check if [MXE](../what/mxe.md) Kafka topics are created? + +You can use a utility like [kafkacat](https://github.com/edenhill/kafkacat) to list all topics. +You can run below command to see the Kafka topics created in your Kafka broker. + +```bash +kafkacat -L -b localhost:9092 +``` + +Confirm that `MetadataChangeEvent`, `MetadataAuditEvent`, `MetadataChangeProposal_v1` and `MetadataChangeLog_v1` topics exist besides the default ones. + +## How can I check if search indices are created in Elasticsearch? + +You can run below command to see the search indices created in your Elasticsearch. + +```bash +curl http://localhost:9200/_cat/indices +``` + +Confirm that `datasetindex_v2` & `corpuserindex_v2` indices exist besides the default ones. Example response as below: + +```bash +yellow open dataset_datasetprofileaspect_v1 HnfYZgyvS9uPebEQDjA1jg 1 1 0 0 208b 208b +yellow open datajobindex_v2 A561PfNsSFmSg1SiR0Y0qQ 1 1 2 9 34.1kb 34.1kb +yellow open mlmodelindex_v2 WRJpdj2zT4ePLSAuEvFlyQ 1 1 1 12 24.2kb 24.2kb +yellow open dataflowindex_v2 FusYIc1VQE-5NaF12uS8dA 1 1 1 3 23.3kb 23.3kb +yellow open mlmodelgroupindex_v2 QOzAaVx7RJ2ovt-eC0hg1w 1 1 0 0 208b 208b +yellow open datahubpolicyindex_v2 luXfXRlSRoS2-S_tvfLjHA 1 1 0 0 208b 208b +yellow open corpuserindex_v2 gbNXtnIJTzqh3vHSZS0Fwg 1 1 2 2 18.4kb 18.4kb +yellow open dataprocessindex_v2 9fL_4iCNTLyFv8MkDc6nIg 1 1 0 0 208b 208b +yellow open chartindex_v2 wYKlG5ylQe2dVKHOaswTww 1 1 2 7 29.4kb 29.4kb +yellow open tagindex_v2 GBQSZEvuRy62kpnh2cu1-w 1 1 2 2 19.7kb 19.7kb +yellow open mlmodeldeploymentindex_v2 UWA2ltxrSDyev7Tmu5OLmQ 1 1 0 0 208b 208b +yellow open dashboardindex_v2 lUjGAVkRRbuwz2NOvMWfMg 1 1 1 0 9.4kb 9.4kb +yellow open .ds-datahub_usage_event-000001 Q6NZEv1UQ4asNHYRywxy3A 1 1 36 0 54.8kb 54.8kb +yellow open datasetindex_v2 bWE3mN7IRy2Uj0QzeCt1KQ 1 1 7 47 93.7kb 93.7kb +yellow open mlfeatureindex_v2 fvjML5xoQpy8oxPIwltm8A 1 1 20 39 59.3kb 59.3kb +yellow open dataplatformindex_v2 GihumZfvRo27vt9yRpoE_w 1 1 0 0 208b 208b +yellow open glossarynodeindex_v2 ABKeekWTQ2urPWfGDsS4NQ 1 1 1 1 18.1kb 18.1kb +yellow open graph_service_v1 k6q7xV8OTIaRIkCjrzdufA 1 1 116 25 77.1kb 77.1kb +yellow open system_metadata_service_v1 9-FKAqp7TY2hs3RQuAtVMw 1 1 303 0 55.9kb 55.9kb +yellow open schemafieldindex_v2 Mi_lqA-yQnKWSleKEXSWeg 1 1 0 0 208b 208b +yellow open mlfeaturetableindex_v2 pk98zrSOQhGr5gPYUQwvvQ 1 1 5 14 36.4kb 36.4kb +yellow open glossarytermindex_v2 NIyi3WWiT0SZr8PtECo0xQ 1 1 3 8 23.1kb 23.1kb +yellow open mlprimarykeyindex_v2 R1WFxD9sQiapIZcXnDtqMA 1 1 7 6 35.5kb 35.5kb +yellow open corpgroupindex_v2 AYxVtFAEQ02BsJdahYYvlA 1 1 2 1 13.3kb 13.3kb +yellow open dataset_datasetusagestatisticsaspect_v1 WqPpDCKZRLaMIcYAAkS_1Q 1 1 0 0 208b 208b +``` + +## How can I check if data has been loaded into MySQL properly? + +Once the mysql container is up and running, you should be able to connect to it directly on `localhost:3306` using tools such as [MySQL Workbench](https://www.mysql.com/products/workbench/). You can also run the following command to invoke [MySQL Command-Line Client](https://dev.mysql.com/doc/refman/8.0/en/mysql.html) inside the mysql container. + +``` +docker exec -it mysql /usr/bin/mysql datahub --user=datahub --password=datahub +``` + +Inspect the content of `metadata_aspect_v2` table, which contains the ingested aspects for all entities. + +## Getting error while starting Docker containers + +There can be different reasons why a container fails during initialization. Below are the most common reasons: + +### `bind: address already in use` + +This error means that the network port (which is supposed to be used by the failed container) is already in use on your system. You need to find and kill the process which is using this specific port before starting the corresponding Docker container. If you don't want to kill the process which is using that port, another option is to change the port number for the Docker container. You need to find and change the [ports](https://docs.docker.com/compose/compose-file/#ports) parameter for the specific Docker container in the `docker-compose.yml` configuration file. + +``` +Example : On macOS + +ERROR: for mysql Cannot start service mysql: driver failed programming external connectivity on endpoint mysql (5abc99513affe527299514cea433503c6ead9e2423eeb09f127f87e2045db2ca): Error starting userland proxy: listen tcp 0.0.0.0:3306: bind: address already in use + + 1) sudo lsof -i :3306 + 2) kill -15 +``` + +### `OCI runtime create failed` + +If you see an error message like below, please make sure to git update your local repo to HEAD. + +``` +ERROR: for datahub-mae-consumer Cannot start service datahub-mae-consumer: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"bash\": executable file not found in $PATH": unknown +``` + +### `failed to register layer: devmapper: Unknown device` + +This most means that you're out of disk space (see [#1879](https://github.com/datahub-project/datahub/issues/1879)). + +### `ERROR: for kafka-rest-proxy Get https://registry-1.docker.io/v2/confluentinc/cp-kafka-rest/manifests/5.4.0: EOF` + +This is most likely a transient issue with [Docker Registry](https://docs.docker.com/registry/). Retry again later. + +## toomanyrequests: too many failed login attempts for username or IP address + +Try the following + +```bash +rm ~/.docker/config.json +docker login +``` + +More discussions on the same issue https://github.com/docker/hub-feedback/issues/1250 + +## Seeing `Table 'datahub.metadata_aspect' doesn't exist` error when logging in + +This means the database wasn't properly initialized as part of the quickstart processs (see [#1816](https://github.com/datahub-project/datahub/issues/1816)). Please run the following command to manually initialize it. + +``` +docker exec -i mysql sh -c 'exec mysql datahub -udatahub -pdatahub' < docker/mysql/init.sql +``` + +## I've messed up my docker setup. How do I start from scratch? + +Run the following script to remove all the containers and volumes created during the quickstart tutorial. Note that you'll also lose all the data as a result. + +``` +datahub docker nuke +``` + +## I'm seeing exceptions in DataHub GMS container like "Caused by: java.lang.IllegalStateException: Duplicate key com.linkedin.metadata.entity.ebean.EbeanAspectV2@dd26e011". What do I do? + +This is related to a SQL column collation issue. The default collation we previously used (prior to Oct 26, 2021) for URN fields was case-insensitive (utf8mb4_unicode_ci). We've recently moved +to deploying with a case-sensitive collation (utf8mb4_bin) by default. In order to update a deployment that was started before Oct 26, 2021 (v0.8.16 and below) to have the new collation, you must run this command against your SQL DB directly: + +``` +ALTER TABLE metadata_aspect_v2 CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_bin; +``` + +## I've modified the default user.props file to include a custom username and password, but I don't see the new user(s) inside the Users & Groups tab. Why not? + +Currently, `user.props` is a file used by the JAAS PropertyFileLoginModule solely for the purpose of **Authentication**. The file is not used as an source from which to +ingest additional metadata about the user. For that, you'll need to ingest some custom information about your new user using the Rest.li APIs or the [File-based ingestion source](../generated/ingestion/sources/file.md). + +For an example of a file that ingests user information, check out [single_mce.json](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/mce_files/single_mce.json), which ingests a single user object into DataHub. Notice that the "urn" field provided +will need to align with the custom username you've provided in user.props file. For example, if your user.props file contains: + +``` +my-custom-user:my-custom-password +``` + +You'll need to ingest some metadata of the following form to see it inside the DataHub UI: + +``` +{ + "auditHeader": null, + "proposedSnapshot": { + "com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot": { + "urn": "urn:li:corpuser:my-custom-user", + "aspects": [ + { + "com.linkedin.pegasus2avro.identity.CorpUserInfo": { + "active": true, + "displayName": { + "string": "The name of the custom user" + }, + "email": "my-custom-user-email@example.io", + "title": { + "string": "Engineer" + }, + "managerUrn": null, + "departmentId": null, + "departmentName": null, + "firstName": null, + "lastName": null, + "fullName": { + "string": "My Custom User" + }, + "countryCode": null + } + } + ] + } + }, + "proposedDelta": null +} +``` + +## I've configured OIDC, but I cannot login. I get continuously redirected. What do I do? + +Sorry to hear that! + +This phenomena may be due to the size of a Cookie DataHub uses to authenticate its users. If it's too large ( > 4096), then you'll see this behavior. The cookie embeds an encoded version of the information returned by your OIDC Identity Provider - if they return a lot of information, this can be the root cause. + +One solution is to use Play Cache to persist this session information for a user. This means the attributes about the user (and their session info) will be stored in an in-memory store in the `datahub-frontend` service, instead of a browser-side cookie. + +To configure the Play Cache session store, you can set the env variable "PAC4J_SESSIONSTORE_PROVIDER" as "PlayCacheSessionStore" for the `datahub-frontend` container. + +Do note that there are downsides to using the Play Cache. Specifically, it will make `datahub-frontend` a stateful server. If you have multiple instances of `datahub-frontend` deployed, you'll need to ensure that the same user is deterministically routed to the same service container (since the sessions are stored in memory). If you're using a single instance of `datahub-frontend` (the default), then things should "just work". + +For more details, please refer to https://github.com/datahub-project/datahub/pull/5114 diff --git a/docs-website/versioned_docs/version-0.10.4/docs/ui-ingestion.md b/docs-website/versioned_docs/version-0.10.4/docs/ui-ingestion.md new file mode 100644 index 0000000000000..6592e54046d3b --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/ui-ingestion.md @@ -0,0 +1,270 @@ +--- +title: UI Ingestion Guide +slug: /ui-ingestion +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/ui-ingestion.md" +--- + +# UI Ingestion Guide + +## Introduction + +Starting in version `0.8.25`, DataHub supports creating, configuring, scheduling, & executing batch metadata ingestion using the DataHub user interface. This makes +getting metadata into DataHub easier by minimizing the overhead required to operate custom integration pipelines. + +This document will describe the steps required to configure, schedule, and execute metadata ingestion inside the UI. + +## Running Metadata Ingestion + +### Prerequisites + +To view & manage UI-based metadata ingestion, you must have the `Manage Metadata Ingestion` & `Manage Secrets` +privileges assigned to your account. These can be granted by a [Platform Policy](authorization/policies.md). + +

+ +

+ +Once you have these privileges, you can begin to manage ingestion by navigating to the 'Ingestion' tab in DataHub. + +

+ +

+ +On this page, you'll see a list of active **Ingestion Sources**. An Ingestion Sources is a unique source of metadata ingested +into DataHub from an external source like Snowflake, Redshift, or BigQuery. + +If you're just getting started, you won't have any sources. In the following sections, we'll describe how to create +your first **Ingestion Source**. + +### Creating an Ingestion Source + +Before ingesting any metadata, you need to create a new Ingestion Source. Start by clicking **+ Create new source**. + +

+ +

+ +#### Step 1: Select a Platform Template + +In the first step, select a **Recipe Template** corresponding to the source type that you'd like to extract metadata from. Choose among +a variety of natively supported integrations, from Snowflake to Postgres to Kafka. +Select `Custom` to construct an ingestion recipe from scratch. + +

+ +

+ +Next, you'll configure an ingestion **Recipe**, which defines _how_ and _what_ to extract from the source system. + +#### Step 2: Configure a Recipe + +Next, you'll define an ingestion **Recipe** in [YAML](https://yaml.org/). A [Recipe](/docs/metadata-ingestion/#recipes) is a set of configurations which is +used by DataHub to extract metadata from a 3rd party system. It most often consists of the following parts: + +1. A source **type**: The type of system you'd like to extract metadata from (e.g. snowflake, mysql, postgres). If you've chosen a native template, this will already be populated for you. + To view a full list of currently supported **types**, check out [this list](/docs/metadata-ingestion/#installing-plugins). + +2. A source **config**: A set of configurations specific to the source **type**. Most sources support the following types of configuration values: + + - **Coordinates**: The location of the system you want to extract metadata from + - **Credentials**: Authorized credentials for accessing the system you want to extract metadata from + - **Customizations**: Customizations regarding the metadata that will be extracted, e.g. which databases or tables to scan in a relational DB + +3. A sink **type**: A type of sink to route the metadata extracted from the source type. The officially supported DataHub sink + types are `datahub-rest` and `datahub-kafka`. + +4. A sink **config**: Configuration required to send metadata to the provided sink type. For example, DataHub coordinates and credentials. + +A sample of a full recipe configured to ingest metadata from MySQL can be found in the image below. + +

+ +

+ +Detailed configuration examples & documentation for each source type can be found on the [DataHub Docs](/docs/metadata-ingestion/) website. + +##### Creating a Secret + +For production use cases, sensitive configuration values, such as database usernames and passwords, +should be hidden from plain view within your ingestion recipe. To accomplish this, you can create & embed **Secrets**. Secrets are named values +that are encrypted and stored within DataHub's storage layer. + +To create a secret, first navigate to the 'Secrets' tab. Then click `+ Create new secret`. + +

+ +

+ +_Creating a Secret to store the username for a MySQL database_ + +Inside the form, provide a unique name for the secret along with the value to be encrypted, and an optional description. Click **Create** when you are done. +This will create a Secret which can be referenced inside your ingestion recipe using its name. + +##### Referencing a Secret + +Once a Secret has been created, it can be referenced from within your **Recipe** using variable substitution. For example, +to substitute secrets for a MySQL username and password into a Recipe, your Recipe would be defined as follows: + +```yaml +source: + type: mysql + config: + host_port: "localhost:3306" + database: my_db + username: ${MYSQL_USERNAME} + password: ${MYSQL_PASSWORD} + include_tables: true + include_views: true + profiling: + enabled: true +sink: + type: datahub-rest + config: + server: "http://datahub-gms:8080" +``` + +_Referencing DataHub Secrets from a Recipe definition_ + +When the Ingestion Source with this Recipe executes, DataHub will attempt to 'resolve' Secrets found within the YAML. If a secret can be resolved, the reference is substituted for its decrypted value prior to execution. +Secret values are not persisted to disk beyond execution time, and are never transmitted outside DataHub. + +> **Attention**: Any DataHub users who have been granted the `Manage Secrets` [Platform Privilege](authorization/policies.md) will be able to retrieve plaintext secret values using the GraphQL API. + +#### Step 3: Schedule Execution + +Next, you can optionally configure a schedule on which to execute your new Ingestion Source. This enables to schedule metadata extraction on a monthly, weekly, daily, or hourly cadence depending on the needs of your organization. +Schedules are defined using CRON format. + +

+ +

+ +_An Ingestion Source that is executed at 9:15am every day, Los Angeles time_ + +To learn more about the CRON scheduling format, check out the [Wikipedia](https://en.wikipedia.org/wiki/Cron) overview. + +If you plan to execute ingestion on an ad-hoc basis, you can click **Skip** to skip the scheduling step entirely. Don't worry - +you can always come back and change this. + +#### Step 4: Finishing Up + +Finally, give your Ingestion Source a name. + +

+ +

+ +Once you're happy with your configurations, click 'Done' to save your changes. + +##### Advanced: Running with a specific CLI version + +DataHub comes pre-configured to use the latest version of the DataHub CLI ([acryl-datahub](https://pypi.org/project/acryl-datahub/)) that is compatible +with the server. However, you can override the default package version using the 'Advanced' source configurations. + +To do so, simply click 'Advanced', then change the 'CLI Version' text box to contain the exact version +of the DataHub CLI you'd like to use. + +

+ +

+ +_Pinning the CLI version to version `0.8.23.2`_ + +Once you're happy with your changes, simply click 'Done' to save. + +### Running an Ingestion Source + +Once you've created your Ingestion Source, you can run it by clicking 'Execute'. Shortly after, +you should see the 'Last Status' column of the ingestion source change from `N/A` to `Running`. This +means that the request to execute ingestion has been successfully picked up by the DataHub ingestion executor. + +

+ +

+ +If ingestion has executed successfully, you should see it's state shown in green as `Succeeded`. + +

+ +

+ +### Cancelling an Ingestion Run + +If your ingestion run is hanging, there may a bug in the ingestion source, or another persistent issue like exponential timeouts. If these situations, +you can cancel ingestion by clicking **Cancel** on the problematic run. + +

+ +

+ +Once cancelled, you can view the output of the ingestion run by clicking **Details**. + +### Debugging a Failed Ingestion Run + +

+ +

+ +A variety of things can cause an ingestion run to fail. Common reasons for failure include: + +1. **Recipe Misconfiguration**: A recipe has not provided the required or expected configurations for the ingestion source. You can refer + to the [Metadata Ingestion Framework](/docs/metadata-ingestion) source docs to learn more about the configurations required for your source type. +2. **Failure to resolve Secrets**: If DataHub is unable to find secrets that were referenced by your Recipe configuration, the ingestion run will fail. + Verify that the names of the secrets referenced in your recipe match those which have been created. +3. **Connectivity / Network Reachability**: If DataHub is unable to reach a data source, for example due to DNS resolution + failures, metadata ingestion will fail. Ensure that the network where DataHub is deployed has access to the data source which + you are trying to reach. +4. **Authentication**: If you've enabled [Metadata Service Authentication](authentication/introducing-metadata-service-authentication.md), you'll need to provide a Personal Access Token +in your Recipe Configuration. To so this, set the 'token' field of the sink configuration to contain a Personal Access Token: +

+ +

+ +The output of each run is captured and available to view in the UI for easier debugging. To view output logs, click **DETAILS** +on the corresponding ingestion run. + +

+ +

+ +## FAQ + +### I tried to ingest metadata after running 'datahub docker quickstart', but ingestion is failing with 'Failed to Connect' errors. What do I do? + +If not due to one of the reasons outlined above, this may be because the executor running ingestion is unable +to reach DataHub's backend using the default configurations. Try changing your ingestion recipe to make the `sink.config.server` variable point to the Docker +DNS name for the `datahub-gms` pod: + +

+ +

+ +### I see 'N/A' when I try to run ingestion. What do I do? + +If you see 'N/A', and the ingestion run state never changes to 'Running', this may mean +that your executor (`datahub-actions`) container is down. + +This container is responsible for executing requests to run ingestion when they come in, either +on demand on a particular schedule. You can verify the health of the container using `docker ps`. Moreover, you can inspect the container logs using by finding the container id +for the `datahub-actions` container and running `docker logs `. + +### When should I NOT use UI Ingestion? + +There are valid cases for ingesting metadata without the UI-based ingestion scheduler. For example, + +- You have written a custom ingestion Source +- Your data sources are not reachable on the network where DataHub is deployed +- Your ingestion source requires context from a local filesystem (e.g. input files, environment variables, etc) +- You want to distribute metadata ingestion among multiple producers / environments + +### How do I attach policies to the actions pod to give it permissions to pull metadata from various sources? + +This varies across the underlying platform. For AWS, please refer to this [guide](./deploy/aws.md#iam-policies-for-ui-based-ingestion). + +## Demo + +Click [here](https://www.youtube.com/watch?v=EyMyLcaw_74) to see a full demo of the UI Ingestion feature. + +## Feedback / Questions / Concerns + +We want to hear from you! For any inquiries, including Feedback, Questions, or Concerns, reach out on [Slack](https://datahubspace.slack.com/join/shared_invite/zt-nx7i0dj7-I3IJYC551vpnvvjIaNRRGw#/shared-invite/email)! diff --git a/docs-website/versioned_docs/version-0.10.4/docs/what-is-datahub/datahub-concepts.md b/docs-website/versioned_docs/version-0.10.4/docs/what-is-datahub/datahub-concepts.md new file mode 100644 index 0000000000000..ab97137603f40 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/what-is-datahub/datahub-concepts.md @@ -0,0 +1,214 @@ +--- +title: DataHub Concepts +sidebar_label: Concepts +slug: /what-is-datahub/datahub-concepts +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/what-is-datahub/datahub-concepts.md +--- + +# DataHub Concepts + +Explore key concepts of DataHub to take full advantage of its capabilities in managing your data. + +## General Concepts + +### URN (Uniform Resource Name) + +URN (Uniform Resource Name) is the chosen scheme of URI to uniquely define any resource in DataHub. It has the following form. + +``` +urn::: +``` + +Examples include `urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)`, `urn:li:corpuser:jdoe`. + +> - [What is URN?](/docs/what/urn.md) + +### Policy + +Access policies in DataHub define who can do what to which resources. + +> - [Authorization: Policies Guide](/docs/authorization/policies.md) +> - [Developer Guides: DataHubPolicy](/docs/generated/metamodel/entities/dataHubPolicy.md) +> - [Feature Guides: About DataHub Access Policies](/docs/authorization/access-policies-guide.md) + +### Role + +DataHub provides the ability to use Roles to manage permissions. + +> - [Authorization: About DataHub Roles](/docs/authorization/roles.md) +> - [Developer Guides: DataHubRole](/docs/generated/metamodel/entities/dataHubRole.md) + +### Access Token (Personal Access Token) + +Personal Access Tokens, or PATs for short, allow users to represent themselves in code and programmatically use DataHub's APIs in deployments where security is a concern. +Used along-side with [authentication-enabled metadata service](/docs/authentication/introducing-metadata-service-authentication.md), PATs add a layer of protection to DataHub where only authorized users are able to perform actions in an automated way. + +> - [Authentication: About DataHub Personal Access Tokens](/docs/authentication/personal-access-tokens.md) +> - [Developer Guides: DataHubAccessToken](/docs/generated/metamodel/entities/dataHubAccessToken.md) + +### View + +Views allow you to save and share sets of filters for reuse when browsing DataHub. A view can either be public or personal. + +> - [DataHubView](/docs/generated/metamodel/entities/dataHubView.md) + +### Deprecation + +Deprecation is an aspect that indicates the deprecation status of an entity. Typically it is expressed as a Boolean value. + +> - [Deprecation of a dataset](/docs/generated/metamodel/entities/dataset.md#deprecation) + +### Ingestion Source + +Ingestion sources refer to the data systems that we are extracting metadata from. For example, we have sources for BigQuery, Looker, Tableau and many others. + +> - [Sources](/metadata-ingestion/README.md#sources) +> - [DataHub Integrations](/integrations) + +### Container + +A container of related data assets. + +> - [Developer Guides: Container](/docs/generated/metamodel/entities/container.md) + +### Data Platform + +Data Platforms are systems or tools that contain Datasets, Dashboards, Charts, and all other kinds of data assets modeled in the metadata graph. + +
+List of Data Platforms + + +- Azure Data Lake (Gen 1) +- Azure Data Lake (Gen 2) +- Airflow +- Ambry +- ClickHouse +- Couchbase +- External Source +- HDFS +- SAP HANA +- Hive +- Iceberg +- AWS S3 +- Kafka +- Kafka Connect +- Kusto +- Mode +- MongoDB +- MySQL +- MariaDB +- OpenAPI +- Oracle +- Pinot +- PostgreSQL +- Presto +- Tableau +- Vertica + +Reference : [data_platforms.json](https://github.com/acryldata/datahub-fork/blob/acryl-main/metadata-service/war/src/main/resources/boot/data_platforms.json) + +
+ +> - [Developer Guides: Data Platform](/docs/generated/metamodel/entities/dataPlatform.md) + +### Dataset + +Datasets represent collections of data that are typically represented as Tables or Views in a database (e.g. BigQuery, Snowflake, Redshift etc.), Streams in a stream-processing environment (Kafka, Pulsar etc.), bundles of data found as Files or Folders in data lake systems (S3, ADLS, etc.). + +> - [Developer Guides: Dataset](/docs/generated/metamodel/entities/dataset.md) + +### Chart + +A single data vizualization derived from a Dataset. A single Chart can be a part of multiple Dashboards. Charts can have tags, owners, links, glossary terms, and descriptions attached to them. Examples include a Superset or Looker Chart. + +> - [Developer Guides: Chart](/docs/generated/metamodel/entities/chart.md) + +### Dashboard + +A collection of Charts for visualization. Dashboards can have tags, owners, links, glossary terms, and descriptions attached to them. Examples include a Superset or Mode Dashboard. + +> - [Developer Guides: Dashboard](/docs/generated/metamodel/entities/dashboard.md) + +### Data Job + +An executable job that processes data assets, where "processing" implies consuming data, producing data, or both. +In orchestration systems, this is sometimes referred to as an individual "Task" within a "DAG". Examples include an Airflow Task. + +> - [Developer Guides: Data Job](/docs/generated/metamodel/entities/dataJob.md) + +### Data Flow + +An executable collection of Data Jobs with dependencies among them, or a DAG. +Sometimes referred to as a "Pipeline". Examples include an Airflow DAG. + +> - [Developer Guides: Data Flow](/docs/generated/metamodel/entities/dataFlow.md) + +### Glossary Term + +Shared vocabulary within the data ecosystem. + +> - [Feature Guides: Glossary](/docs/glossary/business-glossary.md) +> - [Developer Guides: GlossaryTerm](/docs/generated/metamodel/entities/glossaryTerm.md) + +### Glossary Term Group + +Glossary Term Group is similar to a folder, containing Terms and even other Term Groups to allow for a nested structure. + +> - [Feature Guides: Term & Term Group](/docs/glossary/business-glossary.md#terms--term-groups) + +### Tag + +Tags are informal, loosely controlled labels that help in search & discovery. They can be added to datasets, dataset schemas, or containers, for an easy way to label or categorize entities – without having to associate them to a broader business glossary or vocabulary. + +> - [Feature Guides: About DataHub Tags](/docs/tags.md) +> - [Developer Guides: Tags](/docs/generated/metamodel/entities/tag.md) + +### Domain + +Domains are curated, top-level folders or categories where related assets can be explicitly grouped. + +> - [Feature Guides: About DataHub Domains](/docs/domains.md) +> - [Developer Guides: Domain](/docs/generated/metamodel/entities/domain.md) + +### Owner + +Owner refers to the users or groups that has ownership rights over entities. For example, owner can be acceessed to dataset or a column or a dataset. + +> - [Getting Started : Adding Owners On Datasets/Columns](/docs/api/tutorials/owners.md#add-owners) + +### Users (CorpUser) + +CorpUser represents an identity of a person (or an account) in the enterprise. + +> - [Developer Guides: CorpUser](/docs/generated/metamodel/entities/corpuser.md) + +### Groups (CorpGroup) + +CorpGroup represents an identity of a group of users in the enterprise. + +> - [Developer Guides: CorpGroup](/docs/generated/metamodel/entities/corpGroup.md) + +## Metadata Model + +### Entity + +An entity is the primary node in the metadata graph. For example, an instance of a Dataset or a CorpUser is an Entity. + +> - [How does DataHub model metadata?](/docs/modeling/metadata-model.md) + +### Aspect + +An aspect is a collection of attributes that describes a particular facet of an entity. +Aspects can be shared across entities, for example "Ownership" is an aspect that is re-used across all the Entities that have owners. + +> - [What is a metadata aspect?](/docs/what/aspect.md) +> - [How does DataHub model metadata?](/docs/modeling/metadata-model.md) + +### Relationships + +A relationship represents a named edge between 2 entities. They are declared via foreign key attributes within Aspects along with a custom annotation (@Relationship). + +> - [What is a relationship?](/docs/what/relationship.md) +> - [How does DataHub model metadata?](/docs/modeling/metadata-model.md) diff --git a/docs-website/versioned_docs/version-0.10.4/docs/what/aspect.md b/docs-website/versioned_docs/version-0.10.4/docs/what/aspect.md new file mode 100644 index 0000000000000..4e5c86f35843a --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/what/aspect.md @@ -0,0 +1,52 @@ +--- +title: What is a metadata aspect? +slug: /what/aspect +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/what/aspect.md" +--- + +# What is a metadata aspect? + +A metadata aspect is a structured document, or more precisely a `record` in [PDL](https://linkedin.github.io/rest.li/pdl_schema), +that represents a specific kind of metadata (e.g. ownership, schema, statistics, upstreams). +A metadata aspect on its own has no meaning (e.g. ownership for what?) and must be associated with a particular entity (e.g. ownership for PageViewEvent). +We purposely not to impose any model requirement on metadata aspects, as each aspect is expected to differ significantly. + +Metadata aspects are immutable by design, i.e. every change to a particular aspect results in a [new version](../advanced/aspect-versioning.md) created. +An optional retention policy can be applied such that X number of most recent versions will be retained after each update. +Setting X to 1 effectively means the metadata aspect is non-versioned. +It is also possible to apply the retention based on time, e.g. only keeps the metadata changes from the past 30 days. + +While a metadata aspect can be arbitrary complex document with multiple levels of nesting, it is sometimes desirable to break a monolithic aspect into smaller independent aspects. +This will provide the benefits of: + +1. **Faster read/write**: As metadata aspects are immutable, every "update" will lead to the writing the entire large aspect back to the underlying data store. + Likewise, readers will need to retrieve the entire aspect even if it’s only interested in a small part of it. +2. **Ability to independently version different aspects**: For example, one may like to get the change history of all the "ownership metadata" independent of the changes made to "schema metadata" for a dataset. +3. **Help with rest.li endpoint modeling**: While it’s not required to have 1:1 mapping between rest.li endpoints and metadata aspects, + it’d follow this pattern naturally, which means one will end up with smaller, more modular, endpoints instead of giant ones. + +Here’s an example metadata aspect. Note that the `admin` and `members` fields are implicitly conveying a relationship between `Group` entity & `User` entity. +It’s very natural to save such relationships as URNs in a metadata aspect. +The [relationship](relationship.md) section explains how this relationship can be explicitly extracted and modelled. + +``` +namespace com.linkedin.group + +import com.linkedin.common.AuditStamp +import com.linkedin.common.CorpuserUrn + +/** + * The membership metadata for a group + */ +record Membership { + + /** Audit stamp for the last change */ + auditStamp: AuditStamp + + /** Admin of the group */ + admin: CorpuserUrn + + /** Members of the group, ordered in descending importance */ + members: array[CorpuserUrn] +} +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/what/delta.md b/docs-website/versioned_docs/version-0.10.4/docs/what/delta.md new file mode 100644 index 0000000000000..694ba30a3dfe6 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/what/delta.md @@ -0,0 +1,67 @@ +--- +title: What is a metadata delta? +slug: /what/delta +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/what/delta.md" +--- + +# What is a metadata delta? + +Rest.li supports [partial update](https://linkedin.github.io/rest.li/user_guide/restli_server#partial_update) natively without needing explicitly defined models. +However, the granularity of update is always limited to each field in a PDL model. +There are cases where the update need to happen at an even finer grain, e.g. adding or removing items from an array. + +To this end, we’re proposing the following entity-specific metadata delta model that allows atomic partial updates at any desired granularity. +Note that: + +1. Just like metadata [aspects](aspect.md), we’re not imposing any limit on the partial update model, as long as it’s a valid PDL record. + This is because the rest.li endpoint will have the logic that performs the corresponding partial update based on the information in the model. + That said, it’s common to have fields that denote the list of items to be added or removed (e.g. `membersToAdd` & `membersToRemove` from below) +2. Similar to metadata [snapshots](snapshot.md), entity that supports metadata delta will add an entity-specific metadata delta + (e.g. `GroupDelta` from below) that unions all supported partial update models. +3. The entity-specific metadata delta is then added to the global `Delta` typeref, which is added as part of [Metadata Change Event](mxe.md#metadata-change-event-mce) and used during [Metadata Ingestion](../architecture/metadata-ingestion.md). + +``` +namespace com.linkedin.group + +import com.linkedin.common.CorpuserUrn + +/** + * A metadata delta for a specific group entity + */ +record MembershipPartialUpdate { + + /** List of members to be added to the group */ + membersToAdd: array[CorpuserUrn] + + /** List of members to be removed from the group */ + membersToRemove: array[CorpuserUrn] +} +``` + +``` +namespace com.linkedin.metadata.delta + +import com.linkedin.common.CorpGroupUrn +import com.linkedin.group.MembershipPartialUpdate + +/** + * A metadata delta for a specific group entity + */ +record GroupDelta { + + /** URN for the entity the metadata delta is associated with */ + urn: CorpGroupUrn + + /** The specific type of metadata delta to apply */ + delta: union[MembershipPartialUpdate] +} +``` + +``` +namespace com.linkedin.metadata.delta + +/** + * A union of all supported metadata delta types. + */ +typeref Delta = union[GroupDelta] +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/what/entity.md b/docs-website/versioned_docs/version-0.10.4/docs/what/entity.md new file mode 100644 index 0000000000000..f81895a37df24 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/what/entity.md @@ -0,0 +1,10 @@ +--- +title: Entities +slug: /what/entity +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/what/entity.md" +--- + +# Entities + +This page has been moved. Please refer to [The Metadata Model](../modeling/extending-the-metadata-model.md) for details on +the metadata model. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/what/gma.md b/docs-website/versioned_docs/version-0.10.4/docs/what/gma.md new file mode 100644 index 0000000000000..82decaaf8fd1d --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/what/gma.md @@ -0,0 +1,19 @@ +--- +title: What is Generalized Metadata Architecture (GMA)? +slug: /what/gma +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/what/gma.md" +--- + +# What is Generalized Metadata Architecture (GMA)? + +GMA is the backend infrastructure for DataHub. Unlike existing architectures, GMA leverages multiple storage technologies to efficiently service the four most commonly used query patterns + +- Document-oriented CRUD +- Complex queries (including joining distributed tables) +- Graph traversal +- Fulltext search and autocomplete + +GMA also embraces a distributed model, where each team owns, develops and operates their own metadata services (known as [GMS](gms.md)), while the metadata are automatically aggregated to populate the central [metadata graph](graph.md) and [search indexes](search-index.md). This is made possible by standardizing the metadata models and the access layer. + +We strongly believe that GMA can bring tremendous leverage to any team that has a need to store and access metadata. +Moreover, standardizing metadata modeling promotes a model-first approach to developments, resulting in a more concise, consistent, and highly connected metadata ecosystem that benefits all DataHub users. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/what/gms.md b/docs-website/versioned_docs/version-0.10.4/docs/what/gms.md new file mode 100644 index 0000000000000..de95ddd8cd06b --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/what/gms.md @@ -0,0 +1,13 @@ +--- +title: What is Generalized Metadata Service (GMS)? +slug: /what/gms +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/what/gms.md" +--- + +# What is Generalized Metadata Service (GMS)? + +Metadata for [entities](entity.md) [onboarded](../modeling/metadata-model.md) to [GMA](gma.md) is served through microservices known as Generalized Metadata Service (GMS). GMS typically provides a [Rest.li](http://rest.li) API and must access the metadata using [GMA DAOs](../architecture/metadata-serving.md). + +While a GMS is completely free to define its public APIs, we do provide a list of [resource base classes](https://github.com/datahub-project/datahub-gma/tree/master/restli-resources/src/main/java/com/linkedin/metadata/restli) to leverage for common patterns. + +GMA is designed to support a distributed fleet of GMS, each serving a subset of the [GMA graph](graph.md). However, for simplicity we include a single centralized GMS ([datahub-gms](https://github.com/datahub-project/datahub/blob/master/gms)) that serves all entities. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/what/graph.md b/docs-website/versioned_docs/version-0.10.4/docs/what/graph.md new file mode 100644 index 0000000000000..9ec9ecbd5eb69 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/what/graph.md @@ -0,0 +1,22 @@ +--- +title: What is GMA graph? +slug: /what/graph +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/what/graph.md" +--- + +# What is GMA graph? + +All the [entities](entity.md) and [relationships](relationship.md) are stored in a graph database, Neo4j. +The graph always represents the current state of the world and has no direct support for versioning or history. +However, as stated in the [Metadata Modeling](../modeling/metadata-model.md) section, +the graph is merely a derived view of all metadata [aspects](aspect.md) thus can always be rebuilt directly from historic [MAEs](mxe.md#metadata-audit-event-mae). +Consequently, it is possible to build a specific snapshot of the graph in time by replaying MAEs up to that point. + +In theory, the system can work with any generic [OLTP](https://en.wikipedia.org/wiki/Online_transaction_processing) graph DB that supports the following operations: + +- Dynamical creation, modification, and removal of nodes and edges +- Dynamical attachment of key-value properties to each node and edge +- Transactional partial updates of properties of a specific node or edge +- Fast ID-based retrieval of nodes & edges +- Efficient queries involving both graph traversal and properties value filtering +- Support efficient bidirectional graph traversal diff --git a/docs-website/versioned_docs/version-0.10.4/docs/what/mxe.md b/docs-website/versioned_docs/version-0.10.4/docs/what/mxe.md new file mode 100644 index 0000000000000..7e0433213941b --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/what/mxe.md @@ -0,0 +1,432 @@ +--- +title: Metadata Events +slug: /what/mxe +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/what/mxe.md" +--- + +# Metadata Events + +DataHub makes use a few important Kafka events for operation. The most notable of these include + +1. Metadata Change Proposal +2. Metadata Change Log (Versioned + Timeseries) +3. Platform Event + +Each event is originally authored using [PDL](https://linkedin.github.io/rest.li/pdl_schema), a modeling language developed by LinkedIn, and +then converted into their Avro equivalents, which are used when writing and reading the events to Kafka. + +In the document, we'll describe each of these events in detail - including notes about their structure & semantics. + +## Metadata Change Proposal (MCP) + +A Metadata Change Proposal represents a request to change to a specific [aspect](aspect.md) on an enterprise's Metadata +Graph. Each MCP provides a new value for a given aspect. For example, a single MCP can +be emitted to change ownership or documentation or domains or deprecation status for a data asset. + +### Emission + +MCPs may be emitted by clients of DataHub's low-level ingestion APIs (e.g. ingestion sources) +during the process of metadata ingestion. The DataHub Python API exposes an interface for +easily sending MCPs into DataHub. + +The default Kafka topic name for MCPs is `MetadataChangeProposal_v1`. + +### Consumption + +DataHub's storage layer actively listens for new Metadata Change Proposals, attempts +to apply the requested change to the Metadata Graph. + +### Schema + +| Name | Type | Description | Optional | +| ------------------ | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | +| entityUrn | String | The unique identifier for the Entity being changed. For example, a Dataset's urn. | False | +| entityType | String | The type of the entity the new aspect is associated with. This corresponds to the entity name in the DataHub Entity Registry, for example 'dataset'. | False | +| entityKeyAspect | Object | The key struct of the entity that was changed. Only present if the Metadata Change Proposal contained the raw key struct. | True | +| changeType | String | The change type. CREATE, UPSERT and DELETE are currently supported. | False | +| aspectName | String | The entity aspect which was changed. | False | +| aspect | Object | The new aspect value. Null if the aspect was deleted. | True | +| aspect.contentType | String | The serialization type of the aspect itself. The only supported value is `application/json`. | False | +| aspect.value | String | The serialized aspect. This is a JSON-serialized representing the aspect document originally defined in PDL. See https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin for more. | False | +| systemMetadata | Object | The new system metadata. This includes the the ingestion run-id, model registry and more. For the full structure, see https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/SystemMetadata.pdl | True | + +The PDL schema can be found [here](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/MetadataChangeProposal.pdl). + +### Examples + +An MCP representing a request to update the 'ownership' aspect for a particular Dataset: + +```json +{ + "entityType": "dataset", + "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)", + "changeType": "UPSERT", + "aspectName": "ownership", + "aspect": { + "value": "{\"owners\":[{\"type\":\"DATAOWNER\",\"owner\":\"urn:li:corpuser:datahub\"}],\"lastModified\":{\"actor\":\"urn:li:corpuser:datahub\",\"time\":1651516640488}}", + "contentType": "application/json" + }, + "systemMetadata": { + "lastObserved": 1651516640493, + "runId": "no-run-id-provided", + "registryName": "unknownRegistry", + "registryVersion": "0.0.0.0-dev", + "properties": null + } +} +``` + +Note how the aspect payload is serialized as JSON inside the "value" field. The exact structure +of the aspect is determined by its PDL schema. (For example, the [ownership](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl) schema) + +## Metadata Change Log (MCL) + +A Metadata Change Log represents _any_ change which has been made to the Metadata Graph. +Metadata Change Log events are emitted to Kafka immediately after writing the change +the durable storage. + +There are 2 flavors of Metadata Change Log: _versioned_ and _timeseries_. These correspond to the type +of aspects which were updated for a given change. **Versioned** aspects are those +which represent the "latest" state of some attributes, for example the most recent owners of an asset +or its documentation. **Timeseries** aspects are those which represent events related to an asset +that occurred at a particular time, for example profiling of a Dataset. + +### Emission + +MCLs are emitted when _any_ change is made to an entity on the DataHub Metadata Graph, this includes +writing to any aspect of an entity. + +Two distinct topics are maintained for Metadata Change Log. The default Kafka topic name for **versioned** aspects is `MetadataChangeLog_Versioned_v1` and for +**timeseries** aspects is `MetadataChangeLog_Timeseries_v1`. + +### Consumption + +DataHub ships with a Kafka Consumer Job (mae-consumer-job) which listens for MCLs and uses them to update DataHub's search and graph indices, +as well as to generate derived Platform Events (described below). + +In addition, the [Actions Framework](../actions/README.md) consumes Metadata Change Logs to power its [Metadata Change Log](../actions/events/metadata-change-log-event.md) event API. + +### Schema + +| Name | Type | Description | Optional | +| ------------------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------- | +| entityUrn | String | The unique identifier for the Entity being changed. For example, a Dataset's urn. | False | +| entityType | String | The type of the entity the new aspect is associated with. This corresponds to the entity name in the DataHub Entity Registry, for example 'dataset'. | False | +| entityKeyAspect | Object | The key struct of the entity that was changed. Only present if the Metadata Change Proposal contained the raw key struct. | True | +| changeType | String | The change type. CREATE, UPSERT and DELETE are currently supported. | False | +| aspectName | String | The entity aspect which was changed. | False | +| aspect | Object | The new aspect value. Null if the aspect was deleted. | True | +| aspect.contentType | String | The serialization type of the aspect itself. The only supported value is `application/json`. | False | +| aspect.value | String | The serialized aspect. This is a JSON-serialized representing the aspect document originally defined in PDL. See https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin for more. | False | +| previousAspectValue | Object | The previous aspect value. Null if the aspect did not exist previously. | True | +| previousAspectValue.contentType | String | The serialization type of the aspect itself. The only supported value is `application/json` | False | +| previousAspectValue.value | String | The serialized aspect. This is a JSON-serialized representing the aspect document originally defined in PDL. See https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin for more. | False | +| systemMetadata | Object | The new system metadata. This includes the the ingestion run-id, model registry and more. For the full structure, see https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/SystemMetadata.pdl | True | +| previousSystemMetadata | Object | The previous system metadata. This includes the the ingestion run-id, model registry and more. For the full structure, see https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/SystemMetadata.pdl | True | +| created | Object | Audit stamp about who triggered the Metadata Change and when. | False | +| created.time | Number | The timestamp in milliseconds when the aspect change occurred. | False | +| created.actor | String | The URN of the actor (e.g. corpuser) that triggered the change. | + +The PDL schema for can be found [here](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/MetadataChangeLog.pdl). + +### Examples + +An MCL corresponding to a change in the 'ownership' aspect for a particular Dataset: + +```json +{ + "entityType": "dataset", + "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)", + "changeType": "UPSERT", + "aspectName": "ownership", + "aspect": { + "value": "{\"owners\":[{\"type\":\"DATAOWNER\",\"owner\":\"urn:li:corpuser:datahub\"}],\"lastModified\":{\"actor\":\"urn:li:corpuser:datahub\",\"time\":1651516640488}}", + "contentType": "application/json" + }, + "previousAspectValue": { + "value": "{\"owners\":[{\"owner\":\"urn:li:corpuser:jdoe\",\"type\":\"DATAOWNER\"},{\"owner\":\"urn:li:corpuser:datahub\",\"type\":\"DATAOWNER\"}],\"lastModified\":{\"actor\":\"urn:li:corpuser:jdoe\",\"time\":1581407189000}}", + "contentType": "application/json" + }, + "systemMetadata": { + "lastObserved": 1651516640493, + "runId": "no-run-id-provided", + "registryName": "unknownRegistry", + "registryVersion": "0.0.0.0-dev", + "properties": null + }, + "previousSystemMetadata": { + "lastObserved": 1651516415088, + "runId": "file-2022_05_02-11_33_35", + "registryName": null, + "registryVersion": null, + "properties": null + }, + "created": { + "time": 1651516640490, + "actor": "urn:li:corpuser:datahub", + "impersonator": null + } +} +``` + +Note how the aspect payload is serialized as JSON inside the "value" field. The exact structure +of the aspect is determined by its PDL schema. (For example, the [ownership](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl) schema) + +## Platform Event (PE) + +A Platform Event represents an arbitrary business-logic event emitted by DataHub. Each +Platform Event has a `name` which determines its contents. + +### Types + +- **Entity Change Event** (entityChangeEvent): The most important Platform Event is named **Entity Change Event**, and represents a log of semantic changes + (tag addition, removal, deprecation change, etc) that have occurred on DataHub. It is used an important + component of the DataHub Actions Framework. + +All registered Platform Event types are declared inside the DataHub Entity Registry (`entity-registry.yml`). + +### Emission + +All Platform Events are generated by DataHub itself during normal operation. + +PEs are extremely dynamic - they can contain arbitrary payloads depending on the `name`. Thus, +can be emitted in a variety of circumstances. + +The default Kafka topic name for all Platform Events is `PlatformEvent_v1`. + +### Consumption + +The [Actions Framework](../actions/README.md) consumes Platform Events to power its [Entity Change Event](../actions/events/entity-change-event.md) API. + +### Schema + +| Name | Type | Description | Optional | +| ---------------------- | ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | +| header | Object | Header fields | False | +| header.timestampMillis | Long | The time at which the event was generated. | False | +| name | String | The name / type of the event. | False | +| payload | Object | The event itself. | False | +| payload.contentType | String | The serialization type of the event payload. The only supported value is `application/json`. | False | +| payload.value | String | The serialized payload. This is a JSON-serialized representing the payload document originally defined in PDL. See https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin for more. | False | + +The full PDL schema can be found [here](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/PlatformEvent.pdl). + +### Examples + +An example of an 'Entity Change Event' Platform Event that is emitted when a new owner is added to a Dataset: + +```json +{ + "header": { + "timestampMillis": 1655390732551 + }, + "name": "entityChangeEvent", + "payload": { + "value": "{\"entityUrn\":\"urn:li:dataset:abc\",\"entityType\":\"dataset\",\"category\":\"OWNER\",\"operation\":\"ADD\",\"modifier\":\"urn:li:corpuser:jdoe\",\"parameters\":{\"ownerUrn\":\"urn:li:corpuser:jdoe\",\"ownerType\":\"BUSINESS_OWNER\"},\"auditStamp\":{\"actor\":\"urn:li:corpuser:jdoe\",\"time\":1649953100653}}", + "contentType": "application/json" +} +``` + +Note how the actual payload for the event is serialized as JSON inside the 'payload' field. The exact +structure of the Platform Event is determined by its PDL schema. (For example, the [Entity Change Event](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/platform/event/v1/EntityChangeEvent.pdl) schema) + +## Failed Metadata Change Proposal (FMCP) + +When a Metadata Change Proposal cannot be processed successfully, the event is written to a [dead letter queue](https://en.wikipedia.org/wiki/Dead_letter_queue) +in an event called Failed Metadata Change Proposal (FMCP). + +The event simply wraps the original Metadata Change Proposal and an error message, which contains the reason for rejection. +This event can be used for debugging any potential ingestion issues, as well as for re-playing any previous rejected proposal if necessary. + +### Emission + +FMCEs are emitted when MCEs cannot be successfully committed to DataHub's storage layer. + +The default Kafka topic name for FMCPs is `FailedMetadataChangeProposal_v1`. + +### Consumption + +No active consumers. + +### Schema + +The PDL schema can be found [here](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/FailedMetadataChangeProposal.pdl). + +# Deprecated Events + +DataHub ships with a set of deprecated events, which were historically used for proposing and logging +changes to the Metadata Graph. + +Each event in this category was deprecated due to its inflexibility - namely the fact that +the schemas had to be updated when a new aspect was introduced. These events have since been replaced +by the more flexible events described above (Metadata Change Proposal, Metadata Change Log). + +It is not recommended to build dependencies on deprecated events. + +## Metadata Change Event (MCE) + +A Metadata Change Event represents a request to change multiple aspects for the same entity. +It leverages a deprecated concept of `Snapshot`, which is a strongly-typed list of aspects for the same +entity. + +A MCE is a "proposal" for a set of metadata changes, as opposed to [MAE](#metadata-audit-event), which is conveying a committed change. +Consequently, only successfully accepted and processed MCEs will lead to the emission of a corresponding MAE / MCLs. + +### Emission + +MCEs may be emitted by clients of DataHub's low-level ingestion APIs (e.g. ingestion sources) +during the process of metadata ingestion. + +The default Kafka topic name for MCEs is `MetadataChangeEvent_v4`. + +### Consumption + +DataHub's storage layer actively listens for new Metadata Change Events, attempts +to apply the requested changes to the Metadata Graph. + +### Schema + +The PDL schema can be found [here](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/MetadataChangeEvent.pdl). + +### Examples + +An example of an MCE emitted to change the 'ownership' aspect for an Entity: + +```json +{ + "proposedSnapshot": { + "com.linkedin.pegasus2avro.metadata.snapshot.DatasetSnapshot": { + "urn": "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)", + "aspects": [ + { + "com.linkedin.pegasus2avro.common.Ownership": { + "owners": [ + { + "owner": "urn:li:corpuser:jdoe", + "type": "DATAOWNER", + "source": null + }, + { + "owner": "urn:li:corpuser:datahub", + "type": "DATAOWNER", + "source": null + } + ], + "lastModified": { + "time": 1581407189000, + "actor": "urn:li:corpuser:jdoe", + "impersonator": null + } + } + } + ] + } + } +} +``` + +## Metadata Audit Event (MAE) + +A Metadata Audit Event captures changes made to one or multiple metadata [aspects](aspect.md) associated with a particular [entity](entity.md), in the form of a metadata [snapshot](snapshot.md) (deprecated) before the change, and a metadata snapshot after the change. + +Every source-of-truth for a particular metadata aspect is expected to emit a MAE whenever a change is committed to that aspect. By ensuring that, any listener of MAE will be able to construct a complete view of the latest state for all aspects. +Furthermore, because each MAE contains the "after image", any mistake made in emitting the MAE can be easily mitigated by emitting a follow-up MAE with the correction. By the same token, the initial bootstrap problem for any newly added entity can also be solved by emitting a MAE containing all the latest metadata aspects associated with that entity. + +### Emission + +> Note: In recent versions of DataHub (mid 2022), MAEs are no longer actively emitted, and will soon be completely removed from DataHub. +> Use Metadata Change Log instead. + +MAEs are emitted once any metadata change has been successfully committed into DataHub's storage +layer. + +The default Kafka topic name for MAEs is `MetadataAuditEvent_v4`. + +### Consumption + +No active consumers. + +### Schema + +The PDL schema can be found [here](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/MetadataAuditEvent.pdl). + +### Examples + +An example of an MAE emitted representing a change made to the 'ownership' aspect for an Entity (owner removed): + +```json +{ + "oldSnapshot": { + "com.linkedin.pegasus2avro.metadata.snapshot.DatasetSnapshot": { + "urn": "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)", + "aspects": [ + { + "com.linkedin.pegasus2avro.common.Ownership": { + "owners": [ + { + "owner": "urn:li:corpuser:jdoe", + "type": "DATAOWNER", + "source": null + }, + { + "owner": "urn:li:corpuser:datahub", + "type": "DATAOWNER", + "source": null + } + ], + "lastModified": { + "time": 1581407189000, + "actor": "urn:li:corpuser:jdoe", + "impersonator": null + } + } + } + ] + } + }, + "newSnapshot": { + "com.linkedin.pegasus2avro.metadata.snapshot.DatasetSnapshot": { + "urn": "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)", + "aspects": [ + { + "com.linkedin.pegasus2avro.common.Ownership": { + "owners": [ + { + "owner": "urn:li:corpuser:datahub", + "type": "DATAOWNER", + "source": null + } + ], + "lastModified": { + "time": 1581407189000, + "actor": "urn:li:corpuser:jdoe", + "impersonator": null + } + } + } + ] + } + } +} +``` + +## Failed Metadata Change Event (FMCE) + +When a Metadata Change Event cannot be processed successfully, the event is written to a [dead letter queue](https://en.wikipedia.org/wiki/Dead_letter_queue) in an event called Failed Metadata Change Event (FMCE). + +The event simply wraps the original Metadata Change Event and an error message, which contains the reason for rejection. +This event can be used for debugging any potential ingestion issues, as well as for re-playing any previous rejected proposal if necessary. + +### Emission + +FMCEs are emitted when MCEs cannot be successfully committed to DataHub's storage layer. + +### Consumption + +No active consumers. + +### Schema + +The PDL schema can be found [here](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/FailedMetadataChangeEvent.pdl). + +The default Kafka topic name for FMCEs is `FailedMetadataChangeEvent_v4`. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/what/relationship.md b/docs-website/versioned_docs/version-0.10.4/docs/what/relationship.md new file mode 100644 index 0000000000000..e13df313095e3 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/what/relationship.md @@ -0,0 +1,116 @@ +--- +title: What is a relationship? +slug: /what/relationship +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/what/relationship.md +--- + +# What is a relationship? + +A relationship is a named associate between exactly two [entities](entity.md), a source and a destination. + +

+ +

+ +From the above graph, a `Group` entity can be linked to a `User` entity via a `HasMember` relationship. +Note that the name of the relationship reflects the direction, i.e. pointing from `Group` to `User`. +This is due to the fact that the actual metadata aspect holding this information is associated with `Group`, rather than User. +Had the direction been reversed, the relationship would have been named `IsMemberOf` instead. +See [Direction of Relationships](#direction-of-relationships) for more discussions on relationship directionality. +A specific instance of a relationship, e.g. `urn:li:corpGroup:group1` has a member `urn:li:corpuser:user1`, +corresponds to an edge in the metadata graph. + +Similar to an entity, a relationship can also be associated with optional attributes that are derived from the metadata. +For example, from the `Membership` metadata aspect shown below, we’re able to derive the `HasMember` relationship that links a specific `Group` to a specific `User`. We can also include additional attribute to the relationship, e.g. importance, which corresponds to the position of the specific member in the original membership array. This allows complex graph query that travel only relationships that match certain criteria, e.g. "returns only the top-5 most important members of this group." +Similar to the entity attributes, relationship attributes should only be added based on the expected query patterns to reduce the indexing cost. + +``` +namespace: com.linkedin.group + +import com.linkedin.common.AuditStamp +import com.linkedin.common.CorpuserUrn + +/** + * The membership metadata for a group + */ +record Membership { + + /** Audit stamp for the last change */ + modified: AuditStamp + + /** Admin of the group */ + admin: CorpuserUrn + + /** Members of the group, ordered in descending importance */ + members: array[CorpuserUrn] +} +``` + +Relationships are meant to be "entity-neutral". In other words, one would expect to use the same `OwnedBy` relationship to link a `Dataset` to a `User` and to link a `Dashboard` to a `User`. As Pegasus doesn’t allow typing a field using multiple URNs (because they’re all essentially strings), we resort to using generic URN type for the source and destination. +We also introduce a `@pairings` [annotation](https://linkedin.github.io/rest.li/pdl_migration#shorthand-for-custom-properties) to limit the allowed source and destination URN types. + +While it’s possible to model relationships in rest.li as [association resources](https://linkedin.github.io/rest.li/modeling/modeling#association), which often get stored as mapping tables, it is far more common to model them as "foreign keys" field in a metadata aspect. For instance, the `Ownership` aspect is likely to contain an array of owner’s corpuser URNs. + +Below is an example of how a relationship is modeled in PDL. Note that: + +1. As the `source` and `destination` are of generic URN type, we’re able to factor them out to a common `BaseRelationship` model. +2. Each model is expected to have a `@pairings` annotation that is an array of all allowed source-destination URN pairs. +3. Unlike entity attributes, there’s no requirement on making all relationship attributes optional since relationships do not support partial updates. + +``` +namespace com.linkedin.metadata.relationship + +import com.linkedin.common.Urn + +/** + * Common fields that apply to all relationships + */ +record BaseRelationship { + + /** + * Urn for the source of the relationship + */ + source: Urn + + /** + * Urn for the destination of the relationship + */ + destination: Urn +} +``` + +``` +namespace com.linkedin.metadata.relationship + +/** + * Data model for a has-member relationship + */ +@pairings = [ { + "destination" : "com.linkedin.common.urn.CorpGroupUrn", + "source" : "com.linkedin.common.urn.CorpUserUrn" +} ] +record HasMembership includes BaseRelationship +{ + /** + * The importance of the membership + */ + importance: int +} +``` + +## Direction of Relationships + +As relationships are modeled as directed edges between nodes, it’s natural to ask which way should it be pointing, +or should there be edges going both ways? The answer is, "doesn’t really matter." It’s rather an aesthetic choice than technical one. + +For one, the actual direction doesn’t really impact the execution of graph queries. Most graph DBs are fully capable of traversing edges in reverse direction efficiently. + +That being said, generally there’s a more "natural way" to specify the direction of a relationship, which closely relate to how the metadata is stored. For example, the membership information for an LDAP group is generally stored as a list in group’s metadata. As a result, it’s more natural to model a `HasMember` relationship that points from a group to a member, instead of a `IsMemberOf` relationship pointing from member to group. + +Since all relationships are explicitly declared, it’s fairly easy for a user to discover what relationships are available and their directionality by inspecting +the [relationships directory](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/relationship). It’s also possible to provide a UI for the catalog of entities and relationships for analysts who are interested in building complex graph queries to gain insights into the metadata. + +## High Cardinality Relationships + +See [this doc](../advanced/high-cardinality.md) for suggestions on how to best model relationships with high cardinality. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/what/search-document.md b/docs-website/versioned_docs/version-0.10.4/docs/what/search-document.md new file mode 100644 index 0000000000000..e4b1d380d13a4 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/what/search-document.md @@ -0,0 +1,71 @@ +--- +title: What is a search document? +slug: /what/search-document +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/what/search-document.md +--- + +# What is a search document? + +[Search documents](https://en.wikipedia.org/wiki/Search_engine_indexing) are also modeled using [PDL](https://linkedin.github.io/rest.li/pdl_schema) explicitly. +In many ways, the model for a Document is very similar to an [Entity](entity.md) and [Relationship](relationship.md) model, +where each attribute/field contains a value that’s derived from various metadata aspects. +However, a search document is also allowed to have array type of attribute that contains only primitives or enum items. +This is because most full-text search engines supports membership testing against an array field, e.g. an array field containing all the terms used in a document. + +One obvious use of the attributes is to perform search filtering, e.g. give me all the `User` whose first name or last name is similar to “Joe” and reports up to `userFoo`. +Since the document is also served as the main interface for the search API, the attributes can also be used to format the search snippet. +As a result, one may be tempted to add as many attributes as needed. This is acceptable as the underlying search engine is designed to index a large number of fields. + +Below shows an example schema for the `User` search document. Note that: + +1. Each search document is required to have a type-specific `urn` field, generally maps to an entity in the [graph](graph.md). +2. Similar to `Entity`, each document has an optional `removed` field for "soft deletion". + This is captured in [BaseDocument](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/search/BaseDocument.pdl), which is expected to be included by all documents. +3. Similar to `Entity`, all remaining fields are made `optional` to support partial updates. +4. `management` shows an example of a string array field. +5. `ownedDataset` shows an example on how a field can be derived from metadata [aspects](aspect.md) associated with other types of entity (in this case, `Dataset`). + +``` +namespace com.linkedin.metadata.search + +/** + * Common fields that may apply to all documents + */ +record BaseDocument { + + /** Whether the entity has been removed or not */ + removed: optional boolean = false +} +``` + +``` +namespace com.linkedin.metadata.search + +import com.linkedin.common.CorpuserUrn +import com.linkedin.common.DatasetUrn + +/** + * Data model for user entity search + */ +record UserDocument includes BaseDocument { + + /** Urn for the user */ + urn: CorpuserUrn + + /** First name of the user */ + firstName: optional string + + /** Last name of the user */ + lastName: optional string + + /** The chain of management all the way to CEO */ + management: optional array[CorpuserUrn] = [] + + /** Code for the cost center */ + costCenter: optional int + + /** The list of dataset the user owns */ + ownedDatasets: optional array[DatasetUrn] = [] +} +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/what/search-index.md b/docs-website/versioned_docs/version-0.10.4/docs/what/search-index.md new file mode 100644 index 0000000000000..6fe50459e6541 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/what/search-index.md @@ -0,0 +1,25 @@ +--- +title: What is GMA search index? +slug: /what/search-index +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/docs/what/search-index.md +--- + +# What is GMA search index? + +Each [search document](search-document.md) type (or [entity](entity.md) type) will be mapped to an independent search index in Elasticsearch. +Beyond the standard search engine features (analyzer, tokenizer, filter queries, faceting, sharding, etc), +GMA also supports the following specific features: + +- Partial update of indexed documents +- Membership testing on multi-value fields +- Zero downtime switch between indices + +Check out [Search DAO](../architecture/metadata-serving.md#search-dao) for search query abstraction in GMA. + +## Search Automation (TBD) + +We aim to automate the index creation, schema evolution, and reindexing such that the team will only need to focus on the search document model and their custom [Index Builder](../architecture/metadata-ingestion.md#search-index-builders) logic. +As the logic changes, a new version of the index will be created and populated from historic MAEs. +Once it’s fully populated, the team can switch to the new version through a simple config change from their [GMS](gms.md). +They can also rollback to an older version of index whenever needed. diff --git a/docs-website/versioned_docs/version-0.10.4/docs/what/snapshot.md b/docs-website/versioned_docs/version-0.10.4/docs/what/snapshot.md new file mode 100644 index 0000000000000..8aa3446e14cae --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/what/snapshot.md @@ -0,0 +1,55 @@ +--- +title: What is a snapshot? +slug: /what/snapshot +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/what/snapshot.md" +--- + +# What is a snapshot? + +A metadata snapshot models the current state of one or multiple metadata [aspects](aspect.md) associated with a particular [entity](entity.md). +Each entity type is expected to have: + +1. An entity-specific aspect (e.g. `CorpGroupAspect` from below), which is a `typeref` containing a union of all possible metadata aspects for the entity. +2. An entity-specific snapshot (e.g. `CorpGroupSnapshot` from below), which contains an array (aspects) of entity-specific aspects. + +``` +namespace com.linkedin.metadata.aspect + +import com.linkedin.group.Membership +import com.linkedin.group.SomeOtherMetadata + +/** + * A union of all supported metadata aspects for a group + */ +typeref CorpGroupAspect = union[Membership, SomeOtherMetadata] +``` + +``` +namespace com.linkedin.metadata.snapshot + +import com.linkedin.common.CorpGroupUrn +import com.linkedin.metadata.aspect.CorpGroupAspect + +/** + * A metadata snapshot for a specific Group entity. + */ +record CorpGroupSnapshot { + + /** URN for the entity the metadata snapshot is associated with */ + urn: CorpGroupUrn + + /** The list of metadata aspects associated with the group */ + aspects: array[CorpGroupAspect] +} +``` + +The generic `Snapshot` typeref contains a union of all entity-specific snapshots and can therefore be used to represent the state of any metadata aspect for all supported entity types. + +``` +namespace com.linkedin.metadata.snapshot + +/** + * A union of all supported metadata snapshot types. + */ +typeref Snapshot = union[DatasetSnapshot, CorpGroupSnapshot, CorpUserSnapshot] +``` diff --git a/docs-website/versioned_docs/version-0.10.4/docs/what/urn.md b/docs-website/versioned_docs/version-0.10.4/docs/what/urn.md new file mode 100644 index 0000000000000..2fda8d35759c9 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/docs/what/urn.md @@ -0,0 +1,46 @@ +--- +title: What is URN? +slug: /what/urn +custom_edit_url: "https://github.com/datahub-project/datahub/blob/master/docs/what/urn.md" +--- + +# What is URN? + +URN ([Uniform Resource Name](https://en.wikipedia.org/wiki/Uniform_Resource_Name)) is the chosen scheme of [URI](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier) to uniquely define any resource in DataHub. It has the following form + +``` +urn::: +``` + +[Onboarding a new entity](../modeling/metadata-model.md) to [GMA](gma.md) starts with modelling an URN specific to that entity. +You can use the existing [URN models](https://github.com/datahub-project/datahub/blob/master/li-utils/src/main/javaPegasus/com/linkedin/common/urn) for built-in entities as a reference. + +## Namespace + +All URNs available in DataHub are using `li` as their namespace. +This can be easily changed to a different namespace for your organization if you fork DataHub. + +## Entity Type + +Entity type for URN is different than [entity](entity.md) in GMA context. This can be thought of as the object type of +any resource for which you need unique identifier for its each instance. While you can create URNs for GMA entities such as +[DatasetUrn] with entity type `dataset`, you can also define URN for data platforms, [DataPlatformUrn]. + +## ID + +ID is the unique identifier part of a URN. It's unique for a specific entity type within a specific namespace. +ID could contain a single field, or multi fields in the case of complex URNs. A complex URN can even contain other URNs as ID fields. This type of URN is also referred to as nested URN. For non-URN ID fields, the value can be either a string, number, or [Pegasus Enum](https://linkedin.github.io/rest.li/pdl_schema#enum-type). + +Here are some example URNs with a single ID field: + +``` +urn:li:dataPlatform:kafka +urn:li:corpuser:jdoe +``` + +[DatasetUrn](https://github.com/datahub-project/datahub/blob/master/li-utils/src/main/javaPegasus/com/linkedin/common/urn/DatasetUrn.java) is an example of a complex nested URN. It contains 3 ID fields: `platform`, `name` and `fabric`, where `platform` is another [URN](https://github.com/datahub-project/datahub/blob/master/li-utils/src/main/javaPegasus/com/linkedin/common/urn/DataPlatformUrn.java). Here are some examples + +``` +urn:li:dataset:(urn:li:dataPlatform:kafka,PageViewEvent,PROD) +urn:li:dataset:(urn:li:dataPlatform:hdfs,PageViewEvent,EI) +``` diff --git a/docs-website/versioned_docs/version-0.10.4/graphql/enums.md b/docs-website/versioned_docs/version-0.10.4/graphql/enums.md new file mode 100644 index 0000000000000..2eca3b800c576 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/graphql/enums.md @@ -0,0 +1,2398 @@ +--- +id: enums +title: Enums +slug: enums +sidebar_position: 5 +--- + +## AccessLevel + +The access level for a Metadata Entity, either public or private + +

Values

+ + + + + + + + + + + + + +
ValueDescription
PUBLIC +

Publicly available

+
PRIVATE +

Restricted to a subset of viewers

+
+ +## AccessTokenDuration + +The duration for which an Access Token is valid. + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
ONE_HOUR +

1 hour

+
ONE_DAY +

1 day

+
ONE_WEEK +

1 week

+
ONE_MONTH +

1 month

+
THREE_MONTHS +

3 months

+
SIX_MONTHS +

6 months

+
ONE_YEAR +

1 year

+
NO_EXPIRY +

No expiry

+
+ +## AccessTokenType + +A type of DataHub Access Token. + +

Values

+ + + + + + + + + +
ValueDescription
PERSONAL +

Generates a personal access token

+
+ +## AssertionResultType + +The result type of an assertion, success or failure. + +

Values

+ + + + + + + + + + + + + +
ValueDescription
SUCCESS +

The assertion succeeded.

+
FAILURE +

The assertion failed.

+
+ +## AssertionRunStatus + +The state of an assertion run, as defined within an Assertion Run Event. + +

Values

+ + + + + + + + + +
ValueDescription
COMPLETE +

An assertion run has completed.

+
+ +## AssertionStdAggregation + +An "aggregation" function that can be applied to column values of a Dataset to create the input to an Assertion Operator. + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
IDENTITY +

Assertion is applied on individual column value

+
MEAN +

Assertion is applied on column mean

+
MEDIAN +

Assertion is applied on column median

+
UNIQUE_COUNT +

Assertion is applied on number of distinct values in column

+
UNIQUE_PROPOTION +

Assertion is applied on proportion of distinct values in column

+
NULL_COUNT +

Assertion is applied on number of null values in column

+
NULL_PROPORTION +

Assertion is applied on proportion of null values in column

+
STDDEV +

Assertion is applied on column std deviation

+
MIN +

Assertion is applied on column min

+
MAX +

Assertion is applied on column std deviation

+
SUM +

Assertion is applied on column sum

+
COLUMNS +

Assertion is applied on all columns

+
COLUMN_COUNT +

Assertion is applied on number of columns

+
ROW_COUNT +

Assertion is applied on number of rows

+
_NATIVE_ +

Other

+
+ +## AssertionStdOperator + +A standard operator or condition that constitutes an assertion definition + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
BETWEEN +

Value being asserted is between min_value and max_value

+
LESS_THAN +

Value being asserted is less than max_value

+
LESS_THAN_OR_EQUAL_TO +

Value being asserted is less than or equal to max_value

+
GREATER_THAN +

Value being asserted is greater than min_value

+
GREATER_THAN_OR_EQUAL_TO +

Value being asserted is greater than or equal to min_value

+
EQUAL_TO +

Value being asserted is equal to value

+
NOT_NULL +

Value being asserted is not null

+
CONTAIN +

Value being asserted contains value

+
END_WITH +

Value being asserted ends with value

+
START_WITH +

Value being asserted starts with value

+
REGEX_MATCH +

Value being asserted matches the regex value.

+
IN +

Value being asserted is one of the array values

+
NOT_IN +

Value being asserted is not in one of the array values.

+
_NATIVE_ +

Other

+
+ +## AssertionStdParameterType + +The type of an AssertionStdParameter + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
STRING + +
NUMBER + +
LIST + +
SET + +
UNKNOWN + +
+ +## AssertionType + +The top-level assertion type. Currently single Dataset assertions are the only type supported. + +

Values

+ + + + + + + + + +
ValueDescription
DATASET + +
+ +## ChangeCategoryType + +Enum of CategoryTypes + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
DOCUMENTATION +

When documentation has been edited

+
GLOSSARY_TERM +

When glossary terms have been added or removed

+
OWNERSHIP +

When ownership has been modified

+
TECHNICAL_SCHEMA +

When technical schemas have been added or removed

+
TAG +

When tags have been added or removed

+
+ +## ChangeOperationType + +Enum of types of changes + +

Values

+ + + + + + + + + + + + + + + + + +
ValueDescription
ADD +

When an element is added

+
MODIFY +

When an element is modified

+
REMOVE +

When an element is removed

+
+ +## ChartQueryType + +The type of the Chart Query + +

Values

+ + + + + + + + + + + + + +
ValueDescription
SQL +

Standard ANSI SQL

+
LOOKML +

LookML

+
+ +## ChartType + +The type of a Chart Entity + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
BAR +

Bar graph

+
PIE +

Pie chart

+
SCATTER +

Scatter plot

+
TABLE +

Table

+
TEXT +

Markdown formatted text

+
LINE +

A line chart

+
AREA +

An area chart

+
HISTOGRAM +

A histogram chart

+
BOX_PLOT +

A box plot chart

+
WORD_CLOUD +

A word cloud chart

+
COHORT +

A Cohort Analysis chart

+
+ +## CorpUserStatus + +The state of a CorpUser + +

Values

+ + + + + + + + + +
ValueDescription
ACTIVE +

A User that has been provisioned and logged in

+
+ +## CostType + +

Values

+ + + + + + + + + +
ValueDescription
ORG_COST_TYPE +

Org Cost Type to which the Cost of this entity should be attributed to

+
+ +## DataHubViewType + +The type of a DataHub View + +

Values

+ + + + + + + + + + + + + +
ValueDescription
PERSONAL +

A personal view - e.g. saved filters

+
GLOBAL +

A global view, e.g. role view

+
+ +## DataProcessInstanceRunResultType + +The result of the data process run + +

Values

+ + + + + + + + + + + + + + + + + + + + + +
ValueDescription
SUCCESS +

The run finished successfully

+
FAILURE +

The run finished in failure

+
SKIPPED +

The run was skipped

+
UP_FOR_RETRY +

The run failed and is up for retry

+
+ +## DataProcessRunStatus + +The status of the data process instance + +

Values

+ + + + + + + + + + + + + +
ValueDescription
STARTED +

The data process instance has started but not completed

+
COMPLETE +

The data process instance has completed

+
+ +## DatasetAssertionScope + +The scope that a Dataset-level assertion applies to. + +

Values

+ + + + + + + + + + + + + + + + + + + + + +
ValueDescription
DATASET_COLUMN +

Assertion applies to columns of a dataset.

+
DATASET_ROWS +

Assertion applies to rows of a dataset.

+
DATASET_SCHEMA +

Assertion applies to schema of a dataset.

+
UNKNOWN +

The scope of an assertion is unknown.

+
+ +## DatasetLineageType + +Deprecated +The type of an edge between two Datasets + +

Values

+ + + + + + + + + + + + + + + + + +
ValueDescription
COPY +

Direct copy without modification

+
TRANSFORMED +

Transformed dataset

+
VIEW +

Represents a view defined on the sources

+
+ +## DateInterval + +For consumption by UI only + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
SECOND + +
MINUTE + +
HOUR + +
DAY + +
WEEK + +
MONTH + +
YEAR + +
+ +## EntityType + +A top level Metadata Entity Type + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
DOMAIN +

A Domain containing Metadata Entities

+
DATASET +

The Dataset Entity

+
CORP_USER +

The CorpUser Entity

+
CORP_GROUP +

The CorpGroup Entity

+
DATA_PLATFORM +

The DataPlatform Entity

+
DASHBOARD +

The Dashboard Entity

+
NOTEBOOK +

The Notebook Entity

+
CHART +

The Chart Entity

+
DATA_FLOW +

The Data Flow (or Data Pipeline) Entity,

+
DATA_JOB +

The Data Job (or Data Task) Entity

+
TAG +

The Tag Entity

+
GLOSSARY_TERM +

The Glossary Term Entity

+
GLOSSARY_NODE +

The Glossary Node Entity

+
CONTAINER +

A container of Metadata Entities

+
MLMODEL +

The ML Model Entity

+
MLMODEL_GROUP +

The MLModelGroup Entity

+
MLFEATURE_TABLE +

ML Feature Table Entity

+
MLFEATURE +

The ML Feature Entity

+
MLPRIMARY_KEY +

The ML Primary Key Entity

+
INGESTION_SOURCE +

A DataHub Managed Ingestion Source

+
EXECUTION_REQUEST +

A DataHub ExecutionRequest

+
ASSERTION +

A DataHub Assertion

+
DATA_PROCESS_INSTANCE +

An instance of an individual run of a data job or data flow

+
DATA_PLATFORM_INSTANCE +

Data Platform Instance Entity

+
ACCESS_TOKEN +

A DataHub Access Token

+
TEST +

A DataHub Test

+
DATAHUB_POLICY +

A DataHub Policy

+
DATAHUB_ROLE +

A DataHub Role

+
POST +

A DataHub Post

+
SCHEMA_FIELD +

A Schema Field

+
DATAHUB_VIEW +

A DataHub View

+
QUERY +

A dataset query

+
DATA_PRODUCT +

A Data Product

+
CUSTOM_OWNERSHIP_TYPE +

A Custom Ownership Type

+
ROLE +

" +A Role from an organisation

+
+ +## FabricType + +An environment identifier for a particular Entity, ie staging or production +Note that this model will soon be deprecated in favor of a more general purpose of notion +of data environment + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
DEV +

Designates development fabrics

+
TEST +

Designates testing fabrics

+
QA +

Designates quality assurance fabrics

+
UAT +

Designates user acceptance testing fabrics

+
EI +

Designates early integration fabrics

+
PRE +

Designates pre-production fabrics

+
STG +

Designates staging fabrics

+
NON_PROD +

Designates non-production fabrics

+
PROD +

Designates production fabrics

+
CORP +

Designates corporation fabrics

+
+ +## FilterOperator + +

Values

+ + + + + + + + + + + + + + + + + +
ValueDescription
CONTAIN +

Represent the relation: String field contains value, e.g. name contains Profile

+
EQUAL +

Represent the relation: field = value, e.g. platform = hdfs

+
IN +
    +
  • Represent the relation: String field is one of the array values to, e.g. name in ["Profile", "Event"]
  • +
+
+ +## HealthStatus + +

Values

+ + + + + + + + + + + + + + + + + +
ValueDescription
PASS +

The Asset is in a healthy state

+
WARN +

The Asset is in a warning state

+
FAIL +

The Asset is in a failing (unhealthy) state

+
+ +## HealthStatusType + +The type of the health status + +

Values

+ + + + + + + + + +
ValueDescription
ASSERTIONS +

Assertions status

+
+ +## IntendedUserType + +

Values

+ + + + + + + + + + + + + + + + + +
ValueDescription
ENTERPRISE +

Developed for Enterprise Users

+
HOBBY +

Developed for Hobbyists

+
ENTERTAINMENT +

Developed for Entertainment Purposes

+
+ +## LineageDirection + +Direction between two nodes in the lineage graph + +

Values

+ + + + + + + + + + + + + +
ValueDescription
UPSTREAM +

Upstream, or left-to-right in the lineage visualization

+
DOWNSTREAM +

Downstream, or right-to-left in the lineage visualization

+
+ +## LogicalOperator + +A Logical Operator, AND or OR. + +

Values

+ + + + + + + + + + + + + +
ValueDescription
AND +

An AND operator.

+
OR +

An OR operator.

+
+ +## MediaType + +The type of media + +

Values

+ + + + + + + + + +
ValueDescription
IMAGE +

An image

+
+ +## MLFeatureDataType + +The data type associated with an individual Machine Learning Feature + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
USELESS + +
NOMINAL + +
ORDINAL + +
BINARY + +
COUNT + +
TIME + +
INTERVAL + +
IMAGE + +
VIDEO + +
AUDIO + +
TEXT + +
MAP + +
SEQUENCE + +
SET + +
CONTINUOUS + +
BYTE + +
UNKNOWN + +
+ +## NotebookCellType + +The type for a NotebookCell + +

Values

+ + + + + + + + + + + + + + + + + +
ValueDescription
TEXT_CELL +

TEXT Notebook cell type. The cell context is text only.

+
QUERY_CELL +

QUERY Notebook cell type. The cell context is query only.

+
CHART_CELL +

CHART Notebook cell type. The cell content is chart only.

+
+ +## OperationSourceType + +Enum to define the source/reporter type for an Operation. + +

Values

+ + + + + + + + + + + + + +
ValueDescription
DATA_PROCESS +

A data process reported the operation.

+
DATA_PLATFORM +

A data platform reported the operation.

+
+ +## OperationType + +Enum to define the operation type when an entity changes. + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
INSERT +

When data is inserted.

+
UPDATE +

When data is updated.

+
DELETE +

When data is deleted.

+
CREATE +

When table is created.

+
ALTER +

When table is altered

+
DROP +

When table is dropped

+
UNKNOWN +

Unknown operation

+
CUSTOM +

Custom

+
+ +## OriginType + +Enum to define where an entity originated from. + +

Values

+ + + + + + + + + + + + + + + + + +
ValueDescription
NATIVE +

The entity is native to DataHub.

+
EXTERNAL +

The entity is external to DataHub.

+
UNKNOWN +

The entity is of unknown origin.

+
+ +## OwnerEntityType + +Entities that are able to own other entities + +

Values

+ + + + + + + + + + + + + +
ValueDescription
CORP_USER +

A corp user owner

+
CORP_GROUP +

A corp group owner

+
+ +## OwnershipSourceType + +The origin of Ownership metadata associated with a Metadata Entity + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
AUDIT +

Auditing system or audit logs

+
DATABASE +

Database, eg GRANTS table

+
FILE_SYSTEM +

File system, eg file or directory owner

+
ISSUE_TRACKING_SYSTEM +

Issue tracking system, eg Jira

+
MANUAL +

Manually provided by a user

+
SERVICE +

Other ownership like service, eg Nuage, ACL service etc

+
SOURCE_CONTROL +

SCM system, eg GIT, SVN

+
OTHER +

Other sources

+
+ +## OwnershipType + +The type of the ownership relationship between a Person and a Metadata Entity +Note that this field will soon become deprecated due to low usage + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
TECHNICAL_OWNER +

A person or group who is responsible for technical aspects of the asset.

+
BUSINESS_OWNER +

A person or group who is responsible for logical, or business related, aspects of the asset.

+
DATA_STEWARD +

A steward, expert, or delegate responsible for the asset.

+
NONE +

No specific type associated with the owner.

+
CUSTOM +

Associated ownership type is a custom ownership type. Please check OwnershipTypeEntity urn for custom value.

+
DATAOWNER +

A person or group that owns the data. +Deprecated! This ownership type is no longer supported. Use TECHNICAL_OWNER instead.

+
DEVELOPER +

A person or group that is in charge of developing the code +Deprecated! This ownership type is no longer supported. Use TECHNICAL_OWNER instead.

+
DELEGATE +

A person or a group that overseas the operation, eg a DBA or SRE +Deprecated! This ownership type is no longer supported. Use TECHNICAL_OWNER instead.

+
PRODUCER +

A person, group, or service that produces or generates the data +Deprecated! This ownership type is no longer supported. Use TECHNICAL_OWNER instead.

+
STAKEHOLDER +

A person or a group that has direct business interest +Deprecated! Use BUSINESS_OWNER instead.

+
CONSUMER +

A person, group, or service that consumes the data +Deprecated! This ownership type is no longer supported.

+
+ +## PartitionType + +

Values

+ + + + + + + + + + + + + + + + + +
ValueDescription
FULL_TABLE + +
QUERY + +
PARTITION + +
+ +## PlatformNativeType + +Deprecated, do not use this type +The logical type associated with an individual Dataset + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
TABLE +

Table

+
VIEW +

View

+
DIRECTORY +

Directory in file system

+
STREAM +

Stream

+
BUCKET +

Bucket in key value store

+
+ +## PlatformType + +The category of a specific Data Platform + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
FILE_SYSTEM +

Value for a file system

+
KEY_VALUE_STORE +

Value for a key value store

+
MESSAGE_BROKER +

Value for a message broker

+
OBJECT_STORE +

Value for an object store

+
OLAP_DATASTORE +

Value for an OLAP datastore

+
QUERY_ENGINE +

Value for a query engine

+
RELATIONAL_DB +

Value for a relational database

+
SEARCH_ENGINE +

Value for a search engine

+
OTHERS +

Value for other platforms

+
+ +## PolicyMatchCondition + +Match condition + +

Values

+ + + + + + + + + +
ValueDescription
EQUALS +

Whether the field matches the value

+
+ +## PolicyState + +The state of an Access Policy + +

Values

+ + + + + + + + + + + + + + + + + +
ValueDescription
DRAFT +

A Policy that has not been officially created, but in progress +Currently unused

+
ACTIVE +

A Policy that is active and being enforced

+
INACTIVE +

A Policy that is not active or being enforced

+
+ +## PolicyType + +The type of the Access Policy + +

Values

+ + + + + + + + + + + + + +
ValueDescription
METADATA +

An access policy that grants privileges pertaining to Metadata Entities

+
PLATFORM +

An access policy that grants top level administrative privileges pertaining to the DataHub Platform itself

+
+ +## PostContentType + +The type of post + +

Values

+ + + + + + + + + + + + + +
ValueDescription
TEXT +

Text content

+
LINK +

Link content

+
+ +## PostType + +The type of post + +

Values

+ + + + + + + + + +
ValueDescription
HOME_PAGE_ANNOUNCEMENT +

Posts on the home page

+
+ +## QueryLanguage + +A query language / dialect. + +

Values

+ + + + + + + + + +
ValueDescription
SQL +

Standard ANSI SQL

+
+ +## QuerySource + +The source of the query + +

Values

+ + + + + + + + + +
ValueDescription
MANUAL +

The query was provided manually, e.g. from the UI.

+
+ +## RecommendationRenderType + +Enum that defines how the modules should be rendered. +There should be two frontend implementation of large and small modules per type. + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
ENTITY_NAME_LIST +

Simple list of entities

+
PLATFORM_SEARCH_LIST +

List of platforms

+
TAG_SEARCH_LIST +

Tag search list

+
SEARCH_QUERY_LIST +

A list of recommended search queries

+
GLOSSARY_TERM_SEARCH_LIST +

Glossary Term search list

+
DOMAIN_SEARCH_LIST +

Domain Search List

+
+ +## RelationshipDirection + +Direction between a source and destination node + +

Values

+ + + + + + + + + + + + + +
ValueDescription
INCOMING +

A directed edge pointing at the source Entity

+
OUTGOING +

A directed edge pointing at the destination Entity

+
+ +## ScenarioType + +Type of the scenario requesting recommendation + +

Values

+ + + + + + + + + + + + + + + + + + + + + +
ValueDescription
HOME +

Recommendations to show on the users home page

+
SEARCH_RESULTS +

Recommendations to show on the search results page

+
ENTITY_PROFILE +

Recommendations to show on an Entity Profile page

+
SEARCH_BAR +

Recommendations to show on the search bar when clicked

+
+ +## SchemaFieldDataType + +The type associated with a single Dataset schema field + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
BOOLEAN +

A boolean type

+
FIXED +

A fixed bytestring type

+
STRING +

A string type

+
BYTES +

A string of bytes

+
NUMBER +

A number, including integers, floats, and doubles

+
DATE +

A datestrings type

+
TIME +

A timestamp type

+
ENUM +

An enum type

+
NULL +

A NULL type

+
MAP +

A map collection type

+
ARRAY +

An array collection type

+
UNION +

An union type

+
STRUCT +

An complex struct type

+
+ +## SourceCodeUrlType + +

Values

+ + + + + + + + + + + + + + + + + +
ValueDescription
ML_MODEL_SOURCE_CODE +

MLModel Source Code

+
TRAINING_PIPELINE_SOURCE_CODE +

Training Pipeline Source Code

+
EVALUATION_PIPELINE_SOURCE_CODE +

Evaluation Pipeline Source Code

+
+ +## SubResourceType + +A type of Metadata Entity sub resource + +

Values

+ + + + + + + + + +
ValueDescription
DATASET_FIELD +

A Dataset field or column

+
+ +## TermRelationshipType + +A type of Metadata Entity sub resource + +

Values

+ + + + + + + + + + + + + +
ValueDescription
isA +

When a Term inherits from, or has an 'Is A' relationship with another Term

+
hasA +

When a Term contains, or has a 'Has A' relationship with another Term

+
+ +## TestResultType + +The result type of a test that has been run + +

Values

+ + + + + + + + + + + + + +
ValueDescription
SUCCESS +

The test succeeded.

+
FAILURE +

The test failed.

+
+ +## TimeRange + +A time range used in fetching Usage statistics + +

Values

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ValueDescription
DAY +

Last day

+
WEEK +

Last week

+
MONTH +

Last month

+
QUARTER +

Last quarter

+
YEAR +

Last year

+
ALL +

All time

+
+ +## UserSetting + +An individual setting type for a Corp User. + +

Values

+ + + + + + + + + +
ValueDescription
SHOW_SIMPLIFIED_HOMEPAGE +

Show simplified homepage

+
+ +## WindowDuration + +The duration of a fixed window of time + +

Values

+ + + + + + + + + + + + + + + + + + + + + +
ValueDescription
DAY +

A one day window

+
WEEK +

A one week window

+
MONTH +

A one month window

+
YEAR +

A one year window

+
diff --git a/docs-website/versioned_docs/version-0.10.4/graphql/inputObjects.md b/docs-website/versioned_docs/version-0.10.4/graphql/inputObjects.md new file mode 100644 index 0000000000000..a07ac2b858e8e --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/graphql/inputObjects.md @@ -0,0 +1,6313 @@ +--- +id: inputObjects +title: Input objects +slug: inputObjects +sidebar_position: 7 +--- + +## AcceptRoleInput + +Input provided when accepting a DataHub role using an invite token + +

Arguments

+ + + + + + + + + +
NameDescription
+inviteToken
+String! +
+

The token needed to accept the role

+
+ +## ActorFilterInput + +Input required when creating or updating an Access Policies Determines which actors the Policy applies to + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+users
+[String!] +
+

A disjunctive set of users to apply the policy to

+
+groups
+[String!] +
+

A disjunctive set of groups to apply the policy to

+
+resourceOwners
+Boolean! +
+

Whether the filter should return TRUE for owners of a particular resource +Only applies to policies of type METADATA, which have a resource associated with them

+
+resourceOwnersTypes
+[String!] +
+

Set of OwnershipTypes to apply the policy to (if resourceOwners field is set to True)

+
+allUsers
+Boolean! +
+

Whether the filter should apply to all users

+
+allGroups
+Boolean! +
+

Whether the filter should apply to all groups

+
+ +## AddGroupMembersInput + +Input required to add members to an external DataHub group + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+groupUrn
+String! +
+

The group to add members to

+
+userUrns
+[String!]! +
+

The members to add to the group

+
+ +## AddLinkInput + +Input provided when adding the association between a Metadata Entity and a Link + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+linkUrl
+String! +
+

The url of the link to add or remove

+
+label
+String! +
+

A label to attach to the link

+
+resourceUrn
+String! +
+

The urn of the resource or entity to attach the link to, for example a dataset urn

+
+ +## AddNativeGroupMembersInput + +Input required to add members to a native DataHub group + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+groupUrn
+String! +
+

The group to add members to

+
+userUrns
+[String!]! +
+

The members to add to the group

+
+ +## AddOwnerInput + +Input provided when adding the association between a Metadata Entity and an user or group owner + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+ownerUrn
+String! +
+

The primary key of the Owner to add or remove

+
+ownerEntityType
+OwnerEntityType! +
+

The owner type, either a user or group

+
+type
+OwnershipType +
+
Deprecated: No longer supported
+ +

The ownership type for the new owner. If none is provided, then a new NONE will be added. +Deprecated - Use ownershipTypeUrn field instead.

+
+ownershipTypeUrn
+String +
+

The urn of the ownership type entity.

+
+resourceUrn
+String! +
+

The urn of the resource or entity to attach or remove the owner from, for example a dataset urn

+
+ +## AddOwnersInput + +Input provided when adding multiple associations between a Metadata Entity and an user or group owner + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+owners
+[OwnerInput!]! +
+

The primary key of the Owner to add or remove

+
+resourceUrn
+String! +
+

The urn of the resource or entity to attach or remove the owner from, for example a dataset urn

+
+ +## AddTagsInput + +Input provided when adding tags to an asset + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+tagUrns
+[String!]! +
+

The primary key of the Tags

+
+resourceUrn
+String! +
+

The target Metadata Entity to add or remove the Tag to

+
+subResourceType
+SubResourceType +
+

An optional type of a sub resource to attach the Tag to

+
+subResource
+String +
+

An optional sub resource identifier to attach the Tag to

+
+ +## AddTermsInput + +Input provided when adding Terms to an asset + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+termUrns
+[String!]! +
+

The primary key of the Glossary Term to add or remove

+
+resourceUrn
+String! +
+

The target Metadata Entity to add or remove the Glossary Term from

+
+subResourceType
+SubResourceType +
+

An optional type of a sub resource to attach the Glossary Term to

+
+subResource
+String +
+

An optional sub resource identifier to attach the Glossary Term to

+
+ +## AggregateAcrossEntitiesInput + +Input arguments for a full text search query across entities to get aggregations + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+types
+[EntityType!] +
+

Entity types to be searched. If this is not provided, all entities will be searched.

+
+query
+String! +
+

The query string

+
+facets
+[String] +
+

The list of facets to get aggregations for. If list is empty or null, get aggregations for all facets +Sub-aggregations can be specified with the unicode character ␞ (U+241E) as a delimiter between the subtypes. +e.g. _entityType␞owners

+
+orFilters
+[AndFilterInput!] +
+

A list of disjunctive criterion for the filter. (or operation to combine filters)

+
+viewUrn
+String +
+

Optional - A View to apply when generating results

+
+searchFlags
+SearchFlags +
+

Flags controlling search options

+
+ +## AndFilterInput + +A list of disjunctive criterion for the filter. (or operation to combine filters) + +

Arguments

+ + + + + + + + + +
NameDescription
+and
+[FacetFilterInput!] +
+

A list of and criteria the filter applies to the query

+
+ +## AspectParams + +Params to configure what list of aspects should be fetched by the aspects property + +

Arguments

+ + + + + + + + + +
NameDescription
+autoRenderOnly
+Boolean +
+

Only fetch auto render aspects

+
+ +## AutoCompleteInput + +Input for performing an auto completion query against a single Metadata Entity + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+type
+EntityType +
+

Entity type to be autocompleted against

+
+query
+String! +
+

The raw query string

+
+field
+String +
+

An optional entity field name to autocomplete on

+
+limit
+Int +
+

The maximum number of autocomplete results to be returned

+
+filters
+[FacetFilterInput!] +
+

Faceted filters applied to autocomplete results

+
+orFilters
+[AndFilterInput!] +
+

A list of disjunctive criterion for the filter. (or operation to combine filters)

+
+ +## AutoCompleteMultipleInput + +Input for performing an auto completion query against a a set of Metadata Entities + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+types
+[EntityType!] +
+

Entity types to be autocompleted against +Optional, if none supplied, all searchable types will be autocompleted against

+
+query
+String! +
+

The raw query string

+
+field
+String +
+

An optional field to autocomplete against

+
+limit
+Int +
+

The maximum number of autocomplete results

+
+filters
+[FacetFilterInput!] +
+

Faceted filters applied to autocomplete results

+
+orFilters
+[AndFilterInput!] +
+

A list of disjunctive criterion for the filter. (or operation to combine filters)

+
+viewUrn
+String +
+

Optional - A View to apply when generating results

+
+ +## BatchAddOwnersInput + +Input provided when adding owners to a batch of assets + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+owners
+[OwnerInput!]! +
+

The primary key of the owners

+
+ownershipTypeUrn
+String +
+

The ownership type to remove, optional. By default will remove regardless of ownership type.

+
+resources
+[ResourceRefInput]! +
+

The target assets to attach the owners to

+
+ +## BatchAddTagsInput + +Input provided when adding tags to a batch of assets + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+tagUrns
+[String!]! +
+

The primary key of the Tags

+
+resources
+[ResourceRefInput!]! +
+

The target assets to attach the tags to

+
+ +## BatchAddTermsInput + +Input provided when adding glossary terms to a batch of assets + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+termUrns
+[String!]! +
+

The primary key of the Glossary Terms

+
+resources
+[ResourceRefInput]! +
+

The target assets to attach the glossary terms to

+
+ +## BatchAssignRoleInput + +Input provided when batch assigning a role to a list of users + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+roleUrn
+String +
+

The urn of the role to assign to the actors. If undefined, will remove the role.

+
+actors
+[String!]! +
+

The urns of the actors to assign the role to

+
+ +## BatchDatasetUpdateInput + +Arguments provided to batch update Dataset entities + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

Primary key of the Dataset to which the update will be applied

+
+update
+DatasetUpdateInput! +
+

Arguments provided to update the Dataset

+
+ +## BatchGetStepStatesInput + +Input arguments required for fetching step states + +

Arguments

+ + + + + + + + + +
NameDescription
+ids
+[String!]! +
+

The unique ids for the steps to retrieve

+
+ +## BatchRemoveOwnersInput + +Input provided when removing owners from a batch of assets + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+ownerUrns
+[String!]! +
+

The primary key of the owners

+
+ownershipTypeUrn
+String +
+

The ownership type to remove, optional. By default will remove regardless of ownership type.

+
+resources
+[ResourceRefInput]! +
+

The target assets to remove the owners from

+
+ +## BatchRemoveTagsInput + +Input provided when removing tags from a batch of assets + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+tagUrns
+[String!]! +
+

The primary key of the Tags

+
+resources
+[ResourceRefInput]! +
+

The target assets to remove the tags from

+
+ +## BatchRemoveTermsInput + +Input provided when removing glossary terms from a batch of assets + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+termUrns
+[String!]! +
+

The primary key of the Glossary Terms

+
+resources
+[ResourceRefInput]! +
+

The target assets to remove the glossary terms from

+
+ +## BatchSetDataProductInput + +Input properties required for batch setting a DataProduct on other entities + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+dataProductUrn
+String +
+

The urn of the data product you are setting on a group of resources. +If this is null, the Data Product will be unset for the given resources.

+
+resourceUrns
+[String!]! +
+

The urns of the entities the given data product should be set on

+
+ +## BatchSetDomainInput + +Input provided when adding tags to a batch of assets + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+domainUrn
+String +
+

The primary key of the Domain, or null if the domain will be unset

+
+resources
+[ResourceRefInput!]! +
+

The target assets to attach the Domain

+
+ +## BatchUpdateDeprecationInput + +Input provided when updating the deprecation status for a batch of assets. + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+deprecated
+Boolean! +
+

Whether the Entity is marked as deprecated.

+
+decommissionTime
+Long +
+

Optional - The time user plan to decommission this entity

+
+note
+String +
+

Optional - Additional information about the entity deprecation plan

+
+resources
+[ResourceRefInput]! +
+

The target assets to attach the tags to

+
+ +## BatchUpdateSoftDeletedInput + +Input provided when updating the soft-deleted status for a batch of assets + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urns
+[String!]! +
+

The urns of the assets to soft delete

+
+deleted
+Boolean! +
+

Whether to mark the asset as soft-deleted (hidden)

+
+ +## BatchUpdateStepStatesInput + +Input arguments required for updating step states + +

Arguments

+ + + + + + + + + +
NameDescription
+states
+[StepStateInput!]! +
+

Set of step states. If the id does not exist, it will be created.

+
+ +## BrowseInput + +Input required for browse queries + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+type
+EntityType! +
+

The browse entity type

+
+path
+[String!] +
+

The browse path

+
+start
+Int +
+

The starting point of paginated results

+
+count
+Int +
+

The number of elements included in the results

+
+filters
+[FacetFilterInput!] +
+
Deprecated: Use `orFilters`- they are more expressive
+ +

Deprecated in favor of the more expressive orFilters field +Facet filters to apply to search results. These will be 'AND'-ed together.

+
+orFilters
+[AndFilterInput!] +
+

A list of disjunctive criterion for the filter. (or operation to combine filters)

+
+ +## BrowsePathsInput + +Inputs for fetching the browse paths for a Metadata Entity + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+type
+EntityType! +
+

The browse entity type

+
+urn
+String! +
+

The entity urn

+
+ +## BrowseV2Input + +Input required for browse queries + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+type
+EntityType! +
+

The browse entity type

+
+path
+[String!] +
+

The browse path V2 - a list with each entry being part of the browse path V2

+
+start
+Int +
+

The starting point of paginated results

+
+count
+Int +
+

The number of elements included in the results

+
+orFilters
+[AndFilterInput!] +
+

A list of disjunctive criterion for the filter. (or operation to combine filters)

+
+viewUrn
+String +
+

Optional - A View to apply when generating results

+
+query
+String +
+

The search query string

+
+ +## CancelIngestionExecutionRequestInput + +Input for cancelling an execution request input + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+ingestionSourceUrn
+String! +
+

Urn of the ingestion source

+
+executionRequestUrn
+String! +
+

Urn of the specific execution request to cancel

+
+ +## ChartEditablePropertiesUpdate + +Update to writable Chart fields + +

Arguments

+ + + + + + + + + +
NameDescription
+description
+String! +
+

Writable description aka documentation for a Chart

+
+ +## ChartUpdateInput + +Arguments provided to update a Chart Entity + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+ownership
+OwnershipUpdate +
+

Update to ownership

+
+globalTags
+GlobalTagsUpdate +
+

Deprecated, use tags field instead +Update to global tags

+
+tags
+GlobalTagsUpdate +
+

Update to tags

+
+editableProperties
+ChartEditablePropertiesUpdate +
+

Update to editable properties

+
+ +## ContainerEntitiesInput + +Input required to fetch the entities inside of a container. + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+query
+String +
+

Optional query filter for particular entities inside the container

+
+start
+Int +
+

The offset of the result set

+
+count
+Int +
+

The number of entities to include in result set

+
+filters
+[FacetFilterInput!] +
+

Optional Facet filters to apply to the result set

+
+ +## CorpGroupUpdateInput + +Arguments provided to update a CorpGroup Entity + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+description
+String +
+

DataHub description of the group

+
+slack
+String +
+

Slack handle for the group

+
+email
+String +
+

Email address for the group

+
+ +## CorpUserUpdateInput + +Arguments provided to update a CorpUser Entity + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+displayName
+String +
+

Display name to show on DataHub

+
+title
+String +
+

Title to show on DataHub

+
+aboutMe
+String +
+

About me section of the user

+
+teams
+[String!] +
+

Teams that the user belongs to

+
+skills
+[String!] +
+

Skills that the user possesses

+
+pictureLink
+String +
+

A URL which points to a picture which user wants to set as a profile photo

+
+slack
+String +
+

The slack handle of the user

+
+phone
+String +
+

Phone number for the user

+
+email
+String +
+

Email address for the user

+
+ +## CreateAccessTokenInput + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+type
+AccessTokenType! +
+

The type of the Access Token.

+
+actorUrn
+String! +
+

The actor associated with the Access Token.

+
+duration
+AccessTokenDuration! +
+

The duration for which the Access Token is valid.

+
+name
+String! +
+

The name of the token to be generated.

+
+description
+String +
+

Description of the token if defined.

+
+ +## CreateDataProductInput + +Input required for creating a DataProduct. + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+properties
+CreateDataProductPropertiesInput! +
+

Properties about the Query

+
+domainUrn
+String! +
+

The primary key of the Domain

+
+ +## CreateDataProductPropertiesInput + +Input properties required for creating a DataProduct + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

A display name for the DataProduct

+
+description
+String +
+

An optional description for the DataProduct

+
+ +## CreateDomainInput + +Input required to create a new Domain. + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+id
+String +
+

Optional! A custom id to use as the primary key identifier for the domain. If not provided, a random UUID will be generated as the id.

+
+name
+String! +
+

Display name for the Domain

+
+description
+String +
+

Optional description for the Domain

+
+ +## CreateGlossaryEntityInput + +Input required to create a new Glossary Entity - a Node or a Term. + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+id
+String +
+

Optional! A custom id to use as the primary key identifier for the domain. If not provided, a random UUID will be generated as the id.

+
+name
+String! +
+

Display name for the Node or Term

+
+description
+String +
+

Description for the Node or Term

+
+parentNode
+String +
+

Optional parent node urn for the Glossary Node or Term

+
+ +## CreateGroupInput + +Input for creating a new group + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+id
+String +
+

Optional! A custom id to use as the primary key identifier for the group. If not provided, a random UUID will be generated as the id.

+
+name
+String! +
+

The display name of the group

+
+description
+String +
+

The description of the group

+
+ +## CreateIngestionExecutionRequestInput + +Input for creating an execution request input + +

Arguments

+ + + + + + + + + +
NameDescription
+ingestionSourceUrn
+String! +
+

Urn of the ingestion source to execute

+
+ +## CreateInviteTokenInput + +Input provided when creating an invite token + +

Arguments

+ + + + + + + + + +
NameDescription
+roleUrn
+String +
+

The urn of the role to create the invite token for

+
+ +## CreateNativeUserResetTokenInput + +Input required to generate a password reset token for a native user. + +

Arguments

+ + + + + + + + + +
NameDescription
+userUrn
+String! +
+

The urn of the user to reset the password of

+
+ +## CreateOwnershipTypeInput + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

The name of the Custom Ownership Type

+
+description
+String! +
+

The description of the Custom Ownership Type

+
+ +## CreatePostInput + +Input provided when creating a Post + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+postType
+PostType! +
+

The type of post

+
+content
+UpdatePostContentInput! +
+

The content of the post

+
+ +## CreateQueryInput + +Input required for creating a Query. Requires the 'Edit Queries' privilege for all query subjects. + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+properties
+CreateQueryPropertiesInput! +
+

Properties about the Query

+
+subjects
+[CreateQuerySubjectInput!]! +
+

Subjects for the query

+
+ +## CreateQueryPropertiesInput + +Input properties required for creating a Query + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+name
+String +
+

An optional display name for the Query

+
+description
+String +
+

An optional description for the Query

+
+statement
+QueryStatementInput! +
+

The Query contents

+
+ +## CreateQuerySubjectInput + +Input required for creating a Query. For now, only datasets are supported. + +

Arguments

+ + + + + + + + + +
NameDescription
+datasetUrn
+String! +
+

The urn of the dataset that is the subject of the query

+
+ +## CreateSecretInput + +Input arguments for creating a new Secret + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

The name of the secret for reference in ingestion recipes

+
+value
+String! +
+

The value of the secret, to be encrypted and stored

+
+description
+String +
+

An optional description for the secret

+
+ +## CreateTagInput + +Input required to create a new Tag + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+id
+String +
+

Optional! A custom id to use as the primary key identifier for the Tag. If not provided, a random UUID will be generated as the id.

+
+name
+String! +
+

Display name for the Tag

+
+description
+String +
+

Optional description for the Tag

+
+ +## CreateTestConnectionRequestInput + +Input for creating a test connection request + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+recipe
+String! +
+

A JSON-encoded recipe

+
+version
+String +
+

Advanced: The version of the ingestion framework to use

+
+ +## CreateTestInput + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+id
+String +
+

Advanced: a custom id for the test.

+
+name
+String! +
+

The name of the Test

+
+category
+String! +
+

The category of the Test (user defined)

+
+description
+String +
+

Description of the test

+
+definition
+TestDefinitionInput! +
+

The test definition

+
+ +## CreateViewInput + +Input provided when creating a DataHub View + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+viewType
+DataHubViewType! +
+

The type of View

+
+name
+String! +
+

The name of the View

+
+description
+String +
+

An optional description of the View

+
+definition
+DataHubViewDefinitionInput! +
+

The view definition itself

+
+ +## DashboardEditablePropertiesUpdate + +Update to writable Dashboard fields + +

Arguments

+ + + + + + + + + +
NameDescription
+description
+String! +
+

Writable description aka documentation for a Dashboard

+
+ +## DashboardUpdateInput + +Arguments provided to update a Dashboard Entity + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+ownership
+OwnershipUpdate +
+

Update to ownership

+
+globalTags
+GlobalTagsUpdate +
+

Deprecated, use tags field instead +Update to global tags

+
+tags
+GlobalTagsUpdate +
+

Update to tags

+
+editableProperties
+DashboardEditablePropertiesUpdate +
+

Update to editable properties

+
+ +## DataFlowEditablePropertiesUpdate + +Update to writable Data Flow fields + +

Arguments

+ + + + + + + + + +
NameDescription
+description
+String! +
+

Writable description aka documentation for a Data Flow

+
+ +## DataFlowUpdateInput + +Arguments provided to update a Data Flow aka Pipeline Entity + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+ownership
+OwnershipUpdate +
+

Update to ownership

+
+globalTags
+GlobalTagsUpdate +
+

Deprecated, use tags field instead +Update to global tags

+
+tags
+GlobalTagsUpdate +
+

Update to tags

+
+editableProperties
+DataFlowEditablePropertiesUpdate +
+

Update to editable properties

+
+ +## DataHubViewDefinitionInput + +Input required for creating a DataHub View Definition + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+entityTypes
+[EntityType!]! +
+

A set of entity types that the view applies for. If left empty, then ALL entities will be in scope.

+
+filter
+DataHubViewFilterInput! +
+

A set of filters to apply.

+
+ +## DataHubViewFilterInput + +Input required for creating a DataHub View Definition + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+operator
+LogicalOperator! +
+

The operator used to combine the filters.

+
+filters
+[FacetFilterInput!]! +
+

A set of filters combined via an operator. If left empty, then no filters will be applied.

+
+ +## DataJobEditablePropertiesUpdate + +Update to writable Data Job fields + +

Arguments

+ + + + + + + + + +
NameDescription
+description
+String! +
+

Writable description aka documentation for a Data Job

+
+ +## DataJobUpdateInput + +Arguments provided to update a Data Job aka Task Entity + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+ownership
+OwnershipUpdate +
+

Update to ownership

+
+globalTags
+GlobalTagsUpdate +
+

Deprecated, use tags field instead +Update to global tags

+
+tags
+GlobalTagsUpdate +
+

Update to tags

+
+editableProperties
+DataJobEditablePropertiesUpdate +
+

Update to editable properties

+
+ +## DataProductEntitiesInput + +Input required to fetch the entities inside of a Data Product. + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+query
+String +
+

Optional query filter for particular entities inside the Data Product

+
+start
+Int +
+

The offset of the result set

+
+count
+Int +
+

The number of entities to include in result set

+
+filters
+[FacetFilterInput!] +
+

Optional Facet filters to apply to the result set

+
+ +## DatasetDeprecationUpdate + +An update for the deprecation information for a Metadata Entity + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+deprecated
+Boolean! +
+

Whether the dataset is deprecated

+
+decommissionTime
+Long +
+

The time user plan to decommission this dataset

+
+note
+String! +
+

Additional information about the dataset deprecation plan

+
+ +## DatasetEditablePropertiesUpdate + +Update to writable Dataset fields + +

Arguments

+ + + + + + + + + +
NameDescription
+description
+String! +
+

Writable description aka documentation for a Dataset

+
+ +## DatasetUpdateInput + +Arguments provided to update a Dataset Entity + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+ownership
+OwnershipUpdate +
+

Update to ownership

+
+deprecation
+DatasetDeprecationUpdate +
+

Update to deprecation status

+
+institutionalMemory
+InstitutionalMemoryUpdate +
+

Update to institutional memory, ie documentation

+
+globalTags
+GlobalTagsUpdate +
+

Deprecated, use tags field instead +Update to global tags

+
+tags
+GlobalTagsUpdate +
+

Update to tags

+
+editableSchemaMetadata
+EditableSchemaMetadataUpdate +
+

Update to editable schema metadata of the dataset

+
+editableProperties
+DatasetEditablePropertiesUpdate +
+

Update to editable properties

+
+ +## DescriptionUpdateInput + +Incubating. Updates the description of a resource. Currently supports DatasetField descriptions only + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+description
+String! +
+

The new description

+
+resourceUrn
+String! +
+

The primary key of the resource to attach the description to, eg dataset urn

+
+subResourceType
+SubResourceType +
+

An optional sub resource type

+
+subResource
+String +
+

A sub resource identifier, eg dataset field path

+
+ +## DomainEntitiesInput + +Input required to fetch the entities inside of a Domain. + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+query
+String +
+

Optional query filter for particular entities inside the domain

+
+start
+Int +
+

The offset of the result set

+
+count
+Int +
+

The number of entities to include in result set

+
+filters
+[FacetFilterInput!] +
+

Optional Facet filters to apply to the result set

+
+ +## EditableSchemaFieldInfoUpdate + +Update to writable schema field metadata + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+fieldPath
+String! +
+

Flattened name of a field identifying the field the editable info is applied to

+
+description
+String +
+

Edited description of the field

+
+globalTags
+GlobalTagsUpdate +
+

Tags associated with the field

+
+ +## EditableSchemaMetadataUpdate + +Update to editable schema metadata of the dataset + +

Arguments

+ + + + + + + + + +
NameDescription
+editableSchemaFieldInfo
+[EditableSchemaFieldInfoUpdate!]! +
+

Update to writable schema field metadata

+
+ +## EntityCountInput + +Input for the get entity counts endpoint + +

Arguments

+ + + + + + + + + +
NameDescription
+types
+[EntityType!] +
+ +
+ +## EntityRequestContext + +Context that defines an entity page requesting recommendations + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+type
+EntityType! +
+

Type of the enity being displayed

+
+urn
+String! +
+

Urn of the entity being displayed

+
+ +## FacetFilterInput + +Facet filters to apply to search results + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+field
+String! +
+

Name of field to filter by

+
+value
+String +
+
Deprecated: Prefer `values` for single elements
+ +

Value of the field to filter by. Deprecated in favor of values, which should accept a single element array for a +value

+
+values
+[String!] +
+

Values, one of which the intended field should match.

+
+negated
+Boolean +
+

If the filter should or should not be matched

+
+condition
+FilterOperator +
+

Condition for the values. How to If unset, assumed to be equality

+
+ +## FilterInput + +A set of filter criteria + +

Arguments

+ + + + + + + + + +
NameDescription
+and
+[FacetFilterInput!]! +
+

A list of conjunctive filters

+
+ +## GetAccessTokenInput + +Input required to fetch a new Access Token. + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+type
+AccessTokenType! +
+

The type of the Access Token.

+
+actorUrn
+String! +
+

The actor associated with the Access Token.

+
+duration
+AccessTokenDuration! +
+

The duration for which the Access Token is valid.

+
+ +## GetGrantedPrivilegesInput + +Input for getting granted privileges + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+actorUrn
+String! +
+

Urn of the actor

+
+resourceSpec
+ResourceSpec +
+

Spec to identify resource. If empty, gets privileges granted to the actor

+
+ +## GetInviteTokenInput + +Input provided when getting an invite token + +

Arguments

+ + + + + + + + + +
NameDescription
+roleUrn
+String +
+

The urn of the role to get the invite token for

+
+ +## GetQuickFiltersInput + +Input for getting Quick Filters + +

Arguments

+ + + + + + + + + +
NameDescription
+viewUrn
+String +
+

Optional - A View to apply when generating results

+
+ +## GetRootGlossaryEntitiesInput + +Input required when getting Business Glossary entities + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+start
+Int! +
+

The starting offset of the result set returned

+
+count
+Int! +
+

The number of Glossary Entities in the returned result set

+
+ +## GetSchemaBlameInput + +Input for getting schema changes computed at a specific version. + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+datasetUrn
+String! +
+

The dataset urn

+
+version
+String +
+

Changes after this version are not shown. If not provided, this is the latestVersion.

+
+ +## GetSchemaVersionListInput + +Input for getting list of schema versions. + +

Arguments

+ + + + + + + + + +
NameDescription
+datasetUrn
+String! +
+

The dataset urn

+
+ +## GetSecretValuesInput + +Input arguments for retrieving the plaintext values of a set of secrets + +

Arguments

+ + + + + + + + + +
NameDescription
+secrets
+[String!]! +
+

A list of secret names

+
+ +## GlobalTagsUpdate + +Deprecated, use addTag or removeTag mutation instead +Update to the Tags associated with a Metadata Entity + +

Arguments

+ + + + + + + + + +
NameDescription
+tags
+[TagAssociationUpdate!] +
+

The new set of tags

+
+ +## InstitutionalMemoryMetadataUpdate + +An institutional memory to add to a Metadata Entity +TODO Add a USER or GROUP actor enum + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+url
+String! +
+

Link to a document or wiki page or another internal resource

+
+description
+String +
+

Description of the resource

+
+author
+String! +
+

The corp user urn of the author of the metadata

+
+createdAt
+Long +
+

The time at which this metadata was created

+
+ +## InstitutionalMemoryUpdate + +An update for the institutional memory information for a Metadata Entity + +

Arguments

+ + + + + + + + + +
NameDescription
+elements
+[InstitutionalMemoryMetadataUpdate!]! +
+

The individual references in the institutional memory

+
+ +## LineageEdge + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+downstreamUrn
+String! +
+

Urn of the source entity. This urn is downstream of the destinationUrn.

+
+upstreamUrn
+String! +
+

Urn of the destination entity. This urn is upstream of the destinationUrn

+
+ +## LineageInput + +Input for the list lineage property of an Entity + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+direction
+LineageDirection! +
+

The direction of the relationship, either incoming or outgoing from the source entity

+
+start
+Int +
+

The starting offset of the result set

+
+count
+Int +
+

The number of results to be returned

+
+separateSiblings
+Boolean +
+

Optional flag to not merge siblings in the response. They are merged by default.

+
+startTimeMillis
+Long +
+

An optional starting time to filter on

+
+endTimeMillis
+Long +
+

An optional ending time to filter on

+
+ +## ListAccessTokenInput + +Input arguments for listing access tokens + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

The starting offset of the result set

+
+count
+Int +
+

The number of results to be returned

+
+filters
+[FacetFilterInput!] +
+

Facet filters to apply to search results

+
+ +## ListDomainsInput + +Input required when listing DataHub Domains + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

The starting offset of the result set returned

+
+count
+Int +
+

The maximum number of Domains to be returned in the result set

+
+query
+String +
+

Optional search query

+
+ +## ListGlobalViewsInput + +Input provided when listing DataHub Global Views + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

The starting offset of the result set returned

+
+count
+Int +
+

The maximum number of Views to be returned in the result set

+
+query
+String +
+

Optional search query

+
+ +## ListGroupsInput + +Input required when listing DataHub Groups + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

The starting offset of the result set returned

+
+count
+Int +
+

The maximum number of Policies to be returned in the result set

+
+query
+String +
+

Optional search query

+
+ +## ListIngestionSourcesInput + +Input arguments for listing Ingestion Sources + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

The starting offset of the result set

+
+count
+Int +
+

The number of results to be returned

+
+query
+String +
+

An optional search query

+
+ +## ListMyViewsInput + +Input provided when listing DataHub Views + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

The starting offset of the result set returned

+
+count
+Int +
+

The maximum number of Views to be returned in the result set

+
+query
+String +
+

Optional search query

+
+viewType
+DataHubViewType +
+

Optional - List the type of View to filter for.

+
+ +## ListOwnershipTypesInput + +Input required for listing custom ownership types entities + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

The starting offset of the result set returned, default is 0

+
+count
+Int +
+

The maximum number of Custom Ownership Types to be returned in the result set, default is 20

+
+query
+String +
+

Optional search query

+
+filters
+[FacetFilterInput!] +
+

Optional Facet filters to apply to the result set

+
+ +## ListPoliciesInput + +Input required when listing DataHub Access Policies + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

The starting offset of the result set returned

+
+count
+Int +
+

The maximum number of Policies to be returned in the result set

+
+query
+String +
+

Optional search query

+
+ +## ListPostsInput + +Input provided when listing existing posts + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

The starting offset of the result set returned

+
+count
+Int +
+

The maximum number of Roles to be returned in the result set

+
+query
+String +
+

Optional search query

+
+ +## ListQueriesInput + +Input required for listing query entities + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

The starting offset of the result set returned

+
+count
+Int +
+

The maximum number of Queries to be returned in the result set

+
+query
+String +
+

A raw search query

+
+source
+QuerySource +
+

An optional source for the query

+
+datasetUrn
+String +
+

An optional Urn for the parent dataset that the query is associated with.

+
+ +## ListRecommendationsInput + +Input arguments for fetching UI recommendations + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+userUrn
+String! +
+

Urn of the actor requesting recommendations

+
+requestContext
+RecommendationRequestContext +
+

Context provider by the caller requesting recommendations

+
+limit
+Int +
+

Max number of modules to return

+
+ +## ListRolesInput + +Input provided when listing existing roles + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

The starting offset of the result set returned

+
+count
+Int +
+

The maximum number of Roles to be returned in the result set

+
+query
+String +
+

Optional search query

+
+ +## ListSecretsInput + +Input for listing DataHub Secrets + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

The starting offset of the result set

+
+count
+Int +
+

The number of results to be returned

+
+query
+String +
+

An optional search query

+
+ +## ListTestsInput + +Input required when listing DataHub Tests + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

The starting offset of the result set returned

+
+count
+Int +
+

The maximum number of Domains to be returned in the result set

+
+query
+String +
+

Optional query string to match on

+
+ +## ListUsersInput + +Input required when listing DataHub Users + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

The starting offset of the result set returned

+
+count
+Int +
+

The maximum number of Policies to be returned in the result set

+
+query
+String +
+

Optional search query

+
+ +## MetadataAnalyticsInput + +Input to fetch metadata analytics charts + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+entityType
+EntityType +
+

Entity type to fetch analytics for (If empty, queries across all entities)

+
+domain
+String +
+

Urn of the domain to fetch analytics for (If empty or GLOBAL, queries across all domains)

+
+query
+String +
+

Search query to filter down result (If empty, does not apply any search query)

+
+ +## NotebookEditablePropertiesUpdate + +Update to writable Notebook fields + +

Arguments

+ + + + + + + + + +
NameDescription
+description
+String! +
+

Writable description aka documentation for a Notebook

+
+ +## NotebookUpdateInput + +Arguments provided to update a Notebook Entity + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+ownership
+OwnershipUpdate +
+

Update to ownership

+
+tags
+GlobalTagsUpdate +
+

Update to tags

+
+editableProperties
+NotebookEditablePropertiesUpdate +
+

Update to editable properties

+
+ +## OwnerInput + +Input provided when adding an owner to an asset + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+ownerUrn
+String! +
+

The primary key of the Owner to add or remove

+
+ownerEntityType
+OwnerEntityType! +
+

The owner type, either a user or group

+
+type
+OwnershipType +
+
Deprecated: No longer supported
+ +

The ownership type for the new owner. If none is provided, then a new NONE will be added. +Deprecated - Use ownershipTypeUrn field instead.

+
+ownershipTypeUrn
+String +
+

The urn of the ownership type entity.

+
+ +## OwnershipUpdate + +An update for the ownership information for a Metadata Entity + +

Arguments

+ + + + + + + + + +
NameDescription
+owners
+[OwnerUpdate!]! +
+

The updated list of owners

+
+ +## OwnerUpdate + +An owner to add to a Metadata Entity +TODO Add a USER or GROUP actor enum + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+owner
+String! +
+

The owner URN, either a corpGroup or corpuser

+
+type
+OwnershipType +
+
Deprecated: No longer supported
+ +

The owner type. Deprecated - Use ownershipTypeUrn field instead.

+
+ownershipTypeUrn
+String +
+

The urn of the ownership type entity.

+
+ +## PolicyMatchCriterionInput + +Criterion to define relationship between field and values + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+field
+String! +
+

The name of the field that the criterion refers to +e.g. entity_type, entity_urn, domain

+
+values
+[String!]! +
+

Values. Matches criterion if any one of the values matches condition (OR-relationship)

+
+condition
+PolicyMatchCondition! +
+

The name of the field that the criterion refers to

+
+ +## PolicyMatchFilterInput + +Filter object that encodes a complex filter logic with OR + AND + +

Arguments

+ + + + + + + + + +
NameDescription
+criteria
+[PolicyMatchCriterionInput!] +
+

List of criteria to apply

+
+ +## PolicyUpdateInput + +Input provided when creating or updating an Access Policy + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+type
+PolicyType! +
+

The Policy Type

+
+name
+String! +
+

The Policy name

+
+state
+PolicyState! +
+

The Policy state

+
+description
+String +
+

A Policy description

+
+resources
+ResourceFilterInput +
+

The set of resources that the Policy privileges apply to

+
+privileges
+[String!]! +
+

The set of privileges that the Policy grants

+
+actors
+ActorFilterInput! +
+

The set of actors that the Policy privileges are granted to

+
+ +## QueryStatementInput + +Input required for creating a Query Statement + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+value
+String! +
+

The query text

+
+language
+QueryLanguage! +
+

The query language

+
+ +## RecommendationRequestContext + +Context that defines the page requesting recommendations +i.e. for search pages, the query/filters. for entity pages, the entity urn and tab + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+scenario
+ScenarioType! +
+

Scenario in which the recommendations will be displayed

+
+searchRequestContext
+SearchRequestContext +
+

Additional context for defining the search page requesting recommendations

+
+entityRequestContext
+EntityRequestContext +
+

Additional context for defining the entity page requesting recommendations

+
+ +## RelatedTermsInput + +Input provided when adding Terms to an asset + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The Glossary Term urn to add or remove this relationship to/from

+
+termUrns
+[String!]! +
+

The primary key of the Glossary Term to add or remove

+
+relationshipType
+TermRelationshipType! +
+

The type of relationship we're adding or removing to/from for a Glossary Term

+
+ +## RelationshipsInput + +Input for the list relationships field of an Entity + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+types
+[String!]! +
+

The types of relationships to query, representing an OR

+
+direction
+RelationshipDirection! +
+

The direction of the relationship, either incoming or outgoing from the source entity

+
+start
+Int +
+

The starting offset of the result set

+
+count
+Int +
+

The number of results to be returned

+
+ +## RemoveGroupMembersInput + +Input required to remove members from an external DataHub group + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+groupUrn
+String! +
+

The group to remove members from

+
+userUrns
+[String!]! +
+

The members to remove from the group

+
+ +## RemoveLinkInput + +Input provided when removing the association between a Metadata Entity and a Link + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+linkUrl
+String! +
+

The url of the link to add or remove, which uniquely identifies the Link

+
+resourceUrn
+String! +
+

The urn of the resource or entity to attach the link to, for example a dataset urn

+
+ +## RemoveNativeGroupMembersInput + +Input required to remove members from a native DataHub group + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+groupUrn
+String! +
+

The group to remove members from

+
+userUrns
+[String!]! +
+

The members to remove from the group

+
+ +## RemoveOwnerInput + +Input provided when removing the association between a Metadata Entity and an user or group owner + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+ownerUrn
+String! +
+

The primary key of the Owner to add or remove

+
+ownershipTypeUrn
+String +
+

The ownership type to remove, optional. By default will remove regardless of ownership type.

+
+resourceUrn
+String! +
+

The urn of the resource or entity to attach or remove the owner from, for example a dataset urn

+
+ +## ReportOperationInput + +Input provided to report an asset operation + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The urn of the asset (e.g. dataset) to report the operation for

+
+operationType
+OperationType! +
+

The type of operation that was performed. Required

+
+customOperationType
+String +
+

A custom type of operation. Required if operation type is CUSTOM.

+
+sourceType
+OperationSourceType! +
+

The source or reporter of the operation

+
+customProperties
+[StringMapEntryInput!] +
+

A list of key-value parameters to include

+
+partition
+String +
+

An optional partition identifier

+
+numAffectedRows
+Long +
+

Optional: The number of affected rows

+
+timestampMillis
+Long +
+

Optional: Provide a timestamp associated with the operation. If not provided, one will be generated for you based +on the current time.

+
+ +## ResourceFilterInput + +Input required when creating or updating an Access Policies Determines which resources the Policy applies to + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+type
+String +
+

The type of the resource the policy should apply to +Not required because in the future we want to support filtering by type OR by domain

+
+resources
+[String!] +
+

A list of specific resource urns to apply the filter to

+
+allResources
+Boolean +
+

Whether of not to apply the filter to all resources of the type

+
+filter
+PolicyMatchFilterInput +
+

Whether of not to apply the filter to all resources of the type

+
+ +## ResourceRefInput + +Reference to a resource to apply an action to + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+resourceUrn
+String! +
+

The urn of the resource being referenced

+
+subResourceType
+SubResourceType +
+

An optional type of a sub resource to attach the Tag to

+
+subResource
+String +
+

An optional sub resource identifier to attach the Tag to

+
+ +## ResourceSpec + +Spec to identify resource + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+resourceType
+EntityType! +
+

Resource type

+
+resourceUrn
+String! +
+

Resource urn

+
+ +## RollbackIngestionInput + +Input for rolling back an ingestion execution + +

Arguments

+ + + + + + + + + +
NameDescription
+runId
+String! +
+

An ingestion run ID

+
+ +## ScrollAcrossEntitiesInput + +Input arguments for a full text search query across entities, specifying a starting pointer. Allows paging beyond 10k results + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+types
+[EntityType!] +
+

Entity types to be searched. If this is not provided, all entities will be searched.

+
+query
+String! +
+

The query string

+
+scrollId
+String +
+

The starting point of paginated results, an opaque ID the backend understands as a pointer

+
+keepAlive
+String +
+

The amount of time to keep the point in time snapshot alive, takes a time unit based string ex: 5m or 30s

+
+count
+Int +
+

The number of elements included in the results

+
+orFilters
+[AndFilterInput!] +
+

A list of disjunctive criterion for the filter. (or operation to combine filters)

+
+viewUrn
+String +
+

Optional - A View to apply when generating results

+
+searchFlags
+SearchFlags +
+

Flags controlling search options

+
+ +## ScrollAcrossLineageInput + +Input arguments for a search query over the results of a multi-hop graph query, uses scroll API + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String +
+

Urn of the source node

+
+direction
+LineageDirection! +
+

The direction of the relationship, either incoming or outgoing from the source entity

+
+types
+[EntityType!] +
+

Entity types to be searched. If this is not provided, all entities will be searched.

+
+query
+String +
+

The query string

+
+scrollId
+String +
+

The starting point of paginated results, an opaque ID the backend understands as a pointer

+
+keepAlive
+String +
+

The amount of time to keep the point in time snapshot alive, takes a time unit based string ex: 5m or 30s

+
+count
+Int +
+

The number of elements included in the results

+
+orFilters
+[AndFilterInput!] +
+

A list of disjunctive criterion for the filter. (or operation to combine filters)

+
+startTimeMillis
+Long +
+

An optional starting time to filter on

+
+endTimeMillis
+Long +
+

An optional ending time to filter on

+
+searchFlags
+SearchFlags +
+

Flags controlling search options

+
+ +## SearchAcrossEntitiesInput + +Input arguments for a full text search query across entities + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+types
+[EntityType!] +
+

Entity types to be searched. If this is not provided, all entities will be searched.

+
+query
+String! +
+

The query string

+
+start
+Int +
+

The starting point of paginated results

+
+count
+Int +
+

The number of elements included in the results

+
+filters
+[FacetFilterInput!] +
+
Deprecated: Use `orFilters`- they are more expressive
+ +

Deprecated in favor of the more expressive orFilters field +Facet filters to apply to search results. These will be 'AND'-ed together.

+
+orFilters
+[AndFilterInput!] +
+

A list of disjunctive criterion for the filter. (or operation to combine filters)

+
+viewUrn
+String +
+

Optional - A View to apply when generating results

+
+searchFlags
+SearchFlags +
+

Flags controlling search options

+
+ +## SearchAcrossLineageInput + +Input arguments for a search query over the results of a multi-hop graph query + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String +
+

Urn of the source node

+
+direction
+LineageDirection! +
+

The direction of the relationship, either incoming or outgoing from the source entity

+
+types
+[EntityType!] +
+

Entity types to be searched. If this is not provided, all entities will be searched.

+
+query
+String +
+

The query string

+
+start
+Int +
+

The starting point of paginated results

+
+count
+Int +
+

The number of elements included in the results

+
+filters
+[FacetFilterInput!] +
+
Deprecated: Use `orFilters`- they are more expressive
+ +

Deprecated in favor of the more expressive orFilters field +Facet filters to apply to search results. These will be 'AND'-ed together.

+
+orFilters
+[AndFilterInput!] +
+

A list of disjunctive criterion for the filter. (or operation to combine filters)

+
+startTimeMillis
+Long +
+

An optional starting time to filter on

+
+endTimeMillis
+Long +
+

An optional ending time to filter on

+
+searchFlags
+SearchFlags +
+

Flags controlling search options

+
+ +## SearchFlags + +Set of flags to control search behavior + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+skipCache
+Boolean +
+

Whether to skip cache

+
+maxAggValues
+Int +
+

The maximum number of values in an facet aggregation

+
+fulltext
+Boolean +
+

Structured or unstructured fulltext query

+
+skipHighlighting
+Boolean +
+

Whether to skip highlighting

+
+skipAggregates
+Boolean +
+

Whether to skip aggregates/facets

+
+ +## SearchInput + +Input arguments for a full text search query + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+type
+EntityType! +
+

The Metadata Entity type to be searched against

+
+query
+String! +
+

The raw query string

+
+start
+Int +
+

The offset of the result set

+
+count
+Int +
+

The number of entities to include in result set

+
+filters
+[FacetFilterInput!] +
+
Deprecated: Use `orFilters`- they are more expressive
+ +

Deprecated in favor of the more expressive orFilters field +Facet filters to apply to search results. These will be 'AND'-ed together.

+
+orFilters
+[AndFilterInput!] +
+

A list of disjunctive criterion for the filter. (or operation to combine filters)

+
+searchFlags
+SearchFlags +
+

Flags controlling search options

+
+ +## SearchRequestContext + +Context that defines a search page requesting recommendatinos + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+query
+String! +
+

Search query

+
+filters
+[FacetFilterInput!] +
+

Faceted filters applied to search results

+
+ +## StepStateInput + +The input required to update the state of a step + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+id
+String! +
+

The globally unique id for the step

+
+properties
+[StringMapEntryInput]! +
+

The new properties for the step

+
+ +## StringMapEntryInput + +String map entry input + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+key
+String! +
+

The key of the map entry

+
+value
+String +
+

The value fo the map entry

+
+ +## TagAssociationInput + +Input provided when updating the association between a Metadata Entity and a Tag + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+tagUrn
+String! +
+

The primary key of the Tag to add or remove

+
+resourceUrn
+String! +
+

The target Metadata Entity to add or remove the Tag to

+
+subResourceType
+SubResourceType +
+

An optional type of a sub resource to attach the Tag to

+
+subResource
+String +
+

An optional sub resource identifier to attach the Tag to

+
+ +## TagAssociationUpdate + +Deprecated, use addTag or removeTag mutation instead +A tag update to be applied + +

Arguments

+ + + + + + + + + +
NameDescription
+tag
+TagUpdateInput! +
+

The tag being applied

+
+ +## TagUpdateInput + +Deprecated, use addTag or removeTag mutations instead +An update for a particular Tag entity + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the Tag

+
+name
+String! +
+

The display name of a Tag

+
+description
+String +
+

Description of the tag

+
+ownership
+OwnershipUpdate +
+

Ownership metadata of the tag

+
+ +## TermAssociationInput + +Input provided when updating the association between a Metadata Entity and a Glossary Term + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+termUrn
+String! +
+

The primary key of the Glossary Term to add or remove

+
+resourceUrn
+String! +
+

The target Metadata Entity to add or remove the Glossary Term from

+
+subResourceType
+SubResourceType +
+

An optional type of a sub resource to attach the Glossary Term to

+
+subResource
+String +
+

An optional sub resource identifier to attach the Glossary Term to

+
+ +## TestDefinitionInput + +

Arguments

+ + + + + + + + + +
NameDescription
+json
+String +
+

The string representation of the Test

+
+ +## UpdateCorpUserViewsSettingsInput + +Input required to update a users settings. + +

Arguments

+ + + + + + + + + +
NameDescription
+defaultView
+String +
+

The URN of the View that serves as this user's personal default. +If not provided, any existing default view will be removed.

+
+ +## UpdateDataProductInput + +Input properties required for update a DataProduct + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+name
+String +
+

A display name for the DataProduct

+
+description
+String +
+

An optional description for the DataProduct

+
+ +## UpdateDeprecationInput + +Input provided when setting the Deprecation status for an Entity. + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The urn of the Entity to set deprecation for.

+
+deprecated
+Boolean! +
+

Whether the Entity is marked as deprecated.

+
+decommissionTime
+Long +
+

Optional - The time user plan to decommission this entity

+
+note
+String +
+

Optional - Additional information about the entity deprecation plan

+
+ +## UpdateEmbedInput + +Input required to set or clear information related to rendering a Data Asset inside of DataHub. + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The URN associated with the Data Asset to update. Only dataset, dashboard, and chart urns are currently supported.

+
+renderUrl
+String +
+

Set or clear a URL used to render an embedded asset.

+
+ +## UpdateGlobalViewsSettingsInput + +Input required to update Global View Settings. + +

Arguments

+ + + + + + + + + +
NameDescription
+defaultView
+String +
+

The URN of the View that serves as the Global, or organization-wide, default. +If this field is not provided, the existing Global Default will be cleared.

+
+ +## UpdateIngestionSourceConfigInput + +Input parameters for creating / updating an Ingestion Source + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+recipe
+String! +
+

A JSON-encoded recipe

+
+version
+String +
+

The version of DataHub Ingestion Framework to use when executing the recipe.

+
+executorId
+String! +
+

The id of the executor to use for executing the recipe

+
+debugMode
+Boolean +
+

Whether or not to run ingestion in debug mode

+
+ +## UpdateIngestionSourceInput + +Input arguments for creating / updating an Ingestion Source + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

A name associated with the ingestion source

+
+type
+String! +
+

The type of the source itself, e.g. mysql, bigquery, bigquery-usage. Should match the recipe.

+
+description
+String +
+

An optional description associated with the ingestion source

+
+schedule
+UpdateIngestionSourceScheduleInput +
+

An optional schedule for the ingestion source. If not provided, the source is only available for run on-demand.

+
+config
+UpdateIngestionSourceConfigInput! +
+

A set of type-specific ingestion source configurations

+
+ +## UpdateIngestionSourceScheduleInput + +Input arguments for creating / updating the schedule of an Ingestion Source + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+interval
+String! +
+

The cron-formatted interval describing when the job should be executed

+
+timezone
+String! +
+

The name of the timezone in which the cron interval should be scheduled (e.g. America/Los Angeles)

+
+ +## UpdateLineageInput + +Input required in order to upsert lineage edges + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+edgesToAdd
+[LineageEdge]! +
+

New lineage edges to upsert

+
+edgesToRemove
+[LineageEdge]! +
+

Lineage edges to remove. Takes precedence over edgesToAdd - so edges existing both edgesToAdd +and edgesToRemove will be removed.

+
+ +## UpdateMediaInput + +Input provided for filling in a post content + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+type
+MediaType! +
+

The type of media

+
+location
+String! +
+

The location of the media (a URL)

+
+ +## UpdateNameInput + +Input for updating the name of an entity + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

The new name

+
+urn
+String! +
+

The primary key of the resource to update the name for

+
+ +## UpdateOwnershipTypeInput + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+name
+String +
+

The name of the Custom Ownership Type

+
+description
+String +
+

The description of the Custom Ownership Type

+
+ +## UpdateParentNodeInput + +Input for updating the parent node of a resource. Currently only GlossaryNodes and GlossaryTerms have parentNodes. + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+parentNode
+String +
+

The new parent node urn. If parentNode is null, this will remove the parent from this entity

+
+resourceUrn
+String! +
+

The primary key of the resource to update the parent node for

+
+ +## UpdatePostContentInput + +Input provided for filling in a post content + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+contentType
+PostContentType! +
+

The type of post content

+
+title
+String! +
+

The title of the post

+
+description
+String +
+

Optional content of the post

+
+link
+String +
+

Optional link that the post is associated with

+
+media
+UpdateMediaInput +
+

Optional media contained in the post

+
+ +## UpdateQueryInput + +Input required for updating an existing Query. Requires the 'Edit Queries' privilege for all query subjects. + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+properties
+UpdateQueryPropertiesInput +
+

Properties about the Query

+
+subjects
+[UpdateQuerySubjectInput!] +
+

Subjects for the query

+
+ +## UpdateQueryPropertiesInput + +Input properties required for creating a Query. Any non-null fields will be updated if provided. + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+name
+String +
+

An optional display name for the Query

+
+description
+String +
+

An optional description for the Query

+
+statement
+QueryStatementInput +
+

The Query contents

+
+ +## UpdateQuerySubjectInput + +Input required for creating a Query. For now, only datasets are supported. + +

Arguments

+ + + + + + + + + +
NameDescription
+datasetUrn
+String! +
+

The urn of the dataset that is the subject of the query

+
+ +## UpdateTestInput + +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

The name of the Test

+
+category
+String! +
+

The category of the Test (user defined)

+
+description
+String +
+

Description of the test

+
+definition
+TestDefinitionInput! +
+

The test definition

+
+ +## UpdateUserSettingInput + +Input for updating a user setting + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+name
+UserSetting! +
+

The name of the setting

+
+value
+Boolean! +
+

The new value of the setting

+
+ +## UpdateViewInput + +Input provided when updating a DataHub View + +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+name
+String +
+

The name of the View

+
+description
+String +
+

An optional description of the View

+
+definition
+DataHubViewDefinitionInput +
+

The view definition itself

+
diff --git a/docs-website/versioned_docs/version-0.10.4/graphql/interfaces.md b/docs-website/versioned_docs/version-0.10.4/graphql/interfaces.md new file mode 100644 index 0000000000000..d627c8945bda4 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/graphql/interfaces.md @@ -0,0 +1,291 @@ +--- +id: interfaces +title: Interfaces +slug: interfaces +sidebar_position: 4 +--- + +## Aspect + +A versioned aspect, or single group of related metadata, associated with an Entity and having a unique version + +

Implemented by

+ +- [SchemaMetadata](/docs/graphql/objects#schemametadata) + +

Fields

+ + + + + + + + + +
NameDescription
+version
+Long +
+

The version of the aspect, where zero represents the latest version

+
+ +## BrowsableEntity + +A Metadata Entity which is browsable, or has browse paths. + +

Implemented by

+ +- [Dataset](/docs/graphql/objects#dataset) +- [Notebook](/docs/graphql/objects#notebook) +- [Dashboard](/docs/graphql/objects#dashboard) +- [Chart](/docs/graphql/objects#chart) +- [DataFlow](/docs/graphql/objects#dataflow) +- [DataJob](/docs/graphql/objects#datajob) +- [MLModel](/docs/graphql/objects#mlmodel) +- [MLModelGroup](/docs/graphql/objects#mlmodelgroup) +- [MLFeatureTable](/docs/graphql/objects#mlfeaturetable) + +

Fields

+ + + + + + + + + +
NameDescription
+browsePaths
+[BrowsePath!] +
+

The browse paths corresponding to an entity. If no Browse Paths have been generated before, this will be null.

+
+ +## Entity + +A top level Metadata Entity + +

Implemented by

+ +- [AccessTokenMetadata](/docs/graphql/objects#accesstokenmetadata) +- [Dataset](/docs/graphql/objects#dataset) +- [Role](/docs/graphql/objects#role) +- [VersionedDataset](/docs/graphql/objects#versioneddataset) +- [GlossaryTerm](/docs/graphql/objects#glossaryterm) +- [GlossaryNode](/docs/graphql/objects#glossarynode) +- [DataPlatform](/docs/graphql/objects#dataplatform) +- [DataPlatformInstance](/docs/graphql/objects#dataplatforminstance) +- [Container](/docs/graphql/objects#container) +- [SchemaFieldEntity](/docs/graphql/objects#schemafieldentity) +- [CorpUser](/docs/graphql/objects#corpuser) +- [CorpGroup](/docs/graphql/objects#corpgroup) +- [Tag](/docs/graphql/objects#tag) +- [Notebook](/docs/graphql/objects#notebook) +- [Dashboard](/docs/graphql/objects#dashboard) +- [Chart](/docs/graphql/objects#chart) +- [DataFlow](/docs/graphql/objects#dataflow) +- [DataJob](/docs/graphql/objects#datajob) +- [DataProcessInstance](/docs/graphql/objects#dataprocessinstance) +- [Assertion](/docs/graphql/objects#assertion) +- [DataHubPolicy](/docs/graphql/objects#datahubpolicy) +- [MLModel](/docs/graphql/objects#mlmodel) +- [MLModelGroup](/docs/graphql/objects#mlmodelgroup) +- [MLFeature](/docs/graphql/objects#mlfeature) +- [MLPrimaryKey](/docs/graphql/objects#mlprimarykey) +- [MLFeatureTable](/docs/graphql/objects#mlfeaturetable) +- [Domain](/docs/graphql/objects#domain) +- [DataHubRole](/docs/graphql/objects#datahubrole) +- [Post](/docs/graphql/objects#post) +- [DataHubView](/docs/graphql/objects#datahubview) +- [QueryEntity](/docs/graphql/objects#queryentity) +- [DataProduct](/docs/graphql/objects#dataproduct) +- [OwnershipTypeEntity](/docs/graphql/objects#ownershiptypeentity) +- [Test](/docs/graphql/objects#test) +- [EntityWithRelationships](/docs/graphql/interfaces#entitywithrelationships) + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

A primary key of the Metadata Entity

+
+type
+EntityType! +
+

A standard Entity Type

+
+relationships
+EntityRelationshipsResult +
+

List of relationships between the source Entity and some destination entities with a given types

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+ +## EntityWithRelationships + +Deprecated, use relationships field instead + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Implemented by

+ +- [Dataset](/docs/graphql/objects#dataset) +- [Dashboard](/docs/graphql/objects#dashboard) +- [Chart](/docs/graphql/objects#chart) +- [DataFlow](/docs/graphql/objects#dataflow) +- [DataJob](/docs/graphql/objects#datajob) +- [DataProcessInstance](/docs/graphql/objects#dataprocessinstance) +- [Assertion](/docs/graphql/objects#assertion) +- [MLModel](/docs/graphql/objects#mlmodel) +- [MLModelGroup](/docs/graphql/objects#mlmodelgroup) +- [MLFeature](/docs/graphql/objects#mlfeature) +- [MLPrimaryKey](/docs/graphql/objects#mlprimarykey) +- [MLFeatureTable](/docs/graphql/objects#mlfeaturetable) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

A primary key associated with the Metadata Entity

+
+type
+EntityType! +
+

A standard Entity Type

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+lineage
+EntityLineageResult +
+

Edges extending from this entity grouped by direction in the lineage graph

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+LineageInput! +
+ +
+ +
+ +## TimeSeriesAspect + +A time series aspect, or a group of related metadata associated with an Entity and corresponding to a particular timestamp + +

Implemented by

+ +- [DataProcessRunEvent](/docs/graphql/objects#dataprocessrunevent) +- [DashboardUsageMetrics](/docs/graphql/objects#dashboardusagemetrics) +- [DatasetProfile](/docs/graphql/objects#datasetprofile) +- [AssertionRunEvent](/docs/graphql/objects#assertionrunevent) +- [Operation](/docs/graphql/objects#operation) + +

Fields

+ + + + + + + + + +
NameDescription
+timestampMillis
+Long! +
+

The timestamp associated with the time series aspect in milliseconds

+
diff --git a/docs-website/versioned_docs/version-0.10.4/graphql/mutations.md b/docs-website/versioned_docs/version-0.10.4/graphql/mutations.md new file mode 100644 index 0000000000000..6491678ac14fe --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/graphql/mutations.md @@ -0,0 +1,2399 @@ +--- +id: mutations +title: Mutations +slug: mutations +sidebar_position: 2 +--- + +## acceptRole + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Accept role using invite token + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+AcceptRoleInput! +
+ +
+ +## addGroupMembers + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Add members to a group + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+AddGroupMembersInput! +
+ +
+ +## addLink + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Add a link, or institutional memory, from a particular Entity + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+AddLinkInput! +
+ +
+ +## addOwner + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Add an owner to a particular Entity + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+AddOwnerInput! +
+ +
+ +## addOwners + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Add multiple owners to a particular Entity + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+AddOwnersInput! +
+ +
+ +## addRelatedTerms + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Add multiple related Terms to a Glossary Term to establish relationships + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelatedTermsInput! +
+ +
+ +## addTag + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Add a tag to a particular Entity or subresource + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+TagAssociationInput! +
+ +
+ +## addTags + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Add multiple tags to a particular Entity or subresource + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+AddTagsInput! +
+ +
+ +## addTerm + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Add a glossary term to a particular Entity or subresource + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+TermAssociationInput! +
+ +
+ +## addTerms + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Add multiple glossary terms to a particular Entity or subresource + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+AddTermsInput! +
+ +
+ +## batchAddOwners + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Add owners to multiple Entities + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+BatchAddOwnersInput! +
+ +
+ +## batchAddTags + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Add tags to multiple Entities or subresources + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+BatchAddTagsInput! +
+ +
+ +## batchAddTerms + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Add glossary terms to multiple Entities or subresource + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+BatchAddTermsInput! +
+ +
+ +## batchAssignRole + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Batch assign roles to users + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+BatchAssignRoleInput! +
+ +
+ +## batchRemoveOwners + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Remove owners from multiple Entities + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+BatchRemoveOwnersInput! +
+ +
+ +## batchRemoveTags + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Remove tags from multiple Entities or subresource + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+BatchRemoveTagsInput! +
+ +
+ +## batchRemoveTerms + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Remove glossary terms from multiple Entities or subresource + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+BatchRemoveTermsInput! +
+ +
+ +## batchSetDataProduct + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Batch set or unset a DataProduct to a list of entities + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+BatchSetDataProductInput! +
+

Input for batch setting data product

+
+ +## batchSetDomain + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Set domain for multiple Entities + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+BatchSetDomainInput! +
+ +
+ +## batchUpdateDeprecation + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Updates the deprecation status for a batch of assets. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+BatchUpdateDeprecationInput! +
+ +
+ +## batchUpdateSoftDeleted + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Updates the soft deleted status for a batch of assets + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+BatchUpdateSoftDeletedInput! +
+ +
+ +## batchUpdateStepStates + +**Type:** [BatchUpdateStepStatesResult!](/docs/graphql/objects#batchupdatestepstatesresult) + +Batch update the state for a set of steps. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+BatchUpdateStepStatesInput! +
+ +
+ +## cancelIngestionExecutionRequest + +**Type:** [String](/docs/graphql/scalars#string) + +Cancel a running execution request, provided the urn of the original execution request + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CancelIngestionExecutionRequestInput! +
+ +
+ +## createAccessToken + +**Type:** [AccessToken](/docs/graphql/objects#accesstoken) + +Generates an access token for DataHub APIs for a particular user & of a particular type + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CreateAccessTokenInput! +
+ +
+ +## createDataProduct + +**Type:** [DataProduct](/docs/graphql/objects#dataproduct) + +Create a new Data Product + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CreateDataProductInput! +
+

Inputs required to create a new DataProduct.

+
+ +## createDomain + +**Type:** [String](/docs/graphql/scalars#string) + +Create a new Domain. Returns the urn of the newly created Domain. Requires the 'Create Domains' or 'Manage Domains' Platform Privilege. If a Domain with the provided ID already exists, +it will be overwritten. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CreateDomainInput! +
+ +
+ +## createGlossaryNode + +**Type:** [String](/docs/graphql/scalars#string) + +Create a new GlossaryNode. Returns the urn of the newly created GlossaryNode. If a node with the provided ID already exists, it will be overwritten. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CreateGlossaryEntityInput! +
+ +
+ +## createGlossaryTerm + +**Type:** [String](/docs/graphql/scalars#string) + +Create a new GlossaryTerm. Returns the urn of the newly created GlossaryTerm. If a term with the provided ID already exists, it will be overwritten. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CreateGlossaryEntityInput! +
+ +
+ +## createGroup + +**Type:** [String](/docs/graphql/scalars#string) + +Create a new group. Returns the urn of the newly created group. Requires the Manage Users & Groups Platform Privilege + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CreateGroupInput! +
+ +
+ +## createIngestionExecutionRequest + +**Type:** [String](/docs/graphql/scalars#string) + +Create a request to execute an ingestion job +input: Input required for creating an ingestion execution request + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CreateIngestionExecutionRequestInput! +
+ +
+ +## createIngestionSource + +**Type:** [String](/docs/graphql/scalars#string) + +Create a new ingestion source + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+UpdateIngestionSourceInput! +
+ +
+ +## createInviteToken + +**Type:** [InviteToken](/docs/graphql/objects#invitetoken) + +Create invite token + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CreateInviteTokenInput! +
+ +
+ +## createNativeUserResetToken + +**Type:** [ResetToken](/docs/graphql/objects#resettoken) + +Generates a token that can be shared with existing native users to reset their credentials. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CreateNativeUserResetTokenInput! +
+ +
+ +## createOwnershipType + +**Type:** [OwnershipTypeEntity](/docs/graphql/objects#ownershiptypeentity) + +Create a Custom Ownership Type. This requires the 'Manage Ownership Types' Metadata Privilege. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CreateOwnershipTypeInput! +
+

Inputs required to create a new Query.

+
+ +## createPolicy + +**Type:** [String](/docs/graphql/scalars#string) + +Create a policy and returns the resulting urn + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+PolicyUpdateInput! +
+ +
+ +## createPost + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Create a post + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CreatePostInput! +
+ +
+ +## createQuery + +**Type:** [QueryEntity](/docs/graphql/objects#queryentity) + +Create a new Query + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CreateQueryInput! +
+

Inputs required to create a new Query.

+
+ +## createSecret + +**Type:** [String](/docs/graphql/scalars#string) + +Create a new Secret + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CreateSecretInput! +
+ +
+ +## createTag + +**Type:** [String](/docs/graphql/scalars#string) + +Create a new tag. Requires the 'Manage Tags' or 'Create Tags' Platform Privilege. If a Tag with the provided ID already exists, +it will be overwritten. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CreateTagInput! +
+

Inputs required to create a new Tag.

+
+ +## createTest + +**Type:** [String](/docs/graphql/scalars#string) + +Create a new test + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CreateTestInput! +
+ +
+ +## createTestConnectionRequest + +**Type:** [String](/docs/graphql/scalars#string) + +Create a request to execute a test ingestion connection job +input: Input required for creating a test connection request + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CreateTestConnectionRequestInput! +
+ +
+ +## createView + +**Type:** [DataHubView](/docs/graphql/objects#datahubview) + +Create a new DataHub View (Saved Filter) + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+CreateViewInput! +
+

Input required to create a new DataHub View

+
+ +## deleteAssertion + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Remove an assertion associated with an entity. Requires the 'Edit Assertions' privilege on the entity. + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+

The assertion to remove

+
+ +## deleteDataProduct + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Delete a DataProduct by urn. + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+

Urn of the data product to remove.

+
+ +## deleteDomain + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Delete a Domain + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+

The urn of the Domain to delete

+
+ +## deleteGlossaryEntity + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Remove a glossary entity (GlossaryTerm or GlossaryNode). Return boolean whether it was successful or not. + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## deleteIngestionSource + +**Type:** [String](/docs/graphql/scalars#string) + +Delete an existing ingestion source + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## deleteOwnershipType + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Delete a Custom Ownership Type by urn. This requires the 'Manage Ownership Types' Metadata Privilege. + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

Urn of the Custom Ownership Type to remove.

+
+deleteReferences
+Boolean +
+ +
+ +## deletePolicy + +**Type:** [String](/docs/graphql/scalars#string) + +Remove an existing policy and returns the policy urn + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## deletePost + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Delete a post + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## deleteQuery + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Delete a Query by urn. This requires the 'Edit Queries' Metadata Privilege. + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+

Urn of the query to remove.

+
+ +## deleteSecret + +**Type:** [String](/docs/graphql/scalars#string) + +Delete a Secret + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## deleteTag + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Delete a Tag + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+

The urn of the Tag to delete

+
+ +## deleteTest + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Delete an existing test - note that this will NOT delete dangling pointers until the next execution of the test. + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## deleteView + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Delete a DataHub View (Saved Filter) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+

The urn of the View to delete

+
+ +## removeGroup + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Remove a group. Requires Manage Users & Groups Platform Privilege + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## removeGroupMembers + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Remove members from a group + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RemoveGroupMembersInput! +
+ +
+ +## removeLink + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Remove a link, or institutional memory, from a particular Entity + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RemoveLinkInput! +
+ +
+ +## removeOwner + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Remove an owner from a particular Entity + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RemoveOwnerInput! +
+ +
+ +## removeRelatedTerms + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Remove multiple related Terms for a Glossary Term + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelatedTermsInput! +
+ +
+ +## removeTag + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Remove a tag from a particular Entity or subresource + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+TagAssociationInput! +
+ +
+ +## removeTerm + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Remove a glossary term from a particular Entity or subresource + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+TermAssociationInput! +
+ +
+ +## removeUser + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Remove a user. Requires Manage Users & Groups Platform Privilege + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## reportOperation + +**Type:** [String](/docs/graphql/scalars#string) + +Report a new operation for an asset + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ReportOperationInput! +
+

Input required to report an operation

+
+ +## revokeAccessToken + +**Type:** [Boolean!](/docs/graphql/scalars#boolean) + +Revokes access tokens. + +

Arguments

+ + + + + + + + + +
NameDescription
+tokenId
+String! +
+ +
+ +## rollbackIngestion + +**Type:** [String](/docs/graphql/scalars#string) + +Rollback a specific ingestion execution run based on its runId + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RollbackIngestionInput! +
+ +
+ +## setDomain + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Sets the Domain for a Dataset, Chart, Dashboard, Data Flow (Pipeline), or Data Job (Task). Returns true if the Domain was successfully added, or already exists. Requires the Edit Domains privilege for the Entity. + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+entityUrn
+String! +
+ +
+domainUrn
+String! +
+ +
+ +## setTagColor + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Set the hex color associated with an existing Tag + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+colorHex
+String! +
+ +
+ +## unsetDomain + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Sets the Domain for a Dataset, Chart, Dashboard, Data Flow (Pipeline), or Data Job (Task). Returns true if the Domain was successfully removed, or was already removed. Requires the Edit Domains privilege for an asset. + +

Arguments

+ + + + + + + + + +
NameDescription
+entityUrn
+String! +
+ +
+ +## updateChart + +**Type:** [Chart](/docs/graphql/objects#chart) + +Update the metadata about a particular Chart + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+input
+ChartUpdateInput! +
+ +
+ +## updateCorpGroupProperties + +**Type:** [CorpGroup](/docs/graphql/objects#corpgroup) + +Update a particular Corp Group's editable properties + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+input
+CorpGroupUpdateInput! +
+ +
+ +## updateCorpUserProperties + +**Type:** [CorpUser](/docs/graphql/objects#corpuser) + +Update a particular Corp User's editable properties + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+input
+CorpUserUpdateInput! +
+ +
+ +## updateCorpUserViewsSettings + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Update the View-related settings for a user. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+UpdateCorpUserViewsSettingsInput! +
+ +
+ +## updateDashboard + +**Type:** [Dashboard](/docs/graphql/objects#dashboard) + +Update the metadata about a particular Dashboard + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+input
+DashboardUpdateInput! +
+ +
+ +## updateDataFlow + +**Type:** [DataFlow](/docs/graphql/objects#dataflow) + +Update the metadata about a particular Data Flow (Pipeline) + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+input
+DataFlowUpdateInput! +
+ +
+ +## updateDataJob + +**Type:** [DataJob](/docs/graphql/objects#datajob) + +Update the metadata about a particular Data Job (Task) + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+input
+DataJobUpdateInput! +
+ +
+ +## updateDataProduct + +**Type:** [DataProduct](/docs/graphql/objects#dataproduct) + +Update a Data Product + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The urn identifier for the Data Product to update.

+
+input
+UpdateDataProductInput! +
+

Inputs required to create a new DataProduct.

+
+ +## updateDataset + +**Type:** [Dataset](/docs/graphql/objects#dataset) + +Update the metadata about a particular Dataset + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+input
+DatasetUpdateInput! +
+ +
+ +## updateDatasets + +**Type:** [[Dataset]](/docs/graphql/objects#dataset) + +Update the metadata about a batch of Datasets + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+[BatchDatasetUpdateInput!]! +
+ +
+ +## updateDeprecation + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Sets the Deprecation status for a Metadata Entity. Requires the Edit Deprecation status privilege for an entity. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+UpdateDeprecationInput! +
+

Input required to set deprecation for an Entity.

+
+ +## updateDescription + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Incubating. Updates the description of a resource. Currently only supports Dataset Schema Fields, Containers + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+DescriptionUpdateInput! +
+ +
+ +## updateEmbed + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Update the Embed information for a Dataset, Dashboard, or Chart. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+UpdateEmbedInput! +
+ +
+ +## updateGlobalViewsSettings + +**Type:** [Boolean!](/docs/graphql/scalars#boolean) + +Update the global settings related to the Views feature. +Requires the 'Manage Global Views' Platform Privilege. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+UpdateGlobalViewsSettingsInput! +
+ +
+ +## updateIngestionSource + +**Type:** [String](/docs/graphql/scalars#string) + +Update an existing ingestion source + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+input
+UpdateIngestionSourceInput! +
+ +
+ +## updateLineage + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Update lineage for an entity + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+UpdateLineageInput! +
+ +
+ +## updateName + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Updates the name of the entity. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+UpdateNameInput! +
+ +
+ +## updateNotebook + +**Type:** [Notebook](/docs/graphql/objects#notebook) + +Update the metadata about a particular Notebook + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+input
+NotebookUpdateInput! +
+ +
+ +## updateOwnershipType + +**Type:** [OwnershipTypeEntity](/docs/graphql/objects#ownershiptypeentity) + +Update an existing Query. This requires the 'Manage Ownership Types' Metadata Privilege. + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The urn identifier for the custom ownership type to update.

+
+input
+UpdateOwnershipTypeInput! +
+

Inputs required to update an existing Custom Ownership Type.

+
+ +## updateParentNode + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Updates the parent node of a resource. Currently only GlossaryNodes and GlossaryTerms have parentNodes. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+UpdateParentNodeInput! +
+ +
+ +## updatePolicy + +**Type:** [String](/docs/graphql/scalars#string) + +Update an existing policy and returns the resulting urn + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+input
+PolicyUpdateInput! +
+ +
+ +## updateQuery + +**Type:** [QueryEntity](/docs/graphql/objects#queryentity) + +Update an existing Query + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The urn identifier for the query to update.

+
+input
+UpdateQueryInput! +
+

Inputs required to update a Query.

+
+ +## updateTag + +**Type:** [Tag](/docs/graphql/objects#tag) + +Update the information about a particular Entity Tag + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+input
+TagUpdateInput! +
+ +
+ +## updateTest + +**Type:** [String](/docs/graphql/scalars#string) + +Update an existing test + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+input
+UpdateTestInput! +
+ +
+ +## updateUserSetting + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Update a user setting + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+UpdateUserSettingInput! +
+ +
+ +## updateUserStatus + +**Type:** [String](/docs/graphql/scalars#string) + +Change the status of a user. Requires Manage Users & Groups Platform Privilege + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+status
+CorpUserStatus! +
+ +
+ +## updateView + +**Type:** [DataHubView](/docs/graphql/objects#datahubview) + +Delete a DataHub View (Saved Filter) + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The urn of the View to update

+
+input
+UpdateViewInput! +
+

Input required to updat an existing DataHub View

+
diff --git a/docs-website/versioned_docs/version-0.10.4/graphql/objects.md b/docs-website/versioned_docs/version-0.10.4/graphql/objects.md new file mode 100644 index 0000000000000..24a32e632834e --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/graphql/objects.md @@ -0,0 +1,18789 @@ +--- +id: objects +title: Objects +slug: objects +sidebar_position: 3 +--- + +## Access + +

Fields

+ + + + + + + + + +
NameDescription
+roles
+[RoleAssociation!] +
+ +
+ +## AccessToken + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+accessToken
+String! +
+

The access token itself

+
+metadata
+AccessTokenMetadata +
+

Metadata about the generated token

+
+ +## AccessTokenMetadata + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the access token

+
+type
+EntityType! +
+

The standard Entity Type

+
+id
+String! +
+

The unique identifier of the token.

+
+name
+String! +
+

The name of the token, if it exists.

+
+description
+String +
+

The description of the token if defined.

+
+actorUrn
+String! +
+

The actor associated with the Access Token.

+
+ownerUrn
+String! +
+

The actor who created the Access Token.

+
+createdAt
+Long! +
+

The time when token was generated at.

+
+expiresAt
+Long +
+

Time when token will be expired.

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+ +## Actor + +

Fields

+ + + + + + + + + +
NameDescription
+users
+[RoleUser!] +
+

List of users for which the role is provisioned

+
+ +## ActorFilter + +The actors that a DataHub Access Policy applies to + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+users
+[String!] +
+

A disjunctive set of users to apply the policy to

+
+groups
+[String!] +
+

A disjunctive set of groups to apply the policy to

+
+roles
+[String!] +
+

A disjunctive set of roles to apply the policy to

+
+resourceOwners
+Boolean! +
+

Whether the filter should return TRUE for owners of a particular resource +Only applies to policies of type METADATA, which have a resource associated with them

+
+resourceOwnersTypes
+[String!] +
+

Set of OwnershipTypes to apply the policy to (if resourceOwners field is set to True)

+
+resolvedOwnershipTypes
+[OwnershipTypeEntity!] +
+

Set of OwnershipTypes to apply the policy to (if resourceOwners field is set to True), resolved.

+
+allUsers
+Boolean! +
+

Whether the filter should apply to all users

+
+allGroups
+Boolean! +
+

Whether the filter should apply to all groups

+
+resolvedUsers
+[CorpUser!] +
+

The list of users on the Policy, resolved.

+
+resolvedGroups
+[CorpGroup!] +
+

The list of groups on the Policy, resolved.

+
+resolvedRoles
+[DataHubRole!] +
+

The list of roles on the Policy, resolved.

+
+ +## AggregateResults + +Results returned from aggregateAcrossEntities + +

Fields

+ + + + + + + + + +
NameDescription
+facets
+[FacetMetadata!] +
+

Candidate facet aggregations used for search filtering

+
+ +## AggregationMetadata + +Information about the aggregation that can be used for filtering, included the field value and number of results + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+value
+String! +
+

A particular value of a facet field

+
+count
+Long! +
+

The number of search results containing the value

+
+entity
+Entity +
+

Entity corresponding to the facet field

+
+ +## AnalyticsChartGroup + +For consumption by UI only + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+groupId
+String! +
+ +
+title
+String! +
+ +
+charts
+[AnalyticsChart!]! +
+ +
+ +## AnalyticsConfig + +Configurations related to the Analytics Feature + +

Fields

+ + + + + + + + + +
NameDescription
+enabled
+Boolean! +
+

Whether the Analytics feature is enabled and should be displayed

+
+ +## AppConfig + +Config loaded at application boot time +This configuration dictates the behavior of the UI, such as which features are enabled or disabled + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+appVersion
+String +
+

App version

+
+authConfig
+AuthConfig! +
+

Auth-related configurations

+
+analyticsConfig
+AnalyticsConfig! +
+

Configurations related to the Analytics Feature

+
+policiesConfig
+PoliciesConfig! +
+

Configurations related to the Policies Feature

+
+identityManagementConfig
+IdentityManagementConfig! +
+

Configurations related to the User & Group management

+
+managedIngestionConfig
+ManagedIngestionConfig! +
+

Configurations related to UI-based ingestion

+
+lineageConfig
+LineageConfig! +
+

Configurations related to Lineage

+
+visualConfig
+VisualConfig! +
+

Configurations related to visual appearance, allows styling the UI without rebuilding the bundle

+
+telemetryConfig
+TelemetryConfig! +
+

Configurations related to tracking users in the app

+
+testsConfig
+TestsConfig! +
+

Configurations related to DataHub tests

+
+viewsConfig
+ViewsConfig! +
+

Configurations related to DataHub Views

+
+featureFlags
+FeatureFlagsConfig! +
+

Feature flags telling the UI whether a feature is enabled or not

+
+ +## AspectRenderSpec + +Details for the frontend on how the raw aspect should be rendered + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+displayType
+String +
+

Format the aspect should be displayed in for the UI. Powered by the renderSpec annotation on the aspect model

+
+displayName
+String +
+

Name to refer to the aspect type by for the UI. Powered by the renderSpec annotation on the aspect model

+
+key
+String +
+

Field in the aspect payload to index into for rendering.

+
+ +## Assertion + +An assertion represents a programmatic validation, check, or test performed periodically against another Entity. + +

Implements

+ +- [EntityWithRelationships](/docs/graphql/interfaces#entitywithrelationships) +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the Assertion

+
+type
+EntityType! +
+

The standard Entity Type

+
+platform
+DataPlatform! +
+

Standardized platform urn where the assertion is evaluated

+
+info
+AssertionInfo +
+

Details about assertion

+
+dataPlatformInstance
+DataPlatformInstance +
+

The specific instance of the data platform that this entity belongs to

+
+runEvents
+AssertionRunEventsResult +
+

Lifecycle events detailing individual runs of this assertion. If startTimeMillis & endTimeMillis are not provided, the most +recent events will be returned.

+ +

Arguments

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+status
+AssertionRunStatus +
+ +
+startTimeMillis
+Long +
+ +
+endTimeMillis
+Long +
+ +
+filter
+FilterInput +
+ +
+limit
+Int +
+ +
+ +
+relationships
+EntityRelationshipsResult +
+

Edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+lineage
+EntityLineageResult +
+

Edges extending from this entity grouped by direction in the lineage graph

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+LineageInput! +
+ +
+ +
+ +## AssertionInfo + +Type of assertion. Assertion types can evolve to span Datasets, Flows (Pipelines), Models, Features etc. + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+type
+AssertionType! +
+

Top-level type of the assertion.

+
+datasetAssertion
+DatasetAssertionInfo +
+

Dataset-specific assertion information

+
+ +## AssertionResult + +The result of evaluating an assertion. + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+type
+AssertionResultType! +
+

The final result, e.g. either SUCCESS or FAILURE.

+
+rowCount
+Long +
+

Number of rows for evaluated batch

+
+missingCount
+Long +
+

Number of rows with missing value for evaluated batch

+
+unexpectedCount
+Long +
+

Number of rows with unexpected value for evaluated batch

+
+actualAggValue
+Float +
+

Observed aggregate value for evaluated batch

+
+externalUrl
+String +
+

URL where full results are available

+
+nativeResults
+[StringMapEntry!] +
+

Native results / properties of evaluation

+
+ +## AssertionRunEvent + +An event representing an event in the assertion evaluation lifecycle. + +

Implements

+ +- [TimeSeriesAspect](/docs/graphql/interfaces#timeseriesaspect) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+timestampMillis
+Long! +
+

The time at which the assertion was evaluated

+
+assertionUrn
+String! +
+

Urn of assertion which is evaluated

+
+asserteeUrn
+String! +
+

Urn of entity on which the assertion is applicable

+
+runId
+String! +
+

Native (platform-specific) identifier for this run

+
+status
+AssertionRunStatus! +
+

The status of the assertion run as per this timeseries event.

+
+batchSpec
+BatchSpec +
+

Specification of the batch which this run is evaluating

+
+partitionSpec
+PartitionSpec +
+

Information about the partition that was evaluated

+
+runtimeContext
+[StringMapEntry!] +
+

Runtime parameters of evaluation

+
+result
+AssertionResult +
+

Results of assertion, present if the status is COMPLETE

+
+ +## AssertionRunEventsResult + +Result returned when fetching run events for an assertion. + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+total
+Int! +
+

The total number of run events returned

+
+failed
+Int! +
+

The number of failed run events

+
+succeeded
+Int! +
+

The number of succeeded run events

+
+runEvents
+[AssertionRunEvent!]! +
+

The run events themselves

+
+ +## AssertionStdParameter + +Parameter for AssertionStdOperator. + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+value
+String! +
+

The parameter value

+
+type
+AssertionStdParameterType! +
+

The type of the parameter

+
+ +## AssertionStdParameters + +Parameters for AssertionStdOperators + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+value
+AssertionStdParameter +
+

The value parameter of an assertion

+
+maxValue
+AssertionStdParameter +
+

The maxValue parameter of an assertion

+
+minValue
+AssertionStdParameter +
+

The minValue parameter of an assertion

+
+ +## AuditStamp + +A time stamp along with an optional actor + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+time
+Long! +
+

When the audited action took place

+
+actor
+String +
+

Who performed the audited action

+
+ +## AuthConfig + +Configurations related to auth + +

Fields

+ + + + + + + + + +
NameDescription
+tokenAuthEnabled
+Boolean! +
+

Whether token-based auth is enabled.

+
+ +## AuthenticatedUser + +Information about the currently authenticated user + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+corpUser
+CorpUser! +
+

The user information associated with the authenticated user, including properties used in rendering the profile

+
+platformPrivileges
+PlatformPrivileges! +
+

The privileges assigned to the currently authenticated user, which dictates which parts of the UI they should be able to use

+
+ +## AutoCompleteMultipleResults + +The results returned on a multi entity autocomplete query + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+query
+String! +
+

The raw query string

+
+suggestions
+[AutoCompleteResultForEntity!]! +
+

The autocompletion suggestions

+
+ +## AutoCompleteResultForEntity + +An individual auto complete result specific to an individual Metadata Entity Type + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+type
+EntityType! +
+

Entity type

+
+suggestions
+[String!]! +
+

The autocompletion results for specified entity type

+
+entities
+[Entity!]! +
+

A list of entities to render in autocomplete

+
+ +## AutoCompleteResults + +The results returned on a single entity autocomplete query + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+query
+String! +
+

The query string

+
+suggestions
+[String!]! +
+

The autocompletion results

+
+entities
+[Entity!]! +
+

A list of entities to render in autocomplete

+
+ +## BarChart + +For consumption by UI only + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+title
+String! +
+ +
+bars
+[NamedBar!]! +
+ +
+ +## BarSegment + +For consumption by UI only + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+label
+String! +
+ +
+value
+Int! +
+ +
+ +## BaseData + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+dataset
+String! +
+

Dataset used for the Training or Evaluation of the MLModel

+
+motivation
+String +
+

Motivation to pick these datasets

+
+preProcessing
+[String!] +
+

Details of Data Proprocessing

+
+ +## BatchGetStepStatesResult + +Result returned when fetching step state + +

Fields

+ + + + + + + + + +
NameDescription
+results
+[StepStateResult!]! +
+

The step states

+
+ +## BatchSpec + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+nativeBatchId
+String +
+

The native identifier as specified by the system operating on the batch.

+
+query
+String +
+

A query that identifies a batch of data

+
+limit
+Int +
+

Any limit to the number of rows in the batch, if applied

+
+customProperties
+[StringMapEntry!] +
+

Custom properties of the Batch

+
+ +## BatchUpdateStepStatesResult + +Result returned when fetching step state + +

Fields

+ + + + + + + + + +
NameDescription
+results
+[UpdateStepStateResult!]! +
+

Results for each step

+
+ +## BooleanBox + +

Fields

+ + + + + + + + + +
NameDescription
+booleanValue
+Boolean! +
+ +
+ +## BrowsePath + +A hierarchical entity path + +

Fields

+ + + + + + + + + +
NameDescription
+path
+[String!]! +
+

The components of the browse path

+
+ +## BrowsePathEntry + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

The path name of a group of browse results

+
+entity
+Entity +
+

An optional entity associated with this browse entry. This will usually be a container entity. +If this entity is not populated, the name must be used.

+
+ +## BrowsePathV2 + +A hierarchical entity path V2 + +

Fields

+ + + + + + + + + +
NameDescription
+path
+[BrowsePathEntry!]! +
+

The components of the browse path

+
+ +## BrowseResultGroup + +A group of Entities under a given browse path + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

The path name of a group of browse results

+
+count
+Long! +
+

The number of entities within the group

+
+ +## BrowseResultGroupV2 + +A group of Entities under a given browse path + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

The path name of a group of browse results

+
+entity
+Entity +
+

An optional entity associated with this browse group. This will usually be a container entity. +If this entity is not populated, the name must be used.

+
+count
+Long! +
+

The number of entities within the group

+
+hasSubGroups
+Boolean! +
+

Whether or not there are any more groups underneath this group

+
+ +## BrowseResultMetadata + +Metadata about the Browse Paths response + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+path
+[String!]! +
+

The provided path

+
+totalNumEntities
+Long! +
+

The total number of entities under the provided browse path

+
+ +## BrowseResults + +The results of a browse path traversal query + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+entities
+[Entity!]! +
+

The browse results

+
+groups
+[BrowseResultGroup!]! +
+

The groups present at the provided browse path

+
+start
+Int! +
+

The starting point of paginated results

+
+count
+Int! +
+

The number of elements included in the results

+
+total
+Int! +
+

The total number of browse results under the path with filters applied

+
+metadata
+BrowseResultMetadata! +
+

Metadata containing resulting browse groups

+
+ +## BrowseResultsV2 + +The results of a browse path V2 traversal query + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+groups
+[BrowseResultGroupV2!]! +
+

The groups present at the provided browse path V2

+
+start
+Int! +
+

The starting point of paginated results

+
+count
+Int! +
+

The number of groups included in the results

+
+total
+Int! +
+

The total number of browse groups under the path with filters applied

+
+metadata
+BrowseResultMetadata! +
+

Metadata containing resulting browse groups

+
+ +## CaveatDetails + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+needsFurtherTesting
+Boolean +
+

Did the results suggest any further testing

+
+caveatDescription
+String +
+

Caveat Description

+
+groupsNotRepresented
+[String!] +
+

Relevant groups that were not represented in the evaluation dataset

+
+ +## CaveatsAndRecommendations + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+caveats
+CaveatDetails +
+

Caveats on using this MLModel

+
+recommendations
+String +
+

Recommendations on where this MLModel should be used

+
+idealDatasetCharacteristics
+[String!] +
+

Ideal characteristics of an evaluation dataset for this MLModel

+
+ +## Cell + +For consumption by UI only + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+value
+String! +
+ +
+entity
+Entity +
+ +
+linkParams
+LinkParams +
+ +
+ +## ChangeAuditStamps + +Captures information about who created/last modified/deleted the entity and when + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+created
+AuditStamp! +
+

An AuditStamp corresponding to the creation

+
+lastModified
+AuditStamp! +
+

An AuditStamp corresponding to the modification

+
+deleted
+AuditStamp +
+

An optional AuditStamp corresponding to the deletion

+
+ +## Chart + +A Chart Metadata Entity + +

Implements

+ +- [EntityWithRelationships](/docs/graphql/interfaces#entitywithrelationships) +- [Entity](/docs/graphql/interfaces#entity) +- [BrowsableEntity](/docs/graphql/interfaces#browsableentity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the Chart

+
+type
+EntityType! +
+

A standard Entity Type

+
+lastIngested
+Long +
+

The timestamp for the last time this entity was ingested

+
+container
+Container +
+

The parent container in which the entity resides

+
+parentContainers
+ParentContainersResult +
+

Recursively get the lineage of containers for this entity

+
+tool
+String! +
+

The chart tool name +Note that this field will soon be deprecated in favor a unified notion of Data Platform

+
+chartId
+String! +
+

An id unique within the charting tool

+
+properties
+ChartProperties +
+

Additional read only properties about the Chart

+
+editableProperties
+ChartEditableProperties +
+

Additional read write properties about the Chart

+
+query
+ChartQuery +
+

Info about the query which is used to render the chart

+
+ownership
+Ownership +
+

Ownership metadata of the chart

+
+status
+Status +
+

Status metadata of the chart

+
+deprecation
+Deprecation +
+

The deprecation status of the chart

+
+embed
+Embed +
+

Embed information about the Chart

+
+tags
+GlobalTags +
+

The tags associated with the chart

+
+institutionalMemory
+InstitutionalMemory +
+

References to internal resources related to the dashboard

+
+glossaryTerms
+GlossaryTerms +
+

The structured glossary terms associated with the dashboard

+
+domain
+DomainAssociation +
+

The Domain associated with the Chart

+
+dataPlatformInstance
+DataPlatformInstance +
+

The specific instance of the data platform that this entity belongs to

+
+statsSummary
+ChartStatsSummary +
+

Not yet implemented.

+

Experimental - Summary operational & usage statistics about a Chart

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+lineage
+EntityLineageResult +
+

Edges extending from this entity grouped by direction in the lineage graph

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+LineageInput! +
+ +
+ +
+browsePaths
+[BrowsePath!] +
+

The browse paths corresponding to the chart. If no Browse Paths have been generated before, this will be null.

+
+browsePathV2
+BrowsePathV2 +
+

The browse path V2 corresponding to an entity. If no Browse Paths V2 have been generated before, this will be null.

+
+info
+ChartInfo +
+
Deprecated: No longer supported
+ +

Deprecated, use properties field instead +Additional read only information about the chart

+
+editableInfo
+ChartEditableProperties +
+
Deprecated: No longer supported
+ +

Deprecated, use editableProperties field instead +Additional read write information about the Chart

+
+globalTags
+GlobalTags +
+
Deprecated: No longer supported
+ +

Deprecated, use tags instead +The structured tags associated with the chart

+
+platform
+DataPlatform! +
+

Standardized platform urn where the chart is defined

+
+inputFields
+InputFields +
+

Input fields to power the chart

+
+privileges
+EntityPrivileges +
+

Privileges given to a user relevant to this entity

+
+exists
+Boolean +
+

Whether or not this entity exists on DataHub

+
+ +## ChartCell + +A Notebook cell which contains chart as content + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+cellTitle
+String! +
+

Title of the cell

+
+cellId
+String! +
+

Unique id for the cell.

+
+changeAuditStamps
+ChangeAuditStamps +
+

Captures information about who created/last modified/deleted this TextCell and when

+
+ +## ChartEditableProperties + +Chart properties that are editable via the UI This represents logical metadata, +as opposed to technical metadata + +

Fields

+ + + + + + + + + +
NameDescription
+description
+String +
+

Description of the Chart

+
+ +## ChartInfo + +Deprecated, use ChartProperties instead +Additional read only information about the chart + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

Display name of the chart

+
+description
+String +
+

Description of the chart

+
+inputs
+[Dataset!] +
+
Deprecated: No longer supported
+ +

Deprecated, use relationship Consumes instead +Data sources for the chart

+
+externalUrl
+String +
+

Native platform URL of the chart

+
+type
+ChartType +
+

Access level for the chart

+
+access
+AccessLevel +
+

Access level for the chart

+
+customProperties
+[CustomPropertiesEntry!] +
+

A list of platform specific metadata tuples

+
+lastRefreshed
+Long +
+

The time when this chart last refreshed

+
+created
+AuditStamp! +
+

An AuditStamp corresponding to the creation of this chart

+
+lastModified
+AuditStamp! +
+

An AuditStamp corresponding to the modification of this chart

+
+deleted
+AuditStamp +
+

An optional AuditStamp corresponding to the deletion of this chart

+
+ +## ChartProperties + +Additional read only properties about the chart + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

Display name of the chart

+
+description
+String +
+

Description of the chart

+
+externalUrl
+String +
+

Native platform URL of the chart

+
+type
+ChartType +
+

Access level for the chart

+
+access
+AccessLevel +
+

Access level for the chart

+
+customProperties
+[CustomPropertiesEntry!] +
+

A list of platform specific metadata tuples

+
+lastRefreshed
+Long +
+

The time when this chart last refreshed

+
+created
+AuditStamp! +
+

An AuditStamp corresponding to the creation of this chart

+
+lastModified
+AuditStamp! +
+

An AuditStamp corresponding to the modification of this chart

+
+deleted
+AuditStamp +
+

An optional AuditStamp corresponding to the deletion of this chart

+
+ +## ChartQuery + +The query that was used to populate a Chart + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+rawQuery
+String! +
+

Raw query to build a chart from input datasets

+
+type
+ChartQueryType! +
+

The type of the chart query

+
+ +## ChartStatsSummary + +Experimental - subject to change. A summary of usage metrics about a Chart. + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+viewCount
+Int +
+

The total view count for the chart

+
+viewCountLast30Days
+Int +
+

The view count in the last 30 days

+
+uniqueUserCountLast30Days
+Int +
+

The unique user count in the past 30 days

+
+topUsersLast30Days
+[CorpUser!] +
+

The top users in the past 30 days

+
+ +## Container + +A container of other Metadata Entities + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the container

+
+type
+EntityType! +
+

A standard Entity Type

+
+lastIngested
+Long +
+

The timestamp for the last time this entity was ingested

+
+platform
+DataPlatform! +
+

Standardized platform.

+
+container
+Container +
+

Fetch an Entity Container by primary key (urn)

+
+parentContainers
+ParentContainersResult +
+

Recursively get the lineage of containers for this entity

+
+properties
+ContainerProperties +
+

Read-only properties that originate in the source data platform

+
+editableProperties
+ContainerEditableProperties +
+

Read-write properties that originate in DataHub

+
+ownership
+Ownership +
+

Ownership metadata of the dataset

+
+institutionalMemory
+InstitutionalMemory +
+

References to internal resources related to the dataset

+
+tags
+GlobalTags +
+

Tags used for searching dataset

+
+glossaryTerms
+GlossaryTerms +
+

The structured glossary terms associated with the dataset

+
+subTypes
+SubTypes +
+

Sub types of the container, e.g. "Database" etc

+
+domain
+DomainAssociation +
+

The Domain associated with the Dataset

+
+deprecation
+Deprecation +
+

The deprecation status of the container

+
+dataPlatformInstance
+DataPlatformInstance +
+

The specific instance of the data platform that this entity belongs to

+
+entities
+SearchResults +
+

Children entities inside of the Container

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ContainerEntitiesInput +
+ +
+ +
+relationships
+EntityRelationshipsResult +
+

Edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+status
+Status +
+

Status metadata of the container

+
+exists
+Boolean +
+

Whether or not this entity exists on DataHub

+
+ +## ContainerEditableProperties + +Read-write properties that originate in DataHub + +

Fields

+ + + + + + + + + +
NameDescription
+description
+String +
+

DataHub description of the Container

+
+ +## ContainerProperties + +Read-only properties that originate in the source data platform + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

Display name of the Container

+
+description
+String +
+

System description of the Container

+
+customProperties
+[CustomPropertiesEntry!] +
+

Custom properties of the Container

+
+externalUrl
+String +
+

Native platform URL of the Container

+
+qualifiedName
+String +
+

Fully-qualified name of the Container

+
+ +## ContentParams + +Params about the recommended content + +

Fields

+ + + + + + + + + +
NameDescription
+count
+Long +
+

Number of entities corresponding to the recommended content

+
+ +## CorpGroup + +A DataHub Group entity, which represents a Person on the Metadata Entity Graph + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the group

+
+type
+EntityType! +
+

A standard Entity Type

+
+name
+String! +
+

Group name eg wherehows dev, ask_metadata

+
+ownership
+Ownership +
+

Ownership metadata of the Corp Group

+
+properties
+CorpGroupProperties +
+

Additional read only properties about the group

+
+editableProperties
+CorpGroupEditableProperties +
+

Additional read write properties about the group

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+origin
+Origin +
+

Origin info about this group.

+
+info
+CorpGroupInfo +
+
Deprecated: No longer supported
+ +

Deprecated, use properties field instead +Additional read only info about the group

+
+ +## CorpGroupEditableProperties + +Additional read write properties about a group + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+description
+String +
+

DataHub description of the group

+
+slack
+String +
+

Slack handle for the group

+
+email
+String +
+

Email address for the group

+
+ +## CorpGroupInfo + +Deprecated, use CorpUserProperties instead +Additional read only info about a group + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+displayName
+String +
+

The name to display when rendering the group

+
+description
+String +
+

The description provided for the group

+
+email
+String +
+

email of this group

+
+admins
+[CorpUser!] +
+
Deprecated: No longer supported
+ +

Deprecated, do not use +owners of this group

+
+members
+[CorpUser!] +
+
Deprecated: No longer supported
+ +

Deprecated, use relationship IsMemberOfGroup instead +List of ldap urn in this group

+
+groups
+[String!] +
+
Deprecated: No longer supported
+ +

Deprecated, do not use +List of groups urns in this group

+
+ +## CorpGroupProperties + +Additional read only properties about a group + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+displayName
+String +
+

display name of this group

+
+description
+String +
+

The description provided for the group

+
+email
+String +
+

email of this group

+
+slack
+String +
+

Slack handle for the group

+
+ +## CorpUser + +A DataHub User entity, which represents a Person on the Metadata Entity Graph + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the user

+
+type
+EntityType! +
+

The standard Entity Type

+
+username
+String! +
+

A username associated with the user +This uniquely identifies the user within DataHub

+
+properties
+CorpUserProperties +
+

Additional read only properties about the corp user

+
+editableProperties
+CorpUserEditableProperties +
+

Read write properties about the corp user

+
+status
+CorpUserStatus +
+

The status of the user

+
+tags
+GlobalTags +
+

The tags associated with the user

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+isNativeUser
+Boolean +
+

Whether or not this user is a native DataHub user

+
+info
+CorpUserInfo +
+
Deprecated: No longer supported
+ +

Deprecated, use properties field instead +Additional read only info about the corp user

+
+editableInfo
+CorpUserEditableInfo +
+
Deprecated: No longer supported
+ +

Deprecated, use editableProperties field instead +Read write info about the corp user

+
+globalTags
+GlobalTags +
+
Deprecated: No longer supported
+ +

Deprecated, use the tags field instead +The structured tags associated with the user

+
+settings
+CorpUserSettings +
+

Settings that a user can customize through the datahub ui

+
+ +## CorpUserAppearanceSettings + +Settings that control look and feel of the DataHub UI for the user + +

Fields

+ + + + + + + + + +
NameDescription
+showSimplifiedHomepage
+Boolean +
+

Flag whether the user should see a homepage with only datasets, charts & dashboards. Intended for users +who have less operational use cases for the datahub tool.

+
+ +## CorpUserEditableInfo + +Deprecated, use CorpUserEditableProperties instead +Additional read write info about a user + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+displayName
+String +
+

Display name to show on DataHub

+
+title
+String +
+

Title to show on DataHub

+
+aboutMe
+String +
+

About me section of the user

+
+teams
+[String!] +
+

Teams that the user belongs to

+
+skills
+[String!] +
+

Skills that the user possesses

+
+pictureLink
+String +
+

A URL which points to a picture which user wants to set as a profile photo

+
+ +## CorpUserEditableProperties + +Additional read write properties about a user + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+displayName
+String +
+

Display name to show on DataHub

+
+title
+String +
+

Title to show on DataHub

+
+aboutMe
+String +
+

About me section of the user

+
+teams
+[String!] +
+

Teams that the user belongs to

+
+skills
+[String!] +
+

Skills that the user possesses

+
+pictureLink
+String +
+

A URL which points to a picture which user wants to set as a profile photo

+
+slack
+String +
+

The slack handle of the user

+
+phone
+String +
+

Phone number for the user

+
+email
+String +
+

Email address for the user

+
+ +## CorpUserInfo + +Deprecated, use CorpUserProperties instead +Additional read only info about a user + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+active
+Boolean! +
+

Whether the user is active

+
+displayName
+String +
+

Display name of the user

+
+email
+String +
+

Email address of the user

+
+title
+String +
+

Title of the user

+
+manager
+CorpUser +
+

Direct manager of the user

+
+departmentId
+Long +
+

department id the user belong to

+
+departmentName
+String +
+

department name this user belong to

+
+firstName
+String +
+

first name of the user

+
+lastName
+String +
+

last name of the user

+
+fullName
+String +
+

Common name of this user, format is firstName plus lastName

+
+countryCode
+String +
+

two uppercase letters country code

+
+customProperties
+[CustomPropertiesEntry!] +
+

Custom properties of the ldap

+
+ +## CorpUserProperties + +Additional read only properties about a user + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+active
+Boolean! +
+

Whether the user is active

+
+displayName
+String +
+

Display name of the user

+
+email
+String +
+

Email address of the user

+
+title
+String +
+

Title of the user

+
+manager
+CorpUser +
+

Direct manager of the user

+
+departmentId
+Long +
+

department id the user belong to

+
+departmentName
+String +
+

department name this user belong to

+
+firstName
+String +
+

first name of the user

+
+lastName
+String +
+

last name of the user

+
+fullName
+String +
+

Common name of this user, format is firstName plus lastName

+
+countryCode
+String +
+

two uppercase letters country code

+
+customProperties
+[CustomPropertiesEntry!] +
+

Custom properties of the ldap

+
+ +## CorpUserSettings + +Settings that a user can customize through the datahub ui + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+appearance
+CorpUserAppearanceSettings +
+

Settings that control look and feel of the DataHub UI for the user

+
+views
+CorpUserViewsSettings +
+

Settings related to the DataHub Views feature

+
+ +## CorpUserViewsSettings + +Settings related to the Views feature of DataHub. + +

Fields

+ + + + + + + + + +
NameDescription
+defaultView
+DataHubView +
+

The default view for the User.

+
+ +## Cost + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+costType
+CostType! +
+

Type of Cost Code

+
+costValue
+CostValue! +
+

Code to which the Cost of this entity should be attributed to ie organizational cost ID

+
+ +## CostValue + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+costId
+Float +
+

Organizational Cost ID

+
+costCode
+String +
+

Organizational Cost Code

+
+ +## CustomPropertiesEntry + +An entry in a custom properties map represented as a tuple + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+key
+String! +
+

The key of the map entry

+
+value
+String +
+

The value fo the map entry

+
+associatedUrn
+String! +
+

The urn of the entity this property came from for tracking purposes e.g. when sibling nodes are merged together

+
+ +## Dashboard + +A Dashboard Metadata Entity + +

Implements

+ +- [EntityWithRelationships](/docs/graphql/interfaces#entitywithrelationships) +- [Entity](/docs/graphql/interfaces#entity) +- [BrowsableEntity](/docs/graphql/interfaces#browsableentity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the Dashboard

+
+type
+EntityType! +
+

A standard Entity Type

+
+lastIngested
+Long +
+

The timestamp for the last time this entity was ingested

+
+container
+Container +
+

The parent container in which the entity resides

+
+parentContainers
+ParentContainersResult +
+

Recursively get the lineage of containers for this entity

+
+tool
+String! +
+

The dashboard tool name +Note that this will soon be deprecated in favor of a standardized notion of Data Platform

+
+dashboardId
+String! +
+

An id unique within the dashboard tool

+
+properties
+DashboardProperties +
+

Additional read only properties about the dashboard

+
+editableProperties
+DashboardEditableProperties +
+

Additional read write properties about the dashboard

+
+ownership
+Ownership +
+

Ownership metadata of the dashboard

+
+status
+Status +
+

Status metadata of the dashboard

+
+embed
+Embed +
+

Embed information about the Dashboard

+
+deprecation
+Deprecation +
+

The deprecation status of the dashboard

+
+tags
+GlobalTags +
+

The tags associated with the dashboard

+
+institutionalMemory
+InstitutionalMemory +
+

References to internal resources related to the dashboard

+
+glossaryTerms
+GlossaryTerms +
+

The structured glossary terms associated with the dashboard

+
+domain
+DomainAssociation +
+

The Domain associated with the Dashboard

+
+dataPlatformInstance
+DataPlatformInstance +
+

The specific instance of the data platform that this entity belongs to

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+lineage
+EntityLineageResult +
+

Edges extending from this entity grouped by direction in the lineage graph

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+LineageInput! +
+ +
+ +
+browsePaths
+[BrowsePath!] +
+

The browse paths corresponding to the dashboard. If no Browse Paths have been generated before, this will be null.

+
+browsePathV2
+BrowsePathV2 +
+

The browse path V2 corresponding to an entity. If no Browse Paths V2 have been generated before, this will be null.

+
+usageStats
+DashboardUsageQueryResult +
+

Experimental (Subject to breaking change) -- Statistics about how this Dashboard is used

+ +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+startTimeMillis
+Long +
+ +
+endTimeMillis
+Long +
+ +
+limit
+Int +
+ +
+ +
+statsSummary
+DashboardStatsSummary +
+

Experimental - Summary operational & usage statistics about a Dashboard

+
+info
+DashboardInfo +
+
Deprecated: No longer supported
+ +

Deprecated, use properties field instead +Additional read only information about the dashboard

+
+editableInfo
+DashboardEditableProperties +
+
Deprecated: No longer supported
+ +

Deprecated, use editableProperties instead +Additional read write properties about the Dashboard

+
+globalTags
+GlobalTags +
+
Deprecated: No longer supported
+ +

Deprecated, use tags field instead +The structured tags associated with the dashboard

+
+platform
+DataPlatform! +
+

Standardized platform urn where the dashboard is defined

+
+inputFields
+InputFields +
+

Input fields that power all the charts in the dashboard

+
+subTypes
+SubTypes +
+

Sub Types of the dashboard

+
+privileges
+EntityPrivileges +
+

Privileges given to a user relevant to this entity

+
+exists
+Boolean +
+

Whether or not this entity exists on DataHub

+
+ +## DashboardEditableProperties + +Dashboard properties that are editable via the UI This represents logical metadata, +as opposed to technical metadata + +

Fields

+ + + + + + + + + +
NameDescription
+description
+String +
+

Description of the Dashboard

+
+ +## DashboardInfo + +Deprecated, use DashboardProperties instead +Additional read only info about a Dashboard + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

Display of the dashboard

+
+description
+String +
+

Description of the dashboard

+
+charts
+[Chart!]! +
+
Deprecated: No longer supported
+ +

Deprecated, use relationship Contains instead +Charts that comprise the dashboard

+
+externalUrl
+String +
+

Native platform URL of the dashboard

+
+access
+AccessLevel +
+

Access level for the dashboard +Note that this will soon be deprecated for low usage

+
+customProperties
+[CustomPropertiesEntry!] +
+

A list of platform specific metadata tuples

+
+lastRefreshed
+Long +
+

The time when this dashboard last refreshed

+
+created
+AuditStamp! +
+

An AuditStamp corresponding to the creation of this dashboard

+
+lastModified
+AuditStamp! +
+

An AuditStamp corresponding to the modification of this dashboard

+
+deleted
+AuditStamp +
+

An optional AuditStamp corresponding to the deletion of this dashboard

+
+ +## DashboardProperties + +Additional read only properties about a Dashboard + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

Display of the dashboard

+
+description
+String +
+

Description of the dashboard

+
+externalUrl
+String +
+

Native platform URL of the dashboard

+
+access
+AccessLevel +
+

Access level for the dashboard +Note that this will soon be deprecated for low usage

+
+customProperties
+[CustomPropertiesEntry!] +
+

A list of platform specific metadata tuples

+
+lastRefreshed
+Long +
+

The time when this dashboard last refreshed

+
+created
+AuditStamp! +
+

An AuditStamp corresponding to the creation of this dashboard

+
+lastModified
+AuditStamp! +
+

An AuditStamp corresponding to the modification of this dashboard

+
+deleted
+AuditStamp +
+

An optional AuditStamp corresponding to the deletion of this dashboard

+
+ +## DashboardStatsSummary + +Experimental - subject to change. A summary of usage metrics about a Dashboard. + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+viewCount
+Int +
+

The total view count for the dashboard

+
+viewCountLast30Days
+Int +
+

The view count in the last 30 days

+
+uniqueUserCountLast30Days
+Int +
+

The unique user count in the past 30 days

+
+topUsersLast30Days
+[CorpUser!] +
+

The top users in the past 30 days

+
+ +## DashboardUsageAggregation + +An aggregation of Dashboard usage statistics + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+bucket
+Long +
+

The time window start time

+
+duration
+WindowDuration +
+

The time window span

+
+resource
+String +
+

The resource urn associated with the usage information, eg a Dashboard urn

+
+metrics
+DashboardUsageAggregationMetrics +
+

The rolled up usage metrics

+
+ +## DashboardUsageAggregationMetrics + +Rolled up metrics about Dashboard usage over time + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+uniqueUserCount
+Int +
+

The unique number of dashboard users within the time range

+
+viewsCount
+Int +
+

The total number of dashboard views within the time range

+
+executionsCount
+Int +
+

The total number of dashboard executions within the time range

+
+ +## DashboardUsageMetrics + +A set of absolute dashboard usage metrics + +

Implements

+ +- [TimeSeriesAspect](/docs/graphql/interfaces#timeseriesaspect) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+timestampMillis
+Long! +
+

The time at which the metrics were reported

+
+favoritesCount
+Int +
+

The total number of times dashboard has been favorited +FIXME: Qualifies as Popularity Metric rather than Usage Metric?

+
+viewsCount
+Int +
+

The total number of dashboard views

+
+executionsCount
+Int +
+

The total number of dashboard execution

+
+lastViewed
+Long +
+

The time when this dashboard was last viewed

+
+ +## DashboardUsageQueryResult + +The result of a dashboard usage query + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+buckets
+[DashboardUsageAggregation] +
+

A set of relevant time windows for use in displaying usage statistics

+
+aggregations
+DashboardUsageQueryResultAggregations +
+

A set of rolled up aggregations about the dashboard usage

+
+metrics
+[DashboardUsageMetrics!] +
+

A set of absolute dashboard usage metrics

+
+ +## DashboardUsageQueryResultAggregations + +A set of rolled up aggregations about the Dashboard usage + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+uniqueUserCount
+Int +
+

The count of unique Dashboard users within the queried time range

+
+users
+[DashboardUserUsageCounts] +
+

The specific per user usage counts within the queried time range

+
+viewsCount
+Int +
+

The total number of dashboard views within the queried time range

+
+executionsCount
+Int +
+

The total number of dashboard executions within the queried time range

+
+ +## DashboardUserUsageCounts + +Information about individual user usage of a Dashboard + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+user
+CorpUser +
+

The user of the Dashboard

+
+viewsCount
+Int +
+

number of times dashboard has been viewed by the user

+
+executionsCount
+Int +
+

number of dashboard executions by the user

+
+usageCount
+Int +
+

Normalized numeric metric representing user's dashboard usage +Higher value represents more usage

+
+ +## DataFlow + +A Data Flow Metadata Entity, representing an set of pipelined Data Job or Tasks required +to produce an output Dataset Also known as a Data Pipeline + +

Implements

+ +- [EntityWithRelationships](/docs/graphql/interfaces#entitywithrelationships) +- [Entity](/docs/graphql/interfaces#entity) +- [BrowsableEntity](/docs/graphql/interfaces#browsableentity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of a Data Flow

+
+type
+EntityType! +
+

A standard Entity Type

+
+lastIngested
+Long +
+

The timestamp for the last time this entity was ingested

+
+orchestrator
+String! +
+

Workflow orchestrator ei Azkaban, Airflow

+
+flowId
+String! +
+

Id of the flow

+
+cluster
+String! +
+

Cluster of the flow

+
+properties
+DataFlowProperties +
+

Additional read only properties about a Data flow

+
+editableProperties
+DataFlowEditableProperties +
+

Additional read write properties about a Data Flow

+
+ownership
+Ownership +
+

Ownership metadata of the flow

+
+tags
+GlobalTags +
+

The tags associated with the dataflow

+
+status
+Status +
+

Status metadata of the dataflow

+
+deprecation
+Deprecation +
+

The deprecation status of the Data Flow

+
+institutionalMemory
+InstitutionalMemory +
+

References to internal resources related to the dashboard

+
+glossaryTerms
+GlossaryTerms +
+

The structured glossary terms associated with the dashboard

+
+domain
+DomainAssociation +
+

The Domain associated with the DataFlow

+
+dataPlatformInstance
+DataPlatformInstance +
+

The specific instance of the data platform that this entity belongs to

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+lineage
+EntityLineageResult +
+

Edges extending from this entity grouped by direction in the lineage graph

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+LineageInput! +
+ +
+ +
+browsePaths
+[BrowsePath!] +
+

The browse paths corresponding to the data flow. If no Browse Paths have been generated before, this will be null.

+
+browsePathV2
+BrowsePathV2 +
+

The browse path V2 corresponding to an entity. If no Browse Paths V2 have been generated before, this will be null.

+
+info
+DataFlowInfo +
+
Deprecated: No longer supported
+ +

Deprecated, use properties field instead +Additional read only information about a Data flow

+
+globalTags
+GlobalTags +
+
Deprecated: No longer supported
+ +

Deprecated, use tags field instead +The structured tags associated with the dataflow

+
+dataJobs
+DataFlowDataJobsRelationships +
+
Deprecated: No longer supported
+ +

Deprecated, use relationship IsPartOf instead +Data Jobs

+
+platform
+DataPlatform! +
+

Standardized platform urn where the datflow is defined

+
+exists
+Boolean +
+

Whether or not this entity exists on DataHub

+
+ +## DataFlowDataJobsRelationships + +Deprecated, use relationships query instead + +

Fields

+ + + + + + + + + +
NameDescription
+entities
+[EntityRelationshipLegacy] +
+ +
+ +## DataFlowEditableProperties + +Data Flow properties that are editable via the UI This represents logical metadata, +as opposed to technical metadata + +

Fields

+ + + + + + + + + +
NameDescription
+description
+String +
+

Description of the Data Flow

+
+ +## DataFlowInfo + +Deprecated, use DataFlowProperties instead +Additional read only properties about a Data Flow aka Pipeline + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

Display name of the flow

+
+description
+String +
+

Description of the flow

+
+project
+String +
+

Optional project or namespace associated with the flow

+
+externalUrl
+String +
+

External URL associated with the DataFlow

+
+customProperties
+[CustomPropertiesEntry!] +
+

A list of platform specific metadata tuples

+
+ +## DataFlowProperties + +Additional read only properties about a Data Flow aka Pipeline + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

Display name of the flow

+
+description
+String +
+

Description of the flow

+
+project
+String +
+

Optional project or namespace associated with the flow

+
+externalUrl
+String +
+

External URL associated with the DataFlow

+
+customProperties
+[CustomPropertiesEntry!] +
+

A list of platform specific metadata tuples

+
+ +## DataHubPolicy + +An DataHub Platform Access Policy - Policies determine who can perform what actions against which resources on the platform + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the Policy

+
+type
+EntityType! +
+

The standard Entity Type

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from the Role

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+policyType
+PolicyType! +
+

The type of the Policy

+
+name
+String! +
+

The name of the Policy

+
+state
+PolicyState! +
+

The present state of the Policy

+
+description
+String +
+

The description of the Policy

+
+resources
+ResourceFilter +
+

The resources that the Policy privileges apply to

+
+privileges
+[String!]! +
+

The privileges that the Policy grants

+
+actors
+ActorFilter! +
+

The actors that the Policy grants privileges to

+
+editable
+Boolean! +
+

Whether the Policy is editable, ie system policies, or not

+
+ +## DataHubRole + +A DataHub Role is a high-level abstraction on top of Policies that dictates what actions users can take. + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the role

+
+type
+EntityType! +
+

The standard Entity Type

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from the Role

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+name
+String! +
+

The name of the Role.

+
+description
+String! +
+

The description of the Role

+
+ +## DataHubView + +An DataHub View - Filters that are applied across the application automatically. + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the View

+
+type
+EntityType! +
+

The standard Entity Type

+
+viewType
+DataHubViewType! +
+

The type of the View

+
+name
+String! +
+

The name of the View

+
+description
+String +
+

The description of the View

+
+definition
+DataHubViewDefinition! +
+

The definition of the View

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from the View

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+ +## DataHubViewDefinition + +An DataHub View Definition + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+entityTypes
+[EntityType!]! +
+

A set of filters to apply. If left empty, then ALL entity types are in scope.

+
+filter
+DataHubViewFilter! +
+

A set of filters to apply. If left empty, then no filters will be applied.

+
+ +## DataHubViewFilter + +A DataHub View Filter. Note that + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+operator
+LogicalOperator! +
+

The operator used to combine the filters.

+
+filters
+[FacetFilter!]! +
+

A set of filters combined using the operator. If left empty, then no filters will be applied.

+
+ +## DataJob + +A Data Job Metadata Entity, representing an individual unit of computation or Task +to produce an output Dataset Always part of a parent Data Flow aka Pipeline + +

Implements

+ +- [EntityWithRelationships](/docs/graphql/interfaces#entitywithrelationships) +- [Entity](/docs/graphql/interfaces#entity) +- [BrowsableEntity](/docs/graphql/interfaces#browsableentity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the Data Job

+
+type
+EntityType! +
+

A standard Entity Type

+
+lastIngested
+Long +
+

The timestamp for the last time this entity was ingested

+
+dataFlow
+DataFlow +
+

Deprecated, use relationship IsPartOf instead +The associated data flow

+
+jobId
+String! +
+

Id of the job

+
+properties
+DataJobProperties +
+

Additional read only properties associated with the Data Job

+
+dataPlatformInstance
+DataPlatformInstance +
+

The specific instance of the data platform that this entity belongs to

+
+editableProperties
+DataJobEditableProperties +
+

Additional read write properties associated with the Data Job

+
+tags
+GlobalTags +
+

The tags associated with the DataJob

+
+ownership
+Ownership +
+

Ownership metadata of the job

+
+status
+Status +
+

Status metadata of the DataJob

+
+deprecation
+Deprecation +
+

The deprecation status of the Data Flow

+
+institutionalMemory
+InstitutionalMemory +
+

References to internal resources related to the dashboard

+
+glossaryTerms
+GlossaryTerms +
+

The structured glossary terms associated with the dashboard

+
+domain
+DomainAssociation +
+

The Domain associated with the Data Job

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+lineage
+EntityLineageResult +
+

Edges extending from this entity grouped by direction in the lineage graph

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+LineageInput! +
+ +
+ +
+browsePaths
+[BrowsePath!] +
+

The browse paths corresponding to the data job. If no Browse Paths have been generated before, this will be null.

+
+browsePathV2
+BrowsePathV2 +
+

The browse path V2 corresponding to an entity. If no Browse Paths V2 have been generated before, this will be null.

+
+info
+DataJobInfo +
+
Deprecated: No longer supported
+ +

Deprecated, use properties field instead +Additional read only information about a Data processing job

+
+inputOutput
+DataJobInputOutput +
+

Information about the inputs and outputs of a Data processing job including column-level lineage.

+
+globalTags
+GlobalTags +
+
Deprecated: No longer supported
+ +

Deprecated, use the tags field instead +The structured tags associated with the DataJob

+
+runs
+DataProcessInstanceResult +
+

History of runs of this task

+ +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+start
+Int +
+ +
+count
+Int +
+ +
+ +
+privileges
+EntityPrivileges +
+

Privileges given to a user relevant to this entity

+
+exists
+Boolean +
+

Whether or not this entity exists on DataHub

+
+ +## DataJobEditableProperties + +Data Job properties that are editable via the UI This represents logical metadata, +as opposed to technical metadata + +

Fields

+ + + + + + + + + +
NameDescription
+description
+String +
+

Description of the Data Job

+
+ +## DataJobInfo + +Deprecated, use DataJobProperties instead +Additional read only information about a Data Job aka Task + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

Job display name

+
+description
+String +
+

Job description

+
+externalUrl
+String +
+

External URL associated with the DataJob

+
+customProperties
+[CustomPropertiesEntry!] +
+

A list of platform specific metadata tuples

+
+ +## DataJobInputOutput + +The lineage information for a DataJob +TODO Rename this to align with other Lineage models + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+inputDatasets
+[Dataset!] +
+
Deprecated: No longer supported
+ +

Deprecated, use relationship Consumes instead +Input datasets produced by the data job during processing

+
+outputDatasets
+[Dataset!] +
+
Deprecated: No longer supported
+ +

Deprecated, use relationship Produces instead +Output datasets produced by the data job during processing

+
+inputDatajobs
+[DataJob!] +
+
Deprecated: No longer supported
+ +

Deprecated, use relationship DownstreamOf instead +Input datajobs that this data job depends on

+
+fineGrainedLineages
+[FineGrainedLineage!] +
+

Lineage information for the column-level. Includes a list of objects +detailing which columns are upstream and which are downstream of each other. +The upstream and downstream columns are from datasets.

+
+ +## DataJobProperties + +Additional read only properties about a Data Job aka Task + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

Job display name

+
+description
+String +
+

Job description

+
+externalUrl
+String +
+

External URL associated with the DataJob

+
+customProperties
+[CustomPropertiesEntry!] +
+

A list of platform specific metadata tuples

+
+ +## DataPlatform + +A Data Platform represents a specific third party Data System or Tool Examples include +warehouses like Snowflake, orchestrators like Airflow, and dashboarding tools like Looker + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

Urn of the data platform

+
+type
+EntityType! +
+

A standard Entity Type

+
+lastIngested
+Long +
+

The timestamp for the last time this entity was ingested

+
+name
+String! +
+

Name of the data platform

+
+properties
+DataPlatformProperties +
+

Additional read only properties associated with a data platform

+
+displayName
+String +
+
Deprecated: No longer supported
+ +

Deprecated, use properties displayName instead +Display name of the data platform

+
+info
+DataPlatformInfo +
+
Deprecated: No longer supported
+ +

Deprecated, use properties field instead +Additional properties associated with a data platform

+
+relationships
+EntityRelationshipsResult +
+

Edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+ +## DataPlatformInfo + +Deprecated, use DataPlatformProperties instead +Additional read only information about a Data Platform + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+type
+PlatformType! +
+

The platform category

+
+displayName
+String +
+

Display name associated with the platform

+
+datasetNameDelimiter
+String! +
+

The delimiter in the dataset names on the data platform

+
+logoUrl
+String +
+

A logo URL associated with the platform

+
+ +## DataPlatformInstance + +A Data Platform instance represents an instance of a 3rd party platform like Looker, Snowflake, etc. + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

Urn of the data platform

+
+type
+EntityType! +
+

A standard Entity Type

+
+platform
+DataPlatform! +
+

Name of the data platform

+
+instanceId
+String! +
+

The platform instance id

+
+relationships
+EntityRelationshipsResult +
+

Edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+properties
+DataPlatformInstanceProperties +
+

Additional read only properties associated with a data platform instance

+
+ownership
+Ownership +
+

Ownership metadata of the data platform instance

+
+institutionalMemory
+InstitutionalMemory +
+

References to internal resources related to the data platform instance

+
+tags
+GlobalTags +
+

Tags used for searching the data platform instance

+
+deprecation
+Deprecation +
+

The deprecation status of the data platform instance

+
+status
+Status +
+

Status metadata of the container

+
+ +## DataPlatformInstanceProperties + +Additional read only properties about a DataPlatformInstance + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String +
+

The name of the data platform instance used in display

+
+description
+String +
+

Read only technical description for the data platform instance

+
+customProperties
+[CustomPropertiesEntry!] +
+

Custom properties of the data platform instance

+
+externalUrl
+String +
+

External URL associated with the data platform instance

+
+ +## DataPlatformProperties + +Additional read only properties about a Data Platform + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+type
+PlatformType! +
+

The platform category

+
+displayName
+String +
+

Display name associated with the platform

+
+datasetNameDelimiter
+String! +
+

The delimiter in the dataset names on the data platform

+
+logoUrl
+String +
+

A logo URL associated with the platform

+
+ +## DataProcessInstance + +A DataProcessInstance Metadata Entity, representing an individual run of +a task or datajob. + +

Implements

+ +- [EntityWithRelationships](/docs/graphql/interfaces#entitywithrelationships) +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the DataProcessInstance

+
+type
+EntityType! +
+

The standard Entity Type

+
+state
+[DataProcessRunEvent] +
+

The history of state changes for the run

+ +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+startTimeMillis
+Long +
+ +
+endTimeMillis
+Long +
+ +
+limit
+Int +
+ +
+ +
+created
+AuditStamp +
+

When the run was kicked off

+
+name
+String +
+

The name of the data process

+
+relationships
+EntityRelationshipsResult +
+

Edges extending from this entity. +In the UI, used for inputs, outputs and parentTemplate

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+lineage
+EntityLineageResult +
+

Edges extending from this entity grouped by direction in the lineage graph

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+LineageInput! +
+ +
+ +
+externalUrl
+String +
+

The link to view the task run in the source system

+
+ +## DataProcessInstanceResult + +Data Process instances that match the provided query + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+count
+Int +
+

The number of entities to include in result set

+
+start
+Int +
+

The offset of the result set

+
+total
+Int +
+

The total number of run events returned

+
+runs
+[DataProcessInstance] +
+

The data process instances that produced or consumed the entity

+
+ +## DataProcessInstanceRunResult + +the result of a run, part of the run state + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+resultType
+DataProcessInstanceRunResultType +
+

The outcome of the run

+
+nativeResultType
+String +
+

The outcome of the run in the data platforms native language

+
+ +## DataProcessRunEvent + +A state change event in the data process instance lifecycle + +

Implements

+ +- [TimeSeriesAspect](/docs/graphql/interfaces#timeseriesaspect) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+status
+DataProcessRunStatus +
+

The status of the data process instance

+
+attempt
+Int +
+

The try number that this instance run is in

+
+result
+DataProcessInstanceRunResult +
+

The result of a run

+
+timestampMillis
+Long! +
+

The timestamp associated with the run event in milliseconds

+
+ +## DataProduct + +A Data Product, or a logical grouping of Metadata Entities + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the Data Product

+
+type
+EntityType! +
+

A standard Entity Type

+
+properties
+DataProductProperties +
+

Properties about a Data Product

+
+ownership
+Ownership +
+

Ownership metadata of the Data Product

+
+institutionalMemory
+InstitutionalMemory +
+

References to internal resources related to the Data Product

+
+relationships
+EntityRelationshipsResult +
+

Edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+glossaryTerms
+GlossaryTerms +
+

The structured glossary terms associated with the Data Product

+
+domain
+DomainAssociation +
+

The Domain associated with the Data Product

+
+tags
+GlobalTags +
+

Tags used for searching Data Product

+
+ +## DataProductProperties + +Properties about a domain + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

Display name of the Data Product

+
+description
+String +
+

Description of the Data Product

+
+externalUrl
+String +
+

External URL for the DataProduct (most likely GitHub repo where Data Products are managed as code)

+
+numAssets
+Int +
+

Number of children entities inside of the Data Product

+
+customProperties
+[CustomPropertiesEntry!] +
+

Custom properties of the Data Product

+
+ +## Dataset + +A Dataset entity, which encompasses Relational Tables, Document store collections, streaming topics, and other sets of data having an independent lifecycle + +

Implements

+ +- [EntityWithRelationships](/docs/graphql/interfaces#entitywithrelationships) +- [Entity](/docs/graphql/interfaces#entity) +- [BrowsableEntity](/docs/graphql/interfaces#browsableentity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the Dataset

+
+type
+EntityType! +
+

The standard Entity Type

+
+lastIngested
+Long +
+

The timestamp for the last time this entity was ingested

+
+platform
+DataPlatform! +
+

Standardized platform urn where the dataset is defined

+
+container
+Container +
+

The parent container in which the entity resides

+
+parentContainers
+ParentContainersResult +
+

Recursively get the lineage of containers for this entity

+
+name
+String! +
+

Unique guid for dataset +No longer to be used as the Dataset display name. Use properties.name instead

+
+properties
+DatasetProperties +
+

An additional set of read only properties

+
+editableProperties
+DatasetEditableProperties +
+

An additional set of of read write properties

+
+ownership
+Ownership +
+

Ownership metadata of the dataset

+
+deprecation
+Deprecation +
+

The deprecation status of the dataset

+
+institutionalMemory
+InstitutionalMemory +
+

References to internal resources related to the dataset

+
+schemaMetadata
+SchemaMetadata +
+

Schema metadata of the dataset, available by version number

+ +

Arguments

+ + + + + + + + + +
NameDescription
+version
+Long +
+ +
+ +
+editableSchemaMetadata
+EditableSchemaMetadata +
+

Editable schema metadata of the dataset

+
+status
+Status +
+

Status of the Dataset

+
+embed
+Embed +
+

Embed information about the Dataset

+
+tags
+GlobalTags +
+

Tags used for searching dataset

+
+glossaryTerms
+GlossaryTerms +
+

The structured glossary terms associated with the dataset

+
+dataPlatformInstance
+DataPlatformInstance +
+

The specific instance of the data platform that this entity belongs to

+
+domain
+DomainAssociation +
+

The Domain associated with the Dataset

+
+access
+Access +
+

The Roles and the properties to access the dataset

+
+usageStats
+UsageQueryResult +
+

Statistics about how this Dataset is used +The first parameter, resource, is deprecated and no longer needs to be provided

+ +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+resource
+String +
+ +
+range
+TimeRange +
+ +
+ +
+statsSummary
+DatasetStatsSummary +
+

Experimental - Summary operational & usage statistics about a Dataset

+
+datasetProfiles
+[DatasetProfile!] +
+

Profile Stats resource that retrieves the events in a previous unit of time in descending order +If no start or end time are provided, the most recent events will be returned

+ +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+startTimeMillis
+Long +
+ +
+endTimeMillis
+Long +
+ +
+filter
+FilterInput +
+ +
+limit
+Int +
+ +
+ +
+operations
+[Operation!] +
+

Operational events for an entity.

+ +

Arguments

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+startTimeMillis
+Long +
+ +
+endTimeMillis
+Long +
+ +
+filter
+FilterInput +
+ +
+limit
+Int +
+ +
+ +
+assertions
+EntityAssertionsResult +
+

Assertions associated with the Dataset

+ +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+start
+Int +
+ +
+count
+Int +
+ +
+ +
+relationships
+EntityRelationshipsResult +
+

Edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+lineage
+EntityLineageResult +
+

Edges extending from this entity grouped by direction in the lineage graph

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+LineageInput! +
+ +
+ +
+browsePaths
+[BrowsePath!] +
+

The browse paths corresponding to the dataset. If no Browse Paths have been generated before, this will be null.

+
+browsePathV2
+BrowsePathV2 +
+

The browse path V2 corresponding to an entity. If no Browse Paths V2 have been generated before, this will be null.

+
+health
+[Health!] +
+

Experimental! The resolved health statuses of the Dataset

+
+schema
+Schema +
+
Deprecated: Use `schemaMetadata`
+ +

Schema metadata of the dataset

+
+externalUrl
+String +
+
Deprecated: No longer supported
+ +

Deprecated, use properties field instead +External URL associated with the Dataset

+
+origin
+FabricType! +
+
Deprecated: No longer supported
+ +

Deprecated, see the properties field instead +Environment in which the dataset belongs to or where it was generated +Note that this field will soon be deprecated in favor of a more standardized concept of Environment

+
+description
+String +
+
Deprecated: No longer supported
+ +

Deprecated, use the properties field instead +Read only technical description for dataset

+
+platformNativeType
+PlatformNativeType +
+
Deprecated: No longer supported
+ +

Deprecated, do not use this field +The logical type of the dataset ie table, stream, etc

+
+uri
+String +
+
Deprecated: No longer supported
+ +

Deprecated, use properties instead +Native Dataset Uri +Uri should not include any environment specific properties

+
+globalTags
+GlobalTags +
+
Deprecated: No longer supported
+ +

Deprecated, use tags field instead +The structured tags associated with the dataset

+
+subTypes
+SubTypes +
+

Sub Types that this entity implements

+
+viewProperties
+ViewProperties +
+

View related properties. Only relevant if subtypes field contains view.

+
+aspects
+[RawAspect!] +
+

Experimental API. +For fetching extra entities that do not have custom UI code yet

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+AspectParams +
+ +
+ +
+runs
+DataProcessInstanceResult +
+

History of datajob runs that either produced or consumed this dataset

+ +

Arguments

+ + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+ +
+count
+Int +
+ +
+direction
+RelationshipDirection! +
+ +
+ +
+siblings
+SiblingProperties +
+

Metadata about the datasets siblings

+
+fineGrainedLineages
+[FineGrainedLineage!] +
+

Lineage information for the column-level. Includes a list of objects +detailing which columns are upstream and which are downstream of each other. +The upstream and downstream columns are from datasets.

+
+privileges
+EntityPrivileges +
+

Privileges given to a user relevant to this entity

+
+exists
+Boolean +
+

Whether or not this entity exists on DataHub

+
+testResults
+TestResults +
+

The results of evaluating tests

+
+ +## DatasetAssertionInfo + +Detailed information about a Dataset Assertion + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+datasetUrn
+String! +
+

The urn of the dataset that the assertion is related to

+
+scope
+DatasetAssertionScope! +
+

The scope of the Dataset assertion.

+
+fields
+[SchemaFieldRef!] +
+

The fields serving as input to the assertion. Empty if there are none.

+
+aggregation
+AssertionStdAggregation +
+

Standardized assertion operator

+
+operator
+AssertionStdOperator! +
+

Standardized assertion operator

+
+parameters
+AssertionStdParameters +
+

Standard parameters required for the assertion. e.g. min_value, max_value, value, columns

+
+nativeType
+String +
+

The native operator for the assertion. For Great Expectations, this will contain the original expectation name.

+
+nativeParameters
+[StringMapEntry!] +
+

Native parameters required for the assertion.

+
+logic
+String +
+

Logic comprising a raw, unstructured assertion.

+
+ +## DatasetDeprecation + +Deprecated, use Deprecation instead +Information about Dataset deprecation status +Note that this model will soon be migrated to a more general purpose Entity status + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+deprecated
+Boolean! +
+

Whether the dataset has been deprecated by owner

+
+decommissionTime
+Long +
+

The time user plan to decommission this dataset

+
+note
+String! +
+

Additional information about the dataset deprecation plan

+
+actor
+String +
+

The user who will be credited for modifying this deprecation content

+
+ +## DatasetEditableProperties + +Dataset properties that are editable via the UI This represents logical metadata, +as opposed to technical metadata + +

Fields

+ + + + + + + + + +
NameDescription
+description
+String +
+

Description of the Dataset

+
+ +## DatasetFieldProfile + +An individual Dataset Field Profile + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+fieldPath
+String! +
+

The standardized path of the field

+
+uniqueCount
+Long +
+

The unique value count for the field across the Dataset

+
+uniqueProportion
+Float +
+

The proportion of rows with unique values across the Dataset

+
+nullCount
+Long +
+

The number of NULL row values across the Dataset

+
+nullProportion
+Float +
+

The proportion of rows with NULL values across the Dataset

+
+min
+String +
+

The min value for the field

+
+max
+String +
+

The max value for the field

+
+mean
+String +
+

The mean value for the field

+
+median
+String +
+

The median value for the field

+
+stdev
+String +
+

The standard deviation for the field

+
+sampleValues
+[String!] +
+

A set of sample values for the field

+
+ +## DatasetProfile + +A Dataset Profile associated with a Dataset, containing profiling statistics about the Dataset + +

Implements

+ +- [TimeSeriesAspect](/docs/graphql/interfaces#timeseriesaspect) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+timestampMillis
+Long! +
+

The time at which the profile was reported

+
+rowCount
+Long +
+

An optional row count of the Dataset

+
+columnCount
+Long +
+

An optional column count of the Dataset

+
+sizeInBytes
+Long +
+

The storage size in bytes

+
+fieldProfiles
+[DatasetFieldProfile!] +
+

An optional set of per field statistics obtained in the profile

+
+partitionSpec
+PartitionSpec +
+

Information about the partition that was profiled

+
+ +## DatasetProperties + +Additional read only properties about a Dataset + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

The name of the dataset used in display

+
+qualifiedName
+String +
+

Fully-qualified name of the Dataset

+
+origin
+FabricType! +
+

Environment in which the dataset belongs to or where it was generated +Note that this field will soon be deprecated in favor of a more standardized concept of Environment

+
+description
+String +
+

Read only technical description for dataset

+
+customProperties
+[CustomPropertiesEntry!] +
+

Custom properties of the Dataset

+
+externalUrl
+String +
+

External URL associated with the Dataset

+
+created
+Long +
+

Created timestamp millis associated with the Dataset

+
+createdActor
+String +
+

Actor associated with the Dataset's created timestamp

+
+lastModified
+Long +
+

Last Modified timestamp millis associated with the Dataset

+
+lastModifiedActor
+String +
+

Actor associated with the Dataset's lastModified timestamp

+
+ +## DatasetStatsSummary + +Experimental - subject to change. A summary of usage metrics about a Dataset. + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+queryCountLast30Days
+Int +
+

The query count in the past 30 days

+
+uniqueUserCountLast30Days
+Int +
+

The unique user count in the past 30 days

+
+topUsersLast30Days
+[CorpUser!] +
+

The top users in the past 30 days

+
+ +## DateRange + +For consumption by UI only + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+start
+String! +
+ +
+end
+String! +
+ +
+ +## Deprecation + +Information about Metadata Entity deprecation status + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+deprecated
+Boolean! +
+

Whether the entity has been deprecated by owner

+
+decommissionTime
+Long +
+

The time user plan to decommission this entity

+
+note
+String +
+

Additional information about the entity deprecation plan

+
+actor
+String +
+

The user who will be credited for modifying this deprecation content

+
+ +## Domain + +A domain, or a logical grouping of Metadata Entities + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the domain

+
+type
+EntityType! +
+

A standard Entity Type

+
+id
+String! +
+

Id of the domain

+
+properties
+DomainProperties +
+

Properties about a domain

+
+ownership
+Ownership +
+

Ownership metadata of the dataset

+
+institutionalMemory
+InstitutionalMemory +
+

References to internal resources related to the dataset

+
+entities
+SearchResults +
+

Children entities inside of the Domain

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+DomainEntitiesInput +
+ +
+ +
+relationships
+EntityRelationshipsResult +
+

Edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+ +## DomainAssociation + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+domain
+Domain! +
+

The domain related to the assocaited urn

+
+associatedUrn
+String! +
+

Reference back to the tagged urn for tracking purposes e.g. when sibling nodes are merged together

+
+ +## DomainProperties + +Properties about a domain + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

Display name of the domain

+
+description
+String +
+

Description of the Domain

+
+ +## DownstreamEntityRelationships + +Deprecated, use relationships query instead + +

Fields

+ + + + + + + + + +
NameDescription
+entities
+[EntityRelationshipLegacy] +
+ +
+ +## EditableSchemaFieldInfo + +Editable schema field metadata ie descriptions, tags, etc + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+fieldPath
+String! +
+

Flattened name of a field identifying the field the editable info is applied to

+
+description
+String +
+

Edited description of the field

+
+globalTags
+GlobalTags +
+
Deprecated: No longer supported
+ +

Deprecated, use tags field instead +Tags associated with the field

+
+tags
+GlobalTags +
+

Tags associated with the field

+
+glossaryTerms
+GlossaryTerms +
+

Glossary terms associated with the field

+
+ +## EditableSchemaMetadata + +Information about schema metadata that is editable via the UI + +

Fields

+ + + + + + + + + +
NameDescription
+editableSchemaFieldInfo
+[EditableSchemaFieldInfo!]! +
+

Editable schema field metadata

+
+ +## EditableTagProperties + +Additional read write Tag properties +Deprecated! Replaced by TagProperties. + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+name
+String +
+

A display name for the Tag

+
+description
+String +
+

A description of the Tag

+
+ +## Embed + +Information required to render an embedded version of an asset + +

Fields

+ + + + + + + + + +
NameDescription
+renderUrl
+String +
+

A URL which can be rendered inside of an iframe.

+
+ +## EntityAssertionsResult + +A list of Assertions Associated with an Entity + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int! +
+

The starting offset of the result set returned

+
+count
+Int! +
+

The number of assertions in the returned result set

+
+total
+Int! +
+

The total number of assertions in the result set

+
+assertions
+[Assertion!]! +
+

The assertions themselves

+
+ +## EntityCountResult + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+entityType
+EntityType! +
+ +
+count
+Int! +
+ +
+ +## EntityCountResults + +

Fields

+ + + + + + + + + +
NameDescription
+counts
+[EntityCountResult!] +
+ +
+ +## EntityLineageResult + +A list of lineage information associated with a source Entity + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

Start offset of the result set

+
+count
+Int +
+

Number of results in the returned result set

+
+total
+Int +
+

Total number of results in the result set

+
+filtered
+Int +
+

The number of results that were filtered out of the page (soft-deleted or non-existent)

+
+relationships
+[LineageRelationship!]! +
+

Relationships in the result set

+
+ +## EntityPath + +An overview of the field that was matched in the entity search document + +

Fields

+ + + + + + + + + +
NameDescription
+path
+[Entity] +
+

Path of entities between source and destination nodes

+
+ +## EntityPrivileges + +Shared privileges object across entities. Not all privileges apply to every entity. + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+canManageChildren
+Boolean +
+

Whether or not a user can create child entities under a parent entity. +For example, can one create Terms/Node sunder a Glossary Node.

+
+canManageEntity
+Boolean +
+

Whether or not a user can delete or move this entity.

+
+canEditLineage
+Boolean +
+

Whether or not a user can create or delete lineage edges for an entity.

+
+canEditEmbed
+Boolean +
+

Whether or not a user update the embed information

+
+canEditQueries
+Boolean +
+

Whether or not a user can update the Queries for the entity (e.g. dataset)

+
+ +## EntityProfileConfig + +Configuration for an entity profile + +

Fields

+ + + + + + + + + +
NameDescription
+defaultTab
+String +
+

The enum value from EntityProfileTab for which tab should be showed by default on +entity profile pages. If null, rely on default sorting from React code.

+
+ +## EntityProfileParams + +Context to define the entity profile page + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

Urn of the entity being shown

+
+type
+EntityType! +
+

Type of the enity being displayed

+
+ +## EntityProfilesConfig + +Configuration for different entity profiles + +

Fields

+ + + + + + + + + +
NameDescription
+domain
+EntityProfileConfig +
+

The configurations for a Domain entity profile

+
+ +## EntityRelationship + +A relationship between two entities TODO Migrate all entity relationships to this more generic model + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+type
+String! +
+

The type of the relationship

+
+direction
+RelationshipDirection! +
+

The direction of the relationship relative to the source entity

+
+entity
+Entity +
+

Entity that is related via lineage

+
+created
+AuditStamp +
+

An AuditStamp corresponding to the last modification of this relationship

+
+ +## EntityRelationshipLegacy + +Deprecated, use relationships query instead + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+entity
+EntityWithRelationships +
+

Entity that is related via lineage

+
+created
+AuditStamp +
+

An AuditStamp corresponding to the last modification of this relationship

+
+ +## EntityRelationshipsResult + +A list of relationship information associated with a source Entity + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

Start offset of the result set

+
+count
+Int +
+

Number of results in the returned result set

+
+total
+Int +
+

Total number of results in the result set

+
+relationships
+[EntityRelationship!]! +
+

Relationships in the result set

+
+ +## EthicalConsiderations + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+data
+[String!] +
+

Does the model use any sensitive data eg, protected classes

+
+humanLife
+[String!] +
+

Is the model intended to inform decisions about matters central to human life or flourishing eg, health or safety

+
+mitigations
+[String!] +
+

What risk mitigation strategies were used during model development

+
+risksAndHarms
+[String!] +
+

What risks may be present in model usage +Try to identify the potential recipients, likelihood, and magnitude of harms +If these cannot be determined, note that they were considered but remain unknown

+
+useCases
+[String!] +
+

Are there any known model use cases that are especially fraught +This may connect directly to the intended use section

+
+ +## ExecutionRequest + +Retrieve an ingestion execution request + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

Urn of the execution request

+
+id
+String! +
+

Unique id for the execution request

+
+input
+ExecutionRequestInput! +
+

Input provided when creating the Execution Request

+
+result
+ExecutionRequestResult +
+

Result of the execution request

+
+ +## ExecutionRequestInput + +Input provided when creating an Execution Request + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+task
+String! +
+

The type of the task to executed

+
+source
+ExecutionRequestSource! +
+

The source of the execution request

+
+arguments
+[StringMapEntry!] +
+

Arguments provided when creating the execution request

+
+requestedAt
+Long! +
+

The time at which the request was created

+
+ +## ExecutionRequestResult + +The result of an ExecutionRequest + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+status
+String! +
+

The result of the request, e.g. either SUCCEEDED or FAILED

+
+startTimeMs
+Long +
+

Time at which the task began

+
+durationMs
+Long +
+

Duration of the task

+
+report
+String +
+

A report about the ingestion run

+
+structuredReport
+StructuredReport +
+

A structured report for this Execution Request

+
+ +## ExecutionRequestSource + +Information about the source of an execution request + +

Fields

+ + + + + + + + + +
NameDescription
+type
+String +
+

The type of the source, e.g. SCHEDULED_INGESTION_SOURCE

+
+ +## FacetFilter + +A single filter value + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+field
+String! +
+

Name of field to filter by

+
+condition
+FilterOperator +
+

Condition for the values.

+
+values
+[String!]! +
+

Values, one of which the intended field should match.

+
+negated
+Boolean +
+

If the filter should or should not be matched

+
+ +## FacetMetadata + +Contains valid fields to filter search results further on + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+field
+String! +
+

Name of a field present in the search entity

+
+displayName
+String +
+

Display name of the field

+
+aggregations
+[AggregationMetadata!]! +
+

Aggregated search result counts by value of the field

+
+ +## FeatureFlagsConfig + +Configurations related to DataHub Views feature + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+readOnlyModeEnabled
+Boolean! +
+

Whether read only mode is enabled on an instance. +Right now this only affects ability to edit user profile image URL but can be extended.

+
+showSearchFiltersV2
+Boolean! +
+

Whether search filters V2 should be shown or the default filter side-panel

+
+showBrowseV2
+Boolean! +
+

Whether browse V2 sidebar should be shown

+
+showAcrylInfo
+Boolean! +
+

Whether we should show CTAs in the UI related to moving to Managed DataHub by Acryl.

+
+ +## FieldUsageCounts + +The usage for a particular Dataset field + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+fieldName
+String +
+

The path of the field

+
+count
+Int +
+

The count of usages

+
+ +## FineGrainedLineage + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+upstreams
+[SchemaFieldRef!] +
+ +
+downstreams
+[SchemaFieldRef!] +
+ +
+ +## FloatBox + +

Fields

+ + + + + + + + + +
NameDescription
+floatValue
+Float! +
+ +
+ +## ForeignKeyConstraint + +Metadata around a foreign key constraint between two datasets + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String +
+

The human-readable name of the constraint

+
+foreignFields
+[SchemaFieldEntity] +
+

List of fields in the foreign dataset

+
+sourceFields
+[SchemaFieldEntity] +
+

List of fields in this dataset

+
+foreignDataset
+Dataset +
+

The foreign dataset for easy reference

+
+ +## FreshnessStats + +Freshness stats for a query result. +Captures whether the query was served out of a cache, what the staleness was, etc. + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+cached
+Boolean +
+

Whether a cache was used to respond to this query

+
+systemFreshness
+[SystemFreshness] +
+

The latest timestamp in millis of the system that was used to respond to this query +In case a cache was consulted, this reflects the freshness of the cache +In case an index was consulted, this reflects the freshness of the index

+
+ +## GetQuickFiltersResult + +The result object when fetching quick filters + +

Fields

+ + + + + + + + + +
NameDescription
+quickFilters
+[QuickFilter]! +
+

The list of quick filters to render in the UI

+
+ +## GetRootGlossaryNodesResult + +The result when getting Glossary entities + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+nodes
+[GlossaryNode!]! +
+

A list of Glossary Nodes without a parent node

+
+start
+Int! +
+

The starting offset of the result set returned

+
+count
+Int! +
+

The number of nodes in the returned result

+
+total
+Int! +
+

The total number of nodes in the result set

+
+ +## GetRootGlossaryTermsResult + +The result when getting root GlossaryTerms + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+terms
+[GlossaryTerm!]! +
+

A list of Glossary Terms without a parent node

+
+start
+Int! +
+

The starting offset of the result set returned

+
+count
+Int! +
+

The number of terms in the returned result

+
+total
+Int! +
+

The total number of terms in the result set

+
+ +## GetSchemaBlameResult + +Schema changes computed at a specific version. + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+version
+SemanticVersionStruct +
+

Selected semantic version

+
+schemaFieldBlameList
+[SchemaFieldBlame!] +
+

List of schema blame. Absent when there are no fields to return history for.

+
+ +## GetSchemaVersionListResult + +Schema changes computed at a specific version. + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+latestVersion
+SemanticVersionStruct +
+

Latest and current semantic version

+
+version
+SemanticVersionStruct +
+

Selected semantic version

+
+semanticVersionList
+[SemanticVersionStruct!] +
+

All semantic versions. Absent when there are no versions.

+
+ +## GlobalTags + +Tags attached to a particular Metadata Entity + +

Fields

+ + + + + + + + + +
NameDescription
+tags
+[TagAssociation!] +
+

The set of tags attached to the Metadata Entity

+
+ +## GlobalViewsSettings + +Global (platform-level) settings related to the Views feature + +

Fields

+ + + + + + + + + +
NameDescription
+defaultView
+String +
+

The global default View. If a user does not have a personal default, then +this will be the default view.

+
+ +## GlossaryNode + +A Glossary Node, or a directory in a Business Glossary represents a container of +Glossary Terms or other Glossary Nodes + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the glossary term

+
+ownership
+Ownership +
+

Ownership metadata of the glossary term

+
+type
+EntityType! +
+

A standard Entity Type

+
+properties
+GlossaryNodeProperties +
+

Additional properties associated with the Glossary Term

+
+relationships
+EntityRelationshipsResult +
+

Edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+parentNodes
+ParentNodesResult +
+

Recursively get the lineage of glossary nodes for this entity

+
+privileges
+EntityPrivileges +
+

Privileges given to a user relevant to this entity

+
+exists
+Boolean +
+

Whether or not this entity exists on DataHub

+
+ +## GlossaryNodeProperties + +Additional read only properties about a Glossary Node + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

The name of the Glossary Term

+
+description
+String +
+

Description of the glossary term

+
+ +## GlossaryTerm + +A Glossary Term, or a node in a Business Glossary representing a standardized domain +data type + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the glossary term

+
+ownership
+Ownership +
+

Ownership metadata of the glossary term

+
+domain
+DomainAssociation +
+

The Domain associated with the glossary term

+
+institutionalMemory
+InstitutionalMemory +
+

References to internal resources related to the Glossary Term

+
+type
+EntityType! +
+

A standard Entity Type

+
+name
+String! +
+
Deprecated: No longer supported
+ +

A unique identifier for the Glossary Term. Deprecated - Use properties.name field instead.

+
+hierarchicalName
+String! +
+

hierarchicalName of glossary term

+
+properties
+GlossaryTermProperties +
+

Additional properties associated with the Glossary Term

+
+glossaryTermInfo
+GlossaryTermInfo +
+

Deprecated, use properties field instead +Details of the Glossary Term

+
+deprecation
+Deprecation +
+

The deprecation status of the Glossary Term

+
+relationships
+EntityRelationshipsResult +
+

Edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+schemaMetadata
+SchemaMetadata +
+

Schema metadata of the dataset

+ +

Arguments

+ + + + + + + + + +
NameDescription
+version
+Long +
+ +
+ +
+parentNodes
+ParentNodesResult +
+

Recursively get the lineage of glossary nodes for this entity

+
+privileges
+EntityPrivileges +
+

Privileges given to a user relevant to this entity

+
+exists
+Boolean +
+

Whether or not this entity exists on DataHub

+
+ +## GlossaryTermAssociation + +An edge between a Metadata Entity and a Glossary Term Modeled as a struct to permit +additional attributes +TODO Consider whether this query should be serviced by the relationships field + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+term
+GlossaryTerm! +
+

The glossary term itself

+
+associatedUrn
+String! +
+

Reference back to the associated urn for tracking purposes e.g. when sibling nodes are merged together

+
+ +## GlossaryTermInfo + +Deprecated, use GlossaryTermProperties instead +Information about a glossary term + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String +
+

The name of the Glossary Term

+
+description
+String +
+

Description of the glossary term

+
+definition
+String! +
+
Deprecated: No longer supported
+ +

Definition of the glossary term. Deprecated - Use 'description' instead.

+
+termSource
+String! +
+

Term Source of the glossary term

+
+sourceRef
+String +
+

Source Ref of the glossary term

+
+sourceUrl
+String +
+

Source Url of the glossary term

+
+customProperties
+[CustomPropertiesEntry!] +
+

Properties of the glossary term

+
+rawSchema
+String +
+

Schema definition of glossary term

+
+ +## GlossaryTermProperties + +Additional read only properties about a Glossary Term + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

The name of the Glossary Term

+
+description
+String +
+

Description of the glossary term

+
+definition
+String! +
+
Deprecated: No longer supported
+ +

Definition of the glossary term. Deprecated - Use 'description' instead.

+
+termSource
+String! +
+

Term Source of the glossary term

+
+sourceRef
+String +
+

Source Ref of the glossary term

+
+sourceUrl
+String +
+

Source Url of the glossary term

+
+customProperties
+[CustomPropertiesEntry!] +
+

Properties of the glossary term

+
+rawSchema
+String +
+

Schema definition of glossary term

+
+ +## GlossaryTerms + +Glossary Terms attached to a particular Metadata Entity + +

Fields

+ + + + + + + + + +
NameDescription
+terms
+[GlossaryTermAssociation!] +
+

The set of glossary terms attached to the Metadata Entity

+
+ +## Health + +The resolved Health of an Asset + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+type
+HealthStatusType! +
+

An enum representing the type of health indicator

+
+status
+HealthStatus! +
+

An enum representing the resolved Health status of an Asset

+
+message
+String +
+

An optional message describing the resolved health status

+
+causes
+[String!] +
+

The causes responsible for the health status

+
+ +## Highlight + +For consumption by UI only + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+value
+Int! +
+ +
+title
+String! +
+ +
+body
+String! +
+ +
+ +## HyperParameterMap + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+key
+String! +
+ +
+value
+HyperParameterValueType! +
+ +
+ +## IdentityManagementConfig + +Configurations related to Identity Management + +

Fields

+ + + + + + + + + +
NameDescription
+enabled
+Boolean! +
+

Whether identity management screen is able to be shown in the UI

+
+ +## IngestionConfig + +A set of configurations for an Ingestion Source + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+recipe
+String! +
+

The JSON-encoded recipe to use for ingestion

+
+executorId
+String! +
+

Advanced: The specific executor that should handle the execution request. Defaults to 'default'.

+
+version
+String +
+

Advanced: The version of the ingestion framework to use

+
+debugMode
+Boolean +
+

Advanced: Whether or not to run ingestion in debug mode

+
+ +## IngestionRun + +The runs associated with an Ingestion Source managed by DataHub + +

Fields

+ + + + + + + + + +
NameDescription
+executionRequestUrn
+String +
+

The urn of the execution request associated with the user

+
+ +## IngestionSchedule + +A schedule associated with an Ingestion Source + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+timezone
+String +
+

Time Zone abbreviation (e.g. GMT, EDT). Defaults to UTC.

+
+interval
+String! +
+

The cron-formatted interval to execute the ingestion source on

+
+ +## IngestionSource + +An Ingestion Source Entity + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the Ingestion Source

+
+type
+String! +
+

The type of the source itself, e.g. mysql, bigquery, bigquery-usage. Should match the recipe.

+
+name
+String! +
+

The display name of the Ingestion Source

+
+schedule
+IngestionSchedule +
+

An optional schedule associated with the Ingestion Source

+
+platform
+DataPlatform +
+

The data platform associated with this ingestion source

+
+config
+IngestionConfig! +
+

An type-specific set of configurations for the ingestion source

+
+executions
+IngestionSourceExecutionRequests +
+

Previous requests to execute the ingestion source

+ +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+start
+Int +
+ +
+count
+Int +
+ +
+ +
+ +## IngestionSourceExecutionRequests + +Requests for execution associated with an ingestion source + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

The starting offset of the result set

+
+count
+Int +
+

The number of results to be returned

+
+total
+Int +
+

The total number of results in the result set

+
+executionRequests
+[ExecutionRequest!]! +
+

The execution request objects comprising the result set

+
+ +## InputField + +Input field of the chart + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+schemaFieldUrn
+String +
+ +
+schemaField
+SchemaField +
+ +
+ +## InputFields + +Input fields of the chart + +

Fields

+ + + + + + + + + +
NameDescription
+fields
+[InputField] +
+ +
+ +## InstitutionalMemory + +Institutional memory metadata, meaning internal links and pointers related to an Entity + +

Fields

+ + + + + + + + + +
NameDescription
+elements
+[InstitutionalMemoryMetadata!]! +
+

List of records that represent the institutional memory or internal documentation of an entity

+
+ +## InstitutionalMemoryMetadata + +An institutional memory resource about a particular Metadata Entity + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+url
+String! +
+

Link to a document or wiki page or another internal resource

+
+label
+String! +
+

Label associated with the URL

+
+author
+CorpUser! +
+

The author of this metadata

+
+created
+AuditStamp! +
+

An AuditStamp corresponding to the creation of this resource

+
+description
+String! +
+
Deprecated: No longer supported
+ +

Deprecated, use label instead +Description of the resource

+
+associatedUrn
+String! +
+

Reference back to the owned urn for tracking purposes e.g. when sibling nodes are merged together

+
+ +## IntBox + +

Fields

+ + + + + + + + + +
NameDescription
+intValue
+Int! +
+ +
+ +## IntendedUse + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+primaryUses
+[String!] +
+

Primary Use cases for the model

+
+primaryUsers
+[IntendedUserType!] +
+

Primary Intended Users

+
+outOfScopeUses
+[String!] +
+

Out of scope uses of the MLModel

+
+ +## InviteToken + +Token that allows users to sign up as a native user + +

Fields

+ + + + + + + + + +
NameDescription
+inviteToken
+String! +
+

The invite token

+
+ +## KeyValueSchema + +Information about a raw Key Value Schema + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+keySchema
+String! +
+

Raw key schema

+
+valueSchema
+String! +
+

Raw value schema

+
+ +## LineageConfig + +Configurations related to Lineage + +

Fields

+ + + + + + + + + +
NameDescription
+supportsImpactAnalysis
+Boolean! +
+

Whether the backend support impact analysis feature

+
+ +## LineageRelationship + +Metadata about a lineage relationship between two entities + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+type
+String! +
+

The type of the relationship

+
+entity
+Entity +
+

Entity that is related via lineage

+
+degree
+Int! +
+

Degree of relationship (number of hops to get to entity)

+
+createdOn
+Long +
+

Timestamp for when this lineage relationship was created. Could be null.

+
+createdActor
+Entity +
+

The actor who created this lineage relationship. Could be null.

+
+updatedOn
+Long +
+

Timestamp for when this lineage relationship was last updated. Could be null.

+
+updatedActor
+Entity +
+

The actor who last updated this lineage relationship. Could be null.

+
+isManual
+Boolean +
+

Whether this edge is a manual edge. Could be null.

+
+ +## LinkParams + +Parameters required to specify the page to land once clicked + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+searchParams
+SearchParams +
+

Context to define the search page

+
+entityProfileParams
+EntityProfileParams +
+

Context to define the entity profile page

+
+ +## ListAccessTokenResult + +Results returned when listing access tokens + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int! +
+

The starting offset of the result set

+
+count
+Int! +
+

The number of results to be returned

+
+total
+Int! +
+

The total number of results in the result set

+
+tokens
+[AccessTokenMetadata!]! +
+

The token metadata themselves

+
+ +## ListDomainsResult + +The result obtained when listing DataHub Domains + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int! +
+

The starting offset of the result set returned

+
+count
+Int! +
+

The number of Domains in the returned result set

+
+total
+Int! +
+

The total number of Domains in the result set

+
+domains
+[Domain!]! +
+

The Domains themselves

+
+ +## ListGroupsResult + +The result obtained when listing DataHub Groups + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int! +
+

The starting offset of the result set returned

+
+count
+Int! +
+

The number of Policies in the returned result set

+
+total
+Int! +
+

The total number of Policies in the result set

+
+groups
+[CorpGroup!]! +
+

The groups themselves

+
+ +## ListIngestionSourcesResult + +Results returned when listing ingestion sources + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int! +
+

The starting offset of the result set

+
+count
+Int! +
+

The number of results to be returned

+
+total
+Int! +
+

The total number of results in the result set

+
+ingestionSources
+[IngestionSource!]! +
+

The Ingestion Sources themselves

+
+ +## ListOwnershipTypesResult + +Results when listing custom ownership types. + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int! +
+

The starting offset of the result set

+
+count
+Int! +
+

The number of results to be returned

+
+total
+Int! +
+

The total number of results in the result set

+
+ownershipTypes
+[OwnershipTypeEntity!]! +
+

The Custom Ownership Types themselves

+
+ +## ListPoliciesResult + +The result obtained when listing DataHub Access Policies + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int! +
+

The starting offset of the result set returned

+
+count
+Int! +
+

The number of Policies in the returned result set

+
+total
+Int! +
+

The total number of Policies in the result set

+
+policies
+[Policy!]! +
+

The Policies themselves

+
+ +## ListPostsResult + +The result obtained when listing Posts + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int! +
+

The starting offset of the result set returned

+
+count
+Int! +
+

The number of Roles in the returned result set

+
+total
+Int! +
+

The total number of Roles in the result set

+
+posts
+[Post!]! +
+

The Posts themselves

+
+ +## ListQueriesResult + +Results when listing entity queries + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int! +
+

The starting offset of the result set

+
+count
+Int! +
+

The number of results to be returned

+
+total
+Int! +
+

The total number of results in the result set

+
+queries
+[QueryEntity!]! +
+

The Queries themselves

+
+ +## ListRecommendationsResult + +Results returned by the ListRecommendations query + +

Fields

+ + + + + + + + + +
NameDescription
+modules
+[RecommendationModule!]! +
+

List of modules to show

+
+ +## ListRolesResult + +The result obtained when listing DataHub Roles + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int! +
+

The starting offset of the result set returned

+
+count
+Int! +
+

The number of Roles in the returned result set

+
+total
+Int! +
+

The total number of Roles in the result set

+
+roles
+[DataHubRole!]! +
+

The Roles themselves

+
+ +## ListSecretsResult + +Input for listing DataHub Secrets + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int +
+

The starting offset of the result set

+
+count
+Int +
+

The number of results to be returned

+
+total
+Int +
+

The total number of results in the result set

+
+secrets
+[Secret!]! +
+

The secrets themselves

+
+ +## ListTestsResult + +The result obtained when listing DataHub Tests + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int! +
+

The starting offset of the result set returned

+
+count
+Int! +
+

The number of Tests in the returned result set

+
+total
+Int! +
+

The total number of Tests in the result set

+
+tests
+[Test!]! +
+

The Tests themselves

+
+ +## ListUsersResult + +The result obtained when listing DataHub Users + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int! +
+

The starting offset of the result set returned

+
+count
+Int! +
+

The number of Policies in the returned result set

+
+total
+Int! +
+

The total number of Policies in the result set

+
+users
+[CorpUser!]! +
+

The users themselves

+
+ +## ListViewsResult + +The result obtained when listing DataHub Views + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int! +
+

The starting offset of the result set returned

+
+count
+Int! +
+

The number of Views in the returned result set

+
+total
+Int! +
+

The total number of Views in the result set

+
+views
+[DataHubView!]! +
+

The Views themselves

+
+ +## ManagedIngestionConfig + +Configurations related to managed, UI based ingestion + +

Fields

+ + + + + + + + + +
NameDescription
+enabled
+Boolean! +
+

Whether ingestion screen is enabled in the UI

+
+ +## MatchedField + +An overview of the field that was matched in the entity search document + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

Name of the field that matched

+
+value
+String! +
+

Value of the field that matched

+
+ +## Media + +Media content + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+type
+MediaType! +
+

The type of media

+
+location
+String! +
+

The location of the media (a URL)

+
+ +## Metrics + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+performanceMeasures
+[String!] +
+

Measures of ML Model performance

+
+decisionThreshold
+[String!] +
+

Decision Thresholds used if any

+
+ +## MLFeature + +An ML Feature Metadata Entity Note that this entity is incubating + +

Implements

+ +- [EntityWithRelationships](/docs/graphql/interfaces#entitywithrelationships) +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the ML Feature

+
+type
+EntityType! +
+

A standard Entity Type

+
+lastIngested
+Long +
+

The timestamp for the last time this entity was ingested

+
+name
+String! +
+

The display name for the ML Feature

+
+featureNamespace
+String! +
+

MLFeature featureNamespace

+
+description
+String +
+

The description about the ML Feature

+
+dataType
+MLFeatureDataType +
+

MLFeature data type

+
+ownership
+Ownership +
+

Ownership metadata of the MLFeature

+
+featureProperties
+MLFeatureProperties +
+
Deprecated: No longer supported
+ +

ModelProperties metadata of the MLFeature

+
+properties
+MLFeatureProperties +
+

ModelProperties metadata of the MLFeature

+
+institutionalMemory
+InstitutionalMemory +
+

References to internal resources related to the MLFeature

+
+status
+Status +
+

Status metadata of the MLFeature

+
+deprecation
+Deprecation +
+

Deprecation

+
+browsePathV2
+BrowsePathV2 +
+

The browse path V2 corresponding to an entity. If no Browse Paths V2 have been generated before, this will be null.

+
+dataPlatformInstance
+DataPlatformInstance +
+

The specific instance of the data platform that this entity belongs to

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+lineage
+EntityLineageResult +
+

Edges extending from this entity grouped by direction in the lineage graph

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+LineageInput! +
+ +
+ +
+tags
+GlobalTags +
+

Tags applied to entity

+
+glossaryTerms
+GlossaryTerms +
+

The structured glossary terms associated with the entity

+
+domain
+DomainAssociation +
+

The Domain associated with the entity

+
+editableProperties
+MLFeatureEditableProperties +
+

An additional set of of read write properties

+
+exists
+Boolean +
+

Whether or not this entity exists on DataHub

+
+ +## MLFeatureEditableProperties + +

Fields

+ + + + + + + + + +
NameDescription
+description
+String +
+

The edited description

+
+ +## MLFeatureProperties + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+description
+String +
+ +
+dataType
+MLFeatureDataType +
+ +
+version
+VersionTag +
+ +
+sources
+[Dataset] +
+ +
+ +## MLFeatureTable + +An ML Feature Table Entity Note that this entity is incubating + +

Implements

+ +- [EntityWithRelationships](/docs/graphql/interfaces#entitywithrelationships) +- [Entity](/docs/graphql/interfaces#entity) +- [BrowsableEntity](/docs/graphql/interfaces#browsableentity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the ML Feature Table

+
+type
+EntityType! +
+

A standard Entity Type

+
+lastIngested
+Long +
+

The timestamp for the last time this entity was ingested

+
+name
+String! +
+

The display name

+
+platform
+DataPlatform! +
+

Standardized platform urn where the MLFeatureTable is defined

+
+description
+String +
+

MLFeatureTable description

+
+ownership
+Ownership +
+

Ownership metadata of the MLFeatureTable

+
+properties
+MLFeatureTableProperties +
+

Additional read only properties associated the the ML Feature Table

+
+featureTableProperties
+MLFeatureTableProperties +
+
Deprecated: No longer supported
+ +

Deprecated, use properties field instead +ModelProperties metadata of the MLFeature

+
+institutionalMemory
+InstitutionalMemory +
+

References to internal resources related to the MLFeature

+
+status
+Status +
+

Status metadata of the MLFeatureTable

+
+deprecation
+Deprecation +
+

Deprecation

+
+dataPlatformInstance
+DataPlatformInstance +
+

The specific instance of the data platform that this entity belongs to

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+lineage
+EntityLineageResult +
+

Edges extending from this entity grouped by direction in the lineage graph

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+LineageInput! +
+ +
+ +
+browsePaths
+[BrowsePath!] +
+

The browse paths corresponding to the ML Feature Table. If no Browse Paths have been generated before, this will be null.

+
+browsePathV2
+BrowsePathV2 +
+

The browse path V2 corresponding to an entity. If no Browse Paths V2 have been generated before, this will be null.

+
+tags
+GlobalTags +
+

Tags applied to entity

+
+glossaryTerms
+GlossaryTerms +
+

The structured glossary terms associated with the entity

+
+domain
+DomainAssociation +
+

The Domain associated with the entity

+
+editableProperties
+MLFeatureTableEditableProperties +
+

An additional set of of read write properties

+
+exists
+Boolean +
+

Whether or not this entity exists on DataHub

+
+ +## MLFeatureTableEditableProperties + +

Fields

+ + + + + + + + + +
NameDescription
+description
+String +
+

The edited description

+
+ +## MLFeatureTableProperties + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+description
+String +
+ +
+mlFeatures
+[MLFeature] +
+ +
+mlPrimaryKeys
+[MLPrimaryKey] +
+ +
+customProperties
+[CustomPropertiesEntry!] +
+ +
+ +## MLHyperParam + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String +
+ +
+description
+String +
+ +
+value
+String +
+ +
+createdAt
+Long +
+ +
+ +## MLMetric + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String +
+ +
+description
+String +
+ +
+value
+String +
+ +
+createdAt
+Long +
+ +
+ +## MLModel + +An ML Model Metadata Entity Note that this entity is incubating + +

Implements

+ +- [EntityWithRelationships](/docs/graphql/interfaces#entitywithrelationships) +- [Entity](/docs/graphql/interfaces#entity) +- [BrowsableEntity](/docs/graphql/interfaces#browsableentity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the ML model

+
+type
+EntityType! +
+

A standard Entity Type

+
+lastIngested
+Long +
+

The timestamp for the last time this entity was ingested

+
+name
+String! +
+

ML model display name

+
+platform
+DataPlatform! +
+

Standardized platform urn where the MLModel is defined

+
+origin
+FabricType! +
+

Fabric type where mlmodel belongs to or where it was generated

+
+description
+String +
+

Human readable description for mlmodel

+
+globalTags
+GlobalTags +
+
Deprecated: No longer supported
+ +

Deprecated, use tags field instead +The standard tags for the ML Model

+
+tags
+GlobalTags +
+

The standard tags for the ML Model

+
+ownership
+Ownership +
+

Ownership metadata of the mlmodel

+
+properties
+MLModelProperties +
+

Additional read only information about the ML Model

+
+intendedUse
+IntendedUse +
+

Intended use of the mlmodel

+
+factorPrompts
+MLModelFactorPrompts +
+

Factors metadata of the mlmodel

+
+metrics
+Metrics +
+

Metrics metadata of the mlmodel

+
+evaluationData
+[BaseData!] +
+

Evaluation Data of the mlmodel

+
+trainingData
+[BaseData!] +
+

Training Data of the mlmodel

+
+quantitativeAnalyses
+QuantitativeAnalyses +
+

Quantitative Analyses of the mlmodel

+
+ethicalConsiderations
+EthicalConsiderations +
+

Ethical Considerations of the mlmodel

+
+caveatsAndRecommendations
+CaveatsAndRecommendations +
+

Caveats and Recommendations of the mlmodel

+
+institutionalMemory
+InstitutionalMemory +
+

References to internal resources related to the mlmodel

+
+sourceCode
+SourceCode +
+

Source Code

+
+status
+Status +
+

Status metadata of the mlmodel

+
+cost
+Cost +
+

Cost Aspect of the mlmodel

+
+deprecation
+Deprecation +
+

Deprecation

+
+dataPlatformInstance
+DataPlatformInstance +
+

The specific instance of the data platform that this entity belongs to

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+lineage
+EntityLineageResult +
+

Edges extending from this entity grouped by direction in the lineage graph

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+LineageInput! +
+ +
+ +
+browsePaths
+[BrowsePath!] +
+

The browse paths corresponding to the ML Model. If no Browse Paths have been generated before, this will be null.

+
+browsePathV2
+BrowsePathV2 +
+

The browse path V2 corresponding to an entity. If no Browse Paths V2 have been generated before, this will be null.

+
+glossaryTerms
+GlossaryTerms +
+

The structured glossary terms associated with the entity

+
+domain
+DomainAssociation +
+

The Domain associated with the entity

+
+editableProperties
+MLModelEditableProperties +
+

An additional set of of read write properties

+
+exists
+Boolean +
+

Whether or not this entity exists on DataHub

+
+ +## MLModelEditableProperties + +

Fields

+ + + + + + + + + +
NameDescription
+description
+String +
+

The edited description

+
+ +## MLModelFactorPrompts + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+relevantFactors
+[MLModelFactors!] +
+

What are foreseeable salient factors for which MLModel performance may vary, and how were these determined

+
+evaluationFactors
+[MLModelFactors!] +
+

Which factors are being reported, and why were these chosen

+
+ +## MLModelFactors + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+groups
+[String!] +
+

Distinct categories with similar characteristics that are present in the evaluation data instances

+
+instrumentation
+[String!] +
+

Instrumentation used for MLModel

+
+environment
+[String!] +
+

Environment in which the MLModel is deployed

+
+ +## MLModelGroup + +An ML Model Group Metadata Entity +Note that this entity is incubating + +

Implements

+ +- [EntityWithRelationships](/docs/graphql/interfaces#entitywithrelationships) +- [Entity](/docs/graphql/interfaces#entity) +- [BrowsableEntity](/docs/graphql/interfaces#browsableentity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the ML Model Group

+
+type
+EntityType! +
+

A standard Entity Type

+
+lastIngested
+Long +
+

The timestamp for the last time this entity was ingested

+
+name
+String! +
+

The display name for the Entity

+
+platform
+DataPlatform! +
+

Standardized platform urn where the MLModelGroup is defined

+
+origin
+FabricType! +
+

Fabric type where MLModelGroup belongs to or where it was generated

+
+description
+String +
+

Human readable description for MLModelGroup

+
+properties
+MLModelGroupProperties +
+

Additional read only properties about the ML Model Group

+
+ownership
+Ownership +
+

Ownership metadata of the MLModelGroup

+
+status
+Status +
+

Status metadata of the MLModelGroup

+
+deprecation
+Deprecation +
+

Deprecation

+
+dataPlatformInstance
+DataPlatformInstance +
+

The specific instance of the data platform that this entity belongs to

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+lineage
+EntityLineageResult +
+

Edges extending from this entity grouped by direction in the lineage graph

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+LineageInput! +
+ +
+ +
+browsePaths
+[BrowsePath!] +
+

The browse paths corresponding to the ML Model Group. If no Browse Paths have been generated before, this will be null.

+
+browsePathV2
+BrowsePathV2 +
+

The browse path V2 corresponding to an entity. If no Browse Paths V2 have been generated before, this will be null.

+
+tags
+GlobalTags +
+

Tags applied to entity

+
+glossaryTerms
+GlossaryTerms +
+

The structured glossary terms associated with the entity

+
+domain
+DomainAssociation +
+

The Domain associated with the entity

+
+editableProperties
+MLModelGroupEditableProperties +
+

An additional set of of read write properties

+
+exists
+Boolean +
+

Whether or not this entity exists on DataHub

+
+ +## MLModelGroupEditableProperties + +

Fields

+ + + + + + + + + +
NameDescription
+description
+String +
+

The edited description

+
+ +## MLModelGroupProperties + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+description
+String +
+ +
+createdAt
+Long +
+ +
+version
+VersionTag +
+ +
+ +## MLModelProperties + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+description
+String +
+ +
+date
+Long +
+ +
+version
+String +
+ +
+type
+String +
+ +
+hyperParameters
+HyperParameterMap +
+ +
+hyperParams
+[MLHyperParam] +
+ +
+trainingMetrics
+[MLMetric] +
+ +
+mlFeatures
+[String!] +
+ +
+tags
+[String!] +
+ +
+groups
+[MLModelGroup] +
+ +
+customProperties
+[CustomPropertiesEntry!] +
+ +
+externalUrl
+String +
+ +
+ +## MLPrimaryKey + +An ML Primary Key Entity Note that this entity is incubating + +

Implements

+ +- [EntityWithRelationships](/docs/graphql/interfaces#entitywithrelationships) +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the ML Primary Key

+
+type
+EntityType! +
+

A standard Entity Type

+
+lastIngested
+Long +
+

The timestamp for the last time this entity was ingested

+
+name
+String! +
+

The display name

+
+featureNamespace
+String! +
+

MLPrimaryKey featureNamespace

+
+description
+String +
+

MLPrimaryKey description

+
+dataType
+MLFeatureDataType +
+

MLPrimaryKey data type

+
+properties
+MLPrimaryKeyProperties +
+

Additional read only properties of the ML Primary Key

+
+primaryKeyProperties
+MLPrimaryKeyProperties +
+
Deprecated: No longer supported
+ +

Deprecated, use properties field instead +MLPrimaryKeyProperties

+
+ownership
+Ownership +
+

Ownership metadata of the MLPrimaryKey

+
+institutionalMemory
+InstitutionalMemory +
+

References to internal resources related to the MLPrimaryKey

+
+status
+Status +
+

Status metadata of the MLPrimaryKey

+
+deprecation
+Deprecation +
+

Deprecation

+
+dataPlatformInstance
+DataPlatformInstance +
+

The specific instance of the data platform that this entity belongs to

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+lineage
+EntityLineageResult +
+

Edges extending from this entity grouped by direction in the lineage graph

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+LineageInput! +
+ +
+ +
+tags
+GlobalTags +
+

Tags applied to entity

+
+glossaryTerms
+GlossaryTerms +
+

The structured glossary terms associated with the entity

+
+domain
+DomainAssociation +
+

The Domain associated with the entity

+
+editableProperties
+MLPrimaryKeyEditableProperties +
+

An additional set of of read write properties

+
+exists
+Boolean +
+

Whether or not this entity exists on DataHub

+
+ +## MLPrimaryKeyEditableProperties + +

Fields

+ + + + + + + + + +
NameDescription
+description
+String +
+

The edited description

+
+ +## MLPrimaryKeyProperties + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+description
+String +
+ +
+dataType
+MLFeatureDataType +
+ +
+version
+VersionTag +
+ +
+sources
+[Dataset] +
+ +
+ +## NamedBar + +For consumption by UI only + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+name
+String! +
+ +
+segments
+[BarSegment!]! +
+ +
+ +## NamedLine + +For consumption by UI only + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+name
+String! +
+ +
+data
+[NumericDataPoint!]! +
+ +
+ +## Notebook + +A Notebook Metadata Entity + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) +- [BrowsableEntity](/docs/graphql/interfaces#browsableentity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the Notebook

+
+type
+EntityType! +
+

A standard Entity Type

+
+tool
+String! +
+

The Notebook tool name

+
+notebookId
+String! +
+

An id unique within the Notebook tool

+
+info
+NotebookInfo +
+

Additional read only information about the Notebook

+
+editableProperties
+NotebookEditableProperties +
+

Additional read write properties about the Notebook

+
+ownership
+Ownership +
+

Ownership metadata of the Notebook

+
+status
+Status +
+

Status metadata of the Notebook

+
+content
+NotebookContent! +
+

The content of this Notebook

+
+tags
+GlobalTags +
+

The tags associated with the Notebook

+
+institutionalMemory
+InstitutionalMemory +
+

References to internal resources related to the Notebook

+
+domain
+DomainAssociation +
+

The Domain associated with the Notebook

+
+dataPlatformInstance
+DataPlatformInstance +
+

The specific instance of the data platform that this entity belongs to

+
+relationships
+EntityRelationshipsResult +
+

Edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+subTypes
+SubTypes +
+

Sub Types that this entity implements

+
+glossaryTerms
+GlossaryTerms +
+

The structured glossary terms associated with the notebook

+
+platform
+DataPlatform! +
+

Standardized platform.

+
+browsePaths
+[BrowsePath!] +
+

The browse paths corresponding to the Notebook. If no Browse Paths have been generated before, this will be null.

+
+browsePathV2
+BrowsePathV2 +
+

The browse path V2 corresponding to an entity. If no Browse Paths V2 have been generated before, this will be null.

+
+exists
+Boolean +
+

Whether or not this entity exists on DataHub

+
+ +## NotebookCell + +The Union of every NotebookCell + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+chartCell
+ChartCell +
+

The chart cell content. The will be non-null only when all other cell field is null.

+
+textCell
+TextCell +
+

The text cell content. The will be non-null only when all other cell field is null.

+
+queryChell
+QueryCell +
+

The query cell content. The will be non-null only when all other cell field is null.

+
+type
+NotebookCellType! +
+

The type of this Notebook cell

+
+ +## NotebookContent + +The actual content in a Notebook + +

Fields

+ + + + + + + + + +
NameDescription
+cells
+[NotebookCell!]! +
+

The content of a Notebook which is composed by a list of NotebookCell

+
+ +## NotebookEditableProperties + +Notebook properties that are editable via the UI This represents logical metadata, +as opposed to technical metadata + +

Fields

+ + + + + + + + + +
NameDescription
+description
+String +
+

Description of the Notebook

+
+ +## NotebookInfo + +Additional read only information about a Notebook + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+title
+String +
+

Display of the Notebook

+
+description
+String +
+

Description of the Notebook

+
+externalUrl
+String +
+

Native platform URL of the Notebook

+
+customProperties
+[CustomPropertiesEntry!] +
+

A list of platform specific metadata tuples

+
+changeAuditStamps
+ChangeAuditStamps +
+

Captures information about who created/last modified/deleted this Notebook and when

+
+ +## NumericDataPoint + +For consumption by UI only + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+x
+String! +
+ +
+y
+Int! +
+ +
+ +## Operation + +Operational info for an entity. + +

Implements

+ +- [TimeSeriesAspect](/docs/graphql/interfaces#timeseriesaspect) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+timestampMillis
+Long! +
+

The time at which the operation was reported

+
+actor
+String +
+

Actor who issued this operation.

+
+operationType
+OperationType! +
+

Operation type of change.

+
+customOperationType
+String +
+

A custom operation type

+
+sourceType
+OperationSourceType +
+

Source of the operation

+
+numAffectedRows
+Long +
+

How many rows were affected by this operation.

+
+affectedDatasets
+[String!] +
+

Which other datasets were affected by this operation.

+
+lastUpdatedTimestamp
+Long! +
+

When time at which the asset was actually updated

+
+partition
+String +
+

Optional partition identifier

+
+customProperties
+[StringMapEntry!] +
+

Custom operation properties

+
+ +## Origin + +Carries information about where an entity originated from. + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+type
+OriginType! +
+

Where an entity originated from. Either NATIVE or EXTERNAL

+
+externalType
+String +
+

Only populated if type is EXTERNAL. The externalType of the entity, such as the name of the identity provider.

+
+ +## Owner + +An owner of a Metadata Entity + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+owner
+OwnerType! +
+

Owner object

+
+type
+OwnershipType +
+
Deprecated: No longer supported
+ +

The type of the ownership. Deprecated - Use ownershipType field instead.

+
+ownershipType
+OwnershipTypeEntity +
+

Ownership type information

+
+source
+OwnershipSource +
+

Source information for the ownership

+
+associatedUrn
+String! +
+

Reference back to the owned urn for tracking purposes e.g. when sibling nodes are merged together

+
+ +## Ownership + +Ownership information about a Metadata Entity + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+owners
+[Owner!] +
+

List of owners of the entity

+
+lastModified
+AuditStamp! +
+

Audit stamp containing who last modified the record and when

+
+ +## OwnershipSource + +Information about the source of Ownership metadata about a Metadata Entity + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+type
+OwnershipSourceType! +
+

The type of the source

+
+url
+String +
+

An optional reference URL for the source

+
+ +## OwnershipTypeEntity + +A single Custom Ownership Type + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

A primary key associated with the custom ownership type.

+
+type
+EntityType! +
+

A standard Entity Type

+
+info
+OwnershipTypeInfo +
+

Information about the Custom Ownership Type

+
+status
+Status +
+

Status of the Custom Ownership Type

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from the Custom Ownership Type

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+ +## OwnershipTypeInfo + +Properties about an individual Custom Ownership Type. + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

The name of the Custom Ownership Type

+
+description
+String +
+

The description of the Custom Ownership Type

+
+created
+AuditStamp +
+

An Audit Stamp corresponding to the creation of this resource

+
+lastModified
+AuditStamp +
+

An Audit Stamp corresponding to the update of this resource

+
+ +## ParentContainersResult + +All of the parent containers for a given entity. Returns parents with direct parent first followed by the parent's parent etc. + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+count
+Int! +
+

The number of containers bubbling up for this entity

+
+containers
+[Container!]! +
+

A list of parent containers in order from direct parent, to parent's parent etc. If there are no containers, return an emty list

+
+ +## ParentNodesResult + +All of the parent nodes for GlossaryTerms and GlossaryNodes + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+count
+Int! +
+

The number of parent nodes bubbling up for this entity

+
+nodes
+[GlossaryNode!]! +
+

A list of parent nodes in order from direct parent, to parent's parent etc. If there are no nodes, return an empty list

+
+ +## PartitionSpec + +Information about the partition being profiled + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+type
+PartitionType! +
+

The partition type

+
+partition
+String! +
+

The partition identifier

+
+timePartition
+TimeWindow +
+

The optional time window partition information

+
+ +## PlatformPrivileges + +The platform privileges that the currently authenticated user has + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+viewAnalytics
+Boolean! +
+

Whether the user should be able to view analytics

+
+managePolicies
+Boolean! +
+

Whether the user should be able to manage policies

+
+manageIdentities
+Boolean! +
+

Whether the user should be able to manage users & groups

+
+generatePersonalAccessTokens
+Boolean! +
+

Whether the user should be able to generate personal access tokens

+
+createDomains
+Boolean! +
+

Whether the user should be able to create new Domains

+
+manageDomains
+Boolean! +
+

Whether the user should be able to manage Domains

+
+manageIngestion
+Boolean! +
+

Whether the user is able to manage UI-based ingestion

+
+manageSecrets
+Boolean! +
+

Whether the user is able to manage UI-based secrets

+
+manageTokens
+Boolean! +
+

Whether the user should be able to manage tokens on behalf of other users.

+
+manageTests
+Boolean! +
+

Whether the user is able to manage Tests

+
+manageGlossaries
+Boolean! +
+

Whether the user should be able to manage Glossaries

+
+manageUserCredentials
+Boolean! +
+

Whether the user is able to manage user credentials

+
+createTags
+Boolean! +
+

Whether the user should be able to create new Tags

+
+manageTags
+Boolean! +
+

Whether the user should be able to create and delete all Tags

+
+manageGlobalViews
+Boolean! +
+

Whether the user should be able to create, update, and delete global views.

+
+manageOwnershipTypes
+Boolean! +
+

Whether the user should be able to create, update, and delete ownership types.

+
+ +## PoliciesConfig + +Configurations related to the Policies Feature + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+enabled
+Boolean! +
+

Whether the policies feature is enabled and should be displayed in the UI

+
+platformPrivileges
+[Privilege!]! +
+

A list of platform privileges to display in the Policy Builder experience

+
+resourcePrivileges
+[ResourcePrivileges!]! +
+

A list of resource privileges to display in the Policy Builder experience

+
+ +## Policy + +DEPRECATED +TODO: Eventually get rid of this in favor of DataHub Policy +An DataHub Platform Access Policy Access Policies determine who can perform what actions against which resources on the platform + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the Policy

+
+type
+PolicyType! +
+

The type of the Policy

+
+name
+String! +
+

The name of the Policy

+
+state
+PolicyState! +
+

The present state of the Policy

+
+description
+String +
+

The description of the Policy

+
+resources
+ResourceFilter +
+

The resources that the Policy privileges apply to

+
+privileges
+[String!]! +
+

The privileges that the Policy grants

+
+actors
+ActorFilter! +
+

The actors that the Policy grants privileges to

+
+editable
+Boolean! +
+

Whether the Policy is editable, ie system policies, or not

+
+ +## PolicyMatchCriterion + +Criterion to define relationship between field and values + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+field
+String! +
+

The name of the field that the criterion refers to +e.g. entity_type, entity_urn, domain

+
+values
+[PolicyMatchCriterionValue!]! +
+

Values. Matches criterion if any one of the values matches condition (OR-relationship)

+
+condition
+PolicyMatchCondition! +
+

The name of the field that the criterion refers to

+
+ +## PolicyMatchCriterionValue + +Value in PolicyMatchCriterion with hydrated entity if value is urn + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+value
+String! +
+

The value of the field to match

+
+entity
+Entity +
+

Hydrated entities of the above values. Only set if the value is an urn

+
+ +## PolicyMatchFilter + +Filter object that encodes a complex filter logic with OR + AND + +

Fields

+ + + + + + + + + +
NameDescription
+criteria
+[PolicyMatchCriterion!] +
+

List of criteria to apply

+
+ +## Post + +Input provided when creating a Post + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the Post

+
+type
+EntityType! +
+

The standard Entity Type

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from the Post

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+postType
+PostType! +
+

The type of post

+
+content
+PostContent! +
+

The content of the post

+
+lastModified
+AuditStamp! +
+

When the post was last modified

+
+ +## PostContent + +Post content + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+contentType
+PostContentType! +
+

The type of post content

+
+title
+String! +
+

The title of the post

+
+description
+String +
+

Optional content of the post

+
+link
+String +
+

Optional link that the post is associated with

+
+media
+Media +
+

Optional media contained in the post

+
+ +## Privilege + +An individual DataHub Access Privilege + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+type
+String! +
+

Standardized privilege type, serving as a unique identifier for a privilege eg EDIT_ENTITY

+
+displayName
+String +
+

The name to appear when displaying the privilege, eg Edit Entity

+
+description
+String +
+

A description of the privilege to display

+
+ +## Privileges + +Object that encodes the privileges the actor has for a given resource + +

Fields

+ + + + + + + + + +
NameDescription
+privileges
+[String!]! +
+

Granted Privileges

+
+ +## QuantitativeAnalyses + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+unitaryResults
+ResultsType +
+

Link to a dashboard with results showing how the model performed with respect to each factor

+
+intersectionalResults
+ResultsType +
+

Link to a dashboard with results showing how the model performed with respect to the intersection of evaluated factors

+
+ +## QueriesTabConfig + +Configuration for the queries tab + +

Fields

+ + + + + + + + + +
NameDescription
+queriesTabResultSize
+Int +
+

Number of queries to show in the queries tab

+
+ +## QueryCell + +A Notebook cell which contains Query as content + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+cellTitle
+String! +
+

Title of the cell

+
+cellId
+String! +
+

Unique id for the cell.

+
+changeAuditStamps
+ChangeAuditStamps +
+

Captures information about who created/last modified/deleted this TextCell and when

+
+rawQuery
+String! +
+

Raw query to explain some specific logic in a Notebook

+
+lastExecuted
+AuditStamp +
+

Captures information about who last executed this query cell and when

+
+ +## QueryEntity + +An individual Query + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

A primary key associated with the Query

+
+type
+EntityType! +
+

A standard Entity Type

+
+properties
+QueryProperties +
+

Properties about the Query

+
+subjects
+[QuerySubject!] +
+

Subjects for the query

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+ +## QueryProperties + +Properties about an individual Query + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+statement
+QueryStatement! +
+

The Query statement itself

+
+source
+QuerySource! +
+

The source of the Query

+
+name
+String +
+

The name of the Query

+
+description
+String +
+

The description of the Query

+
+created
+AuditStamp! +
+

An Audit Stamp corresponding to the creation of this resource

+
+lastModified
+AuditStamp! +
+

An Audit Stamp corresponding to the update of this resource

+
+ +## QueryStatement + +An individual Query Statement + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+value
+String! +
+

The query statement value

+
+language
+QueryLanguage! +
+

The language for the Query Statement

+
+ +## QuerySubject + +The subject for a Query + +

Fields

+ + + + + + + + + +
NameDescription
+dataset
+Dataset! +
+

The dataset which is the subject of the Query

+
+ +## QuickFilter + +A quick filter in search and auto-complete + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+field
+String! +
+

Name of field to filter by

+
+value
+String! +
+

Value to filter on

+
+entity
+Entity +
+

Entity that the value maps to if any

+
+ +## RawAspect + +Payload representing data about a single aspect + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+aspectName
+String! +
+

The name of the aspect

+
+payload
+String +
+

JSON string containing the aspect's payload

+
+renderSpec
+AspectRenderSpec +
+

Details for the frontend on how the raw aspect should be rendered

+
+ +## RecommendationContent + +Content to display within each recommendation module + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+value
+String! +
+

String representation of content

+
+entity
+Entity +
+

Entity being recommended. Empty if the content being recommended is not an entity

+
+params
+RecommendationParams +
+

Additional context required to generate the the recommendation

+
+ +## RecommendationModule + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+title
+String! +
+

Title of the module to display

+
+moduleId
+String! +
+

Unique id of the module being recommended

+
+renderType
+RecommendationRenderType! +
+

Type of rendering that defines how the module should be rendered

+
+content
+[RecommendationContent!]! +
+

List of content to display inside the module

+
+ +## RecommendationParams + +Parameters required to render a recommendation of a given type + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+searchParams
+SearchParams +
+

Context to define the search recommendations

+
+entityProfileParams
+EntityProfileParams +
+

Context to define the entity profile page

+
+contentParams
+ContentParams +
+

Context about the recommendation

+
+ +## ResetToken + +Token that allows native users to reset their credentials + +

Fields

+ + + + + + + + + +
NameDescription
+resetToken
+String! +
+

The reset token

+
+ +## ResourceFilter + +The resources that a DataHub Access Policy applies to + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+type
+String +
+

The type of the resource the policy should apply to Not required because in the future we want to support filtering by type OR by domain

+
+resources
+[String!] +
+

A list of specific resource urns to apply the filter to

+
+allResources
+Boolean +
+

Whether of not to apply the filter to all resources of the type

+
+filter
+PolicyMatchFilter +
+

Whether of not to apply the filter to all resources of the type

+
+ +## ResourcePrivileges + +A privilege associated with a particular resource type +A resource is most commonly a DataHub Metadata Entity + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+resourceType
+String! +
+

Resource type associated with the Access Privilege, eg dataset

+
+resourceTypeDisplayName
+String +
+

The name to used for displaying the resourceType

+
+entityType
+EntityType +
+

An optional entity type to use when performing search and navigation to the entity

+
+privileges
+[Privilege!]! +
+

A list of privileges that are supported against this resource

+
+ +## Role + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

A primary key of the Metadata Entity

+
+type
+EntityType! +
+

A standard Entity Type

+
+relationships
+EntityRelationshipsResult +
+

List of relationships between the source Entity and some destination entities with a given types

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+id
+String! +
+

Id of the Role

+
+properties
+RoleProperties! +
+

Role properties to include Request Access Url

+
+actors
+Actor! +
+

A standard Entity Type

+
+ +## RoleAssociation + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+role
+Role! +
+

The Role entity itself

+
+associatedUrn
+String! +
+

Reference back to the tagged urn for tracking purposes e.g. when sibling nodes are merged together

+
+ +## RoleProperties + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

Name of the Role in an organisation

+
+description
+String +
+

Description about the role

+
+type
+String +
+

Role type can be READ, WRITE or ADMIN

+
+requestUrl
+String +
+

Url to request a role for a user in an organisation

+
+ +## RoleUser + +

Fields

+ + + + + + + + + +
NameDescription
+user
+CorpUser! +
+

Linked corp user of a role

+
+ +## Row + +For consumption by UI only + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+values
+[String!]! +
+ +
+cells
+[Cell!] +
+ +
+ +## Schema + +Deprecated, use SchemaMetadata instead +Metadata about a Dataset schema + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+datasetUrn
+String +
+

Dataset this schema metadata is associated with

+
+name
+String! +
+

Schema name

+
+platformUrn
+String! +
+

Platform this schema metadata is associated with

+
+version
+Long! +
+

The version of the GMS Schema metadata

+
+cluster
+String +
+

The cluster this schema metadata is derived from

+
+hash
+String! +
+

The SHA1 hash of the schema content

+
+platformSchema
+PlatformSchema +
+

The native schema in the datasets platform, schemaless if it was not provided

+
+fields
+[SchemaField!]! +
+

Client provided a list of fields from value schema

+
+primaryKeys
+[String!] +
+

Client provided list of fields that define primary keys to access record

+
+foreignKeys
+[ForeignKeyConstraint] +
+

Client provided list of foreign key constraints

+
+createdAt
+Long +
+

The time at which the schema metadata information was created

+
+lastObserved
+Long +
+

The time at which the schema metadata information was last ingested

+
+ +## SchemaField + +Information about an individual field in a Dataset schema + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+fieldPath
+String! +
+

Flattened name of the field computed from jsonPath field

+
+jsonPath
+String +
+

Flattened name of a field in JSON Path notation

+
+label
+String +
+

Human readable label for the field. Not supplied by all data sources

+
+nullable
+Boolean! +
+

Indicates if this field is optional or nullable

+
+description
+String +
+

Description of the field

+
+type
+SchemaFieldDataType! +
+

Platform independent field type of the field

+
+nativeDataType
+String +
+

The native type of the field in the datasets platform as declared by platform schema

+
+recursive
+Boolean! +
+

Whether the field references its own type recursively

+
+globalTags
+GlobalTags +
+
Deprecated: No longer supported
+ +

Deprecated, use tags field instead +Tags associated with the field

+
+tags
+GlobalTags +
+

Tags associated with the field

+
+glossaryTerms
+GlossaryTerms +
+

Glossary terms associated with the field

+
+isPartOfKey
+Boolean +
+

Whether the field is part of a key schema

+
+ +## SchemaFieldBlame + +Blame for a single field + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+fieldPath
+String! +
+

Flattened name of a schema field

+
+schemaFieldChange
+SchemaFieldChange! +
+

Attributes identifying a field change

+
+ +## SchemaFieldChange + +Attributes identifying a field change + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+timestampMillis
+Long! +
+

The time at which the schema was updated

+
+lastSemanticVersion
+String! +
+

The last semantic version that this schema was changed in

+
+versionStamp
+String! +
+

Version stamp of the change

+
+changeType
+ChangeOperationType! +
+

The type of the change

+
+lastSchemaFieldChange
+String +
+

Last column update, such as Added/Modified/Removed in v1.2.3.

+
+ +## SchemaFieldEntity + +Standalone schema field entity. Differs from the SchemaField struct because it is not directly nested inside a +schema field + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

Primary key of the schema field

+
+type
+EntityType! +
+

A standard Entity Type

+
+fieldPath
+String! +
+

Field path identifying the field in its dataset

+
+parent
+Entity! +
+

The field's parent.

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+ +## SchemaFieldRef + +A Dataset schema field (i.e. column) + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

A schema field urn

+
+path
+String! +
+

A schema field path

+
+ +## SchemaMetadata + +Metadata about a Dataset schema + +

Implements

+ +- [Aspect](/docs/graphql/interfaces#aspect) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+aspectVersion
+Long +
+

The logical version of the schema metadata, where zero represents the latest version +with otherwise monotonic ordering starting at one

+
+datasetUrn
+String +
+

Dataset this schema metadata is associated with

+
+name
+String! +
+

Schema name

+
+platformUrn
+String! +
+

Platform this schema metadata is associated with

+
+version
+Long! +
+

The version of the GMS Schema metadata

+
+cluster
+String +
+

The cluster this schema metadata is derived from

+
+hash
+String! +
+

The SHA1 hash of the schema content

+
+platformSchema
+PlatformSchema +
+

The native schema in the datasets platform, schemaless if it was not provided

+
+fields
+[SchemaField!]! +
+

Client provided a list of fields from value schema

+
+primaryKeys
+[String!] +
+

Client provided list of fields that define primary keys to access record

+
+foreignKeys
+[ForeignKeyConstraint] +
+

Client provided list of foreign key constraints

+
+createdAt
+Long +
+

The time at which the schema metadata information was created

+
+ +## ScrollAcrossLineageResults + +Results returned by issueing a search across relationships query using scroll API + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+nextScrollId
+String +
+

Opaque ID to pass to the next request to the server

+
+count
+Int! +
+

The number of entities included in the result set

+
+total
+Int! +
+

The total number of search results matching the query and filters

+
+searchResults
+[SearchAcrossLineageResult!]! +
+

The search result entities

+
+facets
+[FacetMetadata!] +
+

Candidate facet aggregations used for search filtering

+
+ +## ScrollResults + +Results returned by issuing a search query + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+nextScrollId
+String +
+

Opaque ID to pass to the next request to the server

+
+count
+Int! +
+

The number of entities included in the result set

+
+total
+Int! +
+

The total number of search results matching the query and filters

+
+searchResults
+[SearchResult!]! +
+

The search result entities for a scroll request

+
+facets
+[FacetMetadata!] +
+

Candidate facet aggregations used for search filtering

+
+ +## SearchAcrossLineageResult + +Individual search result from a search across relationships query (has added metadata about the path) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+entity
+Entity! +
+

The resolved DataHub Metadata Entity matching the search query

+
+insights
+[SearchInsight!] +
+

Insights about why the search result was matched

+
+matchedFields
+[MatchedField!]! +
+

Matched field hint

+
+paths
+[EntityPath] +
+

Optional list of entities between the source and destination node

+
+degree
+Int! +
+

Degree of relationship (number of hops to get to entity)

+
+ +## SearchAcrossLineageResults + +Results returned by issueing a search across relationships query + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int! +
+

The offset of the result set

+
+count
+Int! +
+

The number of entities included in the result set

+
+total
+Int! +
+

The total number of search results matching the query and filters

+
+searchResults
+[SearchAcrossLineageResult!]! +
+

The search result entities

+
+facets
+[FacetMetadata!] +
+

Candidate facet aggregations used for search filtering

+
+freshness
+FreshnessStats +
+

Optional freshness characteristics of this query (cached, staleness etc.)

+
+ +## SearchInsight + +Insights about why a search result was returned or ranked in the way that it was + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+text
+String! +
+

The insight to display

+
+icon
+String +
+

An optional emoji to display in front of the text

+
+ +## SearchParams + +Context to define the search recommendations + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+types
+[EntityType!] +
+

Entity types to be searched. If this is not provided, all entities will be searched.

+
+query
+String! +
+

Search query

+
+filters
+[FacetFilter!] +
+

Filters

+
+ +## SearchResult + +An individual search result hit + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+entity
+Entity! +
+

The resolved DataHub Metadata Entity matching the search query

+
+insights
+[SearchInsight!] +
+

Insights about why the search result was matched

+
+matchedFields
+[MatchedField!]! +
+

Matched field hint

+
+ +## SearchResults + +Results returned by issuing a search query + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+start
+Int! +
+

The offset of the result set

+
+count
+Int! +
+

The number of entities included in the result set

+
+total
+Int! +
+

The total number of search results matching the query and filters

+
+searchResults
+[SearchResult!]! +
+

The search result entities

+
+facets
+[FacetMetadata!] +
+

Candidate facet aggregations used for search filtering

+
+ +## Secret + +A referencible secret stored in DataHub's system. Notice that we do not return the actual secret value. + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The urn of the secret

+
+name
+String! +
+

The name of the secret

+
+description
+String +
+

An optional description for the secret

+
+ +## SecretValue + +A plaintext secret value + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

The name of the secret

+
+value
+String! +
+

The plaintext value of the secret.

+
+ +## SemanticVersionStruct + +Properties identify a semantic version + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+semanticVersion
+String +
+

Semantic version of the change

+
+semanticVersionTimestamp
+Long +
+

Semantic version timestamp

+
+versionStamp
+String +
+

Version stamp of the change

+
+ +## SiblingProperties + +Metadata about the entity's siblings + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+isPrimary
+Boolean +
+

If this entity is the primary sibling among the sibling set

+
+siblings
+[Entity] +
+

The sibling entities

+
+ +## SourceCode + +

Fields

+ + + + + + + + + +
NameDescription
+sourceCode
+[SourceCodeUrl!] +
+

Source Code along with types

+
+ +## SourceCodeUrl + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+type
+SourceCodeUrlType! +
+

Source Code Url Types

+
+sourceCodeUrl
+String! +
+

Source Code Url

+
+ +## Status + +The status of a particular Metadata Entity + +

Fields

+ + + + + + + + + +
NameDescription
+removed
+Boolean! +
+

Whether the entity is removed or not

+
+ +## StepStateResult + +A single step state + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+id
+String! +
+

Unique id of the step

+
+properties
+[StringMapEntry!]! +
+

The properties for the step state

+
+ +## StringBox + +

Fields

+ + + + + + + + + +
NameDescription
+stringValue
+String! +
+ +
+ +## StringMapEntry + +An entry in a string string map represented as a tuple + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+key
+String! +
+

The key of the map entry

+
+value
+String +
+

The value fo the map entry

+
+ +## StructuredReport + +A flexible carrier for structured results of an execution request. + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+type
+String! +
+

The type of the structured report. (e.g. INGESTION_REPORT, TEST_CONNECTION_REPORT, etc.)

+
+serializedValue
+String! +
+

The serialized value of the structured report

+
+contentType
+String! +
+

The content-type of the serialized value (e.g. application/json, application/json;gzip etc.)

+
+ +## SubTypes + +

Fields

+ + + + + + + + + +
NameDescription
+typeNames
+[String!] +
+

The sub-types that this entity implements. e.g. Datasets that are views will implement the "view" subtype

+
+ +## SystemFreshness + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+systemName
+String! +
+

Name of the system

+
+freshnessMillis
+Long! +
+

The latest timestamp in millis of the system that was used to respond to this query +In case a cache was consulted, this reflects the freshness of the cache +In case an index was consulted, this reflects the freshness of the index

+
+ +## TableChart + +For consumption by UI only + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+title
+String! +
+ +
+columns
+[String!]! +
+ +
+rows
+[Row!]! +
+ +
+ +## TableSchema + +Information about a raw Table Schema + +

Fields

+ + + + + + + + + +
NameDescription
+schema
+String! +
+

Raw table schema

+
+ +## Tag + +A Tag Entity, which can be associated with other Metadata Entities and subresources + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the TAG

+
+type
+EntityType! +
+

A standard Entity Type

+
+name
+String! +
+
Deprecated: No longer supported
+ +

A unique identifier for the Tag. Deprecated - Use properties.name field instead.

+
+properties
+TagProperties +
+

Additional properties about the Tag

+
+editableProperties
+EditableTagProperties +
+
Deprecated: No longer supported
+ +

Additional read write properties about the Tag +Deprecated! Use 'properties' field instead.

+
+ownership
+Ownership +
+

Ownership metadata of the dataset

+
+relationships
+EntityRelationshipsResult +
+

Granular API for querying edges extending from this entity

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+description
+String +
+
Deprecated: No longer supported
+ +

Deprecated, use properties.description field instead

+
+ +## TagAssociation + +An edge between a Metadata Entity and a Tag Modeled as a struct to permit +additional attributes +TODO Consider whether this query should be serviced by the relationships field + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+tag
+Tag! +
+

The tag itself

+
+associatedUrn
+String! +
+

Reference back to the tagged urn for tracking purposes e.g. when sibling nodes are merged together

+
+ +## TagProperties + +Properties for a DataHub Tag + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+name
+String! +
+

A display name for the Tag

+
+description
+String +
+

A description of the Tag

+
+colorHex
+String +
+

An optional RGB hex code for a Tag color, e.g. #FFFFFF

+
+ +## TelemetryConfig + +Configurations related to tracking users in the app + +

Fields

+ + + + + + + + + +
NameDescription
+enableThirdPartyLogging
+Boolean +
+

Env variable for whether or not third party logging should be enabled for this instance

+
+ +## Test + +A metadata entity representing a DataHub Test + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the Test itself

+
+type
+EntityType! +
+

The standard Entity Type

+
+name
+String! +
+

The name of the Test

+
+category
+String! +
+

The category of the Test (user defined)

+
+description
+String +
+

Description of the test

+
+definition
+TestDefinition! +
+

Definition for the test

+
+relationships
+EntityRelationshipsResult +
+

Unused for tests

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+ +## TestDefinition + +Definition of the test + +

Fields

+ + + + + + + + + +
NameDescription
+json
+String +
+

JSON-based def for the test

+
+ +## TestResult + +The result of running a test + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+test
+Test +
+

The test itself, or null if the test has been deleted

+
+type
+TestResultType! +
+

The final result, e.g. either SUCCESS or FAILURE.

+
+ +## TestResults + +A set of test results + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+passing
+[TestResult!]! +
+

The tests passing

+
+failing
+[TestResult!]! +
+

The tests failing

+
+ +## TestsConfig + +Configurations related to DataHub Tests feature + +

Fields

+ + + + + + + + + +
NameDescription
+enabled
+Boolean! +
+

Whether Tests feature is enabled

+
+ +## TextCell + +A Notebook cell which contains text as content + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+cellTitle
+String! +
+

Title of the cell

+
+cellId
+String! +
+

Unique id for the cell.

+
+changeAuditStamps
+ChangeAuditStamps +
+

Captures information about who created/last modified/deleted this TextCell and when

+
+text
+String! +
+

The actual text in a TextCell in a Notebook

+
+ +## TimeSeriesChart + +For consumption by UI only + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+title
+String! +
+ +
+lines
+[NamedLine!]! +
+ +
+dateRange
+DateRange! +
+ +
+interval
+DateInterval! +
+ +
+ +## TimeWindow + +A time window with a finite start and end time + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+startTimeMillis
+Long! +
+

The start time of the time window

+
+durationMillis
+Long! +
+

The end time of the time window

+
+ +## UpdateStepStateResult + +Result returned when fetching step state + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+id
+String! +
+

Id of the step

+
+succeeded
+Boolean! +
+

Whether the update succeeded.

+
+ +## UpstreamEntityRelationships + +Deprecated, use relationships query instead + +

Fields

+ + + + + + + + + +
NameDescription
+entities
+[EntityRelationshipLegacy] +
+ +
+ +## UsageAggregation + +An aggregation of Dataset usage statistics + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+bucket
+Long +
+

The time window start time

+
+duration
+WindowDuration +
+

The time window span

+
+resource
+String +
+

The resource urn associated with the usage information, eg a Dataset urn

+
+metrics
+UsageAggregationMetrics +
+

The rolled up usage metrics

+
+ +## UsageAggregationMetrics + +Rolled up metrics about Dataset usage over time + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+uniqueUserCount
+Int +
+

The unique number of users who have queried the dataset within the time range

+
+users
+[UserUsageCounts] +
+

Usage statistics within the time range by user

+
+totalSqlQueries
+Int +
+

The total number of queries issued against the dataset within the time range

+
+topSqlQueries
+[String] +
+

A set of common queries issued against the dataset within the time range

+
+fields
+[FieldUsageCounts] +
+

Per field usage statistics within the time range

+
+ +## UsageQueryResult + +The result of a Dataset usage query + +

Fields

+ + + + + + + + + + + + + +
NameDescription
+buckets
+[UsageAggregation] +
+

A set of relevant time windows for use in displaying usage statistics

+
+aggregations
+UsageQueryResultAggregations +
+

A set of rolled up aggregations about the Dataset usage

+
+ +## UsageQueryResultAggregations + +A set of rolled up aggregations about the Dataset usage + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+uniqueUserCount
+Int +
+

The count of unique Dataset users within the queried time range

+
+users
+[UserUsageCounts] +
+

The specific per user usage counts within the queried time range

+
+fields
+[FieldUsageCounts] +
+

The specific per field usage counts within the queried time range

+
+totalSqlQueries
+Int +
+

The total number of queries executed within the queried time range +Note that this field will likely be deprecated in favor of a totalQueries field

+
+ +## UserUsageCounts + +Information about individual user usage of a Dataset + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+user
+CorpUser +
+

The user of the Dataset

+
+count
+Int +
+

The number of queries issued by the user

+
+userEmail
+String +
+

The extracted user email +Note that this field will soon be deprecated and merged with user

+
+ +## VersionedDataset + +A Dataset entity, which encompasses Relational Tables, Document store collections, streaming topics, and other sets of data having an independent lifecycle + +

Implements

+ +- [Entity](/docs/graphql/interfaces#entity) + +

Fields

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+

The primary key of the Dataset

+
+type
+EntityType! +
+

The standard Entity Type

+
+platform
+DataPlatform! +
+

Standardized platform urn where the dataset is defined

+
+container
+Container +
+

The parent container in which the entity resides

+
+parentContainers
+ParentContainersResult +
+

Recursively get the lineage of containers for this entity

+
+name
+String! +
+

Unique guid for dataset +No longer to be used as the Dataset display name. Use properties.name instead

+
+properties
+DatasetProperties +
+

An additional set of read only properties

+
+editableProperties
+DatasetEditableProperties +
+

An additional set of of read write properties

+
+ownership
+Ownership +
+

Ownership metadata of the dataset

+
+deprecation
+Deprecation +
+

The deprecation status of the dataset

+
+institutionalMemory
+InstitutionalMemory +
+

References to internal resources related to the dataset

+
+editableSchemaMetadata
+EditableSchemaMetadata +
+

Editable schema metadata of the dataset

+
+status
+Status +
+

Status of the Dataset

+
+tags
+GlobalTags +
+

Tags used for searching dataset

+
+glossaryTerms
+GlossaryTerms +
+

The structured glossary terms associated with the dataset

+
+domain
+DomainAssociation +
+

The Domain associated with the Dataset

+
+health
+[Health!] +
+

Experimental! The resolved health status of the Dataset

+
+schema
+Schema +
+

Schema metadata of the dataset

+
+subTypes
+SubTypes +
+

Sub Types that this entity implements

+
+viewProperties
+ViewProperties +
+

View related properties. Only relevant if subtypes field contains view.

+
+origin
+FabricType! +
+
Deprecated: No longer supported
+ +

Deprecated, see the properties field instead +Environment in which the dataset belongs to or where it was generated +Note that this field will soon be deprecated in favor of a more standardized concept of Environment

+
+relationships
+EntityRelationshipsResult +
+
Deprecated: No longer supported
+ +

No-op, has to be included due to model

+ +

Arguments

+ + + + + + + + + +
NameDescription
+input
+RelationshipsInput! +
+ +
+ +
+ +## VersionTag + +The technical version associated with a given Metadata Entity + +

Fields

+ + + + + + + + + +
NameDescription
+versionTag
+String +
+ +
+ +## ViewProperties + +Properties about a Dataset of type view + +

Fields

+ + + + + + + + + + + + + + + + + +
NameDescription
+materialized
+Boolean! +
+

Whether the view is materialized or not

+
+logic
+String! +
+

The logic associated with the view, most commonly a SQL statement

+
+language
+String! +
+

The language in which the view logic is written, for example SQL

+
+ +## ViewsConfig + +Configurations related to DataHub Views feature + +

Fields

+ + + + + + + + + +
NameDescription
+enabled
+Boolean! +
+

Whether Views feature is enabled

+
+ +## VisualConfig + +Configurations related to visual appearance of the app + +

Fields

+ + + + + + + + + + + + + + + + + + + + + +
NameDescription
+logoUrl
+String +
+

Custom logo url for the homepage & top banner

+
+faviconUrl
+String +
+

Custom favicon url for the homepage & top banner

+
+queriesTab
+QueriesTabConfig +
+

Configuration for the queries tab

+
+entityProfiles
+EntityProfilesConfig +
+

Configuration for the queries tab

+
diff --git a/docs-website/versioned_docs/version-0.10.4/graphql/queries.md b/docs-website/versioned_docs/version-0.10.4/graphql/queries.md new file mode 100644 index 0000000000000..d42f52a7de7aa --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/graphql/queries.md @@ -0,0 +1,1611 @@ +--- +id: queries +title: Queries +slug: queries +sidebar_position: 1 +--- + +## aggregateAcrossEntities + +**Type:** [AggregateResults](/docs/graphql/objects#aggregateresults) + +Aggregate across DataHub entities + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+AggregateAcrossEntitiesInput! +
+ +
+ +## appConfig + +**Type:** [AppConfig](/docs/graphql/objects#appconfig) + +Fetch configurations +Used by DataHub UI + +## assertion + +**Type:** [Assertion](/docs/graphql/objects#assertion) + +Fetch an Assertion by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## autoComplete + +**Type:** [AutoCompleteResults](/docs/graphql/objects#autocompleteresults) + +Autocomplete a search query against a specific DataHub Entity Type + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+AutoCompleteInput! +
+ +
+ +## autoCompleteForMultiple + +**Type:** [AutoCompleteMultipleResults](/docs/graphql/objects#autocompletemultipleresults) + +Autocomplete a search query against a specific set of DataHub Entity Types + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+AutoCompleteMultipleInput! +
+ +
+ +## batchGetStepStates + +**Type:** [BatchGetStepStatesResult!](/docs/graphql/objects#batchgetstepstatesresult) + +Batch fetch the state for a set of steps. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+BatchGetStepStatesInput! +
+ +
+ +## browse + +**Type:** [BrowseResults](/docs/graphql/objects#browseresults) + +Hierarchically browse a specific type of DataHub Entity by path +Used by explore in the UI + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+BrowseInput! +
+ +
+ +## browsePaths + +**Type:** [[BrowsePath!]](/docs/graphql/objects#browsepath) + +Retrieve the browse paths corresponding to an entity + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+BrowsePathsInput! +
+ +
+ +## browseV2 + +**Type:** [BrowseResultsV2](/docs/graphql/objects#browseresultsv2) + +Browse for different entities by getting organizational groups and their +aggregated counts + content. Uses browsePathsV2 aspect and replaces our old +browse endpoint. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+BrowseV2Input! +
+ +
+ +## chart + +**Type:** [Chart](/docs/graphql/objects#chart) + +Fetch a Chart by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## container + +**Type:** [Container](/docs/graphql/objects#container) + +Fetch an Entity Container by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## corpGroup + +**Type:** [CorpGroup](/docs/graphql/objects#corpgroup) + +Fetch a CorpGroup, representing a DataHub platform group by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## corpUser + +**Type:** [CorpUser](/docs/graphql/objects#corpuser) + +Fetch a CorpUser, representing a DataHub platform user, by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## dashboard + +**Type:** [Dashboard](/docs/graphql/objects#dashboard) + +Fetch a Dashboard by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## dataFlow + +**Type:** [DataFlow](/docs/graphql/objects#dataflow) + +Fetch a Data Flow (or Data Pipeline) by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## dataJob + +**Type:** [DataJob](/docs/graphql/objects#datajob) + +Fetch a Data Job (or Data Task) by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## dataPlatform + +**Type:** [DataPlatform](/docs/graphql/objects#dataplatform) + +Fetch a Data Platform by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## dataProduct + +**Type:** [DataProduct](/docs/graphql/objects#dataproduct) + +Fetch a DataProduct by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## dataset + +**Type:** [Dataset](/docs/graphql/objects#dataset) + +Fetch a Dataset by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## domain + +**Type:** [Domain](/docs/graphql/objects#domain) + +Fetch a Domain by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## entities + +**Type:** [[Entity]](/docs/graphql/interfaces#entity) + +Gets entities based on their urns + +

Arguments

+ + + + + + + + + +
NameDescription
+urns
+[String!]! +
+ +
+ +## entity + +**Type:** [Entity](/docs/graphql/interfaces#entity) + +Gets an entity based on its urn + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## entityExists + +**Type:** [Boolean](/docs/graphql/scalars#boolean) + +Get whether or not not an entity exists + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## executionRequest + +**Type:** [ExecutionRequest](/docs/graphql/objects#executionrequest) + +Get an execution request +urn: The primary key associated with the execution request. + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## getAccessToken + +**Type:** [AccessToken](/docs/graphql/objects#accesstoken) + +Generates an access token for DataHub APIs for a particular user & of a particular type +Deprecated, use createAccessToken instead + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+GetAccessTokenInput! +
+ +
+ +## getAnalyticsCharts + +**Type:** [[AnalyticsChartGroup!]!](/docs/graphql/objects#analyticschartgroup) + +Retrieves a set of server driven Analytics Charts to render in the UI + +## getEntityCounts + +**Type:** [EntityCountResults](/docs/graphql/objects#entitycountresults) + +Fetches the number of entities ingested by type + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+EntityCountInput +
+ +
+ +## getGrantedPrivileges + +**Type:** [Privileges](/docs/graphql/objects#privileges) + +Get all granted privileges for the given actor and resource + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+GetGrantedPrivilegesInput! +
+ +
+ +## getHighlights + +**Type:** [[Highlight!]!](/docs/graphql/objects#highlight) + +Retrieves a set of server driven Analytics Highlight Cards to render in the UI + +## getInviteToken + +**Type:** [InviteToken](/docs/graphql/objects#invitetoken) + +Get invite token + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+GetInviteTokenInput! +
+ +
+ +## getMetadataAnalyticsCharts + +**Type:** [[AnalyticsChartGroup!]!](/docs/graphql/objects#analyticschartgroup) + +Retrieves a set of charts regarding the ingested metadata + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+MetadataAnalyticsInput! +
+ +
+ +## getQuickFilters + +**Type:** [GetQuickFiltersResult](/docs/graphql/objects#getquickfiltersresult) + +Get quick filters to display in auto-complete + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+GetQuickFiltersInput! +
+ +
+ +## getRootGlossaryNodes + +**Type:** [GetRootGlossaryNodesResult](/docs/graphql/objects#getrootglossarynodesresult) + +Get all GlossaryNodes without a parentNode + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+GetRootGlossaryEntitiesInput! +
+ +
+ +## getRootGlossaryTerms + +**Type:** [GetRootGlossaryTermsResult](/docs/graphql/objects#getrootglossarytermsresult) + +Get all GlossaryTerms without a parentNode + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+GetRootGlossaryEntitiesInput! +
+ +
+ +## getSchemaBlame + +**Type:** [GetSchemaBlameResult](/docs/graphql/objects#getschemablameresult) + +Returns the most recent changes made to each column in a dataset at each dataset version. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+GetSchemaBlameInput! +
+ +
+ +## getSchemaVersionList + +**Type:** [GetSchemaVersionListResult](/docs/graphql/objects#getschemaversionlistresult) + +Returns the list of schema versions for a dataset. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+GetSchemaVersionListInput! +
+ +
+ +## getSecretValues + +**Type:** [[SecretValue!]](/docs/graphql/objects#secretvalue) + +Fetch the values of a set of secrets. The caller must have the MANAGE_SECRETS +privilege to use. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+GetSecretValuesInput! +
+ +
+ +## globalViewsSettings + +**Type:** [GlobalViewsSettings](/docs/graphql/objects#globalviewssettings) + +Fetch the Global Settings related to the Views feature. +Requires the 'Manage Global Views' Platform Privilege. + +## glossaryNode + +**Type:** [GlossaryNode](/docs/graphql/objects#glossarynode) + +Fetch a Glossary Node by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## glossaryTerm + +**Type:** [GlossaryTerm](/docs/graphql/objects#glossaryterm) + +Fetch a Glossary Term by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## ingestionSource + +**Type:** [IngestionSource](/docs/graphql/objects#ingestionsource) + +Fetch a specific ingestion source +urn: The primary key associated with the ingestion source. + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## isAnalyticsEnabled + +**Type:** [Boolean!](/docs/graphql/scalars#boolean) + +Deprecated, use appConfig Query instead +Whether the analytics feature is enabled in the UI + +## listAccessTokens + +**Type:** [ListAccessTokenResult!](/docs/graphql/objects#listaccesstokenresult) + +List access tokens stored in DataHub. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ListAccessTokenInput! +
+ +
+ +## listDataProductAssets + +**Type:** [SearchResults](/docs/graphql/objects#searchresults) + +List Data Product assets for a given urn + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+input
+SearchAcrossEntitiesInput! +
+ +
+ +## listDomains + +**Type:** [ListDomainsResult](/docs/graphql/objects#listdomainsresult) + +List all DataHub Domains + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ListDomainsInput! +
+ +
+ +## listGlobalViews + +**Type:** [ListViewsResult](/docs/graphql/objects#listviewsresult) + +List Global DataHub Views + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ListGlobalViewsInput! +
+ +
+ +## listGroups + +**Type:** [ListGroupsResult](/docs/graphql/objects#listgroupsresult) + +List all DataHub Groups + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ListGroupsInput! +
+ +
+ +## listIngestionSources + +**Type:** [ListIngestionSourcesResult](/docs/graphql/objects#listingestionsourcesresult) + +List all ingestion sources + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ListIngestionSourcesInput! +
+ +
+ +## listMyViews + +**Type:** [ListViewsResult](/docs/graphql/objects#listviewsresult) + +List DataHub Views owned by the current user + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ListMyViewsInput! +
+ +
+ +## listOwnershipTypes + +**Type:** [ListOwnershipTypesResult!](/docs/graphql/objects#listownershiptypesresult) + +List Custom Ownership Types + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ListOwnershipTypesInput! +
+

Input required for listing custom ownership types

+
+ +## listPolicies + +**Type:** [ListPoliciesResult](/docs/graphql/objects#listpoliciesresult) + +List all DataHub Access Policies + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ListPoliciesInput! +
+ +
+ +## listPosts + +**Type:** [ListPostsResult](/docs/graphql/objects#listpostsresult) + +List all Posts + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ListPostsInput! +
+ +
+ +## listQueries + +**Type:** [ListQueriesResult](/docs/graphql/objects#listqueriesresult) + +List Dataset Queries + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ListQueriesInput! +
+

Input required for listing queries

+
+ +## listRecommendations + +**Type:** [ListRecommendationsResult](/docs/graphql/objects#listrecommendationsresult) + +Fetch recommendations for a particular scenario + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ListRecommendationsInput! +
+ +
+ +## listRoles + +**Type:** [ListRolesResult](/docs/graphql/objects#listrolesresult) + +List all DataHub Roles + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ListRolesInput! +
+ +
+ +## listSecrets + +**Type:** [ListSecretsResult](/docs/graphql/objects#listsecretsresult) + +List all secrets stored in DataHub (no values) + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ListSecretsInput! +
+ +
+ +## listTests + +**Type:** [ListTestsResult](/docs/graphql/objects#listtestsresult) + +List all DataHub Tests + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ListTestsInput! +
+ +
+ +## listUsers + +**Type:** [ListUsersResult](/docs/graphql/objects#listusersresult) + +List all DataHub Users + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ListUsersInput! +
+ +
+ +## me + +**Type:** [AuthenticatedUser](/docs/graphql/objects#authenticateduser) + +Fetch details associated with the authenticated user, provided via an auth cookie or header + +## mlFeature + +**Type:** [MLFeature](/docs/graphql/objects#mlfeature) + +Incubating: Fetch a ML Feature by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## mlFeatureTable + +**Type:** [MLFeatureTable](/docs/graphql/objects#mlfeaturetable) + +Incubating: Fetch a ML Feature Table by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## mlModel + +**Type:** [MLModel](/docs/graphql/objects#mlmodel) + +Incubating: Fetch an ML Model by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## mlModelGroup + +**Type:** [MLModelGroup](/docs/graphql/objects#mlmodelgroup) + +Incubating: Fetch an ML Model Group by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## mlPrimaryKey + +**Type:** [MLPrimaryKey](/docs/graphql/objects#mlprimarykey) + +Incubating: Fetch a ML Primary Key by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## notebook + +**Type:** [Notebook](/docs/graphql/objects#notebook) + +Fetch a Notebook by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## scrollAcrossEntities + +**Type:** [ScrollResults](/docs/graphql/objects#scrollresults) + +Search DataHub entities by providing a pointer reference for scrolling through results. + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ScrollAcrossEntitiesInput! +
+ +
+ +## scrollAcrossLineage + +**Type:** [ScrollAcrossLineageResults](/docs/graphql/objects#scrollacrosslineageresults) + +Search across the results of a graph query on a node, uses scroll API + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+ScrollAcrossLineageInput! +
+ +
+ +## search + +**Type:** [SearchResults](/docs/graphql/objects#searchresults) + +Full text search against a specific DataHub Entity Type + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+SearchInput! +
+ +
+ +## searchAcrossEntities + +**Type:** [SearchResults](/docs/graphql/objects#searchresults) + +Search DataHub entities + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+SearchAcrossEntitiesInput! +
+ +
+ +## searchAcrossLineage + +**Type:** [SearchAcrossLineageResults](/docs/graphql/objects#searchacrosslineageresults) + +Search across the results of a graph query on a node + +

Arguments

+ + + + + + + + + +
NameDescription
+input
+SearchAcrossLineageInput! +
+ +
+ +## tag + +**Type:** [Tag](/docs/graphql/objects#tag) + +Fetch a Tag by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## test + +**Type:** [Test](/docs/graphql/objects#test) + +Fetch a Test by primary key (urn) + +

Arguments

+ + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+ +## versionedDataset + +**Type:** [VersionedDataset](/docs/graphql/objects#versioneddataset) + +Fetch a Dataset by primary key (urn) at a point in time based on aspect versions (versionStamp) + +

Arguments

+ + + + + + + + + + + + + +
NameDescription
+urn
+String! +
+ +
+versionStamp
+String +
+ +
diff --git a/docs-website/versioned_docs/version-0.10.4/graphql/scalars.md b/docs-website/versioned_docs/version-0.10.4/graphql/scalars.md new file mode 100644 index 0000000000000..18070eacb034b --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/graphql/scalars.md @@ -0,0 +1,24 @@ +--- +id: scalars +title: Scalars +slug: scalars +sidebar_position: 8 +--- + +## Boolean + +The `Boolean` scalar type represents `true` or `false`. + +## Float + +The `Float` scalar type represents signed double-precision fractional values as specified by [IEEE 754](https://en.wikipedia.org/wiki/IEEE_floating_point). + +## Int + +The `Int` scalar type represents non-fractional signed whole numeric values. Int can represent values between -(2^31) and 2^31 - 1. + +## Long + +## String + +The `String` scalar type represents textual data, represented as UTF-8 character sequences. The String type is most often used by GraphQL to represent free-form human-readable text. diff --git a/docs-website/versioned_docs/version-0.10.4/graphql/unions.md b/docs-website/versioned_docs/version-0.10.4/graphql/unions.md new file mode 100644 index 0000000000000..fd64e01ecbc4f --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/graphql/unions.md @@ -0,0 +1,49 @@ +--- +id: unions +title: Unions +slug: unions +sidebar_position: 6 +--- + +## AnalyticsChart + +For consumption by UI only + +

Possible types

+ +- [TimeSeriesChart](/docs/graphql/objects#timeserieschart) +- [BarChart](/docs/graphql/objects#barchart) +- [TableChart](/docs/graphql/objects#tablechart) + +## HyperParameterValueType + +

Possible types

+ +- [StringBox](/docs/graphql/objects#stringbox) +- [IntBox](/docs/graphql/objects#intbox) +- [FloatBox](/docs/graphql/objects#floatbox) +- [BooleanBox](/docs/graphql/objects#booleanbox) + +## OwnerType + +An owner of a Metadata Entity, either a user or group + +

Possible types

+ +- [CorpUser](/docs/graphql/objects#corpuser) +- [CorpGroup](/docs/graphql/objects#corpgroup) + +## PlatformSchema + +A type of Schema, either a table schema or a key value schema + +

Possible types

+ +- [TableSchema](/docs/graphql/objects#tableschema) +- [KeyValueSchema](/docs/graphql/objects#keyvalueschema) + +## ResultsType + +

Possible types

+ +- [StringBox](/docs/graphql/objects#stringbox) diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion-modules/airflow-plugin/README.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion-modules/airflow-plugin/README.md new file mode 100644 index 0000000000000..c8bb3a1f913ee --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion-modules/airflow-plugin/README.md @@ -0,0 +1,10 @@ +--- +title: Datahub Airflow Plugin +slug: /metadata-ingestion-modules/airflow-plugin +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion-modules/airflow-plugin/README.md +--- + +# Datahub Airflow Plugin + +See [the DataHub Airflow docs](/docs/lineage/airflow) for details. diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/README.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/README.md new file mode 100644 index 0000000000000..d4f01c85fdd0a --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/README.md @@ -0,0 +1,214 @@ +--- +slug: /metadata-ingestion +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/README.md +--- + +# Introduction to Metadata Ingestion + + +Find Integration Source + + +## Integration Options + +DataHub supports both **push-based** and **pull-based** metadata integration. + +Push-based integrations allow you to emit metadata directly from your data systems when metadata changes, while pull-based integrations allow you to "crawl" or "ingest" metadata from the data systems by connecting to them and extracting metadata in a batch or incremental-batch manner. Supporting both mechanisms means that you can integrate with all your systems in the most flexible way possible. + +Examples of push-based integrations include [Airflow](../docs/lineage/airflow.md), [Spark](../metadata-integration/java/spark-lineage/README.md), [Great Expectations](./integration_docs/great-expectations.md) and [Protobuf Schemas](../metadata-integration/java/datahub-protobuf/README.md). This allows you to get low-latency metadata integration from the "active" agents in your data ecosystem. Examples of pull-based integrations include BigQuery, Snowflake, Looker, Tableau and many others. + +This document describes the pull-based metadata ingestion system that is built into DataHub for easy integration with a wide variety of sources in your data stack. + +## Getting Started + +### Prerequisites + +Before running any metadata ingestion job, you should make sure that DataHub backend services are all running. You can either run ingestion via the [UI](../docs/ui-ingestion.md) or via the [CLI](../docs/cli.md). You can reference the CLI usage guide given there as you go through this page. + +## Core Concepts + +### Sources + +Please see our [Integrations page](/integrations) to browse our ingestion sources and filter on their features. + +Data systems that we are extracting metadata from are referred to as **Sources**. The `Sources` tab on the left in the sidebar shows you all the sources that are available for you to ingest metadata from. For example, we have sources for [BigQuery](/docs/generated/ingestion/sources/bigquery), [Looker](/docs/generated/ingestion/sources/looker), [Tableau](/docs/generated/ingestion/sources/tableau) and many others. + +#### Metadata Ingestion Source Status + +We apply a Support Status to each Metadata Source to help you understand the integration reliability at a glance. + +![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen): Certified Sources are well-tested & widely-adopted by the DataHub Community. We expect the integration to be stable with few user-facing issues. + +![Incubating](https://img.shields.io/badge/support%20status-incubating-blue): Incubating Sources are ready for DataHub Community adoption but have not been tested for a wide variety of edge-cases. We eagerly solicit feedback from the Community to streghten the connector; minor version changes may arise in future releases. + +![Testing](https://img.shields.io/badge/support%20status-testing-lightgrey): Testing Sources are available for experiementation by DataHub Community members, but may change without notice. + +### Sinks + +Sinks are destinations for metadata. When configuring ingestion for DataHub, you're likely to be sending the metadata to DataHub over either the [REST (datahub-sink)](./sink_docs/datahub.md#datahub-rest) or the [Kafka (datahub-kafka)](./sink_docs/datahub.md#datahub-kafka) sink. In some cases, the [File](./sink_docs/file.md) sink is also helpful to store a persistent offline copy of the metadata during debugging. + +The default sink that most of the ingestion systems and guides assume is the `datahub-rest` sink, but you should be able to adapt all of them for the other sinks as well! + +### Recipes + +A recipe is the main configuration file that puts it all together. It tells our ingestion scripts where to pull data from (source) and where to put it (sink). + +:::tip +Name your recipe with **.dhub.yaml** extension like _myrecipe.dhub.yaml_ to use vscode or intellij as a recipe editor with autocomplete +and syntax validation. + +Make sure yaml plugin is installed for your editor: + +- For vscode install [Redhat's yaml plugin](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml) +- For intellij install [official yaml plugin](https://plugins.jetbrains.com/plugin/13126-yaml) + +::: + +Since `acryl-datahub` version `>=0.8.33.2`, the default sink is assumed to be a DataHub REST endpoint: + +- Hosted at "http://localhost:8080" or the environment variable `${DATAHUB_GMS_URL}` if present +- With an empty auth token or the environment variable `${DATAHUB_GMS_TOKEN}` if present. + +Here's a simple recipe that pulls metadata from MSSQL (source) and puts it into the default sink (datahub rest). + +```yaml +# The simplest recipe that pulls metadata from MSSQL and puts it into DataHub +# using the Rest API. +source: + type: mssql + config: + username: sa + password: ${MSSQL_PASSWORD} + database: DemoData +# sink section omitted as we want to use the default datahub-rest sink +``` + +Running this recipe is as simple as: + +```shell +datahub ingest -c recipe.dhub.yaml +``` + +or if you want to override the default endpoints, you can provide the environment variables as part of the command like below: + +```shell +DATAHUB_GMS_URL="https://my-datahub-server:8080" DATAHUB_GMS_TOKEN="my-datahub-token" datahub ingest -c recipe.dhub.yaml +``` + +A number of recipes are included in the [examples/recipes](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/recipes) directory. For full info and context on each source and sink, see the pages described in the [table of plugins](../docs/cli.md#installing-plugins). + +> Note that one recipe file can only have 1 source and 1 sink. If you want multiple sources then you will need multiple recipe files. + +### Handling sensitive information in recipes + +We automatically expand environment variables in the config (e.g. `${MSSQL_PASSWORD}`), +similar to variable substitution in GNU bash or in docker-compose files. For details, see +https://docs.docker.com/compose/compose-file/compose-file-v2/#variable-substitution. This environment variable substitution should be used to mask sensitive information in recipe files. As long as you can get env variables securely to the ingestion process there would not be any need to store sensitive information in recipes. + +### Basic Usage of CLI for ingestion + +```shell +pip install 'acryl-datahub[datahub-rest]' # install the required plugin +datahub ingest -c ./examples/recipes/mssql_to_datahub.dhub.yml +``` + +The `--dry-run` option of the `ingest` command performs all of the ingestion steps, except writing to the sink. This is useful to validate that the +ingestion recipe is producing the desired metadata events before ingesting them into datahub. + +```shell +# Dry run +datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml --dry-run +# Short-form +datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n +``` + +The `--preview` option of the `ingest` command performs all of the ingestion steps, but limits the processing to only the first 10 workunits produced by the source. +This option helps with quick end-to-end smoke testing of the ingestion recipe. + +```shell +# Preview +datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml --preview +# Preview with dry-run +datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n --preview +``` + +By default `--preview` creates 10 workunits. But if you wish to try producing more workunits you can use another option `--preview-workunits` + +```shell +# Preview 20 workunits without sending anything to sink +datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n --preview --preview-workunits=20 +``` + +#### Reporting + +By default, the cli sends an ingestion report to DataHub, which allows you to see the result of all cli-based ingestion in the UI. This can be turned off with the `--no-default-report` flag. + +```shell +# Running ingestion with reporting to DataHub turned off +datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yaml --no-default-report +``` + +The reports include the recipe that was used for ingestion. This can be turned off by adding an additional section to the ingestion recipe. + +```yaml +source: + # source configs + +sink: + # sink configs + +# Add configuration for the datahub reporter +reporting: + - type: datahub + config: + report_recipe: false +``` + +## Transformations + +If you'd like to modify data before it reaches the ingestion sinks – for instance, adding additional owners or tags – you can use a transformer to write your own module and integrate it with DataHub. Transformers require extending the recipe with a new section to describe the transformers that you want to run. + +For example, a pipeline that ingests metadata from MSSQL and applies a default "important" tag to all datasets is described below: + +```yaml +# A recipe to ingest metadata from MSSQL and apply default tags to all tables +source: + type: mssql + config: + username: sa + password: ${MSSQL_PASSWORD} + database: DemoData + +transformers: # an array of transformers applied sequentially + - type: simple_add_dataset_tags + config: + tag_urns: + - "urn:li:tag:Important" +# default sink, no config needed +``` + +Check out the [transformers guide](./docs/transformer/intro.md) to learn more about how you can create really flexible pipelines for processing metadata using Transformers! + +## Using as a library (SDK) + +In some cases, you might want to construct Metadata events directly and use programmatic ways to emit that metadata to DataHub. In this case, take a look at the [Python emitter](./as-a-library.md) and the [Java emitter](../metadata-integration/java/as-a-library.md) libraries which can be called from your own code. + +### Programmatic Pipeline + +In some cases, you might want to configure and run a pipeline entirely from within your custom Python script. Here is an example of how to do it. + +- [programmatic_pipeline.py](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/programatic_pipeline.py) - a basic mysql to REST programmatic pipeline. + +## Developing + +See the guides on [developing](./developing.md), [adding a source](./adding-source.md) and [using transformers](./docs/transformer/intro.md). + +## Compatibility + +DataHub server uses a 3 digit versioning scheme, while the CLI uses a 4 digit scheme. For example, if you're using DataHub server version 0.10.0, you should use CLI version 0.10.0.x, where x is a patch version. +We do this because we do CLI releases at a much higher frequency than server releases, usually every few days vs twice a month. + +For ingestion sources, any breaking changes will be highlighted in the [release notes](../docs/how/updating-datahub.md). When fields are deprecated or otherwise changed, we will try to maintain backwards compatibility for two server releases, which is about 4-6 weeks. The CLI will also print warnings whenever deprecated options are used. diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/adding-source.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/adding-source.md new file mode 100644 index 0000000000000..7e7aa2ac9cbd8 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/adding-source.md @@ -0,0 +1,264 @@ +--- +title: Adding a Metadata Ingestion Source +slug: /metadata-ingestion/adding-source +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/adding-source.md +--- + +# Adding a Metadata Ingestion Source + +There are two ways of adding a metadata ingestion source. + +1. You are going to contribute the custom source directly to the Datahub project. +2. You are writing the custom source for yourself and are not going to contribute back (yet). + +If you are going for case (1) just follow the steps 1 to 9 below. In case you are building it for yourself you can skip +steps 4-9 (but maybe write tests and docs for yourself as well) and follow the documentation +on [how to use custom ingestion sources](../docs/how/add-custom-ingestion-source.md) +without forking Datahub. + +:::note + +This guide assumes that you've already followed the metadata ingestion [developing guide](./developing.md) to set up +your local environment. + +::: + +### 1. Set up the configuration model + +We use [pydantic](https://pydantic-docs.helpmanual.io/) for configuration, and all models must inherit +from `ConfigModel`. The [file source](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/file.py) is a good example. + +#### Documentation for Configuration Classes + +We use [pydantic](https://pydantic-docs.helpmanual.io) conventions for documenting configuration flags. Use the `description` attribute to write rich documentation for your configuration field. + +For example, the following code: + +```python +from pydantic import Field +from datahub.api.configuration.common import ConfigModel + +class LookerAPIConfig(ConfigModel): + client_id: str = Field(description="Looker API client id.") + client_secret: str = Field(description="Looker API client secret.") + base_url: str = Field( + description="Url to your Looker instance: `https://company.looker.com:19999` or `https://looker.company.com`, or similar. Used for making API calls to Looker and constructing clickable dashboard and chart urls." + ) + transport_options: Optional[TransportOptionsConfig] = Field( + default=None, + description="Populates the [TransportOptions](https://github.com/looker-open-source/sdk-codegen/blob/94d6047a0d52912ac082eb91616c1e7c379ab262/python/looker_sdk/rtl/transport.py#L70) struct for looker client", + ) +``` + +generates the following documentation: + +

+ +

+ +:::note +Inline markdown or code snippets are not yet supported for field level documentation. +::: + +### 2. Set up the reporter + +The reporter interface enables the source to report statistics, warnings, failures, and other information about the run. +Some sources use the default `SourceReport` class, but others inherit and extend that class. + +### 3. Implement the source itself + +The core for the source is the `get_workunits` method, which produces a stream of metadata events (typically MCP objects) wrapped up in a MetadataWorkUnit. +The [file source](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/file.py) is a good and simple example. + +The MetadataChangeEventClass is defined in the metadata models which are generated +under `metadata-ingestion/src/datahub/metadata/schema_classes.py`. There are also +some [convenience methods](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/emitter/mce_builder.py) for commonly used operations. + +### 4. Set up the dependencies + +Declare the source's pip dependencies in the `plugins` variable of the [setup script](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/setup.py). + +### 5. Enable discoverability + +Declare the source under the `entry_points` variable of the [setup script](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/setup.py). This enables the source to be +listed when running `datahub check plugins`, and sets up the source's shortened alias for use in recipes. + +### 6. Write tests + +Tests go in the `tests` directory. We use the [pytest framework](https://pytest.org/). + +### 7. Write docs + +#### 7.1 Set up the source class for automatic documentation + +- Indicate the platform name that this source class produces metadata for using the `@platform_name` decorator. We prefer using the human-readable platform name, so e.g. BigQuery (not bigquery). +- Indicate the config class being used by the source by using the `@config_class` decorator. +- Indicate the support status of the connector by using the `@support_status` decorator. +- Indicate what capabilities the connector supports (and what important capabilities it does NOT support) by using the `@capability` decorator. +- Add rich documentation for the connector by utilizing docstrings on your Python class. Markdown is supported. + +See below a simple example of how to do this for any source. + +```python + +from datahub.ingestion.api.decorators import ( + SourceCapability, + SupportStatus, + capability, + config_class, + platform_name, + support_status, +) + +@platform_name("File") +@support_status(SupportStatus.CERTIFIED) +@config_class(FileSourceConfig) +@capability( + SourceCapability.PLATFORM_INSTANCE, + "File based ingestion does not support platform instances", + supported=False, +) +@capability(SourceCapability.DOMAINS, "Enabled by default") +@capability(SourceCapability.DATA_PROFILING, "Optionally enabled via configuration") +@capability(SourceCapability.DESCRIPTIONS, "Enabled by default") +@capability(SourceCapability.LINEAGE_COARSE, "Enabled by default") +class FileSource(Source): + """ + + The File Source can be used to produce all kinds of metadata from a generic metadata events file. + :::note + Events in this file can be in MCE form or MCP form. + ::: + + """ + + ... source code goes here + +``` + +#### 7.2 Write custom documentation + +- Create a copy of [`source-docs-template.md`](./source-docs-template.md) and edit all relevant components. +- Name the document as `` and move it to `metadata-ingestion/docs/sources//.md`. For example for the Kafka platform, under the `kafka` plugin, move the document to `metadata-ingestion/docs/sources/kafka/kafka.md`. +- Add a quickstart recipe corresponding to the plugin under `metadata-ingestion/docs/sources//_recipe.yml`. For example, for the Kafka platform, under the `kafka` plugin, there is a quickstart recipe located at `metadata-ingestion/docs/sources/kafka/kafka_recipe.yml`. +- To write platform-specific documentation (that is cross-plugin), write the documentation under `metadata-ingestion/docs/sources//README.md`. For example, cross-plugin documentation for the BigQuery platform is located under `metadata-ingestion/docs/sources/bigquery/README.md`. + +#### 7.3 Viewing the Documentation + +Documentation for the source can be viewed by running the documentation generator from the `docs-website` module. + +##### Step 1: Build the Ingestion docs + +```console +# From the root of DataHub repo +./gradlew :metadata-ingestion:docGen +``` + +If this finishes successfully, you will see output messages like: + +```console +Ingestion Documentation Generation Complete +############################################ +{ + "source_platforms": { + "discovered": 40, + "generated": 40 + }, + "plugins": { + "discovered": 47, + "generated": 47, + "failed": 0 + } +} +############################################ +``` + +You can also find documentation files generated at `./docs/generated/ingestion/sources` relative to the root of the DataHub repo. You should be able to locate your specific source's markdown file here and investigate it to make sure things look as expected. + +#### Step 2: Build the Entire Documentation + +To view how this documentation looks in the browser, there is one more step. Just build the entire docusaurus page from the `docs-website` module. + +```console +# From the root of DataHub repo +./gradlew :docs-website:build +``` + +This will generate messages like: + +```console +... +> Task :docs-website:yarnGenerate +yarn run v1.22.0 +$ rm -rf genDocs/* && ts-node -O '{ "lib": ["es2020"], "target": "es6" }' generateDocsDir.ts && mv -v docs/* genDocs/ +Including untracked files in docs list: +docs/graphql -> genDocs/graphql +Done in 2.47s. + +> Task :docs-website:yarnBuild +yarn run v1.22.0 +$ docusaurus build + +╭──────────────────────────────────────────────────────────────────────────────╮│ ││ Update available 2.0.0-beta.8 → 2.0.0-beta.18 ││ ││ To upgrade Docusaurus packages with the latest version, run the ││ following command: ││ yarn upgrade @docusaurus/core@latest ││ @docusaurus/plugin-ideal-image@latest @docusaurus/preset-classic@latest ││ │╰──────────────────────────────────────────────────────────────────────────────╯ + + +[en] Creating an optimized production build... +Invalid docusaurus-plugin-ideal-image version 2.0.0-beta.7. +All official @docusaurus/* packages should have the exact same version as @docusaurus/core (2.0.0-beta.8). +Maybe you want to check, or regenerate your yarn.lock or package-lock.json file? +Browserslist: caniuse-lite is outdated. Please run: + npx browserslist@latest --update-db + Why you should do it regularly: https://github.com/browserslist/browserslist#browsers-data-updating +ℹ Compiling Client +ℹ Compiling Server +✔ Client: Compiled successfully in 1.95s +✔ Server: Compiled successfully in 7.52s +Success! Generated static files in "build". + +Use `npm run serve` command to test your build locally. + +Done in 11.59s. + +Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0. +Use '--warning-mode all' to show the individual deprecation warnings. +See https://docs.gradle.org/6.9.2/userguide/command_line_interface.html#sec:command_line_warnings + +BUILD SUCCESSFUL in 35s +36 actionable tasks: 16 executed, 20 up-to-date +``` + +After this you need to run the following script from the `docs-website` module. + +```console +cd docs-website +npm run serve +``` + +Now, browse to http://localhost:3000 or whichever port npm is running on, to browse the docs. +Your source should show up on the left sidebar under `Metadata Ingestion / Sources`. + +### 8. Add SQL Alchemy mapping (if applicable) + +Add the source in `get_platform_from_sqlalchemy_uri` function +in [sql_common.py](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py) if the source has an sqlalchemy source + +### 9. Add logo for the platform + +Add the logo image in [images folder](https://github.com/datahub-project/datahub/blob/master/datahub-web-react/src/images) and add it to be ingested at [startup](https://github.com/datahub-project/datahub/blob/master/metadata-service/war/src/main/resources/boot/data_platforms.json) + +### 10. Update Frontend for UI-based ingestion + +We are currently transitioning to a more dynamic approach to display available sources for UI-based Managed Ingestion. For the time being, adhere to these next steps to get your source to display in the UI Ingestion tab. + +#### 10.1 Add to sources.json + +Add new source to the list in [sources.json](https://github.com/datahub-project/datahub/blob/master/datahub-web-react/src/app/ingest/source/builder/sources.json) including a default quickstart recipe. This will render your source in the list of options when creating a new recipe in the UI. + +#### 10.2 Add logo to the React app + +Add your source logo to the React [images folder](https://github.com/datahub-project/datahub/tree/master/datahub-web-react/src/images) so your image is available in memory. + +#### 10.3 Update constants.ts + +Create new constants in [constants.ts](https://github.com/datahub-project/datahub/blob/master/datahub-web-react/src/app/ingest/source/builder/constants.ts) for the source urn and source name. Update PLATFORM_URN_TO_LOGO to map your source urn to the newly added logo in the images folder. diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/as-a-library.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/as-a-library.md new file mode 100644 index 0000000000000..d2dd9f4d1a776 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/as-a-library.md @@ -0,0 +1,136 @@ +--- +title: Python Emitter +slug: /metadata-ingestion/as-a-library +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/as-a-library.md +--- + +# Python Emitter + +In some cases, you might want to construct Metadata events directly and use programmatic ways to emit that metadata to DataHub. Use-cases are typically push-based and include emitting metadata events from CI/CD pipelines, custom orchestrators etc. + +The `acryl-datahub` Python package offers REST and Kafka emitter API-s, which can easily be imported and called from your own code. + +> **Pro Tip!** Throughout our API guides, we have examples of using Python API SDK. +> Lookout for the `| Python |` tab within our tutorials. + +## Installation + +Follow the installation guide for the main `acryl-datahub` package [here](./README.md#install-from-pypi). Read on for emitter specific installation instructions. + +## REST Emitter + +The REST emitter is a thin wrapper on top of the `requests` module and offers a blocking interface for sending metadata events over HTTP. Use this when simplicity and acknowledgement of metadata being persisted to DataHub's metadata store is more important than throughput of metadata emission. Also use this when read-after-write scenarios exist, e.g. writing metadata and then immediately reading it back. + +### Installation + +```console +pip install -U `acryl-datahub[datahub-rest]` +``` + +### Example Usage + +```python +import datahub.emitter.mce_builder as builder +from datahub.emitter.mcp import MetadataChangeProposalWrapper +from datahub.metadata.schema_classes import DatasetPropertiesClass + +from datahub.emitter.rest_emitter import DatahubRestEmitter + +# Create an emitter to DataHub over REST +emitter = DatahubRestEmitter(gms_server="http://localhost:8080", extra_headers={}) + +# Test the connection +emitter.test_connection() + +# Construct a dataset properties object +dataset_properties = DatasetPropertiesClass(description="This table stored the canonical User profile", + customProperties={ + "governance": "ENABLED" + }) + +# Construct a MetadataChangeProposalWrapper object. +metadata_event = MetadataChangeProposalWrapper( + entityUrn=builder.make_dataset_urn("bigquery", "my-project.my-dataset.user-table"), + aspect=dataset_properties, +) + +# Emit metadata! This is a blocking call +emitter.emit(metadata_event) +``` + +Other examples: + +- [lineage_emitter_mcpw_rest.py](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_mcpw_rest.py) - emits simple bigquery table-to-table (dataset-to-dataset) lineage via REST as MetadataChangeProposalWrapper. + +### Emitter Code + +If you're interested in looking at the REST emitter code, it is available [here](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/emitter/rest_emitter.py) + +## Kafka Emitter + +The Kafka emitter is a thin wrapper on top of the SerializingProducer class from `confluent-kafka` and offers a non-blocking interface for sending metadata events to DataHub. Use this when you want to decouple your metadata producer from the uptime of your datahub metadata server by utilizing Kafka as a highly available message bus. For example, if your DataHub metadata service is down due to planned or unplanned outages, you can still continue to collect metadata from your mission critical systems by sending it to Kafka. Also use this emitter when throughput of metadata emission is more important than acknowledgement of metadata being persisted to DataHub's backend store. + +**_Note_**: The Kafka emitter uses Avro to serialize the Metadata events to Kafka. Changing the serializer will result in unprocessable events as DataHub currently expects the metadata events over Kafka to be serialized in Avro. + +### Installation + +```console +# For emission over Kafka +pip install -U `acryl-datahub[datahub-kafka]` +``` + +### Example Usage + +```python +import datahub.emitter.mce_builder as builder +from datahub.emitter.mcp import MetadataChangeProposalWrapper +from datahub.metadata.schema_classes import DatasetPropertiesClass + +from datahub.emitter.kafka_emitter import DatahubKafkaEmitter, KafkaEmitterConfig +# Create an emitter to Kafka +kafka_config = { + "connection": { + "bootstrap": "localhost:9092", + "schema_registry_url": "http://localhost:8081", + "schema_registry_config": {}, # schema_registry configs passed to underlying schema registry client + "producer_config": {}, # extra producer configs passed to underlying kafka producer + } +} + +emitter = DatahubKafkaEmitter( + KafkaEmitterConfig.parse_obj(kafka_config) +) + +# Construct a dataset properties object +dataset_properties = DatasetPropertiesClass(description="This table stored the canonical User profile", + customProperties={ + "governance": "ENABLED" + }) + +# Construct a MetadataChangeProposalWrapper object. +metadata_event = MetadataChangeProposalWrapper( + entityUrn=builder.make_dataset_urn("bigquery", "my-project.my-dataset.user-table"), + aspect=dataset_properties, +) + + +# Emit metadata! This is a non-blocking call +emitter.emit( + metadata_event, + callback=lambda exc, message: print(f"Message sent to topic:{message.topic()}, partition:{message.partition()}, offset:{message.offset()}") if message else print(f"Failed to send with: {exc}") +) + +#Send all pending events +emitter.flush() +``` + +### Emitter Code + +If you're interested in looking at the Kafka emitter code, it is available [here](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/emitter/kafka_emitter.py) + +## Other Languages + +Emitter API-s are also supported for: + +- [Java](../metadata-integration/java/as-a-library.md) diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/developing.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/developing.md new file mode 100644 index 0000000000000..c0bbc88be2539 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/developing.md @@ -0,0 +1,211 @@ +--- +title: Developing on Metadata Ingestion +slug: /metadata-ingestion/developing +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/developing.md +--- + +# Developing on Metadata Ingestion + +If you just want to use metadata ingestion, check the [user-centric](./README.md) guide. +This document is for developers who want to develop and possibly contribute to the metadata ingestion framework. + +Also take a look at the guide to [adding a source](./adding-source.md). + +## Getting Started + +### Requirements + +1. Python 3.7+ must be installed in your host environment. +2. Java8 (gradle won't work with newer versions) +3. On MacOS: `brew install librdkafka` +4. On Debian/Ubuntu: `sudo apt install librdkafka-dev python3-dev python3-venv` +5. On Fedora (if using LDAP source integration): `sudo yum install openldap-devel` + +### Set up your Python environment + +From the repository root: + +```shell +cd metadata-ingestion +../gradlew :metadata-ingestion:installDev +source venv/bin/activate +datahub version # should print "DataHub CLI version: unavailable (installed in develop mode)" +``` + +### Common setup issues + +Common issues (click to expand): + +
+ datahub command not found with PyPI install + +If you've already run the pip install, but running `datahub` in your command line doesn't work, then there is likely an issue with your PATH setup and Python. + +The easiest way to circumvent this is to install and run via Python, and use `python3 -m datahub` in place of `datahub`. + +```shell +python3 -m pip install --upgrade acryl-datahub +python3 -m datahub --help +``` + +
+ +
+ Wheel issues e.g. "Failed building wheel for avro-python3" or "error: invalid command 'bdist_wheel'" + +This means Python's `wheel` is not installed. Try running the following commands and then retry. + +```shell +pip install --upgrade pip wheel setuptools +pip cache purge +``` + +
+ +
+ Failure to install confluent_kafka: "error: command 'x86_64-linux-gnu-gcc' failed with exit status 1" + +This sometimes happens if there's a version mismatch between the Kafka's C library and the Python wrapper library. Try running `pip install confluent_kafka==1.5.0` and then retrying. + +
+ +### Using Plugins in Development + +The syntax for installing plugins is slightly different in development. For example: + +```diff +- pip install 'acryl-datahub[bigquery,datahub-rest]' ++ pip install -e '.[bigquery,datahub-rest]' +``` + +## Architecture + +

+ +

+ +The architecture of this metadata ingestion framework is heavily inspired by [Apache Gobblin](https://gobblin.apache.org/) (also originally a LinkedIn project!). We have a standardized format - the MetadataChangeEvent - and sources and sinks which respectively produce and consume these objects. The sources pull metadata from a variety of data systems, while the sinks are primarily for moving this metadata into DataHub. + +## Code layout + +- The CLI interface is defined in [entrypoints.py](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/entrypoints.py) and in the [cli](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/cli) directory. +- The high level interfaces are defined in the [API directory](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/api). +- The actual [sources](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source) and [sinks](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/sink) have their own directories. The registry files in those directories import the implementations. +- The metadata models are created using code generation, and eventually live in the `./src/datahub/metadata` directory. However, these files are not checked in and instead are generated at build time. See the [codegen](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/scripts/codegen.sh) script for details. +- Tests live in the [`tests`](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/tests) directory. They're split between smaller unit tests and larger integration tests. + +## Code style + +We use black, isort, flake8, and mypy to ensure consistent code style and quality. + +```shell +# Assumes: pip install -e '.[dev]' and venv is activated +black src/ tests/ +isort src/ tests/ +flake8 src/ tests/ +mypy src/ tests/ +``` + +or you can run from root of the repository + +```shell +./gradlew :metadata-ingestion:lintFix +``` + +Some other notes: + +- Prefer mixin classes over tall inheritance hierarchies. +- Write type annotations wherever possible. +- Use `typing.Protocol` to make implicit interfaces explicit. +- If you ever find yourself copying and pasting large chunks of code, there's probably a better way to do it. +- Prefer a standalone helper method over a `@staticmethod`. +- You probably should not be defining a `__hash__` method yourself. Using `@dataclass(frozen=True)` is a good way to get a hashable class. +- Avoid global state. In sources, this includes instance variables that effectively function as "global" state for the source. +- Avoid defining functions within other functions. This makes it harder to read and test the code. +- When interacting with external APIs, parse the responses into a dataclass rather than operating directly on the response object. + +## Dependency Management + +The vast majority of our dependencies are not required by the "core" package but instead can be optionally installed using Python "extras". This allows us to keep the core package lightweight. We should be deliberate about adding new dependencies to the core framework. + +Where possible, we should avoid pinning version dependencies. The `acryl-datahub` package is frequently used as a library and hence installed alongside other tools. If you need to restrict the version of a dependency, use a range like `>=1.2.3,<2.0.0` or a negative constraint like `>=1.2.3, !=1.2.7` instead. Every upper bound and negative constraint should be accompanied by a comment explaining why it's necessary. + +Caveat: Some packages like Great Expectations and Airflow frequently make breaking changes. For such packages, it's ok to add a "defensive" upper bound with the current latest version, accompanied by a comment. It's critical that we revisit these upper bounds at least once a month and broaden them if possible. + +## Guidelines for Ingestion Configs + +We use [pydantic](https://pydantic-docs.helpmanual.io/) to define the ingestion configs. +In order to ensure that the configs are consistent and easy to use, we have a few guidelines: + +#### Naming + +- Most important point: we should **match the terminology of the source system**. For example, snowflake shouldn’t have a `host_port`, it should have an `account_id`. +- We should prefer slightly more verbose names when the alternative isn’t descriptive enough. For example `client_id` or `tenant_id` over a bare `id` and `access_secret` over a bare `secret`. +- AllowDenyPatterns should be used whenever we need to filter a list. The pattern should always apply to the fully qualified name of the entity. These configs should be named `*_pattern`, for example `table_pattern`. +- Avoid `*_only` configs like `profile_table_level_only` in favor of `profile_table_level` and `profile_column_level`. `include_tables` and `include_views` are a good example. + +#### Content + +- All configs should have a description. +- When using inheritance or mixin classes, make sure that the fields and documentation is applicable in the base class. The `bigquery_temp_table_schema` field definitely shouldn’t be showing up in every single source’s profiling config! +- Set reasonable defaults! + - The configs should not contain a default that you’d reasonably expect to be built in. As a **bad** example, the Postgres source’s `schema_pattern` has a default deny pattern containing `information_schema`. This means that if the user overrides the schema_pattern, they’ll need to manually add the information_schema to their deny patterns. This is a bad, and the filtering should’ve been handled automatically by the source’s implementation, not added at runtime by its config. + +#### Coding + +- Use a single pydantic validator per thing to validate - we shouldn’t have validation methods that are 50 lines long. +- Use `SecretStr` for passwords, auth tokens, etc. +- When doing simple field renames, use the `pydantic_renamed_field` helper. +- When doing field deprecations, use the `pydantic_removed_field` helper. +- Validator methods must only throw ValueError, TypeError, or AssertionError. Do not throw ConfigurationError from validators. +- Set `hidden_from_docs` for internal-only config flags. However, needing this often indicates a larger problem with the code structure. The hidden field should probably be a class attribute or an instance variable on the corresponding source. + +## Testing + +```shell +# Follow standard install from source procedure - see above. + +# Install, including all dev requirements. +pip install -e '.[dev]' + +# For running integration tests, you can use +pip install -e '.[integration-tests]' + +# Run the full testing suite +pytest -vv + +# Run unit tests. +pytest -m 'not integration and not slow_integration' + +# Run Docker-based integration tests. +pytest -m 'integration' + +# Run Docker-based slow integration tests. +pytest -m 'slow_integration' + +# You can also run these steps via the gradle build: +../gradlew :metadata-ingestion:lint +../gradlew :metadata-ingestion:lintFix +../gradlew :metadata-ingestion:testQuick +../gradlew :metadata-ingestion:testFull +../gradlew :metadata-ingestion:check +# Run all tests in a single file +../gradlew :metadata-ingestion:testSingle -PtestFile=tests/unit/test_airflow.py +# Run all tests under tests/unit +../gradlew :metadata-ingestion:testSingle -PtestFile=tests/unit +``` + +### Updating golden test files + +If you made some changes that require generating new "golden" data files for use in testing a specific ingestion source, you can run the following to re-generate them: + +```shell +pytest tests/integration//.py --update-golden-files +``` + +For example, + +```shell +pytest tests/integration/dbt/test_dbt.py --update-golden-files +``` diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/dev_guides/add_stateful_ingestion_to_source.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/dev_guides/add_stateful_ingestion_to_source.md new file mode 100644 index 0000000000000..cdc2ce9625568 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/dev_guides/add_stateful_ingestion_to_source.md @@ -0,0 +1,313 @@ +--- +title: Adding Stateful Ingestion to a Source +slug: /metadata-ingestion/docs/dev_guides/add_stateful_ingestion_to_source +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/dev_guides/add_stateful_ingestion_to_source.md +--- + +# Adding Stateful Ingestion to a Source + +Currently, datahub supports the [Stale Metadata Removal](./stateful.md#stale-entity-removal) and +the [Redunant Run Elimination](./stateful.md#redundant-run-elimination) use-cases on top of the more generic stateful ingestion +capability available for the sources. This document describes how to add support for these two use-cases to new sources. + +## Adding Stale Metadata Removal to a Source + +Adding the stale metadata removal use-case to a new source involves + +1. Defining the new checkpoint state that stores the list of entities emitted from a specific ingestion run. +2. Modifying the `SourceConfig` associated with the source to use a custom `stateful_ingestion` config param. +3. Modifying the `SourceReport` associated with the source to include soft-deleted entities in the report. +4. Modifying the `Source` to + 1. Instantiate the StaleEntityRemovalHandler object + 2. Add entities from the current run to the state object + 3. Emit stale metadata removal workunits + +The [datahub.ingestion.source.state.stale_entity_removal_handler](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/stale_entity_removal_handler.py) module provides the supporting infrastructure for all the steps described +above and substantially simplifies the implementation on the source side. Below is a detailed explanation of each of these +steps along with examples. + +### 1. Defining the checkpoint state for the source. + +The checkpoint state class is responsible for tracking the entities emitted from each ingestion run. If none of the existing states do not meet the needs of the new source, a new checkpoint state must be created. The state must +inherit from the `StaleEntityCheckpointStateBase` abstract class shown below, and implement each of the abstract methods. + +```python +class StaleEntityCheckpointStateBase(CheckpointStateBase, ABC, Generic[Derived]): + """ + Defines the abstract interface for the checkpoint states that are used for stale entity removal. + Examples include sql_common state for tracking table and & view urns, + dbt that tracks node & assertion urns, kafka state tracking topic urns. + """ + + @classmethod + @abstractmethod + def get_supported_types(cls) -> List[str]: + pass + + @abstractmethod + def add_checkpoint_urn(self, type: str, urn: str) -> None: + """ + Adds an urn into the list used for tracking the type. + :param type: The type of the urn such as a 'table', 'view', + 'node', 'topic', 'assertion' that the concrete sub-class understands. + :param urn: The urn string + :return: None. + """ + pass + + @abstractmethod + def get_urns_not_in( + self, type: str, other_checkpoint_state: Derived + ) -> Iterable[str]: + """ + Gets the urns present in this checkpoint but not the other_checkpoint for the given type. + :param type: The type of the urn such as a 'table', 'view', + 'node', 'topic', 'assertion' that the concrete sub-class understands. + :param other_checkpoint_state: the checkpoint state to compute the urn set difference against. + :return: an iterable to the set of urns present in this checkpoing state but not in the other_checkpoint. + """ + pass +``` + +Examples: + +1. [KafkaCheckpointState](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/kafka_state.py#L11). +2. [DbtCheckpointState](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/dbt_state.py#L16) +3. [BaseSQLAlchemyCheckpointState](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/sql_common_state.py#L17) + +### 2. Modifying the SourceConfig + +The source's config must inherit from `StatefulIngestionConfigBase`, and should declare a field named `stateful_ingestion` of type `Optional[StatefulStaleMetadataRemovalConfig]`. + +Examples: + +1. The `KafkaSourceConfig` + +```python +from typing import List, Optional +import pydantic +from datahub.ingestion.source.state.stale_entity_removal_handler import StatefulStaleMetadataRemovalConfig +from datahub.ingestion.source.state.stateful_ingestion_base import ( + StatefulIngestionConfigBase, +) + +class KafkaSourceConfig(StatefulIngestionConfigBase): + # ...... + + stateful_ingestion: Optional[StatefulStaleMetadataRemovalConfig] = None +``` + +2. The [DBTStatefulIngestionConfig](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt.py#L131) + and the [DBTConfig](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt.py#L317). + +### 3. Modifying the SourceReport + +The report class of the source should inherit from `StaleEntityRemovalSourceReport` whose definition is shown below. + +```python +from typing import List +from dataclasses import dataclass, field +from datahub.ingestion.source.state.stateful_ingestion_base import StatefulIngestionReport +@dataclass +class StaleEntityRemovalSourceReport(StatefulIngestionReport): + soft_deleted_stale_entities: List[str] = field(default_factory=list) + + def report_stale_entity_soft_deleted(self, urn: str) -> None: + self.soft_deleted_stale_entities.append(urn) +``` + +Examples: + +1. The `KafkaSourceReport` + +```python +from dataclasses import dataclass +from datahub.ingestion.source.state.stale_entity_removal_handler import StaleEntityRemovalSourceReport +@dataclass +class KafkaSourceReport(StaleEntityRemovalSourceReport): + # + # Create and register the stateful ingestion stale entity removal handler. + self.stale_entity_removal_handler = StaleEntityRemovalHandler( + source=self, + config=self.source_config, + state_type_class=KafkaCheckpointState, + pipeline_name=self.ctx.pipeline_name, + run_id=self.ctx.run_id, + ) +``` + +#### 4.2 Adding entities from current run to the state object. + +Use the `add_entity_to_state` method of the `StaleEntityRemovalHandler`. + +Examples: + +```python +# Kafka +self.stale_entity_removal_handler.add_entity_to_state( + type="topic", + urn=topic_urn,) + +# DBT +self.stale_entity_removal_handler.add_entity_to_state( + type="dataset", + urn=node_datahub_urn +) +self.stale_entity_removal_handler.add_entity_to_state( + type="assertion", + urn=node_datahub_urn, +) +``` + +#### 4.3 Emitting soft-delete workunits associated with the stale entities. + +```python +def get_workunits(self) -> Iterable[MetadataWorkUnit]: + # + # Emit the rest of the workunits for the source. + # NOTE: Populating the current state happens during the execution of this code. + # ... + + # Clean up stale entities at the end + yield from self.stale_entity_removal_handler.gen_removed_entity_workunits() +``` + +## Adding Redundant Run Elimination to a Source + +This use-case applies to the sources that drive ingestion by querying logs over a specified duration via the config(such +as snowflake usage, bigquery usage etc.). It typically involves expensive and long-running queries. To add redundant +run elimination to a new source to prevent the expensive reruns for the same time range(potentially due to a user +error or a scheduler malfunction), the following steps +are required. + +1. Update the `SourceConfig` +2. Update the `SourceReport` +3. Modify the `Source` to + 1. Instantiate the RedundantRunSkipHandler object. + 2. Check if the current run should be skipped. + 3. Update the state for the current run(start & end times). + +The [datahub.ingestion.source.state.redundant_run_skip_handler](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/redundant_run_skip_handler.py) +modules provides the supporting infrastructure required for all the steps described above. + +NOTE: The handler currently uses a simple state, +the [BaseUsageCheckpointState](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/usage_common_state.py), +across all sources it supports (unlike the StaleEntityRemovalHandler). + +### 1. Modifying the SourceConfig + +The `SourceConfig` must inherit from the [StatefulRedundantRunSkipConfig](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/redundant_run_skip_handler.py#L23) class. + +Examples: + +1. Snowflake Usage + +```python +from datahub.ingestion.source.state.redundant_run_skip_handler import ( + StatefulRedundantRunSkipConfig, +) +class SnowflakeStatefulIngestionConfig(StatefulRedundantRunSkipConfig): + pass +``` + +### 2. Modifying the SourceReport + +The `SourceReport` must inherit from the [StatefulIngestionReport](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/stateful_ingestion_base.py#L102) class. +Examples: + +1. Snowflake Usage + +```python +@dataclass +class SnowflakeUsageReport(BaseSnowflakeReport, StatefulIngestionReport): + # +``` + +### 3. Modifying the Source + +The source must inherit from `StatefulIngestionSourceBase`. + +#### 3.1 Instantiate RedundantRunSkipHandler in the `__init__` method of the source. + +The source should instantiate an instance of the `RedundantRunSkipHandler` in its `__init__` method. +Examples: +Snowflake Usage + +```python +from datahub.ingestion.source.state.redundant_run_skip_handler import ( + RedundantRunSkipHandler, +) +class SnowflakeUsageSource(StatefulIngestionSourceBase): + + def __init__(self, config: SnowflakeUsageConfig, ctx: PipelineContext): + super(SnowflakeUsageSource, self).__init__(config, ctx) + self.config: SnowflakeUsageConfig = config + self.report: SnowflakeUsageReport = SnowflakeUsageReport() + # Create and register the stateful ingestion use-case handlers. + self.redundant_run_skip_handler = RedundantRunSkipHandler( + source=self, + config=self.config, + pipeline_name=self.ctx.pipeline_name, + run_id=self.ctx.run_id, + ) +``` + +#### 3.2 Checking if the current run should be skipped. + +The sources can query if the current run should be skipped using `should_skip_this_run` method of `RedundantRunSkipHandler`. This should done from the `get_workunits` method, before doing any other work. + +Example code: + +```python +def get_workunits(self) -> Iterable[MetadataWorkUnit]: + # Skip a redundant run + if self.redundant_run_skip_handler.should_skip_this_run( + cur_start_time_millis=datetime_to_ts_millis(self.config.start_time) + ): + return + # Generate the workunits. +``` + +#### 3.3 Updating the state for the current run. + +The source should use the `update_state` method of `RedundantRunSkipHandler` to update the current run's state if the run has not been skipped. This step can be performed in the `get_workunits` if the run has not been skipped. + +Example code: + +```python + def get_workunits(self) -> Iterable[MetadataWorkUnit]: + # Skip a redundant run + if self.redundant_run_skip_handler.should_skip_this_run( + cur_start_time_millis=datetime_to_ts_millis(self.config.start_time) + ): + return + + # Generate the workunits. + # + # Update checkpoint state for this run. + self.redundant_run_skip_handler.update_state( + start_time_millis=datetime_to_ts_millis(self.config.start_time), + end_time_millis=datetime_to_ts_millis(self.config.end_time), + ) +``` diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/dev_guides/classification.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/dev_guides/classification.md new file mode 100644 index 0000000000000..a4bd0addf114d --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/dev_guides/classification.md @@ -0,0 +1,445 @@ +--- +title: Classification +slug: /metadata-ingestion/docs/dev_guides/classification +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/dev_guides/classification.md +--- + +# Classification + +The classification feature enables sources to be configured to automatically predict info types for columns and use them as glossary terms. This is an explicit opt-in feature and is not enabled by default. + +## Config details + +Note that a `.` is used to denote nested fields in the YAML recipe. + +| Field | Required | Type | Description | Default | +| ------------------------- | -------- | --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- | +| enabled | | boolean | Whether classification should be used to auto-detect glossary terms | False | +| sample_size | | int | Number of sample values used for classification. | 100 | +| info_type_to_term | | Dict[str,string] | Optional mapping to provide glossary term identifier for info type. | By default, info type is used as glossary term identifier. | +| classifiers | | Array of object | Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance. | [{'type': 'datahub', 'config': None}] | +| table_pattern | | AllowDenyPattern (see below for fields) | Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.\*' | {'allow': ['.*'], 'deny': [], 'ignoreCase': True} | +| table_pattern.allow | | Array of string | List of regex patterns to include in ingestion | ['.*'] | +| table_pattern.deny | | Array of string | List of regex patterns to exclude from ingestion. | [] | +| table_pattern.ignoreCase | | boolean | Whether to ignore case sensitivity during pattern matching. | True | +| column_pattern | | AllowDenyPattern (see below for fields) | Regex patterns to filter columns for classification. This is used in combination with other patterns in parent config. Specify regex to match the column name in `database.schema.table.column` format. | {'allow': ['.*'], 'deny': [], 'ignoreCase': True} | +| column_pattern.allow | | Array of string | List of regex patterns to include in ingestion | ['.*'] | +| column_pattern.deny | | Array of string | List of regex patterns to exclude from ingestion. | [] | +| column_pattern.ignoreCase | | boolean | Whether to ignore case sensitivity during pattern matching. | True | + +## DataHub Classifier + +DataHub Classifier is the default classifier implementation, which uses [acryl-datahub-classify](https://pypi.org/project/acryl-datahub-classify/) library to predict info types. + +### Config Details + +| Field | Required | Type | Description | Default | +| ------------------------------------------------------ | ------------------------------------------------------ | ---------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| confidence_level_threshold | | number | | 0.68 | +| info_types | | list[string] | List of infotypes to be predicted. By default, all supported infotypes are considered, along with any custom infotypes configured in `info_types_config`. | None | +| info_types_config | Configuration details for infotypes | Dict[str, InfoTypeConfig] | | See [reference_input.py](https://github.com/acryldata/datahub-classify/blob/main/datahub-classify/src/datahub_classify/reference_input.py) for default configuration. | +| info_types_config.`key`.prediction_factors_and_weights | ❓ (required if info_types_config.`key` is set) | Dict[str,number] | Factors and their weights to consider when predicting info types | | +| info_types_config.`key`.name | | NameFactorConfig (see below for fields) | | | +| info_types_config.`key`.name.regex | | Array of string | List of regex patterns the column name follows for the info type | ['.*'] | +| info_types_config.`key`.description | | DescriptionFactorConfig (see below for fields) | | | +| info_types_config.`key`.description.regex | | Array of string | List of regex patterns the column description follows for the info type | ['.*'] | +| info_types_config.`key`.datatype | | DataTypeFactorConfig (see below for fields) | | | +| info_types_config.`key`.datatype.type | | Array of string | List of data types for the info type | ['.*'] | +| info_types_config.`key`.values | | ValuesFactorConfig (see below for fields) | | | +| info_types_config.`key`.values.prediction_type | ❓ (required if info_types_config.`key`.values is set) | string | | None | +| info_types_config.`key`.values.regex | | Array of string | List of regex patterns the column value follows for the info type | None | +| info_types_config.`key`.values.library | | Array of string | Library used for prediction | None | +| minimum_values_threshold | | number | Minimum number of non-null column values required to process `values` prediction factor. | 50 | +| | + +### Supported infotypes + +- `Email_Address` +- `Gender` +- `Credit_Debit_Card_Number` +- `Phone_Number` +- `Street_Address` +- `Full_Name` +- `Age` +- `IBAN` +- `US_Social_Security_Number` +- `Vehicle_Identification_Number` +- `IP_Address_v4` +- `IP_Address_v6` +- `US_Driving_License_Number` +- `Swift_Code` + +### Supported sources + +- snowflake + +#### Example + +```yml +source: + type: snowflake + config: + env: PROD + # Coordinates + account_id: account_name + warehouse: "COMPUTE_WH" + + # Credentials + username: user + password: pass + role: "sysadmin" + + # Options + top_n_queries: 10 + email_domain: mycompany.com + + classification: + enabled: True + classifiers: + - type: datahub +``` + +#### Example with Advanced Configuration: Customizing configuration for supported info types + +```yml +source: + type: snowflake + config: + env: PROD + # Coordinates + account_id: account_name + warehouse: "COMPUTE_WH" + + # Credentials + username: user + password: pass + role: "sysadmin" + + # Options + top_n_queries: 10 + email_domain: mycompany.com + + classification: + enabled: True + info_type_to_term: + Email_Address: "Email" + classifiers: + - type: datahub + config: + confidence_level_threshold: 0.7 + info_types_config: + Email_Address: + prediction_factors_and_weights: + name: 0.4 + description: 0 + datatype: 0 + values: 0.6 + name: + regex: + - "^.*mail.*id.*$" + - "^.*id.*mail.*$" + - "^.*mail.*add.*$" + - "^.*add.*mail.*$" + - email + - mail + description: + regex: + - "^.*mail.*id.*$" + - "^.*mail.*add.*$" + - email + - mail + datatype: + type: + - str + values: + prediction_type: regex + regex: + - "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}" + library: [] + Gender: + prediction_factors_and_weights: + name: 0.4 + description: 0 + datatype: 0 + values: 0.6 + name: + regex: + - "^.*gender.*$" + - "^.*sex.*$" + - gender + - sex + description: + regex: + - "^.*gender.*$" + - "^.*sex.*$" + - gender + - sex + datatype: + type: + - int + - str + values: + prediction_type: regex + regex: + - male + - female + - man + - woman + - m + - f + - w + - men + - women + library: [] + Credit_Debit_Card_Number: + prediction_factors_and_weights: + name: 0.4 + description: 0 + datatype: 0 + values: 0.6 + name: + regex: + - "^.*card.*number.*$" + - "^.*number.*card.*$" + - "^.*credit.*card.*$" + - "^.*debit.*card.*$" + description: + regex: + - "^.*card.*number.*$" + - "^.*number.*card.*$" + - "^.*credit.*card.*$" + - "^.*debit.*card.*$" + datatype: + type: + - str + - int + values: + prediction_type: regex + regex: + - "^4[0-9]{12}(?:[0-9]{3})?$" + - "^(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}$" + - "^3[47][0-9]{13}$" + - "^3(?:0[0-5]|[68][0-9])[0-9]{11}$" + - "^6(?:011|5[0-9]{2})[0-9]{12}$" + - "^(?:2131|1800|35\\d{3})\\d{11}$" + - "^(6541|6556)[0-9]{12}$" + - "^389[0-9]{11}$" + - "^63[7-9][0-9]{13}$" + - "^9[0-9]{15}$" + - "^(6304|6706|6709|6771)[0-9]{12,15}$" + - "^(5018|5020|5038|6304|6759|6761|6763)[0-9]{8,15}$" + - "^(62[0-9]{14,17})$" + - "^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})$" + - "^(4903|4905|4911|4936|6333|6759)[0-9]{12}|(4903|4905|4911|4936|6333|6759)[0-9]{14}|(4903|4905|4911|4936|6333|6759)[0-9]{15}|564182[0-9]{10}|564182[0-9]{12}|564182[0-9]{13}|633110[0-9]{10}|633110[0-9]{12}|633110[0-9]{13}$" + - "^(6334|6767)[0-9]{12}|(6334|6767)[0-9]{14}|(6334|6767)[0-9]{15}$" + library: [] + Phone_Number: + prediction_factors_and_weights: + name: 0.4 + description: 0 + datatype: 0 + values: 0.6 + name: + regex: + - ".*phone.*(num|no).*" + - ".*(num|no).*phone.*" + - ".*[^a-z]+ph[^a-z]+.*(num|no).*" + - ".*(num|no).*[^a-z]+ph[^a-z]+.*" + - ".*mobile.*(num|no).*" + - ".*(num|no).*mobile.*" + - ".*telephone.*(num|no).*" + - ".*(num|no).*telephone.*" + - ".*cell.*(num|no).*" + - ".*(num|no).*cell.*" + - ".*contact.*(num|no).*" + - ".*(num|no).*contact.*" + - ".*landline.*(num|no).*" + - ".*(num|no).*landline.*" + - ".*fax.*(num|no).*" + - ".*(num|no).*fax.*" + - phone + - telephone + - landline + - mobile + - tel + - fax + - cell + - contact + description: + regex: + - ".*phone.*(num|no).*" + - ".*(num|no).*phone.*" + - ".*[^a-z]+ph[^a-z]+.*(num|no).*" + - ".*(num|no).*[^a-z]+ph[^a-z]+.*" + - ".*mobile.*(num|no).*" + - ".*(num|no).*mobile.*" + - ".*telephone.*(num|no).*" + - ".*(num|no).*telephone.*" + - ".*cell.*(num|no).*" + - ".*(num|no).*cell.*" + - ".*contact.*(num|no).*" + - ".*(num|no).*contact.*" + - ".*landline.*(num|no).*" + - ".*(num|no).*landline.*" + - ".*fax.*(num|no).*" + - ".*(num|no).*fax.*" + - phone + - telephone + - landline + - mobile + - tel + - fax + - cell + - contact + datatype: + type: + - int + - str + values: + prediction_type: library + regex: [] + library: + - phonenumbers + Street_Address: + prediction_factors_and_weights: + name: 0.5 + description: 0 + datatype: 0 + values: 0.5 + name: + regex: + - ".*street.*add.*" + - ".*add.*street.*" + - ".*full.*add.*" + - ".*add.*full.*" + - ".*mail.*add.*" + - ".*add.*mail.*" + - add[^a-z]+ + - address + - street + description: + regex: + - ".*street.*add.*" + - ".*add.*street.*" + - ".*full.*add.*" + - ".*add.*full.*" + - ".*mail.*add.*" + - ".*add.*mail.*" + - add[^a-z]+ + - address + - street + datatype: + type: + - str + values: + prediction_type: library + regex: [] + library: + - spacy + Full_name: + prediction_factors_and_weights: + name: 0.3 + description: 0 + datatype: 0 + values: 0.7 + name: + regex: + - ".*person.*name.*" + - ".*name.*person.*" + - ".*user.*name.*" + - ".*name.*user.*" + - ".*full.*name.*" + - ".*name.*full.*" + - fullname + - name + - person + - user + description: + regex: + - ".*person.*name.*" + - ".*name.*person.*" + - ".*user.*name.*" + - ".*name.*user.*" + - ".*full.*name.*" + - ".*name.*full.*" + - fullname + - name + - person + - user + datatype: + type: + - str + values: + prediction_type: library + regex: [] + library: + - spacy + Age: + prediction_factors_and_weights: + name: 0.65 + description: 0 + datatype: 0 + values: 0.35 + name: + regex: + - age[^a-z]+.* + - ".*[^a-z]+age" + - ".*[^a-z]+age[^a-z]+.*" + - age + description: + regex: + - age[^a-z]+.* + - ".*[^a-z]+age" + - ".*[^a-z]+age[^a-z]+.*" + - age + datatype: + type: + - int + values: + prediction_type: library + regex: [] + library: + - rule_based_logic +``` + +#### Example with Advanced Configuration: Specifying custom info type + +```yml +source: + type: snowflake + config: + env: PROD + # Coordinates + account_id: account_name + warehouse: "COMPUTE_WH" + + # Credentials + username: user + password: pass + role: "sysadmin" + + # Options + top_n_queries: 10 + email_domain: mycompany.com + + classification: + enabled: True + classifiers: + - type: datahub + config: + confidence_level_threshold: 0.7 + minimum_values_threshold: 10 + info_types_config: + CloudRegion: + prediction_factors_and_weights: + name: 0 + description: 0 + datatype: 0 + values: 1 + values: + prediction_type: regex + regex: + - "(af|ap|ca|eu|me|sa|us)-(central|north|(north(?:east|west))|south|south(?:east|west)|east|west)-\\d+" + library: [] +``` diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/dev_guides/reporting_telemetry.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/dev_guides/reporting_telemetry.md new file mode 100644 index 0000000000000..316979b208db0 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/dev_guides/reporting_telemetry.md @@ -0,0 +1,113 @@ +--- +title: Datahub's Reporting Framework for Ingestion Job Telemetry +slug: /metadata-ingestion/docs/dev_guides/reporting_telemetry +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/dev_guides/reporting_telemetry.md +--- + +# Datahub's Reporting Framework for Ingestion Job Telemetry + +The Datahub's reporting framework allows for configuring reporting providers with the ingestion pipelines to send +telemetry about the ingestion job runs to external systems for monitoring purposes. It is powered by the Datahub's +stateful ingestion framework. The `datahub` reporting provider comes with the standard client installation, +and allows for reporting ingestion job telemetry to the datahub backend as the destination. + +**_NOTE_**: This feature requires the server to be `statefulIngestion` capable. +This is a feature of metadata service with version >= `0.8.20`. + +To check if you are running a stateful ingestion capable server: + +```console +curl http:///config + +{ +models: { }, +statefulIngestionCapable: true, # <-- this should be present and true +retention: "true", +noCode: "true" +} +``` + +## Config details + +The ingestion reporting providers are a list of reporting provider configurations under the `reporting` config +param of the pipeline, each reporting provider configuration begin a type and config pair object. The telemetry data will +be sent to all the reporting providers in this list. + +Note that a `.` is used to denote nested fields, and `[idx]` is used to denote an element of an array of objects in the YAML recipe. + +| Field | Required | Default | Description | +| ----------------------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `reporting[idx].type` | ✅ | `datahub` | The type of the ingestion reporting provider registered with datahub. | +| `reporting[idx].config` | | The `datahub_api` config if set at pipeline level. Otherwise, the default `DatahubClientConfig`. See the [defaults](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19) here. | The configuration required for initializing the datahub reporting provider. | +| `pipeline_name` | ✅ | | The name of the ingestion pipeline. This is used as a part of the identifying key for the telemetry data reported by each job in the ingestion pipeline. | + +#### Supported sources + +- All sql based sources. +- snowflake_usage. + +#### Sample configuration + +```yaml +source: + type: "snowflake" + config: + username: + password: + role: + host_port: + warehouse: + # Rest of the source specific params ... +# This is mandatory. Changing it will cause old telemetry correlation to be lost. +pipeline_name: "my_snowflake_pipeline_1" + +# Pipeline-level datahub_api configuration. +datahub_api: # Optional. But if provided, this config will be used by the "datahub" ingestion state provider. + server: "http://localhost:8080" + +sink: + type: "datahub-rest" + config: + server: "http://localhost:8080" + +reporting: + - type: "datahub" # Required + config: # Optional. + datahub_api: # default value + server: "http://localhost:8080" +``` + +## Reporting Ingestion State Provider (Developer Guide) + +An ingestion reporting state provider is responsible for saving and retrieving the ingestion telemetry +associated with the ingestion runs of various jobs inside the source connector of the ingestion pipeline. +The data model used for capturing the telemetry is [DatahubIngestionRunSummary](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/datajob/datahub/DatahubIngestionRunSummary.pdl). +A reporting ingestion state provider needs to implement the [IngestionReportingProviderBase](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/api/ingestion_job_reporting_provider_base.py) +interface and register itself with datahub by adding an entry under `datahub.ingestion.reporting_provider.plugins` +key of the entry_points section in [setup.py](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/setup.py) +with its type and implementation class as shown below. + +```python +entry_points = { + # " + "datahub.ingestion.reporting_provider.plugins": [ + "datahub = datahub.ingestion.reporting.datahub_ingestion_run_summary_provider:DatahubIngestionRunSummaryProvider", + "file = datahub.ingestion.reporting.file_reporter:FileReporter", + ], +} +``` + +### Datahub Reporting Ingestion State Provider + +This is the reporting state provider implementation that is available out of the box in datahub. Its type is `datahub` and it is implemented on top +of the `datahub_api` client and the timeseries aspect capabilities of the datahub-backend. + +#### Config details + +Note that a `.` is used to denote nested fields in the YAML recipe. + +| Field | Required | Default | Description | +| -------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- | +| `type` | ✅ | `datahub` | The type of the ingestion reporting provider registered with datahub. | +| `config` | | The `datahub_api` config if set at pipeline level. Otherwise, the default `DatahubClientConfig`. See the [defaults](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19) here. | The configuration required for initializing the datahub reporting provider. | diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/dev_guides/sql_profiles.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/dev_guides/sql_profiles.md new file mode 100644 index 0000000000000..7c154ab6a9cbd --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/dev_guides/sql_profiles.md @@ -0,0 +1,40 @@ +--- +title: SQL Profiling +slug: /metadata-ingestion/docs/dev_guides/sql_profiles +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/dev_guides/sql_profiles.md +--- + +# SQL Profiling + +SQL Profiling collects table level and column level statistics. +The SQL-based profiler does not run alone, but rather can be enabled for other SQL-based sources. +Enabling profiling will slow down ingestion runs. + +:::caution + +Running profiling against many tables or over many rows can run up significant costs. +While we've done our best to limit the expensiveness of the queries the profiler runs, you +should be prudent about the set of tables profiling is enabled on or the frequency +of the profiling runs. + +::: + +## Capabilities + +Extracts: + +- Row and column counts for each table +- For each column, if applicable: + - null counts and proportions + - distinct counts and proportions + - minimum, maximum, mean, median, standard deviation, some quantile values + - histograms or frequencies of unique values + +## Supported Sources + +SQL profiling is supported for all SQL sources. Check the individual source page to verify if it supports profiling. + +## Questions + +If you've got any questions on configuring profiling, feel free to ping us on [our Slack](https://slack.datahubproject.io/)! diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/dev_guides/stateful.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/dev_guides/stateful.md new file mode 100644 index 0000000000000..98caf2d96b8f0 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/dev_guides/stateful.md @@ -0,0 +1,188 @@ +--- +title: Stateful Ingestion +slug: /metadata-ingestion/docs/dev_guides/stateful +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/dev_guides/stateful.md +--- + +# Stateful Ingestion + +The stateful ingestion feature enables sources to be configured to save custom checkpoint states from their +runs, and query these states back from subsequent runs to make decisions about the current run based on the state saved +from the previous run(s) using a supported ingestion state provider. This is an explicit opt-in feature and is not enabled +by default. + +**_NOTE_**: This feature requires the server to be `statefulIngestion` capable. This is a feature of metadata service with version >= `0.8.20`. + +To check if you are running a stateful ingestion capable server: + +```console +curl http:///config + +{ +models: { }, +statefulIngestionCapable: true, # <-- this should be present and true +retention: "true", +noCode: "true" +} +``` + +## Config details + +Note that a `.` is used to denote nested fields in the YAML recipe. + +| Field | Required | Default | Description | +| ------------------------------------------------------------ | -------- | ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `source.config.stateful_ingestion.enabled` | | False | The type of the ingestion state provider registered with datahub. | +| `source.config.stateful_ingestion.ignore_old_state` | | False | If set to True, ignores the previous checkpoint state. | +| `source.config.stateful_ingestion.ignore_new_state` | | False | If set to True, ignores the current checkpoint state. | +| `source.config.stateful_ingestion.max_checkpoint_state_size` | | 2^24 (16MB) | The maximum size of the checkpoint state in bytes. | +| `source.config.stateful_ingestion.state_provider` | | The default [datahub ingestion state provider](#datahub-ingestion-state-provider) configuration. | The ingestion state provider configuration. | +| `pipeline_name` | ✅ | | The name of the ingestion pipeline the checkpoint states of various source connector job runs are saved/retrieved against via the ingestion state provider. | + +NOTE: If either `dry-run` or `preview` mode are set, stateful ingestion will be turned off regardless of the rest of the configuration. + +## Use-cases powered by stateful ingestion. + +Following is the list of current use-cases powered by stateful ingestion in datahub. + +### Stale Entity Removal + +Stateful ingestion can be used to automatically soft-delete the tables and views that are seen in a previous run +but absent in the current run (they are either deleted or no longer desired). + +

+ +

+ +#### Supported sources + +- All sql based sources. + +#### Additional config details + +Note that a `.` is used to denote nested fields in the YAML recipe. + +| Field | Required | Default | Description | +| ------------------------------------------ | -------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------- | +| `stateful_ingestion.remove_stale_metadata` | | True | Soft-deletes the tables and views that were found in the last successful run but missing in the current run with stateful_ingestion enabled. | + +#### Sample configuration + +```yaml +source: + type: "snowflake" + config: + username: + password: + host_port: + warehouse: + role: + include_tables: True + include_views: True + # Rest of the source specific params ... + ## Stateful Ingestion config ## + stateful_ingestion: + enabled: True # False by default + remove_stale_metadata: True # default value + ## Default state_provider configuration ## + # state_provider: + # type: "datahub" # default value + # This section is needed if the pipeline-level `datahub_api` is not configured. + # config: # default value + # datahub_api: + # server: "http://localhost:8080" + +# The pipeline_name is mandatory for stateful ingestion and the state is tied to this. +# If this is changed after using with stateful ingestion, the previous state will not be available to the next run. +pipeline_name: "my_snowflake_pipeline_1" + +# Pipeline-level datahub_api configuration. +datahub_api: # Optional. But if provided, this config will be used by the "datahub" ingestion state provider. + server: "http://localhost:8080" + +sink: + type: "datahub-rest" + config: + server: "http://localhost:8080" +``` + +### Redundant Run Elimination + +Typically, the usage runs are configured to fetch the usage data for the previous day(or hour) for each run. Once a usage +run has finished, subsequent runs until the following day would be fetching the same usage data. With stateful ingestion, +the redundant fetches can be avoided even if the ingestion job is scheduled to run more frequently than the granularity of +usage ingestion. + +#### Supported sources + +- Snowflake Usage source. + +#### Additional config details + +Note that a `.` is used to denote nested fields in the YAML recipe. + +| Field | Required | Default | Description | +| -------------------------------- | -------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------- | +| `stateful_ingestion.force_rerun` | | False | Custom-alias for `stateful_ingestion.ignore_old_state`. Prevents a rerun for the same time window if there was a previous successful run. | + +#### Sample Configuration + +```yaml +source: + type: "snowflake-usage-legacy" + config: + username: + password: + role: + host_port: + warehouse: + # Rest of the source specific params ... + ## Stateful Ingestion config ## + stateful_ingestion: + enabled: True # default is false + force_rerun: False # Specific to this source(alias for ignore_old_state), used to override default behavior if True. + +# The pipeline_name is mandatory for stateful ingestion and the state is tied to this. +# If this is changed after using with stateful ingestion, the previous state will not be available to the next run. +pipeline_name: "my_snowflake_usage_ingestion_pipeline_1" +sink: + type: "datahub-rest" + config: + server: "http://localhost:8080" +``` + +## Adding Stateful Ingestion Capability to New Sources (Developer Guide) + +See [this documentation](./add_stateful_ingestion_to_source.md) for more details on how to add stateful ingestion +capability to new sources for the use-cases supported by datahub. + +## The Checkpointing Ingestion State Provider (Developer Guide) + +The ingestion checkpointing state provider is responsible for saving and retrieving the ingestion checkpoint state associated with the ingestion runs +of various jobs inside the source connector of the ingestion pipeline. The checkpointing data model is [DatahubIngestionCheckpoint](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/datajob/datahub/DatahubIngestionCheckpoint.pdl) and it supports any custom state to be stored using the [IngestionCheckpointState](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/datajob/datahub/IngestionCheckpointState.pdl#L9). A checkpointing ingestion state provider needs to implement the +[IngestionCheckpointingProviderBase](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/api/ingestion_job_checkpointing_provider_base.py) interface and +register itself with datahub by adding an entry under `datahub.ingestion.checkpointing_provider.plugins` key of the entry_points section in [setup.py](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/setup.py) with its type and implementation class as shown below. + +```python +entry_points = { + # " + "datahub.ingestion.checkpointing_provider.plugins": [ + "datahub = datahub.ingestion.source.state_provider.datahub_ingestion_checkpointing_provider:DatahubIngestionCheckpointingProvider", + ], +} +``` + +### Datahub Checkpointing Ingestion State Provider + +This is the state provider implementation that is available out of the box. Its type is `datahub` and it is implemented on top +of the `datahub_api` client and the timeseries aspect capabilities of the datahub-backend. + +#### Config details + +Note that a `.` is used to denote nested fields in the YAML recipe. + +| Field | Required | Default | Description | +| ----------------------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------- | +| `state_provider.type` | | `datahub` | The type of the ingestion state provider registered with datahub | +| `state_provider.config` | | The `datahub_api` config if set at pipeline level. Otherwise, the default `DatahubClientConfig`. See the [defaults](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19) here. | The configuration required for initializing the state provider. | diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/transformer/dataset_transformer.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/transformer/dataset_transformer.md new file mode 100644 index 0000000000000..0d455e84085d9 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/transformer/dataset_transformer.md @@ -0,0 +1,1316 @@ +--- +title: Dataset +sidebar_label: Dataset +slug: /metadata-ingestion/docs/transformer/dataset_transformer +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/transformer/dataset_transformer.md +--- + +# Dataset Transformers + +The below table shows transformer which can transform aspects of entity [Dataset](../../../docs/generated/metamodel/entities/dataset.md). + +| Dataset Aspect | Transformer | +| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `status` | - [Mark Dataset status](#mark-dataset-status) | +| `ownership` | - [Simple Add Dataset ownership](#simple-add-dataset-ownership)
- [Pattern Add Dataset ownership](#pattern-add-dataset-ownership)
- [Simple Remove Dataset Ownership](#simple-remove-dataset-ownership) | +| `globalTags` | - [Simple Add Dataset globalTags ](#simple-add-dataset-globaltags)
- [Pattern Add Dataset globalTags](#pattern-add-dataset-globaltags)
- [Add Dataset globalTags](#add-dataset-globaltags) | +| `browsePaths` | - [Set Dataset browsePath](#set-dataset-browsepath) | +| `glossaryTerms` | - [Simple Add Dataset glossaryTerms ](#simple-add-dataset-glossaryterms)
- [Pattern Add Dataset glossaryTerms](#pattern-add-dataset-glossaryterms) | +| `schemaMetadata` | - [Pattern Add Dataset Schema Field glossaryTerms](#pattern-add-dataset-schema-field-glossaryterms)
- [Pattern Add Dataset Schema Field globalTags](#pattern-add-dataset-schema-field-globaltags) | +| `datasetProperties` | - [Simple Add Dataset datasetProperties](#simple-add-dataset-datasetproperties)
- [Add Dataset datasetProperties](#add-dataset-datasetproperties) | +| `domains` | - [Simple Add Dataset domains](#simple-add-dataset-domains)
- [Pattern Add Dataset domains](#pattern-add-dataset-domains) | + +## Mark Dataset Status + +### Config Details + +| Field | Required | Type | Default | Description | +| --------- | -------- | ------- | ------- | ------------------------------------------- | +| `removed` | ✅ | boolean | | Flag to control visbility of dataset on UI. | + +If you would like to stop a dataset from appearing in the UI, then you need to mark the status of the dataset as removed. + +You can use this transformer in your source recipe to mark status as removed. + +```yaml +transformers: + - type: "mark_dataset_status" + config: + removed: true +``` + +## Simple Add Dataset ownership + +### Config Details + +| Field | Required | Type | Default | Description | +| ------------------ | -------- | ------------ | ----------- | ---------------------------------------------------------------- | +| `owner_urns` | ✅ | list[string] | | List of owner urns. | +| `ownership_type` | | string | `DATAOWNER` | ownership type of the owners. | +| `replace_existing` | | boolean | `false` | Whether to remove owners from entity sent by ingestion source. | +| `semantics` | | enum | `OVERWRITE` | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | + +For transformer behaviour on `replace_existing` and `semantics`, please refer section [Relationship Between replace_existing And semantics](#relationship-between-replace_existing-and-semantics). + +
+Let’s suppose we’d like to append a series of users who we know to own a dataset but aren't detected during normal ingestion. To do so, we can use the `simple_add_dataset_ownership` transformer that’s included in the ingestion framework. + +The config, which we’d append to our ingestion recipe YAML, would look like this: + +Below configuration will add listed owner_urns in ownership aspect + +```yaml +transformers: + - type: "simple_add_dataset_ownership" + config: + owner_urns: + - "urn:li:corpuser:username1" + - "urn:li:corpuser:username2" + - "urn:li:corpGroup:groupname" + ownership_type: "PRODUCER" +``` + +`simple_add_dataset_ownership` can be configured in below different way + +- Add owners, however replace existing owners sent by ingestion source + ```yaml + transformers: + - type: "simple_add_dataset_ownership" + config: + replace_existing: true # false is default behaviour + owner_urns: + - "urn:li:corpuser:username1" + - "urn:li:corpuser:username2" + - "urn:li:corpGroup:groupname" + ownership_type: "PRODUCER" + ``` +- Add owners, however overwrite the owners available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "simple_add_dataset_ownership" + config: + semantics: OVERWRITE # OVERWRITE is default behaviour + owner_urns: + - "urn:li:corpuser:username1" + - "urn:li:corpuser:username2" + - "urn:li:corpGroup:groupname" + ownership_type: "PRODUCER" + ``` +- Add owners, however keep the owners available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "simple_add_dataset_ownership" + config: + semantics: PATCH + owner_urns: + - "urn:li:corpuser:username1" + - "urn:li:corpuser:username2" + - "urn:li:corpGroup:groupname" + ownership_type: "PRODUCER" + ``` + +## Pattern Add Dataset ownership + +### Config Details + +| Field | Required | Type | Default | Description | +| ------------------ | -------- | -------------------- | ----------- | --------------------------------------------------------------------------------------- | +| `owner_pattern` | ✅ | map[regx, list[urn]] | | entity urn with regular expression and list of owners urn apply to matching entity urn. | +| `ownership_type` | | string | `DATAOWNER` | ownership type of the owners. | +| `replace_existing` | | boolean | `false` | Whether to remove owners from entity sent by ingestion source. | +| `semantics` | | enum | `OVERWRITE` | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | + +let’s suppose we’d like to append a series of users who we know to own a different dataset from a data source but aren't detected during normal ingestion. To do so, we can use the `pattern_add_dataset_ownership` module that’s included in the ingestion framework. This will match the pattern to `urn` of the dataset and assign the respective owners. + +The config, which we’d append to our ingestion recipe YAML, would look like this: + +```yaml +transformers: + - type: "pattern_add_dataset_ownership" + config: + owner_pattern: + rules: + ".*example1.*": ["urn:li:corpuser:username1"] + ".*example2.*": ["urn:li:corpuser:username2"] + ownership_type: "DEVELOPER" +``` + +`pattern_add_dataset_ownership` can be configured in below different way + +- Add owner, however replace existing owner sent by ingestion source + ```yaml + transformers: + - type: "pattern_add_dataset_ownership" + config: + replace_existing: true # false is default behaviour + owner_pattern: + rules: + ".*example1.*": ["urn:li:corpuser:username1"] + ".*example2.*": ["urn:li:corpuser:username2"] + ownership_type: "PRODUCER" + ``` +- Add owner, however overwrite the owners available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "pattern_add_dataset_ownership" + config: + semantics: OVERWRITE # OVERWRITE is default behaviour + owner_pattern: + rules: + ".*example1.*": ["urn:li:corpuser:username1"] + ".*example2.*": ["urn:li:corpuser:username2"] + ownership_type: "PRODUCER" + ``` +- Add owner, however keep the owners available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "pattern_add_dataset_ownership" + config: + semantics: PATCH + owner_pattern: + rules: + ".*example1.*": ["urn:li:corpuser:username1"] + ".*example2.*": ["urn:li:corpuser:username2"] + ownership_type: "PRODUCER" + ``` + +## Simple Remove Dataset ownership + +If we wanted to clear existing owners sent by ingestion source we can use the `simple_remove_dataset_ownership` transformer which removes all owners sent by the ingestion source. + +```yaml +transformers: + - type: "simple_remove_dataset_ownership" + config: {} +``` + +The main use case of `simple_remove_dataset_ownership` is to remove incorrect owners present in the source. You can use it along with the [Simple Add Dataset ownership](#simple-add-dataset-ownership) to remove wrong owners and add the correct ones. + +Note that whatever owners you send via `simple_remove_dataset_ownership` will overwrite the owners present in the UI. + +## Extract Dataset globalTags + +### Config Details + +| Field | Required | Type | Default | Description | +| -------------------- | -------- | ------- | ----------- | ------------------------------------------------------------------- | +| `extract_tags_from` | ✅ | string | `urn` | Which field to extract tag from. Currently only `urn` is supported. | +| `extract_tags_regex` | ✅ | string | `.*` | Regex to use to extract tag. | +| `replace_existing` | | boolean | `false` | Whether to remove owners from entity sent by ingestion source. | +| `semantics` | | enum | `OVERWRITE` | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | + +Let’s suppose we’d like to add a dataset tags based on part of urn. To do so, we can use the `extract_dataset_tags` transformer that’s included in the ingestion framework. + +The config, which we’d append to our ingestion recipe YAML, would look like this: + +```yaml +transformers: + - type: "extract_dataset_tags" + config: + extract_tags_from: "urn" + extract_tags_regex: ".([^._]*)_" +``` + +So if we have input URNs like + +- `urn:li:dataset:(urn:li:dataPlatform:kafka,clusterid.USA-ops-team_table1,PROD)` +- `urn:li:dataset:(urn:li:dataPlatform:kafka,clusterid.Canada-marketing_table1,PROD)` + +a tag called `USA-ops-team` and `Canada-marketing` will be added to them respectively. This is helpful in case you are using prefixes in your datasets to segregate different things. Now you can turn that segregation into a tag on your dataset in DataHub for further use. + +## Simple Add Dataset globalTags + +### Config Details + +| Field | Required | Type | Default | Description | +| ------------------ | -------- | ------------ | ----------- | ---------------------------------------------------------------- | +| `tag_urns` | ✅ | list[string] | | List of globalTags urn. | +| `replace_existing` | | boolean | `false` | Whether to remove owners from entity sent by ingestion source. | +| `semantics` | | enum | `OVERWRITE` | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | + +Let’s suppose we’d like to add a set of dataset tags. To do so, we can use the `simple_add_dataset_tags` transformer that’s included in the ingestion framework. + +The config, which we’d append to our ingestion recipe YAML, would look like this: + +```yaml +transformers: + - type: "simple_add_dataset_tags" + config: + tag_urns: + - "urn:li:tag:NeedsDocumentation" + - "urn:li:tag:Legacy" +``` + +`simple_add_dataset_tags` can be configured in below different way + +- Add tags, however replace existing tags sent by ingestion source + ```yaml + transformers: + - type: "simple_add_dataset_tags" + config: + replace_existing: true # false is default behaviour + tag_urns: + - "urn:li:tag:NeedsDocumentation" + - "urn:li:tag:Legacy" + ``` +- Add tags, however overwrite the tags available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "simple_add_dataset_tags" + config: + semantics: OVERWRITE # OVERWRITE is default behaviour + tag_urns: + - "urn:li:tag:NeedsDocumentation" + - "urn:li:tag:Legacy" + ``` +- Add tags, however keep the tags available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "simple_add_dataset_tags" + config: + semantics: PATCH + tag_urns: + - "urn:li:tag:NeedsDocumentation" + - "urn:li:tag:Legacy" + ``` + +## Pattern Add Dataset globalTags + +### Config Details + +| Field | Required | Type | Default | Description | +| ------------------ | -------- | -------------------- | ----------- | ------------------------------------------------------------------------------------- | +| `tag_pattern` | ✅ | map[regx, list[urn]] | | Entity urn with regular expression and list of tags urn apply to matching entity urn. | +| `replace_existing` | | boolean | `false` | Whether to remove owners from entity sent by ingestion source. | +| `semantics` | | enum | `OVERWRITE` | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | + +Let’s suppose we’d like to append a series of tags to specific datasets. To do so, we can use the `pattern_add_dataset_tags` module that’s included in the ingestion framework. This will match the regex pattern to `urn` of the dataset and assign the respective tags urns given in the array. + +The config, which we’d append to our ingestion recipe YAML, would look like this: + +```yaml +transformers: + - type: "pattern_add_dataset_tags" + config: + tag_pattern: + rules: + ".*example1.*": ["urn:li:tag:NeedsDocumentation", "urn:li:tag:Legacy"] + ".*example2.*": ["urn:li:tag:NeedsDocumentation"] +``` + +`pattern_add_dataset_tags` can be configured in below different way + +- Add tags, however replace existing tags sent by ingestion source + ```yaml + transformers: + - type: "pattern_add_dataset_tags" + config: + replace_existing: true # false is default behaviour + tag_pattern: + rules: + ".*example1.*": + ["urn:li:tag:NeedsDocumentation", "urn:li:tag:Legacy"] + ".*example2.*": ["urn:li:tag:NeedsDocumentation"] + ``` +- Add tags, however overwrite the tags available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "pattern_add_dataset_tags" + config: + semantics: OVERWRITE # OVERWRITE is default behaviour + tag_pattern: + rules: + ".*example1.*": + ["urn:li:tag:NeedsDocumentation", "urn:li:tag:Legacy"] + ".*example2.*": ["urn:li:tag:NeedsDocumentation"] + ``` +- Add tags, however keep the tags available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "pattern_add_dataset_tags" + config: + semantics: PATCH + tag_pattern: + rules: + ".*example1.*": + ["urn:li:tag:NeedsDocumentation", "urn:li:tag:Legacy"] + ".*example2.*": ["urn:li:tag:NeedsDocumentation"] + ``` + +## Add Dataset globalTags + +### Config Details + +| Field | Required | Type | Default | Description | +| ------------------ | -------- | ------------------------------------------ | ----------- | -------------------------------------------------------------------------- | +| `get_tags_to_add` | ✅ | callable[[str], list[TagAssociationClass]] | | A function which takes entity urn as input and return TagAssociationClass. | +| `replace_existing` | | boolean | `false` | Whether to remove owners from entity sent by ingestion source. | +| `semantics` | | enum | `OVERWRITE` | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | + +If you'd like to add more complex logic for assigning tags, you can use the more generic add_dataset_tags transformer, which calls a user-provided function to determine the tags for each dataset. + +```yaml +transformers: + - type: "add_dataset_tags" + config: + get_tags_to_add: "." +``` + +Then define your function to return a list of TagAssociationClass tags, for example: + +```python +import logging + +import datahub.emitter.mce_builder as builder +from datahub.metadata.schema_classes import ( + TagAssociationClass +) + +def custom_tags(entity_urn: str) -> List[TagAssociationClass]: + """Compute the tags to associate to a given dataset.""" + + tag_strings = [] + + ### Add custom logic here + tag_strings.append('custom1') + tag_strings.append('custom2') + + tag_strings = [builder.make_tag_urn(tag=n) for n in tag_strings] + tags = [TagAssociationClass(tag=tag) for tag in tag_strings] + + logging.info(f"Tagging dataset {entity_urn} with {tag_strings}.") + return tags +``` + +Finally, you can install and use your custom transformer as [shown here](#installing-the-package). + +`add_dataset_tags` can be configured in below different way + +- Add tags, however replace existing tags sent by ingestion source + ```yaml + transformers: + - type: "add_dataset_tags" + config: + replace_existing: true # false is default behaviour + get_tags_to_add: "." + ``` +- Add tags, however overwrite the tags available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "add_dataset_tags" + config: + semantics: OVERWRITE # OVERWRITE is default behaviour + get_tags_to_add: "." + ``` +- Add tags, however keep the tags available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "add_dataset_tags" + config: + semantics: PATCH + get_tags_to_add: "." + ``` + +## Set Dataset browsePath + +### Config Details + +| Field | Required | Type | Default | Description | +| ------------------ | -------- | ------------ | ----------- | ---------------------------------------------------------------- | +| `path_templates` | ✅ | list[string] | | List of path templates. | +| `replace_existing` | | boolean | `false` | Whether to remove owners from entity sent by ingestion source. | +| `semantics` | | enum | `OVERWRITE` | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | + +If you would like to add to browse paths of dataset can use this transformer. There are 3 optional variables that you can use to get information from the dataset `urn`: + +- ENV: env passed (default: prod) +- PLATFORM: `mysql`, `postgres` or different platform supported by datahub +- DATASET_PARTS: slash separated parts of dataset name. e.g. `database_name/schema_name/[table_name]` for postgres + +e.g. this can be used to create browse paths like `/prod/postgres/superset/public/logs` for table `superset.public.logs` in a `postgres` database + +```yaml +transformers: + - type: "set_dataset_browse_path" + config: + path_templates: + - /ENV/PLATFORM/DATASET_PARTS +``` + +If you don't want the environment but wanted to add something static in the browse path like the database instance name you can use this. + +```yaml +transformers: + - type: "set_dataset_browse_path" + config: + path_templates: + - /PLATFORM/marketing_db/DATASET_PARTS +``` + +It will create browse path like `/mysql/marketing_db/sales/orders` for a table `sales.orders` in `mysql` database instance. + +You can use this to add multiple browse paths. Different people might know the same data assets by different names. + +```yaml +transformers: + - type: "set_dataset_browse_path" + config: + path_templates: + - /PLATFORM/marketing_db/DATASET_PARTS + - /data_warehouse/DATASET_PARTS +``` + +This will add 2 browse paths like `/mysql/marketing_db/sales/orders` and `/data_warehouse/sales/orders` for a table `sales.orders` in `mysql` database instance. + +Default behaviour of the transform is to add new browse paths, you can optionally set `replace_existing: True` so +the transform becomes a _set_ operation instead of an _append_. + +```yaml +transformers: + - type: "set_dataset_browse_path" + config: + replace_existing: True + path_templates: + - /ENV/PLATFORM/DATASET_PARTS +``` + +In this case, the resulting dataset will have only 1 browse path, the one from the transform. + +`set_dataset_browse_path` can be configured in below different way + +- Add browsePath, however replace existing browsePath sent by ingestion source + ```yaml + transformers: + - type: "set_dataset_browse_path" + config: + replace_existing: true # false is default behaviour + path_templates: + - /PLATFORM/marketing_db/DATASET_PARTS + ``` +- Add browsePath, however overwrite the browsePath available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "set_dataset_browse_path" + config: + semantics: OVERWRITE # OVERWRITE is default behaviour + path_templates: + - /PLATFORM/marketing_db/DATASET_PARTS + ``` +- Add browsePath, however keep the browsePath available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "set_dataset_browse_path" + config: + semantics: PATCH + path_templates: + - /PLATFORM/marketing_db/DATASET_PARTS + ``` + +## Simple Add Dataset glossaryTerms + +### Config Details + +| Field | Required | Type | Default | Description | +| ------------------ | -------- | ------------ | ----------- | ---------------------------------------------------------------- | +| `term_urns` | ✅ | list[string] | | List of glossaryTerms urn. | +| `replace_existing` | | boolean | `false` | Whether to remove owners from entity sent by ingestion source. | +| `semantics` | | enum | `OVERWRITE` | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | + +We can use a similar convention to associate [Glossary Terms](../../../docs/generated/ingestion/sources/business-glossary.md) to datasets. +We can use the `simple_add_dataset_terms` transformer that’s included in the ingestion framework. + +The config, which we’d append to our ingestion recipe YAML, would look like this: + +```yaml +transformers: + - type: "simple_add_dataset_terms" + config: + term_urns: + - "urn:li:glossaryTerm:Email" + - "urn:li:glossaryTerm:Address" +``` + +`simple_add_dataset_terms` can be configured in below different way + +- Add terms, however replace existing terms sent by ingestion source + ```yaml + transformers: + - type: "simple_add_dataset_terms" + config: + replace_existing: true # false is default behaviour + term_urns: + - "urn:li:glossaryTerm:Email" + - "urn:li:glossaryTerm:Address" + ``` +- Add terms, however overwrite the terms available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "simple_add_dataset_terms" + config: + semantics: OVERWRITE # OVERWRITE is default behaviour + term_urns: + - "urn:li:glossaryTerm:Email" + - "urn:li:glossaryTerm:Address" + ``` +- Add terms, however keep the terms available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "simple_add_dataset_terms" + config: + semantics: PATCH + term_urns: + - "urn:li:glossaryTerm:Email" + - "urn:li:glossaryTerm:Address" + ``` + +## Pattern Add Dataset glossaryTerms + +### Config Details + +| Field | Required | Type | Default | Description | +| ------------------ | -------- | -------------------- | ----------- | ---------------------------------------------------------------------------------------------- | +| `term_pattern` | ✅ | map[regx, list[urn]] | | entity urn with regular expression and list of glossaryTerms urn apply to matching entity urn. | +| `replace_existing` | | boolean | `false` | Whether to remove owners from entity sent by ingestion source. | +| `semantics` | | enum | `OVERWRITE` | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | + +We can add glossary terms to datasets based on a regex filter. + +```yaml +transformers: + - type: "pattern_add_dataset_terms" + config: + term_pattern: + rules: + ".*example1.*": + ["urn:li:glossaryTerm:Email", "urn:li:glossaryTerm:Address"] + ".*example2.*": ["urn:li:glossaryTerm:PostalCode"] +``` + +`pattern_add_dataset_terms` can be configured in below different way + +- Add terms, however replace existing terms sent by ingestion source + + ```yaml + transformers: + - type: "pattern_add_dataset_terms" + config: + replace_existing: true # false is default behaviour + term_pattern: + rules: + ".*example1.*": + ["urn:li:glossaryTerm:Email", "urn:li:glossaryTerm:Address"] + ".*example2.*": ["urn:li:glossaryTerm:PostalCode"] + ``` + +- Add terms, however overwrite the terms available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "pattern_add_dataset_terms" + config: + semantics: OVERWRITE # OVERWRITE is default behaviour + term_pattern: + rules: + ".*example1.*": + ["urn:li:glossaryTerm:Email", "urn:li:glossaryTerm:Address"] + ".*example2.*": ["urn:li:glossaryTerm:PostalCode"] + ``` +- Add terms, however keep the terms available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "pattern_add_dataset_terms" + config: + semantics: PATCH + term_pattern: + rules: + ".*example1.*": + ["urn:li:glossaryTerm:Email", "urn:li:glossaryTerm:Address"] + ".*example2.*": ["urn:li:glossaryTerm:PostalCode"] + ``` + +## Pattern Add Dataset Schema Field glossaryTerms + +### Config Details + +| Field | Required | Type | Default | Description | +| ------------------ | -------- | -------------------- | ----------- | ---------------------------------------------------------------------------------------------- | +| `term_pattern` | ✅ | map[regx, list[urn]] | | entity urn with regular expression and list of glossaryTerms urn apply to matching entity urn. | +| `replace_existing` | | boolean | `false` | Whether to remove owners from entity sent by ingestion source. | +| `semantics` | | enum | `OVERWRITE` | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | + +We can add glossary terms to schema fields based on a regex filter. + +Note that only terms from the first matching pattern will be applied. + +```yaml +transformers: + - type: "pattern_add_dataset_schema_terms" + config: + term_pattern: + rules: + ".*email.*": ["urn:li:glossaryTerm:Email"] + ".*name.*": ["urn:li:glossaryTerm:Name"] +``` + +`pattern_add_dataset_schema_terms` can be configured in below different way + +- Add terms, however replace existing terms sent by ingestion source + ```yaml + transformers: + - type: "pattern_add_dataset_schema_terms" + config: + replace_existing: true # false is default behaviour + term_pattern: + rules: + ".*email.*": ["urn:li:glossaryTerm:Email"] + ".*name.*": ["urn:li:glossaryTerm:Name"] + ``` +- Add terms, however overwrite the terms available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "pattern_add_dataset_schema_terms" + config: + semantics: OVERWRITE # OVERWRITE is default behaviour + term_pattern: + rules: + ".*email.*": ["urn:li:glossaryTerm:Email"] + ".*name.*": ["urn:li:glossaryTerm:Name"] + ``` +- Add terms, however keep the terms available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "pattern_add_dataset_schema_terms" + config: + semantics: PATCH + term_pattern: + rules: + ".*email.*": ["urn:li:glossaryTerm:Email"] + ".*name.*": ["urn:li:glossaryTerm:Name"] + ``` + +## Pattern Add Dataset Schema Field globalTags + +### Config Details + +| Field | Required | Type | Default | Description | +| ------------------ | -------- | -------------------- | ----------- | ------------------------------------------------------------------------------------- | +| `tag_pattern` | ✅ | map[regx, list[urn]] | | entity urn with regular expression and list of tags urn apply to matching entity urn. | +| `replace_existing` | | boolean | `false` | Whether to remove owners from entity sent by ingestion source. | +| `semantics` | | enum | `OVERWRITE` | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | + +We can also append a series of tags to specific schema fields. To do so, we can use the `pattern_add_dataset_schema_tags` transformer. This will match the regex pattern to each schema field path and assign the respective tags urns given in the array. + +Note that the tags from the first matching pattern will be applied, not all matching patterns. + +The config would look like this: + +```yaml +transformers: + - type: "pattern_add_dataset_schema_tags" + config: + tag_pattern: + rules: + ".*email.*": ["urn:li:tag:Email"] + ".*name.*": ["urn:li:tag:Name"] +``` + +`pattern_add_dataset_schema_tags` can be configured in below different way + +- Add tags, however replace existing tag sent by ingestion source + ```yaml + transformers: + - type: "pattern_add_dataset_schema_tags" + config: + replace_existing: true # false is default behaviour + tag_pattern: + rules: + ".*example1.*": + ["urn:li:tag:NeedsDocumentation", "urn:li:tag:Legacy"] + ".*example2.*": ["urn:li:tag:NeedsDocumentation"] + ``` +- Add tags, however overwrite the tags available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "pattern_add_dataset_schema_tags" + config: + semantics: OVERWRITE # OVERWRITE is default behaviour + tag_pattern: + rules: + ".*example1.*": + ["urn:li:tag:NeedsDocumentation", "urn:li:tag:Legacy"] + ".*example2.*": ["urn:li:tag:NeedsDocumentation"] + ``` +- Add tags, however keep the tags available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "pattern_add_dataset_schema_tags" + config: + semantics: PATCH + tag_pattern: + rules: + ".*example1.*": + ["urn:li:tag:NeedsDocumentation", "urn:li:tag:Legacy"] + ".*example2.*": ["urn:li:tag:NeedsDocumentation"] + ``` + +## Simple Add Dataset datasetProperties + +### Config Details + +| Field | Required | Type | Default | Description | +| ------------------ | -------- | -------------- | ----------- | ---------------------------------------------------------------- | +| `properties` | ✅ | dict[str, str] | | Map of key value pair. | +| `replace_existing` | | boolean | `false` | Whether to remove owners from entity sent by ingestion source. | +| `semantics` | | enum | `OVERWRITE` | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | + +`simple_add_dataset_properties` transformer assigns the properties to dataset entity from the configuration. +`properties` field is a dictionary of string values. Note in case of any key collision, the value in the config will +overwrite the previous value. + +```yaml +transformers: + - type: "simple_add_dataset_properties" + config: + properties: + prop1: value1 + prop2: value2 +``` + +`simple_add_dataset_properties` can be configured in below different way + +- Add dataset-properties, however replace existing dataset-properties sent by ingestion source + + ```yaml + transformers: + - type: "simple_add_dataset_properties" + config: + replace_existing: true # false is default behaviour + properties: + prop1: value1 + prop2: value2 + ``` + +- Add dataset-properties, however overwrite the dataset-properties available for the dataset on DataHub GMS + + ```yaml + transformers: + - type: "simple_add_dataset_properties" + config: + semantics: OVERWRITE # OVERWRITE is default behaviour + properties: + prop1: value1 + prop2: value2 + ``` + +- Add dataset-properties, however keep the dataset-properties available for the dataset on DataHub GMS + + ```yaml + transformers: + - type: "simple_add_dataset_properties" + config: + semantics: PATCH + properties: + prop1: value1 + prop2: value2 + ``` + +## Add Dataset datasetProperties + +### Config Details + +| Field | Required | Type | Default | Description | +| ------------------------------- | -------- | -------------------------------------- | ----------- | ---------------------------------------------------------------- | +| `add_properties_resolver_class` | ✅ | Type[AddDatasetPropertiesResolverBase] | | A class extends from `AddDatasetPropertiesResolverBase` | +| `replace_existing` | | boolean | `false` | Whether to remove owners from entity sent by ingestion source. | +| `semantics` | | enum | `OVERWRITE` | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | + +If you'd like to add more complex logic for assigning properties, you can use the `add_dataset_properties` transformer, which calls a user-provided class (that extends from `AddDatasetPropertiesResolverBase` class) to determine the properties for each dataset. + +The config, which we’d append to our ingestion recipe YAML, would look like this: + +```yaml +transformers: + - type: "add_dataset_properties" + config: + add_properties_resolver_class: "." +``` + +Then define your class to return a list of custom properties, for example: + +```python +import logging +from typing import Dict +from datahub.ingestion.transformer.add_dataset_properties import AddDatasetPropertiesResolverBase + +class MyPropertiesResolver(AddDatasetPropertiesResolverBase): + def get_properties_to_add(self, entity_urn: str) -> Dict[str, str]: + ### Add custom logic here + properties= {'my_custom_property': 'property value'} + logging.info(f"Adding properties: {properties} to dataset: {entity_urn}.") + return properties +``` + +`add_dataset_properties` can be configured in below different way + +- Add dataset-properties, however replace existing dataset-properties sent by ingestion source + + ```yaml + transformers: + - type: "add_dataset_properties" + config: + replace_existing: true # false is default behaviour + add_properties_resolver_class: "." + ``` + +- Add dataset-properties, however overwrite the dataset-properties available for the dataset on DataHub GMS + + ```yaml + transformers: + - type: "add_dataset_properties" + config: + semantics: OVERWRITE # OVERWRITE is default behaviour + add_properties_resolver_class: "." + ``` + +- Add dataset-properties, however keep the dataset-properties available for the dataset on DataHub GMS + ```yaml + transformers: + - type: "add_dataset_properties" + config: + semantics: PATCH + add_properties_resolver_class: "." + ``` + +## Simple Add Dataset domains + +### Config Details + +| Field | Required | Type | Default | Description | +| ------------------ | -------- | --------------------- | ----------- | ---------------------------------------------------------------- | +| `domains` | ✅ | list[union[urn, str]] | | List of simple domain name or domain urns. | +| `replace_existing` | | boolean | `false` | Whether to remove owners from entity sent by ingestion source. | +| `semantics` | | enum | `OVERWRITE` | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | + +For transformer behaviour on `replace_existing` and `semantics`, please refer section [Relationship Between replace_existing And semantics](#relationship-between-replace_existing-and-semantics). + +
+ +let’s suppose we’d like to add a series of domain to dataset, in this case you can use `simple_add_dataset_domain` transformer. + +The config, which we’d append to our ingestion recipe YAML, would look like this: + +Here we can set domains to either urn (i.e. urn:li:domain:engineering) or simple domain name (i.e. engineering) in both of the cases domain should be provisioned on DataHub GMS + +```yaml +transformers: + - type: "simple_add_dataset_domain" + config: + semantics: OVERWRITE + domains: + - urn:li:domain:engineering +``` + +`simple_add_dataset_domain` can be configured in below different way + +- Add domains, however replace existing domains sent by ingestion source + +```yaml +transformers: + - type: "simple_add_dataset_domain" + config: + replace_existing: true # false is default behaviour + domains: + - "urn:li:domain:engineering" + - "urn:li:domain:hr" +``` + +- Add domains, however overwrite the domains available for the dataset on DataHub GMS + +```yaml +transformers: + - type: "simple_add_dataset_domain" + config: + semantics: OVERWRITE # OVERWRITE is default behaviour + domains: + - "urn:li:domain:engineering" + - "urn:li:domain:hr" +``` + +- Add domains, however keep the domains available for the dataset on DataHub GMS + +```yaml +transformers: + - type: "simple_add_dataset_domain" + config: + semantics: PATCH + domains: + - "urn:li:domain:engineering" + - "urn:li:domain:hr" +``` + +## Pattern Add Dataset domains + +### Config Details + +| Field | Required | Type | Default | Description | +| ------------------ | -------- | ------------------------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------- | +| `domain_pattern` | ✅ | map[regx, list[union[urn, str]] | | dataset urn with regular expression and list of simple domain name or domain urn need to be apply on matching dataset urn. | +| `replace_existing` | | boolean | `false` | Whether to remove owners from entity sent by ingestion source. | +| `semantics` | | enum | `OVERWRITE` | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | + +Let’s suppose we’d like to append a series of domain to specific datasets. To do so, we can use the pattern_add_dataset_domain transformer that’s included in the ingestion framework. +This will match the regex pattern to urn of the dataset and assign the respective domain urns given in the array. + +The config, which we’d append to our ingestion recipe YAML, would look like this: +Here we can set domain list to either urn (i.e. urn:li:domain:hr) or simple domain name (i.e. hr) +in both of the cases domain should be provisioned on DataHub GMS + +```yaml +transformers: + - type: "pattern_add_dataset_domain" + config: + semantics: OVERWRITE + domain_pattern: + rules: + 'urn:li:dataset:\(urn:li:dataPlatform:postgres,postgres\.public\.n.*': + ["hr"] + 'urn:li:dataset:\(urn:li:dataPlatform:postgres,postgres\.public\.t.*': + ["urn:li:domain:finance"] +``` + +`pattern_add_dataset_domain` can be configured in below different way + +- Add domains, however replace existing domains sent by ingestion source + +```yaml +transformers: + - type: "pattern_add_dataset_ownership" + config: + replace_existing: true # false is default behaviour + domain_pattern: + rules: + 'urn:li:dataset:\(urn:li:dataPlatform:postgres,postgres\.public\.n.*': + ["hr"] + 'urn:li:dataset:\(urn:li:dataPlatform:postgres,postgres\.public\.t.*': + ["urn:li:domain:finance"] +``` + +- Add domains, however overwrite the domains available for the dataset on DataHub GMS + +```yaml +transformers: + - type: "pattern_add_dataset_ownership" + config: + semantics: OVERWRITE # OVERWRITE is default behaviour + domain_pattern: + rules: + 'urn:li:dataset:\(urn:li:dataPlatform:postgres,postgres\.public\.n.*': + ["hr"] + 'urn:li:dataset:\(urn:li:dataPlatform:postgres,postgres\.public\.t.*': + ["urn:li:domain:finance"] +``` + +- Add domains, however keep the domains available for the dataset on DataHub GMS + +```yaml +transformers: + - type: "pattern_add_dataset_ownership" + config: + semantics: PATCH + domain_pattern: + rules: + 'urn:li:dataset:\(urn:li:dataPlatform:postgres,postgres\.public\.n.*': + ["hr"] + 'urn:li:dataset:\(urn:li:dataPlatform:postgres,postgres\.public\.t.*': + ["urn:li:domain:finance"] +``` + +## Relationship Between replace_existing and semantics + +The transformer behaviour mentioned here is in context of `simple_add_dataset_ownership`, however it is applicable for all dataset transformers which are supporting `replace_existing` +and `semantics` configuration attributes, for example `simple_add_dataset_tags` will add or remove tags as per behaviour mentioned in this section. + +`replace_existing` controls whether to remove owners from currently executing ingestion pipeline. + +`semantics` controls whether to overwrite or patch owners present on DataHub GMS server. These owners might be added from DataHub Portal. + +if `replace_existing` is set to `true` and `semantics` is set to `OVERWRITE` then transformer takes below steps + +1. As `replace_existing` is set to `true`, remove the owners from input entity (i.e. dataset) +2. Add owners mentioned in ingestion recipe to input entity +3. As `semantics` is set to `OVERWRITE` no need to fetch owners present on DataHub GMS server for the input entity +4. Return input entity + +if `replace_existing` is set to `true` and `semantics` is set to `PATCH` then transformer takes below steps + +1. `replace_existing` is set to `true`, first remove the owners from input entity (i.e. dataset) +2. Add owners mentioned in ingestion recipe to input entity +3. As `semantics` is set to `PATCH` fetch owners for the input entity from DataHub GMS Server +4. Add owners fetched from DataHub GMS Server to input entity +5. Return input entity + +if `replace_existing` is set to `false` and `semantics` is set to `OVERWRITE` then transformer takes below steps + +1. As `replace_existing` is set to `false`, keep the owners present in input entity as is +2. Add owners mentioned in ingestion recipe to input entity +3. As `semantics` is set to `OVERWRITE` no need to fetch owners from DataHub GMS Server for the input entity +4. Return input entity + +if `replace_existing` is set to `false` and `semantics` is set to `PATCH` then transformer takes below steps + +1. `replace_existing` is set to `false`, keep the owners present in input entity as is +2. Add owners mentioned in ingestion recipe to input entity +3. As `semantics` is set to `PATCH` fetch owners for the input entity from DataHub GMS Server +4. Add owners fetched from DataHub GMS Server to input entity +5. Return input entity + +## Writing a custom transformer from scratch + +In the above couple of examples, we use classes that have already been implemented in the ingestion framework. However, it’s common for more advanced cases to pop up where custom code is required, for instance if you'd like to utilize conditional logic or rewrite properties. In such cases, we can add our own modules and define the arguments it takes as a custom transformer. + +As an example, suppose we want to append a set of ownership fields to our metadata that are dependent upon an external source – for instance, an API endpoint or file – rather than a preset list like above. In this case, we can set a JSON file as an argument to our custom config, and our transformer will read this file and append the included ownership elements to all metadata events. + +Our JSON file might look like the following: + +```json +[ + "urn:li:corpuser:athos", + "urn:li:corpuser:porthos", + "urn:li:corpuser:aramis", + "urn:li:corpGroup:the_three_musketeers" +] +``` + +### Defining a config + +To get started, we’ll initiate an `AddCustomOwnershipConfig` class that inherits from [`datahub.configuration.common.ConfigModel`](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/configuration/common.py). The sole parameter will be an `owners_json` which expects a path to a JSON file containing a list of owner URNs. This will go in a file called `custom_transform_example.py`. + +```python +from datahub.configuration.common import ConfigModel + +class AddCustomOwnershipConfig(ConfigModel): + owners_json: str +``` + +### Defining the transformer + +Next, we’ll define the transformer itself, which must inherit from [`datahub.ingestion.api.transform.Transformer`](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/api/transform.py). The framework provides a helper class called [`datahub.ingestion.transformer.base_transformer.BaseTransformer`](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/transformer/base_transformer.py) that makes it super-simple to write transformers. +First, let's get all our imports in: + +```python +# append these to the start of custom_transform_example.py +import json +from typing import List, Optional + +from datahub.configuration.common import ConfigModel +from datahub.ingestion.api.common import PipelineContext +from datahub.ingestion.transformer.add_dataset_ownership import Semantics +from datahub.ingestion.transformer.base_transformer import ( + BaseTransformer, + SingleAspectTransformer, +) +from datahub.metadata.schema_classes import ( + OwnerClass, + OwnershipClass, + OwnershipTypeClass, +) + +``` + +Next, let's define the base scaffolding for the class: + +```python +# append this to the end of custom_transform_example.py + +class AddCustomOwnership(BaseTransformer, SingleAspectTransformer): + """Transformer that adds owners to datasets according to a callback function.""" + + # context param to generate run metadata such as a run ID + ctx: PipelineContext + # as defined in the previous block + config: AddCustomOwnershipConfig + + def __init__(self, config: AddCustomOwnershipConfig, ctx: PipelineContext): + super().__init__() + self.ctx = ctx + self.config = config + + with open(self.config.owners_json, "r") as f: + raw_owner_urns = json.load(f) + + self.owners = [ + OwnerClass(owner=owner, type=OwnershipTypeClass.DATAOWNER) + for owner in raw_owner_urns + ] +``` + +A transformer must have two functions: a `create()` function for initialization and a `transform()` function for executing the transformation. Transformers that extend `BaseTransformer` and `SingleAspectTransformer` can avoid having to implement the more complex `transform` function and just implement the `transform_aspect` function. + +Let's begin by adding a `create()` method for parsing our configuration dictionary: + +```python +# add this as a function of AddCustomOwnership + +@classmethod +def create(cls, config_dict: dict, ctx: PipelineContext) -> "AddCustomOwnership": + config = AddCustomOwnershipConfig.parse_obj(config_dict) + return cls(config, ctx) +``` + +Next we need to tell the helper classes which entity types and aspect we are interested in transforming. In this case, we want to only process `dataset` entities and transform the `ownership` aspect. + +```python +def entity_types(self) -> List[str]: + return ["dataset"] + + def aspect_name(self) -> str: + return "ownership" +``` + +Finally we need to implement the `transform_aspect()` method that does the work of adding our custom ownership classes. This method will be called be the framework with an optional aspect value filled out if the upstream source produced a value for this aspect. The framework takes care of pre-processing both MCE-s and MCP-s so that the `transform_aspect()` function is only called one per entity. Our job is merely to inspect the incoming aspect (or absence) and produce a transformed value for this aspect. Returning `None` from this method will effectively suppress this aspect from being emitted. + +```python +# add this as a function of AddCustomOwnership + + def transform_aspect( # type: ignore + self, entity_urn: str, aspect_name: str, aspect: Optional[OwnershipClass] + ) -> Optional[OwnershipClass]: + + owners_to_add = self.owners + assert aspect is None or isinstance(aspect, OwnershipClass) + + if owners_to_add: + ownership = ( + aspect + if aspect + else OwnershipClass( + owners=[], + ) + ) + ownership.owners.extend(owners_to_add) + + return ownership +``` + +### More Sophistication: Making calls to DataHub during Transformation + +In some advanced cases, you might want to check with DataHub before performing a transformation. A good example for this might be retrieving the current set of owners of a dataset before providing the new set of owners during an ingestion process. To allow transformers to always be able to query the graph, the framework provides them access to the graph through the context object `ctx`. Connectivity to the graph is automatically instantiated anytime the pipeline uses a REST sink. In case you are using the Kafka sink, you can additionally provide access to the graph by configuring it in your pipeline. + +Here is an example of a recipe that uses Kafka as the sink, but provides access to the graph by explicitly configuring the `datahub_api`. + +```yaml +source: + type: mysql + config: + # ..source configs + +sink: + type: datahub-kafka + config: + connection: + bootstrap: localhost:9092 + schema_registry_url: "http://localhost:8081" + +datahub_api: + server: http://localhost:8080 + # standard configs accepted by datahub rest client ... +``` + +#### Advanced Use-Case: Patching Owners + +With the above capability, we can now build more powerful transformers that can check with the server-side state before issuing changes in metadata. +e.g. Here is how the AddDatasetOwnership transformer can now support PATCH semantics by ensuring that it never deletes any owners that are stored on the server. + +```python +def transform_one(self, mce: MetadataChangeEventClass) -> MetadataChangeEventClass: + if not isinstance(mce.proposedSnapshot, DatasetSnapshotClass): + return mce + owners_to_add = self.config.get_owners_to_add(mce.proposedSnapshot) + if owners_to_add: + ownership = builder.get_or_add_aspect( + mce, + OwnershipClass( + owners=[], + ), + ) + ownership.owners.extend(owners_to_add) + + if self.config.semantics == Semantics.PATCH: + assert self.ctx.graph + patch_ownership = AddDatasetOwnership.get_ownership_to_set( + self.ctx.graph, mce.proposedSnapshot.urn, ownership + ) + builder.set_aspect( + mce, aspect=patch_ownership, aspect_type=OwnershipClass + ) + return mce +``` + +### Installing the package + +Now that we've defined the transformer, we need to make it visible to DataHub. The easiest way to do this is to just place it in the same directory as your recipe, in which case the module name is the same as the file – in this case, `custom_transform_example`. + +
+ Advanced: Installing as a package and enable discoverability +Alternatively, create a `setup.py` in the same directory as our transform script to make it visible globally. After installing this package (e.g. with `python setup.py` or `pip install -e .`), our module will be installed and importable as `custom_transform_example`. + +```python +from setuptools import find_packages, setup + +setup( + name="custom_transform_example", + version="1.0", + packages=find_packages(), + # if you don't already have DataHub installed, add it under install_requires + # install_requires=["acryl-datahub"], + entry_points={ + "datahub.ingestion.transformer.plugins": [ + "custom_transform_example_alias = custom_transform_example:AddCustomOwnership", + ], + }, +) +``` + +Additionally, declare the transformer under the `entry_points` variable of the setup script. This enables the transformer to be +listed when running `datahub check plugins`, and sets up the transformer's shortened alias for use in recipes. + +
+ +### Running the transform + +```yaml +transformers: + - type: "custom_transform_example_alias" + config: + owners_json: "" # the JSON file mentioned at the start +``` + +After running `datahub ingest -c `, our MCEs will now have the following owners appended: + +```json +"owners": [ + { + "owner": "urn:li:corpuser:athos", + "type": "DATAOWNER", + "source": null + }, + { + "owner": "urn:li:corpuser:porthos", + "type": "DATAOWNER", + "source": null + }, + { + "owner": "urn:li:corpuser:aramis", + "type": "DATAOWNER", + "source": null + }, + { + "owner": "urn:li:corpGroup:the_three_musketeers", + "type": "DATAOWNER", + "source": null + }, + // ...and any additional owners +], +``` + +All the files for this tutorial may be found [here](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/transforms/). diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/transformer/intro.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/transformer/intro.md new file mode 100644 index 0000000000000..b701aac36d25c --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/docs/transformer/intro.md @@ -0,0 +1,46 @@ +--- +title: Introduction +sidebar_label: Introduction +slug: /metadata-ingestion/docs/transformer/intro +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/transformer/intro.md +--- + +# Transformers + +## What’s a transformer? + +Oftentimes we want to modify metadata before it reaches the ingestion sink – for instance, we might want to add custom tags, ownership, properties, or patch some fields. A transformer allows us to do exactly these things. + +Moreover, a transformer allows one to have fine-grained control over the metadata that’s ingested without having to modify the ingestion framework's code yourself. Instead, you can write your own module that can transform metadata events however you like. To include a transformer into a recipe, all that's needed is the name of the transformer as well as any configuration that the transformer needs. + +:::note + +Providing urns for metadata that does not already exist will result in unexpected behavior. Ensure any tags, terms, domains, etc. urns that you want to apply in your transformer already exist in your DataHub instance. + +For example, adding a domain urn in your transformer to apply to datasets will not create the domain entity if it doesn't exist. Therefore, you can't add documentation to it and it won't show up in Advanced Search. This goes for any metadata you are applying in transformers. + +::: + +## Provided transformers + +Aside from the option of writing your own transformer (see below), we provide some simple transformers for the use cases of adding: tags, glossary terms, properties and ownership information. + +DataHub provided transformers for dataset are: + +- [Simple Add Dataset ownership](./dataset_transformer.md#simple-add-dataset-ownership) +- [Pattern Add Dataset ownership](./dataset_transformer.md#pattern-add-dataset-ownership) +- [Simple Remove Dataset ownership](./dataset_transformer.md#simple-remove-dataset-ownership) +- [Mark Dataset Status](./dataset_transformer.md#mark-dataset-status) +- [Simple Add Dataset globalTags](./dataset_transformer.md#simple-add-dataset-globaltags) +- [Pattern Add Dataset globalTags](./dataset_transformer.md#pattern-add-dataset-globaltags) +- [Add Dataset globalTags](./dataset_transformer.md#add-dataset-globaltags) +- [Set Dataset browsePath](./dataset_transformer.md#set-dataset-browsepath) +- [Simple Add Dataset glossaryTerms](./dataset_transformer.md#simple-add-dataset-glossaryterms) +- [Pattern Add Dataset glossaryTerms](./dataset_transformer.md#pattern-add-dataset-glossaryterms) +- [Pattern Add Dataset Schema Field glossaryTerms](./dataset_transformer.md#pattern-add-dataset-schema-field-glossaryterms) +- [Pattern Add Dataset Schema Field globalTags](./dataset_transformer.md#pattern-add-dataset-schema-field-globaltags) +- [Simple Add Dataset datasetProperties](./dataset_transformer.md#simple-add-dataset-datasetproperties) +- [Add Dataset datasetProperties](./dataset_transformer.md#add-dataset-datasetproperties) +- [Simple Add Dataset domains](./dataset_transformer.md#simple-add-dataset-domains) +- [Pattern Add Dataset domains](./dataset_transformer.md#pattern-add-dataset-domains) diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/examples/transforms/README.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/examples/transforms/README.md new file mode 100644 index 0000000000000..6f3da8504be40 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/examples/transforms/README.md @@ -0,0 +1,12 @@ +--- +title: Custom transformer script +slug: /metadata-ingestion/examples/transforms +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/transforms/README.md +--- + +# Custom transformer script + +This script sets up a transformer that reads in a list of owner URNs from a JSON file specified via `owners_json` and appends these owners to every MCE. + +See the transformers tutorial (https://datahubproject.io/docs/metadata-ingestion/transformers) for how this module is built and run. diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/integration_docs/great-expectations.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/integration_docs/great-expectations.md new file mode 100644 index 0000000000000..b9f26793fcdb4 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/integration_docs/great-expectations.md @@ -0,0 +1,67 @@ +--- +title: Great Expectations +slug: /metadata-ingestion/integration_docs/great-expectations +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/integration_docs/great-expectations.md +--- + +# Great Expectations + +This guide helps to setup and configure `DataHubValidationAction` in Great Expectations to send assertions(expectations) and their results to DataHub using DataHub's Python Rest emitter. + +## Capabilities + +`DataHubValidationAction` pushes assertions metadata to DataHub. This includes + +- **Assertion Details**: Details of assertions (i.e. expectation) set on a Dataset (Table). +- **Assertion Results**: Evaluation results for an assertion tracked over time. + +This integration supports v3 api datasources using SqlAlchemyExecutionEngine. + +## Limitations + +This integration does not support + +- v2 Datasources such as SqlAlchemyDataset +- v3 Datasources using execution engine other than SqlAlchemyExecutionEngine (Spark, Pandas) +- Cross-dataset expectations (those involving > 1 table) + +## Setting up + +1. Install the required dependency in your Great Expectations environment. + + ```shell + pip install 'acryl-datahub[great-expectations]' + ``` + +2. To add `DataHubValidationAction` in Great Expectations Checkpoint, add following configuration in action_list for your Great Expectations `Checkpoint`. For more details on setting action_list, see [Checkpoints and Actions](https://docs.greatexpectations.io/docs/reference/checkpoints_and_actions/) + ```yml + action_list: + - name: datahub_action + action: + module_name: datahub.integrations.great_expectations.action + class_name: DataHubValidationAction + server_url: http://localhost:8080 #datahub server url + ``` + **Configuration options:** + - `server_url` (required): URL of DataHub GMS endpoint + - `env` (optional, defaults to "PROD"): Environment to use in namespace when constructing dataset URNs. + - `exclude_dbname` (optional): Exclude dbname / catalog when constructing dataset URNs. (Highly applicable to Trino / Presto where we want to omit catalog e.g. `hive`) + - `platform_alias` (optional): Platform alias when constructing dataset URNs. e.g. main data platform is `presto-on-hive` but using `trino` to run the test + - `platform_instance_map` (optional): Platform instance mapping to use when constructing dataset URNs. Maps the GX 'data source' name to a platform instance on DataHub. e.g. `platform_instance_map: { "datasource_name": "warehouse" }` + - `graceful_exceptions` (defaults to true): If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall checkpoint to fail. Note that configuration issues will still throw exceptions. + - `token` (optional): Bearer token used for authentication. + - `timeout_sec` (optional): Per-HTTP request timeout. + - `retry_status_codes` (optional): Retry HTTP request also on these status codes. + - `retry_max_times` (optional): Maximum times to retry if HTTP request fails. The delay between retries is increased exponentially. + - `extra_headers` (optional): Extra headers which will be added to the datahub request. + - `parse_table_names_from_sql` (defaults to false): The integration can use an SQL parser to try to parse the datasets being asserted. This parsing is disabled by default, but can be enabled by setting `parse_table_names_from_sql: True`. The parser is based on the [`sqllineage`](https://pypi.org/project/sqllineage/) package. + - `convert_urns_to_lowercase` (optional): Whether to convert dataset urns to lowercase. + +## Debugging + +Set environment variable `DATAHUB_DEBUG` (default `false`) to `true` to enable debug logging for `DataHubValidationAction`. + +## Learn more + +To see the Great Expectations in action, check out [this demo](https://www.loom.com/share/d781c9f0b270477fb5d6b0c26ef7f22d) from the Feb 2022 townhall. diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/schedule_docs/airflow.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/schedule_docs/airflow.md new file mode 100644 index 0000000000000..28f2ccaaee384 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/schedule_docs/airflow.md @@ -0,0 +1,49 @@ +--- +title: Using Airflow +slug: /metadata-ingestion/schedule_docs/airflow +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/schedule_docs/airflow.md +--- + +# Using Airflow + +If you are using Apache Airflow for your scheduling then you might want to also use it for scheduling your ingestion recipes. For any Airflow specific questions you can go through [Airflow docs](https://airflow.apache.org/docs/apache-airflow/stable/) for more details. + +We've provided a few examples of how to configure your DAG: + +- [`mysql_sample_dag`](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub_provider/example_dags/mysql_sample_dag.py) embeds the full MySQL ingestion configuration inside the DAG. + +- [`snowflake_sample_dag`](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub_provider/example_dags/snowflake_sample_dag.py) avoids embedding credentials inside the recipe, and instead fetches them from Airflow's [Connections](https://airflow.apache.org/docs/apache-airflow/stable/howto/connection/index.html) feature. You must configure your connections in Airflow to use this approach. + +:::tip + +These example DAGs use the `PythonVirtualenvOperator` to run the ingestion. This is the recommended approach, since it guarantees that there will not be any conflicts between DataHub and the rest of your Airflow environment. + +When configuring the task, it's important to specify the requirements with your source and set the `system_site_packages` option to false. + +```py +ingestion_task = PythonVirtualenvOperator( + task_id="ingestion_task", + requirements=[ + "acryl-datahub[]", + ], + system_site_packages=False, + python_callable=your_callable, +) +``` + +::: + +
+Advanced: loading a recipe file + +In more advanced cases, you might want to store your ingestion recipe in a file and load it from your task. + +- Ensure the recipe file is in a folder accessible to your airflow workers. You can either specify absolute path on the machines where Airflow is installed or a path relative to `AIRFLOW_HOME`. +- Ensure [DataHub CLI](../../docs/cli.md) is installed in your airflow environment. +- Create a DAG task to read your DataHub ingestion recipe file and run it. See the example below for reference. +- Deploy the DAG file into airflow for scheduling. Typically this involves checking in the DAG file into your dags folder which is accessible to your Airflow instance. + +Example: [`generic_recipe_sample_dag`](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub_provider/example_dags/generic_recipe_sample_dag.py) + +
diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/schedule_docs/cron.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/schedule_docs/cron.md new file mode 100644 index 0000000000000..e2a4a50ff4082 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/schedule_docs/cron.md @@ -0,0 +1,36 @@ +--- +title: Using Cron +slug: /metadata-ingestion/schedule_docs/cron +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/schedule_docs/cron.md +--- + +# Using Cron + +Assume you have a recipe file `/home/ubuntu/datahub_ingest/mysql_to_datahub.yml` on your machine + +``` +source: + type: mysql + config: + # Coordinates + host_port: localhost:3306 + database: dbname + + # Credentials + username: root + password: example + +sink: + type: datahub-rest + config: + server: http://localhost:8080 +``` + +We can use crontab to schedule ingestion to run five minutes after midnight, every day using [DataHub CLI](../../docs/cli.md). + +``` +5 0 * * * datahub ingest -c /home/ubuntu/datahub_ingest/mysql_to_datahub.yml +``` + +Read through [crontab docs](https://man7.org/linux/man-pages/man5/crontab.5.html) for more options related to scheduling. diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/schedule_docs/datahub.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/schedule_docs/datahub.md new file mode 100644 index 0000000000000..5e4a12c75948c --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/schedule_docs/datahub.md @@ -0,0 +1,10 @@ +--- +title: Using DataHub +slug: /metadata-ingestion/schedule_docs/datahub +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/schedule_docs/datahub.md +--- + +# Using DataHub + +[UI Ingestion](../../docs/ui-ingestion.md) can be used to schedule metadata ingestion through DataHub. diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/schedule_docs/intro.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/schedule_docs/intro.md new file mode 100644 index 0000000000000..7dc471af52be0 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/schedule_docs/intro.md @@ -0,0 +1,38 @@ +--- +title: Introduction to Scheduling Metadata Ingestion +slug: /metadata-ingestion/schedule_docs/intro +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/schedule_docs/intro.md +--- + +# Introduction to Scheduling Metadata Ingestion + +Given a recipe file `/home/ubuntu/datahub_ingest/mysql_to_datahub.yml`. + +``` +source: + type: mysql + config: + # Coordinates + host_port: localhost:3306 + database: dbname + + # Credentials + username: root + password: example + +sink: + type: datahub-rest + config: + server: http://localhost:8080 +``` + +We can do ingestion of our metadata using [DataHub CLI](../../docs/cli.md) as follows + +``` +datahub ingest -c /home/ubuntu/datahub_ingest/mysql_to_datahub.yml +``` + +This will ingest metadata from the `mysql` source which is configured in the recipe file. This does ingestion once. As the source system changes we would like to have the changes reflected in DataHub. To do this someone will need to re-run the ingestion command using a recipe file. + +An alternate to running the command manually we can schedule the ingestion to run on a regular basis. In this section we give some examples of how scheduling ingestion of metadata into DataHub can be done. diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/schedule_docs/kubernetes.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/schedule_docs/kubernetes.md new file mode 100644 index 0000000000000..a29fe6248cd6b --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/schedule_docs/kubernetes.md @@ -0,0 +1,55 @@ +--- +title: Using Kubernetes +slug: /metadata-ingestion/schedule_docs/kubernetes +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/schedule_docs/kubernetes.md +--- + +# Using Kubernetes + +If you have deployed DataHub using our official [helm charts](https://github.com/acryldata/datahub-helm) you can use the +datahub ingestion cron subchart to schedule ingestions. + +Here is an example of what that configuration would look like in your **values.yaml**: + +```yaml +datahub-ingestion-cron: + enabled: true + crons: + mysql: + schedule: "0 * * * *" # Every hour + recipe: + configmapName: recipe-config + fileName: mysql_recipe.yml +``` + +This assumes the pre-existence of a Kubernetes ConfigMap which holds all recipes being scheduled in the same namespace as +where the cron jobs will be running. + +An example could be: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: recipe-config +data: + mysql_recipe.yml: |- + source: + type: mysql + config: + # Coordinates + host_port: :3306 + database: dbname + + # Credentials + username: root + password: example + + sink: + type: datahub-rest + config: + server: http://:8080 +``` + +For more information, please see the [documentation](https://github.com/acryldata/datahub-helm/tree/master/charts/datahub/subcharts/datahub-ingestion-cron) of this sub-chart. diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/sink_docs/console.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/sink_docs/console.md new file mode 100644 index 0000000000000..c3806b0fb20a2 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/sink_docs/console.md @@ -0,0 +1,40 @@ +--- +title: Console +slug: /metadata-ingestion/sink_docs/console +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/sink_docs/console.md +--- + +# Console + +For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md). + +## Setup + +Works with `acryl-datahub` out of the box. + +## Capabilities + +Simply prints each metadata event to stdout. Useful for experimentation and debugging purposes. + +## Quickstart recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes). + +```yml +source: + # source configs + +sink: + type: "console" +``` + +## Config details + +None! + +## Questions + +If you've got any questions on configuring this sink, feel free to ping us on [our Slack](https://slack.datahubproject.io/)! diff --git a/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/sink_docs/datahub.md b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/sink_docs/datahub.md new file mode 100644 index 0000000000000..14cf947874212 --- /dev/null +++ b/docs-website/versioned_docs/version-0.10.4/metadata-ingestion/sink_docs/datahub.md @@ -0,0 +1,190 @@ +--- +title: DataHub +slug: /metadata-ingestion/sink_docs/datahub +custom_edit_url: >- + https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/sink_docs/datahub.md +--- + +# DataHub + +## DataHub Rest + +For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md). + +### Setup + +To install this plugin, run `pip install 'acryl-datahub[datahub-rest]'`. + +### Capabilities + +Pushes metadata to DataHub using the GMS REST API. The advantage of the REST-based interface +is that any errors can immediately be reported. + +### Quickstart recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes). This should point to the GMS server. + +```yml +source: + # source configs +sink: + type: "datahub-rest" + config: + server: "http://localhost:8080" +``` + +If you are running the ingestion in a container in docker and your [GMS is also running in docker](../../docker/README.md) then you should use the internal docker hostname of the GMS pod. Usually it would look something like + +```yml +source: + # source configs +sink: + type: "datahub-rest" + config: + server: "http://datahub-gms:8080" +``` + +If GMS is running in a kubernetes pod [deployed through the helm charts](../../docs/deploy/kubernetes.md) and you are trying to connect to it from within the kubernetes cluster then you should use the Kubernetes service name of GMS. Usually it would look something like + +```yml +source: + # source configs +sink: + type: "datahub-rest" + config: + server: "http://datahub-datahub-gms.datahub.svc.cluster.local:8080" +``` + +If you are using [UI based ingestion](../../docs/ui-ingestion.md) then where GMS is deployed decides what hostname you should use. + +### Config details + +Note that a `.` is used to denote nested fields in the YAML recipe. + +| Field | Required | Default | Description | +| -------------------------- | -------- | -------------------- | -------------------------------------------------------------------------------------------------- | +| `server` | ✅ | | URL of DataHub GMS endpoint. | +| `timeout_sec` | | 30 | Per-HTTP request timeout. | +| `retry_max_times` | | 1 | Maximum times to retry if HTTP request fails. The delay between retries is increased exponentially | +| `retry_status_codes` | | [429, 502, 503, 504] | Retry HTTP request also on these status codes | +| `token` | | | Bearer token used for authentication. | +| `extra_headers` | | | Extra headers which will be added to the request. | +| `max_threads` | | `15` | Experimental: Max parallelism for REST API calls | +| `ca_certificate_path` | | | Path to CA certificate for HTTPS communications | +| `disable_ssl_verification` | | false | Disable ssl certificate validation | + +## DataHub Kafka + +For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md). + +### Setup + +To install this plugin, run `pip install 'acryl-datahub[datahub-kafka]'`. + +### Capabilities + +Pushes metadata to DataHub by publishing messages to Kafka. The advantage of the Kafka-based +interface is that it's asynchronous and can handle higher throughput. + +### Quickstart recipe + +Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options. + +For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes). + +```yml +source: + # source configs + +sink: + type: "datahub-kafka" + config: + connection: + bootstrap: "localhost:9092" + schema_registry_url: "http://localhost:8081" +``` + +### Config details + +Note that a `.` is used to denote nested fields in the YAML recipe. + +| Field | Required | Default | Description | +| -------------------------------------------- | -------- | ---------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `connection.bootstrap` | ✅ | | Kafka bootstrap URL. | +| `connection.producer_config.