-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add ingestion overview pages #9210
Merged
Merged
Changes from 2 commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
f3e4f60
fix: add ingestion guides
yoonhyejin 2ee9e3e
Merge branch 'master' into docs/ingestion-guide
yoonhyejin ec8c8b9
Merge branch 'master' into docs/ingestion-guide
yoonhyejin b1c363f
feat: reflect feedbacks
yoonhyejin f680aa5
feat: fix ingestion overview explanations
yoonhyejin 90f2e79
Merge branch 'master' into docs/ingestion-guide
yoonhyejin 4fb646f
Merge branch 'master' into docs/ingestion-guide
yoonhyejin c5b001d
Merge branch 'master' into docs/ingestion-guide
yoonhyejin e23fb6d
fix: fix image name
yoonhyejin 2deb055
Merge branch 'master' into docs/ingestion-guide
yoonhyejin e10197b
Merge branch 'master' into docs/ingestion-guide
yoonhyejin bcaa01c
fix: fix image path
yoonhyejin 5619cff
fix: reflect feedbacks
yoonhyejin 62bbaa6
Merge branch 'master' into docs/ingestion-guide
yoonhyejin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,228 +1,61 @@ | ||
# Introduction to Metadata Ingestion | ||
|
||
<a | ||
className='button button--primary button--lg' | ||
href="https://datahubproject.io/integrations"> | ||
Find Integration Source | ||
</a> | ||
|
||
## Integration Options | ||
|
||
DataHub supports both **push-based** and **pull-based** metadata integration. | ||
|
||
Push-based integrations allow you to emit metadata directly from your data systems when metadata changes, while pull-based integrations allow you to "crawl" or "ingest" metadata from the data systems by connecting to them and extracting metadata in a batch or incremental-batch manner. Supporting both mechanisms means that you can integrate with all your systems in the most flexible way possible. | ||
|
||
Examples of push-based integrations include [Airflow](../docs/lineage/airflow.md), [Spark](../metadata-integration/java/spark-lineage/README.md), [Great Expectations](./integration_docs/great-expectations.md) and [Protobuf Schemas](../metadata-integration/java/datahub-protobuf/README.md). This allows you to get low-latency metadata integration from the "active" agents in your data ecosystem. Examples of pull-based integrations include BigQuery, Snowflake, Looker, Tableau and many others. | ||
|
||
This document describes the pull-based metadata ingestion system that is built into DataHub for easy integration with a wide variety of sources in your data stack. | ||
|
||
## Getting Started | ||
|
||
### Prerequisites | ||
|
||
Before running any metadata ingestion job, you should make sure that DataHub backend services are all running. You can either run ingestion via the [UI](../docs/ui-ingestion.md) or via the [CLI](../docs/cli.md). You can reference the CLI usage guide given there as you go through this page. | ||
|
||
## Core Concepts | ||
|
||
### Sources | ||
|
||
Please see our [Integrations page](https://datahubproject.io/integrations) to browse our ingestion sources and filter on their features. | ||
|
||
Data systems that we are extracting metadata from are referred to as **Sources**. The `Sources` tab on the left in the sidebar shows you all the sources that are available for you to ingest metadata from. For example, we have sources for [BigQuery](https://datahubproject.io/docs/generated/ingestion/sources/bigquery), [Looker](https://datahubproject.io/docs/generated/ingestion/sources/looker), [Tableau](https://datahubproject.io/docs/generated/ingestion/sources/tableau) and many others. | ||
|
||
#### Metadata Ingestion Source Status | ||
|
||
We apply a Support Status to each Metadata Source to help you understand the integration reliability at a glance. | ||
|
||
![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen): Certified Sources are well-tested & widely-adopted by the DataHub Community. We expect the integration to be stable with few user-facing issues. | ||
|
||
![Incubating](https://img.shields.io/badge/support%20status-incubating-blue): Incubating Sources are ready for DataHub Community adoption but have not been tested for a wide variety of edge-cases. We eagerly solicit feedback from the Community to streghten the connector; minor version changes may arise in future releases. | ||
|
||
![Testing](https://img.shields.io/badge/support%20status-testing-lightgrey): Testing Sources are available for experiementation by DataHub Community members, but may change without notice. | ||
|
||
### Sinks | ||
|
||
Sinks are destinations for metadata. When configuring ingestion for DataHub, you're likely to be sending the metadata to DataHub over either the [REST (datahub-sink)](./sink_docs/datahub.md#datahub-rest) or the [Kafka (datahub-kafka)](./sink_docs/datahub.md#datahub-kafka) sink. In some cases, the [File](./sink_docs/file.md) sink is also helpful to store a persistent offline copy of the metadata during debugging. | ||
|
||
The default sink that most of the ingestion systems and guides assume is the `datahub-rest` sink, but you should be able to adapt all of them for the other sinks as well! | ||
|
||
### Recipes | ||
|
||
A recipe is the main configuration file that puts it all together. It tells our ingestion scripts where to pull data from (source) and where to put it (sink). | ||
|
||
:::tip | ||
Name your recipe with **.dhub.yaml** extension like _myrecipe.dhub.yaml_ to use vscode or intellij as a recipe editor with autocomplete | ||
and syntax validation. | ||
|
||
Make sure yaml plugin is installed for your editor: | ||
|
||
- For vscode install [Redhat's yaml plugin](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml) | ||
- For intellij install [official yaml plugin](https://plugins.jetbrains.com/plugin/13126-yaml) | ||
|
||
:::tip Find Integration Source | ||
Please see our **[Integrations page](https://datahubproject.io/integrations)** to browse our ingestion sources and filter on their features. | ||
::: | ||
|
||
Since `acryl-datahub` version `>=0.8.33.2`, the default sink is assumed to be a DataHub REST endpoint: | ||
## Integration Methods | ||
|
||
- Hosted at "http://localhost:8080" or the environment variable `${DATAHUB_GMS_URL}` if present | ||
- With an empty auth token or the environment variable `${DATAHUB_GMS_TOKEN}` if present. | ||
DataHub offers three methods for data ingestion: | ||
|
||
Here's a simple recipe that pulls metadata from MSSQL (source) and puts it into the default sink (datahub rest). | ||
- UI ingestion | ||
- CLI ingestion | ||
- SDK-based ingestion | ||
|
||
```yaml | ||
# The simplest recipe that pulls metadata from MSSQL and puts it into DataHub | ||
# using the Rest API. | ||
source: | ||
type: mssql | ||
config: | ||
username: sa | ||
password: ${MSSQL_PASSWORD} | ||
database: DemoData | ||
# sink section omitted as we want to use the default datahub-rest sink | ||
``` | ||
### UI Ingestion | ||
|
||
Running this recipe is as simple as: | ||
DataHub supports configuring and monitoring ingestion via the UI. | ||
For a detailed guide on UI ingestion, please refer to the [UI Ingestion](../docs/ui-ingestion.md) page. | ||
|
||
```shell | ||
datahub ingest -c recipe.dhub.yaml | ||
``` | ||
### CLI Ingestion | ||
|
||
or if you want to override the default endpoints, you can provide the environment variables as part of the command like below: | ||
DataHub supports configuring ingestion via [CLI](../docs/cli.md). | ||
For more information, refer to the [CLI Ingestion guide](cli-ingestion.md). | ||
|
||
```shell | ||
DATAHUB_GMS_URL="https://my-datahub-server:8080" DATAHUB_GMS_TOKEN="my-datahub-token" datahub ingest -c recipe.dhub.yaml | ||
``` | ||
### SDK-based ingestion | ||
|
||
A number of recipes are included in the [examples/recipes](./examples/recipes) directory. For full info and context on each source and sink, see the pages described in the [table of plugins](../docs/cli.md#installing-plugins). | ||
In some cases, you might want to construct Metadata events directly and use programmatic ways to emit that metadata to DataHub. | ||
In this case, take a look at the [Python emitter](./as-a-library.md) and the [Java emitter](../metadata-integration/java/as-a-library.md) libraries which can be called from your own code. | ||
|
||
> Note that one recipe file can only have 1 source and 1 sink. If you want multiple sources then you will need multiple recipe files. | ||
For instance, if you want to configure and run a pipeline entirely from within your custom Python script, please refer to [programmatic_pipeline.py](./examples/library/programatic_pipeline.py) - a basic mysql to REST programmatic pipeline. | ||
|
||
### Handling sensitive information in recipes | ||
## Types of Integration | ||
|
||
We automatically expand environment variables in the config (e.g. `${MSSQL_PASSWORD}`), | ||
similar to variable substitution in GNU bash or in docker-compose files. For details, see | ||
https://docs.docker.com/compose/compose-file/compose-file-v2/#variable-substitution. This environment variable substitution should be used to mask sensitive information in recipe files. As long as you can get env variables securely to the ingestion process there would not be any need to store sensitive information in recipes. | ||
Integration can be divided into two concepts based on the method: | ||
|
||
### Basic Usage of CLI for ingestion | ||
- Push-based integration | ||
- Pull-based integration | ||
|
||
```shell | ||
pip install 'acryl-datahub[datahub-rest]' # install the required plugin | ||
datahub ingest -c ./examples/recipes/mssql_to_datahub.dhub.yml | ||
``` | ||
### Push-based Integration | ||
|
||
The `--dry-run` option of the `ingest` command performs all of the ingestion steps, except writing to the sink. This is useful to validate that the | ||
ingestion recipe is producing the desired metadata events before ingesting them into datahub. | ||
|
||
```shell | ||
# Dry run | ||
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml --dry-run | ||
# Short-form | ||
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n | ||
``` | ||
|
||
The `--preview` option of the `ingest` command performs all of the ingestion steps, but limits the processing to only the first 10 workunits produced by the source. | ||
This option helps with quick end-to-end smoke testing of the ingestion recipe. | ||
|
||
```shell | ||
# Preview | ||
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml --preview | ||
# Preview with dry-run | ||
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n --preview | ||
``` | ||
|
||
By default `--preview` creates 10 workunits. But if you wish to try producing more workunits you can use another option `--preview-workunits` | ||
|
||
```shell | ||
# Preview 20 workunits without sending anything to sink | ||
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n --preview --preview-workunits=20 | ||
``` | ||
|
||
#### Reporting | ||
|
||
By default, the cli sends an ingestion report to DataHub, which allows you to see the result of all cli-based ingestion in the UI. This can be turned off with the `--no-default-report` flag. | ||
|
||
```shell | ||
# Running ingestion with reporting to DataHub turned off | ||
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yaml --no-default-report | ||
``` | ||
|
||
The reports include the recipe that was used for ingestion. This can be turned off by adding an additional section to the ingestion recipe. | ||
|
||
```yaml | ||
source: | ||
# source configs | ||
|
||
sink: | ||
# sink configs | ||
|
||
# Add configuration for the datahub reporter | ||
reporting: | ||
- type: datahub | ||
config: | ||
report_recipe: false | ||
|
||
# Optional log to put failed JSONs into a file | ||
# Helpful in case you are trying to debug some issue with specific ingestion failing | ||
failure_log: | ||
enabled: false | ||
log_config: | ||
filename: ./path/to/failure.json | ||
``` | ||
|
||
#### Deploying and scheduling ingestion to the UI | ||
|
||
The `deploy` subcommand of the `ingest` command tree allows users to upload their recipes and schedule them in the server. | ||
|
||
```shell | ||
datahub ingest deploy -n <user friendly name for ingestion> -c recipe.yaml | ||
``` | ||
|
||
By default, no schedule is done unless explicitly configured with the `--schedule` parameter. Schedule timezones are UTC by default and can be overriden with `--time-zone` flag. | ||
```shell | ||
datahub ingest deploy -n test --schedule "0 * * * *" --time-zone "Europe/London" -c recipe.yaml | ||
``` | ||
|
||
## Transformations | ||
|
||
If you'd like to modify data before it reaches the ingestion sinks – for instance, adding additional owners or tags – you can use a transformer to write your own module and integrate it with DataHub. Transformers require extending the recipe with a new section to describe the transformers that you want to run. | ||
|
||
For example, a pipeline that ingests metadata from MSSQL and applies a default "important" tag to all datasets is described below: | ||
|
||
```yaml | ||
# A recipe to ingest metadata from MSSQL and apply default tags to all tables | ||
source: | ||
type: mssql | ||
config: | ||
username: sa | ||
password: ${MSSQL_PASSWORD} | ||
database: DemoData | ||
|
||
transformers: # an array of transformers applied sequentially | ||
- type: simple_add_dataset_tags | ||
config: | ||
tag_urns: | ||
- "urn:li:tag:Important" | ||
# default sink, no config needed | ||
``` | ||
|
||
Check out the [transformers guide](./docs/transformer/intro.md) to learn more about how you can create really flexible pipelines for processing metadata using Transformers! | ||
|
||
## Using as a library (SDK) | ||
|
||
In some cases, you might want to construct Metadata events directly and use programmatic ways to emit that metadata to DataHub. In this case, take a look at the [Python emitter](./as-a-library.md) and the [Java emitter](../metadata-integration/java/as-a-library.md) libraries which can be called from your own code. | ||
|
||
### Programmatic Pipeline | ||
Push-based integrations allow you to emit metadata directly from your data systems when metadata changes, while pull-based integrations allow you to "crawl" or "ingest" metadata from the data systems by connecting to them and extracting metadata in a batch or incremental-batch manner. Supporting both mechanisms means that you can integrate with all your systems in the most flexible way possible. | ||
Examples of push-based integrations include [Airflow](../docs/lineage/airflow.md), [Spark](../metadata-integration/java/spark-lineage/README.md), [Great Expectations](./integration_docs/great-expectations.md) and [Protobuf Schemas](../metadata-integration/java/datahub-protobuf/README.md). This allows you to get low-latency metadata integration from the "active" agents in your data ecosystem. | ||
|
||
In some cases, you might want to configure and run a pipeline entirely from within your custom Python script. Here is an example of how to do it. | ||
### Pull-based Integration | ||
|
||
- [programmatic_pipeline.py](./examples/library/programatic_pipeline.py) - a basic mysql to REST programmatic pipeline. | ||
Examples of pull-based integrations include BigQuery, Snowflake, Looker, Tableau and many others. | ||
This document describes the pull-based metadata ingestion system that is built into DataHub for easy integration with a wide variety of sources in your data stack. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this line doesn't make sense anymore There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done! |
||
|
||
## Developing | ||
## Core Concepts | ||
yoonhyejin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
See the guides on [developing](./developing.md), [adding a source](./adding-source.md) and [using transformers](./docs/transformer/intro.md). | ||
The following are the core concepts related to ingestion: | ||
|
||
## Compatibility | ||
- [Sources](source_overview.md) : Data systems from which extract metadata. (e.g. BigQuery, MySQL) | ||
- [Sinks](sink_overview.md) : Destination for metadata (e.g. File, DataHub) | ||
- [Recipe](recipe_overview.md) : The main configuration for ingestion in the form or .yaml file | ||
|
||
DataHub server uses a 3 digit versioning scheme, while the CLI uses a 4 digit scheme. For example, if you're using DataHub server version 0.10.0, you should use CLI version 0.10.0.x, where x is a patch version. | ||
We do this because we do CLI releases at a much higher frequency than server releases, usually every few days vs twice a month. | ||
For more advanced guides, please refer to the following: | ||
|
||
For ingestion sources, any breaking changes will be highlighted in the [release notes](../docs/how/updating-datahub.md). When fields are deprecated or otherwise changed, we will try to maintain backwards compatibility for two server releases, which is about 4-6 weeks. The CLI will also print warnings whenever deprecated options are used. | ||
- [Developing on Metadata Ingestion](./developing.md) | ||
- [Adding a Metadata Ingestion Source](./adding-source.md) | ||
- [Using Transformers](./docs/transformer/intro.md) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just make these one-line sentences integrated into the bulleted list above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!