Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ingestion overview pages #9210

Merged
merged 14 commits into from
Nov 20, 2023
Merged
2 changes: 1 addition & 1 deletion docs-website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ module.exports = {
},
{
type: "category",
label: "Sinks",
label: "Transformers",
link: {
type: "doc",
id: "metadata-ingestion/docs/transformer/intro",
Expand Down
61 changes: 61 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,36 @@ Command Options:
--strict-warnings If enabled, ingestion runs with warnings will yield a non-zero error code
--test-source-connection When set, ingestion will only test the source connection details from the recipe
```
#### ingest --dry-run

The `--dry-run` option of the `ingest` command performs all of the ingestion steps, except writing to the sink. This is useful to validate that the
ingestion recipe is producing the desired metadata events before ingesting them into datahub.

```shell
# Dry run
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml --dry-run
# Short-form
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n
```

#### ingest --preview

The `--preview` option of the `ingest` command performs all of the ingestion steps, but limits the processing to only the first 10 workunits produced by the source.
This option helps with quick end-to-end smoke testing of the ingestion recipe.

```shell
# Preview
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml --preview
# Preview with dry-run
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n --preview
```

By default `--preview` creates 10 workunits. But if you wish to try producing more workunits you can use another option `--preview-workunits`

```shell
# Preview 20 workunits without sending anything to sink
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n --preview --preview-workunits=20
```

#### ingest deploy

Expand All @@ -115,6 +145,37 @@ To update an existing recipe please use the `--urn` parameter to specify the id
**Note:** Updating a recipe will result in a replacement of the existing options with what was specified in the cli command.
I.e: Not specifying a schedule in the cli update command will remove the schedule from the recipe to be updated.

#### ingest --no-default-report
By default, the cli sends an ingestion report to DataHub, which allows you to see the result of all cli-based ingestion in the UI. This can be turned off with the `--no-default-report` flag.

```shell
# Running ingestion with reporting to DataHub turned off
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yaml --no-default-report
```

The reports include the recipe that was used for ingestion. This can be turned off by adding an additional section to the ingestion recipe.

```yaml
source:
# source configs

sink:
# sink configs

# Add configuration for the datahub reporter
reporting:
- type: datahub
config:
report_recipe: false

# Optional log to put failed JSONs into a file
# Helpful in case you are trying to debug some issue with specific ingestion failing
failure_log:
enabled: false
log_config:
filename: ./path/to/failure.json
```

### init

The init command is used to tell `datahub` about where your DataHub instance is located. The CLI will point to localhost DataHub by default.
Expand Down
27 changes: 5 additions & 22 deletions metadata-ingestion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,26 +8,9 @@ Please see our **[Integrations page](https://datahubproject.io/integrations)** t

DataHub offers three methods for data ingestion:

- UI ingestion
- CLI ingestion
- SDK-based ingestion

### UI Ingestion

DataHub supports configuring and monitoring ingestion via the UI.
For a detailed guide on UI ingestion, please refer to the [UI Ingestion](../docs/ui-ingestion.md) page.

### CLI Ingestion

DataHub supports configuring ingestion via [CLI](../docs/cli.md).
For more information, refer to the [CLI Ingestion guide](cli-ingestion.md).

### SDK-based ingestion

In some cases, you might want to construct Metadata events directly and use programmatic ways to emit that metadata to DataHub.
In this case, take a look at the [Python emitter](./as-a-library.md) and the [Java emitter](../metadata-integration/java/as-a-library.md) libraries which can be called from your own code.

For instance, if you want to configure and run a pipeline entirely from within your custom Python script, please refer to [programmatic_pipeline.py](./examples/library/programatic_pipeline.py) - a basic mysql to REST programmatic pipeline.
- [UI Ingestion](../docs/ui-ingestion.md)
yoonhyejin marked this conversation as resolved.
Show resolved Hide resolved
- [CLI Ingestion guide](cli-ingestion.md)
- SDK-based ingestion - [Python Emitter](./as-a-library.md), [Java emitter](../metadata-integration/java/as-a-library.md)

## Types of Integration

Expand All @@ -38,13 +21,13 @@ Integration can be divided into two concepts based on the method:

### Push-based Integration

Push-based integrations allow you to emit metadata directly from your data systems when metadata changes, while pull-based integrations allow you to "crawl" or "ingest" metadata from the data systems by connecting to them and extracting metadata in a batch or incremental-batch manner. Supporting both mechanisms means that you can integrate with all your systems in the most flexible way possible.
Push-based integrations allow you to emit metadata directly from your data systems when metadata changes.
Examples of push-based integrations include [Airflow](../docs/lineage/airflow.md), [Spark](../metadata-integration/java/spark-lineage/README.md), [Great Expectations](./integration_docs/great-expectations.md) and [Protobuf Schemas](../metadata-integration/java/datahub-protobuf/README.md). This allows you to get low-latency metadata integration from the "active" agents in your data ecosystem.

### Pull-based Integration

Pull-based integrations allow you to "crawl" or "ingest" metadata from the data systems by connecting to them and extracting metadata in a batch or incremental-batch manner.
Examples of pull-based integrations include BigQuery, Snowflake, Looker, Tableau and many others.
This document describes the pull-based metadata ingestion system that is built into DataHub for easy integration with a wide variety of sources in your data stack.

## Core Concepts
yoonhyejin marked this conversation as resolved.
Show resolved Hide resolved

Expand Down
101 changes: 33 additions & 68 deletions metadata-ingestion/cli-ingestion.md
Original file line number Diff line number Diff line change
@@ -1,94 +1,59 @@
# CLI Ingestion

## Ingesting Metadata

```shell
pip install 'acryl-datahub[datahub-rest]' # install the required plugin
datahub ingest -c <path_to_recipe_file>
```

For more information on configuring recipes, please refer to [Recipes](recipe_overview.md)

### --dry-run

The `--dry-run` option of the `ingest` command performs all of the ingestion steps, except writing to the sink. This is useful to validate that the
ingestion recipe is producing the desired metadata events before ingesting them into datahub.
## Installing CLI
yoonhyejin marked this conversation as resolved.
Show resolved Hide resolved

Make sure you have installed DataHub CLI before following this guide.
```shell
# Dry run
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml --dry-run
# Short-form
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n
# Requires Python 3.7+
python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub
# validate that the install was successful
datahub version
# If you see "command not found", try running this instead: python3 -m datahub version
```
Check out the [CLI Installation Guide](../docs/cli.md#installation) for more reference.
yoonhyejin marked this conversation as resolved.
Show resolved Hide resolved

### --preview

The `--preview` option of the `ingest` command performs all of the ingestion steps, but limits the processing to only the first 10 workunits produced by the source.
This option helps with quick end-to-end smoke testing of the ingestion recipe.
After that, install the required plugin for the ingestion.

```shell
# Preview
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml --preview
# Preview with dry-run
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n --preview
```

By default `--preview` creates 10 workunits. But if you wish to try producing more workunits you can use another option `--preview-workunits`

```shell
# Preview 20 workunits without sending anything to sink
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n --preview --preview-workunits=20
```

## Reporting

By default, the cli sends an ingestion report to DataHub, which allows you to see the result of all cli-based ingestion in the UI. This can be turned off with the `--no-default-report` flag.

```shell
# Running ingestion with reporting to DataHub turned off
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yaml --no-default-report
pip install 'acryl-datahub[datahub-rest]' # install the required plugin
```
Check out the [alternative installation options](../docs/cli.md#alternate-installation-options) for more reference.

The reports include the recipe that was used for ingestion. This can be turned off by adding an additional section to the ingestion recipe.

## Configuring a Recipe
Create a recipe.yml file that defines the source and sink for metadata, as shown below.
```yaml
# my_reipe.yml
source:
# source configs

type: <source_name>
config:
option_1: <value>
...

sink:
# sink configs

# Add configuration for the datahub reporter
reporting:
- type: datahub
config:
report_recipe: false

# Optional log to put failed JSONs into a file
# Helpful in case you are trying to debug some issue with specific ingestion failing
failure_log:
enabled: false
log_config:
filename: ./path/to/failure.json
type: <sink_type_name>
config:
...
```

## Deploying and scheduling ingestion to the UI
For more information and examples on configuring recipes, please refer to [Recipes](recipe_overview.md).

The `deploy` subcommand of the `ingest` command tree allows users to upload their recipes and schedule them in the server.
## Ingesting Metadata
You can run ingestion using `datahub ingest` like below.

```shell
datahub ingest deploy -n <user friendly name for ingestion> -c recipe.yaml
datahub ingest -c <path_to_recipe_file.yml>
```

By default, no schedule is done unless explicitly configured with the `--schedule` parameter. Schedule timezones are UTC by default and can be overriden with `--time-zone` flag.

```shell
datahub ingest deploy -n test --schedule "0 * * * *" --time-zone "Europe/London" -c recipe.yaml
```
## Reference

## Compatibility
Please refer the following pages for advanced guids on CLI ingestion.
- [Reference for `datahub ingest` command](../docs/cli.md#ingest)
- [UI Ingestion Guide](../docs/ui-ingestion.md)

:::Tip Compatibility
DataHub server uses a 3 digit versioning scheme, while the CLI uses a 4 digit scheme. For example, if you're using DataHub server version 0.10.0, you should use CLI version 0.10.0.x, where x is a patch version.
We do this because we do CLI releases at a much higher frequency than server releases, usually every few days vs twice a month.

For ingestion sources, any breaking changes will be highlighted in the [release notes](../docs/how/updating-datahub.md). When fields are deprecated or otherwise changed, we will try to maintain backwards compatibility for two server releases, which is about 4-6 weeks. The CLI will also print warnings whenever deprecated options are used.
:::
44 changes: 42 additions & 2 deletions metadata-ingestion/recipe_overview.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# Recipes

A recipe is the main configuration file for metadata ingestion. It tells our ingestion scripts where to pull data from (source) and where to put it (sink).

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/sources-sinks.png"/>
</p>


## Configuring Recipe
yoonhyejin marked this conversation as resolved.
Show resolved Hide resolved

The basic form of the recipe file consists of:
Expand Down Expand Up @@ -33,16 +41,48 @@ A number of recipes are included in the [examples/recipes](./examples/recipes) d
Note that one recipe file can only have 1 source and 1 sink. If you want multiple sources then you will need multiple recipe files.
:::

## Running Recipe
## Running a Recipe

DataHub supports running recipes via the CLI or UI.

<Tabs>
<TabItem value="cli" label="CLI" default>

Install CLI and the plugin for the ingestion.
```shell
python3 -m pip install --upgrade acryl-datahub
pip install 'acryl-datahub[datahub-rest]'
```
Running this recipe is as simple as:

```shell
datahub ingest -c recipe.dhub.yaml
```

For a detailed guide on running recipes via CLI, please refer to [CLI Ingestion Guide](cli-ingestion.md).
yoonhyejin marked this conversation as resolved.
Show resolved Hide resolved

</TabItem>

<TabItem value="ui" label="UI">

You can configure and run the recipe in **Ingestion** tab in DataHub.

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/ingestion-tab.png"/>
</p>

* Make sure you have the **Manage Metadata Ingestion & Manage Secret** privileges.
* Navigate to **Ingestion** tab in DataHub.
* Create an ingestion source & configure the recipe via UI.
* Hit **Execute**.

For a detailed guide on running recipes via UI, please refer to [UI Ingestion Guide](../docs/ui-ingestion.md).

</TabItem>
</Tabs>


## Advanced Configuration

### Handling Sensitive Information in Recipes
yoonhyejin marked this conversation as resolved.
Show resolved Hide resolved

We automatically expand environment variables in the config (e.g. `${MSSQL_PASSWORD}`),
Expand Down
11 changes: 10 additions & 1 deletion metadata-ingestion/sink_overview.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# Sinks

Sinks are **destinations for metadata**.
yoonhyejin marked this conversation as resolved.
Show resolved Hide resolved

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/docs/main/imgs/sources-sinks.png"/>
</p>

In general, the sink will be defined in the recipe after the _source_ like below.
yoonhyejin marked this conversation as resolved.
Show resolved Hide resolved

```yaml
Expand All @@ -11,13 +16,17 @@ sink:
config: ...
```

## Types of Sink
yoonhyejin marked this conversation as resolved.
Show resolved Hide resolved

When configuring ingestion for DataHub, you're likely to be sending the metadata to DataHub over either one of the following.

- [REST (datahub-rest)](sink_docs/datahub.md#datahub-rest)
- [Kafka (datahub-kafka)](sink_docs/datahub.md#datahub-kafka)
- [File](sink_docs/file.md)
yoonhyejin marked this conversation as resolved.
Show resolved Hide resolved
- Since `acryl-datahub` version `>=0.8.33.2`, the default sink is assumed to be a `datahub-rest` endpoint.

## Default Sink

Since `acryl-datahub` version `>=0.8.33.2`, the default sink is assumed to be a `datahub-rest` endpoint.
- Hosted at "http://localhost:8080" or the environment variable `${DATAHUB_GMS_URL}` if present
- With an empty auth token or the environment variable `${DATAHUB_GMS_TOKEN}` if present.
yoonhyejin marked this conversation as resolved.
Show resolved Hide resolved

Expand Down
28 changes: 20 additions & 8 deletions metadata-ingestion/source_overview.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,32 @@
# Sources

:::tip Find Integration Source
Please see our **[Integrations page](https://datahubproject.io/integrations)** to browse our ingestion sources and filter on their features.
:::

Sources refers to the data systems that we are extracting metadata from.
The `Sources` tab on the left in the sidebar shows you all the sources that are available for you to ingest metadata from. For example, we have sources for [BigQuery](https://datahubproject.io/docs/generated/ingestion/sources/bigquery), [Looker](https://datahubproject.io/docs/generated/ingestion/sources/looker), [Tableau](https://datahubproject.io/docs/generated/ingestion/sources/tableau) and many others.
Sources refers to **the data systems that we are extracting metadata from.**
yoonhyejin marked this conversation as resolved.
Show resolved Hide resolved

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/sources-sinks.png"/>
</p>

In general, the source will be defined at the top of the recipe like below.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the word recipe a link?



```yaml
source: <source_name>
config: ...
#my_recipe.yml
source:
type: <source_name>
config:
option_1: <value>
...
```

:::note Metadata Ingestion Source Status
## Types of Source
The `Sources` tab on the left in the sidebar shows you all the sources that are available for you to ingest metadata from. For example, we have sources for [BigQuery](https://datahubproject.io/docs/generated/ingestion/sources/bigquery), [Looker](https://datahubproject.io/docs/generated/ingestion/sources/looker), [Tableau](https://datahubproject.io/docs/generated/ingestion/sources/tableau) and many others.

:::tip Find Integration Source
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:::tip Find Integration Source
:::tip Find an Integration Source

Please see our **[Integrations page](https://datahubproject.io/integrations)** to browse our ingestion sources and filter on their features.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a broader tone thing, but I tend not to like the word "please" in docs. that said, I'm not sure how "casual" we want the tone to be

Suggested change
Please see our **[Integrations page](https://datahubproject.io/integrations)** to browse our ingestion sources and filter on their features.
See the full **[list of integrations](https://datahubproject.io/integrations)** and filter on their features.

:::

## Metadata Ingestion Source Status

We apply a Support Status to each Metadata Source to help you understand the integration reliability at a glance.

Expand Down
Loading