diff --git a/docs/lineage/airflow.md b/docs/lineage/airflow.md index 9d838ef8a4404..2d7707637e2d1 100644 --- a/docs/lineage/airflow.md +++ b/docs/lineage/airflow.md @@ -17,7 +17,7 @@ There's two actively supported implementations of the plugin, with different Air | Approach | Airflow Version | Notes | | --------- | --------------- | --------------------------------------------------------------------------- | -| Plugin v2 | 2.3+ | Recommended. Requires Python 3.8+ | +| Plugin v2 | 2.3.4+ | Recommended. Requires Python 3.8+ | | Plugin v1 | 2.1+ | No automatic lineage extraction; may not extract lineage if the task fails. | If you're using Airflow older than 2.1, it's possible to use the v1 plugin with older versions of `acryl-datahub-airflow-plugin`. See the [compatibility section](#compatibility) for more details. @@ -66,7 +66,7 @@ enabled = True # default ``` | Name | Default value | Description | -|----------------------------|----------------------|------------------------------------------------------------------------------------------| +| -------------------------- | -------------------- | ---------------------------------------------------------------------------------------- | | enabled | true | If the plugin should be enabled. | | conn_id | datahub_rest_default | The name of the datahub rest connection. | | cluster | prod | name of the airflow cluster, this is equivalent to the `env` of the instance | @@ -132,7 +132,7 @@ conn_id = datahub_rest_default # or datahub_kafka_default ``` | Name | Default value | Description | -|----------------------------|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| -------------------------- | -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | enabled | true | If the plugin should be enabled. | | conn_id | datahub_rest_default | The name of the datahub connection you set in step 1. | | cluster | prod | name of the airflow cluster | @@ -240,6 +240,7 @@ See this [example PR](https://github.com/datahub-project/datahub/pull/10452) whi There might be a case where the DAGs are removed from the Airflow but the corresponding pipelines and tasks are still there in the Datahub, let's call such pipelines ans tasks, `obsolete pipelines and tasks` Following are the steps to cleanup them from the datahub: + - create a DAG named `Datahub_Cleanup`, i.e. ```python @@ -263,8 +264,8 @@ with DAG( ) ``` -- ingest this DAG, and it will remove all the obsolete pipelines and tasks from the Datahub based on the `cluster` value set in the `airflow.cfg` +- ingest this DAG, and it will remove all the obsolete pipelines and tasks from the Datahub based on the `cluster` value set in the `airflow.cfg` ## Get all dataJobs associated with a dataFlow @@ -274,12 +275,7 @@ If you are looking to find all tasks (aka DataJobs) that belong to a specific pi query { dataFlow(urn: "urn:li:dataFlow:(airflow,db_etl,prod)") { childJobs: relationships( - input: { - types: ["IsPartOf"], - direction: INCOMING, - start: 0, - count: 100 - } + input: { types: ["IsPartOf"], direction: INCOMING, start: 0, count: 100 } ) { total relationships {