Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(ingest/airflow): clarify docs around 1.x compat #6436

Merged
merged 1 commit into from
Nov 15, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docker/airflow/local_airflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
This document describes how you can run Airflow side-by-side with DataHub's quickstart docker images to test out Airflow lineage with DataHub.
This offers a much easier way to try out Airflow with DataHub, compared to configuring containers by hand, setting up configurations and networking connectivity between the two systems.

## Pre-requisites
## Prerequisites
- Docker: ensure that you have a working Docker installation and you have at least 8GB of memory to allocate to both Airflow and DataHub combined.
```
docker info | grep Memory
Expand Down
43 changes: 28 additions & 15 deletions docs/lineage/airflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,27 +6,32 @@ DataHub supports integration of
- DAG and Task run information as well as
- Lineage information when present

There are a few ways to enable these integrations from Airflow into DataHub.
You can use either the DataHub Airflow lineage plugin (recommended) or the Airflow lineage backend (deprecated).

## Using Datahub's Airflow lineage plugin (new)
## Using Datahub's Airflow lineage plugin

:::note

We recommend you use the lineage plugin if you are on Airflow version >= 2.0.2 or on MWAA with an Airflow version >= 2.0.2
The Airflow lineage plugin is only supported with Airflow version >= 2.0.2 or on MWAA with an Airflow version >= 2.0.2.

If you're using Airflow 1.x, we recommend using the Airflow lineage backend with acryl-datahub <= 0.9.1.0.

:::

### Setup

1. You need to install the required dependency in your airflow.

```shell
pip install acryl-datahub-airflow-plugin
```

2. Disable lazy plugin load in your airflow.cfg.
2. Disable lazy plugin loading in your airflow.cfg.
On MWAA you should add this config to your [Apache Airflow configuration options](https://docs.aws.amazon.com/mwaa/latest/userguide/configuring-env-variables.html#configuring-2.0-airflow-override).

```yaml
core.lazy_load_plugins : False
```ini title="airflow.cfg"
[core]
lazy_load_plugins = False
```

3. You must configure an Airflow hook for Datahub. We support both a Datahub REST hook and a Kafka-based hook, but you only need one.
Expand Down Expand Up @@ -57,18 +62,19 @@ We recommend you use the lineage plugin if you are on Airflow version >= 2.0.2 o
### How to validate installation

1. Go and check in Airflow at Admin -> Plugins menu if you can see the Datahub plugin
2. Run an Airflow DAG and you should see in the task logs Datahub releated log messages like:
2. Run an Airflow DAG. In the task logs, you should see Datahub related log messages like:

```
Emitting Datahub ...
```

## Using Datahub's Airflow lineage backend
## Using Datahub's Airflow lineage backend (deprecated)

:::caution

The Airflow lineage backend is only supported in Airflow 1.10.15+ and 2.0.2+.
For managed services like MWAA you should use the Datahub Airflow plugin as the lineage backend is not supported there
The DataHub Airflow plugin (above) is the recommended way to integrate Airflow with DataHub. For managed services like MWAA, the lineage backend is not supported and so you must use the Airflow plugin.

If you're using Airflow 1.x, we recommend using the Airflow lineage backend with acryl-datahub <= 0.9.1.0.

:::

Expand All @@ -77,13 +83,13 @@ For managed services like MWAA you should use the Datahub Airflow plugin as the
If you are looking to run Airflow and DataHub using docker locally, follow the guide [here](../../docker/airflow/local_airflow.md). Otherwise proceed to follow the instructions below.
:::

## Setting up Airflow to use DataHub as Lineage Backend
### Setting up Airflow to use DataHub as Lineage Backend

1. You need to install the required dependency in your airflow. See <https://registry.astronomer.io/providers/datahub/modules/datahublineagebackend>

```shell
```shell
pip install acryl-datahub[airflow]
```
```

2. You must configure an Airflow hook for Datahub. We support both a Datahub REST hook and a Kafka-based hook, but you only need one.

Expand All @@ -96,7 +102,7 @@ If you are looking to run Airflow and DataHub using docker locally, follow the g

3. Add the following lines to your `airflow.cfg` file.

```ini
```ini title="airflow.cfg"
[lineage]
backend = datahub_provider.lineage.datahub.DatahubLineageBackend
datahub_kwargs = {
Expand All @@ -114,8 +120,9 @@ If you are looking to run Airflow and DataHub using docker locally, follow the g
- `cluster` (defaults to "prod"): The "cluster" to associate Airflow DAGs and tasks with.
- `capture_ownership_info` (defaults to true): If true, the owners field of the DAG will be capture as a DataHub corpuser.
- `capture_tags_info` (defaults to true): If true, the tags field of the DAG will be captured as DataHub tags.
- `capture_executions` (defaults to false): If true, it captures task runs as DataHub DataProcessInstances. **This feature only works with Datahub GMS version v0.8.33 or greater.**
- `capture_executions` (defaults to false): If true, it captures task runs as DataHub DataProcessInstances.
- `graceful_exceptions` (defaults to true): If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions.

4. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_demo.py), or reference [`lineage_backend_taskflow_demo.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_taskflow_demo.py) if you're using the [TaskFlow API](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html).
5. [optional] Learn more about [Airflow lineage](https://airflow.apache.org/docs/apache-airflow/stable/lineage.html), including shorthand notation and some automation.

Expand All @@ -126,3 +133,9 @@ Take a look at this sample DAG:
- [`lineage_emission_dag.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_emission_dag.py) - emits lineage using the DatahubEmitterOperator.

In order to use this example, you must first configure the Datahub hook. Like in ingestion, we support a Datahub REST hook and a Kafka-based hook. See step 1 above for details.

## Additional references

Related Datahub videos:
- [Airflow Lineage](https://www.youtube.com/watch?v=3wiaqhb8UR0)
- [Airflow Run History in DataHub](https://www.youtube.com/watch?v=YpUOqDU5ZYg)
65 changes: 1 addition & 64 deletions metadata-ingestion-modules/airflow-plugin/README.md
Original file line number Diff line number Diff line change
@@ -1,67 +1,4 @@
# Datahub Airflow Plugin

## Capabilities
See [the DataHub Airflow docs](https://datahubproject.io/docs/lineage/airflow) for details.

DataHub supports integration of

- Airflow Pipeline (DAG) metadata
- DAG and Task run information
- Lineage information when present

## Installation

1. You need to install the required dependency in your airflow.

```shell
pip install acryl-datahub-airflow-plugin
```

::: note

We recommend you use the lineage plugin if you are on Airflow version >= 2.0.2 or on MWAA with an Airflow version >= 2.0.2
:::

2. Disable lazy plugin load in your airflow.cfg

```yaml
core.lazy_load_plugins : False
```

3. You must configure an Airflow hook for Datahub. We support both a Datahub REST hook and a Kafka-based hook, but you only need one.

```shell
# For REST-based:
airflow connections add --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host 'http://localhost:8080'
# For Kafka-based (standard Kafka sink config can be passed via extras):
airflow connections add --conn-type 'datahub_kafka' 'datahub_kafka_default' --conn-host 'broker:9092' --conn-extra '{}'
```

4. Add your `datahub_conn_id` and/or `cluster` to your `airflow.cfg` file if it is not align with the default values. See configuration parameters below

**Configuration options:**

|Name | Default value | Description |
|---|---|---|
| datahub.datahub_conn_id | datahub_rest_deafault | The name of the datahub connection you set in step 1. |
| datahub.cluster | prod | name of the airflow cluster |
| datahub.capture_ownership_info | true | If true, the owners field of the DAG will be capture as a DataHub corpuser. |
| datahub.capture_tags_info | true | If true, the tags field of the DAG will be captured as DataHub tags. |
| datahub.graceful_exceptions | true | If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions.|

5. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_demo.py), or reference [`lineage_backend_taskflow_demo.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_taskflow_demo.py) if you're using the [TaskFlow API](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html).
6. [optional] Learn more about [Airflow lineage](https://airflow.apache.org/docs/apache-airflow/stable/lineage.html), including shorthand notation and some automation.

## How to validate installation

1. Go and check in Airflow at Admin -> Plugins menu if you can see the Datahub plugin
2. Run an Airflow DAG and you should see in the task logs Datahub releated log messages like:

```
Emitting Datahub ...
```

## Additional references

Related Datahub videos:
[Airflow Lineage](https://www.youtube.com/watch?v=3wiaqhb8UR0)
[Airflow Run History in DataHub](https://www.youtube.com/watch?v=YpUOqDU5ZYg)