Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(spark): add configuration instructions for databricks #6206

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 79 additions & 14 deletions metadata-integration/java/spark-lineage/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ To integrate Spark with DataHub, we provide a lightweight Java agent that listen

## Configuring Spark agent

The Spark agent can be configured using a config file or while creating a spark Session.
The Spark agent can be configured using a config file or while creating a Spark Session. If you are using Spark on Databricks, refer [Configuration Instructions for Databricks](#configuration-instructions--databricks).

## Before you begin: Versions and Release Notes
### Before you begin: Versions and Release Notes

Versioning of the jar artifact will follow the semantic versioning of the main [DataHub repo](https://github.com/datahub-project/datahub) and release notes will be available [here](https://github.com/datahub-project/datahub/releases).
Always check [the Maven central repository](https://search.maven.org/search?q=a:datahub-spark-lineage) for the latest released version.
Expand All @@ -16,18 +16,18 @@ Always check [the Maven central repository](https://search.maven.org/search?q=a:
When running jobs using spark-submit, the agent needs to be configured in the config file.

```text
#Configuring datahub spark agent jar
#Configuring DataHub spark agent jar
spark.jars.packages io.acryl:datahub-spark-lineage:0.8.23
spark.extraListeners datahub.spark.DatahubSparkListener
spark.datahub.rest.server http://localhost:8080
```

#### Configuration for Amazon EMR
### Configuration Instructions: Amazon EMR

Set the following spark-defaults configuration properties as it stated [here](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html)

```
spark.jars.packages io.acryl:datahub-spark-lineage:0.8.23
```text
spark.jars.packages io.acryl:datahub-spark-lineage:0.8.23
spark.extraListeners datahub.spark.DatahubSparkListener
spark.datahub.rest.server https://your_datahub_host/gms
#If you have authentication set up then you also need to specify the Datahub access token
Expand Down Expand Up @@ -64,7 +64,63 @@ spark = SparkSession.builder()
.getOrCreate();
```

### Configuration details
### Configuration Instructions: Databricks

The Spark agent can be configured using Databricks Cluster [Spark configuration](https://docs.databricks.com/clusters/configure.html#spark-configuration) and [Init script](https://docs.databricks.com/clusters/configure.html#init-scripts).

[Databricks Secrets](https://docs.databricks.com/security/secrets/secrets.html) can be leveraged to store sensitive information like tokens.

- Download `datahub-spark-lineage` jar from [the Maven central repository](https://search.maven.org/search?q=a:datahub-spark-lineage).
- Create `init.sh` with below content

```sh
#!/bin/bash
cp /dbfs/datahub/datahub-spark-lineage*.jar /databricks/jars
```

- Install and configure [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html).
- Copy jar and init script to Databricks File System(DBFS) using Databricks CLI.

```sh
databricks fs mkdirs dbfs:/datahub
databricks fs --overwrite datahub-spark-lineage*.jar dbfs:/datahub
databricks fs --overwrite init.sh dbfs:/datahub
```

- Open Databricks Cluster configuration page. Click the **Advanced Options** toggle. Click the **Spark** tab. Add below configurations under `Spark Config`.

```text
spark.extraListeners datahub.spark.DatahubSparkListener
spark.datahub.rest.server http://localhost:8080
spark.datahub.databricks.cluster cluster-name<any preferred cluster identifier>
```

- Click the **Init Scripts** tab. Set cluster init script as `dbfs:/datahub/init.sh`.

- Configuring DataHub authentication token

- Add below config in cluster spark config.

```text
spark.datahub.rest.token <token>
```

- Alternatively, Databricks secrets can be used to secure token.
- Create secret using Databricks CLI.

```sh
databricks secrets create-scope --scope datahub --initial-manage-principal users
databricks secrets put --scope datahub --key rest-token
databricks secrets list --scope datahub &lt;&lt;Edit prompted file with token value&gt;&gt;
```

- Add in spark config

```text
spark.datahub.rest.token {{secrets/datahub/rest-token}}
```

## Configuration Options

| Field | Required | Default | Description |
|-------------------------------------------------|----------|---------|-------------------------------------------------------------------------|
Expand All @@ -86,15 +142,23 @@ As of current writing, the Spark agent produces metadata related to the Spark jo
- A pipeline is created per Spark <master, appName>.
- A task is created per unique Spark query execution within an app.

For Spark on Databricks,

- A pipeline is created per
- cluster_identifier: specified with spark.datahub.databricks.cluster
- applicationID: on every restart of the cluster new spark applicationID will be created.
- A task is created per unique Spark query execution.

### Custom properties & relating to Spark UI

The following custom properties in pipelines and tasks relate to the Spark UI:

- appName and appId in a pipeline can be used to determine the Spark application
- description and SQLQueryId in a task can be used to determine the Query Execution within the application on the SQL tab of Spark UI
- Other custom properties of pipelines and tasks capture the start and end times of execution etc.
- The query plan is captured in the *queryPlan* property of a task.

Other custom properties of pipelines and tasks capture the start and end times of execution etc.
The query plan is captured in the *queryPlan* property of a task.
For Spark on Databricks, pipeline start time is the cluster start time.

### Spark versions supported

Expand All @@ -110,8 +174,9 @@ This initial release has been tested with the following environments:
- spark-submit of Python/Java applications to local and remote servers
- Jupyter notebooks with pyspark code
- Standalone Java applications
- Databricks Standalone Cluster

Note that testing for other environments such as Databricks is planned in near future.
Testing with Databricks Standard and High-concurrency Cluster is not done yet.

### Spark commands supported

Expand Down Expand Up @@ -157,27 +222,27 @@ YY/MM/DD HH:mm:ss INFO SparkContext: Registered listener datahub.spark.DatahubSp

On application start

```
```text
YY/MM/DD HH:mm:ss INFO DatahubSparkListener: Application started: SparkListenerApplicationStart(AppName,Some(local-1644489736794),1644489735772,user,None,None)
YY/MM/DD HH:mm:ss INFO McpEmitter: REST Emitter Configuration: GMS url <rest.server>
YY/MM/DD HH:mm:ss INFO McpEmitter: REST Emitter Configuration: Token XXXXX
```

On pushing data to server

```
```text
YY/MM/DD HH:mm:ss INFO McpEmitter: MetadataWriteResponse(success=true, responseContent={"value":"<URN>"}, underlyingResponse=HTTP/1.1 200 OK [Date: day, DD month year HH:mm:ss GMT, Content-Type: application/json, X-RestLi-Protocol-Version: 2.0.0, Content-Length: 97, Server: Jetty(9.4.46.v20220331)] [Content-Length: 97,Chunked: false])
```

On application end

```
```text
YY/MM/DD HH:mm:ss INFO DatahubSparkListener: Application ended : AppName AppID
```

- To enable debugging logs, add below configuration in log4j.properties file

```
```properties
log4j.logger.datahub.spark=DEBUG
log4j.logger.datahub.client.rest=DEBUG
```
Expand Down