Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

README improvements & fixes #283

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 43 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -25,7 +25,11 @@ To build this project, please execute:
mvn package -DskipTests
```

`mvn package` will assemble all the required dependencies and package into an uber jar.
`mvn package` will assemble all the required dependencies and package into an uber jar:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this connector supposed to be always built from source, or are ready-made uber jar downloads available somewhere?


spark-atlas-connector-assembly/target/spark-atlas-connector-assembly-0.1.0-SNAPSHOT.jar

(`spark-atlas-connector_2.11-0.1.0-SNAPSHOT.jar` is a thin jar without dependencies)

Create Atlas models
===================
@@ -38,26 +42,59 @@ Please copy `1100-spark_model.json` to `<ATLAS_HOME>/models/1000-Hadoop` directo
How To Use
==========

To use it, you will need to make this jar accessible in Spark Driver, also configure
The connector itself is configured with `atlas-application.properties`.

To get started, you can copy the `atlas-application.properties` from your Atlas server.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know though – is this the recommended way?


## Quick start with Atlas rest client:

Modify your copy of `atlas-application.properties` as shown below.

Set this:

atlas.client.type=rest

Add credentials. These are the defaults for a vanilla atlas server installation:

atlas.client.username=admin
atlas.client.password=admin

If your Atlas server is not on the same host as where your spark job is run:
- Replace `atlas.rest.address=http://localhost:21000` with `http://your-atlas-host:21000`

For production use, consider using `atlas.client.type=kafka` instead.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me kafka didn't work out of box. Maybe I would've had to modify some other properties to set the host names right. Any way, rest should work in all configurations and can be handy when people want to quickly try out this connector.


## Spark config

To use SAC on a spark job, you need to include the uber jar for Spark Driver and set these spark confs:

```
spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker
spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker
spark.sql.streaming.streamingQueryListeners=com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker
```

For example, when you're using spark-shell, you can start the Spark like:
For example, to run `spark-shell`:

```shell
bin/spark-shell --jars spark-atlas-connector_2.11-0.1.0-SNAPSHOT.jar \
bin/spark-shell --jars spark-atlas-connector-assembly-0.1.0-SNAPSHOT.jar \
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here fixing to use the name of the uberjar. Isn't it generally the jar to be used?

--conf spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker \
--conf spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker \
--conf spark.sql.streaming.streamingQueryListeners=com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker
```

Also make sure atlas configuration file `atlas-application.properties` is in the Driver's classpath. For example, putting this file into `<SPARK_HOME>/conf`.
If you're using spark with `--deploy-mode=client` (which is the default):
- Make sure that `atlas-application.properties` is in the Driver's classpath
- For example, place it at `<SPARK_HOME>/conf/`.

If you're using spark with `--deploy-mode=cluster`:
- Add this spark arg to copy `atlas-application.properties` to all containers:

`--files atlas-application.properties`

If you're using cluster mode, please also ship this conf file to the remote Drive using `--files atlas-application.properties`.
For `--jars` (and `--files`, if applicable), use the full path to the file.
- For example, use an `hdfs://` path for the `spark-atlas-connector-assembly-0.1.0-SNAPSHOT
.jar` if you store the jar on hdfs, etc.

Spark Atlas Connector supports two types of Atlas clients, "kafka" and "rest". You can configure which type of client via setting `atlas.client.type` to whether `kafka` or `rest`.
The default value is `kafka` which provides stable and secured way of publishing changes. Atlas has embedded Kafka instance so you can test it out in test environment, but it's encouraged to use external kafka cluster in production. If you don't have Kafka cluster in production, you may want to set client to `rest`.