Skip to content

Commit

Permalink
docs: minor type fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
vbarua committed Oct 3, 2024
1 parent f7759a3 commit 4de0da9
Showing 1 changed file with 9 additions and 20 deletions.
29 changes: 9 additions & 20 deletions examples/substrait-spark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ The Substrait-Spark library allows Substrait plans to convert to and from Spark

## How does this work in practice?

Once Spark SQL and Spark DataFrame APIs queries have been created, Spark's optimized query plan can be used generate Substrait plans; and Substrait Plans can be executed on a Spark cluster. Below is a description of how to use this library; there are two sample datasets included for demonstration.
Once Spark SQL and Spark DataFrame APIs queries have been created, Spark's optimized query plan can be used to generate Substrait plans; and Substrait Plans can be executed on a Spark cluster. Below is a description of how to use this library; there are two sample datasets included for demonstration.

The most commonly used logical relations are supported, including those generated from all the TPC-H queries, but there are currently some gaps in support that prevent all the TPC-DS queries from being translatable.

Expand All @@ -25,13 +25,13 @@ To run these you will need:

- Java 17 or greater
- Docker to start a test Spark Cluster
- you could use your own cluster, but would need to adjust file locations defined in [SparkHelper](./app/src/main/java/io/substrait/examples/SparkHelper.java)
- You could use your own cluster, but would need to adjust file locations defined in [SparkHelper](./app/src/main/java/io/substrait/examples/SparkHelper.java)
- The [just task runner](https://github.com/casey/just#installation) is optional, but very helpful to run the bash commands
- [Two datafiles](./app/src/main/resources/) are provided (CSV format)

For building using the `substrait-spark` library youself, using the [mvn repository](https://mvnrepository.com/artifact/io.substrait/spark)

Using maven:
Using Maven:
```xml
<!-- https://mvnrepository.com/artifact/io.substrait/spark -->
<dependency>
Expand All @@ -41,15 +41,15 @@ Using maven:
</dependency>
```

Using Gradle (groovy)
Using Gradle (Groovy)
```groovy
// https://mvnrepository.com/artifact/io.substrait/spark
implementation 'io.substrait:spark:0.36.0'
```

### Setup configuration

Firstly the application needs to be built; this is a simple Java application. As well issuing the `gradle` build command it also creates two directories `_apps` and `_data`. The JAR file and will be copied to the `_apps` directory and the datafiles to the `_data`. Note that the permissions on the `_data` directory are set to group write - this allows the spark process in the docker container to write the output plan
Firstly the application needs to be built; this is a simple Java application. As well issuing the `gradle` build command it also creates two directories `_apps` and `_data`. The JAR file and will be copied to the `_apps` directory and the datafiles to the `_data`. Note that the permissions on the `_data` directory are set to group write - this allows the Spark process in the docker container to write the output plan

To run using `just`
```
Expand Down Expand Up @@ -152,15 +152,14 @@ If we were to just run this as-is, the output table would be below.

### Logical and Optimized Query Plans

THe next step is to look at the logical and optimised query plans that Spark has constructed.
The next step is to look at the logical and optimised query plans that Spark has constructed.

```java
LogicalPlan logical = result.logicalPlan();
System.out.println(logical);

LogicalPlan optimised = result.queryExecution().optimizedPlan();
System.out.println(optimised);

```

The logical plan will be:
Expand Down Expand Up @@ -310,15 +309,14 @@ As `Substrait Spark` library also allows plans to be loaded and executed, so the

The [`SparkConsumeSubstrait`](./app/src/main/java/io/substrait/examples/SparkConsumeSubstrait.java) code shows how to load this file, and most importantly how to convert it to a Spark engine plan to execute

Loading the binary protobuf file is the reverse of the writing process (in the code the file name comes from a command line argument, here we're showing the hardcode file name )
Loading the binary protobuf file is the reverse of the writing process (in the code the file name comes from a command line argument, here we're showing the hardcoded file name )

```java
byte[] buffer = Files.readAllBytes(Paths.get("spark_sql_substrait.plan"));
io.substrait.proto.Plan proto = io.substrait.proto.Plan.parseFrom(buffer);

ProtoPlanConverter protoToPlan = new ProtoPlanConverter();
Plan plan = protoToPlan.from(proto);

```
The loaded byte array is first converted into the protobuf Plan, and then into the Substrait Plan object. Note it can be useful to name the variables, and/or use the pull class names to keep track of it's the ProtoBuf Plan or the high-level POJO Plan.

Expand All @@ -342,10 +340,9 @@ If you were to print out this plan, it has the identical structure to the plan s
+- Project [vehicle_id#10]
+- Filter ((isnotnull(test_result#14) AND (test_result#14 = P)) AND isnotnull(vehicle_id#10))
+- Relation [test_id#9,vehicle_id#10,test_date#11,test_class#12,test_type#13,test_result#14,test_mileage#15,postcode_area#16] csv
```

Executed of this plan is then simple `Dataset.ofRows(spark, sparkPlan).show();` giving the output of
Execution of this plan is then a simple `Dataset.ofRows(spark, sparkPlan).show();` giving the output of

```java
+------+-----+
Expand Down Expand Up @@ -409,7 +406,6 @@ Spark's inner join is taking as inputs the two filtered relations; it's ensuring
+- Project [vehicle_id#10]
+- Filter ((isnotnull(test_result#14) AND (test_result#14 = P)) AND isnotnull(vehicle_id#10))
```

The Substrait Representation looks longer, but is showing the same structure. (note that this format is a custom format implemented as [SubstraitStingify](...) as the standard text output can be hard to read).
Expand Down Expand Up @@ -483,7 +479,6 @@ For example the above plan can be handled with PyArrow or DuckDB (as an example
table_provider=self.simple_provider,
)
result = reader.read_all()

```

When run with the plan pyarrow instantly rejects it with
Expand Down Expand Up @@ -533,7 +528,7 @@ As names is a array, we can check the final part of the name and return the matc
```


When run the output is along these lines (the query is slightly different here for simplicity); we can see the tables being request and the schema expected. Nothing is done with the schema here but could be useful for ensuring that the expectations of the plan match the schema of the data held in the engine.
When run the output is along these lines (the query is slightly different here for simplicity); we can see the tables being requested and the schema expected. Nothing is done with the schema here but could be useful for ensuring that the expectations of the plan match the schema of the data held in the engine.

```
== Requesting table ['spark_catalog', 'default', 'vehicles'] with schema:
Expand Down Expand Up @@ -578,9 +573,3 @@ In the case of Spark for example, identical plans can be created with the Spark
Logical references to a 'table' via a `NamedScan` gives more flexibility; but the structure of the reference still needs to be properly understood and agreed upon.

Once common understanding is agreed upon, transferring plans between engines brings great flexibility and potential.






0 comments on commit 4de0da9

Please sign in to comment.