Skip to content

Commit

Permalink
release version 3.0.0
Browse files Browse the repository at this point in the history
refactor join implementations to support existence joins and BHJ building hash map on driver side.

supports spark333 batch shuffle reading.

update rust-toolchain to latest nightly version.

other minor improvements.

update docs.
  • Loading branch information
zhangli20 committed Jul 1, 2024
1 parent 173607e commit 8e4b4cd
Show file tree
Hide file tree
Showing 73 changed files with 4,750 additions and 3,592 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/build-ce7-releases.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ jobs:
strategy:
matrix:
sparkver: [spark303, spark333]
blazever: [2.0.9.1]
blazever: [3.0.0]

steps:
- uses: actions/checkout@v4
Expand Down
15 changes: 7 additions & 8 deletions .github/workflows/tpcds.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,19 +34,18 @@ jobs:
with: {version: "21.7"}

- uses: actions-rust-lang/setup-rust-toolchain@v1
with: {rustflags: --allow warnings -C target-cpu=native}
with:
toolchain: nightly
rustflags: --allow warnings -C target-feature=+aes
components:
cargo
rustfmt

- name: Rustfmt Check
uses: actions-rust-lang/rustfmt@v1

## - name: Rust Clippy Check
## uses: actions-rs/clippy-check@v1
## with:
## token: ${{ secrets.GITHUB_TOKEN }}
## args: --all-features

- name: Cargo test
run: cargo test --workspace --all-features
run: cargo +nightly test --workspace --all-features

- name: Build Spark303
run: mvn package -Ppre -Pspark303
Expand Down
51 changes: 26 additions & 25 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

36 changes: 18 additions & 18 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -64,26 +64,26 @@ serde_json = { version = "1.0.96" }

[patch.crates-io]
# datafusion: branch=v36-blaze
datafusion = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
datafusion-common = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
datafusion-expr = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
datafusion-execution = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
datafusion-optimizer = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
datafusion-physical-expr = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
datafusion = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}
datafusion-common = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}
datafusion-expr = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}
datafusion-execution = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}
datafusion-optimizer = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}
datafusion-physical-expr = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}

# arrow: branch=v50-blaze
arrow = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
arrow-arith = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
arrow-array = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
arrow-buffer = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
arrow-cast = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
arrow-data = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
arrow-ord = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
arrow-row = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
arrow-schema = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
arrow-select = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
arrow-string = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
parquet = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
arrow = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
arrow-arith = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
arrow-array = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
arrow-buffer = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
arrow-cast = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
arrow-data = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
arrow-ord = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
arrow-row = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
arrow-schema = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
arrow-select = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
arrow-string = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
parquet = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}

# serde_json: branch=v1.0.96-blaze
serde_json = { git = "https://github.com/blaze-init/json", branch = "v1.0.96-blaze" }
22 changes: 13 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Blaze._

```shell
SHIM=spark333 # or spark303
MODE=release # or dev
MODE=release # or pre
mvn package -P"${SHIM}" -P"${MODE}"
```

Expand All @@ -94,11 +94,16 @@ This section describes how to submit and configure a Spark Job with Blaze suppor
1. move blaze jar package to spark client classpath (normally `spark-xx.xx.xx/jars/`).

2. add the follow confs to spark configuration in `spark-xx.xx.xx/conf/spark-default.conf`:

```properties
spark.blaze.enable true
spark.sql.extensions org.apache.spark.sql.blaze.BlazeSparkSessionExtension
spark.shuffle.manager org.apache.spark.sql.execution.blaze.shuffle.BlazeShuffleManager
spark.memory.offHeap.enabled false

# other blaze confs defined in spark-extension/src/main/java/org/apache/spark/sql/blaze/BlazeConf.java
# suggested executor memory configuration
spark.executor.memory 4g
spark.executor.memoryOverhead 4096
```

3. submit a query with spark-sql, or other tools like spark-thriftserver:
Expand All @@ -108,16 +113,15 @@ spark-sql -f tpcds/q01.sql

## Performance

Check [Benchmark Results](./benchmark-results/20240202.md) with the latest date for the performance
comparison with vanilla Spark on TPC-DS 1TB dataset. The benchmark result shows that Blaze saved
~55% query time and ~60% cluster resources in average. ~6x performance achieved for the best case (q06).
Check [Benchmark Results](./benchmark-results/20240701-blaze300.md) with the latest date for the performance
comparison with vanilla Spark 3.3.3. The benchmark result shows that Blaze save about 50% time on TPC-DS/TPC-H 1TB datasets.
Stay tuned and join us for more upcoming thrilling numbers.

Query time:
![20240202-query-time](./benchmark-results/blaze-query-time-comparison-20240202.png)
TPC-DS Query time:
![20240701-query-time-tpcds](./benchmark-results/spark333-vs-blaze300-query-time-20240701.png)

Cluster resources:
![20240202-resources](./benchmark-results/blaze-cluster-resources-cost-comparison-20240202.png)
TPC-H Query time:
![20240701-query-time-tpch](./benchmark-results/spark333-vs-blaze300-query-time-20240701-tpch.png)

We also encourage you to benchmark Blaze and share the results with us. 🤗

Expand Down
15 changes: 9 additions & 6 deletions RELEASES.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
# blaze-v2.0.9.1
# blaze-v3.0.0

## Features
* Supports failing-back nondeterministic expressions.
* Supports "$[].xxx" jsonpath syntax in get_json_object().
* Supports using spark.io.compression.codec for shuffle/broadcast compression
* Supports date type casting
* Refactor join implementations to support existence joins and BHJ building hash map on driver side

## Performance
* Supports adaptive batch size in ParquetScan, improving vectorized reading performance.
* Supports directly spill to disk file when on-heap memory is full.
* Fixed performance issues when running on spark3 with default configurations
* Use cached parquet metadata
* Refactor native broadcast to avoid duplicated broadcast jobs
* Supports spark333 batch shuffle reading

## Bugfix
* Fix incorrect parquet rowgroup pruning with files containing deprecated min/max values.
* Fix in_list conversion in from_proto.rs
Loading

0 comments on commit 8e4b4cd

Please sign in to comment.