release version 3.0.0

refactor join implementations to support existence joins and BHJ building hash map on driver side. supports spark333 batch shuffle reading. update rust-toolchain to latest nightly version. other minor improvements. update docs.
kwai · Jul 1, 2024 · 8e4b4cd · 8e4b4cd
1 parent 173607e
commit 8e4b4cd
Show file tree

Hide file tree

Showing 73 changed files with 4,750 additions and 3,592 deletions.
diff --git a/.github/workflows/build-ce7-releases.yml b/.github/workflows/build-ce7-releases.yml
@@ -12,7 +12,7 @@ jobs:
     strategy:
       matrix:
         sparkver: [spark303, spark333]
-        blazever: [2.0.9.1]
+        blazever: [3.0.0]
 
     steps:
       - uses: actions/checkout@v4

diff --git a/.github/workflows/tpcds.yml b/.github/workflows/tpcds.yml
@@ -34,19 +34,18 @@ jobs:
         with: {version: "21.7"}
 
       - uses: actions-rust-lang/setup-rust-toolchain@v1
-        with: {rustflags: --allow warnings -C target-cpu=native}
+        with:
+          toolchain: nightly
+          rustflags: --allow warnings -C target-feature=+aes
+          components:
+            cargo
+            rustfmt
 
       - name: Rustfmt Check
         uses: actions-rust-lang/rustfmt@v1
 
-          ## - name: Rust Clippy Check
-          ##   uses: actions-rs/clippy-check@v1
-          ##   with:
-          ##     token: ${{ secrets.GITHUB_TOKEN }}
-          ##     args: --all-features
-
       - name: Cargo test
-        run: cargo test --workspace --all-features
+        run: cargo +nightly test --workspace --all-features
 
       - name: Build Spark303
         run: mvn package -Ppre -Pspark303

diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -64,26 +64,26 @@ serde_json = { version = "1.0.96" }
 
 [patch.crates-io]
 # datafusion: branch=v36-blaze
-datafusion = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
-datafusion-common = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
-datafusion-expr = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
-datafusion-execution = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
-datafusion-optimizer = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
-datafusion-physical-expr = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "71433f743b2c399ea1728531b0e56fd7c6ef5282"}
+datafusion = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}
+datafusion-common = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}
+datafusion-expr = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}
+datafusion-execution = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}
+datafusion-optimizer = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}
+datafusion-physical-expr = { git = "https://github.com/blaze-init/arrow-datafusion.git", rev = "17b1ad3c7432391b94dd54e48a60db6d5712a7ef"}
 
 # arrow: branch=v50-blaze
-arrow = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
-arrow-arith = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
-arrow-array = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
-arrow-buffer = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
-arrow-cast = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
-arrow-data = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
-arrow-ord = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
-arrow-row = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
-arrow-schema = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
-arrow-select = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
-arrow-string = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
-parquet = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "2c39d9a251f7e3f8f15312bdd0c38759e465e8bc"}
+arrow = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
+arrow-arith = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
+arrow-array = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
+arrow-buffer = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
+arrow-cast = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
+arrow-data = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
+arrow-ord = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
+arrow-row = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
+arrow-schema = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
+arrow-select = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
+arrow-string = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
+parquet = { git = "https://github.com/blaze-init/arrow-rs.git", rev = "7471d70f7ae6edd5d4da82b7d966a8ede720e499"}
 
 # serde_json: branch=v1.0.96-blaze
 serde_json = { git = "https://github.com/blaze-init/json", branch = "v1.0.96-blaze" }
diff --git a/README.md b/README.md
@@ -73,7 +73,7 @@ Blaze._
 
 ```shell
 SHIM=spark333 # or spark303
-MODE=release # or dev
+MODE=release # or pre
 mvn package -P"${SHIM}" -P"${MODE}"
 ```
 
@@ -94,11 +94,16 @@ This section describes how to submit and configure a Spark Job with Blaze suppor
 1. move blaze jar package to spark client classpath (normally `spark-xx.xx.xx/jars/`).
 
 2. add the follow confs to spark configuration in `spark-xx.xx.xx/conf/spark-default.conf`:
+
 ```properties
+spark.blaze.enable true
 spark.sql.extensions org.apache.spark.sql.blaze.BlazeSparkSessionExtension
 spark.shuffle.manager org.apache.spark.sql.execution.blaze.shuffle.BlazeShuffleManager
+spark.memory.offHeap.enabled false
 
-# other blaze confs defined in spark-extension/src/main/java/org/apache/spark/sql/blaze/BlazeConf.java
+# suggested executor memory configuration
+spark.executor.memory 4g
+spark.executor.memoryOverhead 4096
 ```
 
 3. submit a query with spark-sql, or other tools like spark-thriftserver:
@@ -108,16 +113,15 @@ spark-sql -f tpcds/q01.sql
 
 ## Performance
 
-Check [Benchmark Results](./benchmark-results/20240202.md) with the latest date for the performance
-comparison with vanilla Spark on TPC-DS 1TB dataset. The benchmark result shows that Blaze saved
-~55% query time and ~60% cluster resources in average. ~6x performance achieved for the best case (q06).
+Check [Benchmark Results](./benchmark-results/20240701-blaze300.md) with the latest date for the performance
+comparison with vanilla Spark 3.3.3. The benchmark result shows that Blaze save about 50% time on TPC-DS/TPC-H 1TB datasets.
 Stay tuned and join us for more upcoming thrilling numbers.
 
-Query time:
-![20240202-query-time](./benchmark-results/blaze-query-time-comparison-20240202.png)
+TPC-DS Query time:
+![20240701-query-time-tpcds](./benchmark-results/spark333-vs-blaze300-query-time-20240701.png)
 
-Cluster resources:
-![20240202-resources](./benchmark-results/blaze-cluster-resources-cost-comparison-20240202.png)
+TPC-H Query time:
+![20240701-query-time-tpch](./benchmark-results/spark333-vs-blaze300-query-time-20240701-tpch.png)
 
 We also encourage you to benchmark Blaze and share the results with us. 🤗
 

diff --git a/RELEASES.md b/RELEASES.md
@@ -1,12 +1,15 @@
-# blaze-v2.0.9.1
+# blaze-v3.0.0
 
 ## Features
-* Supports failing-back nondeterministic expressions.
-* Supports "$[].xxx" jsonpath syntax in get_json_object().
+* Supports using spark.io.compression.codec for shuffle/broadcast compression
+* Supports date type casting
+* Refactor join implementations to support existence joins and BHJ building hash map on driver side
 
 ## Performance
-* Supports adaptive batch size in ParquetScan, improving vectorized reading performance.
-* Supports directly spill to disk file when on-heap memory is full.
+* Fixed performance issues when running on spark3 with default configurations
+* Use cached parquet metadata
+* Refactor native broadcast to avoid duplicated broadcast jobs
+* Supports spark333 batch shuffle reading
 
 ## Bugfix
-* Fix incorrect parquet rowgroup pruning with files containing deprecated min/max values.
+* Fix in_list conversion in from_proto.rs