Skip to content

Latest commit

 

History

History
76 lines (47 loc) · 4.08 KB

Velox.md

File metadata and controls

76 lines (47 loc) · 4.08 KB

Velox Backend

Currently the mvn script can automatically fetch and build all dependency libraries incluing Velox and Arrow. Our nightly build still use Velox under oap-project.

Velox use the script setup-ubuntu.sh to install all dependency libraries, but Arrow's dependency library can't be installed. So we need to install them manually:

apt install maven build-essential cmake libssl-dev libre2-dev libcurl4-openssl-dev clang lldb lld libz-dev

Also we need to setup the JAVA_HOME env.

export JAVA_HOME=path/to/java/home
export PATH=$JAVA_HOME/bin:$PATH

Velox home directory

The command below clones velox source code from OAP-project/velox to tools/build/velox_ep. Then it applies some patches to Velox build script and builds the velox library.

mvn clean package -DskipTests -Dcheckstyle.skip -Pbackends-velox -Dbuild_protobuf=OFF -Dbuild_cpp=ON -Dbuild_velox=ON -Dbuild_velox_from_source=ON -Dbuild_arrow=ON

You can also clone the Velox source to some other folder then specify it by -Dvelox_home as below. With -Dbuild_velox=ON, the script applies the patches and build the Velox library. With -Dbuild_velox=OFF, script skips the velox build steps and reuse the existed library. It's useful if Velox isn't changed.

mvn clean package -DskipTests -Dcheckstyle.skip -Pbackends-velox -Dbuild_protobuf=OFF -Dbuild_cpp=ON -Dbuild_velox=ON -Dvelox_home=${VELOX_HOME} -Dbuild_arrow=ON -Dcompile_velox=ON

Arrow home directory

Arrow home can be set as the same of Velox. Without -Darrow_home, arrow is cloned to toos/build/arrow_ep. You can specify the arrow home directory by -Darrow_home and then use -Dbuild_arrow to control arrow build or not.

Test TPC-H on Gluten with Velox backend

In Gluten, all 22 queries can be fully offloaded into Velox for computing.

Data preparation

Considering current Velox does not fully support Decimal and Date data type, the datagen script transforms "Decimal-to-Double" and "Date-to-String". As a result, we need to modify the TPCH queries a bit. You can find the modified TPC-H queries.

Submit the Spark SQL job

Submit test script from spark-shell. You can find the scala code to Run TPC-H as an example. Please remember to modify the location of TPC-H files as well as TPC-H queries in backends-velox/workload/tpch/run_tpch/tpch_parquet.scala before you run the testing.

var parquet_file_path = "/PATH/TO/TPCH_PARQUET_PATH"
var gluten_root = "/PATH/TO/GLUTEN"

Below script shows an example about how to run the testing, you should modify the parameters such as executor cores, memory, offHeap size based on your environment.

cat tpch_parquet.scala | spark-shell --name tpch_powertest_velox --master yarn --deploy-mode client --conf spark.plugins=io.glutenproject.GlutenPlugin --conf --conf spark.gluten.sql.columnar.backend.lib=velox --conf spark.driver.extraClassPath=${gluten_jvm_jar} --conf spark.executor.extraClassPath=${gluten_jvm_jar} --conf spark.memory.offHeap.size=20g --conf spark.sql.sources.useV1SourceList=avro --num-executors 6 --executor-cores 6 --driver-memory 20g --executor-memory 25g --conf spark.executor.memoryOverhead=5g --conf spark.driver.maxResultSize=32g

Result

TPC-H Q6

Performance

Below table shows the TPC-H Q1 and Q6 Performance in a multiple-thread test (--num-executors 6 --executor-cores 6) for Velox and vanilla Spark. Both Parquet and ORC datasets are sf1024.

Query Performance (s) Velox (ORC) Vanilla Spark (Parquet) Vanilla Spark (ORC)
TPC-H Q6 13.6 21.6 34.9
TPC-H Q1 26.1 76.7 84.9

External reference setup

TO ease your first hand experience of using Gluten, we have setup an external reference cluster. If you are interested, please contact [email protected]