Skip to content

Commit

Permalink
Using tpch script from datafusion-benchmarks (apache#12)
Browse files Browse the repository at this point in the history
* Using tpch script from datafusion-benchmarks

* Using tpch script from datafusion-benchmarks

* Reverting to single partition

* Removing plans, reverting to single partition

* Trying one partition only

* Fixing tests

* One partition only

* Using TPCH Dbgen from Databricks

* Restored partiition count

* Will tests eventually pass?

* Introducing regexp for determinism

* Ignored additional tests

* Ignored additional tests

* Update README.md
  • Loading branch information
edmondop authored Oct 4, 2024
1 parent 2523e9f commit 2c8b8b8
Show file tree
Hide file tree
Showing 31 changed files with 1,485 additions and 2,311 deletions.
34 changes: 23 additions & 11 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,36 @@ name: Rust

on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]

env:
CARGO_TERM_COLOR: always
PYTHON_VERSION: 3.9
TPCH_SCALING_FACTOR: "1"
TPCH_TEST_PARTITIONS: "1"
TPCH_DATA_PATH: "data"

jobs:
build:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3
- name: Install protobuf compiler
shell: bash
run: sudo apt-get install protobuf-compiler
- name: Build Rust code
run: cargo build --verbose
- name: Run tests
run: cargo test --verbose
- uses: actions/checkout@v3
- name: Install protobuf compiler
shell: bash
run: sudo apt-get install protobuf-compiler
- name: Build Rust code
run: cargo build --verbose
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Install test dependencies
run: |
python -m pip install --upgrade pip
pip install -r tpch/requirements.txt
- name: Generate test data
run: |
./scripts/gen-test-data.sh
- name: Run tests
run: cargo test --verbose
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,5 @@ venv
*.so
*.log
results-sf*
data
tpch/tpch-dbgen
37 changes: 31 additions & 6 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 6 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,11 @@ uuid = "1.2"
rustc_version = "0.4.0"
tonic-build = { version = "0.8", default-features = false, features = ["transport", "prost"] }

[dev-dependencies]
anyhow = "1.0.89"
pretty_assertions = "1.4.0"
regex = "1.11.0"

[lib]
name = "datafusion_ray"
crate-type = ["cdylib", "rlib"]
Expand All @@ -54,4 +59,4 @@ name = "datafusion_ray._datafusion_ray_internal"

[profile.release]
codegen-units = 1
lto = true
lto = true
40 changes: 34 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@

# DataFusion on Ray

> This was originally a research project donated from [ray-sql](https://github.com/datafusion-contrib/ray-sql) to evaluate performing distributed SQL queries from Python, using
[Ray](https://www.ray.io/) and [DataFusion](https://github.com/apache/arrow-datafusion).
> This was originally a research project donated from [ray-sql](https://github.com/datafusion-contrib/ray-sql) to evaluate performing distributed SQL queries from Python, using
> [Ray](https://www.ray.io/) and [DataFusion](https://github.com/apache/arrow-datafusion).
DataFusion Ray is a distributed SQL query engine powered by the Rust implementation of [Apache Arrow](https://arrow.apache.org/), [Apache DataFusion](https://datafusion.apache.org/) and [Ray](https://www.ray.io/).

Expand All @@ -33,7 +33,7 @@ DataFusion Ray is a distributed SQL query engine powered by the Rust implementat

## Non Goals

- Re-build the cluster scheduling systems like what [Ballista](https://datafusion.apache.org/ballista/) did.
- Re-build the cluster scheduling systems like what [Ballista](https://datafusion.apache.org/ballista/) did.
- Ballista is extremely complex and utilizing Ray feels like it abstracts some of that complexity away.
- Datafusion Ray is delegating cluster management to Ray.

Expand Down Expand Up @@ -120,10 +120,38 @@ python -m pip install -r requirements-in.txt

Whenever rust code changes (your changes or via `git pull`):

```bash
````bash
# make sure you activate the venv using "source venv/bin/activate" first
maturin develop
python -m pytest
maturin develop python -m pytest ```
## Testing
Running local Rust tests require generating the tpch-data. This can be done
by running the following command:
```bash
./scripts/generate_tpch_data.sh
```

Tests compare plans with expected plans, which unfortunately contain the
path to the parquet tables. The path committed under version control is
the one of a Github Runner, and won't work locally. You can fix it by
running the following command:
```bash
./scripts/replace-expected-plan-paths.sh local-dev
````
When instead you need to regenerate the plans, which you can do by
re-running the planner tests removing all the content of
`testdata/expected-plans`, they will now contain your local paths. You can
fix it before committing the plans running
```bash
./scripts/replace-expected-plan-paths.sh pre-ci
```
## Benchmarking
Expand Down
60 changes: 60 additions & 0 deletions scripts/gen-test-data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
#!/bin/bash

set -e

create_directories() {
mkdir -p data
}

clone_and_build_tpch_dbgen() {
if [ -z "$(ls -A tpch/tpch-dbgen)" ]; then
echo "tpch/tpch-dbgen folder is empty. Cloning repository..."
git clone https://github.com/databricks/tpch-dbgen.git tpch/tpch-dbgen
cd tpch/tpch-dbgen
make
cd ../../
else
echo "tpch/tpch-dbgen folder is not empty. Skipping cloning of TPCH dbgen."
fi
}

generate_data() {
cd tpch/tpch-dbgen
if [ "$TPCH_TEST_PARTITIONS" -gt 1 ]; then
for i in $(seq 1 "$TPCH_TEST_PARTITIONS"); do
./dbgen -f -s "$TPCH_SCALING_FACTOR" -C "$TPCH_TEST_PARTITIONS" -S "$i"
done
else
./dbgen -f -s "$TPCH_SCALING_FACTOR"
fi
mv ./*.tbl* ../../data
}

convert_data() {
cd ../../
python -m tpch.tpchgen convert --partitions "$TPCH_TEST_PARTITIONS"
}

main() {
if [ -z "$TPCH_TEST_PARTITIONS" ]; then
echo "Error: TPCH_TEST_PARTITIONS is not set."
exit 1
fi

if [ -z "$TPCH_SCALING_FACTOR" ]; then
echo "Error: TPCH_SCALING_FACTOR is not set."
exit 1
fi

create_directories

if [ -z "$(ls -A data)" ]; then
clone_and_build_tpch_dbgen
generate_data
convert_data
else
echo "Data folder is not empty. Skipping cloning and data generation."
fi
}

main
44 changes: 44 additions & 0 deletions scripts/replace-expected-plans-paths.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/bin/bash

# This script helps change the path to parquet files in expected plans for
# local development and CI

set -e

if [ "$#" -ne 1 ]; then
echo "Usage: $0 <mode>"
echo "Modes: pre-ci, local-dev"
exit 1
fi

# Assign the parameter to the mode variable
mode=$1

ci_dir="home/runner/work/datafusion-ray/datafusion-ray"
current_dir=$(pwd)
current_dir_no_leading_slash="${current_dir#/}"
expected_plans_dir="./testdata/expected-plans"

# Function to replace paths in files
replace_paths() {
local search=$1
local replace=$2
find "$expected_plans_dir" -type f -exec sed -i "s|$search|$replace|g" {} +
echo "Replaced all occurrences of '$search' with '$replace' in files within '$expected_plans_dir'."
}

# Handle the modes
case $mode in
pre-ci)
replace_paths "$current_dir_no_leading_slash" "$ci_dir"
;;
local-dev)
replace_paths "$ci_dir" "$current_dir_no_leading_slash"
;;
*)
echo "Invalid mode: $mode"
echo "Usage: $0 <mode>"
echo "Modes: pre-ci, local-dev"
exit 1
;;
esac
Loading

0 comments on commit 2c8b8b8

Please sign in to comment.