spark-arrow-experiments

Experimentation guidance framework, for testing and doing automated experiments with Spark and Ceph.

We have 3 independently usable modules:

experimenter: Main experimentation guidance framework.
data_generator: Plugin-based data generator. Use this to generate sample parquet files.
graph_generator: Simple graph generator, to plot results.

Requirements

For the experimentation framework, we require:

python>=3.2
metareserve
spark_deploy>=0.1.1
rados_deploy>=0.1.1
data_deploy>=0.5.0

For the data generator, we require:

pandas
pyarrow

For the graph generator, we require:

numpy>=1.20.1 Many tools also require
scipy>=0.19.1
scikit-learn>=0.24.2

Experiments

The general cycle all experiments follow:

Divide nodes into pools.
Install requirements.
Start frameworks.
Deploy data.
Submit Spark application.
Aggregate results.
Stop frameworks

We formalized this concept inside this framework. Experiments define how they give shape to this cycle by registering functions with the execution framework. The execution framework triggers these functions at the correct time in the cycle.

We provided many default function implementations, to be able to perform automated experiments.

To use the experimentation framework, provided in the expermenter directory, use:

python3 experimenter/entrypoint.py -h

For more information, see /experimenter/README.md.

Data Generation

We built a simple data generator in the data_generator directory. Execute it using:

python3 data_generator/entrypoint.py -h

By default, generated data is outputted to /data_generator/generated/. We wrote one plugin, which generates a simple parquet file.

Instead of generating data for the experiments, there are also a few pre-generated files here. To get git lfs objects from this repo, use:

apt update
apt install git-lfs
git clone https://github.com/JayjeetAtGithub/datasets
cd datasets/
git lfs pull

For more information, see /data_generator/README.md.

Graph Generation

Our basic experiments return a timeseries as datapoints consisting of 2 64-bit integers. The first number is the initialization time Spark reader implementations needed to start up. The second number is the computation time Spark needed.

To use the graph generator, use:

python3 graph_generator/entrypoint.py -h

By default, generated graphs are outputted to /graph_generator/generated/.

For more information, see /graph_generator/README.md.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
data_generator		data_generator
experimenter		experimenter
graph_generator		graph_generator
results/final		results/final
thirdparty		thirdparty
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-arrow-experiments

Requirements

Experiments

Data Generation

Graph Generation

About

Releases

Packages

Languages

Sebastiaan-Alvarez-Rodriguez/spark-arrow-experiments

Folders and files

Latest commit

History

Repository files navigation

spark-arrow-experiments

Requirements

Experiments

Data Generation

Graph Generation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages