This project aims to interconnect Delta Lake with Ceph. The interconnection is particularly aimed at offloading operations from compute- to storage systems, when using Delta Lake together with Spark to compute data.
In a layered list, you will find the following technologies, starting from a user point-of-view, going deeper:
- Apache Spark
- Delta Lake
- Spark (Delta Lake uses Spark internally)
- Arrow-Spark
- Apache Arrow (C++ core)
- Skyhook
- Ceph
This project uses this custom flavor of Delta Lake, that allows us to capture and forward data reading/writing to Arrow-Spark.
- You must compile custom Delta Lake. For build instructions on how to do this, refer to the repository.
- Build Arrow-Spark, and make sure you have it in your local
maven
repository. For build instructions on how to do this, refer to the repository. - Once you have finished step 1, i.e, you have the custom Delta Lake as a
JAR
file atsome/path/custom_delta.jar
, execute the following in the root folder of this project:
mkdir depjars
cp some/path/custom_delta.jar depjars/
- Finally, choose one of the following commands to build:
./gradlew shadowJar
builds the fullJAR
file, with all dependencies included../gradlew lightJar
builds aJAR
file, skipping a bunch of dependencies that Apache Spark already has. This is a useful build target when planning to deploy on a Spark cluster../gradlew essentialJar
builds aJAR
file, skipping all but the absolute essentials. This is a useful build target when planning to deploy on a Spark production cluster.
After compilation, you will find the output JAR
in:
app/build/libs/app-all.jar (if shadowJar-compiled)
app/build/libs/app-light.jar (if lightJar-compiled)
app/build/libs/app-essential.jar (if essentialJar-compiled)