This repo contains sample code to implement streaming feature engineering from a local Kafka deployed using docker compose
.
This project contains 3 components:
- Data producer
- Kafka
- Spark consumer
The data producer is a simple python application that is deployed in a docker container.
This will randomly sample records in data/
and send them to Kafka.
This is deployed in a docker container with the necessary ports exposed.
This is a pyspark application that consumes data from Kafka and performs some transformations. This can either be in streaming or batch mode.
To install necessary python dependencies and build jars required by Spark, run:
make install
To activate your virtual environment, run:
source .venv/bin/activate
To run Kafka along with the random data producer run:
docker compose up --build -d
This will build and run the necessary containers.
To run the Spark stream job run:
python -m src.consumer_spark
The feature store is available in demo/feature_repo
.
Feature definitions are defined in transaction_repo.py
and a test workflow is available to test the feature store by requesting online features via the feature server and Feast SDK.
Note: to ensure data is available for the feature store, please run earlier steps and ensure the docker containers are up and running and the spark stream is running successfully. See Run section for more details.
To setup the feature store, run:
cd demo/feature_repo
feast plan
feast apply
In order to populate the feature store, you must materialize the features from the offline store:
CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
feast materialize-incremental $CURRENT_TIME
To start the feast server run:
feast serve
You can then run the test workflow:
python test_workflow.py