Tool for fast prototyping of distributed stream processing applications.
The tool was tested on Ubuntu 20.04.4 and is based on Python 3.8.10, Kafka 2.13-2.8.0, PySpark 3.2.1 and MySQL 8.0.30.
- Clone the repository, then enter into it.
git clone https://github.com/PINetDalhousie/stream2gym.git
cd stream2gym
- Install dependencies. Our tool depends on the following software:
- pip3
- Mininet 2.3.0
- Networkx 2.5.1
- Java 11
- Xterm
- Kafka-python 2.0.2
- Matplotlib 3.3.4
- Seaborn 0.12.1
- PyYAML 5.3.1
Most dependencies can be installed using apt install
and pip3 install
:
$ sudo apt install python3-pip mininet default-jdk xterm netcat
$ sudo pip3 install mininet networkx kafka-python matplotlib python-snappy lz4 seaborn pyyaml seaborn
- You are ready to go! Should be able to get help using:
sudo python3 main.py -h
- Navigate through the
use-cases/
directory to explore the diverse applications we tested using stream2gym. Details of the applications including the exact data processing pipeline, topology, executed queries, and platform configurations can be found inside respective application directory. Example command to test a streaming data analytics application in a small network:
sudo python3 main.py use-cases/app-testing/document-analytics/input.graphml
-
Log production, consumption history and metrics of interest (e.g., bandwidth consumption) automatically for STANDARD producer and consumer. Look over the logs in
logs/output/
directory once the simulation ends. -
Set a duration for the simulation (OBS.: this is the time the workload will run, not the total simulation time.)
sudo python3 main.py use-cases/disconnection/military-coordination/input.graphml --time 300
- Capture the traffic of all the hosts while testing your application.
sudo python3 main.py use-cases/disconnection/military-coordination/input.graphml --capture-all
- Run event streaming and stream processing engine jointly or individually. Default setup is running event streaming (Apache Kafka) and stream processing engine (Apache Spark) as a sequential pipeline.
sudo python3 main.py use-cases/reproducibility/input.graphml --only-spark 1
- Explore the stream2gym supported configuration parameters in
documentation/config-parameters.pdf
. Setup parameters as you need and quickly test your prototype in a distributed emulated environment.
If you find our work relevant to your research, please consider citing:
@INPROCEEDINGS{10272479,
author={Amin Ifath, Md. Monzurul and Neves, Miguel and Haque, Israat},
booktitle={2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS)},
title={Fast Prototyping of Distributed Stream Processing Applications with stream2gym},
year={2023},
volume={},
number={},
pages={395-405},
keywords={Portable computers;Visual analytics;Emulation;Telecommunication traffic;Production;Hardware;Reproducibility of results},
doi={10.1109/ICDCS57875.2023.00034}
}
Md. Monzurul Amin Ifath ([email protected])