Spark On Yarn Cluster

This repo is used in this publication:

M. M. Aseman-Manzar, S. Karimian-Aliabadi, R. Entezari-Maleki, B. Egger and A. Movaghar, "Cost-Aware Resource Recommendation for DAG-Based Big Data Workflows: An Apache Spark Case Study," in IEEE Transactions on Services Computing, vol. 16, no. 3, pp. 1726-1737, 1 May-June 2023, doi: 10.1109/TSC.2022.3203010. https://ieeexplore.ieee.org/abstract/document/9894699

I also have another repo which is helpful for parsing the history logs:

https://github.com/mohsenasm/Python-Spark-Log-Parser

1. Test Yarn+HDFS with Wordcount

First run the cluster and go into the resourcemanager container:
docker-compose -f hadoop-docker-compose.yml up -d && docker-compose -f hadoop-docker-compose.yml exec resourcemanager bash
1. Then copy a sample file for the wordcount application:
  hdfs dfs -mkdir -p /in/ && hdfs dfs -copyFromLocal /opt/hadoop-3.1.1/README.txt /in/
2. Run the wordcount application on the cluster:
  yarn jar /opt/hadoop-3.1.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount /in /out
3. See the output with:
  hdfs dfs -cat /out/* and check hadoop history server on http://localhost:8188
Remove the cluster:
docker-compose -f hadoop-docker-compose.yml down -v

2. Test Spark+Yarn in cluster/client mode with SparkPi

First run the cluster:
docker-compose -f spark-client-docker-compose.yml up -d --build
Then go into the spark container:
docker-compose -f spark-client-docker-compose.yml run -p 18080:18080 spark-client bash
1. Start the history server:
  setup-history-server.sh
2. Run the SparkPi application on the yarn cluster:
  ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster examples/jars/spark-examples*.jar 3
  and see run history on http://localhost:18080
3. optional Run the SparkPi application in client mode on the yarn cluster:
  ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client examples/jars/spark-examples*.jar 3
  and see run history on http://localhost:18080
Remove the cluster:
docker-compose -f spark-client-docker-compose.yml down -v

3. Run TPC-DS on Spark+Yarn

First run the cluster:
docker-compose -f spark-client-with-tpcds-docker-compose.yml up -d --build
Then go into the tpc-ds container:
docker-compose -f spark-client-with-tpcds-docker-compose.yml run tpc-ds /run.sh bash
1. /run.sh gen_data 1
2. /run.sh copy_queries
1. /run.sh gen_ddl 1 for using parquet
2. /run.sh gen_ddl_csv 1 for using csv
Then go into the spark container:
docker-compose -f spark-client-with-tpcds-docker-compose.yml run -p 18080:18080 spark-client bash
1. Start the history server:
  setup-history-server.sh
2. Copy data to HDFS:
  hdfs dfs -mkdir -p /tpc-ds-files/data/parquet_1 && hdfs dfs -copyFromLocal /tpc-ds-files/data/csv_1 /tpc-ds-files/data/csv_1
3. Create parquet tables (for steps 4, 5, 6 and 10):
  spark-sql --master yarn --deploy-mode client -f /tpc-ds-files/ddl/tpcds_1.sql --name create_db_scale_1
4. optional Run sample query:
  spark-submit --master yarn --deploy-mode client /root/scripts/query.py -s 1 -q 'SELECT * from (SELECT count(*) from store_returns)' --name 'query for test database creation'
5. (Client Mode) Run a TPC-DS query from pre-generated queries with spark-submit:
  spark-submit --master yarn --deploy-mode client /root/scripts/query.py -s 1 -lf /tpc-ds-files/pre_generated_queries/query5.sql --name query5_client
6. (Client Mode + spark-sql) Run a TPC-DS query from pre-generated queries with spark-sql: spark-sql --master yarn --deploy-mode client --conf spark.sql.crossJoin.enabled=true -database scale_1 -f /tpc-ds-files/pre_generated_queries/query26.sql --name query26_cluster
7. Create csv tables (for step 8):
  spark-sql --master yarn --deploy-mode client -f /tpc-ds-files/ddl/tpcds_1_csv.sql --name create_db_scale_1_csv
8. (Client Mode + spark-sql + csv_database) Run a TPC-DS query from pre-generated queries with spark-sql: spark-sql --master yarn --deploy-mode client --conf spark.sql.crossJoin.enabled=true -database scale_1_csv -f /tpc-ds-files/pre_generated_queries/query26.sql --name query26_cluster
9. Copy TPC-DS pre-generated queries to HDFS:
  hdfs dfs -mkdir -p /tpc-ds-files/pre_generated_queries && hdfs dfs -copyFromLocal /tpc-ds-files/pre_generated_queries /tpc-ds-files/
10. (Cluster Mode) Run a TPC-DS query from pre-generated queries with spark-submit:
  spark-submit --master yarn --deploy-mode cluster /root/scripts/query.py -s 1 -hf /tpc-ds-files/pre_generated_queries/query40.sql -hf /tpc-ds-files/pre_generated_queries/query52.sql --name query40_and_query52_cluster
Remove the cluster:
docker-compose -f spark-client-with-tpcds-docker-compose.yml down -v

4. Run Multiple Samples of TPC-DS on Spark+Yarn

Run python3 run_tpcds.py 1 3 5 10. Then history will be on hdfs:///spark-history and on ./output/spark-history in the host.
Remove the cluster:
docker-compose -f spark-client-with-tpcds-docker-compose.yml down -v

5. Run Multiple Samples of TPC-DS on Spark+Yarn in Swarm Cluster

Change directory to the swarm directory in root of the project.
Preparations:
1. Setup swarm manager with docker swarm init --advertise-addr <the_manager_ip_address>. This command will print a docker swarm join command, copy it.
2. Run the join command into each worker.
3. On the swarm manager, for each node, assign label node-id. (docker node update --label-add node-id=1 node1_hostname)
4. Update file swarm/spark-swarm-client.yml.
Run swarm cluster with docker stack deploy -c spark-swarm.yml tpcds and wait until all services in docker service ls be running.
Run ADDITIONAL_SPARK_CONFIG="--num-executors 25 --executor-cores 1 --executor-memory 1G" USE_CSV="False" python3 run_tpcds_on_swarm.py 1 10 20 40 35 70 100 120 135 150. Then history will be on hdfs:///spark-history and on ./output/spark-history in the host.
Remove the cluster:
1. Remove all services: docker stack rm tpcds && docker-compose -f spark-swarm-client.yml down -v
2. On each nodes:
  - wait until docker ps prints no services.
  - execute docker ps && docker container prune && docker volume prune and confirm y.

To See Progress in Swarm

docker service create --mount type=bind,source=/var/run/docker.sock,destination=/var/run/docker.sock -p 80:8080 -e PORT=8080 --constraint 'node.role == manager' --name swarm-dashboard mohsenasm/swarm-dashboard

Web Tools

namenode -> http://localhost:9870
spark history -> http://localhost:18080
hadoop history -> http://localhost:8188
swarm-dashboard -> http://localhost:80

Name	Name	Last commit message	Last commit date
Latest commit mohsenasm Update Readme.md Jan 3, 2024 2b79276 · Jan 3, 2024 History 121 Commits
spark-client	spark-client	fix image versions	Jul 13, 2020
spark-log-viewer	spark-log-viewer	fix image versions	Jul 13, 2020
swarm	swarm	add all tpc-ds queries	Jan 7, 2020
tpc-ds	tpc-ds	fix query names	Jan 7, 2020
yarn	yarn	fix image versions	Jul 13, 2020
.gitignore	.gitignore	hadoop cluster	Aug 4, 2019
.travis.yml	.travis.yml	fix image versions	Jul 13, 2020
LICENSE	LICENSE	Create LICENSE	Aug 4, 2019
Readme.md	Readme.md	Update Readme.md	Jan 3, 2024
hadoop-docker-compose.yml	hadoop-docker-compose.yml	try to fix docker network issue	Oct 11, 2019
hadoop.env	hadoop.env	(WIP) add thread to runner script to support multiple queues	Dec 24, 2019
hadoop_multiQ.env	hadoop_multiQ.env	(WIP) add thread to runner script to support multiple queues	Dec 24, 2019
run_tpcds.py	run_tpcds.py	Fixed a typo.	Jul 12, 2021
spark-client-docker-compose.yml	spark-client-docker-compose.yml	try to fix docker network issue	Oct 11, 2019
spark-client-with-tpcds-docker-compose.yml	spark-client-with-tpcds-docker-compose.yml	fix image versions	Jul 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark On Yarn Cluster

1. Test Yarn+HDFS with Wordcount

2. Test Spark+Yarn in cluster/client mode with SparkPi

3. Run TPC-DS on Spark+Yarn

4. Run Multiple Samples of TPC-DS on Spark+Yarn

5. Run Multiple Samples of TPC-DS on Spark+Yarn in Swarm Cluster

To See Progress in Swarm

Web Tools

About

Releases

Packages

Languages

License

mohsenasm/spark-on-yarn-cluster

Folders and files

Latest commit

History

Repository files navigation

Spark On Yarn Cluster

1. Test Yarn+HDFS with Wordcount

2. Test Spark+Yarn in cluster/client mode with SparkPi

3. Run TPC-DS on Spark+Yarn

4. Run Multiple Samples of TPC-DS on Spark+Yarn

5. Run Multiple Samples of TPC-DS on Spark+Yarn in Swarm Cluster

To See Progress in Swarm

Web Tools

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages