-
Notifications
You must be signed in to change notification settings - Fork 100
Home
Apache Spark community is working on storing shuffle data on remote storage (SPARK-25299, Discussion Document). This Remote Shuffle Service demonstrates one implementation by streaming shuffle data to shuffle servers which run on separate machines.
There are multiple options to implement a shuffle service storing data on remote storage. Following are some examples (not comprehensive list of all possible solutions):
-
Streaming Based Remote Shuffle Service: shuffle clients send shuffle records to remote shuffle service in a streaming style, and shuffle service writes the records to different partition files based on the records’ partition id.
-
Async Shuffle File Upload: shuffle clients write shuffle records to local disk first, then upload/merge shuffle files to remote storage asynchronously.
This project focuses on the first option to illustrate how to design and implement a streaming based remote shuffle service. See this High Level Design Doc for a quick overview.