Support Parallel save result with export_workspace. #995

HansVRP · 2025-01-14T12:15:52Z

Example Job: j-25011314204947baa2057fa6f64bae8b

More info: spark is perfectly fine with running multiple jobs and stages concurrently. By doing so, we can keep executor allocation rate high.
So in the presence of multiple 'save_result' nodes, this could really help to improve overall performance.

JeroenVerstraelen · 2025-01-20T14:34:23Z

TODO: Estimate the amount of work this issue will require

JeroenVerstraelen · 2025-01-20T14:34:49Z

LCFM also benefits from this issue

EmileSonneveld · 2025-01-20T16:39:39Z

A quick attempt to distribute the SaveResult nodes to multiple python threads gave an error when trying to serialize the RDDs inside it.
RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

_pickle.PicklingError: Could not pickle the task to send it to the workers.

It might be needed to run the ProcessGraphDeserializer.evaluate for each thread too.
Or find a way to start the write_assets tasks all a once from the main thread, and then start polling for the results also on the main thread.

JeroenVerstraelen · 2025-01-20T19:08:18Z

Would something like collectAsync help in this case:

val value = rdd.collect() //RDD elements will be copied to spark driver 
val value = rdd.collectAsync() //no copy here  
value.get() //Now, RDD elements will be copied to spark driver

HansVRP added the feature request label Jan 14, 2025

JeroenVerstraelen assigned bossie Jan 20, 2025

JeroenVerstraelen assigned EmileSonneveld and unassigned bossie Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Parallel save result with export_workspace. #995

Support Parallel save result with export_workspace. #995

HansVRP commented Jan 14, 2025 •

edited by jdries

Loading

JeroenVerstraelen commented Jan 20, 2025

JeroenVerstraelen commented Jan 20, 2025

EmileSonneveld commented Jan 20, 2025

JeroenVerstraelen commented Jan 20, 2025

Support Parallel save result with export_workspace. #995

Support Parallel save result with export_workspace. #995

Comments

HansVRP commented Jan 14, 2025 • edited by jdries Loading

JeroenVerstraelen commented Jan 20, 2025

JeroenVerstraelen commented Jan 20, 2025

EmileSonneveld commented Jan 20, 2025

JeroenVerstraelen commented Jan 20, 2025

HansVRP commented Jan 14, 2025 •

edited by jdries

Loading