Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Parallel save result with export_workspace. #995

Open
HansVRP opened this issue Jan 14, 2025 · 4 comments
Open

Support Parallel save result with export_workspace. #995

HansVRP opened this issue Jan 14, 2025 · 4 comments
Assignees

Comments

@HansVRP
Copy link

HansVRP commented Jan 14, 2025

Example Job: j-25011314204947baa2057fa6f64bae8b

More info: spark is perfectly fine with running multiple jobs and stages concurrently. By doing so, we can keep executor allocation rate high.
So in the presence of multiple 'save_result' nodes, this could really help to improve overall performance.

@JeroenVerstraelen
Copy link
Contributor

TODO: Estimate the amount of work this issue will require

@JeroenVerstraelen
Copy link
Contributor

LCFM also benefits from this issue

@EmileSonneveld
Copy link
Contributor

A quick attempt to distribute the SaveResult nodes to multiple python threads gave an error when trying to serialize the RDDs inside it.
RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

_pickle.PicklingError: Could not pickle the task to send it to the workers.

It might be needed to run the ProcessGraphDeserializer.evaluate for each thread too.
Or find a way to start the write_assets tasks all a once from the main thread, and then start polling for the results also on the main thread.

@JeroenVerstraelen
Copy link
Contributor

Would something like collectAsync help in this case:

val value = rdd.collect() //RDD elements will be copied to spark driver 
val value = rdd.collectAsync() //no copy here  
value.get() //Now, RDD elements will be copied to spark driver 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants