You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
More info: spark is perfectly fine with running multiple jobs and stages concurrently. By doing so, we can keep executor allocation rate high.
So in the presence of multiple 'save_result' nodes, this could really help to improve overall performance.
The text was updated successfully, but these errors were encountered:
A quick attempt to distribute the SaveResult nodes to multiple python threads gave an error when trying to serialize the RDDs inside it. RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
_pickle.PicklingError: Could not pickle the task to send it to the workers.
It might be needed to run the ProcessGraphDeserializer.evaluate for each thread too.
Or find a way to start the write_assets tasks all a once from the main thread, and then start polling for the results also on the main thread.
Would something like collectAsync help in this case:
val value = rdd.collect() //RDD elements will be copied to spark driver
val value = rdd.collectAsync() //no copy here
value.get() //Now, RDD elements will be copied to spark driver
Example Job: j-25011314204947baa2057fa6f64bae8b
More info: spark is perfectly fine with running multiple jobs and stages concurrently. By doing so, we can keep executor allocation rate high.
So in the presence of multiple 'save_result' nodes, this could really help to improve overall performance.
The text was updated successfully, but these errors were encountered: