You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Spark has lots of configs related to shuffle. The following configs are intended to give some control over the size of the data that each task can receive.
Ideally we want to be able to give the GPU lots of data and not have to worry too much about running out of GPU memory, which can cause spilling and increase the run time of a query. But we know that spilling when we get lots of data is going to be inevitable in some situations.
See if we override/augment the AQE planning that coalesces partitions. The idea would be that we could look at the plan and know which nodes in the plan cache a significant amount of data on the GPU. Which operators increase the size of the data on the GPU, and which might decrease it. With that we can then, in theory, adjust the target shuffle size to avoid overloading GPU memory.
Please note that there are a lot of heuristics that would need to be deployed as a part of this. Specifically size estimation for various stages of the plan. #12121 would be a great addition to this, but we probably need an okay set of heuristics to start out with as we cannot guarantee that these will be available.
The text was updated successfully, but these errors were encountered:
Besides configs, will we also think about exploiting hints in sql (https://spark.apache.org/docs/3.5.4/sql-ref-syntax-qry-select-hints.html) ? In the context of Auto tuner we already have real metrics for the Vanilla Spark run, it might be beneficial if we modify the query by adding a lot of hints to it and then run on GPU.
Is your feature request related to a problem? Please describe.
Spark has lots of configs related to shuffle. The following configs are intended to give some control over the size of the data that each task can receive.
spark.sql.shuffle.partitions
spark.sql.adaptive.coalescePartitions.initialPartitionNum
spark.sql.adaptive.advisoryPartitionSizeInBytes
spark.sql.adaptive.coalescePartitions.minPartitionSize
spark.sql.adaptive.coalescePartitions.minPartitionNum
spark.sql.adaptive.coalescePartitions.parallelismFirst
Ideally we want to be able to give the GPU lots of data and not have to worry too much about running out of GPU memory, which can cause spilling and increase the run time of a query. But we know that spilling when we get lots of data is going to be inevitable in some situations.
So the plan is to.
Please note that there are a lot of heuristics that would need to be deployed as a part of this. Specifically size estimation for various stages of the plan. #12121 would be a great addition to this, but we probably need an okay set of heuristics to start out with as we cannot guarantee that these will be available.
The text was updated successfully, but these errors were encountered: