High CPU utlization with all query operators/stages GPU based. #11963
-
Hello, I have a general question about CPU utilization. It's unlikely a bug just a behavior I don't fully understand. I'm running benchmarks on synthetic data and I see surprisingly high CPU utilization. I was expecting that when all query parts are executed on GPU then CPU utilization should be relatively low as all it does is just fetch the data during the shuffle from other workers. I see utilization 80-90% for the most of the time time of the query. I compare it with performance of CPU based instance that I run separately, I also get 80-90% utilization which makes sense I guess as CPUs used for all the operations in that case. Setup1 master master: worker: Logic
DataS3 Schema: |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 8 replies
-
It could be a number of things causing the issues. We would need to do some profiling to really find out. I am happy to do some for you. Your use case is simple enough I should be able to reproduce it locally. Be aware that we try to use the GPU for things that the GPU is good at and still use the CPU for things it is good at. The CPU still wins in compression and decompression when you have lots of cores, so that is my guess. But I would have to run some thing to really see. |
Beta Was this translation helpful? Give feedback.
-
Thanks!
|
Beta Was this translation helpful? Give feedback.
-
Setup scripts with all the versions
CPU & GPU run spark shell commands
|
Beta Was this translation helpful? Give feedback.
-
I first wanted to verify that I got similar results, because I was running in local mode with 12 CPU cores and 1 GPU instead of you using 4 GPUs with 12 CPU cores each. It is just a lot simple to profile thing in local mode. The query has three stages. The first stage reads in the parquet data and does a partial aggregation to drop the duplicates. Te second stage finishes the deduplication and repartitions the data so that the window operation can happen. The last stage will sort the data, do the window operation, and write the results out. For the first stage about 9 CPU cores were fully utilized the entire time. For the second stage I saw about 10 CPU cores being fully utilized. The final stage only had about 3.5 CPU cores being utilized. So yes this does look like there is a lot of CPU being used, more than I would want/expect. I did some very simple hprof profiling (
It looks like just about all of the slowness is related to shuffle, and most of that comes from shuffle serialization. We know that this is an issue and have been working on improving it. It is still a WIP, but you should hopefully start to see some improvements in 25.02. On my setup I see about 105 seconds to run the query. Just FYI |
Beta Was this translation helpful? Give feedback.
-
https://github.com/NVIDIA/spark-rapids/pulls?q=is%3Apr+kudo+is%3Aclosed is a list of closed PRs that have gone in already. Some of the changes went into 24.12, but It is still experimental, and off by default. You can turn it on by setting The main point of the improvements were to try and reduce the size and overhead of serializing small batches. The original format that we used was self describing. You could deserialize the serialized data without any extra knowledge. For shuffle batches with very few rows this resulted in a lot of extra data being written out. The new format requires that you know the schema of the data before you can deserialize it. But that is fine because spark knows the schema of the data it is going to read. With only 200 shuffle partitions it should not be a big deal, except that you have 100 string columns. Along with this there have been some improvements with how we serialize variable length data, like stings. We are working on having the option to move some or all of this processing to the GPU. The goal would be to use the CPU if the GPU if busy, but the GPU if it is more idle. But don't expect any miracles just yet. This only drops the runtime for my setup to about 100 seconds currently. |
Beta Was this translation helpful? Give feedback.
I first wanted to verify that I got similar results, because I was running in local mode with 12 CPU cores and 1 GPU instead of you using 4 GPUs with 12 CPU cores each. It is just a lot simple to profile thing in local mode.
The query has three stages. The first stage reads in the parquet data and does a partial aggregation to drop the duplicates. Te second stage finishes the deduplication and repartitions the data so that the window operation can happen. The last stage will sort the data, do the window operation, and write the results out.
For the first stage about 9 CPU cores were fully utilized the entire time. For the second stage I saw about 10 CPU cores being fully utilized. The fin…