High CPU utlization with all query operators/stages GPU based. #11963

MaxNevermind · 2025-01-14T02:22:36Z

MaxNevermind
Jan 14, 2025

Hello,

I have a general question about CPU utilization. It's unlikely a bug just a behavior I don't fully understand. I'm running benchmarks on synthetic data and I see surprisingly high CPU utilization. I was expecting that when all query parts are executed on GPU then CPU utilization should be relatively low as all it does is just fetch the data during the shuffle from other workers. I see utilization 80-90% for the most of the time time of the query.
AWS instances used for GPU workers: g4dn.12xlarge - 48 cores / 192 GB ram / 4 T4 gpu / 900GB NVMe / 50 Gbps
Am I missing something, what does load CPU that much?

I compare it with performance of CPU based instance that I run separately, I also get 80-90% utilization which makes sense I guess as CPUs used for all the operations in that case.
AWS instances used for CPU workers: r6id.24xlarge - 96 cores / 768 GB ram / 5.7TB NVMe / 37.5 Gbps

Setup

1 master
1 workers

master:
m5dn.xlarge
4 cores / 16 GB ram / 150GB NVMe / 25 Gbps

worker:
r6id.24xlarge
96 cores / 768 GB ram / 5.7TB NVMe / 37.5 Gbps

Logic


  spark
    .read
    .parquet("s3a://rapids-test-1/data_gen_1")
    .dropDuplicates("col1")
    .withColumn("window", first(col("col2")).over(Window.partitionBy(col("col2")).orderBy("col3")))
    .write
    .option("compression", "snappy")
    .mode("overwrite")
    .parquet("s3a://rapids-test-1/data_gen_4")

Data

S3
150 parquet files
22 GB gzip
50 mil rows

Schema:
100 string columns
5 random chars each

Answered by revans2

Jan 14, 2025

I first wanted to verify that I got similar results, because I was running in local mode with 12 CPU cores and 1 GPU instead of you using 4 GPUs with 12 CPU cores each. It is just a lot simple to profile thing in local mode.

The query has three stages. The first stage reads in the parquet data and does a partial aggregation to drop the duplicates. Te second stage finishes the deduplication and repartitions the data so that the window operation can happen. The last stage will sort the data, do the window operation, and write the results out.

For the first stage about 9 CPU cores were fully utilized the entire time. For the second stage I saw about 10 CPU cores being fully utilized. The fin…

View full answer

revans2 · 2025-01-14T15:01:38Z

revans2
Jan 14, 2025
Maintainer

It could be a number of things causing the issues. We would need to do some profiling to really find out. I am happy to do some for you. Your use case is simple enough I should be able to reproduce it locally.

Be aware that we try to use the GPU for things that the GPU is good at and still use the CPU for things it is good at. The CPU still wins in compression and decompression when you have lots of cores, so that is my guess. But I would have to run some thing to really see.

0 replies

MaxNevermind · 2025-01-14T16:44:42Z

MaxNevermind
Jan 14, 2025
Author

Thanks!
Here a data generator in Scala.

object DataGen extends App {

  val spark = org.apache.spark.sql.SparkSession.builder()
    .appName("ParallelDataGenerator")
    .master("local[*]")
    .getOrCreate()

  import org.apache.spark.sql.types._
  import org.apache.spark.sql.{DataFrame, Row, SparkSession}
  import java.util.concurrent.ThreadLocalRandom

  def generateData(spark: SparkSession, rowCount: Long, colCount: Int): DataFrame = {
    val schema = StructType((1 to colCount).map(i => StructField(s"col$i", StringType, nullable = false)))
    val numPartitions = spark.sparkContext.defaultParallelism
    val dataRDD = spark.sparkContext.parallelize(1L to rowCount, numPartitions).map { _ =>
      Row.fromSeq((1 to colCount).map(_ => randomString(5)))
    }
    spark.createDataFrame(dataRDD, schema)
  }

  def randomString(length: Int): String = {
    val chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"
    val builder = new StringBuilder(length)
    val random = ThreadLocalRandom.current()
    for (_ <- 0 until length) {
      builder.append(chars.charAt(random.nextInt(chars.length)))
    }
    builder.toString()
  }

  println(java.time.LocalDateTime.now())
  generateData(spark, rowCount = 50000000, colCount = 100)
    .repartition(150)
    .write
    .mode("overwrite")
    .parquet("s3a://rapids-test-1/data_gen_1")
  println(java.time.LocalDateTime.now())

}

2 replies

revans2 Jan 14, 2025
Maintainer

@MaxNevermind Can I get a few config setting from you? I assume that you are running with 12 threads per executor and 4 executors. What are the following configs? If you didn't set them, that is fine I just want to know if you overrode any of these.

spark.sql.shuffle.partitions
spark.sql.files.maxPartitionBytes
spark.shuffle.manager
spark.rapids.memory.host.spillStorageSize
spark.rapids.memory.pinnedPool.size
spark.rapids.sql.concurrentGpuTasks

Also what version of Spark-RAPIDS are you using?

revans2 Jan 14, 2025
Maintainer

Oh also what was the heap size set to for the executor? If you have a history file I can look at it should tell me everything I need to know to truly replicate it.

MaxNevermind · 2025-01-14T17:04:19Z

MaxNevermind
Jan 14, 2025
Author

Setup scripts with all the versions

sudo apt update
sudo apt upgrade -y
sudo apt install openjdk-11-jdk -y
cd /home/ubuntu
sudo wget https://archive.apache.org/dist/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz
sudo wget https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.10.1/rapids-4-spark_2.12-24.10.1.jar
sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.779/aws-java-sdk-bundle-1.12.779.jar
sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
sudo wget https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.12/3.5.2/spark-hadoop-cloud_2.12-3.5.2.jar
sudo tar -xvzf ./spark-3.5.2-bin-hadoop3.tgz
sudo mv ./aws-java-sdk-bundle-1.12.779.jar ./spark-3.5.2-bin-hadoop3/jars
sudo mv ./hadoop-aws-3.3.4.jar ./spark-3.5.2-bin-hadoop3/jars
sudo mv ./spark-hadoop-cloud_2.12-3.5.2.jar ./spark-3.5.2-bin-hadoop3/jars
sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt update
sudo apt install nvidia-driver-535 -y
sudo reboot


sudo mkfs.ext4 -F /dev/nvme1n1
sudo mkdir /mnt/nvme
sudo mount /dev/nvme1n1 /mnt/nvme
sudo mkdir /mnt/nvme/spark_local_dir
sudo mkdir /mnt/nvme/spark_worker_dir
sudo mkdir /mnt/nvme/hadoop_tmp_dir

export WORK_DIR=/home/ubuntu
export PRIVATE_MASTER_DNS=ip-10-0-7-36.us-west-2.compute.internal
export SPARK_MASTER=spark://$PRIVATE_MASTER_DNS:7077
export SPARK_HOME=$WORK_DIR/spark-3.5.2-bin-hadoop3
export SPARK_RAPIDS_PLUGIN_JAR=$WORK_DIR/rapids-4-spark_2.12-24.10.1.jar

touch $SPARK_HOME/conf/spark-env.sh
echo '
export SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=1 -Dspark.worker.resource.gpu.discoveryScript=$SPARK_HOME/examples/src/main/scripts/getGpusResources.sh"
export SPARK_LOCAL_DIRS="/mnt/nvme/spark_local_dir"
export SPARK_WORKER_DIR="/mnt/nvme/spark_worker_dir"
export AWS_JAVA_V1_DISABLE_DEPRECATION_ANNOUNCEMENT=true
' >> $SPARK_HOME/conf/spark-env.sh

sudo $SPARK_HOME/sbin/start-master.sh
sudo $SPARK_HOME/sbin/start-worker.sh $SPARK_MASTER

CPU & GPU run spark shell commands

sudo $SPARK_HOME/bin/spark-shell \
	--master $SPARK_MASTER \
	--deploy-mode "client" \
	--conf spark.driver.cores=4 \
	--conf spark.driver.memory=14g \
	--conf spark.driver.extraJavaOptions="-Dhadoop.tmp.dir=/mnt/nvme/hadoop_tmp_dir" \
	--conf spark.executor.cores=24 \
	--conf spark.executor.memory=160g \
	--conf spark.executor.extraJavaOptions="-Dhadoop.tmp.dir=/mnt/nvme/hadoop_tmp_dir" \
	--conf spark.dynamicAllocation.enabled=false \
	--conf spark.sql.shuffle.partitions=200 \
	--conf spark.network.timeout=300s \
	--conf spark.local.dir=/mnt/nvme/spark_work_dir \
	--conf spark.sql.parquet.compression.codec=gzip \
	--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
	--conf spark.hadoop.fs.s3a.committer.name=magic \
	--conf spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol \
	--conf spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter \
	--conf spark.sql.adaptive.enabled=false \
	--conf spark.dynamicAllocation.enabled=false \
	--conf spark.sql.legacy.charVarcharAsString=true


sudo $SPARK_HOME/bin/spark-shell \
	--master $SPARK_MASTER \
	--deploy-mode "client" \
	--conf spark.driver.cores=4 \
	--conf spark.driver.memory=14g \
	--conf spark.driver.extraJavaOptions="-Dhadoop.tmp.dir=/mnt/nvme/hadoop_tmp_dir" \
	--conf spark.executor.cores=12 \
	--conf spark.executor.memory=36g \
	--conf spark.executor.extraJavaOptions="-Dhadoop.tmp.dir=/mnt/nvme/hadoop_tmp_dir" \
	--conf spark.dynamicAllocation.enabled=false \
	--conf spark.sql.shuffle.partitions=200 \
	--conf spark.network.timeout=300s \
	--conf spark.local.dir=/mnt/nvme/spark_work_dir \
	--conf spark.sql.parquet.compression.codec=gzip \
	--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
	--conf spark.hadoop.fs.s3a.committer.name=magic \
	--conf spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol \
	--conf spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter \
	--conf spark.sql.adaptive.enabled=false \
	--conf spark.sql.legacy.charVarcharAsString=true \
	--conf spark.plugins=com.nvidia.spark.SQLPlugin \
	--conf spark.rapids.sql.enabled=true \
	--conf spark.executor.resource.gpu.amount=1 \
	--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
	--conf spark.task.resource.gpu.amount=0.083 \
	--conf spark.rapids.memory.host.spillStorageSize=8G \
	--conf spark.rapids.memory.pinnedPool.size=8g \
	--conf spark.rapids.sql.concurrentGpuTasks=1 \
	--jars $SPARK_RAPIDS_PLUGIN_JAR

0 replies

revans2 · 2025-01-14T21:31:18Z

revans2
Jan 14, 2025
Maintainer

I first wanted to verify that I got similar results, because I was running in local mode with 12 CPU cores and 1 GPU instead of you using 4 GPUs with 12 CPU cores each. It is just a lot simple to profile thing in local mode.

The query has three stages. The first stage reads in the parquet data and does a partial aggregation to drop the duplicates. Te second stage finishes the deduplication and repartitions the data so that the window operation can happen. The last stage will sort the data, do the window operation, and write the results out.

For the first stage about 9 CPU cores were fully utilized the entire time. For the second stage I saw about 10 CPU cores being fully utilized. The final stage only had about 3.5 CPU cores being utilized.

So yes this does look like there is a lot of CPU being used, more than I would want/expect. I did some very simple hprof profiling (-agentlib:hprof=cpu=samples,depth=12). I ran the query twice and grabbed the top 30-ish items from the list of stack traces. I know that I am ignoring the long tail. I also removed the stack traces that are just blocked all the time waiting on I/O or user input. This is really just quick and dirty to see if anything stands out, and it does.

what	counts	pct
Shuffle Serialization	40115	35.34%
Shuffle Compression	22528	19.85%
Parquet Write	14985	13.20%
Shuffle Write I/O	9146	8.06%
Aggregate	5597	4.93%
Parquet Read	4931	4.34%
Shuffle Partition	4218	3.72%
Read Shuffle Data	2991	2.64%
Shuffle Write Cleanup	2786	2.45%
Shuffle Read I/O	2193	1.93%
Shuffle Decompression	2160	1.90%
Shuffle Deserialization	1850	1.63%

It looks like just about all of the slowness is related to shuffle, and most of that comes from shuffle serialization. We know that this is an issue and have been working on improving it. It is still a WIP, but you should hopefully start to see some improvements in 25.02.

On my setup I see about 105 seconds to run the query. Just FYI

1 reply

MaxNevermind Jan 14, 2025
Author

Thanks!
That's a very nice breakdown of utilization by steps. I will try to do it myself, could be useful approach for future optimizations.

We know that this is an issue and have been working on improving it.

That sounds interesting, can you list some related PRs? Wanted to understand the underlying nature of improvements or maybe even contribute.

I guess the old the old optimization approach for Spark of reducing the number of shuffles stills stands true, didn't realize that that the time mostly spent of serialization and compression, I thought it primarily pushing the data through the network.

Though I don't quite get it why Shuffle Decompression / Shuffle Deserialization takes so much less time of total, any ideas? My naive thought was that it should be somewhat close to Shuffle Serialization / Shuffle Compression.

revans2 · 2025-01-15T16:24:21Z

revans2
Jan 15, 2025
Maintainer

https://github.com/NVIDIA/spark-rapids/pulls?q=is%3Apr+kudo+is%3Aclosed is a list of closed PRs that have gone in already. Some of the changes went into 24.12, but It is still experimental, and off by default. You can turn it on by setting spark.rapids.shuffle.kudo.serializer.enabled to true. But please don't do it in production.

The main point of the improvements were to try and reduce the size and overhead of serializing small batches. The original format that we used was self describing. You could deserialize the serialized data without any extra knowledge. For shuffle batches with very few rows this resulted in a lot of extra data being written out. The new format requires that you know the schema of the data before you can deserialize it. But that is fine because spark knows the schema of the data it is going to read. With only 200 shuffle partitions it should not be a big deal, except that you have 100 string columns.

Along with this there have been some improvements with how we serialize variable length data, like stings. We are working on having the option to move some or all of this processing to the GPU. The goal would be to use the CPU if the GPU if busy, but the GPU if it is more idle.

But don't expect any miracles just yet. This only drops the runtime for my setup to about 100 seconds currently.

5 replies

MaxNevermind Jan 15, 2025
Author

Any insight on this question?

Though I don't quite get it why Shuffle Decompression / Shuffle Deserialization takes so much less time of total, any ideas? My naive thought was that it should be somewhat close to Shuffle Serialization / Shuffle Compression.

revans2 Jan 15, 2025
Maintainer

Sorry. I forgot to answer that. Compression is a harder problem than decompression is. The LZ4 github page https://github.com/lz4/lz4/blob/dev/README.md lists its compression at 780 MB/s and decompression at 4970 MB/s. So that is part of it.

Serialization and deserialization should, in theory, be a lot closer to each other. It comes down to me writing the code badly and we need to do a lot of work on optimizing it. kudo is a step in the right direction, but I think there is a lot more that we still need to do. Ideally we should be able to write the data out at memory speeds, or at least as fast as the compression codec can handle, no matter what the schema is or how many partitions are being asked for.

That said we do generally try to tackle these with tuning. In this case I was able to get it down to 99 seconds, on my setup, by setting spark.conf.set("spark.sql.files.maxPartitionBytes","256m"). This just makes it so the GPU has more data to work with when it is doing processing. GPUs tend to scale sub-linearly so larger batches, up to a point, can be processed much more efficiently. If I combine this with kudo it is closer to 95 seconds.

I like your test case, it is kind of a torture case for string shuffles. If you are alright with it I will file a follow on issue for us to do some more in-depth profiling and understand what else we can/should do to improve it further.

revans2 Jan 15, 2025
Maintainer

I should add that at least with my setup it looks like we are getting close to being I/O bound at this point. We are shuffling 56 GiB of data, and write out 33 GiB. That adds up to about 89 GiB, and my NVMe that I use for shuffle can do about 1.3 GiB/sec, but it is hard to do more than that.

MaxNevermind Jan 15, 2025
Author

Got it, thanks!

I like your test case, it is kind of a torture case for string shuffles.

We are working on advanced synthetic data generator that might be even more sophisticated in torturing and stressing the edge cases of the system :-)

sameerz Jan 16, 2025
Maintainer

We are working on advanced synthetic data generator that might be even more sophisticated in torturing and stressing the edge cases of the system :-)

@MaxNevermind Would be happy to learn about the synthetic data generator, if you have any information to share. And of course please continue to ask questions and file issues when stressing RAPIDS Spark.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU utlization with all query operators/stages GPU based. #11963

{{title}}

Replies: 5 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

High CPU utlization with all query operators/stages GPU based. #11963

MaxNevermind Jan 14, 2025

Setup

Logic

Data

Replies: 5 comments · 8 replies

revans2 Jan 14, 2025 Maintainer

MaxNevermind Jan 14, 2025 Author

revans2 Jan 14, 2025 Maintainer

revans2 Jan 14, 2025 Maintainer

MaxNevermind Jan 14, 2025 Author

revans2 Jan 14, 2025 Maintainer

MaxNevermind Jan 14, 2025 Author

revans2 Jan 15, 2025 Maintainer

MaxNevermind Jan 15, 2025 Author

revans2 Jan 15, 2025 Maintainer

revans2 Jan 15, 2025 Maintainer

MaxNevermind Jan 15, 2025 Author

sameerz Jan 16, 2025 Maintainer

MaxNevermind
Jan 14, 2025

Replies: 5 comments 8 replies

revans2
Jan 14, 2025
Maintainer

MaxNevermind
Jan 14, 2025
Author

revans2 Jan 14, 2025
Maintainer

revans2 Jan 14, 2025
Maintainer

MaxNevermind
Jan 14, 2025
Author

revans2
Jan 14, 2025
Maintainer

MaxNevermind Jan 14, 2025
Author

revans2
Jan 15, 2025
Maintainer

MaxNevermind Jan 15, 2025
Author

revans2 Jan 15, 2025
Maintainer

revans2 Jan 15, 2025
Maintainer

MaxNevermind Jan 15, 2025
Author

sameerz Jan 16, 2025
Maintainer