Unable to Visualize Multiple Concurrent Tasks on a GPU #756

mnlcarv · 2024-10-14T17:49:48Z

mnlcarv
Oct 14, 2024

I want to test the execution of multiple concurrent tasks on the GPU in Spark-RAPIDS-ML. I'm using K-means (code below) and I'm trying to partition the dataset (using repartition(n_partitions)) so that I can create multiple tasks for the fit method and run them in the GPU.

#example_kmeans_spark_rapids_ml.py

from pyspark.sql import SparkSession
from spark_rapids_ml.clustering import KMeans
import time
import numpy as np
import pandas as pd
from pyspark.ml.feature import VectorAssembler

spark = SparkSession.builder \
    .appName("Example") \
    .getOrCreate()

n_rows = 12500
n_cols = 100
n_clusters_data = 100

np_data = np.random.rand(n_rows, n_cols)
pd_data = pd.DataFrame(np_data, columns=[f"feature_{i}" for i in range(n_cols)])

df = spark.createDataFrame(pd_data)

# number of partitions (assign a partition per task)
df = df.repartition(8)

feature_columns = [f"feature_{i}" for i in range(n_cols)]
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
assembled_df = assembler.transform(df)

kmeans = KMeans(k=n_clusters_data, maxIter=5, seed=1, featuresCol="features")

start_time = time.time()
model = kmeans.fit(assembled_df)
print(f"Fit took: {time.time() - start_time} sec")

spark.stop()

Basically, I have the following testing cases:

Case 1 (1 task in a GPU)
To run the fit method in only 1 task, I'm setting both spark.task.resource.gpu.amount and spark.rapids.sql.concurrentGpuTasks to 1. I also set repartition to 1 in my code: df = df.repartition(1)

Case 2 (8 tasks in a GPU)
To run the fit method in only 8 tasks, I'm setting spark.task.resource.gpu.amount=0.125 and spark.rapids.sql.concurrentGpuTasks=8. I also set repartition to 8 in my code: df = df.repartition(8)

I measured the execution time of the fit method and case 1 is faster than case 2 for a toy dataset:
Case 1 (1 task): 11.03 s
Case 2 (8 tasks): 12.06 s

However, I only see 1 task at the Spark UI in both cases (see images below). I was expecting to see 8 tasks in parallel in case 2.

Spark UI

Could anyone help me to understand how to interpret these results? Why am I not seeing multiple tasks for case 2? Am I doing something wrong?

Job submission (example for case 2 -- 8 tasks):
spark-submit example_kmeans_spark_rapids_ml.py --master local[*] --num-executors 1 --executor-memory 20g --driver-memory 10g --conf spark.task.resource.gpu.amount=0.125 --conf spark.rapids.sql.concurrentGpuTasks=8 --conf spark.executor.resource.gpu.amount=1 --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh --conf spark.driver.resource.gpu.discoveryScript=./getGpusResources.sh --conf spark.files=getGpusResources.sh

My setup:
A laptop with one MX450 GPU and 8 CPU cores
Ubuntu 22.04 (WSL)
CUDA 11.8
CUDA driver version 565.90
RAPIDS for CUDA 11 ( cudf-cu11==24.10.* and cuml-cu11==24.10.*)
pyspark 3.4.3
spark-rapids-ml 24.8.0

Answered by lijinf2

Oct 14, 2024

Thank you for sharing the experimental results. The data is presented very clearly.

Allow me to provide some interpretation. Given that there is only one physical GPU in your cluster, it’s expected that there would be a single GPU task. Internally, Spark RAPIDS ML repartitions the input Spark DataFrame into a number of partitions equal to the number of available GPUs, before invoking cuML and NCCL for GPU computation. cuML and NCCL currently operate under the one-process-per-GPU expectation, so repartitioning occurs when the number of partitions does not align with the number of available GPUs.

In the case of Case 2, the runtime is slightly slower than Case 1. This suggests that the overh…

View full answer

lijinf2 · 2024-10-14T22:07:34Z

lijinf2
Oct 14, 2024
Collaborator

Thank you for sharing the experimental results. The data is presented very clearly.

Allow me to provide some interpretation. Given that there is only one physical GPU in your cluster, it’s expected that there would be a single GPU task. Internally, Spark RAPIDS ML repartitions the input Spark DataFrame into a number of partitions equal to the number of available GPUs, before invoking cuML and NCCL for GPU computation. cuML and NCCL currently operate under the one-process-per-GPU expectation, so repartitioning occurs when the number of partitions does not align with the number of available GPUs.

In the case of Case 2, the runtime is slightly slower than Case 1. This suggests that the overhead of repartitioning outweighs the performance gain from parallelizing across 8 tasks on your specific dataset and workload.

4 replies

mnlcarv Oct 15, 2024
Author

@lijinf2 Thank you for your detailed response.

Actually, I saw a tuning guide about concurrent tasks running on a GPU in the docs of the RAPIDS accelerator for Apache Spark (for accelerating Spark SQL and DataFrames), and I thought I could use it in Spark RAPIDS ML as well.

So if I understood correctly, Spark RAPIDS ML will always process a task per GPU, no matter how many repartitions I have (assuming that each partition is assigned to a task) or how I tune the Spark config parameters spark.task.resource.gpu.amount and spark.rapids.sql.concurrentGpuTasks. And this is a constraint of cuML and NCCL because currently they can only operate under the one-process-per-GPU expectation (assuming that each process is related to a Spark task).

Do you know how I could test the execution of concurrent tasks on a GPU in Spark RAPIDS ML algorithms? I would need first to adapt cuML and NCCL to allow more than one process per GPU and then adapt the repartition strategy in Spark RAPIDS ML to not consider the amount of available GPUs?

lijinf2 Oct 15, 2024
Collaborator

I am not aware of a method to run concurrent tasks on a single GPU for ML algorithms. This limitation primarily stems from NCCL inter-GPU communication, which is designed and optimized for one process per GPU configuration. Overcoming this constraint may require additional effort and could introduce runtime trade-offs in GPU-to-GPU communication.

Do you have a specific use case where concurrent ML tasks on a GPU are essential? Spark ML algorithms typically have long training times, so the cost of repartitioning is often negligible. Once training is complete, the model weights are returned to the driver. Subsequent computations, such as inference or dataframe post-processing, are unlikely to rely on NCCL and therefore, would not be subject to the one process per GPU limitation.

mnlcarv Oct 16, 2024
Author

I see. So ML algorithms are implemented targeting multiple GPUs and they use NCCL to manage inter-GPU communication. The constraint is that NCCL supports only one process per GPU (that's why even using only one GPU I couldn't run multiple tasks on it). MIG and MPS enables multiple concurrent processes per GPU, but according to this discussion, NCCL does not support any of them.

A case where concurrent ML tasks on a GPU would be interesting is for algorithms where training may underutilizes the GPU. For instance, the K-means fit with a small number of clusters (e.g. <10). In this case only one fit task per GPU may not utilize all the compute and memory-bandwidth capacity available on the GPU. So instead of processing one fit task per GPU, it might be better (in terms of GPU utilization) to concurrently process multiple tasks in the same GPU (to saturate it as much as possible) before requesting more GPUs. This way, other GPUs in the cluster would be free to process other tasks.

lijinf2 Oct 16, 2024
Collaborator

To maximize GPU utilization, it may be beneficial to set the num_workers to a number smaller than the total available GPUs. All Spark Rapids ML algorithms support this parameter and will use only the specified number of GPUs for training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Visualize Multiple Concurrent Tasks on a GPU #756

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Unable to Visualize Multiple Concurrent Tasks on a GPU #756

mnlcarv Oct 14, 2024

Replies: 1 comment · 4 replies

lijinf2 Oct 14, 2024 Collaborator

mnlcarv Oct 15, 2024 Author

lijinf2 Oct 15, 2024 Collaborator

mnlcarv Oct 16, 2024 Author

lijinf2 Oct 16, 2024 Collaborator

mnlcarv
Oct 14, 2024

Replies: 1 comment 4 replies

lijinf2
Oct 14, 2024
Collaborator

mnlcarv Oct 15, 2024
Author

lijinf2 Oct 15, 2024
Collaborator

mnlcarv Oct 16, 2024
Author

lijinf2 Oct 16, 2024
Collaborator