-
I want to test the execution of multiple concurrent tasks on the GPU in Spark-RAPIDS-ML. I'm using K-means (code below) and I'm trying to partition the dataset (using repartition(n_partitions)) so that I can create multiple tasks for the fit method and run them in the GPU. #example_kmeans_spark_rapids_ml.py
from pyspark.sql import SparkSession
from spark_rapids_ml.clustering import KMeans
import time
import numpy as np
import pandas as pd
from pyspark.ml.feature import VectorAssembler
spark = SparkSession.builder \
.appName("Example") \
.getOrCreate()
n_rows = 12500
n_cols = 100
n_clusters_data = 100
np_data = np.random.rand(n_rows, n_cols)
pd_data = pd.DataFrame(np_data, columns=[f"feature_{i}" for i in range(n_cols)])
df = spark.createDataFrame(pd_data)
# number of partitions (assign a partition per task)
df = df.repartition(8)
feature_columns = [f"feature_{i}" for i in range(n_cols)]
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
assembled_df = assembler.transform(df)
kmeans = KMeans(k=n_clusters_data, maxIter=5, seed=1, featuresCol="features")
start_time = time.time()
model = kmeans.fit(assembled_df)
print(f"Fit took: {time.time() - start_time} sec")
spark.stop() Basically, I have the following testing cases: Case 1 (1 task in a GPU) Case 2 (8 tasks in a GPU) I measured the execution time of the fit method and case 1 is faster than case 2 for a toy dataset: However, I only see 1 task at the Spark UI in both cases (see images below). I was expecting to see 8 tasks in parallel in case 2. Spark UI Could anyone help me to understand how to interpret these results? Why am I not seeing multiple tasks for case 2? Am I doing something wrong? Job submission (example for case 2 -- 8 tasks): My setup: |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
Thank you for sharing the experimental results. The data is presented very clearly. Allow me to provide some interpretation. Given that there is only one physical GPU in your cluster, it’s expected that there would be a single GPU task. Internally, Spark RAPIDS ML repartitions the input Spark DataFrame into a number of partitions equal to the number of available GPUs, before invoking cuML and NCCL for GPU computation. cuML and NCCL currently operate under the one-process-per-GPU expectation, so repartitioning occurs when the number of partitions does not align with the number of available GPUs. In the case of Case 2, the runtime is slightly slower than Case 1. This suggests that the overhead of repartitioning outweighs the performance gain from parallelizing across 8 tasks on your specific dataset and workload. |
Beta Was this translation helpful? Give feedback.
Thank you for sharing the experimental results. The data is presented very clearly.
Allow me to provide some interpretation. Given that there is only one physical GPU in your cluster, it’s expected that there would be a single GPU task. Internally, Spark RAPIDS ML repartitions the input Spark DataFrame into a number of partitions equal to the number of available GPUs, before invoking cuML and NCCL for GPU computation. cuML and NCCL currently operate under the one-process-per-GPU expectation, so repartitioning occurs when the number of partitions does not align with the number of available GPUs.
In the case of Case 2, the runtime is slightly slower than Case 1. This suggests that the overh…