Driver Pod Creation Delay When Submitting Spark Jobs with Kubeflow Spark Operator #2374

ramsinghtmdc · 2024-12-30T09:28:43Z

What question do you want to ask?

✋ When I submit a Spark job using the Kubeflow Spark Operator (via Helm chart), there is a significant delay in the creation of the driver pod. This impacts the overall job execution time and may cause resource utilization inefficiencies.

Thanks.

Additional context

No response

Have the same question?

Give it a 👍 We prioritize the question with most 👍

nitishtw · 2025-01-07T02:46:50Z

We are also facing a similar issue, where SparkApplication waits around 10m to schedule via SparkOperator

hongshaoyang · 2025-01-07T10:05:27Z

Are there any custom helm values?

ChenYi015 · 2025-01-09T03:15:13Z

@ramsinghtmdc How many Spark jobs submitted in the same time?

bnetzi · 2025-01-09T16:28:51Z

I suggest to configure proper values

For a large consistent operator we are using

spark operator controller pod:
31 vCPUs

Worker queue config
bucketQPS: '1000'
bucketSize: '2000'

Controller config
workers: 100
maxTrackedExecutorPerApp: '1'

nitishtw · 2025-01-10T02:35:39Z

@ChenYi015 More detailed documentation about these properties/values and their use cases would be beneficial. While the value file includes some one-liners, it takes time to understand which combination of worker configurations aligns with specific node group sizes. Clearer guidance would save a lot of effort!

@bnetzi The default value of maxTrackedExecutorPerApp in the values file is set to 1000. Could you confirm if it's ok to override this with 1?

# -- Specifies the maximum number of Executor pods that can be tracked by the controller per SparkApplication.
  maxTrackedExecutorPerApp: 1000

What does this property used for? Since every driver will be launching multiple executors, why are we tracking only one?

bnetzi · 2025-01-10T08:37:05Z

@nitishtw - As far as I understand, the only benefit from tracking executors is that you get in the spark application object their status as well. For our needs 1 is sufficient to see easily cases where the executors failed to start, more than that is just noise.

nitishtw · 2025-01-11T20:03:45Z

@bnetzi tried same configuration you mentioned above; my spark operator is running on a c7i.8xlarge machine [32 vCPU, 64Gi Mem]

All jobs directly go to the SUBMISSION_FAILED state; however with default configs; jobs are running fine

In spark operator logs, it says that it failed to find the jar dependencies -

:: problems summary ::
:::: WARNINGS
[FAILED ] org.apache.hadoop.thirdparty#hadoop-shaded-guava;1.1.1!hadoop-shaded-guava.jar: Downloaded file size (0) doesn't match expected Content Length (3362359) for https://repo1.maven.org/maven2/org/apache/hadoop/thirdparty/hadoop-shaded-guava/1.1.1/hadoop-shaded-guava-1.1.1.jar. Please retry. (1984ms)

[FAILED ] org.apache.hadoop.thirdparty#hadoop-shaded-guava;1.1.1!hadoop-shaded-guava.jar: Downloaded file size (0) doesn't match expected Content Length (3362359) for https://repo1.maven.org/maven2/org/apache/hadoop/thirdparty/hadoop-shaded-guava/1.1.1/hadoop-shaded-guava-1.1.1.jar. Please retry. (1984ms)

==== central: tried

https://repo1.maven.org/maven2/org/apache/hadoop/thirdparty/hadoop-shaded-guava/1.1.1/hadoop-shaded-guava-1.1.1.jar

::::::::::::::::::::::::::::::::::::::::::::::

:: FAILED DOWNLOADS ::

:: ^ see resolution messages for details ^ ::

::::::::::::::::::::::::::::::::::::::::::::::

:: org.apache.hadoop.thirdparty#hadoop-shaded-guava;1.1.1!hadoop-shaded-guava.jar

::::::::::::::::::::::::::::::::::::::::::::::

Is this expected? Is some kind of throttling happening here while fetching from maven central repo?
We are already thinking of moving all our dependencies to s3 bucket; but can you help if this is happening due to worker config change?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Driver Pod Creation Delay When Submitting Spark Jobs with Kubeflow Spark Operator #2374

Driver Pod Creation Delay When Submitting Spark Jobs with Kubeflow Spark Operator #2374

ramsinghtmdc commented Dec 30, 2024

nitishtw commented Jan 7, 2025

hongshaoyang commented Jan 7, 2025

ChenYi015 commented Jan 9, 2025

bnetzi commented Jan 9, 2025

nitishtw commented Jan 10, 2025 •

edited

Loading

bnetzi commented Jan 10, 2025

nitishtw commented Jan 11, 2025

Driver Pod Creation Delay When Submitting Spark Jobs with Kubeflow Spark Operator #2374

Driver Pod Creation Delay When Submitting Spark Jobs with Kubeflow Spark Operator #2374

Comments

ramsinghtmdc commented Dec 30, 2024

What question do you want to ask?

Additional context

Have the same question?

nitishtw commented Jan 7, 2025

hongshaoyang commented Jan 7, 2025

ChenYi015 commented Jan 9, 2025

bnetzi commented Jan 9, 2025

nitishtw commented Jan 10, 2025 • edited Loading

bnetzi commented Jan 10, 2025

nitishtw commented Jan 11, 2025

nitishtw commented Jan 10, 2025 •

edited

Loading