Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver Pod Creation Delay When Submitting Spark Jobs with Kubeflow Spark Operator #2374

Open
1 task
ramsinghtmdc opened this issue Dec 30, 2024 · 7 comments
Open
1 task

Comments

@ramsinghtmdc
Copy link

What question do you want to ask?

  • ✋ When I submit a Spark job using the Kubeflow Spark Operator (via Helm chart), there is a significant delay in the creation of the driver pod. This impacts the overall job execution time and may cause resource utilization inefficiencies.

Thanks.

Additional context

No response

Have the same question?

Give it a 👍 We prioritize the question with most 👍

@nitishtw
Copy link

nitishtw commented Jan 7, 2025

We are also facing a similar issue, where SparkApplication waits around 10m to schedule via SparkOperator

@hongshaoyang
Copy link
Contributor

Are there any custom helm values?

@ChenYi015
Copy link
Contributor

@ramsinghtmdc How many Spark jobs submitted in the same time?

@bnetzi
Copy link

bnetzi commented Jan 9, 2025

I suggest to configure proper values

For a large consistent operator we are using

spark operator controller pod:
31 vCPUs

Worker queue config
bucketQPS: '1000'
bucketSize: '2000'

Controller config
workers: 100
maxTrackedExecutorPerApp: '1'

@nitishtw
Copy link

nitishtw commented Jan 10, 2025

@ChenYi015 More detailed documentation about these properties/values and their use cases would be beneficial. While the value file includes some one-liners, it takes time to understand which combination of worker configurations aligns with specific node group sizes. Clearer guidance would save a lot of effort!

@bnetzi The default value of maxTrackedExecutorPerApp in the values file is set to 1000. Could you confirm if it's ok to override this with 1?

# -- Specifies the maximum number of Executor pods that can be tracked by the controller per SparkApplication.
  maxTrackedExecutorPerApp: 1000

What does this property used for? Since every driver will be launching multiple executors, why are we tracking only one?

@bnetzi
Copy link

bnetzi commented Jan 10, 2025

@nitishtw - As far as I understand, the only benefit from tracking executors is that you get in the spark application object their status as well. For our needs 1 is sufficient to see easily cases where the executors failed to start, more than that is just noise.

@nitishtw
Copy link

@bnetzi tried same configuration you mentioned above; my spark operator is running on a c7i.8xlarge machine [32 vCPU, 64Gi Mem]

All jobs directly go to the SUBMISSION_FAILED state; however with default configs; jobs are running fine

In spark operator logs, it says that it failed to find the jar dependencies -

:: problems summary ::
:::: WARNINGS
[FAILED ] org.apache.hadoop.thirdparty#hadoop-shaded-guava;1.1.1!hadoop-shaded-guava.jar: Downloaded file size (0) doesn't match expected Content Length (3362359) for https://repo1.maven.org/maven2/org/apache/hadoop/thirdparty/hadoop-shaded-guava/1.1.1/hadoop-shaded-guava-1.1.1.jar. Please retry. (1984ms)

[FAILED ] org.apache.hadoop.thirdparty#hadoop-shaded-guava;1.1.1!hadoop-shaded-guava.jar: Downloaded file size (0) doesn't match expected Content Length (3362359) for https://repo1.maven.org/maven2/org/apache/hadoop/thirdparty/hadoop-shaded-guava/1.1.1/hadoop-shaded-guava-1.1.1.jar. Please retry. (1984ms)

==== central: tried

https://repo1.maven.org/maven2/org/apache/hadoop/thirdparty/hadoop-shaded-guava/1.1.1/hadoop-shaded-guava-1.1.1.jar

::::::::::::::::::::::::::::::::::::::::::::::

:: FAILED DOWNLOADS ::

:: ^ see resolution messages for details ^ ::

::::::::::::::::::::::::::::::::::::::::::::::

:: org.apache.hadoop.thirdparty#hadoop-shaded-guava;1.1.1!hadoop-shaded-guava.jar

::::::::::::::::::::::::::::::::::::::::::::::

Is this expected? Is some kind of throttling happening here while fetching from maven central repo?
We are already thinking of moving all our dependencies to s3 bucket; but can you help if this is happening due to worker config change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants