Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] RayJob does not shut down the submitter pod properly #2359

Open
1 of 2 tasks
Moonquakes opened this issue Sep 6, 2024 · 1 comment
Open
1 of 2 tasks

[Bug] RayJob does not shut down the submitter pod properly #2359

Moonquakes opened this issue Sep 6, 2024 · 1 comment
Labels
bug Something isn't working external-author-action-required P1 Issue that should be fixed within a few weeks rayjob

Comments

@Moonquakes
Copy link

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

In some cases of kuberay v1.0.0, especially when RayJob requests a lot of resources and takes a long time (more than half an hour), the task will be completed, but the log output is not completed (no normal success information is output, but the end output of the job can be seen in the dashboard). At this time, RayJob will be stuck there and the submitter pod will not be recycled normally.

The status information returned by kuberay is shown in the figure below
img_v3_02ef_97fbe77b-d958-4ebb-929c-31daf282b13g

After I upgraded the version to v1.1.1, not only the submitter pod was not recycled normally, but the head node was also not recycled. The status was shown as Running in the jobDeploymentStatus field, and nothing else changed

Reproduction script

It is easy to reproduce a RayJob that occupies a lot of resources and takes a long time

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@Moonquakes Moonquakes added bug Something isn't working triage labels Sep 6, 2024
@kevin85421
Copy link
Member

RayJob has improved a lot in KubeRay v1.1.0, so I’m not surprised that there are some stability issues in v1.0.0. However, I am surprised that KubeRay v1.1.1 also has the issue. Would you mind (1) checking the KubeRay v1.1.1 logs to see if there are any logs related to this logic, and (2) providing a simple RayJob YAML so that I can check whether you use the correct config or not?

@kevin85421 kevin85421 added rayjob and removed triage labels Sep 7, 2024
@anyscalesam anyscalesam added external-author-action-required P1 Issue that should be fixed within a few weeks labels Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working external-author-action-required P1 Issue that should be fixed within a few weeks rayjob
Projects
None yet
Development

No branches or pull requests

3 participants