2/3 Trainings on NVIDIA H100 80GB HBM3 get stuck with no chance to resume #964

WhoIsElMasri · 2025-01-01T11:25:16Z

Search before asking

I have searched the HUB issues and found no similar bug report.

HUB Component

Training

Bug

Impacted Trainings:

6emf7AeSmgKZdjIDqE78
CTnXcC06MFRvU9BItUkK

e.g. 31% Disconnected. Checkpoint saved for epoch 167.
Resume: "Something went wrong. Please try again later."

Environment

Independent from Browser and local environment

Minimal Reproducible Example

No response

Additional

No response

UltralyticsAssistant · 2025-01-01T11:25:59Z

👋 Hello @WhoIsElMasri, thank you for raising an issue about Ultralytics HUB 🚀! Your feedback helps us improve.

From your description, it seems you're encountering issues with your trainings getting disconnected and being unable to resume them. If this is a 🐛 Bug Report, could you please provide a minimum reproducible example (MRE) with detailed steps to replicate the issue? Additionally, if you have any screenshots or logs that can help us pinpoint the problem, those would be very helpful.

If this is a ❓ Question, sharing more details about the training setup, including your dataset, model, and any specific configurations, will help us provide a more thorough response.

An Ultralytics engineer will review your issue shortly and assist further. Thank you for your patience! 😊

pderrenger · 2025-01-02T02:11:02Z

@WhoIsElMasri thank you for reporting this issue and for providing the impacted training IDs! Let’s address this step by step.

It seems like the training sessions on the NVIDIA H100 80GB are encountering disconnection issues and subsequently failing to resume as expected. Here’s how we can troubleshoot and proceed:

Verify Checkpoints: It's good to confirm that the checkpoints for the affected sessions (e.g., epoch 167) were saved correctly. If you have access to the training details in the Ultralytics HUB, please verify that the checkpoints exist and are accessible.
Resume Training: When resuming training, ensure the process uses the appropriate checkpoint file. If you’re using the Ultralytics HUB, resuming should happen automatically via the Resume Training option. However, as you mentioned encountering an error ("Something went wrong"), this might indicate an issue with the backend or the checkpoint file itself.
Backend Issue: Since the problem persists across multiple sessions, this could be related to a backend issue specific to the environment or GPU setup (NVIDIA H100). Our team will need to investigate further using the training session IDs you provided.
Next Steps:
- I recommend reaching out through the HUB Feedback option within the platform to ensure this is escalated directly to the engineering team.
- If reproducible, our team may request additional details or logs to better diagnose the root cause.
Temporary Workaround: If resuming via the HUB continues to fail, an alternative is to download the last checkpoint (if available) and manually resume training using the Python SDK. For example:
```
from ultralytics import YOLO

# Load the saved checkpoint
model = YOLO("path_to_downloaded_checkpoint.pt")

# Resume training
model.train(resume=True)
```
Monitoring and Updates: We suggest keeping track of the training progress and reconnecting via the HUB periodically, as intermittent backend issues may resolve themselves.

We truly appreciate your patience and understanding. The Ultralytics team is committed to resolving this issue swiftly. Let us know if you experience any additional errors or need further assistance! 🚀

yogendrasinghx · 2025-01-02T09:36:08Z

Hi @WhoIsElMasri,

We sincerely apologize for the inconvenience caused and thank you for bringing this issue to our attention. We have successfully reproduced the issue on our end, and your provided model IDs were crucial in identifying the problem.

I’ve informed the development team, and they will begin working on a fix shortly. I’ll keep you updated and let you know as soon as the issue is resolved.

Thank you for your patience and understanding!

WhoIsElMasri added the bug Something isn't working label Jan 1, 2025

UltralyticsAssistant added the HUB Ultralytics HUB issues label Jan 1, 2025

yogendrasinghx added the web Related to web interface or web functionality label Jan 2, 2025

yogendrasinghx assigned bitcodr Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2/3 Trainings on NVIDIA H100 80GB HBM3 get stuck with no chance to resume #964

2/3 Trainings on NVIDIA H100 80GB HBM3 get stuck with no chance to resume #964

WhoIsElMasri commented Jan 1, 2025

UltralyticsAssistant commented Jan 1, 2025

pderrenger commented Jan 2, 2025

yogendrasinghx commented Jan 2, 2025

2/3 Trainings on NVIDIA H100 80GB HBM3 get stuck with no chance to resume #964

2/3 Trainings on NVIDIA H100 80GB HBM3 get stuck with no chance to resume #964

Comments

WhoIsElMasri commented Jan 1, 2025

Search before asking

HUB Component

Bug

Environment

Minimal Reproducible Example

Additional

UltralyticsAssistant commented Jan 1, 2025

pderrenger commented Jan 2, 2025

yogendrasinghx commented Jan 2, 2025