-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2/3 Trainings on NVIDIA H100 80GB HBM3 get stuck with no chance to resume #964
Comments
👋 Hello @WhoIsElMasri, thank you for raising an issue about Ultralytics HUB 🚀! Your feedback helps us improve. From your description, it seems you're encountering issues with your trainings getting disconnected and being unable to resume them. If this is a 🐛 Bug Report, could you please provide a minimum reproducible example (MRE) with detailed steps to replicate the issue? Additionally, if you have any screenshots or logs that can help us pinpoint the problem, those would be very helpful. If this is a ❓ Question, sharing more details about the training setup, including your dataset, model, and any specific configurations, will help us provide a more thorough response. An Ultralytics engineer will review your issue shortly and assist further. Thank you for your patience! 😊 |
@WhoIsElMasri thank you for reporting this issue and for providing the impacted training IDs! Let’s address this step by step. It seems like the training sessions on the NVIDIA H100 80GB are encountering disconnection issues and subsequently failing to resume as expected. Here’s how we can troubleshoot and proceed:
We truly appreciate your patience and understanding. The Ultralytics team is committed to resolving this issue swiftly. Let us know if you experience any additional errors or need further assistance! 🚀 |
Hi @WhoIsElMasri, We sincerely apologize for the inconvenience caused and thank you for bringing this issue to our attention. We have successfully reproduced the issue on our end, and your provided model IDs were crucial in identifying the problem. I’ve informed the development team, and they will begin working on a fix shortly. I’ll keep you updated and let you know as soon as the issue is resolved. Thank you for your patience and understanding! |
Search before asking
HUB Component
Training
Bug
Impacted Trainings:
e.g. 31% Disconnected. Checkpoint saved for epoch 167.
Resume: "Something went wrong. Please try again later."
Environment
Independent from Browser and local environment
Minimal Reproducible Example
No response
Additional
No response
The text was updated successfully, but these errors were encountered: