Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2/3 Trainings on NVIDIA H100 80GB HBM3 get stuck with no chance to resume #964

Open
1 task done
WhoIsElMasri opened this issue Jan 1, 2025 · 3 comments
Open
1 task done
Assignees
Labels
bug Something isn't working HUB Ultralytics HUB issues web Related to web interface or web functionality

Comments

@WhoIsElMasri
Copy link

Search before asking

  • I have searched the HUB issues and found no similar bug report.

HUB Component

Training

Bug

Impacted Trainings:

  • 6emf7AeSmgKZdjIDqE78
  • CTnXcC06MFRvU9BItUkK

e.g. 31% Disconnected. Checkpoint saved for epoch 167.
Resume: "Something went wrong. Please try again later."

Environment

Independent from Browser and local environment

Minimal Reproducible Example

No response

Additional

No response

@WhoIsElMasri WhoIsElMasri added the bug Something isn't working label Jan 1, 2025
@UltralyticsAssistant UltralyticsAssistant added the HUB Ultralytics HUB issues label Jan 1, 2025
@UltralyticsAssistant
Copy link
Member

👋 Hello @WhoIsElMasri, thank you for raising an issue about Ultralytics HUB 🚀! Your feedback helps us improve.

From your description, it seems you're encountering issues with your trainings getting disconnected and being unable to resume them. If this is a 🐛 Bug Report, could you please provide a minimum reproducible example (MRE) with detailed steps to replicate the issue? Additionally, if you have any screenshots or logs that can help us pinpoint the problem, those would be very helpful.

If this is a ❓ Question, sharing more details about the training setup, including your dataset, model, and any specific configurations, will help us provide a more thorough response.

An Ultralytics engineer will review your issue shortly and assist further. Thank you for your patience! 😊

@pderrenger
Copy link
Member

@WhoIsElMasri thank you for reporting this issue and for providing the impacted training IDs! Let’s address this step by step.

It seems like the training sessions on the NVIDIA H100 80GB are encountering disconnection issues and subsequently failing to resume as expected. Here’s how we can troubleshoot and proceed:

  1. Verify Checkpoints: It's good to confirm that the checkpoints for the affected sessions (e.g., epoch 167) were saved correctly. If you have access to the training details in the Ultralytics HUB, please verify that the checkpoints exist and are accessible.

  2. Resume Training: When resuming training, ensure the process uses the appropriate checkpoint file. If you’re using the Ultralytics HUB, resuming should happen automatically via the Resume Training option. However, as you mentioned encountering an error ("Something went wrong"), this might indicate an issue with the backend or the checkpoint file itself.

  3. Backend Issue: Since the problem persists across multiple sessions, this could be related to a backend issue specific to the environment or GPU setup (NVIDIA H100). Our team will need to investigate further using the training session IDs you provided.

  4. Next Steps:

    • I recommend reaching out through the HUB Feedback option within the platform to ensure this is escalated directly to the engineering team.
    • If reproducible, our team may request additional details or logs to better diagnose the root cause.
  5. Temporary Workaround: If resuming via the HUB continues to fail, an alternative is to download the last checkpoint (if available) and manually resume training using the Python SDK. For example:

    from ultralytics import YOLO
    
    # Load the saved checkpoint
    model = YOLO("path_to_downloaded_checkpoint.pt")
    
    # Resume training
    model.train(resume=True)
  6. Monitoring and Updates: We suggest keeping track of the training progress and reconnecting via the HUB periodically, as intermittent backend issues may resolve themselves.

We truly appreciate your patience and understanding. The Ultralytics team is committed to resolving this issue swiftly. Let us know if you experience any additional errors or need further assistance! 🚀

@yogendrasinghx
Copy link
Member

Hi @WhoIsElMasri,

We sincerely apologize for the inconvenience caused and thank you for bringing this issue to our attention. We have successfully reproduced the issue on our end, and your provided model IDs were crucial in identifying the problem.

I’ve informed the development team, and they will begin working on a fix shortly. I’ll keep you updated and let you know as soon as the issue is resolved.

Thank you for your patience and understanding!

@yogendrasinghx yogendrasinghx added the web Related to web interface or web functionality label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working HUB Ultralytics HUB issues web Related to web interface or web functionality
Projects
None yet
Development

No branches or pull requests

5 participants