Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow accelerator to instantiate the device #5255

Merged
merged 18 commits into from
Aug 15, 2024

Conversation

nelyahu
Copy link
Contributor

@nelyahu nelyahu commented Mar 11, 2024

when instantiating torch.device for HPU it cannot be fed with HPU:1 annotation, but only "HPU".
moving the logic to accelerator will allow to solve this issue, with single line change.

@loadams
Copy link
Contributor

loadams commented Mar 27, 2024

Will merge #5275 first, that way we can have the newly added unit tests run on this, then we can merge it?

@nelyahu
Copy link
Contributor Author

nelyahu commented Mar 28, 2024

@loadams Sure, thanks :)

@loadams
Copy link
Contributor

loadams commented Mar 28, 2024

@loadams Sure, thanks :)

Done now, so the tests should run on this PR, then we can merge it, thanks!

@nelyahu
Copy link
Contributor Author

nelyahu commented Apr 3, 2024

Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.

@loadams
Copy link
Contributor

loadams commented Apr 3, 2024

Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.

Thanks @nelyahu - when it is released, go ahead and just update it in this PR? Also curious if you can share what is required in that release that is causing failures here since it appears the tests just hang now?

And just tag one of us when it is updated and we can get this merged.

@nelyahu
Copy link
Contributor Author

nelyahu commented Apr 4, 2024

Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.

Thanks @nelyahu - when it is released, go ahead and just update it in this PR? Also curious if you can share what is required in that release that is causing failures here since it appears the tests just hang now?

And just tag one of us when it is updated and we can get this merged.

Sure @loadams , i will update.
I verified this patch on v1.16.0, as our dev process is on the next release branch, and i did not verified v1.14.0 (the release which is being tested here). There was a lot of bug fixed since then, cannot say for sure what is the root cause.

@loadams
Copy link
Contributor

loadams commented Apr 19, 2024

Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.

Thanks @nelyahu - when it is released, go ahead and just update it in this PR? Also curious if you can share what is required in that release that is causing failures here since it appears the tests just hang now?
And just tag one of us when it is updated and we can get this merged.

Sure @loadams , i will update. I verified this patch on v1.16.0, as our dev process is on the next release branch, and i did not verified v1.14.0 (the release which is being tested here). There was a lot of bug fixed since then, cannot say for sure what is the root cause.

Thanks @nelyahu - feel free to tag us when it the new version is released and updated so we can merge this then, thanks!

@loadams
Copy link
Contributor

loadams commented May 28, 2024

Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.

Thanks @nelyahu - when it is released, go ahead and just update it in this PR? Also curious if you can share what is required in that release that is causing failures here since it appears the tests just hang now?
And just tag one of us when it is updated and we can get this merged.

Sure @loadams , i will update. I verified this patch on v1.16.0, as our dev process is on the next release branch, and i did not verified v1.14.0 (the release which is being tested here). There was a lot of bug fixed since then, cannot say for sure what is the root cause.

Thanks @nelyahu - feel free to tag us when it the new version is released and updated so we can merge this then, thanks!

@nelyahu - I tried 1.15.1, looks like same errors, so we will wait for 1.16

@nelyahu
Copy link
Contributor Author

nelyahu commented May 28, 2024

Halting this PR till habana version 1.16.0 will be released. CI issue should be fixed there.

Thanks @nelyahu - when it is released, go ahead and just update it in this PR? Also curious if you can share what is required in that release that is causing failures here since it appears the tests just hang now?
And just tag one of us when it is updated and we can get this merged.

Sure @loadams , i will update. I verified this patch on v1.16.0, as our dev process is on the next release branch, and i did not verified v1.14.0 (the release which is being tested here). There was a lot of bug fixed since then, cannot say for sure what is the root cause.

Thanks @nelyahu - feel free to tag us when it the new version is released and updated so we can merge this then, thanks!

@nelyahu - I tried 1.15.1, looks like same errors, so we will wait for 1.16

@loadams Yes.

@loadams
Copy link
Contributor

loadams commented Aug 14, 2024

@nelyahu - now that we have updated the hpu runner to 1.17, should we move ahead with merging this PR?

@loadams loadams enabled auto-merge August 14, 2024 18:27
@loadams loadams disabled auto-merge August 14, 2024 21:43
@loadams
Copy link
Contributor

loadams commented Aug 14, 2024

@nelyahu - now that we have updated the hpu runner to 1.17, should we move ahead with merging this PR?

@nelyahu/ @BacharL - looks like there is an error with this, could you take a look when you have time?

@loadams loadams self-assigned this Aug 14, 2024
known issue, WIP by habana SW team
@nelyahu
Copy link
Contributor Author

nelyahu commented Aug 15, 2024

@nelyahu - now that we have updated the hpu runner to 1.17, should we move ahead with merging this PR?

@nelyahu/ @BacharL - looks like there is an error with this, could you take a look when you have time?

Yes, this test is disabled locally in our env and fails for the same reason. I removed it from gaudi2 workflow file. once fixed will be re-added

@loadams loadams added this pull request to the merge queue Aug 15, 2024
Merged via the queue into microsoft:master with commit eb07d41 Aug 15, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants