Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduling fixes on MPS #10549

Merged
merged 4 commits into from
Jan 16, 2025
Merged

Scheduling fixes on MPS #10549

merged 4 commits into from
Jan 16, 2025

Conversation

hlky
Copy link
Collaborator

@hlky hlky commented Jan 13, 2025

What does this PR do?

segfault in MPS scheduler tests is caused by randn_like, there are a few related PyTorch issues about problems with *_like functions on MPS.

float64 is unsupported on MPS, timesteps are float64 in scheduling_heun_discrete and scheduling_lms_discrete, this change should be ok as the timestep is downcast later anyway.

In test_schedulers using .to(sample.device, dtype=sample.dtype) instead of .to(sample.device).to(sample.dtype) should be the same but compatible with MPS.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@hlky hlky requested a review from yiyixuxu January 13, 2025 05:49
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@yiyixuxu
Copy link
Collaborator

maybe it is easier to only use np32 for MPs? Some of the models we recently integrated are very sensitive to precision (e.g. MOCHI, LTX)

cc @bghira here for this opinions too

@bghira
Copy link
Contributor

bghira commented Jan 13, 2025

Sana is especially sensitive but it could be like the RoPE for Flux where we went from fp64 to fp32 and saw no real degradation. if it won't work on MPS maybe some CPU fallback code can work for those systems, but that sounds like an upstream pytorch limitation.

i guess i'd give it a whirl and see if the known sensitive models have an issue, and document potential instabilities with Pytorch on MPS (which is in general a good idea to hamper expectations)

@hlky
Copy link
Collaborator Author

hlky commented Jan 14, 2025

Given the timesteps range casting int64->int32 should be lossless, no? and when the timestep is cast to float type before model int32->float16 etc. should also be lossless, no? Anyway it seems like the main fix for CI failure is

noise = torch.randn(scaled_sample.shape).to(torch_device)

There are some issues on PyTorch regarding *_like failures on MPS.

self.timesteps is cast to torch.int64 in some and add_noise (which is what the failing test was for) already handles casting for MPS, so we can revert these np.int64->np.int32 changes.

@hlky
Copy link
Collaborator Author

hlky commented Jan 16, 2025

np.int64->np.int32 changes were not needed. With this PR scheduler tests on MPS are all passing.

917 passed, 15 skipped, 46 deselected, 15 warnings in 15.48s

Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@yiyixuxu yiyixuxu merged commit 08e62fe into huggingface:main Jan 16, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants