Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fallback from Vulkan to CPU #2411

Open
thewh1teagle opened this issue Sep 9, 2024 · 8 comments
Open

Fallback from Vulkan to CPU #2411

thewh1teagle opened this issue Sep 9, 2024 · 8 comments

Comments

@thewh1teagle
Copy link
Contributor

thewh1teagle commented Sep 9, 2024

Vulkan has a lot of bugs on Windows / Linux. but when it works, it works much faster than CPU. (10-20x faster)
I'm forced to use Vulkan in the project vibe but many users report that it's crash on Windows / Linux.

Some of the errors:

PopOS
thewh1teagle/vibe#269

Ubuntu

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) HD Graphics 620 (KBL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32
2024-09-09T10:58:08.692125Z ERROR whisper_rs::whisper_sys_tracing: whisper_model_load: ERROR not all tensors loaded from model file - expected 947, got 3
2024-09-09T10:58:08.711251Z ERROR whisper_rs::whisper_sys_tracing: whisper_init_with_params_no_state: failed to load model

Arch
thewh1teagle/vibe#267

Windows
thewh1teagle/vibe#266

thewh1teagle/vibe#263

Windows

ggml_gallocr_reserve_n: reallocating NVIDIA GeForce GT 730 buffer from size 0.00 MiB to 565.06 MiB
ggml_vulkan: Device memory allocation of size 592512000 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate NVIDIA GeForce GT 730 buffer of size 592512000
@thewh1teagle
Copy link
Contributor Author

@ggerganov

Do you have any suggestions on how we can improve the stability of ggml and whisper.cpp to reduce crashes (aborts) and ensure they consistently return errors instead?

@ggerganov
Copy link
Owner

Hm, I haven't tested the Vulkan backend with whisper.cpp at all, so cannot recommend any way to improve the stability. But looking at the error - this seems like its trying to load an invalid mode, no?

The other error seems like the GPU device runs out of memory. I think your application can check if there is enough available memory before trying to load the Whisper model.

@thewh1teagle
Copy link
Contributor Author

@ggerganov

There's a lot of different issues with vulkan. for instance new issue reported that vulkan failed because it doesn't support fp16 storage ggerganov/llama.cpp#7620

How can we fallback to CPU in case it failed?
Vulkan is really important on Windows, that's the only wide GPU optimization we have currently on Windows.


I consider using OpenVino instead on Windows, but last time I checked it requires special files to be installed / special model file so it won't work better than Vulkan in dekstop app.

@thewh1teagle
Copy link
Contributor Author

@ggerganov

I've noticed that CoreML/Metal includes a fallback mechanism to CPU. Since Vulkan has compatibility issues on many modern PCs, it would be great if Vulkan could have a similar fallback.

Would you be able to outline the steps needed to implement a CPU fallback for Vulkan? I'm willing to work on it and collaborate with others to push this forward. Should I focus on this in the ggml repository or in whisper.cpp?

Thanks!

@ggerganov
Copy link
Owner

I think the fallback mechanism only applies to operators that are not yet implemented on the backend. Are there such operators in the Vulkan backend?

With the change that I just pushed, the memory usage should be reduced significantly. I will make a new whisper.cpp release in the following days, and after that, if the issues still persist, we can discuss how to improve the Vulkan state.

@thewh1teagle
Copy link
Contributor Author

thewh1teagle commented Oct 6, 2024

@ggerganov

Tiny model still fail to load on latest commit with vulkan. 1GB of gpu is available

C:\ReallyTempEmptyEveryDay\vibe.test>.\vibe.exe

C:\ReallyTempEmptyEveryDay\vibe.test>ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: NVIDIA GeForce GTX 1660 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_gallocr_reserve_n: reallocating NVIDIA GeForce GTX 1660 Ti buffer from size 0.00 MiB to 11.08 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 0.00 MiB
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_gallocr_reserve_n: reallocating NVIDIA GeForce GTX 1660 Ti buffer from size 0.00 MiB to 60.29 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 0.00 MiB
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_gallocr_reserve_n: reallocating NVIDIA GeForce GTX 1660 Ti buffer from size 0.00 MiB to 2.20 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 0.00 MiB
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
ggml_gallocr_reserve_n: reallocating NVIDIA GeForce GTX 1660 Ti buffer from size 0.00 MiB to 89.95 MiB
ggml_vulkan: Device memory allocation of size 94318336 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate NVIDIA GeForce GTX 1660 Ti buffer of size 94318336

I think the fallback mechanism only applies to operators that are not yet implemented on the backend. Are there such operators in the Vulkan backend?

Not that I'm aware of. I thought that it fallback completely to cpu. That should be useful

@ggerganov
Copy link
Owner

@thewh1teagle Can you confirm that the memory allocation issue is now fixed with the latest commit on master?

@thewh1teagle
Copy link
Contributor Author

thewh1teagle commented Oct 12, 2024

Can you confirm that the memory allocation issue is now fixed with the latest commit on master?

@ggerganov

The memory allocation issue seems to be fixed in the latest version. However, many users are still reporting problems related to Vulkan. For example:

ggml_vulkan: device Vulkan0 does not support 16-bit storage

I believe providing an option to fall back to CPU-only inference would still be very useful, especially on Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants