Fallback from Vulkan to CPU #2411

thewh1teagle · 2024-09-09T12:25:49Z

Vulkan has a lot of bugs on Windows / Linux. but when it works, it works much faster than CPU. (10-20x faster)
I'm forced to use Vulkan in the project vibe but many users report that it's crash on Windows / Linux.

Some of the errors:

PopOS
thewh1teagle/vibe#269

Ubuntu

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) HD Graphics 620 (KBL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32
2024-09-09T10:58:08.692125Z ERROR whisper_rs::whisper_sys_tracing: whisper_model_load: ERROR not all tensors loaded from model file - expected 947, got 3
2024-09-09T10:58:08.711251Z ERROR whisper_rs::whisper_sys_tracing: whisper_init_with_params_no_state: failed to load model

Arch
thewh1teagle/vibe#267

Windows
thewh1teagle/vibe#266

thewh1teagle/vibe#263

Windows

ggml_gallocr_reserve_n: reallocating NVIDIA GeForce GT 730 buffer from size 0.00 MiB to 565.06 MiB
ggml_vulkan: Device memory allocation of size 592512000 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate NVIDIA GeForce GT 730 buffer of size 592512000

thewh1teagle · 2024-09-23T18:01:06Z

@ggerganov

Do you have any suggestions on how we can improve the stability of ggml and whisper.cpp to reduce crashes (aborts) and ensure they consistently return errors instead?

ggerganov · 2024-09-24T18:05:52Z

Hm, I haven't tested the Vulkan backend with whisper.cpp at all, so cannot recommend any way to improve the stability. But looking at the error - this seems like its trying to load an invalid mode, no?

The other error seems like the GPU device runs out of memory. I think your application can check if there is enough available memory before trying to load the Whisper model.

thewh1teagle · 2024-09-24T21:00:34Z

@ggerganov

There's a lot of different issues with vulkan. for instance new issue reported that vulkan failed because it doesn't support fp16 storage ggerganov/llama.cpp#7620

How can we fallback to CPU in case it failed?
Vulkan is really important on Windows, that's the only wide GPU optimization we have currently on Windows.

I consider using OpenVino instead on Windows, but last time I checked it requires special files to be installed / special model file so it won't work better than Vulkan in dekstop app.

thewh1teagle · 2024-10-04T14:50:59Z

@ggerganov

I've noticed that CoreML/Metal includes a fallback mechanism to CPU. Since Vulkan has compatibility issues on many modern PCs, it would be great if Vulkan could have a similar fallback.

Would you be able to outline the steps needed to implement a CPU fallback for Vulkan? I'm willing to work on it and collaborate with others to push this forward. Should I focus on this in the ggml repository or in whisper.cpp?

Thanks!

ggerganov · 2024-10-05T09:39:14Z

I think the fallback mechanism only applies to operators that are not yet implemented on the backend. Are there such operators in the Vulkan backend?

With the change that I just pushed, the memory usage should be reduced significantly. I will make a new whisper.cpp release in the following days, and after that, if the issues still persist, we can discuss how to improve the Vulkan state.

thewh1teagle · 2024-10-06T03:08:26Z

@ggerganov

Tiny model still fail to load on latest commit with vulkan. 1GB of gpu is available

C:\ReallyTempEmptyEveryDay\vibe.test>.\vibe.exe

C:\ReallyTempEmptyEveryDay\vibe.test>ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: NVIDIA GeForce GTX 1660 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_gallocr_reserve_n: reallocating NVIDIA GeForce GTX 1660 Ti buffer from size 0.00 MiB to 11.08 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 0.00 MiB
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_gallocr_reserve_n: reallocating NVIDIA GeForce GTX 1660 Ti buffer from size 0.00 MiB to 60.29 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 0.00 MiB
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_gallocr_reserve_n: reallocating NVIDIA GeForce GTX 1660 Ti buffer from size 0.00 MiB to 2.20 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 0.00 MiB
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
ggml_gallocr_reserve_n: reallocating NVIDIA GeForce GTX 1660 Ti buffer from size 0.00 MiB to 89.95 MiB
ggml_vulkan: Device memory allocation of size 94318336 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate NVIDIA GeForce GTX 1660 Ti buffer of size 94318336

I think the fallback mechanism only applies to operators that are not yet implemented on the backend. Are there such operators in the Vulkan backend?

Not that I'm aware of. I thought that it fallback completely to cpu. That should be useful

ggerganov · 2024-10-06T07:38:11Z

@thewh1teagle Can you confirm that the memory allocation issue is now fixed with the latest commit on master?

thewh1teagle · 2024-10-12T18:17:44Z

Can you confirm that the memory allocation issue is now fixed with the latest commit on master?

@ggerganov

The memory allocation issue seems to be fixed in the latest version. However, many users are still reporting problems related to Vulkan. For example:

ggml_vulkan: device Vulkan0 does not support 16-bit storage

I believe providing an option to fall back to CPU-only inference would still be very useful, especially on Windows.

ggerganov mentioned this issue Oct 6, 2024

Retry allocation with fallback flags #2451

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fallback from Vulkan to CPU #2411

Fallback from Vulkan to CPU #2411

thewh1teagle commented Sep 9, 2024 •

edited

Loading

thewh1teagle commented Sep 23, 2024

ggerganov commented Sep 24, 2024

thewh1teagle commented Sep 24, 2024

thewh1teagle commented Oct 4, 2024

ggerganov commented Oct 5, 2024

thewh1teagle commented Oct 6, 2024 •

edited

Loading

ggerganov commented Oct 6, 2024

thewh1teagle commented Oct 12, 2024 •

edited

Loading

Fallback from Vulkan to CPU #2411

Fallback from Vulkan to CPU #2411

Comments

thewh1teagle commented Sep 9, 2024 • edited Loading

thewh1teagle commented Sep 23, 2024

ggerganov commented Sep 24, 2024

thewh1teagle commented Sep 24, 2024

thewh1teagle commented Oct 4, 2024

ggerganov commented Oct 5, 2024

thewh1teagle commented Oct 6, 2024 • edited Loading

ggerganov commented Oct 6, 2024

thewh1teagle commented Oct 12, 2024 • edited Loading

thewh1teagle commented Sep 9, 2024 •

edited

Loading

thewh1teagle commented Oct 6, 2024 •

edited

Loading

thewh1teagle commented Oct 12, 2024 •

edited

Loading