Feature request: AMD GPU support with oneDNN AMD support #1072

santhoshtr · 2023-02-09T10:07:19Z

Hi, CTranslate2 uses oneDNN. oneDNN latest versions has support for AMD GPU. It require Intel oneAPI DPC++. The same approach can potentially enable NVIDIA GPU support too.

It would help running the MT models on AMD GPUs. With Rocm, this would be a full opensource way to run MT models in GPUs.

Thanks

guillaumekln · 2023-02-09T13:23:03Z

Hello,

Currently we only use oneDNN for specific operators such as matrix multiplications and convolutions, but a full MT models contains many other operators (softmax, layer norm, gather, concat, etc.). Even though some of them are available in oneDNN, it would require quite some work to specialize all operations for AMD GPUs.

At this time I don't plan to work on this feature, but it would indeed be a nice one to have!

leuc · 2023-04-04T20:36:24Z

I wanted to try faster whisper on a Intel A770 dGPU 16GB. A complete use of oneDNN could also enable that hardware support.

towel · 2023-04-05T12:32:28Z

Migrating a transcription component to faster-whisper, and using an AMD GPU, I'd also appreciate faster-whisper with ROCm support even more.

phineas-pta · 2023-04-11T20:45:22Z

@towel did you manage to get faster-whisper working on AMD ?

CristianPi · 2023-05-15T19:51:44Z

any way to run with an amd gpu?

MidnightKittenCat · 2023-08-29T05:17:24Z

Any update on this?

guillaumekln · 2023-08-29T08:18:44Z

I still don't plan to work on this at this time, and as far as I know no one else is working on this. I expect it would be quite some work to have a complete ROCm support.

lenhone · 2023-09-10T05:43:57Z

I had a go at converting the existing cuda stuff to rocm a few months ago but could never get it to build, not surprising as I have zero C++ or cmakelists skills.

curand, cublas, cudnn, cuda, cub appear to map to hip with minor adjustments, but I could never get the cmakelists to include thrust (the version supplied by rocm) and it always halted compiling due to producing too many errors.

TheJKM · 2023-10-26T23:11:36Z

I started trying to port CTranslate2 to ROCm last weekend and decided to share my (non-working) results here. The code is available in the rocm Branch of my fork.

Basically, hipify was able to convert most of the code automatically. I added a new CMake config option to enable compiling with ROCm, and so far calling the HIP compiler works, however it breaks the other options and requires a CMake version new enough to have HIP language support.

Current issues are some CUDA library dependencies I did not look at yet, and the use of bfloat16 data type. While latest ROCm has a (according to this GH issue -> ROCm/ROCm#2534) drop-in replacement for the CUDA bf16, it currently has some issues in missing operators. Therefore, I'm trying to completely disable bf16 for now, but without luck so far.

This work has right now just the goal of making it work, and not integrating HIP/ROCm into the (CMake) infrastructure.

In case someone wants to have a look at the code and help porting, feel free to look at my fork. Unfortunately, I don't expect to have much time in the near future for this project.

BBC-Esq · 2023-10-26T23:15:14Z

This is awesome dude. Wish I had programming experience to help with this, but alas I don't. I've been looking for ways to enable gpu acceleration for amd gpus using ctranslate2...Let me know if I can help in any way, whether it be by testing or what have you.

BBC-Esq · 2023-10-26T23:20:44Z

Have you gotten it to work at all yet?

MidnightKittenCat · 2023-10-26T23:21:53Z

Have you gotten it to work at all yet?

“I started trying to port CTranslate2 to ROCm last weekend and decided to share my (non-working) results here”

I believe that should answer your question.

commandline-be · 2023-12-31T01:48:34Z

I started trying to port CTranslate2 to ROCm last weekend and decided to share my (non-working) results here. The code is available in the rocm Branch of my fork.

Basically, hipify was able to convert most of the code automatically. I added a new CMake config option to enable compiling with ROCm, and so far calling the HIP compiler works, however it breaks the other options and requires a CMake version new enough to have HIP language support.

Current issues are some CUDA library dependencies I did not look at yet, and the use of bfloat16 data type. While latest ROCm has a (according to this GH issue -> ROCm/ROCm#2534) drop-in replacement for the CUDA bf16, it currently has some issues in missing operators. Therefore, I'm trying to completely disable bf16 for now, but without luck so far.

This work has right now just the goal of making it work, and not integrating HIP/ROCm into the (CMake) infrastructure.

In case someone wants to have a look at the code and help porting, feel free to look at my fork. Unfortunately, I don't expect to have much time in the near future for this project.

Thanks for sharing, much interested in the ripple effects this port may have for others projects.

There's now ROCM 6.0 available which I believe addresses specifically what you're referencing.

FYI: https://repo.radeon.com/amdgpu/6.0/ubuntu/dists/jammy/

I've tried all kinds of dumb uninformed stuff trying to get libretranslate to work with rocm to no avail. It depens on too recent cuda to be tricked by rocm. Latest pytorch+rocm5.7 also did not work out well.

https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/3rd-party/pytorch-install.html

commandline-be · 2024-01-03T07:42:50Z

So, would it make sense to create an experimental version of ctranslate2 using a more recnet oneDNN which does have AMD GPU support ?

from https://github.com/oneapi-src/oneDNN
oneAPI Deep Neural Network Library (oneDNN) is an open-source cross-platform performance library of basic building blocks for deep learning applications. oneDNN is part of oneAPI. The library is optimized for Intel(R) Architecture Processors, Intel Graphics, and Arm* 64-bit Architecture (AArch64)-based processors. oneDNN has experimental support for the following architectures: NVIDIA* GPU, AMD* GPU, OpenPOWER* Power ISA (PPC64), IBMz* (s390x), and RISC-V.

https://github.com/oneapi-src/oneDNN?tab=readme-ov-file#system-requirements
SYCL runtime with AMD GPU support requires
oneAPI DPC++ Compiler with support for HIP AMD
AMD ROCm, version 5.3 or later
MIOpen, version 2.18 or later (optional if AMD ROCm includes the required version of MIOpen)
rocBLAS, version 2.45.0 or later (optional if AMD ROCm includes the required version of rocBLAS)

https://github.com/oneapi-src/oneDNN/blob/main/src/gpu/amd/README.md
Support for AMD backend is implemented via SYCL HIP backend. The feature is disabled by default. Users must enable it at build time with a CMake option DNNL_GPU_VENDOR=AMD. The AMD GPUs can be used via oneDNN engine abstraction. The engine should be created using dnnl::engine::kind::gpu engine kind or the user can provide a sycl::device objects that corresponds to AMD GPUs.

vince62s · 2024-01-03T18:01:23Z

As said in the Feb 2023 comment "Even though some of them are available in oneDNN, it would require quite some work to specialize all operations for AMD GPUs." Since no one is making those changes, it won't move on.

katzmike · 2024-03-21T13:10:16Z

I am not a developer but I work at AMD and handle developer relationships. We would like to assist with the effort to enable CTranslate2 for AMD dGPUs and iGPU. We will have engineers investigate, but we may also be able to provide hardware to the lead contributors of this effort. Please contact me via michael dot katz at amd dot com if this would help.

radna0 · 2024-07-01T17:18:17Z

is there any update on this?

kvrban · 2024-07-05T12:16:47Z

is there any update on this?

I suspect Lisa and Jensen have a deal that AMD only gets the crumbs from the AI pie.
So there is nothing left for us peasant to continue paying the nvidia tax.

DDXDB · 2024-07-13T19:13:55Z

So, would it make sense to create an experimental version of ctranslate2 using a more recnet oneDNN which does have AMD GPU support ?

from https://github.com/oneapi-src/oneDNN oneAPI Deep Neural Network Library (oneDNN) is an open-source cross-platform performance library of basic building blocks for deep learning applications. oneDNN is part of oneAPI. The library is optimized for Intel(R) Architecture Processors, Intel Graphics, and Arm* 64-bit Architecture (AArch64)-based processors. oneDNN has experimental support for the following architectures: NVIDIA* GPU, AMD* GPU, OpenPOWER* Power ISA (PPC64), IBMz* (s390x), and RISC-V.

https://github.com/oneapi-src/oneDNN?tab=readme-ov-file#system-requirements SYCL runtime with AMD GPU support requires oneAPI DPC++ Compiler with support for HIP AMD AMD ROCm, version 5.3 or later MIOpen, version 2.18 or later (optional if AMD ROCm includes the required version of MIOpen) rocBLAS, version 2.45.0 or later (optional if AMD ROCm includes the required version of rocBLAS)

https://github.com/oneapi-src/oneDNN/blob/main/src/gpu/amd/README.md Support for AMD backend is implemented via SYCL HIP backend. The feature is disabled by default. Users must enable it at build time with a CMake option DNNL_GPU_VENDOR=AMD. The AMD GPUs can be used via oneDNN engine abstraction. The engine should be created using dnnl::engine::kind::gpu engine kind or the user can provide a sycl::device objects that corresponds to AMD GPUs.

Does that mean Intel ARC Gpus can also be supported?

chboishabba · 2024-07-14T13:49:52Z

https://bbs.archlinux.org/viewtopic.php?pid=2183865#p2183865

yeetmanpat · 2024-07-30T18:55:22Z

This has just been released: https://docs.scale-lang.com/

Could someone more technical see whether this toolkit would make running ctranslate2 on AMD possible?

genehand · 2024-08-02T02:07:37Z

For whisper.cpp at least, it now supports vulkan as a gpu backend. With home assistant this is working well for me through https://github.com/ser/wyoming-whisper-api-client

tannisroot · 2024-08-02T04:10:08Z

For whisper.cpp at least, it now supports vulkan as a gpu backend. With home assistant this is working well for me through https://github.com/ser/wyoming-whisper-api-client

Personally, on my hardware, even with GPU acceleration, whisper.cpp is way slower than faster-whisper using the same model and CPU, and the transcription time is also very unpredictable.

chboishabba · 2024-08-02T05:22:07Z

Try whisperx if you are able to use faster-whisper, it does distraction and has a better VAD...

…

On Fri, Aug 2, 2024, 2:10 PM Aleksandr Oleinikov ***@***.***> wrote: For whisper.cpp at least, it now supports vulkan as a gpu backend <ggerganov/whisper.cpp#2302>. With home assistant this is working well for me through https://github.com/ser/wyoming-whisper-api-client Personally, on my hardware, even with GPU acceleration, whisper.cpp is way slower than faster-whisper using the same model and CPU, and the transcription time is also very unpredictable. — Reply to this email directly, view it on GitHub <#1072 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGM4B3STLKCH2XVHLRM6NCLZPMBDVAVCNFSM6AAAAAAUWK5BQ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRUGUYDAMZYGU> . You are receiving this because you commented.Message ID: ***@***.***>

chboishabba · 2024-08-02T05:22:28Z

Diarisation

…

On Fri, Aug 2, 2024, 3:21 PM Johl Brown ***@***.***> wrote: Try whisperx if you are able to use faster-whisper, it does distraction and has a better VAD... On Fri, Aug 2, 2024, 2:10 PM Aleksandr Oleinikov ***@***.***> wrote: > For whisper.cpp at least, it now supports vulkan as a gpu backend > <ggerganov/whisper.cpp#2302>. With home > assistant this is working well for me through > https://github.com/ser/wyoming-whisper-api-client > > Personally, on my hardware, even with GPU acceleration, whisper.cpp is > way slower than faster-whisper using the same model and CPU, and the > transcription time is also very unpredictable. > > — > Reply to this email directly, view it on GitHub > <#1072 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AGM4B3STLKCH2XVHLRM6NCLZPMBDVAVCNFSM6AAAAAAUWK5BQ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRUGUYDAMZYGU> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

tannisroot · 2024-08-02T07:44:43Z

whisperx

I don't believe there is a way to hook it up to the Wyoming protocol, which is my sole usecase for it.

genehand · 2024-08-03T21:54:55Z

Personally, on my hardware, even with GPU acceleration, whisper.cpp is way slower than faster-whisper using the same model and CPU, and the transcription time is also very unpredictable.

Alright with rocm 6.2 supporting my gpu now I was curious to do a quick test. Using the medium model and this test file, here's what I'm seeing:

project	backend	beam size	transcribe time
faster-whisper	cpu	5	1m26.002s
whisper.cpp	vulkan	5	1m24.906s
whisper.cpp	rocm	5	59.834s
faster-whisper	rocm	5	37.649s

This is with an i5-10400F and RX5700 using code adapted from the readme:

model_size = "medium"
model = WhisperModel(model_size, device="cpu", compute_type="int8", cpu_threads=12)
segments, info = model.transcribe("tests/data/physicsworks.wav", beam_size=5, language="en")

Edit: Lowered cpu_threads from the default 16 with improved results.
Edit 2: Added faster-whisper with @arlo-phoenix's fork

commandline-be · 2024-08-05T11:55:50Z

I am not a developer but I work at AMD and handle developer relationships. We would like to assist with the effort to enable CTranslate2 for AMD dGPUs and iGPU. We will have engineers investigate, but we may also be able to provide hardware to the lead contributors of this effort. Please contact me via michael dot katz at amd dot com if this would help.

can a first be to test this against Zluda?
https://github.com/vosen/ZLUDA ( run CUDA on AMD)

I look forward to being able to run ctranslate2 with GPU acceleration without requiring to buy an nvidia

arlo-phoenix · 2024-08-06T17:58:23Z

I ported CTranslate2 over to ROCm. My fork is here: https://github.com/arlo-phoenix/CTranslate2-rocm
and install instructions can be found under README_ROCM.md. I also wrote about the issues I had and the libraries using CT2 I tested.

Status Tracker

faster whisper
whisperX
bfloat16 (main blocker for upstreaming imo)
sync with upstream (I intentionally went back a couple commits to avoid having to deal with fa2 and AWQ)

Instead of using oneDNN I just hipified the repo and extracted HIP to CUDA function mapping to create a preprocessor solution similar to projects like llama.cpp. Besides the listed stuff it is feature complete and works very well. I included some benchmark scripts with the file from #1072 (comment) (@genehand would be nice if you could try this and add the numbers to a table!). On my RX6800 I'm getting 11s-12s with faster_whisper and 4.2s with whisperX. For RDNA this should now be the fastest working whisper inference solution :)

Btw should we split issues up? This is two combined into one. I personally believe porting all operators to oneDNN is far too much effort and might not even lead to good performance. This repo hipified quite well, I was able to use simple defines from HIP to CUDA functions for the majority of the project. I only had to rewrite the conv1d operator from scratch since hipDNN isn't maintained anymore.

BBC-Esq · 2024-08-06T18:05:34Z

@arlo-phoenix Can you add the "issues" tab on your github so we can communicate that way? I'm possibly interested in incorporating this into my projects.

yeetmanpat · 2024-08-07T21:08:09Z

Thanks @arlo-phoenix I've successfully installed it. Using your benchmark I get 8-9 seconds on faster_whisper with an RX 7800 XT. I tried testing with whisperx but cannot get it to work, I get OSError: libtorch_cuda.so no such file.

chboishabba · 2024-08-07T23:28:58Z

Did you install pytorch with CUDA? eg. pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118 https://pytorch.org/get-started/locally/

…

On Thu, Aug 8, 2024 at 7:08 AM yeetmanpat ***@***.***> wrote: Thanks @arlo-phoenix <https://github.com/arlo-phoenix> I've successfully installed it. Using your benchmark I get 8-9 seconds on faster_whisper with an RX 7800 XT. I tried testing with whisperx but cannot get it to work, I get OSError: libtorch_cuda.so no such file. — Reply to this email directly, view it on GitHub <#1072 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGM4B3WC77TQ3NWXXICG4NTZQKEFLAVCNFSM6AAAAAAUWK5BQ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZUGM2TAMJUGQ> . You are receiving this because you commented.Message ID: ***@***.***>

chboishabba · 2024-08-07T23:30:11Z

Sorry just read you're using RX 7800, you will need pytorch-rocm which is only available on linux pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/rocm6.0

…

On Thu, Aug 8, 2024 at 7:08 AM yeetmanpat ***@***.***> wrote: Thanks @arlo-phoenix <https://github.com/arlo-phoenix> I've successfully installed it. Using your benchmark I get 8-9 seconds on faster_whisper with an RX 7800 XT. I tried testing with whisperx but cannot get it to work, I get OSError: libtorch_cuda.so no such file. — Reply to this email directly, view it on GitHub <#1072 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGM4B3WC77TQ3NWXXICG4NTZQKEFLAVCNFSM6AAAAAAUWK5BQ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZUGM2TAMJUGQ> . You are receiving this because you commented.Message ID: ***@***.***>

genehand · 2024-08-09T12:39:36Z

(@genehand would be nice if you could try this and add the numbers to a table!). On my RX6800 I'm getting 11s-12s with faster_whisper and 4.2s with whisperX. For RDNA this should now be the fastest working whisper inference solution :)

Nice! I’m traveling now but will definitely try that out after the weekend. 👏

BBC-Esq · 2024-08-09T12:54:35Z

Any idea if this will work on WSL on Windows?

arlo-phoenix · 2024-08-10T13:09:52Z

I updated my fork to work with ROCm 6.2, install instructions are still here (there's some regression? / not well documented change for MIOpen GetWorkspaceSize which makes it return 0 causing the following functions to trigger fallbacks which caused a significant slowdown. I don't think this is the correct solution, but just reusing the last workspace size worked in my short time testing.)
I also enabled the use of the AsyncAllocator in CT2 (had a CUDA_VERSION guard that I missed) which improved faster-whisper consistency and speed (whisperX is almost 6% faster now because of this).

I tried testing with whisperx but cannot get it to work, I get OSError: libtorch_cuda.so no such file.

@yeetmanpat as @chboishabba said that's a typical torch error. I added install instructions for whisperX that worked for me.

Nice! I’m traveling now but will definitely try that out after the weekend. 👏

@genehand Nice. But just as a warning, I realized it might not work after all since this relies on MIOpen which depends on composable_kernel which afaik isn't made / tuned for RDNA 1. It's still worth a try though.

Any idea if this will work on WSL on Windows?

depends on your GPU. ROCm WSL is officially only supported on RX7800W+ (source). This doesn't have as many dependencies as pytorch, but it's still up there. If you can't run pytorch you likely can't run this.

chboishabba · 2024-08-12T03:53:01Z

Amazing to have ROCm support for ct2, if anyone is able to assist regarding older cards supporting newer ROCm versions, this would massively increase the number of AMD cards able to run ML apps. I can get whisper to run on gpu using below docker but ran into issues with ct2, tested a few months ago... I'm running on RX580 which was highly sought after during GPU shortages of COVID... I believe latest supported ROCm for gfx803 and gfx900 is 5.4.2 https://github.com/jrcichra/rocm-pytorch-gfx803 https://wiki.archlinux.org/title/AMD_Radeon_Instinct_MI25#ROCm

…

On Sat, Aug 10, 2024 at 11:10 PM Arlo Phoenix ***@***.***> wrote: I updated my fork to work with ROCm 6.2, install instructions are still here <https://github.com/arlo-phoenix/CTranslate2-rocm/blob/rocm/README_ROCM.md> (there's some regression? / not well documented change for MIOpen GetWorkspaceSize which makes it return 0 causing the following functions to trigger fallbacks which caused a significant slowdown. I don't think this is the correct solution, but just reusing the last workspace size worked in my short time testing.) I also enabled the use of the AsyncAllocator in CT2 (had a CUDA_VERSION guard that I missed) which improved faster-whisper consistency and speed (whisperX is almost 6% faster now because of this). ------------------------------ @yeetmanpat <https://github.com/yeetmanpat> as @chboishabba <https://github.com/chboishabba> said that's a typical torch error. I added install instructions <https://github.com/arlo-phoenix/CTranslate2-rocm/blob/rocm/README_ROCM.md#whisperX> for whisperX that worked for me. ------------------------------ Nice! I’m traveling now but will definitely try that out after the weekend. 👏 @genehand <https://github.com/genehand> Nice. But just as a warning, I realized it might not work after all since this relies on MIOpen which depends on composable_kernel which afaik isn't made / tuned for RDNA 1. It's still worth a try though. — Reply to this email directly, view it on GitHub <#1072 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGM4B3UOYFGSKAY6H4IIK63ZQYGLXAVCNFSM6AAAAAAUWK5BQ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBRGY3TKOBVGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

genehand · 2024-08-13T17:37:30Z

I realized it might not work after all since this relies on MIOpen which depends on composable_kernel which afaik isn't made / tuned for RDNA 1. It's still worth a try though.

Updated the table with faster-whisper results (includes loading the model) 😄 So far I'm not able to run whisperx, after messing with batch_size and HSA_OVERRIDE_GFX_VERSION I'm still running into what sounds like what you mentioned:

RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

chboishabba · 2024-08-14T01:36:48Z

This is a rocm version error, check your drivers and that you have rocm-pytorch eg from my link above I get HSA override error because rx580 isn't supported by rocm6 so I don't think it shows as a device. You can test it's a torch error with torch.cuda.is_enabled() will show true even with AMD cards if rocm is working

…

On Wed, Aug 14, 2024, 3:37 AM Gene Hand ***@***.***> wrote: I realized it might not work after all since this relies on MIOpen which depends on composable_kernel which afaik isn't made / tuned for RDNA 1. It's still worth a try though. Updated the table with faster-whisper results (includes loading the model) 😄 So far I'm not able to run whisperx, after messing with batch_size and HSA_OVERRIDE_GFX_VERSION I'm still running into what sounds like what you mentioned: RuntimeError: HIP error: invalid device function HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions. — Reply to this email directly, view it on GitHub <#1072 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGM4B3WVKFR3KKTKV5YLKATZRI77HAVCNFSM6AAAAAAUWK5BQ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBWG43TSOJTGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

chboishabba · 2024-08-14T01:37:09Z

Sorry is_available() I think

…

On Wed, Aug 14, 2024, 11:36 AM Johl Brown ***@***.***> wrote: This is a rocm version error, check your drivers and that you have rocm-pytorch eg from my link above I get HSA override error because rx580 isn't supported by rocm6 so I don't think it shows as a device. You can test it's a torch error with torch.cuda.is_enabled() will show true even with AMD cards if rocm is working On Wed, Aug 14, 2024, 3:37 AM Gene Hand ***@***.***> wrote: > I realized it might not work after all since this relies on MIOpen which > depends on composable_kernel which afaik isn't made / tuned for RDNA 1. > It's still worth a try though. > > Updated the table with faster-whisper results (includes loading the > model) 😄 So far I'm not able to run whisperx, after messing with > batch_size and HSA_OVERRIDE_GFX_VERSION I'm still running into what > sounds like what you mentioned: > > RuntimeError: HIP error: invalid device function > HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. > For debugging consider passing AMD_SERIALIZE_KERNEL=3 > Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions. > > — > Reply to this email directly, view it on GitHub > <#1072 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AGM4B3WVKFR3KKTKV5YLKATZRI77HAVCNFSM6AAAAAAUWK5BQ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBWG43TSOJTGQ> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

Bazza-63 · 2024-08-17T12:42:41Z

Probably not possible but if MIOpen were to removed, would it compile on windows? Or do all ctranslate2 models require it?

Donkey545 · 2024-08-17T19:43:01Z

For those interested in this thread, I made use of @arlo-phoenix 's fork to build a Wyoming Faster Whisper for ROCm container.

Check it out here if you are interested. I don't have much hardware to test with, so All I have tested is my APU.

chboishabba · 2024-08-19T02:59:59Z

Very cool I will be testing this on RX580. Cards of that age strike a good balance between power/compute/cost for many entry level users. Not sure if I already mentioned but there is a patch by I think xuhuisheng on GitHub for rocm on gfx803 and a few docker images floating around with the patches...

…

On Sun, Aug 18, 2024, 5:43 AM Dominic Lopriore ***@***.***> wrote: For those interested in this thread, I made use of @arlo-phoenix <https://github.com/arlo-phoenix> 's fork to build a Wyoming Faster Whisper for ROCm container. Check it out here <https://github.com/Donkey545/wyoming-faster-whisper-rocm> if you are interested. I don't have much hardware to test with, so All I have tested is my APU. — Reply to this email directly, view it on GitHub <#1072 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGM4B3X5LKMAXET3FRUEASTZR6RV7AVCNFSM6AAAAAAUWK5BQ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJUHE2TOMJQHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

sleppyrobot · 2024-10-11T18:41:17Z

I ported CTranslate2 over to ROCm. My fork is here: https://github.com/arlo-phoenix/CTranslate2-rocm and install instructions can be found under README_ROCM.md. I also wrote about the issues I had and the libraries using CT2 I tested.

Status Tracker

faster whisper

whisperX

bfloat16 (main blocker for upstreaming imo)

sync with upstream (I intentionally went back a couple commits to avoid having to deal with fa2 and AWQ)

Instead of using oneDNN I just hipified the repo and extracted HIP to CUDA function mapping to create a preprocessor solution similar to projects like llama.cpp. Besides the listed stuff it is feature complete and works very well. I included some benchmark scripts with the file from #1072 (comment) (@genehand would be nice if you could try this and add the numbers to a table!). On my RX6800 I'm getting 11s-12s with faster_whisper and 4.2s with whisperX. For RDNA this should now be the fastest working whisper inference solution :)

Btw should we split issues up? This is two combined into one. I personally believe porting all operators to oneDNN is far too much effort and might not even lead to good performance. This repo hipified quite well, I was able to use simple defines from HIP to CUDA functions for the majority of the project. I only had to rewrite the conv1d operator from scratch since hipDNN isn't maintained anymore.

just tried this a bit ago, ran into a two issues, the read-me points to one issue, which is about building the wheel, however before getting to that point I got an error with the intel runtime file libiomp5 not being found even after installing the runtime, adding -DOPENMP_RUNTIME=NONE to the cmake args fixed it.

I am on linux 24.04 with a 7900xtx python 3.10 and rocm 6.2

full command
CLANG_CMAKE_CXX_COMPILER=clang++ CXX=clang++ HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DWITH_MKL=OFF -DOPENMP_RUNTIME=NONE -DWITH_HIP=ON -DCMAKE_HIP_ARCHITECTURES=$PYTORCH_ROCM_ARCH -DBUILD_TESTS=ON -DWITH_CUDNN=ON
cmake --build build -- -j16

Bazza-63 · 2024-11-02T11:18:21Z

Thought some of you might find this project interesting. It's a ROCm builder with GPU specific patches and extended GPU support.

https://github.com/lamikr/rocm_sdk_builder

It really speeds up inference and I haven't experienced any system hangs. It's currently on ROCm 6.1.2.

commandline-be · 2024-11-04T08:32:54Z

Thought some of you might find this project interesting. It's a ROCm builder with GPU specific patches and extended GPU support.

https://github.com/lamikr/rocm_sdk_builder

It really speeds up inference and I haven't experienced any system hangs. It's currently on ROCm 6.1.2.

thanks for sharing, having a Radeon VII i'm at a loss with how the support is put on EOL while it never really had much use as a CL accellerator. Meanwhile it was basically the spearhead of AMD for GPU accelerator cards.

moyutegong · 2024-11-12T08:12:33Z

I ported CTranslate2 over to ROCm. My fork is here: https://github.com/arlo-phoenix/CTranslate2-rocm and install instructions can be found under README_ROCM.md. I also wrote about the issues I had and the libraries using CT2 I tested.

Status Tracker

faster whisper

whisperX

bfloat16 (main blocker for upstreaming imo)

sync with upstream (I intentionally went back a couple commits to avoid having to deal with fa2 and AWQ)

Instead of using oneDNN I just hipified the repo and extracted HIP to CUDA function mapping to create a preprocessor solution similar to projects like llama.cpp. Besides the listed stuff it is feature complete and works very well. I included some benchmark scripts with the file from #1072 (comment) (@genehand would be nice if you could try this and add the numbers to a table!). On my RX6800 I'm getting 11s-12s with faster_whisper and 4.2s with whisperX. For RDNA this should now be the fastest working whisper inference solution :)
Btw should we split issues up? This is two combined into one. I personally believe porting all operators to oneDNN is far too much effort and might not even lead to good performance. This repo hipified quite well, I was able to use simple defines from HIP to CUDA functions for the majority of the project. I only had to rewrite the conv1d operator from scratch since hipDNN isn't maintained anymore.

just tried this a bit ago, ran into a two issues, the read-me points to one issue, which is about building the wheel, however before getting to that point I got an error with the intel runtime file libiomp5 not being found even after installing the runtime, adding -DOPENMP_RUNTIME=NONE to the cmake args fixed it.

I am on linux 24.04 with a 7900xtx python 3.10 and rocm 6.2

full command CLANG_CMAKE_CXX_COMPILER=clang++ CXX=clang++ HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DWITH_MKL=OFF -DOPENMP_RUNTIME=NONE -DWITH_HIP=ON -DCMAKE_HIP_ARCHITECTURES=$PYTORCH_ROCM_ARCH -DBUILD_TESTS=ON -DWITH_CUDNN=ON cmake --build build -- -j16

I am unable to compile CTranslate2-rocm from source using ROCm based on WSL2, so I downloaded the wheel compiled with Python 3.9 from Wyoming Faster Whisper for ROCm and successfully installed it. It can run faster_whisper, but Python 3.9 does not support many functions. My graphics card is also a 7900xtx. If possible, could you release a wheel for Python 3.10? I would be very grateful.

creeloper27 · 2024-11-12T17:12:43Z

@moyutegong Are you able to provide a link to the wheel you downloaded or a the commands you used by any chance?
Thanks in advance!

moyutegong · 2024-11-13T08:08:20Z

@moyutegong Are you able to provide a link to the wheel you downloaded or a the commands you used by any chance? Thanks in advance!

I first executed the commands according to https://github.com/arlo-phoenix/CTranslate2-rocm/blob/rocm/README_ROCM.md. When I got to the python setup.py bdist_wheel step, it failed; I was unable to build the wheel using WSL. Then, I downloaded https://github.com/Donkey545/wyoming-faster-whisper-rocm/blob/main/src/ctranslate2-4.1.0-cp39-cp39-linux_x86_64.whl and installed this wheel using pip, after which I was able to install faster-whisper.

guillaumekln added enhancement New feature or request help wanted Extra attention is needed labels Feb 9, 2023

guillaumekln mentioned this issue Apr 19, 2023

Can I use my amd gpu to run it? althougn the official pytorch does not support it on windows however SYSTRAN/faster-whisper#162

Closed

Doomsdayrs mentioned this issue Jul 2, 2023

OpenCL for AMD / Intel GPUs? argosopentech/argos-translate#296

Closed

kandeshvari mentioned this issue Dec 14, 2023

ROCM support for AMD GPUs m-bain/whisperX#566

Open

commandline-be mentioned this issue Jan 23, 2024

Document how to build ctranslate2 with zenddn amd/ZenDNN-pytorch#1

Open

hlevring mentioned this issue Jan 23, 2024

OpenVino GPU plugin support #1603

Closed

dbolger mentioned this issue Feb 26, 2024

wyoming-(normal)-whisper? rhasspy/wyoming-faster-whisper#15

Closed

tannisroot mentioned this issue Mar 3, 2024

please include openvino iGPU support for hardware acceleration rhasspy/wyoming-faster-whisper#25

Closed

minhthuc2502 pinned this issue Apr 12, 2024

Snuupy mentioned this issue Aug 12, 2024

ROCm isnt supported abdeladim-s/subsai#67

Open

Feature request: AMD GPU support with oneDNN AMD support #1072

Feature request: AMD GPU support with oneDNN AMD support #1072

Comments

santhoshtr commented Feb 9, 2023

guillaumekln commented Feb 9, 2023

leuc commented Apr 4, 2023

towel commented Apr 5, 2023 • edited Loading

phineas-pta commented Apr 11, 2023

CristianPi commented May 15, 2023

MidnightKittenCat commented Aug 29, 2023

guillaumekln commented Aug 29, 2023

lenhone commented Sep 10, 2023

TheJKM commented Oct 26, 2023

BBC-Esq commented Oct 26, 2023

BBC-Esq commented Oct 26, 2023

MidnightKittenCat commented Oct 26, 2023 • edited Loading

commandline-be commented Dec 31, 2023

commandline-be commented Jan 3, 2024 • edited Loading

vince62s commented Jan 3, 2024

katzmike commented Mar 21, 2024

radna0 commented Jul 1, 2024

kvrban commented Jul 5, 2024

DDXDB commented Jul 13, 2024

chboishabba commented Jul 14, 2024

yeetmanpat commented Jul 30, 2024

genehand commented Aug 2, 2024

tannisroot commented Aug 2, 2024

chboishabba commented Aug 2, 2024 via email

chboishabba commented Aug 2, 2024 via email

tannisroot commented Aug 2, 2024

genehand commented Aug 3, 2024 • edited Loading

commandline-be commented Aug 5, 2024 • edited Loading

arlo-phoenix commented Aug 6, 2024

Status Tracker

BBC-Esq commented Aug 6, 2024

yeetmanpat commented Aug 7, 2024

chboishabba commented Aug 7, 2024 via email

chboishabba commented Aug 7, 2024 via email

genehand commented Aug 9, 2024

BBC-Esq commented Aug 9, 2024

arlo-phoenix commented Aug 10, 2024 • edited Loading

chboishabba commented Aug 12, 2024 via email

genehand commented Aug 13, 2024

chboishabba commented Aug 14, 2024 via email

chboishabba commented Aug 14, 2024 via email

Bazza-63 commented Aug 17, 2024 • edited Loading

Donkey545 commented Aug 17, 2024

chboishabba commented Aug 19, 2024 via email

sleppyrobot commented Oct 11, 2024

Status Tracker

Bazza-63 commented Nov 2, 2024

commandline-be commented Nov 4, 2024

moyutegong commented Nov 12, 2024

Status Tracker

creeloper27 commented Nov 12, 2024

moyutegong commented Nov 13, 2024

towel commented Apr 5, 2023 •

edited

Loading

MidnightKittenCat commented Oct 26, 2023 •

edited

Loading

commandline-be commented Jan 3, 2024 •

edited

Loading

genehand commented Aug 3, 2024 •

edited

Loading

commandline-be commented Aug 5, 2024 •

edited

Loading

arlo-phoenix commented Aug 10, 2024 •

edited

Loading

Bazza-63 commented Aug 17, 2024 •

edited

Loading