Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/tinyblas #10343

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Refactor/tinyblas #10343

wants to merge 3 commits into from

Conversation

Djip007
Copy link
Contributor

@Djip007 Djip007 commented Nov 16, 2024

This is a sample of create a full backend for "LLAMAFILE". #10183

  • add a "ggml-tinyblas" backend.
  • remove LLAMAFILE from cpu backend
  • add BF16 gemm for zen4.

TODO:

  • the CMakeLists.txt is largely copied from the cpu backend, may to big and need cleaning ?
  • multi-thread only implemented with OpenMP

Note: it is possible to split the sgemm.cpp (float / QN_0 / x86 ...)

I did not see slow down on my AMD Ryzen 9 7940HS (zen4).

@slaren
Copy link
Collaborator

slaren commented Nov 16, 2024

I am not convinced that this is a good idea. The maintenance cost of keeping tinyblas in the CPU backend is effectively negligible, however, moving it to a separate backend has a significant cost:

  • Requires duplicating a lot of code: this implementation is not correct because it uses the CPU backend, which cannot be done since backends need to be independent of each other. This means that all the code for optimized quantization will need to be duplicated, in addition to the CMakeLists file and other smaller pieces. Note that this implementation is not correct as it is, since it depends on the CPU backend.
  • In the future, it will make distributing a single llama.cpp package more complicated. My goal is to be able to bundle several versions of the CPU backend, each for a different instruction set, and choose the best one at runtime. So we would have eg. ggml-cpu-avx.dll, ggml-cpu-avx2.dll, ggml-cpu-avx512.dll, and so on, and the best would be used. We will also need to do the same for llamafile, effectively duplicating the number of files and the load process.
  • Currently this is not compatible with the ggml threadpool, which we should look into making the default in the future, instead of OpenMP.

I believe we should go the other way and instead remove the AMX backend and add its code to the CPU backend. At the time the AMX backend was created this was not feasible due to the weight repacking that it does, but this is no longer a problem and can be implemented in a similar way to the aarch64 online repacking.

@github-actions github-actions bot added documentation Improvements or additions to documentation build Compilation issues ggml changes relating to the ggml tensor library for machine learning labels Nov 16, 2024
@Djip007
Copy link
Contributor Author

Djip007 commented Nov 17, 2024

  • Requires duplicating a lot of code: this implementation is not correct because it uses the CPU backend, which cannot be done since backends need to be independent of each other

That the main point...

  • have backends independent of each other is nice for discrete accelerator (dGPU, PcieFPGA ...)
  • but it is to much work for iGPU, NPU, ... that use main memory, or "special" CPU instruction.

have a mega CPU backend that can use integrated accelerator will make the CPU backend a lot more complicated.

So, for me if we do not allow for a backend to share OP with the CPU backend, we need to add something new, like allow to register OP in the CPU backend.

ggml-cpu-avx.dll, ggml-cpu-avx2.dll, ggml-cpu-avx512.dll

If we add the possibility to register OP, this is not needed, we can build all possible version, and use the best possible at init time. llamafile does it in a way, we can use gcc target, Or create a more featuring register service.

can be implemented in a similar way to the aarch64 online repacking.

If online repacking is nice, it would be better to have static repacking for the weight. 😎


Some benchmark with this backend on AMD Ryzen 9 7940HS with Mistral-7B-Instruct-v0.3

model size params backend threads test t/s
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp1 2.00 ± 0.01
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp2 3.98 ± 0.01
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp3 5.91 ± 0.05
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp4 7.89 ± 0.04
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp5 9.78 ± 0.09
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp6 6.17 ± 0.01
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp7 7.17 ± 0.08
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp8 8.15 ± 0.07
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp9 9.12 ± 0.13
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp10 19.12 ± 0.12
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp11 11.19 ± 0.10
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp12 12.07 ± 0.28
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp13 13.00 ± 0.26
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp14 13.92 ± 0.43
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp15 27.27 ± 0.32
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp16 15.82 ± 0.23
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp30 42.86 ± 0.59
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp32 28.16 ± 0.29
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp64 41.06 ± 0.43
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp65 48.85 ± 0.39
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp128 43.93 ± 0.04
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp130 49.93 ± 2.38
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp255 49.87 ± 1.27
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp256 45.86 ± 0.05
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp510 40.93 ± 0.91
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 pp512 41.69 ± 0.05
llama 7B all F32 27.00 GiB 7.25 B tinyBLAS 8 tg16 2.00 ± 0.00
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp1 3.98 ± 0.03
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp2 7.95 ± 0.02
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp3 11.87 ± 0.06
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp4 15.73 ± 0.11
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp5 19.60 ± 0.10
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp6 12.67 ± 0.01
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp7 14.72 ± 0.03
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp8 16.75 ± 0.03
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp9 18.70 ± 0.10
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp10 37.15 ± 0.08
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp11 22.66 ± 0.11
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp12 24.62 ± 0.03
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp13 26.69 ± 0.11
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp14 28.50 ± 0.08
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp15 50.90 ± 0.48
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp16 31.65 ± 0.02
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp30 61.04 ± 0.17
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp32 46.24 ± 0.25
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp64 51.26 ± 0.24
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp65 60.34 ± 1.61
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp128 56.52 ± 0.41
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp130 61.57 ± 0.50
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp255 57.19 ± 0.05
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp256 53.46 ± 0.04
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp510 45.76 ± 0.01
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 pp512 44.95 ± 0.01
llama 7B F16 13.50 GiB 7.25 B tinyBLAS 8 tg16 3.98 ± 0.01
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp1 4.00 ± 0.03
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp2 7.98 ± 0.03
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp3 11.92 ± 0.09
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp4 15.85 ± 0.07
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp5 19.72 ± 0.02
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp6 12.74 ± 0.09
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp7 14.91 ± 0.04
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp8 16.91 ± 0.02
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp9 19.04 ± 0.05
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp10 38.40 ± 0.24
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp11 23.19 ± 0.05
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp12 25.27 ± 0.16
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp13 27.30 ± 0.07
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp14 29.43 ± 0.05
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp15 56.03 ± 0.34
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp16 33.32 ± 0.08
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp30 88.02 ± 0.60
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp32 59.20 ± 1.13
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp64 78.50 ± 0.55
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp65 82.31 ± 0.17
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp128 87.22 ± 1.69
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp130 99.64 ± 1.55
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp255 108.06 ± 0.40
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp256 99.17 ± 0.01
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp510 97.31 ± 0.08
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 pp512 93.39 ± 0.09
llama 7B BF16 13.50 GiB 7.25 B tinyBLAS 8 tg16 4.00 ± 0.01
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp1 7.45 ± 0.05
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp2 14.84 ± 0.11
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp3 21.77 ± 0.72
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp4 28.98 ± 0.22
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp5 20.92 ± 0.17
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp6 24.80 ± 0.12
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp7 28.57 ± 0.14
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp8 51.67 ± 1.03
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp9 35.81 ± 0.05
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp10 39.00 ± 0.32
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp11 42.47 ± 0.16
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp12 60.79 ± 0.37
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp13 44.46 ± 0.34
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp14 47.41 ± 0.27
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp15 50.07 ± 0.33
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp16 64.08 ± 0.28
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp30 59.13 ± 0.13
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp32 68.56 ± 0.28
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp64 69.53 ± 1.65
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp65 62.54 ± 1.10
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp128 71.39 ± 0.79
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp130 67.31 ± 0.60
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp255 67.93 ± 0.11
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp256 69.19 ± 0.09
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp510 66.04 ± 0.05
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 pp512 65.54 ± 1.38
llama 7B Q8_0 7.17 GiB 7.25 B tinyBLAS 8 tg16 7.42 ± 0.05
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp1 9.53 ± 0.11
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp2 18.62 ± 0.06
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp3 26.01 ± 0.30
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp4 30.66 ± 0.18
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp5 31.69 ± 0.22
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp6 32.47 ± 0.39
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp7 32.15 ± 0.02
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp8 33.11 ± 0.90
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp9 35.84 ± 0.94
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp10 37.19 ± 0.52
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp11 37.40 ± 0.20
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp12 37.74 ± 0.03
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp13 38.00 ± 0.16
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp14 38.27 ± 0.16
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp15 38.45 ± 0.04
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp16 38.67 ± 0.13
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp30 39.03 ± 0.78
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp32 39.07 ± 1.12
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp64 39.79 ± 0.69
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp65 39.35 ± 0.61
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp128 40.30 ± 0.02
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp130 40.08 ± 0.06
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp255 40.07 ± 0.04
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp256 40.04 ± 0.04
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp510 39.57 ± 0.02
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 pp512 39.54 ± 0.01
llama 7B Q6_K 5.54 GiB 7.25 B tinyBLAS 8 tg16 9.57 ± 0.10

@ggerganov
Copy link
Owner

In the future, it will make distributing a single llama.cpp package more complicated. My goal is to be able to bundle several versions of the CPU backend, each for a different instruction set, and choose the best one at runtime. So we would have eg. ggml-cpu-avx.dll, ggml-cpu-avx2.dll, ggml-cpu-avx512.dll, and so on, and the best would be used. We will also need to do the same for llamafile, effectively duplicating the number of files and the load process.

I believe we should go the other way and instead remove the AMX backend and add its code to the CPU backend. At the time the AMX backend was created this was not feasible due to the weight repacking that it does, but this is no longer a problem and can be implemented in a similar way to the aarch64 online repacking.

This makes a lot of sense. Will close the #10183 issue and create a new one to track the AMX backend integration in the CPU backend.

@ggerganov ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues demo Demonstrate some concept or idea, not intended to be merged documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants