Refactor/tinyblas #10343

Djip007 · 2024-11-16T22:21:10Z

I have read the contributing guidelines
Self-reported review complexity:
- Medium

This is a sample of create a full backend for "LLAMAFILE". #10183

add a "ggml-tinyblas" backend.
remove LLAMAFILE from cpu backend
add BF16 gemm for zen4.

TODO:

the CMakeLists.txt is largely copied from the cpu backend, may to big and need cleaning ?
multi-thread only implemented with OpenMP

Note: it is possible to split the sgemm.cpp (float / QN_0 / x86 ...)

I did not see slow down on my AMD Ryzen 9 7940HS (zen4).

slaren · 2024-11-16T22:36:20Z

I am not convinced that this is a good idea. The maintenance cost of keeping tinyblas in the CPU backend is effectively negligible, however, moving it to a separate backend has a significant cost:

Requires duplicating a lot of code: this implementation is not correct because it uses the CPU backend, which cannot be done since backends need to be independent of each other. This means that all the code for optimized quantization will need to be duplicated, in addition to the CMakeLists file and other smaller pieces. Note that this implementation is not correct as it is, since it depends on the CPU backend.
In the future, it will make distributing a single llama.cpp package more complicated. My goal is to be able to bundle several versions of the CPU backend, each for a different instruction set, and choose the best one at runtime. So we would have eg. ggml-cpu-avx.dll, ggml-cpu-avx2.dll, ggml-cpu-avx512.dll, and so on, and the best would be used. We will also need to do the same for llamafile, effectively duplicating the number of files and the load process.
Currently this is not compatible with the ggml threadpool, which we should look into making the default in the future, instead of OpenMP.

I believe we should go the other way and instead remove the AMX backend and add its code to the CPU backend. At the time the AMX backend was created this was not feasible due to the weight repacking that it does, but this is no longer a problem and can be implemented in a similar way to the aarch64 online repacking.

Djip007 · 2024-11-17T01:31:13Z

Requires duplicating a lot of code: this implementation is not correct because it uses the CPU backend, which cannot be done since backends need to be independent of each other

That the main point...

have backends independent of each other is nice for discrete accelerator (dGPU, PcieFPGA ...)
but it is to much work for iGPU, NPU, ... that use main memory, or "special" CPU instruction.

have a mega CPU backend that can use integrated accelerator will make the CPU backend a lot more complicated.

So, for me if we do not allow for a backend to share OP with the CPU backend, we need to add something new, like allow to register OP in the CPU backend.

ggml-cpu-avx.dll, ggml-cpu-avx2.dll, ggml-cpu-avx512.dll

If we add the possibility to register OP, this is not needed, we can build all possible version, and use the best possible at init time. llamafile does it in a way, we can use gcc target, Or create a more featuring register service.

can be implemented in a similar way to the aarch64 online repacking.

If online repacking is nice, it would be better to have static repacking for the weight. 😎

Some benchmark with this backend on AMD Ryzen 9 7940HS with Mistral-7B-Instruct-v0.3

model	size	params	backend	threads	test	t/s
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp1	2.00 ± 0.01
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp2	3.98 ± 0.01
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp3	5.91 ± 0.05
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp4	7.89 ± 0.04
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp5	9.78 ± 0.09
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp6	6.17 ± 0.01
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp7	7.17 ± 0.08
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp8	8.15 ± 0.07
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp9	9.12 ± 0.13
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp10	19.12 ± 0.12
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp11	11.19 ± 0.10
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp12	12.07 ± 0.28
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp13	13.00 ± 0.26
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp14	13.92 ± 0.43
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp15	27.27 ± 0.32
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp16	15.82 ± 0.23
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp30	42.86 ± 0.59
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp32	28.16 ± 0.29
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp64	41.06 ± 0.43
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp65	48.85 ± 0.39
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp128	43.93 ± 0.04
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp130	49.93 ± 2.38
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp255	49.87 ± 1.27
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp256	45.86 ± 0.05
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp510	40.93 ± 0.91
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	pp512	41.69 ± 0.05
llama 7B all F32	27.00 GiB	7.25 B	tinyBLAS	8	tg16	2.00 ± 0.00
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp1	3.98 ± 0.03
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp2	7.95 ± 0.02
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp3	11.87 ± 0.06
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp4	15.73 ± 0.11
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp5	19.60 ± 0.10
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp6	12.67 ± 0.01
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp7	14.72 ± 0.03
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp8	16.75 ± 0.03
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp9	18.70 ± 0.10
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp10	37.15 ± 0.08
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp11	22.66 ± 0.11
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp12	24.62 ± 0.03
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp13	26.69 ± 0.11
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp14	28.50 ± 0.08
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp15	50.90 ± 0.48
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp16	31.65 ± 0.02
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp30	61.04 ± 0.17
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp32	46.24 ± 0.25
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp64	51.26 ± 0.24
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp65	60.34 ± 1.61
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp128	56.52 ± 0.41
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp130	61.57 ± 0.50
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp255	57.19 ± 0.05
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp256	53.46 ± 0.04
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp510	45.76 ± 0.01
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	pp512	44.95 ± 0.01
llama 7B F16	13.50 GiB	7.25 B	tinyBLAS	8	tg16	3.98 ± 0.01
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp1	4.00 ± 0.03
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp2	7.98 ± 0.03
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp3	11.92 ± 0.09
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp4	15.85 ± 0.07
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp5	19.72 ± 0.02
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp6	12.74 ± 0.09
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp7	14.91 ± 0.04
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp8	16.91 ± 0.02
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp9	19.04 ± 0.05
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp10	38.40 ± 0.24
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp11	23.19 ± 0.05
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp12	25.27 ± 0.16
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp13	27.30 ± 0.07
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp14	29.43 ± 0.05
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp15	56.03 ± 0.34
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp16	33.32 ± 0.08
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp30	88.02 ± 0.60
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp32	59.20 ± 1.13
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp64	78.50 ± 0.55
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp65	82.31 ± 0.17
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp128	87.22 ± 1.69
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp130	99.64 ± 1.55
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp255	108.06 ± 0.40
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp256	99.17 ± 0.01
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp510	97.31 ± 0.08
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	pp512	93.39 ± 0.09
llama 7B BF16	13.50 GiB	7.25 B	tinyBLAS	8	tg16	4.00 ± 0.01
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp1	7.45 ± 0.05
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp2	14.84 ± 0.11
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp3	21.77 ± 0.72
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp4	28.98 ± 0.22
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp5	20.92 ± 0.17
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp6	24.80 ± 0.12
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp7	28.57 ± 0.14
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp8	51.67 ± 1.03
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp9	35.81 ± 0.05
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp10	39.00 ± 0.32
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp11	42.47 ± 0.16
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp12	60.79 ± 0.37
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp13	44.46 ± 0.34
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp14	47.41 ± 0.27
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp15	50.07 ± 0.33
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp16	64.08 ± 0.28
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp30	59.13 ± 0.13
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp32	68.56 ± 0.28
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp64	69.53 ± 1.65
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp65	62.54 ± 1.10
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp128	71.39 ± 0.79
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp130	67.31 ± 0.60
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp255	67.93 ± 0.11
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp256	69.19 ± 0.09
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp510	66.04 ± 0.05
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	pp512	65.54 ± 1.38
llama 7B Q8_0	7.17 GiB	7.25 B	tinyBLAS	8	tg16	7.42 ± 0.05
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp1	9.53 ± 0.11
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp2	18.62 ± 0.06
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp3	26.01 ± 0.30
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp4	30.66 ± 0.18
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp5	31.69 ± 0.22
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp6	32.47 ± 0.39
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp7	32.15 ± 0.02
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp8	33.11 ± 0.90
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp9	35.84 ± 0.94
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp10	37.19 ± 0.52
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp11	37.40 ± 0.20
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp12	37.74 ± 0.03
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp13	38.00 ± 0.16
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp14	38.27 ± 0.16
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp15	38.45 ± 0.04
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp16	38.67 ± 0.13
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp30	39.03 ± 0.78
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp32	39.07 ± 1.12
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp64	39.79 ± 0.69
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp65	39.35 ± 0.61
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp128	40.30 ± 0.02
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp130	40.08 ± 0.06
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp255	40.07 ± 0.04
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp256	40.04 ± 0.04
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp510	39.57 ± 0.02
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	pp512	39.54 ± 0.01
llama 7B Q6_K	5.54 GiB	7.25 B	tinyBLAS	8	tg16	9.57 ± 0.10

ggerganov · 2024-11-17T06:47:21Z

In the future, it will make distributing a single llama.cpp package more complicated. My goal is to be able to bundle several versions of the CPU backend, each for a different instruction set, and choose the best one at runtime. So we would have eg. ggml-cpu-avx.dll, ggml-cpu-avx2.dll, ggml-cpu-avx512.dll, and so on, and the best would be used. We will also need to do the same for llamafile, effectively duplicating the number of files and the load process.

I believe we should go the other way and instead remove the AMX backend and add its code to the CPU backend. At the time the AMX backend was created this was not feasible due to the weight repacking that it does, but this is no longer a problem and can be implemented in a similar way to the aarch64 online repacking.

This makes a lot of sense. Will close the #10183 issue and create a new one to track the AMX backend integration in the CPU backend.

Djip007 added 2 commits November 16, 2024 23:54

extract llamafile in new tinyblas backend

7dd261f

some cleanup with tinyblas backend

dda8847

github-actions bot added documentation Improvements or additions to documentation build Compilation issues ggml changes relating to the ggml tensor library for machine learning labels Nov 16, 2024

update Makefile

a3822fb

Djip007 force-pushed the refactor/tinyblas branch from fd08ab8 to a3822fb Compare November 16, 2024 23:58

ggerganov mentioned this pull request Nov 17, 2024

ggml : move LLAMAFILE/tinyBLAS into a backend #10183

Closed

ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Nov 17, 2024

ggerganov mentioned this pull request Nov 17, 2024

ggml : reintegrate the AMX backend into the CPU backend #10359

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor/tinyblas #10343

Refactor/tinyblas #10343

Djip007 commented Nov 16, 2024 •

edited

Loading

slaren commented Nov 16, 2024 •

edited

Loading

Djip007 commented Nov 17, 2024

ggerganov commented Nov 17, 2024

Refactor/tinyblas #10343

Are you sure you want to change the base?

Refactor/tinyblas #10343

Conversation

Djip007 commented Nov 16, 2024 • edited Loading

slaren commented Nov 16, 2024 • edited Loading

Djip007 commented Nov 17, 2024

ggerganov commented Nov 17, 2024

Djip007 commented Nov 16, 2024 •

edited

Loading

slaren commented Nov 16, 2024 •

edited

Loading