[ GEMM ] HGEMM noTrans case #2488

skykongkong8 · 2024-02-26T08:28:58Z

Latest NNTrainer (26.02.24) HGEMM does not consider TLB cache.
I believe this is one of the most suspicious reasons for having nonlinear latency in big sized GEMM computation.
I am currently implementing micro, and macro scale HGEMM kernels in order to resolve this issue : skykongkong8@ac9cbb5
and observing some improvements in latency without deterioration of accuracy (w.r.t. full-fp16 HGEMM).

[ outdated ]

v5 : apply SIMD in Brute Force implementation
v6 : v5 + loop unrolling
v7 : v8 + bigger kernel
v8 : v7 + cache blocking
v9 : v8 + packing
v10 : v9 + bigger kernel + adaptive kernel use
v11 : v10 + discontinuous packing on B
[ WIP ] v10_modified : v10 + small kernels + adaptive data packing

[ current status ] : #2531 , #2541 , #2578

continuous data packing
4x4, 4x8, 8x8 hgemm kernel for f16 and f16-f32
software prefetching

The text was updated successfully, but these errors were encountered:

taos-ci · 2024-02-26T08:29:01Z

cibot: Thank you for posting issue #2488. The person in charge will reply soon.

skykongkong8 · 2024-02-26T08:32:16Z

Those who want to make comments / reviews on WIP branch, please leave it here, or let me know! :)

skykongkong8 · 2024-03-15T07:04:26Z

@s-debadri As we discussed, please get started from here :) Thanks a lot!

skykongkong8 · 2024-04-08T04:16:57Z

Current Status : 08.04.2024

Unittest output using Galaxy S23 with #2541

GEMM dimension	fp32	prev	8x8	f16-f32 8x16	full-f16
4096 square	2087 ms	7172 ms	...	1964 ms	1452 ms
2048 square	260 ms	413 ms	...	250 ms	185 ms
1024 square	34 ms	52 ms	...	30 ms	103 ms
768 square	13 ms	18 ms	...	11 ms	10 ms
256X1440X256	2869 mcrs	3807 mcrs	...	2544 mcrs	2055 mcrs
256X256X1440	2929 mcrs	3950 mcrs	...	2467 mcrs	2523 mcrs
8X1440X8	5 mcrs	5 mcrs	...	10 mcrs
8X8X1440	5 mcrs	4 mcrs	...	8 mcrs

skykongkong8 · 2024-04-25T05:06:05Z

Status Update: 24.04.2024

Macro style kernel
Adaptive loops for macros
More digits per loop

Unittest output using Galaxy S23 with local commit (TBA)

Latency

mean latency with TC = 100

dim	KERNEL_8x16_ACC16	KERNEL_8x16_ACC8	cblas fp32
1024	23 ms	30 ms	32 ms
768	9 ms	12.8 ms	13.6 ms
256x1440x256	2054 mcrs	2664 mcrs	2701 mcrs
256x256x1440	2359 mcrs	2965 mcrs	3104 mcrs

mse w.r.t. sgemm

dim	KERNEL_8x16_ACC16	KERNEL_8x16_ACC8
1024	0.00608169	0.00226737
768	0.00310214	0.0017091
256x1440x256	0.0149112	0.00518965
256x256x1440	0.00119428	0.000306849

Overall, this shows 150% boost-up with f16-f32 w.r.t. cblas fp32
Considering enlarged vector length from f32 to f16, and partial-accumulation, result above sounds reasonable.
However, this code takes a little bit of accuracy loss for its cost. Should be checked once more with model output.

skykongkong8 · 2024-05-14T02:00:59Z

This issue is temporally resolved, and can be discussed in other issues.

skykongkong8 self-assigned this Feb 26, 2024

skykongkong8 changed the title ~~[ GEMM ] Apply advanced cache miss avoiding techniques in HGEMM~~ [ GEMM ] Consider TLB cache in HGEMM Feb 27, 2024

skykongkong8 mentioned this issue Apr 3, 2024

[ hgemm ] Use optimized hgemm if possible #2531

Merged

skykongkong8 mentioned this issue May 10, 2024

[ hgemm ] Partial sum up to 2048 digits for more acceleration & trivial refactor @open sesame 05/10 10:47 #2578

Merged

skykongkong8 changed the title ~~[ GEMM ] Consider TLB cache in HGEMM~~ [ GEMM ] Kernel-based fp16 GEMM (HGEMM) May 14, 2024

skykongkong8 changed the title ~~[ GEMM ] Kernel-based fp16 GEMM (HGEMM)~~ [ GEMM ] HGEMM noTrans case (HGEMM) May 14, 2024

skykongkong8 changed the title ~~[ GEMM ] HGEMM noTrans case (HGEMM)~~ [ GEMM ] HGEMM noTrans case May 14, 2024

skykongkong8 closed this as completed May 14, 2024

skykongkong8 linked a pull request May 22, 2024 that will close this issue

[ hgemm ] Partial sum up to 2048 digits for more acceleration & trivial refactor @open sesame 05/10 10:47 #2578

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ GEMM ] HGEMM noTrans case #2488

[ GEMM ] HGEMM noTrans case #2488

skykongkong8 commented Feb 26, 2024 •

edited

Loading

taos-ci commented Feb 26, 2024

skykongkong8 commented Feb 26, 2024 •

edited

Loading

skykongkong8 commented Mar 15, 2024

skykongkong8 commented Apr 8, 2024 •

edited

Loading

skykongkong8 commented Apr 25, 2024 •

edited

Loading

skykongkong8 commented May 14, 2024

[ GEMM ] HGEMM noTrans case #2488

[ GEMM ] HGEMM noTrans case #2488

Comments

skykongkong8 commented Feb 26, 2024 • edited Loading

taos-ci commented Feb 26, 2024

skykongkong8 commented Feb 26, 2024 • edited Loading

skykongkong8 commented Mar 15, 2024

skykongkong8 commented Apr 8, 2024 • edited Loading

Current Status : 08.04.2024

skykongkong8 commented Apr 25, 2024 • edited Loading

Status Update: 24.04.2024

Latency

mse w.r.t. sgemm

skykongkong8 commented May 14, 2024

skykongkong8 commented Feb 26, 2024 •

edited

Loading

skykongkong8 commented Feb 26, 2024 •

edited

Loading

skykongkong8 commented Apr 8, 2024 •

edited

Loading

skykongkong8 commented Apr 25, 2024 •

edited

Loading