Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ GEMM ] HGEMM noTrans case #2488

Closed
skykongkong8 opened this issue Feb 26, 2024 · 6 comments · Fixed by #2578
Closed

[ GEMM ] HGEMM noTrans case #2488

skykongkong8 opened this issue Feb 26, 2024 · 6 comments · Fixed by #2578
Assignees

Comments

@skykongkong8
Copy link
Member

skykongkong8 commented Feb 26, 2024

Latest NNTrainer (26.02.24) HGEMM does not consider TLB cache.
I believe this is one of the most suspicious reasons for having nonlinear latency in big sized GEMM computation.
I am currently implementing micro, and macro scale HGEMM kernels in order to resolve this issue : skykongkong8@ac9cbb5
and observing some improvements in latency without deterioration of accuracy (w.r.t. full-fp16 HGEMM).

[ outdated ]

  • v5 : apply SIMD in Brute Force implementation
  • v6 : v5 + loop unrolling
  • v7 : v8 + bigger kernel
  • v8 : v7 + cache blocking
  • v9 : v8 + packing
  • v10 : v9 + bigger kernel + adaptive kernel use
  • v11 : v10 + discontinuous packing on B
  • [ WIP ] v10_modified : v10 + small kernels + adaptive data packing

[ current status ] : #2531 , #2541 , #2578

  1. continuous data packing
  2. 4x4, 4x8, 8x8 hgemm kernel for f16 and f16-f32
  3. software prefetching
@taos-ci
Copy link

taos-ci commented Feb 26, 2024

:octocat: cibot: Thank you for posting issue #2488. The person in charge will reply soon.

@skykongkong8 skykongkong8 self-assigned this Feb 26, 2024
@skykongkong8
Copy link
Member Author

skykongkong8 commented Feb 26, 2024

Those who want to make comments / reviews on WIP branch, please leave it here, or let me know! :)

@skykongkong8 skykongkong8 changed the title [ GEMM ] Apply advanced cache miss avoiding techniques in HGEMM [ GEMM ] Consider TLB cache in HGEMM Feb 27, 2024
@skykongkong8
Copy link
Member Author

@s-debadri As we discussed, please get started from here :) Thanks a lot!

@skykongkong8
Copy link
Member Author

skykongkong8 commented Apr 8, 2024

Current Status : 08.04.2024

Unittest output using Galaxy S23 with #2541

GEMM dimension fp32 prev 8x8 f16-f32 8x16 full-f16
4096 square 2087 ms 7172 ms ... 1964 ms 1452 ms
2048 square 260 ms 413 ms ... 250 ms 185 ms
1024 square 34 ms 52 ms ... 30 ms 103 ms
768 square 13 ms 18 ms ... 11 ms 10 ms
256X1440X256 2869 mcrs 3807 mcrs ... 2544 mcrs 2055 mcrs
256X256X1440 2929 mcrs 3950 mcrs ... 2467 mcrs 2523 mcrs
8X1440X8 5 mcrs 5 mcrs ... 10 mcrs
8X8X1440 5 mcrs 4 mcrs ... 8 mcrs

@skykongkong8
Copy link
Member Author

skykongkong8 commented Apr 25, 2024

Status Update: 24.04.2024

  • Macro style kernel
  • Adaptive loops for macros
  • More digits per loop

Unittest output using Galaxy S23 with local commit (TBA)

Latency

mean latency with TC = 100

dim KERNEL_8x16_ACC16 KERNEL_8x16_ACC8 cblas fp32
1024 23 ms 30 ms 32 ms
768 9 ms 12.8 ms 13.6 ms
256x1440x256 2054 mcrs 2664 mcrs 2701 mcrs
256x256x1440 2359 mcrs 2965 mcrs 3104 mcrs

mse w.r.t. sgemm

dim KERNEL_8x16_ACC16 KERNEL_8x16_ACC8
1024 0.00608169 0.00226737
768 0.00310214 0.0017091
256x1440x256 0.0149112 0.00518965
256x256x1440 0.00119428 0.000306849
  • Overall, this shows 150% boost-up with f16-f32 w.r.t. cblas fp32
  • Considering enlarged vector length from f32 to f16, and partial-accumulation, result above sounds reasonable.
  • However, this code takes a little bit of accuracy loss for its cost. Should be checked once more with model output.

@skykongkong8 skykongkong8 changed the title [ GEMM ] Consider TLB cache in HGEMM [ GEMM ] Kernel-based fp16 GEMM (HGEMM) May 14, 2024
@skykongkong8 skykongkong8 changed the title [ GEMM ] Kernel-based fp16 GEMM (HGEMM) [ GEMM ] HGEMM noTrans case (HGEMM) May 14, 2024
@skykongkong8 skykongkong8 changed the title [ GEMM ] HGEMM noTrans case (HGEMM) [ GEMM ] HGEMM noTrans case May 14, 2024
@skykongkong8
Copy link
Member Author

This issue is temporally resolved, and can be discussed in other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants