[ hgemm ] Partial sum up to 2048 digits for more acceleration & trivial refactor @open sesame 05/10 10:47 #2578

skykongkong8 · 2024-05-10T00:12:00Z

Unittest output using Galaxy S23 with local commit

Latency

mean latency with TC = 100

dim	KERNEL_8x16_ACC16	KERNEL_8x16_ACC8	cblas fp32
1024	23 ms	30 ms	32 ms
768	9 ms	12.8 ms	13.6 ms
256x1440x256	2054 mcrs	2664 mcrs	2701 mcrs
256x256x1440	2359 mcrs	2965 mcrs	3104 mcrs

mse w.r.t. sgemm

dim	KERNEL_8x16_ACC16	KERNEL_8x16_ACC8
1024	0.00608169	0.00226737
768	0.00310214	0.0017091
256x1440x256	0.0149112	0.00518965
256x256x1440	0.00119428	0.000306849

Overall, this shows 150% boost-up with f16-f32 w.r.t. cblas fp32
Considering enlarged vector length from f32 to f16, and partial-accumulation, result above sounds reasonable.
However, this code takes a little bit of accuracy loss for its cost. Should be checked once more with model output.

+ Trivial refactor:

unified macro-style GEMM kernels.
adaptive K-direction loops

taos-ci · 2024-05-10T00:12:04Z

📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2578. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

skykongkong8 · 2024-05-10T00:20:29Z

This is PR includes issue raised at #2488

taos-ci · 2024-05-10T00:50:36Z

cibot: @skykongkong8, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2578-202405100912040.87964606285095-8c6c01e8d19abf1cdb80840c610af30669339647/.

taos-ci

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

jijoongmoon · 2024-05-10T05:13:06Z

what the ACC16 and ACC8 Means?

skykongkong8 · 2024-05-10T05:18:23Z

what the ACC16 and ACC8 Means?

It means the number of iterations of the tile of current gemm kernel.
Suppose GEMM(M, K, N, ... ).
Using axb gemm kernel with ACC'c' will therefore compute a * b * c digits with fp16, and accumulate with f32 M * K * N / (a * b * c) times.
That's why I wrote partial sum up to '2048 digits' with 8x16 ACC16 -> 8 * 16 * 16 = 2048

In a nutshell,

8x16 kernel : tile size of 8x16 ( related to M and N )
ACC16 / ACC8 : iterate current tile through K-direction with 16 / 8 times.

baek2sm · 2024-05-10T11:04:09Z

test/unittest/unittest_nntrainer_tensor_neon_fp16.cpp

+  const int MOD = 10;
+
+  GEN_TEST_INPUT(A, ((i * (batch * height * channel) + j * (batch * height) +
+                      k * (width) + l + 1) %


(width) Is this parenthesis for readability?

baek2sm

LGTM

SeoHyungjun

LGTM

SeoHyungjun · 2024-05-20T01:53:38Z

nntrainer/tensor/hgemm/hgemm.cpp

      hgemm_noTrans_4x8(M, N, K, A, K, B, N, C32, N, alpha, beta);
-    } else if (N % 8 == 0) {
+    } else if ((K & 0x7) == 0 && (N & 0x7) == 0) {


In terms of readability, wouldn't it be better to put a K after the N?
The order of function parameters and conditions is also M, N, K.

- Implement 4x4 GEMM kernel that works f16-f32 partial accumulation **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- Now Hgemm supports 4x4 f16-f32 partial accumulation strategy **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- With macro-defined code, the function latency is expected to be optimized by compiler more easily **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

…16 kernel - With more digits computed with fp16 (in this case 1024 -> 2048) I could observe latency improvement with the cost of accuracy loss. However, according to current accuracy measurement criteria, it is still acceptable. Note that it is highly desired to be proven with model output once more. - With variety of partial sum kernels, we can adaptively apply internal macro kernels without being constrained to K-divisibilty w.r.t. 4, 8, 16.Commit title (Until 50 colums per line) **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

…8x8 kernel - Apply similar change made in commit#52a3c734 but in 8x8 kernel **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- To avoid the constraint of 4-8 divisibilty w.r.t. K, loop for adaptive K direction. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- I found there was a repeated usage of matrix initialization before mul-add fused operations. - With separate initialization code, we can enjoy: 1. Cleaner code that is reusable for both f16 & f16-f32 kernel 2. Redundant init process is minimized for f16 kernel. Better latency with the SAME accuracy. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- Due to adaptive macro kernel usage, previous comment is no longer needed. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

taos-ci

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

skykongkong8 requested review from myungjoo, jijoongmoon, again4you, jaeyun-jung, leemgs, wooksong, helloahn, kparichay, gichan-jang, anyj0527, zhoonit, lhs8928, songgot, jihochu, DonghakPark, SeoHyungjun, baek2sm, djeong20, EunjuYang and a team as code owners May 10, 2024 00:12

github-actions bot added the Need Review label May 10, 2024

skykongkong8 mentioned this pull request May 10, 2024

[ GEMM ] HGEMM noTrans case #2488

Closed

skykongkong8 changed the title ~~[ Wait for #2577 ] [ hgemm ] Partial sum up to 2048 digits for more acceleration & trivial refactor~~ [ Wait for #2577 ] [ hgemm ] Partial sum up to 2048 digits for more acceleration & trivial refactor @open sesame 05/10 10:47 May 10, 2024

taos-ci approved these changes May 10, 2024

View reviewed changes

baek2sm reviewed May 10, 2024

View reviewed changes

baek2sm approved these changes May 10, 2024

View reviewed changes

SeoHyungjun approved these changes May 20, 2024

View reviewed changes

skykongkong8 added 7 commits May 20, 2024 15:46

[ hgemm ] Implement 4x4 f16-f32 kernel

9b42349

- Implement 4x4 GEMM kernel that works f16-f32 partial accumulation **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

[ hgemm ] Add 4x4 kernel-using f16-f32 hgemm_noTrans

bccf833

- Now Hgemm supports 4x4 f16-f32 partial accumulation strategy **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

skykongkong8 force-pushed the hgemm/f16acc/2048 branch from 8c6c01e to c8f6491 Compare May 20, 2024 06:47

github-actions bot added PR/READY2MERGE and removed Need Review labels May 20, 2024

[ Trivial ] Remove redundant comments and format

ee66e16

- Due to adaptive macro kernel usage, previous comment is no longer needed. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

skykongkong8 force-pushed the hgemm/f16acc/2048 branch from c8f6491 to ee66e16 Compare May 20, 2024 06:48

skykongkong8 changed the title ~~[ Wait for #2577 ] [ hgemm ] Partial sum up to 2048 digits for more acceleration & trivial refactor @open sesame 05/10 10:47~~ [ hgemm ] Partial sum up to 2048 digits for more acceleration & trivial refactor @open sesame 05/10 10:47 May 20, 2024

taos-ci approved these changes May 20, 2024

View reviewed changes

This was linked to issues May 22, 2024

[ HGEMM ] Half-Precision GEMM Roadmap #2583

Closed

[ GEMM ] HGEMM noTrans case #2488

Closed

jijoongmoon merged commit e75f8ba into nnstreamer:main May 22, 2024
32 checks passed

skykongkong8 deleted the hgemm/f16acc/2048 branch June 18, 2024 02:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ hgemm ] Partial sum up to 2048 digits for more acceleration & trivial refactor @open sesame 05/10 10:47 #2578

[ hgemm ] Partial sum up to 2048 digits for more acceleration & trivial refactor @open sesame 05/10 10:47 #2578

skykongkong8 commented May 10, 2024 •

edited

Loading

taos-ci commented May 10, 2024

skykongkong8 commented May 10, 2024

taos-ci commented May 10, 2024

taos-ci left a comment

jijoongmoon commented May 10, 2024

skykongkong8 commented May 10, 2024 •

edited

Loading

baek2sm May 10, 2024

skykongkong8 May 12, 2024

baek2sm left a comment

SeoHyungjun left a comment •

edited

Loading

SeoHyungjun May 20, 2024

skykongkong8 May 20, 2024

taos-ci left a comment

[ hgemm ] Partial sum up to 2048 digits for more acceleration & trivial refactor @open sesame 05/10 10:47 #2578

[ hgemm ] Partial sum up to 2048 digits for more acceleration & trivial refactor @open sesame 05/10 10:47 #2578

Conversation

skykongkong8 commented May 10, 2024 • edited Loading

Latency

mse w.r.t. sgemm

+ Trivial refactor:

taos-ci commented May 10, 2024

skykongkong8 commented May 10, 2024

taos-ci commented May 10, 2024

taos-ci left a comment

Choose a reason for hiding this comment

jijoongmoon commented May 10, 2024

skykongkong8 commented May 10, 2024 • edited Loading

baek2sm May 10, 2024

Choose a reason for hiding this comment

skykongkong8 May 12, 2024

Choose a reason for hiding this comment

baek2sm left a comment

Choose a reason for hiding this comment

SeoHyungjun left a comment • edited Loading

Choose a reason for hiding this comment

SeoHyungjun May 20, 2024

Choose a reason for hiding this comment

skykongkong8 May 20, 2024

Choose a reason for hiding this comment

taos-ci left a comment

Choose a reason for hiding this comment

skykongkong8 commented May 10, 2024 •

edited

Loading

skykongkong8 commented May 10, 2024 •

edited

Loading

SeoHyungjun left a comment •

edited

Loading