Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ hgemm ] Partial sum up to 2048 digits for more acceleration & trivial refactor @open sesame 05/10 10:47 #2578

Merged
merged 8 commits into from
May 22, 2024

Conversation

skykongkong8
Copy link
Member

@skykongkong8 skykongkong8 commented May 10, 2024

Unittest output using Galaxy S23 with local commit

Latency

mean latency with TC = 100

dim KERNEL_8x16_ACC16 KERNEL_8x16_ACC8 cblas fp32
1024 23 ms 30 ms 32 ms
768 9 ms 12.8 ms 13.6 ms
256x1440x256 2054 mcrs 2664 mcrs 2701 mcrs
256x256x1440 2359 mcrs 2965 mcrs 3104 mcrs

mse w.r.t. sgemm

dim KERNEL_8x16_ACC16 KERNEL_8x16_ACC8
1024 0.00608169 0.00226737
768 0.00310214 0.0017091
256x1440x256 0.0149112 0.00518965
256x256x1440 0.00119428 0.000306849
  • Overall, this shows 150% boost-up with f16-f32 w.r.t. cblas fp32
  • Considering enlarged vector length from f32 to f16, and partial-accumulation, result above sounds reasonable.
  • However, this code takes a little bit of accuracy loss for its cost. Should be checked once more with model output.

+ Trivial refactor:

  • unified macro-style GEMM kernels.
  • adaptive K-direction loops

@taos-ci
Copy link

taos-ci commented May 10, 2024

📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2578. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

@skykongkong8
Copy link
Member Author

This is PR includes issue raised at #2488

@taos-ci
Copy link

taos-ci commented May 10, 2024

:octocat: cibot: @skykongkong8, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2578-202405100912040.87964606285095-8c6c01e8d19abf1cdb80840c610af30669339647/.

@skykongkong8 skykongkong8 changed the title [ Wait for #2577 ] [ hgemm ] Partial sum up to 2048 digits for more acceleration & trivial refactor [ Wait for #2577 ] [ hgemm ] Partial sum up to 2048 digits for more acceleration & trivial refactor @open sesame 05/10 10:47 May 10, 2024
Copy link

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

@jijoongmoon
Copy link
Collaborator

what the ACC16 and ACC8 Means?

@skykongkong8
Copy link
Member Author

skykongkong8 commented May 10, 2024

what the ACC16 and ACC8 Means?

It means the number of iterations of the tile of current gemm kernel.
Suppose GEMM(M, K, N, ... ).
Using axb gemm kernel with ACC'c' will therefore compute a * b * c digits with fp16, and accumulate with f32 M * K * N / (a * b * c) times.
That's why I wrote partial sum up to '2048 digits' with 8x16 ACC16 -> 8 * 16 * 16 = 2048

In a nutshell,

  • 8x16 kernel : tile size of 8x16 ( related to M and N )
  • ACC16 / ACC8 : iterate current tile through K-direction with 16 / 8 times.

const int MOD = 10;

GEN_TEST_INPUT(A, ((i * (batch * height * channel) + j * (batch * height) +
k * (width) + l + 1) %
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(width) Is this parenthesis for readability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

Copy link
Contributor

@baek2sm baek2sm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@SeoHyungjun SeoHyungjun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

hgemm_noTrans_4x8(M, N, K, A, K, B, N, C32, N, alpha, beta);
} else if (N % 8 == 0) {
} else if ((K & 0x7) == 0 && (N & 0x7) == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of readability, wouldn't it be better to put a K after the N?
The order of function parameters and conditions is also M, N, K.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

- Implement 4x4 GEMM kernel that works f16-f32 partial accumulation

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- Now Hgemm supports 4x4 f16-f32 partial accumulation strategy

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- With macro-defined code, the function latency is expected to be optimized by compiler more easily

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
…16 kernel

- With more digits computed with fp16 (in this case 1024 -> 2048) I could observe latency improvement with the cost of accuracy loss. However, according to current accuracy measurement criteria, it is still acceptable. Note that it is highly desired to be proven with model output once more.
- With variety of partial sum kernels, we can adaptively apply internal macro kernels without being constrained to K-divisibilty w.r.t. 4, 8, 16.Commit title (Until 50 colums per line)

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
…8x8 kernel

- Apply similar change made in commit#52a3c734 but in 8x8 kernel

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- To avoid the constraint of 4-8 divisibilty w.r.t. K, loop for adaptive K direction.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- I found there was a repeated usage of matrix initialization before mul-add fused operations.
- With separate initialization code, we can enjoy:
	1. Cleaner code that is reusable for both f16 & f16-f32 kernel
	2. Redundant init process is minimized for f16 kernel. Better latency with the SAME accuracy.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- Due to adaptive macro kernel usage, previous comment is no longer needed.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
@skykongkong8 skykongkong8 changed the title [ Wait for #2577 ] [ hgemm ] Partial sum up to 2048 digits for more acceleration & trivial refactor @open sesame 05/10 10:47 [ hgemm ] Partial sum up to 2048 digits for more acceleration & trivial refactor @open sesame 05/10 10:47 May 20, 2024
Copy link

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

@jijoongmoon jijoongmoon merged commit e75f8ba into nnstreamer:main May 22, 2024
32 checks passed
@skykongkong8 skykongkong8 deleted the hgemm/f16acc/2048 branch June 18, 2024 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ HGEMM ] Half-Precision GEMM Roadmap [ GEMM ] HGEMM noTrans case
5 participants