-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ hgemm ] Partial sum up to 2048 digits for more acceleration & trivial refactor @open sesame 05/10 10:47 #2578
Conversation
📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2578. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/. |
This is PR includes issue raised at #2488 |
cibot: @skykongkong8, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2578-202405100912040.87964606285095-8c6c01e8d19abf1cdb80840c610af30669339647/. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.
what the ACC16 and ACC8 Means? |
It means the number of iterations of the tile of current gemm kernel. In a nutshell,
|
const int MOD = 10; | ||
|
||
GEN_TEST_INPUT(A, ((i * (batch * height * channel) + j * (batch * height) + | ||
k * (width) + l + 1) % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(width)
Is this parenthesis for readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
nntrainer/tensor/hgemm/hgemm.cpp
Outdated
hgemm_noTrans_4x8(M, N, K, A, K, B, N, C32, N, alpha, beta); | ||
} else if (N % 8 == 0) { | ||
} else if ((K & 0x7) == 0 && (N & 0x7) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of readability, wouldn't it be better to put a K after the N?
The order of function parameters and conditions is also M, N, K.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
- Implement 4x4 GEMM kernel that works f16-f32 partial accumulation **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- Now Hgemm supports 4x4 f16-f32 partial accumulation strategy **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- With macro-defined code, the function latency is expected to be optimized by compiler more easily **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
…16 kernel - With more digits computed with fp16 (in this case 1024 -> 2048) I could observe latency improvement with the cost of accuracy loss. However, according to current accuracy measurement criteria, it is still acceptable. Note that it is highly desired to be proven with model output once more. - With variety of partial sum kernels, we can adaptively apply internal macro kernels without being constrained to K-divisibilty w.r.t. 4, 8, 16.Commit title (Until 50 colums per line) **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
…8x8 kernel - Apply similar change made in commit#52a3c734 but in 8x8 kernel **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- To avoid the constraint of 4-8 divisibilty w.r.t. K, loop for adaptive K direction. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- I found there was a repeated usage of matrix initialization before mul-add fused operations. - With separate initialization code, we can enjoy: 1. Cleaner code that is reusable for both f16 & f16-f32 kernel 2. Redundant init process is minimized for f16 kernel. Better latency with the SAME accuracy. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
8c6c01e
to
c8f6491
Compare
- Due to adaptive macro kernel usage, previous comment is no longer needed. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
c8f6491
to
ee66e16
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.
Latency
mse w.r.t. sgemm
+ Trivial refactor: