[ tensor ] Apply SIMD in matrix transpose #2603

skykongkong8 · 2024-05-23T04:11:07Z

PR for issue raised in #2582

Matrix Transpose function in the latest NNTrainer (14.05.24) is implemented using for-loops.
Although current implementation is useful for general use in (b,c,h,w)-Tensor transpose, it would be a little bit naive implementation for the (h,w)-matrix transpose.

Latency measurement

TC = 20, tested on Galaxy S23, with frequently used ones

dim	prev	neon
768x768	400 mcrs	121 mcrs
1440x1440	2 ms	0.44 ms
1920x1560	4.3 ~ 1.6 ms	1.8 ~ 0.8 ms
1560x2048	4.18 ms	0.618 ms
512x2048	1.31 ms	0.18 ms

Overall, 200%~ 500% acceleration. ( Note that this method is effective for sufficiently big matrices )
Merging this PR will instantly impact BiQGEMM calculation (transpose("0:2:1"))

…matrix transpose - Previously, matrix transpose was relying on naive for-loop implementaion. - Using SIMD instructions, there is a room to be latency-optimized. - Note that current implementation only supports half-precision matrices. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- Add new function "transpos_matrix" to use newly implemented matrix transpose code **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- If it is for height-width transpose, we can enjoy SIMD accelerated code. - Use SIMD version if possible, otherwise fallback. - Through this commit, followings are expected to be accelerated, or can be accelerated with ease in the near future: - "0:2:1" transpose - BiQHGEMM - HGEMM **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

taos-ci · 2024-05-23T04:11:10Z

📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2603. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

taos-ci · 2024-05-23T04:11:14Z

cibot: @skykongkong8, nntrainer/tensor/matrix_transpose_neon/matrix_transpose_kernels_neon.h does not include Doxygen tags such as @file @brief @author @bug. You must include the Doxygen tags in the source code. Please refer to a Doxygen manual at http://github.com/nnstreamer/TAOS-CI/blob/main/ci/doc/doxygen-documentation.md

- Previously, there was a code defect when transposing matrix with non-4-divisible col length. - Bugfix and refactor its using interface: move transpose fallback when NEON is supported. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

taos-ci · 2024-05-23T04:12:42Z

cibot: @skykongkong8, nntrainer/tensor/matrix_transpose_neon/matrix_transpose_kernels_neon.h does not include Doxygen tags such as @file @brief @author @bug. You must include the Doxygen tags in the source code. Please refer to a Doxygen manual at http://github.com/nnstreamer/TAOS-CI/blob/main/ci/doc/doxygen-documentation.md

taos-ci

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

nntrainer/tensor/matrix_transpose_neon/matrix_transpose_kernels_neon.h

taos-ci

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

SeoHyungjun

LGTM

nntrainer/tensor/blas_interface.cpp

- add doxygen tags to avoid CI fail - trivial formatting **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

taos-ci

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

jijoongmoon

LGTM

skykongkong8 added 3 commits May 23, 2024 11:11

[ blas ] Add transpose_matrix function

5d75e5a

- Add new function "transpos_matrix" to use newly implemented matrix transpose code **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

skykongkong8 requested review from myungjoo, jijoongmoon, again4you, jaeyun-jung, leemgs, wooksong, helloahn, kparichay, gichan-jang, anyj0527, zhoonit, lhs8928, songgot, jihochu, DonghakPark, SeoHyungjun, baek2sm, djeong20, EunjuYang and a team as code owners May 23, 2024 04:11

github-actions bot added the Need Review label May 23, 2024

skykongkong8 force-pushed the pr/transpose/without_wait_for branch from 433ec98 to 4efa98b Compare May 23, 2024 04:12

skykongkong8 linked an issue May 23, 2024 that may be closed by this pull request

[ Tensor ] Accelerate fp16 matrix transpose with SIMD #2582

Closed

skykongkong8 force-pushed the pr/transpose/without_wait_for branch 2 times, most recently from 41f6812 to bc598c4 Compare May 23, 2024 04:40

taos-ci approved these changes May 23, 2024

View reviewed changes

myungjoo reviewed May 23, 2024

View reviewed changes

nntrainer/tensor/matrix_transpose_neon/matrix_transpose_kernels_neon.h Outdated Show resolved Hide resolved

skykongkong8 force-pushed the pr/transpose/without_wait_for branch from bc598c4 to c42beca Compare May 23, 2024 07:36

taos-ci approved these changes May 23, 2024

View reviewed changes

myungjoo approved these changes May 27, 2024

View reviewed changes

SeoHyungjun approved these changes May 30, 2024

View reviewed changes

nntrainer/tensor/blas_interface.cpp Outdated Show resolved Hide resolved

github-actions bot added PR/READY2MERGE and removed Need Review labels May 30, 2024

skykongkong8 force-pushed the pr/transpose/without_wait_for branch from c42beca to c7daba7 Compare June 3, 2024 06:07

[ trivial ] Add doxygen tags in matrix transpose functions

845c7d8

- add doxygen tags to avoid CI fail - trivial formatting **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

skykongkong8 force-pushed the pr/transpose/without_wait_for branch from c7daba7 to 845c7d8 Compare June 3, 2024 06:08

taos-ci approved these changes Jun 3, 2024

View reviewed changes

jijoongmoon approved these changes Jun 4, 2024

View reviewed changes

jijoongmoon merged commit b838d79 into nnstreamer:main Jun 4, 2024
28 of 29 checks passed

skykongkong8 deleted the pr/transpose/without_wait_for branch June 18, 2024 02:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ tensor ] Apply SIMD in matrix transpose #2603

[ tensor ] Apply SIMD in matrix transpose #2603

skykongkong8 commented May 23, 2024 •

edited

Loading

taos-ci commented May 23, 2024

taos-ci commented May 23, 2024

taos-ci commented May 23, 2024

taos-ci left a comment

taos-ci left a comment

SeoHyungjun left a comment

taos-ci left a comment

jijoongmoon left a comment

[ tensor ] Apply SIMD in matrix transpose #2603

[ tensor ] Apply SIMD in matrix transpose #2603

Conversation

skykongkong8 commented May 23, 2024 • edited Loading

Latency measurement

taos-ci commented May 23, 2024

taos-ci commented May 23, 2024

taos-ci commented May 23, 2024

taos-ci left a comment

Choose a reason for hiding this comment

taos-ci left a comment

Choose a reason for hiding this comment

SeoHyungjun left a comment

Choose a reason for hiding this comment

taos-ci left a comment

Choose a reason for hiding this comment

jijoongmoon left a comment

Choose a reason for hiding this comment

skykongkong8 commented May 23, 2024 •

edited

Loading