Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ tensor ] Apply SIMD in matrix transpose #2603

Merged

Conversation

skykongkong8
Copy link
Member

@skykongkong8 skykongkong8 commented May 23, 2024

PR for issue raised in #2582

Matrix Transpose function in the latest NNTrainer (14.05.24) is implemented using for-loops.
Although current implementation is useful for general use in (b,c,h,w)-Tensor transpose, it would be a little bit naive implementation for the (h,w)-matrix transpose.

Latency measurement

TC = 20, tested on Galaxy S23, with frequently used ones

dim prev neon
768x768 400 mcrs 121 mcrs
1440x1440 2 ms 0.44 ms
1920x1560 4.3 ~ 1.6 ms 1.8 ~ 0.8 ms
1560x2048 4.18 ms 0.618 ms
512x2048 1.31 ms 0.18 ms
  • Overall, 200%~ 500% acceleration. ( Note that this method is effective for sufficiently big matrices )
  • Merging this PR will instantly impact BiQGEMM calculation (transpose("0:2:1"))

…matrix transpose

- Previously, matrix transpose was relying on naive for-loop implementaion.
- Using SIMD instructions, there is a room to be latency-optimized.
- Note that current implementation only supports half-precision matrices.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- Add new function "transpos_matrix" to use newly implemented matrix transpose code

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- If it is for height-width transpose, we can enjoy SIMD accelerated code.
- Use SIMD version if possible, otherwise fallback.
- Through this commit, followings are expected to be accelerated, or can be accelerated with ease in the near future:
  - "0:2:1" transpose
  - BiQHGEMM
  - HGEMM

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
@taos-ci
Copy link

taos-ci commented May 23, 2024

📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2603. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

@taos-ci
Copy link

taos-ci commented May 23, 2024

:octocat: cibot: @skykongkong8, nntrainer/tensor/matrix_transpose_neon/matrix_transpose_kernels_neon.h does not include Doxygen tags such as @file @brief @author @bug. You must include the Doxygen tags in the source code. Please refer to a Doxygen manual at http://github.com/nnstreamer/TAOS-CI/blob/main/ci/doc/doxygen-documentation.md

- Previously, there was a code defect when transposing matrix with non-4-divisible col length.
- Bugfix and refactor its using interface: move transpose fallback when NEON is supported.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
@skykongkong8 skykongkong8 force-pushed the pr/transpose/without_wait_for branch from 433ec98 to 4efa98b Compare May 23, 2024 04:12
@taos-ci
Copy link

taos-ci commented May 23, 2024

:octocat: cibot: @skykongkong8, nntrainer/tensor/matrix_transpose_neon/matrix_transpose_kernels_neon.h does not include Doxygen tags such as @file @brief @author @bug. You must include the Doxygen tags in the source code. Please refer to a Doxygen manual at http://github.com/nnstreamer/TAOS-CI/blob/main/ci/doc/doxygen-documentation.md

@skykongkong8 skykongkong8 linked an issue May 23, 2024 that may be closed by this pull request
@skykongkong8 skykongkong8 force-pushed the pr/transpose/without_wait_for branch 2 times, most recently from 41f6812 to bc598c4 Compare May 23, 2024 04:40
Copy link

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

@skykongkong8 skykongkong8 force-pushed the pr/transpose/without_wait_for branch from bc598c4 to c42beca Compare May 23, 2024 07:36
Copy link

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

Copy link
Member

@SeoHyungjun SeoHyungjun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

nntrainer/tensor/blas_interface.cpp Outdated Show resolved Hide resolved
@skykongkong8 skykongkong8 force-pushed the pr/transpose/without_wait_for branch from c42beca to c7daba7 Compare June 3, 2024 06:07
- add doxygen tags to avoid CI fail
- trivial formatting

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
@skykongkong8 skykongkong8 force-pushed the pr/transpose/without_wait_for branch from c7daba7 to 845c7d8 Compare June 3, 2024 06:08
Copy link

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

Copy link
Collaborator

@jijoongmoon jijoongmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jijoongmoon jijoongmoon merged commit b838d79 into nnstreamer:main Jun 4, 2024
28 of 29 checks passed
@skykongkong8 skykongkong8 deleted the pr/transpose/without_wait_for branch June 18, 2024 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ Tensor ] Accelerate fp16 matrix transpose with SIMD
5 participants