You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.
ARM NEON has pairwise-folding addition instructions where pairs of narrow (e.g. 8-bit) input lanes are added together and accumulated into wider (e.g. 16-bit) integer lanes. For example SADALP, SADDLP.
This is in addition to plain pairwise-folding additions with all operands of the same bit width, like SADDP.
An extreme case of such folding is the dot-product instructions (SDOT, See PR #127) where the folding addition is performed 4-fold. When one of the source operands has all lanes set to 1's, this acts as a 4-fold addition of 8bit values into 32bit accumulators.
This combination of folding behavior and mixing different bit widths allows to maximize the number of scalar operations done per instruction.
ARM NEON has pairwise-folding addition instructions where pairs of narrow (e.g. 8-bit) input lanes are added together and accumulated into wider (e.g. 16-bit) integer lanes. For example SADALP, SADDLP.
This is in addition to plain pairwise-folding additions with all operands of the same bit width, like SADDP.
An extreme case of such folding is the dot-product instructions (SDOT, See PR #127) where the folding addition is performed 4-fold. When one of the source operands has all lanes set to 1's, this acts as a 4-fold addition of 8bit values into 32bit accumulators.
This combination of folding behavior and mixing different bit widths allows to maximize the number of scalar operations done per instruction.
This is very widely used in any integer arithmetic application. For example in matrix multiplication kernels using plain NEON without SDOT, based on the idea of multiplying 8bit input values into 16bit local products (see Issue #226), then pairwise-folding those 16bit products into 32bit accumulators:
https://github.com/google/ruy/blob/808ff748e0c7dc746a413fe45fa022d63e6253e8/ruy/kernel_arm64.cc#L1233
The text was updated successfully, but these errors were encountered: