-
Hi! I am wondering, what is the best approach to perform a reduction which computes 2 smallest values (with indices) for both rows and columns of gemm result? Can we fuse such reduction with gemm, so there will be no costs payed to transfer D? |
Beta Was this translation helpful? Give feedback.
Answered by
S-o-T
Apr 25, 2023
Replies: 1 comment 4 replies
-
yes, you can fuse such computations into the epilogue of our kernels. In this instance, you can pass in pointers to two values in gmem that you can then perform atomic min and max operations on. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
So i figured out how to modify an epilogue_with_reduction.h (to use element indices) and gemm_with_fused_epilogue.h (to skip gemm result transfer to gmem) in order to achieve such reduction (but only in single direction, reducing over columns still requires computing transposed problem), but examples of their usage (using fp16) does require cc >= 7.5, while i am still interested in fp32 at cc 6.1, so any advises on direction to look at?