Replies: 2 comments 2 replies
-
I have not notice this before. Is there any chance to eliminate this requirement ? |
Beta Was this translation helpful? Give feedback.
1 reply
-
I noticed while reading the example that you specialize this code for only Turing and Ampere architectures. I'm interested in using this on a Volta GPU. Should I have any particular concerns with that? I know that small problems in the compiler can often reduce performance greatly. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In 2.5 release, we have new examples of fusing two back-to-back fprop in https://github.com/NVIDIA/cutlass/tree/master/examples/13_two_tensor_op_fusion .
In DL networks, we have observed a set of problems that require convolution kernels with small channel count. For example, several convolution layers of ResNet50 have channel count between 64 and 256. These kernels are generally memory limited due to the memory time spent on prologue and epilogue while comparing with a short main loop. In particular, while running these kernels back-to-back, the memory traffic introduced by inter-layer activation storing and loading is a major performance bottleneck.
The benefit of fusing these convolution kernels together lies in two aspects:
The example uses a block-resident approach where output activation generated by each thread block will remain inside the memory of that thread block (either in shared memory or register file) and will be used as input activation for the next layer. Similar to GEMM-GEMM fusion, block-resident CONV-CONV fusion has certain requirements. Specifically the 2nd convolution is required to employ 1x1 filters so the input activations will not depend on other thread blocks. Here is an example:
Beta Was this translation helpful? Give feedback.
All reactions