Fusing two back-to-back convolutions #208

hwu36 · 2021-03-17T14:21:16Z

hwu36
Mar 17, 2021
Maintainer

In 2.5 release, we have new examples of fusing two back-to-back fprop in https://github.com/NVIDIA/cutlass/tree/master/examples/13_two_tensor_op_fusion .

In DL networks, we have observed a set of problems that require convolution kernels with small channel count. For example, several convolution layers of ResNet50 have channel count between 64 and 256. These kernels are generally memory limited due to the memory time spent on prologue and epilogue while comparing with a short main loop. In particular, while running these kernels back-to-back, the memory traffic introduced by inter-layer activation storing and loading is a major performance bottleneck.

The benefit of fusing these convolution kernels together lies in two aspects:

Eliminate memory traffic for storing and loading inter-layer activations.
Eliminate launch latency which is especially beneficial to short kernels with small batch size.

The example uses a block-resident approach where output activation generated by each thread block will remain inside the memory of that thread block (either in shared memory or register file) and will be used as input activation for the next layer. Similar to GEMM-GEMM fusion, block-resident CONV-CONV fusion has certain requirements. Specifically the 2nd convolution is required to employ 1x1 filters so the input activations will not depend on other thread blocks. Here is an example:

CONV1 ([NHWC],[CRSK]): ([N, 56, 56, 64], [64, 3, 3, 64]) + scale_bias_relu

CONV2 ([NHWC],[CRSK]): (([N, 56, 56, 64], [64, 1, 1, 64]) + scale_bias_relu

ocwins · 2021-04-04T09:58:57Z

ocwins
Apr 4, 2021

Specifically the 2nd convolution is required to employ 1x1 filters so the input activations will not depend on other thread blocks

I have not notice this before.

Is there any chance to eliminate this requirement ?

1 reply

hwu36 Apr 5, 2021
Maintainer Author

I don't think we can support 3x3 in the 2nd conv by using just one kernel. 3x3 would require the threadblock in the 2nd conv fetch data belonging to the other threadblock.

galv · 2021-04-23T18:49:56Z

galv
Apr 23, 2021

I noticed while reading the example that you specialize this code for only Turing and Ampere architectures.

I'm interested in using this on a Volta GPU. Should I have any particular concerns with that? I know that small problems in the compiler can often reduce performance greatly.

1 reply

hwu36 Apr 23, 2021
Maintainer Author

It is not going to work out of the box on Volta. Turing and Ampere tensor core instructions have similar data-to-thread mapping. Volta is different.

You need to change the logic when the 2nd gemm/conv is picking operand A data from the result of the 1st gemm/conv. It is tightly related to the data-to-thread mapping of the tensor core instructions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fusing two back-to-back convolutions #208

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Fusing two back-to-back convolutions #208

hwu36 Mar 17, 2021 Maintainer

Replies: 2 comments · 2 replies

ocwins Apr 4, 2021

hwu36 Apr 5, 2021 Maintainer Author

galv Apr 23, 2021

hwu36 Apr 23, 2021 Maintainer Author

hwu36
Mar 17, 2021
Maintainer

Replies: 2 comments 2 replies

ocwins
Apr 4, 2021

hwu36 Apr 5, 2021
Maintainer Author

galv
Apr 23, 2021

hwu36 Apr 23, 2021
Maintainer Author