Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental cuFFT support. #587

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Add experimental cuFFT support. #587

wants to merge 1 commit into from

Conversation

cbalint13
Copy link

@cbalint13 cbalint13 commented Nov 8, 2020

Enable offload experimental FFT processing on cuda based GPUs.

Description

  • Proposed is a very simple patch using drop-in swap of actual main FFTW3 routines, see: cuFFTW interface
  • cuFFT don't support R2R so those are skipped in this PR, however they are not used (yet) in the srsLTE runtime.

Target

  • To improve the under 4ms computation requirement in the very tight timeframe only with limited threads is very challenging.

Evaluation

  1. Enabling cuFFT with proposed patch works fine just as the cpu target code, no degradation or loss was experienced.

  2. According to a simple benchmark things are slower on CUDA vs CPU having @1024pts target sizes:

Running ./cufftwf-benchmark
---------------------------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------
cu_fftwf/1024/manual_time         23298 ns        23283 ns        29583 bytes_per_second=335.33M/s items_per_second=43.9524M/s
cu_fftwf/2048/manual_time         27701 ns        27674 ns        25377 bytes_per_second=564.059M/s items_per_second=73.9324M/s
cu_fftwf/524288/manual_time     2002004 ns      1996820 ns          352 bytes_per_second=1.95117G/s items_per_second=261.882M/s
cu_fftwf/1048576/manual_time    4032475 ns      4022351 ns          175 bytes_per_second=1.9374G/s items_per_second=260.033M/s
cu_fftwf/manual_time_BigO          3.85 N          3.84 N    
cu_fftwf/manual_time_RMS              3 %             3 %    

Running ./fftw3f-benchmark
------------------------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------
fftwf/1024/manual_time          1644 ns         1669 ns       425523 bytes_per_second=4.63961G/s items_per_second=622.718M/s
fftwf/2048/manual_time          4070 ns         4094 ns       171156 bytes_per_second=3.74898G/s items_per_second=503.18M/s
fftwf/524288/manual_time    10701510 ns     10675552 ns           64 bytes_per_second=373.779M/s items_per_second=48.992M/s
fftwf/1048576/manual_time   33639549 ns     33554670 ns           20 bytes_per_second=237.815M/s items_per_second=31.1709M/s
fftwf/manual_time_BigO          0.00 N^2        0.00 N^2  
fftwf/manual_time_RMS             16 %            16 %  
  1. But being offloaded, the CPU may benefit some more free time slots on some low end SBC.

  2. Beyond FFT, more benefits for baseband may come from Nvidia Aerial's cuPHY , targeting FEC / TurboCodes too.


@andrepuschmann , @suttonpd , @ismagom looking forward your thoughts.

Thank You !

@CLAassistant
Copy link

CLAassistant commented Nov 8, 2020

CLA assistant check
All committers have signed the CLA.

Copy link
Collaborator

@andrepuschmann andrepuschmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Balint, thanks a lot for that. We are looking at GPU-based acceleration for a while but didn't see an actual use-case just yet Primarily, as you've pointed out, FFTs and other basic operations are super efficient on x86. That might change for coding and other more complex PHY procedures. We'll see. We also follow Aerial, but didn't try it out or have seen any benchmark results. Definitely looking forward to it. That being said, I think your PR is a good basis for possible GPU offloading using CUDA.

lib/src/phy/dft/dft_fftw.c Outdated Show resolved Hide resolved
@xavierarteaga
Copy link
Contributor

Hello, thanks for your experiment. I was curious and benchmarked it. There is an OFDM unit test that can be used for benchmark: lib/src/phy/dft/test/ofdm_test -N 2048 -n 100 -r 10000. The benchmark result is the ratio of the number of samples divided by the time it took to process.

For USE_CUDA=Off (i7 7800X OC 4.5GHz):

Running test for 100 PRB, 16800 RE...  [email protected] [email protected] MSE=0.000005

For USE_CUDA=On (GeForce GTX 970 WindForce 3X OC):

Running test for 100 PRB, 16800 RE...  [email protected] [email protected] MSE=0.000005

Also, the whole DL processing chain (including PDCCH and PDSCH encoding/decoding) can be tested and benchmarked lib/test/phy/phy_dl_test -p 100 -s 1000 -m 28. This test prompts the ratio of the number of bits encoded in a subframe divided by the processing time.

For USE_CUDA=Off:

lib/test/phy/phy_dl_test -p 100 -s 1000 -m 28
Finished! The UE failed decoding 0 of 1000 transport blocks.
75376000 were transmitted, 75376000 bits were received.
[Rates in Mbps] Granted  Processed
           eNb:    75.4      263.4
            UE:    75.4      102.6
BLER:   0.0%
Ok

For USE_CUDA=On:

Finished! The UE failed decoding 0 of 1000 transport blocks.
75376000 were transmitted, 75376000 bits were received.
[Rates in Mbps] Granted  Processed
           eNb:    75.4      142.1
            UE:    75.4       79.5
BLER:   0.0%
Ok

I am curious about how the cuPHY may perform.

@ghost ghost mentioned this pull request Dec 23, 2020
@andrepuschmann
Copy link
Collaborator

Just a quick follow-up on this. We've decided to leave the PR for the upcoming release, simply because the user benefit isn't obvious right now. We are happy to leave the PR open and build on top of it should Nvidia decide to make cuBB available publicly. Thanks again @cbalint13 for your contribution. Those are much appreciated.

@cbalint13
Copy link
Author

cbalint13 commented May 3, 2021

Just a quick follow-up on this. We've decided to leave the PR for the upcoming release, simply because the user benefit isn't obvious right now. We are happy to leave the PR open and build on top of it should Nvidia decide to make cuBB available publicly. Thanks again @cbalint13 for your contribution. Those are much appreciated.

@andrepuschmann , @xavierarteaga , @ismagom

Actual srsRAN implementation seems to have little benefits with CUDA, however it is a encouragement for future development.


Notes on more various tests and benchmarks:

  • the fft/ifft buffers/batches are too small for substantial benefits for cuFFT computation (compared to large batches)
  • the intense GPU<->PCIe<->RAM transfers for the small buffers/batches is also a major drawback (even with DMA)
  • the actual schema (having small buffers/batches) would suffer same drawbacks for CUDA turbocode/ldpc routines too
  • on e.g. NVIDIA Jetson (GPU on same CPU+MEM bus) the intense buffer exchange is worse (up to freezing the bus).

Some possible future solutions that may enable more heterogenous computation schema:

  • To gain benefit the whole computation chain including data should stay on the very GPU instance (until "to-air" release).
  • PHY code/buffer re-organization would benefit coupling with more advanced (never seen before) mod/demod schemas like e.g. [1] [2] (might help elude current/classical "not-so-parallelizable" FEC computation).

[1] LDPC https://arxiv.org/pdf/2007.07644.pdf
[2] TurboAE https://github.com/yihanjiang/turboae

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants