forked from GPU-correlators/xGPU
-
Notifications
You must be signed in to change notification settings - Fork 1
A GPU based FX correlator for radio astronomy
License
david-macmahon/xGPU
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Release Notes for xGPU ---------------------- Overview: xGPU is a library for performing the cross-multiplication step of the FX correlator algorithm, which is popular for radio astronomy signal processing. Precision: By default xGPU accepts signed 8-bit integer input, which is then converted to and computed in 32-bit floating point, with the final result output in 32-bit floating point. However, on architectures that support the dp4a instruction (sm_61, sm_70 and sm_72 at the time of writing) a pure integer correlator is supported (8-bit integer multiply, 32-bit integer accumulation). This can provide up to a 4x speedup versus using a floating point correlator. Note, that if dp4a is requested on an unsupported architecture, then the computation will be emulated, providing the correct answer but much slower than native dp4a. Software Compatibility: The library has been tested under Linux (Ubuntu 16.04 and CentOS 7) using release 9.2 of the CUDA toolkit. However the expectation is that all versions of CUDA should be compatible. Hardward Compatibility: For a list of supported devices see, https://developer.nvidia.com/cuda-gpus xGPU has been tested and known to work on all GPUs from Fermi onwards. While this library will run on pre-Fermi GPUs with appropriate changes to the Makefile, note that the kernels make Fermi-specific optimizations and so will likely lead to sub-standard performance on sm1.x CUDA architectures. Note that as of CUDA 9.0, Fermi support has been removed from the CUDA toolkit with Kepler being the minimum. Building the Library: The library, library query tool "xgpuinfo", and the sample program "cuda_correlator" can be built by changing into the src subdirectoy and running "make". $ cd src $ make The default architecture that is targeted is Kepler (sm_35), though this can be overriden with the CUDA_ARCH variable. E.g., $ make CUDA_ARCH=sm_60 would build xGPU for Pascal GP100. The possible options are arch example chips example products dp4a sm_20 GF100, GF110 Tesla 2050, Tesla 2070 sm_21 GF114, GF116 GTX 460 sm_30 GK104, GK106, GK107 Tesla K10, GTX 680 sm_35 GK110, GK180 Tesla K20, Tesla K40 sm_50 GM107, GM108 GTX 750 sm_52 GM200, GM204, GM206, GM207 Tesla M40, Tesla M40, GTX 980 sm_53 GM20B Jetson TX1 sm_60 GP100 Tesla P100 sm_61 GP102, GP104, GP106, GP107 Tesla P40, Tesla P4, GTX 1080 x sm_62 GP10B Jetson TX2 sm_70 GV100 Tesla V100, Titan V x sm_72 GV10B Xavier x In general one should target the same architecture that they are running on. While code compiled for an older architecture will in general run on a more recent architecture, this may lead to sub-optimal performance. For example, native dp4a instructions will only be generated for specific architectures. Other options include CUDA_DIR # path to CUDA toolkit installation (default /usr/local/cuda) DEBUG # set any specific compiler flags (default is -O3) TEXTURE_DIM # whether to use 1-d or 2-d textures (default is 1) DP4A # whether to use dp4a instruction (default is no) For the full list of environment variables that control compilation and installation options see the top of src/Makefile. Currently, a number of sizing parameters must be specified when building the library. Default values of these parameters are specified near the top of src/xgpu_info.h. The default values can be overridden on the make command line to suit your instrument's needs. The options that can be given on the make command line are shown here with there default values. NPOL=2 NSTATION=256 NFREQUENCY=10 NTIME=1024 NTIME_PIPE=128 Note that NTIME_PIPE must be a multiple of 4 (16 when dp4a is used) and NTIME must be a multiple of NTIME_PIPE. The preprocessor will error out if those two conditions are not met. For example, to compile with NSTATION set to 128 and all other parameters at their default values: $ make NSTATION=128 Installing the Library: The library can be installed by changing into the src subdirectoy and running "make install". By default, this will install xgpuinfo into /usr/local/bin, xgpu.h into /usr/local/include, and libxgpu.so (or libxgpu.dll on Cygwin) to /usr/local/lib. Specifying "prefix=/some/path" on the "make install" command line will install these files into /some/path/bin, /some/path/include, and /some/path/lib instead. $ cd src $ make install # install under /usr/local $ make install prefix=$HOME/local # install under $HOME/local Using the Library: The library can be called from C or C++ code. To use the library, your source files need to #include <xpgu.h> and your executable needs to be linked with libxpgu.so (or libxgpu.dll on Cygwin). On UNIX systems, this usaually means adding "-L/path/to/lib/dir" and "-lxpgu" to the link command line. Please see the comments in xgpu.h as well as the usage in the sample program cuda_correlator.cu for more details on how to use the library. This library has been designed to be interfaced with other parts of an FX correlator pipeline, and so not much can be achieved in isolation. A simple test program "cuda_correlator.cu" is included which performs cross-multiplication on the host and the device and verifies the device obtained the correct answer. The many options regarding number of stations, frequency channels etc. are set in the top of this file. Benchmarking Performance: xGPU includes an additional benchmarking utility: CUBE - CUDA BEnchmarking. This uses C-preprocessor directive to obtain arithmetic throughput and device memory bandwidth performance. To invoke a benchmarking run, one simply has to execute the "bench" script. This will perform four runs of the test. The first two of these are concerned with counting all flops and transfers performed by the kernels, and measuring the time taken for each of these steps. The latter two are concerened with measuring the asynchronous performance of the device<->host transfers. By default the results are printed to stdout, though they are output to file (cube_benchmark.log and cube_benchmark.csv). Acknowledging xGPU: If you find this code useful in your work, please cite: M. A. Clark, P. C. La Plante, and L. J. Greenhill, "Accelerating Radio Astronomy Cross-Correlation with Graphics Processing units", [arXiv:1107.4264 [astro-ph]]. Authors: Kate Clark (NVIDIA) Paul La Plante (Loyola University Maryland) Lincoln Greenhill (Harvard-Smithsonian Center for Astrophysics) David MacMahon (University of California, Berkeley) Ben Barsdell (NVIDIA)
About
A GPU based FX correlator for radio astronomy
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published
Languages
- Cuda 47.4%
- C 31.7%
- C++ 13.3%
- Makefile 4.6%
- M4 1.7%
- Shell 1.3%