Skip to content

Latest commit

 

History

History
717 lines (580 loc) · 35 KB

README.md

File metadata and controls

717 lines (580 loc) · 35 KB

NAME

Inline::CUDA - Inline NVIDIA's CUDA code and GPU processing from within any Perl script.

VERSION

Version 0.15

SYNOPSIS

WARNING: see section "INSTALLATION" for how to install this package.

WARNING: prior to installation, please install https://github.com/hadjiprocopis/perl-nvidia2-ml

Inline::CUDA is a module that allows you to write Perl subroutines in C or C++ with CUDA extensions.

Similarly to Inline::C, Inline::CUDA is not meant to be used directly but rather in this way:

    #Firstly, specify some configuration options:

    use Inline CUDA => Config =>
    # optionally specify some options,
    # you don't have to
    # if they are already stored in a configuration file
    # which is consulted before running any CUDA program
    #    host_compiler_bindir => '/usr/local/gcc82/bin',
    #    cc => '/usr/local/gcc82/bin/gcc82',
    #    cxx => '/usr/local/gcc82/bin/g++82',
    #    ld => '/usr/local/gcc82/bin/gcc82',
    #    nvcc => '/usr/local/cuda/bin/nvcc',
    #    nvld => '/usr/local/cuda/bin/nvcc',
    # pass options to nvcc:
    #  this is how to deal with unknown compiler flags passed on to nvcc: pass them all to gcc
    #  only supported in nvcc versions 11+
    #    nvccflags => '--forward-unknown-to-host-compiler',
    #  do not check compiler version, use whatever the user wants
    #    nvccflags => '--allow-unsupported-compiler',
    # this will use CC or CXX depending on the language specified here
    # you can use C++ in your CUDA code, and there are tests in t/*
    # which check if c or c++ and show how to do this:
        host_code_language => 'c', # or 'c++' or 'cpp', case insensitive, see also cxx =>
    # optional extra Include and Lib dirs
        #inc => '-I...',
        #libs => '-L... -l...',
    # for debugging
        BUILD_NOISY => 1,
    # code will be left in ./_Inline/build/ after successful build
        clean_after_build => 0,
        warnings => 10,
      ;

    # and then, suck in code from __DATA__ and run it at runtime
    # notice that Inline->use(CUDA => <<'EOCUDA') is run at compiletime
    my $codestr;
    { local $/ = undef; $codestr = <DATA> }
    Inline->bind( CUDA => $codestr );

    if( do_add() ){ die "error running do_add()..." }

    1;
    __DATA__
    /* this is C code with CUDA extensions */
    #include <stdio.h>
    
    #define N 1000
    
    /* This is the CUDA Kernel which nvcc compiles: */
    __global__
    void add(int *a, int *b) {
            int i = blockIdx.x;
            if (i<N) b[i] = a[i]+b[i];
    }
    /* this function can be called from Perl.
       It returns 0 on success or 1 on failure.
       This simple code does not support passing parameters in,
       which is covered elsewhere.
    */
    int do_add() {
            cudaError_t err;
    
            // Create int arrays on the CPU.
            // ('h' stands for "host".)
            int ha[N], hb[N];
    
            // Create corresponding int arrays on the GPU.
            // ('d' stands for "device".)
            int *da, *db;
            if( (err=cudaMalloc((void **)&da, N*sizeof(int))) != cudaSuccess ){
                    fprintf(stderr, "do_add(): error, call to cudaMalloc() has failed for %zu bytes for da: %s\n",
                            N*sizeof(int), cudaGetErrorString(err)
                    );
                    return 1;
            }
            if( (err=cudaMalloc((void **)&db, N*sizeof(int))) != cudaSuccess ){
                    fprintf(stderr, "do_add(): error, call to cudaMalloc() has failed for %zu bytes for db: %s\n",
                            N*sizeof(int), cudaGetErrorString(err)
                    );
                    return 1;
            }
    
            // Initialise the input data on the CPU.
            for (int i = 0; i<N; ++i) ha[i] = i;
    
            // Copy input data to array on GPU.
            if( (err=cudaMemcpy(da, ha, N*sizeof(int), cudaMemcpyHostToDevice)) != cudaSuccess ){
                    fprintf(stderr, "do_add(): error, call to cudaMemcpy(cudaMemcpyHostToDevice) has failed for %zu bytes for ha->da: %s\n",
                            N*sizeof(int), cudaGetErrorString(err)
                    );
                    return 1;
            }
    
            // Launch GPU code with N threads, one per array element.
            add<<<N, 1>>>(da, db);
            if( (err=cudaGetLastError()) != cudaSuccess ){
                    fprintf(stderr, "do_add(): error, failed to launch the kernel into the device: %s\n",
                            cudaGetErrorString(err)
                    );
                    return 1;
            }
    
            // Copy output array from GPU back to CPU.
            if( (err=cudaMemcpy(hb, db, N*sizeof(int), cudaMemcpyDeviceToHost)) != cudaSuccess ){
                    fprintf(stderr, "do_add(): error, call to cudaMemcpy(cudaMemcpyDeviceToHost) has failed for %zu bytes for db->ha: %s\n",
                            N*sizeof(int), cudaGetErrorString(err)
                    );
                    return 1;
            }
    
            //for (int i = 0; i<N; ++i) printf("%d\n", hb[i]); // print results
    
            // Free up the arrays on the GPU.
            if( (err=cudaFree(da)) != cudaSuccess ){
                    fprintf(stderr, "do_add(): error, call to cudaFree() has failed for da: %s\n",
                            cudaGetErrorString(err)
                    );
                    return 1;
            }
            if( (err=cudaFree(db)) != cudaSuccess ){
                    fprintf(stderr, "do_add(): error, call to cudaFree() has failed for db: %s\n",
                            cudaGetErrorString(err)
                    );
                    return 1;
            }
    
            return 0;
    }

The statement: use Inline::CUDA => ...; is executed at compile-time. Often this is not desirable because you may want to read code from file, modify code at runtime or even auto-generate the inlined code at runtime. In these situations Inline provides bind().

Here is how to inline code read at runtime from a file called my_cruncher.cu, whose contents are exactly the same as the __DATA__ section in the previous example,

      use Inline;
      use File::Slurp;

      my $data = read_file('my_cruncher.cu');
      Inline->bind(CUDA => $data);

Using Inline->use(CUDA => "DATA") seems to have a problem when __DATA__ section contains identifiers enclosed in double underscores, e.g. __global__ (this is a CUDA reserved keyword) one workaround is to declare #define CUDA_GLOBAL __global__ and then replace all __global__ with CUDA_GLOBAL.

Sometimes, it is more convenient to configure Inline::CUDA not in a use statement (as above) but in a require statement. The latter is executed during the runtime of your script as opposed to loading the file during compile time for the former. This has certain benefits as you can enclose it in a conditional, eval or try/catch blocks. This is how (thank you [email protected]):

require Inline;
# configuration:
Inline->import(
  CUDA => Config =>
    ccflagsex => '...'
);
# compile your code:
Inline->import(
  CUDA => $my_code
);

CUDA

The somewhat old news, at least since 2007, is that a Graphics Processing Unit (GPU) has found uses beyond its traditional role in calculating and displaying graphics to our computer monitor. This stems from the fact that a GPU is a highly parallel computing machinery. Similar to the operating system sending data and instructions to that GPU frame-after-frame from the time it is booted in order to display windows, widgets, transparent menus, spinning animations, video games and visual effects, a developer can now send data and instructions to the GPU for doing any sort of arithmetic calculation in a highly parallel manner. Case in point is matrix multiplication where thousands of GPU computing elements are processing the matrices' elements in parallel. True parallelism, that is. As opposed to the emulated or limited, by the number of cores, 2, 4, 8 for cheap desktops, CPU's parallelism. It goes without saying that GPU processing is very powerful and opens up to a new world of nunber-crunching possibilities without the need for expensive super-computer capabilities.

NVIDIA's CUDA is "a parallel computing platform and programming model that makes using a GPU for general purpose computing simple and elegant" (from NVIDIA's site). In short, we use CUDA to dispatch number-crunching code to a Graphics Processing Unit (GPU) and then get the results back.

NVIDIA's CUDA comprises of a few keywords which can be inserted in C, C++, Fortran, etc. code. In effect, developers still write programs in their preferred language (C, C++ etc.) and whenever they need to access the GPU they use the CUDA extensions. For more information check CUDA Programming Guide .

A CUDA program is, therefore, a C or C++ program with a few CUDA keywords added. Generally, compiling such a program is done by a CUDA compiler, namely nvcc (nvidia cuda compiler) which, simplistically put, splits the code in two parts, the CUDA part and the C part. The C part is delegated to a C compiler, like gcc, and the CUDA part is handled by nvcc. Finally nvcc links these components into an ordinary standalone executable. For more information read CUDA Toolkit Documentation

Notice that in NVIDIA jargon, a "device" is (one of) the GPU and "host" is the CPU and the OS.

CAVEATS

In practice there are huge caveats which their conquering can be surprisingly easy with some CLI magic. This is fine in Linux or even OSX but for poor M$-windows victims, the same process can be painfully tortuous and possibly ending to a mental breaker. As I don't belong to that category I will not be able to help you with very specific requests regarding the so-called OS.

And on to the caveats.

Does your GPU support CUDA?

First of all, not all GPUs support CUDA. But new NVIDIA ones usually do and at a price of less or around 100 euros.

Different CUDA SDK exists for different hardware

Secondly, different GPUs have different "compute capability" requiring different versions of the CUDA SDK, which provides the nvcc and friends. For example my GeForce GTX 650 has a compute capability of 3.0 and that requires a SDK version of 10.2. That's the last SDK to support a 3.x capability GPU. Currently, the SDK has reached version 11.4 and supports compute capabilities of 3.5 to 8.6. See the Wikipedia article on CUDA for what GPUs are supported and by what CUDA SDK version.

CUDA compiler requires specific compiler version

Thirdly and most importantly, nvcc has specific and strict requirements regarding the version of the "host compiler", for example, gcc/g++, clang, cl.exe. See which compilers are supported at

For example, my GPU's compute capability (3.0) requires CUDA SDK version 10.2 which requires gcc version less or equal to 8. Find out what compiler your CUDA SDK supports in this ax3l's gist

There is a hack to stop nvcc checking compiler version and using whatever compiler it is specified by the user. Simply pass --allow-unsupported-compiler to nvcc and hope for the best. According to CUDA Toolkit Documentation, this flag has no effect in MacOS.

xt/30-check-basic-with-system-compiler.t shows how to tell Inline::CUDA to use the system compiler and also tell nvcc to not check compiler version. This test can fail in particular OS/versions. It seems to have worked for my particular setting. With this option you are at least safe from getting into trouble because of "Perl and XS objects with mismatched compiler versions".

GPU Programming: memory transfers overheads

Additionally, general GPU programming, in practice, has quite some caveats of its own that the potential GPU programmer must be aware of. To start with, there are some quite large overheads associated with sending data to the GPU and receiving it back. Because the memory generally accessible to any program running on the CPU (e.g. the C-part of the CUDA code) is not available to the GPU in the simple and elegant manner C programmers take for granted when presented with a memory pointer and read the memory space it points to. And vice versa. Memory in the C-part of the code must be cudaMemcpy()'ed (the equivalent of memcpy() for host-to-device and device-to-host data transfers) to the GPU. And the results calculated in the GPU remain there until are transfered back to host using another cudaMemcpy() call.

Add to this the overhead of copying the value of each item of a Perl array into a C array which cudaMemcpy() understands and expects and you get quite a significant overhead and a lot of paper-pushing for finally getting the same block of data onto the GPU. And the same applies in doing the reverse.

Here is a rough sketch of what memory transfers are required for calling an Inline::CUDA function from Perl and doing GPU processing:

    my @array = (1..5); # memory allocated for Perl array
    inline_cuda_function(\@array, $result);
    ...
    // now inside a Inline::CUDA code block
    int inline_cuda_function(SV *in, SV *out){
            // allocate memory for copying Perl array (in) to C
            h_in = malloc(5*sizeof(int));
            // allocate memory for holding the results on host
            h_out = malloc(5*sizeof(int));
            // allocate memory on the GPU for this same data
            cudaMalloc((void **)&d_in, 5*sizeof(int));
            // allocate memory on the GPU for the result
            cudaMalloc((void **)&d_out, 5*sizeof(int));
            // transfer Perl data onto host's C-array
            AV *anAV = (AV *)SvRV(in);
            for(int i=0;i<5;i++){
                    SV *anSV = *av_fetch((AV *)SvRV(anAV), i, FALSE);
                    h_in[i] = SvNV(anSV);
            }
            // and now transfer host's C-array onto the GPU
            cudaMemcpy(d_in, h_in, 5*sizeof(int), cudaMemcpyHostToDevice);
            // launch the kernel and do the processing onto the GPU
            ...
            // extract results from the GPU onto host memory
            cudaMemcpy(h_out, d_in, 5*sizeof(int), cudaMemcpyDeviceToHost);
            // and now from host memory (the C array) onto Perl
            // we have been passed a scalar, we create a new arrayref
            // and place it to its RV slot
            anAV = newAV();
            av_extend(anAV, 5); // resize the Perl array to fit the result
            // sv_setrv() is a macro created by LeoNerd, see above
            // it places the new array we created onto the passed scalar (out)
            sv_setrv(SvRV(out), (SV *)av);
            for(int i=0;i<5;i++){
                    av_store(av, i, newSVnv(h_out[i]));
            }
            free(h_in); free(h_out);
            cudaFree(d_in); cudaFree(d_out);
            return 0; // success
    }

There are some benchmarks in xt/benchmarks/*.b which compare the performance of a small (size ~10x10), medium (size ~100x100) and large (size ~1000x1000) data scenario for doing matrix multiplication (run them with make benchmark). In my computer at least the pure-C, CPU-hosted outperforms the GPU for the small, medium scenaria exactly because of these overheads. But the GPU is a clear winner for large data scenario.

See for example this particular benchmark: xt/benchmarks/30-matrix-multiply.b

Perl and XS objects with mismatched compiler versions

Finally, there is an issue with compiling XS code, which is essentially what Inline::CUDA does, with a compiler which is different to the compiler current Perl is built with. This is the case when a special host compiler had to be installed because of the CUDA SDK version. if that's true then you are essentially loading XS code compiled with gcc82 (as per the example in section "INSTALLATION") with a perl executable which was compiled with system compiler, for example gcc11. If that is really an issue then it will be insurmountable and the only solution will be to perlbrew a new Perl built with the special host compiler, e.g. gcc82.

The manual on installing Perl states that specifying the compiler is as simple as sh Configure -Dcc=/usr/local/gcc82/bin/gcc82

If you want to compile and install a new Perl using perlbrew then this will do it (thank you [email protected]:

PERLBREW_CONFIGURE_FLAGS='-d -Dcc=/usr/local/gcc82/bin/gcc' perlbrew install 5.38.2 --as 5.38.2.gcc82

The -d is for not being asked trivial questions about the compilation options and use sane defaults. The --as 5.38.2.gcc82 tells perlbrew to rename the new installed perl in case there is already one with the same name.

INSTALLATION

Installation of Inline::CUDA is a nightmare because it depends on external dependencies. It needs NVIDIA's CUDA SDK (providing nvcc (the nvidia cuda compiler) which requires specific host compiler versions. Which means that it is very likely that you will also need to install in your system an older compiler compatible with nvcc version. Even if your GPU supports the latest CUDA SDK version (at 11.4 as of July 2021), the maximum gcc version allowed with that is 10.21. Currently, gcc is at version 11.2 and upgrades monthly.

Installing a "private" compiler, in Linux, can be easy or hard depending whether the package manager allows it. Mine does not. See "how-to-install-compiler" for instructions on how to do that on Linux and label the new compiler with its own name so that one can have system compiler and older compiler living in parallel and not disturbing each other.

That said, there is a workaround: add this to pass the --allow-unsupported-compiler flag to nvcc. This can be achieved via the use Inline = Config => ...>, as below:

    use Inline => Config =>
            nvccflags => '--allow-unsupported-compiler',
            ... # other config options
    ;
    ... # Inline::CUDA etc.

The long and proper way of installing Inline::CUDA is described below.

So, if all goes Merfy you will have to install nvcc and an additional host compiler gcc. The latter is not the most pleasant of experiences in Linux. I don't know what's the situation with Windows. I can only imagine the horror.

Here is a rough sketch of what one should do.

  • Find the NVIDIA GPU name+version you have installed on your hardware kit. For example, GeForce GTX 650. This can be easy or hard.

    • If you already have the executable nvidia-smi installed or want to install it (e.g. in Fedora CLI do dnf provides nvidia-smi and make sure you have repo rpmfusion-nonfree enabled, somehow).
    • Install nvidia::ml and run the script I provide with Inline::CUDA at scripts/nvidia-ml-test.pl
  • With the NVIDIA GPU name+version available search this Wikipedia article for the "compute capability" of the GPU. For example this is 3.0 for GeForce GTX 650.

  • Use the "compute capability" of the GPU in order to find the CUDA SDK version you must install in the same Wikipedia article . For example, for the GPU GeForce GTX 650, one should download and install CUDA SDK 10.2.

  • Download, but not yet install, the specific version of the CUDA SDK from the CUDA Toolkit Archive

  • If you are lucky, your system's C compiler will be compatible with the CUDA SDK version you downloaded and installing the above archive will be successful. it is worth to give it a try, i.e. try to install and see if it will complain about incompatible host compiler version. If it doesn't then you are good to go.

  • If installing the above archive yields errors about incompatible host compiler then you must install a supported host compiler at a private path (so as not to interfere with your actual system compiler) and provide that path during installation (see below) of the CUDA SDK and also during installation of Inline::CUDA (see below).

  • Find the maximum host compiler version supported by your CUDA SDK you downloaded. For example, CUDA SDK 10.2 in Linux is documented at https://docs.nvidia.com/cuda/archive/10.2/cuda-installation-guide-linux/. It states that the maximum gcc version is 8.2.1 for RHEL 8.1. I suspect that it is the compiler's major version, e.g. 8, that matters. I can confirm that gcc 8.4.0 works fine for Linux, Fedora 34, kernel 5.12, perl v5.32, GeForce GTX 650.

  • Once you decide on the compiler version, download it and install it to a private path so as not to interfere with the system compiler. Note that path for later use.

  • I have instructions on how to do the above, in Linux for gcc. Download specific gcc version from: ftp://ftp.fu-berlin.de/unix/languages/gcc/releases/ (other mirrors exist here https://gcc.gnu.org/mirrors.html). Compile the compiler and make sure you give it a prefix and a suffix. You must also download packages https://ftp.gnu.org/gnu/mpfr, https://ftp.gnu.org/gnu/mpc/ and https://ftp.gnu.org/gnu/gmp/, choosing versions compatible with the gcc version you have already downloaded. The crucial line in the configuration stage of compiling gcc is configure --prefix=/usr/local/gcc82 --program-suffix=82 --enable-languages=c,c++ --disable-multilib --disable-libstdcxx-pch . Here is a gist from https://stackoverflow.com/questions/58859081/how-to-install-an-older-version-of-gcc-on-fedora:

         tar xvf gcc-8.2.0.tar.xz 
         cd gcc-8.2.0/
         tar xvf mpfr-4.0.2.tar.xz && mv -v mpfr-4.0.2 mpfr
         tar xvf gmp-6.1.2.tar.xz && mv -v gmp-6.1.2 gmp
         tar xvf mpc-1.1.0.tar.gz && mv -v mpc-1.1.0 mpc
         cd ../
         mkdir build-gcc820
         cd build-gcc820/
         ../gcc-8.2.0/configure --prefix=/usr/local/gcc82 --program-suffix=82 --enable-languages=c,c++,fortran --disable-multilib --disable-libstdcxx-pch
         make && make install
    

    From now on, I will be using /usr/local/gcc82/bin/gcc82 and /usr/local/gcc82/bin/g++82 as my host compilers.

  • Now you have our special compiler at /usr/local/gcc82 under the name /usr/local/gcc82/bin/gcc82 and also /usr/local/gcc82/bin/g++82. We need to install the CUDA SDK and tell it to skip checking host compiler compatibility (I don't think there is a way to point it to the correct compiler to use). In Linux, this is like sh cuda_10.2.89_440.33.01_linux.run --override. After a successful installation you should be able to see /usr/local/cuda/bin/nvcc. Optionally add this to your PATH, export PATH="${PATH}:/usr/local/cuda/bin"

  • In general, compiling CUDA code, for example this one, is as simple as:

          nvcc --compiler-bindir /usr/local/gcc82/bin/gcc82 simple.cu && a.out
    

    Notice the cuda program extension .cu. It is important to keep nvcc happy. Also note that if your CUDA SDK does not require installing an older version of a compiler but instead it is happy with your system compiler, then you can omit this: --compiler-bindir /usr/local/gcc82/bin/gcc82.

  • If you did compile the simple cuda program and managed to run it, then you are ready to install Inline::CUDA. If your system compiler is acceptable by CUDA SDK, then it is as simple as running

          perl Makefile.PL
          make
          make install
    

    But if you need to declare a special host compiler (re: /usr/local/gcc82/bin/gcc82) because your system compiler is not accepted by CUDA SDK then you need to specify that to the installation process via one of the following two methods:

    • The first method is more permanent but assumes that you can (re-)install the module. During installation, specify the following environment variables, assuming a bash-based terminal, then this should do it:

            CC=/usr/local/gcc82/bin/gcc82 \
            CXX=/usr/local/gcc82/bin/g++82 \
            LD=/usr/local/gcc82/bin/g++82 \
            perl Makefile.PL
            make
            make install
      

      #item The second method assumes you can edit Inline::CUDA's configuration file located to a place like: /usr/local/share/perl5/5.32/auto/share/dist/Inline-CUDA/Inline-CUDA.conf (different systems will have a slightly different path), and modify the entries for 'cc', 'cxx' and 'ld'.

  • Whatever the host compiler was, the configuration will be saved in a file called Inline-CUDA.conf. This file will be saved in a share-dir relative to your current Perl installation path. As an example mine is at /usr/local/share/perl5/5.32/auto/share/dist/Inline-CUDA/Inline-CUDA.conf

    This configuration file will be consulted every time you use Inline::CUDA and will know where the special host compiler resides.

  • Finally, make test will run a suite of test scripts and if all goes well all will succeed. Additionally, make benchmark will run a matrix multiplication benchmark which will reveal if you can indeed get any benefits using GPGPU on your specific hardware for this specific problem. Feel free to extend benchmarks for your use-case.

  • At this stage I would urge people installing the code to run also make author-test and report back errors.

DEMO

The folder demos/ in the base dir of the current distribution contains self-contained Inline::CUDA demo(s). One of which produces the Mandelbrot Fractal on the GPU using Cuda code copied from marioroy's excellent work at https://github.com/marioroy/mandelbrot-python, see also PerlMonks post at https://perlmonks.org/?node_id=11139880. The demo is not complete, it just plugs marioroy's Cuda code into Inline::CUDA.

From the base dir of the current distribution run:

make demo

CAVEATS

In your CUDA code do not implement main()! Place your CUDA code in your own functions which you call from Perl. If you get segmentation faults check the above first.

CONTRIBUTIONS BY OTHERS

This is a module which stands on the shoulders of Giants.

Literally!

To start with, CUDA and nvidia cuda compiler are two NVIDIA projects which offer general programming on the GPU to the masses opening a new world of computational capabilities as an alternative to the traditional CPU model. A big thank you to NVIDIA.

Then there is Perl's Inline module created by Ingy döt Net. This module makes it easy to inline a lot of computer languages and call them within a Perl script, passing Perl data structures and obtaining results back.

This module is the key to opening many doors for Perl scripts.

A big thank you to Ingy döt Net.

Then there is Perl's Inline::C module created/co-created/maintained by Ingy döt Net, Sisyphus and Tina Müller.

The current Inline::CUDA module relies heavily on Inline::C. Because the underlying CUDA language is C, I decided that instead of copying what Inline::C does and modifying the section where the Makefile is written, I decided to inject all Inline::C's subs into Inline::CUDA except some sections which require special treatment, like when writing the Makefile and also allowing some special Config keywords. The sub injection happens every time the module is called, and that definetely adds a tiny overhead which, in my opinion, is compensated by the huge advantage of not copy-pasting code from Inline::C into Inline::CUDA and then incorporating my changes every time Inline::C updates. A big thank you to Ingy döt Net (again!), Sisyphus and Tina Müller.

For writing test cases and benchmarks I had to descend into C and become acquainted with perlguts, e.g. what is an SV. In this process I had to ask for the wisdom of PerlMonks.org and #perl. A particular question was how to pass in a C function an arrayref, a scalar or a scalarref, store the results of the computation in there, in a call-by-reference manner. Fortunately LeoNerd at #perl created the following sv_setrv() macro which saved the day. Big thank you LeoNerd.

    /************************************************************/
    /* MONKEYPATCH by LeoNerd to set an arrayref into a scalarref
       As posted on https://kiwiirc.com/nextclient/#irc://irc.perl.org/#perl
       at 10:50 23/07/2021
       A BIG THANK YOU LeoNerd
    */
    #define HAVE_PERL_VERSION(R, V, S) \
        (PERL_REVISION > (R) || (PERL_REVISION == (R) && (PERL_VERSION > (V) || (PERL_VERSION == (V) && (PERL_SUBVERSION >= (S))))))

    #define sv_setrv(s, r)  S_sv_setrv(aTHX_ s, r)
    static void S_sv_setrv(pTHX_ SV *sv, SV *rv)
    {
      sv_setiv(sv, (IV)rv);
    #if !HAVE_PERL_VERSION(5, 24, 0)
      SvIOK_off(sv);
    #endif
      SvROK_on(sv);
    }

I copied numerical recipes (as C code, Cuda kernels, etc.) from the repository of Zhengchun Liu this code resides in 'C/inlinecuda' of the current distribution and offers shortcuts to GPU-based matrix multiplication, for example.

The idea of this project came to me when kcott asked whether there are https://www.perlmonks.org/?node_id=11134476 which I responded with the preliminary idea for what is now Inline::CUDA. A big thank you to kcott.

I got helpful comments, advice and the odd smiley from LeoNerd, mst, Bojte, shlomif at #perl, thank you.

I got helpful comments and advice in this PerlMonks.org post from syphilis and perlfan, although the problem was cracked by LeoNerd #perl.

I also got helpful comments and advice from Ed J when I filed a bug over at ExtUtils::MakeMaker (see https://rt.cpan.org/Ticket/Display.html?id=138022 and https://rt.cpan.org/Ticket/Display.html?id=137912).

AUTHOR

Andreas Hadjiprocopis, <bliako at cpan dot org>, <andreashad2 at gmail dot com>, https://perlmonks.org/?node=bliako

DEDICATIONS

!Almaz!

BUGS

Please report any bugs or feature requests to bug-inline-cuda at rt.cpan.org, or through the web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=Inline-CUDA. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

NOTE: this project is not yet on CPAN so report bugs by email to the author. I am not very comfortable with github so cloning and merging and pushing and pulling are beyond me.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Inline::CUDA

You can also look for information at:

ACKNOWLEDGEMENTS

This module stands on the shoulders of giants, namely the authors of Inline and Inline::C. I wish to thank them here and pass most credit to them. I will keep 1%.

A big thank you to NVIDIA for providing tools and support for doing numerical programming on their GPU.

All mentioned above provided keys to many doors, all free and open source. Thank you!

LICENSE AND COPYRIGHT

This software is Copyright (c) 2021 by Andreas Hadjiprocopis.

This is free software, licensed under:

The Artistic License 2.0 (GPL Compatible)