Kernel reuse #93

stolk · 2020-06-08T06:07:02Z

It's unclear to me if cuda kernels can ever be reused?

It seems the cuda code is compiled every time, even if I call the same code (with different data) every display frame?

wjakob · 2020-06-08T08:04:41Z

They are reused if your computation results in the same assembly code. You can see this by increasing the debug level a bit and checking if there are "cache hit" messages. One thing to avoid are literal constants that change from iteration to iteration (which lead to different PTX code being generated).

wjakob · 2020-06-08T08:07:00Z

If you give a small example of your problematic code, it will be easier to give feedback btw.

stolk · 2020-06-08T15:54:41Z

I was wrong about not caching, cuda_eval() does take less when called repeatedly for same code+data.
170ms when shift value changes, and 17ms when the shift value does not change.

typedef enoki::CUDAArray<int32_t>  IV;  // Int vector
typedef enoki::CUDAArray<float>    FV;  // Flt vector

and the top-level func:

void osino_computefield(void)
{
#ifdef SHIFTED
        static float shift=0.0f;
        shift += 0.003f;
#endif
        if (!field) field = new FV;
        const int mag = BLKMAG;
        const int sz = (1<<mag);
        const int msk = sz-1;
        const int cnt = sz * sz * sz;
        const float lo = 0.5f * (sz-1);
        const IV ix = enoki::arange<IV>( cnt );
        const IV zc = ix & msk;
        const IV yc = enoki::sr<mag>(ix) & msk;
        const IV xc = enoki::sr<mag+mag>(ix) & msk;
        const float s0 = 1.000f / lo;
        const float s1 = 1.003f / lo;
        const float s2 = 1.005f / lo;
        FV x = ( FV(xc) - lo ) * s0;
        FV y = ( FV(yc) - lo ) * s1;
        FV z = ( FV(zc) - lo ) * s2;

        const FV lsq_unwarped = x*x + y*y + z*z; // 0 .. 0.25
        const FV depth = 0.25f - lsq_unwarped;
        const FV warpstrength = 0.39f + enoki::max(0, depth) * 8.2f;

        const FV wx = osino_3d(11+y, 23-z, 17+x) * warpstrength;
        const FV wy = osino_3d(19-z, 13+x, 11-y) * warpstrength;
        const FV wz = osino_3d(31+x, 41-z, 61+y) * warpstrength;

        x += wx;
        y += wy;
        z += wz;

        const FV lsq = x*x + y*y + z*z;
        const FV len = enoki::sqrt(lsq);
        const FV d = 2.0f - 4.0f * len;
#ifdef SHIFTED
        const FV v = osino_3d_4o(1.2f*x+shift,1.2f*y,1.2f*z);
#else
        const FV v = osino_3d_4o(1.2f*x,1.2f*y,1.2f*z);
#endif
        *field = enoki::clamp(v + d, -1, 1);
        TT_BEGIN("cuda_eval");
        enoki::cuda_eval(); // may return before the GPU finished executing the kernel.
        TT_END  ("cuda_eval");
}

Which makes me wonder... is there a way that some parameters change for that code that doesn't trigger what I assume is a recompile?

I use it to generate Simple Noise fields, and it would be nice if I could generate the field with different offsets, without triggering a costly operation?

Also note: The manual says that 'cuda_eval' may return early, async, but my profiling says the bulk of the cycles are spent in there?

cuda_eval(): launching kernel (n=2097152, in=0, out=18, ops=4274)

stolk · 2020-06-09T01:45:47Z

I see in jit.cu that the call to cuLaunchKernel() passes null for the kernelParameters argument.
So how are calls to the kernels parameterized?

https://github.com/mitsuba-renderer/enoki/blob/master/src/cuda/jit.cu#L1372

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel reuse #93

Kernel reuse #93

stolk commented Jun 8, 2020

wjakob commented Jun 8, 2020

wjakob commented Jun 8, 2020

stolk commented Jun 8, 2020

stolk commented Jun 9, 2020

Kernel reuse #93

Kernel reuse #93

Comments

stolk commented Jun 8, 2020

wjakob commented Jun 8, 2020

wjakob commented Jun 8, 2020

stolk commented Jun 8, 2020

stolk commented Jun 9, 2020