Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel reuse #93

Open
stolk opened this issue Jun 8, 2020 · 4 comments
Open

Kernel reuse #93

stolk opened this issue Jun 8, 2020 · 4 comments

Comments

@stolk
Copy link
Contributor

stolk commented Jun 8, 2020

It's unclear to me if cuda kernels can ever be reused?

It seems the cuda code is compiled every time, even if I call the same code (with different data) every display frame?

@wjakob
Copy link
Member

wjakob commented Jun 8, 2020

They are reused if your computation results in the same assembly code. You can see this by increasing the debug level a bit and checking if there are "cache hit" messages. One thing to avoid are literal constants that change from iteration to iteration (which lead to different PTX code being generated).

@wjakob
Copy link
Member

wjakob commented Jun 8, 2020

If you give a small example of your problematic code, it will be easier to give feedback btw.

@stolk
Copy link
Contributor Author

stolk commented Jun 8, 2020

I was wrong about not caching, cuda_eval() does take less when called repeatedly for same code+data.
170ms when shift value changes, and 17ms when the shift value does not change.

typedef enoki::CUDAArray<int32_t>  IV;  // Int vector
typedef enoki::CUDAArray<float>    FV;  // Flt vector

and the top-level func:

void osino_computefield(void)
{
#ifdef SHIFTED
        static float shift=0.0f;
        shift += 0.003f;
#endif
        if (!field) field = new FV;
        const int mag = BLKMAG;
        const int sz = (1<<mag);
        const int msk = sz-1;
        const int cnt = sz * sz * sz;
        const float lo = 0.5f * (sz-1);
        const IV ix = enoki::arange<IV>( cnt );
        const IV zc = ix & msk;
        const IV yc = enoki::sr<mag>(ix) & msk;
        const IV xc = enoki::sr<mag+mag>(ix) & msk;
        const float s0 = 1.000f / lo;
        const float s1 = 1.003f / lo;
        const float s2 = 1.005f / lo;
        FV x = ( FV(xc) - lo ) * s0;
        FV y = ( FV(yc) - lo ) * s1;
        FV z = ( FV(zc) - lo ) * s2;

        const FV lsq_unwarped = x*x + y*y + z*z; // 0 .. 0.25
        const FV depth = 0.25f - lsq_unwarped;
        const FV warpstrength = 0.39f + enoki::max(0, depth) * 8.2f;

        const FV wx = osino_3d(11+y, 23-z, 17+x) * warpstrength;
        const FV wy = osino_3d(19-z, 13+x, 11-y) * warpstrength;
        const FV wz = osino_3d(31+x, 41-z, 61+y) * warpstrength;

        x += wx;
        y += wy;
        z += wz;

        const FV lsq = x*x + y*y + z*z;
        const FV len = enoki::sqrt(lsq);
        const FV d = 2.0f - 4.0f * len;
#ifdef SHIFTED
        const FV v = osino_3d_4o(1.2f*x+shift,1.2f*y,1.2f*z);
#else
        const FV v = osino_3d_4o(1.2f*x,1.2f*y,1.2f*z);
#endif
        *field = enoki::clamp(v + d, -1, 1);
        TT_BEGIN("cuda_eval");
        enoki::cuda_eval(); // may return before the GPU finished executing the kernel.
        TT_END  ("cuda_eval");
}

Which makes me wonder... is there a way that some parameters change for that code that doesn't trigger what I assume is a recompile?

I use it to generate Simple Noise fields, and it would be nice if I could generate the field with different offsets, without triggering a costly operation?

Also note: The manual says that 'cuda_eval' may return early, async, but my profiling says the bulk of the cycles are spent in there?

cuda_eval(): launching kernel (n=2097152, in=0, out=18, ops=4274)

@stolk
Copy link
Contributor Author

stolk commented Jun 9, 2020

I see in jit.cu that the call to cuLaunchKernel() passes null for the kernelParameters argument.
So how are calls to the kernels parameterized?

https://github.com/mitsuba-renderer/enoki/blob/master/src/cuda/jit.cu#L1372

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants