diff --git a/README.md b/README.md
index 3be3c1ec..eb84a8d7 100644
--- a/README.md
+++ b/README.md
@@ -30,10 +30,11 @@ within-method parallelism, but the solution of the same structure (same `f`) is
 different choices of `u0` or `p`. For these cases, DiffEqGPU exports the following ensemble algorithms:
 
 - `EnsembleGPUArray`: Utilizes the CuArray setup to parallelize ODE solves across the GPU.
+- `EnsembleGPUKernel`: Utilizes the GPU kernels to parallelize each ODE solve with their separate ODE integrator on each kernel. 
 - `EnsembleCPUArray`: A test version for analyzing the overhead of the array-based parallelism setup.
 
 For more information on using the ensemble interface, see
-[the DiffEqDocs page on ensembles](http://docs.juliadiffeq.org/dev/features/ensemble)
+[the DiffEqDocs page on ensembles](https://docs.sciml.ai/dev/modules/DiffEqDocs/features/ensemble/)
 
 For example, the following solves the Lorenz equation with 10,000 separate random parameters on the GPU:
 
@@ -54,6 +55,41 @@ monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy=false)
 @time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0)
 ```
 
+### EnsembleGPUKernel
+
+The `EnsembleGPUKernel` requires a specialized ODE algorithm which is written on the GPU kernel. These implementations are part of DiffEqGPU. These implementations do not allow mutation of arrays, hence use out-of-place (OOP) `ODEProblem`.
+
+#### Support
+
+- `Tsit5`: The kernelized version can be called using `GPUTsit5()` with the `EnsembleProblem`. 
+
+Taking the example above, we simulate the lorenz equation:
+
+```julia
+using DiffEqGPU, OrdinaryDiffEq, StaticArrays
+
+function lorenz(u, p, t)
+    σ = p[1]
+    ρ = p[2]
+    β = p[3]
+    du1 = σ * (u[2] - u[1])
+    du2 = u[1] * (ρ - u[3]) - u[2]
+    du3 = u[1] * u[2] - β * u[3]
+    return SVector{3}(du1, du2, du3)
+end
+
+u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
+tspan = (0.0f0, 10.0f0)
+p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
+prob = ODEProblem{false}(lorenz, u0, tspan, p)
+prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)).*p)
+monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
+
+@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 10_000, adaptive = false, dt = 0.1f0)
+
+@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 10_000, adaptive = true, dt = 0.1f0, save_everystep = false)
+```
+
 #### Current Support
 
 Automated GPU parameter parallelism support is continuing to be improved, so there are currently a few limitations.
diff --git a/test/gpu_tsit5_tests.jl b/test/gpu_tsit5_tests.jl
index 53efd3c0..8c3aa98e 100644
--- a/test/gpu_tsit5_tests.jl
+++ b/test/gpu_tsit5_tests.jl
@@ -32,3 +32,13 @@ bench_asol = solve(prob, Tsit5(), dt = 0.1f-1, save_everystep = false, abstol =
 
 @show norm(bench_sol.u[end] - sol[1].u[end]) < 2e-2
 @show norm(bench_asol.u[end] - asol[1].u[end]) < 1e-4
+
+## With random parameters
+
+prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)
+monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
+
+sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 10,
+            adaptive = false, dt = 0.1f0)
+asol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 10,
+             adaptive = true, dt = 0.1f-1, abstol = 1.0f-8, reltol = 1.0f-5)