Lux case working with Reactant but not with CUDA #2244

yolhan83 · 2024-12-31T21:31:09Z

Hello, this is a case I just tested to benchmark where the gradient calculation goes nicely with Reactant+Enzyme but not with CUDA+Enzyme, not sured where to post this, I hope it's ok to put this here,

using Lux,Random
using Reactant
using CUDA,LuxCUDA
using Enzyme

Reactant.set_default_backend("gpu")

const dev = xla_device()
const dev_test = gpu_device() 
const rng = MersenneTwister(1234)

model = Lux.Chain(
    Conv((3,3),1=>3,tanh,pad = SamePad()),
    MaxPool((2,2)), # (14,14,3,N)
    MaxPool((2,2)), # (7,7,3,N)
    MaxPool((2,2)), # (3,3,3,N)
    Lux.FlattenLayer(),
    Dense(3*3*3=>10,tanh),
    Dense(10=>10)
)
psn,stn = Lux.setup(rng,model);
pst,stt = (psn,stn)  |> dev_test;
ps,st = (psn,stn)  |> dev;
function loss(model,ps,st,x,y)
    m,_ = model(x,ps,st)
    return Lux.MSELoss()(m,y)
end

xn = rand(Float32,28,28,1,1000) ;
yn = rand(Float32,10,1000);

xt = xn |> dev_test;
yt = yn |> dev_test;

x = xn |> dev;
y = yn |> dev;

L = Reactant.@compile loss(model,ps,st,x,y);

L(model,ps,st,x,y) # works
loss(model,pst,stt,xt,yt) # works

function get_grad(model,ps,st,x,y)
    dps = Enzyme.make_zero(ps)
    Enzyme.autodiff(
        Enzyme.Reverse,
        loss,
        Const(model),
        Duplicated(ps,dps),
        Const(st),
        Const(x),
        Const(y)
        )
    return dps
end;
G = Reactant.@compile get_grad(model,ps,st,x,y);

G(model,ps,st,x,y) # works 
get_grad(model,pst,stt,xt,yt); # does not work

and here is the error,

No create nofree of empty function (jl_gc_safe_enter) jl_gc_safe_enter)
 at context:   call fastcc void @julia__launch_configuration_979_80290([2 x i64]* noalias nocapture nofree noundef nonnull writeonly sret([2 x i64]) align 8 dereferenceable(16) %9, i64 noundef signext 0, { i64, {} addrspace(10)* } addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(32) %192) #837, !dbg !1983 (julia__launch_configuration_979_80290)

Stacktrace:
 [1] launch_configuration
   @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/occupancy.jl:56
 [2] #launch_heuristic#1200
   @ ~/.julia/packages/CUDA/2kjXI/src/gpuarrays.jl:22
 [3] launch_heuristic
   @ ~/.julia/packages/CUDA/2kjXI/src/gpuarrays.jl:15
 [4] _copyto!
   @ ~/.julia/packages/GPUArrays/qt4ax/src/host/broadcast.jl:78
 [5] materialize!
   @ ~/.julia/packages/GPUArrays/qt4ax/src/host/broadcast.jl:38
 [6] materialize!
   @ ./broadcast.jl:911
 [7] broadcast!
   @ ./broadcast.jl:880


Stacktrace:
  [1] launch_configuration
    @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/occupancy.jl:56 [inlined]
  [2] #launch_heuristic#1200
    @ ~/.julia/packages/CUDA/2kjXI/src/gpuarrays.jl:22 [inlined]
  [3] launch_heuristic
    @ ~/.julia/packages/CUDA/2kjXI/src/gpuarrays.jl:15 [inlined]
  [4] _copyto!
    @ ~/.julia/packages/GPUArrays/qt4ax/src/host/broadcast.jl:78 [inlined]
  [5] materialize!
    @ ~/.julia/packages/GPUArrays/qt4ax/src/host/broadcast.jl:38 [inlined]
  [6] materialize!
    @ ./broadcast.jl:911 [inlined]
  [7] broadcast!
    @ ./broadcast.jl:880
  [8] bias_activation!
    @ ~/.julia/packages/LuxLib/TbynI/src/impl/bias_activation.jl:178 [inlined]
  [9] conv_bias_act
    @ ~/.julia/packages/LuxLib/TbynI/src/impl/conv.jl:114 [inlined]
 [10] conv_bias_act
    @ ~/.julia/packages/LuxLib/TbynI/src/impl/conv.jl:0 [inlined]
 [11] augmented_julia_conv_bias_act_78596_inner_1wrap
    @ ~/.julia/packages/LuxLib/TbynI/src/impl/conv.jl:0
 [12] macro expansion
    @ ~/.julia/packages/Enzyme/DiEvV/src/compiler.jl:5317 [inlined]
 [13] enzyme_call
    @ ~/.julia/packages/Enzyme/DiEvV/src/compiler.jl:4863 [inlined]
 [14] AugmentedForwardThunk
    @ ~/.julia/packages/Enzyme/DiEvV/src/compiler.jl:4799 [inlined]
 [15] runtime_generic_augfwd(activity::Type{Val{(false, false, false, true, false, true, false)}}, runtimeActivity::Val{false}, width::Val{1}, ModifiedBetween::Val{(true, true, true, true, true, true, true)}, RT::Val{@NamedTuple{1, 2, 3}}, f::typeof(LuxLib.Impl.conv_bias_act), df::Nothing, primal_1::Type{Nothing}, shadow_1_1::Nothing, primal_2::CuArray{Float32, 4, CUDA.DeviceMemory}, shadow_2_1::Nothing, primal_3::CuArray{Float32, 4, CUDA.DeviceMemory}, shadow_3_1::CuArray{Float32, 4, CUDA.DeviceMemory}, primal_4::DenseConvDims{2, 2, 2, 4, 2}, shadow_4_1::Nothing, primal_5::CuArray{Float32, 1, CUDA.DeviceMemory}, shadow_5_1::CuArray{Float32, 1, CUDA.DeviceMemory}, primal_6::typeof(tanh_fast), shadow_6_1::Nothing)
    @ Enzyme.Compiler ~/.julia/packages/Enzyme/DiEvV/src/rules/jitrules.jl:480
 [16] conv_bias_act
    @ ~/.julia/packages/LuxLib/TbynI/src/impl/conv.jl:126 [inlined]
 [17] conv_bias_act
    @ ~/.julia/packages/LuxLib/TbynI/src/impl/conv.jl:107 [inlined]
 [18] fused_conv
    @ ~/.julia/packages/LuxLib/TbynI/src/impl/conv.jl:148 [inlined]
 [19] fused_conv
    @ ~/.julia/packages/LuxLib/TbynI/src/impl/conv.jl:134 [inlined]
 [20] fused_conv_bias_activation
    @ ~/.julia/packages/LuxLib/TbynI/src/api/conv.jl:33 [inlined]
 [21] Conv
    @ ~/.julia/packages/Lux/fMnM0/src/layers/conv.jl:204 [inlined]
 [22] apply
    @ ~/.julia/packages/LuxCore/GlbG3/src/LuxCore.jl:155 [inlined]
 [23] macro expansion
    @ ~/.julia/packages/Lux/fMnM0/src/layers/containers.jl:0 [inlined]
 [24] applychain
    @ ~/.julia/packages/Lux/fMnM0/src/layers/containers.jl:482
 [25] Chain
    @ ~/.julia/packages/Lux/fMnM0/src/layers/containers.jl:480 [inlined]
 [26] loss
    @ ./REPL[325]:2 [inlined]
 [27] loss
    @ ./REPL[325]:0 [inlined]
 [28] diffejulia_loss_34822_inner_1wrap
    @ ./REPL[325]:0
 [29] macro expansion
    @ ~/.julia/packages/Enzyme/DiEvV/src/compiler.jl:5317 [inlined]
 [30] enzyme_call
    @ ~/.julia/packages/Enzyme/DiEvV/src/compiler.jl:4863 [inlined]
 [31] CombinedAdjointThunk
    @ ~/.julia/packages/Enzyme/DiEvV/src/compiler.jl:4735 [inlined]
 [32] autodiff
    @ ~/.julia/packages/Enzyme/DiEvV/src/Enzyme.jl:503 [inlined]
 [33] autodiff
    @ ~/.julia/packages/Enzyme/DiEvV/src/Enzyme.jl:544 [inlined]
 [34] autodiff
    @ ~/.julia/packages/Enzyme/DiEvV/src/Enzyme.jl:516 [inlined]
 [35] get_grad(model::Chain{@NamedTuple{layer_1::Conv{typeof(tanh), Int64, Int64, Tuple{Int64, Int64}, Tuple{Int64, Int64}, NTuple{4, Int64}, Tuple{Int64, Int64}, Int64, Nothing, Nothing, Static.True, Static.False}, layer_2::MaxPool{Lux.PoolingLayer{Lux.GenericPoolMode{Tuple{Int64, Int64}, Tuple{Int64, Int64}, NTuple{4, Int64}, Tuple{Int64, Int64}}, Lux.MaxPoolOp}}, layer_3::MaxPool{Lux.PoolingLayer{Lux.GenericPoolMode{Tuple{Int64, Int64}, Tuple{Int64, Int64}, NTuple{4, Int64}, Tuple{Int64, Int64}}, Lux.MaxPoolOp}}, layer_4::MaxPool{Lux.PoolingLayer{Lux.GenericPoolMode{Tuple{Int64, Int64}, Tuple{Int64, Int64}, NTuple{4, Int64}, Tuple{Int64, Int64}}, Lux.MaxPoolOp}}, layer_5::FlattenLayer{Nothing}, layer_6::Dense{typeof(tanh), Int64, Int64, Nothing, Nothing, Static.True}, layer_7::Dense{typeof(identity), Int64, Int64, Nothing, Nothing, Static.True}}, Nothing}, ps::@NamedTuple{layer_1::@NamedTuple{weight::CuArray{Float32, 4, CUDA.DeviceMemory}, bias::CuArray{Float32, 1, CUDA.DeviceMemory}}, layer_2::@NamedTuple{}, layer_3::@NamedTuple{}, layer_4::@NamedTuple{}, layer_5::@NamedTuple{}, layer_6::@NamedTuple{weight::CuArray{Float32, 2, CUDA.DeviceMemory}, bias::CuArray{Float32, 1, CUDA.DeviceMemory}}, layer_7::@NamedTuple{weight::CuArray{Float32, 2, CUDA.DeviceMemory}, bias::CuArray{Float32, 1, CUDA.DeviceMemory}}}, st::@NamedTuple{layer_1::@NamedTuple{}, layer_2::@NamedTuple{}, layer_3::@NamedTuple{}, layer_4::@NamedTuple{}, layer_5::@NamedTuple{}, layer_6::@NamedTuple{}, layer_7::@NamedTuple{}}, x::CuArray{Float32, 4, CUDA.DeviceMemory}, y::CuArray{Float32, 2, CUDA.DeviceMemory})        
    @ Main ./REPL[335]:3
 [36] top-level scope
    @ REPL[345]:1

and my versions
julia :

Julia Version 1.10.7
Commit 4976d05258e (2024-11-26 15:57 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700H
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, alderlake)
Threads: 1 default, 0 interactive, 1 GC (on 20 virtual cores)

CUDA :

CUDA runtime 12.6, artifact installation
CUDA driver 12.4
NVIDIA driver 552.12.0

CUDA libraries:
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+550.73.1

Julia packages:
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.10.7
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 4060 Laptop GPU (sm_89, 1.565 GiB / 7.996 GiB available)

pkg :

  [052768ef] CUDA v5.5.2
  [7da242da] Enzyme v0.13.26
  [b2108857] Lux v1.4.3
  [d0bbae9a] LuxCUDA v0.3.3
  [3c362404] Reactant v0.2.12

The text was updated successfully, but these errors were encountered:

yolhan83 · 2024-12-31T21:33:06Z

Oh i Think I saw some broadcast related issues with CUDA is it this again ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lux case working with Reactant but not with CUDA #2244

Lux case working with Reactant but not with CUDA #2244

yolhan83 commented Dec 31, 2024

yolhan83 commented Dec 31, 2024

Lux case working with Reactant but not with CUDA #2244

Lux case working with Reactant but not with CUDA #2244

Comments

yolhan83 commented Dec 31, 2024

yolhan83 commented Dec 31, 2024