Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lux case working with Reactant but not with CUDA #2244

Open
yolhan83 opened this issue Dec 31, 2024 · 1 comment
Open

Lux case working with Reactant but not with CUDA #2244

yolhan83 opened this issue Dec 31, 2024 · 1 comment

Comments

@yolhan83
Copy link

Hello, this is a case I just tested to benchmark where the gradient calculation goes nicely with Reactant+Enzyme but not with CUDA+Enzyme, not sured where to post this, I hope it's ok to put this here,

using Lux,Random
using Reactant
using CUDA,LuxCUDA
using Enzyme

Reactant.set_default_backend("gpu")

const dev = xla_device()
const dev_test = gpu_device() 
const rng = MersenneTwister(1234)

model = Lux.Chain(
    Conv((3,3),1=>3,tanh,pad = SamePad()),
    MaxPool((2,2)), # (14,14,3,N)
    MaxPool((2,2)), # (7,7,3,N)
    MaxPool((2,2)), # (3,3,3,N)
    Lux.FlattenLayer(),
    Dense(3*3*3=>10,tanh),
    Dense(10=>10)
)
psn,stn = Lux.setup(rng,model);
pst,stt = (psn,stn)  |> dev_test;
ps,st = (psn,stn)  |> dev;
function loss(model,ps,st,x,y)
    m,_ = model(x,ps,st)
    return Lux.MSELoss()(m,y)
end

xn = rand(Float32,28,28,1,1000) ;
yn = rand(Float32,10,1000);

xt = xn |> dev_test;
yt = yn |> dev_test;

x = xn |> dev;
y = yn |> dev;

L = Reactant.@compile loss(model,ps,st,x,y);

L(model,ps,st,x,y) # works
loss(model,pst,stt,xt,yt) # works

function get_grad(model,ps,st,x,y)
    dps = Enzyme.make_zero(ps)
    Enzyme.autodiff(
        Enzyme.Reverse,
        loss,
        Const(model),
        Duplicated(ps,dps),
        Const(st),
        Const(x),
        Const(y)
        )
    return dps
end;
G = Reactant.@compile get_grad(model,ps,st,x,y);

G(model,ps,st,x,y) # works 
get_grad(model,pst,stt,xt,yt); # does not work 

and here is the error,

No create nofree of empty function (jl_gc_safe_enter) jl_gc_safe_enter)
 at context:   call fastcc void @julia__launch_configuration_979_80290([2 x i64]* noalias nocapture nofree noundef nonnull writeonly sret([2 x i64]) align 8 dereferenceable(16) %9, i64 noundef signext 0, { i64, {} addrspace(10)* } addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(32) %192) #837, !dbg !1983 (julia__launch_configuration_979_80290)

Stacktrace:
 [1] launch_configuration
   @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/occupancy.jl:56
 [2] #launch_heuristic#1200
   @ ~/.julia/packages/CUDA/2kjXI/src/gpuarrays.jl:22
 [3] launch_heuristic
   @ ~/.julia/packages/CUDA/2kjXI/src/gpuarrays.jl:15
 [4] _copyto!
   @ ~/.julia/packages/GPUArrays/qt4ax/src/host/broadcast.jl:78
 [5] materialize!
   @ ~/.julia/packages/GPUArrays/qt4ax/src/host/broadcast.jl:38
 [6] materialize!
   @ ./broadcast.jl:911
 [7] broadcast!
   @ ./broadcast.jl:880


Stacktrace:
  [1] launch_configuration
    @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/occupancy.jl:56 [inlined]
  [2] #launch_heuristic#1200
    @ ~/.julia/packages/CUDA/2kjXI/src/gpuarrays.jl:22 [inlined]
  [3] launch_heuristic
    @ ~/.julia/packages/CUDA/2kjXI/src/gpuarrays.jl:15 [inlined]
  [4] _copyto!
    @ ~/.julia/packages/GPUArrays/qt4ax/src/host/broadcast.jl:78 [inlined]
  [5] materialize!
    @ ~/.julia/packages/GPUArrays/qt4ax/src/host/broadcast.jl:38 [inlined]
  [6] materialize!
    @ ./broadcast.jl:911 [inlined]
  [7] broadcast!
    @ ./broadcast.jl:880
  [8] bias_activation!
    @ ~/.julia/packages/LuxLib/TbynI/src/impl/bias_activation.jl:178 [inlined]
  [9] conv_bias_act
    @ ~/.julia/packages/LuxLib/TbynI/src/impl/conv.jl:114 [inlined]
 [10] conv_bias_act
    @ ~/.julia/packages/LuxLib/TbynI/src/impl/conv.jl:0 [inlined]
 [11] augmented_julia_conv_bias_act_78596_inner_1wrap
    @ ~/.julia/packages/LuxLib/TbynI/src/impl/conv.jl:0
 [12] macro expansion
    @ ~/.julia/packages/Enzyme/DiEvV/src/compiler.jl:5317 [inlined]
 [13] enzyme_call
    @ ~/.julia/packages/Enzyme/DiEvV/src/compiler.jl:4863 [inlined]
 [14] AugmentedForwardThunk
    @ ~/.julia/packages/Enzyme/DiEvV/src/compiler.jl:4799 [inlined]
 [15] runtime_generic_augfwd(activity::Type{Val{(false, false, false, true, false, true, false)}}, runtimeActivity::Val{false}, width::Val{1}, ModifiedBetween::Val{(true, true, true, true, true, true, true)}, RT::Val{@NamedTuple{1, 2, 3}}, f::typeof(LuxLib.Impl.conv_bias_act), df::Nothing, primal_1::Type{Nothing}, shadow_1_1::Nothing, primal_2::CuArray{Float32, 4, CUDA.DeviceMemory}, shadow_2_1::Nothing, primal_3::CuArray{Float32, 4, CUDA.DeviceMemory}, shadow_3_1::CuArray{Float32, 4, CUDA.DeviceMemory}, primal_4::DenseConvDims{2, 2, 2, 4, 2}, shadow_4_1::Nothing, primal_5::CuArray{Float32, 1, CUDA.DeviceMemory}, shadow_5_1::CuArray{Float32, 1, CUDA.DeviceMemory}, primal_6::typeof(tanh_fast), shadow_6_1::Nothing)
    @ Enzyme.Compiler ~/.julia/packages/Enzyme/DiEvV/src/rules/jitrules.jl:480
 [16] conv_bias_act
    @ ~/.julia/packages/LuxLib/TbynI/src/impl/conv.jl:126 [inlined]
 [17] conv_bias_act
    @ ~/.julia/packages/LuxLib/TbynI/src/impl/conv.jl:107 [inlined]
 [18] fused_conv
    @ ~/.julia/packages/LuxLib/TbynI/src/impl/conv.jl:148 [inlined]
 [19] fused_conv
    @ ~/.julia/packages/LuxLib/TbynI/src/impl/conv.jl:134 [inlined]
 [20] fused_conv_bias_activation
    @ ~/.julia/packages/LuxLib/TbynI/src/api/conv.jl:33 [inlined]
 [21] Conv
    @ ~/.julia/packages/Lux/fMnM0/src/layers/conv.jl:204 [inlined]
 [22] apply
    @ ~/.julia/packages/LuxCore/GlbG3/src/LuxCore.jl:155 [inlined]
 [23] macro expansion
    @ ~/.julia/packages/Lux/fMnM0/src/layers/containers.jl:0 [inlined]
 [24] applychain
    @ ~/.julia/packages/Lux/fMnM0/src/layers/containers.jl:482
 [25] Chain
    @ ~/.julia/packages/Lux/fMnM0/src/layers/containers.jl:480 [inlined]
 [26] loss
    @ ./REPL[325]:2 [inlined]
 [27] loss
    @ ./REPL[325]:0 [inlined]
 [28] diffejulia_loss_34822_inner_1wrap
    @ ./REPL[325]:0
 [29] macro expansion
    @ ~/.julia/packages/Enzyme/DiEvV/src/compiler.jl:5317 [inlined]
 [30] enzyme_call
    @ ~/.julia/packages/Enzyme/DiEvV/src/compiler.jl:4863 [inlined]
 [31] CombinedAdjointThunk
    @ ~/.julia/packages/Enzyme/DiEvV/src/compiler.jl:4735 [inlined]
 [32] autodiff
    @ ~/.julia/packages/Enzyme/DiEvV/src/Enzyme.jl:503 [inlined]
 [33] autodiff
    @ ~/.julia/packages/Enzyme/DiEvV/src/Enzyme.jl:544 [inlined]
 [34] autodiff
    @ ~/.julia/packages/Enzyme/DiEvV/src/Enzyme.jl:516 [inlined]
 [35] get_grad(model::Chain{@NamedTuple{layer_1::Conv{typeof(tanh), Int64, Int64, Tuple{Int64, Int64}, Tuple{Int64, Int64}, NTuple{4, Int64}, Tuple{Int64, Int64}, Int64, Nothing, Nothing, Static.True, Static.False}, layer_2::MaxPool{Lux.PoolingLayer{Lux.GenericPoolMode{Tuple{Int64, Int64}, Tuple{Int64, Int64}, NTuple{4, Int64}, Tuple{Int64, Int64}}, Lux.MaxPoolOp}}, layer_3::MaxPool{Lux.PoolingLayer{Lux.GenericPoolMode{Tuple{Int64, Int64}, Tuple{Int64, Int64}, NTuple{4, Int64}, Tuple{Int64, Int64}}, Lux.MaxPoolOp}}, layer_4::MaxPool{Lux.PoolingLayer{Lux.GenericPoolMode{Tuple{Int64, Int64}, Tuple{Int64, Int64}, NTuple{4, Int64}, Tuple{Int64, Int64}}, Lux.MaxPoolOp}}, layer_5::FlattenLayer{Nothing}, layer_6::Dense{typeof(tanh), Int64, Int64, Nothing, Nothing, Static.True}, layer_7::Dense{typeof(identity), Int64, Int64, Nothing, Nothing, Static.True}}, Nothing}, ps::@NamedTuple{layer_1::@NamedTuple{weight::CuArray{Float32, 4, CUDA.DeviceMemory}, bias::CuArray{Float32, 1, CUDA.DeviceMemory}}, layer_2::@NamedTuple{}, layer_3::@NamedTuple{}, layer_4::@NamedTuple{}, layer_5::@NamedTuple{}, layer_6::@NamedTuple{weight::CuArray{Float32, 2, CUDA.DeviceMemory}, bias::CuArray{Float32, 1, CUDA.DeviceMemory}}, layer_7::@NamedTuple{weight::CuArray{Float32, 2, CUDA.DeviceMemory}, bias::CuArray{Float32, 1, CUDA.DeviceMemory}}}, st::@NamedTuple{layer_1::@NamedTuple{}, layer_2::@NamedTuple{}, layer_3::@NamedTuple{}, layer_4::@NamedTuple{}, layer_5::@NamedTuple{}, layer_6::@NamedTuple{}, layer_7::@NamedTuple{}}, x::CuArray{Float32, 4, CUDA.DeviceMemory}, y::CuArray{Float32, 2, CUDA.DeviceMemory})        
    @ Main ./REPL[335]:3
 [36] top-level scope
    @ REPL[345]:1

and my versions
julia :

Julia Version 1.10.7
Commit 4976d05258e (2024-11-26 15:57 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700H
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, alderlake)
Threads: 1 default, 0 interactive, 1 GC (on 20 virtual cores)

CUDA :

CUDA runtime 12.6, artifact installation
CUDA driver 12.4
NVIDIA driver 552.12.0

CUDA libraries:
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+550.73.1

Julia packages:
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.10.7
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 4060 Laptop GPU (sm_89, 1.565 GiB / 7.996 GiB available)

pkg :

  [052768ef] CUDA v5.5.2
  [7da242da] Enzyme v0.13.26
  [b2108857] Lux v1.4.3
  [d0bbae9a] LuxCUDA v0.3.3
  [3c362404] Reactant v0.2.12
@yolhan83
Copy link
Author

Oh i Think I saw some broadcast related issues with CUDA is it this again ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant