You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm implementing a tensor product operation in Halide that involves gathering inputs and scattering the final output on a GPU. I'm aiming to optimize shared memory usage for better performance, but I'm encountering some challenges.
Load gather_weight into shared memory at the outer reduction loop (c1) in product.
Inside c1, gather_weight requires m1(32) x c0(16) = 512 elements.
Per GPU block, there are m1(32) x p1(32) GPU threads. Since c0 is independent of the p1 dimension (the y-axis of GPU threads), we can reuse gather_weight across threads if we load it into shared memory.
Ideal Scenario:
Shared Memory Allocation: Allocate only m1(32) x c0(16) = 512 elements.
Data Loading: Use only a subset of GPU threads, such as m1(32) x (p1/2)(16), to load gather_weight into shared memory.
Issue Encountered:
Current Observation:
The actual shared memory allocation is significantly larger than expected (e.g., 16384 elements).
Only the x-axis of GPU threads is used for loading gather_weight into shared memory.
Hypothesis:
Halide might not recognize that gather_weight can be reused across independent thread variables (p1), leading to a larger shared memory allocation (m1 x c0 x p1).
Attempted Solution:
I tried adjusting the schedule to bring gather_weight computation at output instead of product:
produce output:
gpu_block o<Default_GPU>:
gpu_thread m<Default_GPU>:
output(...) = ...
for r39:
gpu_block r39.r39<Default_GPU>:
gpu_block m.m<Default_GPU>:
produce GatherWeight:
for c.wc1:
gpu_thread c.wc0 in [0, 15]<Default_GPU>:
gpu_thread m.m0 in [0, 31]<Default_GPU>:
GatherWeight(...) = ...
consume GatherWeight:
gpu_thread r39.p1.p1 in [0, 31]<Default_GPU>:
gpu_thread m.m1.m1 in [0, 31]<Default_GPU>:
produce Product:
for p:
Product(...) = ...
for r28.c1:
for p:
for r28.c0 in [0, 15]:
Product(...) = ...
consume Product:
for r39.p1.p0 in [0, 3]:
output(...) = ...
let t160 = (maxPos + 127)/128
let t161 = (output.extent.0 + 31)/32
let t163 = output.min.1*output.stride.1
let t162 = (input.min.1*input.stride.1) + input.min.0
for (output.s1.r39$x, 0, weight.extent.2) {
let t166 = ((output.s1.r39$x - omap.min.1)*omap.stride.1) - omap.min.0
let t165 = ((output.s1.r39$x - imap.min.1)*imap.stride.1) - imap.min.0
let t164 = (output.s1.r39$x*weight.stride.2) + output.min.0
gpu_block<CUDA> (output.s1.r39$y.r39$y.block_id_y, 0, t160) {
gpu_block<CUDA> (output.s1.m.m.block_id_x, 0, t161) {
allocate GatherWeight.0[float32 * 2048] in GPUShared
gpu_thread<CUDA> (.thread_id_y, 0, 32) {
gpu_thread<CUDA> (.thread_id_x, 0, 32) {
allocate Product.0[float32 * 4] in Register
if (.thread_id_y < 16) {
produce GatherWeight {
let t143.s = (output.s1.m.m.block_id_x*32) + t164
let t167 = .thread_id_x + t143.s
for (GatherWeight.s0.c.wc1, 0, 4) {
let t158 = (GatherWeight.s0.c.wc1*16) + .thread_id_y
GatherWeight.0[(t158*32) + .thread_id_x] = weight[(t158*weight.stride.1) + t167]
}
}
}
gpu_thread_barrier(2)
consume GatherWeight {
produce Product {
let Product.s0.p.loop_extent.s = (maxPos - (output.s1.r39$y.r39$y.block_id_y*128)) - (.thread_id_y*4)
let t168 = min(Product.s0.p.loop_extent.s, 4)
for (Product.s0.p.rebased, 0, t168) {
Product.0[Product.s0.p.rebased] = 0.000000f
}
let t169 = min(Product.s0.p.loop_extent.s, 4)
let t170 = (((output.s1.r39$y.r39$y.block_id_y*32) + .thread_id_y)*4) + t165
for (Product.s1.r28$x.c1, 0, 4) {
let t148 = (Product.s1.r28$x.c1*16) - t162
let t171 = Product.s1.r28$x.c1*16
for (Product.s1.p.rebased, 0, t169) {
let t151 = Product.s1.p.rebased + t170
for (Product.s1.r28$x.c0, 0, 16) {
Product.0[Product.s1.p.rebased] = Product.0[Product.s1.p.rebased] + (input[((imap[t151]*input.stride.1) + t148) + Product.s1.r28$x.c0]*GatherWeight.0[((Product.s1.r28$x.c0 + t171)*32) + .thread_id_x])
}
}
}
}
consume Product {
let output.s1.r39$y.p1.p0.epilogue.s = maxPos - (((output.s1.r39$y.r39$y.block_id_y*32) + .thread_id_y)*4)
let t154.s = (output.s1.m.m.block_id_x*32) - t163
let t172 = max(min(output.s1.r39$y.p1.p0.epilogue.s, 4), 0)
let t174 = (((output.s1.r39$y.r39$y.block_id_y*32) + .thread_id_y)*4) + t166
let t173 = .thread_id_x + t154.s
for (output.s1.r39$y.p1.p0, 0, t172) {
let t111 = (omap[output.s1.r39$y.p1.p0 + t174]*output.stride.1) + t173
let t112 = Product.0[output.s1.r39$y.p1.p0]
atomic (output) {
output[t111] = output[t111] + t112
}
}
}
free Product.0
}
}
}
free GatherWeight.0
}
}
}
}
}
Result:
The loop nest now appears closer to the desired structure.
Only half of thread_id_y is involved in loading data into shared memory.
Remaining Issue:
The shared memory still allocates an entire m0 x c0 x c1 block, which is larger than necessary (m0 x c0).
Ideally, I want the shared memory loading to happen within the loop over c1, loading only a block of size m0 x c0 = 512 elements.
Is there a way to adjust the Halide schedule to achieve this shared memory usage? I think .compute_at(product, c1) is necessary at some point, but I don't know how to bring this shared memory loads inside c1 with my requirements. I feel I'm almost there, or is this type of loop nest what halide wasn't meant to designed for?
The text was updated successfully, but these errors were encountered:
Hi,
I'm implementing a tensor product operation in Halide that involves gathering inputs and scattering the final output on a GPU. I'm aiming to optimize shared memory usage for better performance, but I'm encountering some challenges.
Here's a reproduce of my Halide Generator code:
Outcome :
Objective:
I want to achieve the following optimizations on the GPU:
Accumulate the
product
in a4x1
(p0
xm0
) register block.Load
gather_weight
into shared memory at the outer reduction loop (c1
) inproduct
.c1
,gather_weight
requiresm1(32) x c0(16) = 512
elements.m1(32) x p1(32)
GPU threads. Sincec0
is independent of thep1
dimension (the y-axis of GPU threads), we can reusegather_weight
across threads if we load it into shared memory.m1(32) x c0(16) = 512
elements.m1(32) x (p1/2)(16)
, to loadgather_weight
into shared memory.Issue Encountered:
Current Observation:
16384
elements).gather_weight
into shared memory.Hypothesis:
gather_weight
can be reused across independent thread variables (p1
), leading to a larger shared memory allocation (m1 x c0 x p1
).Attempted Solution:
I tried adjusting the schedule to bring
gather_weight
computation atoutput
instead ofproduct
:loop nest and conceptual stmt :
Result:
thread_id_y
is involved in loading data into shared memory.Remaining Issue:
m0 x c0 x c1
block, which is larger than necessary (m0 x c0
).c1
, loading only a block of sizem0 x c0 = 512
elements.Is there a way to adjust the Halide schedule to achieve this shared memory usage? I think
.compute_at(product, c1)
is necessary at some point, but I don't know how to bring this shared memory loads insidec1
with my requirements. I feel I'm almost there, or is this type of loop nest what halide wasn't meant to designed for?The text was updated successfully, but these errors were encountered: