[BUG][QST] Hopper Grouped GEMM Fails When Workspace not aligned at 64, but MinWorkspaceAlignment =16 #2042

ankutalev · 2025-01-16T12:40:53Z

Describe the bug
See title - I expected GroupedGemm works, when workspace pointer 16-bits aligned, but it fails with Got bad cuda status: misaligned address at line: 596 for 16 and 32 alignments.

Steps/Code to reproduce bug
You can apply following patch to example for reproducing

diff --git a/hopper_grouped.cu b/orig_hopper_groped.cu
index a927a2b..f578b85 100644
--- a/hopper_grouped.cu
+++ b/orig_hopper_groped.cu
@@ -664,13 +664,13 @@ int run(Options &options, bool host_problem_shapes_available = true)
   size_t workspace_size = GemmT::get_workspace_size(arguments);
 
   // Allocate workspace memory
-  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size * 2);
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
 
   // Check if the problem size is supported or not
   CUTLASS_CHECK(gemm.can_implement(arguments));
 
   // Initialize CUTLASS kernel with arguments and workspace pointer
-  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get() + 32));
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
 
   // Correctness / Warmup iteration
   CUTLASS_CHECK(gemm.run());

You can change 32 to 16 and still see fail. 64/128/256 alignments works fine. Note that workspace_size * 2 here is only to not get out of bounds with additional offset.

Expected behavior
GEMM must work, because minimumWorkspaceAlignment set to 16. Link

Environment details (please complete the following information):

Bare metal, CUDA Version is 12.6

Thanks!

The text was updated successfully, but these errors were encountered:

ankutalev · 2025-01-20T13:41:22Z

cuda-gdb complains for this example built with -g -G:

Thread 1 "hopper_grouped" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 4, block (4,0,0), thread (0,0,0), device 0, sm 0, warp 3, lane 0]
0x00007ffdbd309940 in cute::tma_descriptor_replace_dims_strides_in_shared_mem (smem_desc=..., prob_shape=..., prob_stride=...)
    at cutlass/include/cute/arch/copy_sm90_desc.hpp:325
325         :: "l"(smem_int64_desc), "r"(prob_shape[0]));

...
at cutlass/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized.hpp:719
    cute::tma_descriptor_replace_dims_strides_in_shared_mem(shared_tensormap.smem_tensormap_A,
                                                            prob_shape_A,
                                                            prob_stride_A);

even without original changes, i.e. example is broken

ankutalev · 2025-01-24T13:21:44Z

@thakkarV can you take a look?
Thanks!

thakkarV · 2025-01-24T14:19:26Z

@ANIKET-SHIVAM CC

ankutalev added ? - Needs Triage bug Something isn't working labels Jan 16, 2025

ankutalev changed the title ~~[BUG][QST] Hopper Grouped GEMM Fails When Workspace not aligned for 64, but MinWorkspaceAlignment =16~~ [BUG][QST] Hopper Grouped GEMM Fails When Workspace not aligned at 64, but MinWorkspaceAlignment =16 Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][QST] Hopper Grouped GEMM Fails When Workspace not aligned at 64, but MinWorkspaceAlignment =16 #2042

[BUG][QST] Hopper Grouped GEMM Fails When Workspace not aligned at 64, but MinWorkspaceAlignment =16 #2042

ankutalev commented Jan 16, 2025

ankutalev commented Jan 20, 2025

ankutalev commented Jan 24, 2025

thakkarV commented Jan 24, 2025

[BUG][QST] Hopper Grouped GEMM Fails When Workspace not aligned at 64, but MinWorkspaceAlignment =16 #2042

[BUG][QST] Hopper Grouped GEMM Fails When Workspace not aligned at 64, but MinWorkspaceAlignment =16 #2042

Comments

ankutalev commented Jan 16, 2025

ankutalev commented Jan 20, 2025

ankutalev commented Jan 24, 2025

thakkarV commented Jan 24, 2025