Fix the fixed warp tile used in global to shared memory load/store. #18

lcy-seso · 2024-12-11T06:42:34Z

The current global-to-shared load uses a fixed 16x16 base tile to align with TensorCore's warp tile requirement. This approach results in inefficiency when the overall problem size is large enough to support a larger warp tile for coalescing memory access.

lcy-seso self-assigned this Dec 17, 2024

haruhi55 added the enhancement New feature or request label Dec 18, 2024

haruhi55 changed the title ~~fix the fixed warp tile used in global to shared memory load/store.~~ Fix the fixed warp tile used in global to shared memory load/store. Dec 18, 2024

lcy-seso mentioned this issue Dec 18, 2024

Refactor(cell): Enhance the implementation of warp tile offset calculation. #22

Merged

lcy-seso linked a pull request Jan 26, 2025 that will close this issue

fix(cell): Relax Global-Shared loader/storer to support all possible tile shape. #47

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the fixed warp tile used in global to shared memory load/store. #18

Fix the fixed warp tile used in global to shared memory load/store. #18

lcy-seso commented Dec 11, 2024

Fix the fixed warp tile used in global to shared memory load/store. #18

Fix the fixed warp tile used in global to shared memory load/store. #18

Comments

lcy-seso commented Dec 11, 2024