You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current global-to-shared load uses a fixed 16x16 base tile to align with TensorCore's warp tile requirement. This approach results in inefficiency when the overall problem size is large enough to support a larger warp tile for coalescing memory access.
The text was updated successfully, but these errors were encountered:
haruhi55
changed the title
fix the fixed warp tile used in global to shared memory load/store.
Fix the fixed warp tile used in global to shared memory load/store.
Dec 18, 2024
The current global-to-shared load uses a fixed 16x16 base tile to align with TensorCore's warp tile requirement. This approach results in inefficiency when the overall problem size is large enough to support a larger warp tile for coalescing memory access.
The text was updated successfully, but these errors were encountered: