-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Cute docs need a concrete example using tensor cores #2063
Comments
Agreed, I have an open MR to update a lot of these CuTe examples but it's been delayed for various reasons. In the meantime, here's the current state within that update MR of |
Thank you! Happy to hear it's in the works. Being able to look at different data types and transposes and see how the tiling and swizzle changes will be really helpful especially if the versions get close to hardware limit. Following the code from the cutlass profiler and seeing whats happening in cute eventually is pretty hard to track. Thanks for your example! What should i replace the |
Ah yes, I've updated the code above to remove those for you, thanks. |
Thank you, this is so helpful! It gets really close to the 16x8x8 cutlass one (0.6 ms vs 0.56) for m=n=k=4096 again, really helpful. I don't want to wear out my welcome, but the cutlass version has a cta size of (256, 128, 32) where this is (128, 128, 64).
To help me understand something would you mind showing me how the code changes if I drive the cta size m up from 128 to 256 and drive the cta size k down from 64 to 32 in this example? Assuming they're just simple number changes in the gemm_tn function. If it's more involved than that, please don't bother. |
Then you get into some layout engineering, here's another thread we we walk through some of that: An easy way to get 256x128x32 is to cut down the vectorization and change the layouts: // Define CTA tile sizes (static)
auto bM = Int<256>{};
auto bN = Int<128>{};
auto bK = Int< 32>{};
auto cta_tiler = make_shape(bM, bN, bK); // (BLK_M, BLK_N, BLK_K)
auto bP = Int<3>{}; // Pipeline
// Define the smem layouts (static)
// Swizzles for LDSM and 64b k-major loads
auto swizzle_atom = composition(Swizzle<3,3,3>{},
Layout<Shape <_16,Shape <_8, _4>>,
Stride< _8,Stride<_1,_128>>>{});
auto sA = tile_to_shape(swizzle_atom, make_shape(bM,bK,bP));
// Define the thread layouts (static)
TiledCopy copyA = make_tiled_copy(Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<uint64_t>, cute::half_t>{},
Layout<Shape<_16,_8>,Stride<_8,_1>>{}, // Thr layout 16x8 k-major
Layout<Shape< _1,_4>>{}); // Val layout 1x4 k-major Untested -- maybe with the swizzle, maybe a different swizzles, etc. You'll want to balance GMEM vectorization and cache line utilization with SMEM bank conflicts and optimize the data layout |
Thank you very much. Seeing the numbers change and doing pdflatex in the layouts between the two versions and comparing them gives me a good amount of study material. |
I'm trying to learn how to use cute and it's surprising that even the sgemm_sm80.cu example defaults to universalFMA and gives a much lower than possible GFlops for the mma. Can you update a version of this file that even just has commented out TiledMMA versions for different tensorops? I feel like that would be worth more than many many lines of documentation. From:
Having additional mmaC's there commented out would be fantastic, especially if they lined up with the copy tiling.
In the meantime could somebody help me with the edits I need to make to turn the sgemm_sm80.cu to use the TF32 tensor cores for sm80?
The text was updated successfully, but these errors were encountered: