You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wish to implement a fused mixed precision matrix multiplication such as w4a4 + w16a16 where the w16a16 part is small. An example of this kernel used is for accelerating an LLM with LoRA applied.
I can find some examples in "torchao" that implement matrix multiplication of w4a4/w4a8 and integrate matrix multiplication and dequantization via epilogue, but I don't know how to further integrate matrix multiplication of w16a16 on top of it, is there any examples I can refer to?
The text was updated successfully, but these errors were encountered:
Thank you very much for your reply!
The input consists of two activations X1[L, D1], X2[L, D2] and two weight matrices W1[D, D1], W2[D, D2], where $L = 2048, D_1 = 4096, D_2 = 64, D = 4096$. The output is $Y = X_1 W_1^\top + X_2 W_2^\top$. Meanwhile, X1 and W1 will be quantized to 4bit.
I think the difference with ex.13 is that I need to add the results of the two GEMMs instead of computing the latter based on the results of the former GEMM.
Dear Team,
I wish to implement a fused mixed precision matrix multiplication such as w4a4 + w16a16 where the w16a16 part is small. An example of this kernel used is for accelerating an LLM with LoRA applied.
I can find some examples in "torchao" that implement matrix multiplication of w4a4/w4a8 and integrate matrix multiplication and dequantization via epilogue, but I don't know how to further integrate matrix multiplication of w16a16 on top of it, is there any examples I can refer to?
The text was updated successfully, but these errors were encountered: