You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bitblas leveraged apache tvm to perform high performance code generation, within our tensor ir + schedule based template, we should register tensor intrin (for cuda tensor core, amd matrix core and lop3). There exists about hundreds of intrins that need to register during runtime, which requires interaction between cpp and python via ctypes, which can take several minutes to complete.
This results in a suboptimal user experience, even with our tilelang backend, we have to pass the registration during get hardware aware configs, some refactoring of code must be done to alleviate this overhead.
The text was updated successfully, but these errors were encountered:
Pull Request #255 has make some modifications related to this issue, we've put lop3 intrin registration into a lazy import form. But there still exists a bit registration overhead for tl backend (even it doesn't require any intrin to be registered) when we want to enable hardware aware tuning, this can be enhanced by replacing get_mma_intrin_group with a customized and fast one.
bitblas/gpu/matmul_analysis.py::650
iftarget.kind.name=="cuda"andcheck_sm_version(target.arch) >=70:
# TODO(lei): we should consider the dtype of the input a and b# instead of assuming both a and b share the same dtype.# As the tensorcore may supports e4m3_float8 * e5m2_float8in_dtype, out_dtype=get_in_out_dtypes(block_stmt)
try:
_=get_mma_intrin_group(
a_dtype=in_dtype,
b_dtype=in_dtype,
out_dtype=out_dtype,
)
exceptException:
logger.debug("Cannot find the corresponding mma intrin group")
returnfunc, None
Bitblas leveraged apache tvm to perform high performance code generation, within our tensor ir + schedule based template, we should register tensor intrin (for cuda tensor core, amd matrix core and lop3). There exists about hundreds of intrins that need to register during runtime, which requires interaction between cpp and python via ctypes, which can take several minutes to complete.
This results in a suboptimal user experience, even with our tilelang backend, we have to pass the registration during get hardware aware configs, some refactoring of code must be done to alleviate this overhead.
The text was updated successfully, but these errors were encountered: