Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Lazy tvm tensor intrin registration is required to save import time #256

Closed
LeiWang1999 opened this issue Dec 3, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@LeiWang1999
Copy link
Contributor

Bitblas leveraged apache tvm to perform high performance code generation, within our tensor ir + schedule based template, we should register tensor intrin (for cuda tensor core, amd matrix core and lop3). There exists about hundreds of intrins that need to register during runtime, which requires interaction between cpp and python via ctypes, which can take several minutes to complete.

This results in a suboptimal user experience, even with our tilelang backend, we have to pass the registration during get hardware aware configs, some refactoring of code must be done to alleviate this overhead.

@LeiWang1999 LeiWang1999 added the enhancement New feature or request label Dec 3, 2024
@LeiWang1999 LeiWang1999 self-assigned this Dec 3, 2024
@LeiWang1999
Copy link
Contributor Author

LeiWang1999 commented Dec 8, 2024

Pull Request #255 has make some modifications related to this issue, we've put lop3 intrin registration into a lazy import form. But there still exists a bit registration overhead for tl backend (even it doesn't require any intrin to be registered) when we want to enable hardware aware tuning, this can be enhanced by replacing get_mma_intrin_group with a customized and fast one.

bitblas/gpu/matmul_analysis.py::650

if target.kind.name == "cuda" and check_sm_version(target.arch) >= 70:
    # TODO(lei): we should consider the dtype of the input a and b
    # instead of assuming both a and b share the same dtype.
    # As the tensorcore may supports e4m3_float8 * e5m2_float8
    in_dtype, out_dtype = get_in_out_dtypes(block_stmt)
    try:
        _ = get_mma_intrin_group(
            a_dtype=in_dtype,
            b_dtype=in_dtype,
            out_dtype=out_dtype,
        )
    except Exception:
        logger.debug("Cannot find the corresponding mma intrin group")
        return func, None

@LeiWang1999
Copy link
Contributor Author

Closed as recent pull request #262 has covered this issue :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant