[tritonbench] Fix colfax_cutlass flash_attention operator #2401

xuzhao9 · 2024-07-31T21:49:48Z

colfax_cutlass kernels will fail because of C++ template instantiation.
We need to explicitly include the header file to instantiate all template parameters.

Test plan:

Install the colfax_cutlass operators:

python install.py --userbenchmark triton --cutlass
/home/xz/git/benchmark/submodules/cutlass-kernels/src/fmha/fmha_forward.cu(826): warning #117-D: non-void function "main" should return a value
      return;
            ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/home/xz/git/benchmark/submodules/cutlass-kernels/src/fmha/fmha_forward.cu(826): warning #117-D: non-void function "main" should return a value
      return;
            ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

Run the flash_attention operator from colfax_cutlass

python run_benchmark.py triton --op flash_attention --only colfax_cutlass --num-inputs 1

  (Batch, Heads, SeqLen, Dhead)    colfax_cutlass-latency
-------------------------------  ------------------------
              (32, 32, 512, 64)                  0.001024

facebook-github-bot · 2024-07-31T23:56:53Z

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

manman-ren · 2024-08-01T00:40:06Z

torchbenchmark/operators/flash_attention/operator.py

@@ -128,6 +128,7 @@ class Operator(BenchmarkOperator):
    def __init__(self, tb_args: argparse.Namespace, extra_args: Optional[List[str]] = None):
        super().__init__(tb_args, extra_args)
        args = parse_op_args(self.extra_args)
+        self.use_cuda_graphs = False


I wonder why we need to turn off cuda_graphs.

It is not about colfax_cutlass, but with ThunderKittens (#2370) mine fails with error:

Caught exception, terminating early with partial results Traceback (most recent call last): File "/home/xz/git/benchmark/torchbenchmark/util/triton_op.py", line 558, in run y_vals: Dict[str, BenchmarkOperatorMetrics] = functools.reduce( ^^^^^^^^^^^^^^^^^ File "/home/xz/git/benchmark/torchbenchmark/util/triton_op.py", line 546, in _reduce_benchmarks acc[bm_name] = self._do_bench( ^^^^^^^^^^^^^^^ File "/home/xz/git/benchmark/torchbenchmark/util/triton_op.py", line 753, in _do_bench metrics.latency = triton.testing.do_bench_cudagraph( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xz/miniconda3/lib/python3.11/site-packages/triton/testing.py", line 46, in do_bench_cudagraph with torch.cuda.graph(g): File "/home/xz/miniconda3/lib/python3.11/site-packages/torch/cuda/graphs.py", line 186, in __exit__ self.cuda_graph.capture_end() File "/home/xz/miniconda3/lib/python3.11/site-packages/torch/cuda/graphs.py", line 84, in capture_end super().capture_end() RuntimeError: CUDA error: operation failed due to a previous error during capture CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

So I am thinking cudagraph might not work with ThunderKittens?

Since it is working with colfax_cutlass, I am reverting this line for this PR.

facebook-github-bot · 2024-08-01T02:03:16Z

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-08-01T13:04:29Z

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-08-01T14:06:55Z

@xuzhao9 merged this pull request in 0a2ff22.

facebook-github-bot added the cla signed label Jul 31, 2024

xuzhao9 had a problem deploying to docker-s3-upload July 31, 2024 21:49 — with GitHub Actions Error

xuzhao9 temporarily deployed to docker-s3-upload July 31, 2024 21:57 — with GitHub Actions Inactive

xuzhao9 had a problem deploying to docker-s3-upload July 31, 2024 21:57 — with GitHub Actions Failure

xuzhao9 requested a review from manman-ren July 31, 2024 21:57

manman-ren reviewed Aug 1, 2024

View reviewed changes

xuzhao9 had a problem deploying to docker-s3-upload August 1, 2024 01:56 — with GitHub Actions Error

xuzhao9 temporarily deployed to docker-s3-upload August 1, 2024 01:56 — with GitHub Actions Inactive

xuzhao9 and others added 4 commits July 31, 2024 23:25

Fix coflax_cutlass build

941b841

Fix the colfax_cutlass build

5334eb2

Fix operator

34d80e4

Still use cudagraphs on colfax_cutlass

b7c6ad7

xuzhao9 force-pushed the xz9/fix-cutlass branch from b5f4ef3 to b7c6ad7 Compare August 1, 2024 03:25

xuzhao9 temporarily deployed to docker-s3-upload August 1, 2024 03:26 — with GitHub Actions Inactive

manman-ren approved these changes Aug 1, 2024

View reviewed changes

facebook-github-bot closed this in 0a2ff22 Aug 1, 2024

facebook-github-bot added the Merged label Aug 1, 2024

xuzhao9 deleted the xz9/fix-cutlass branch August 1, 2024 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tritonbench] Fix colfax_cutlass flash_attention operator #2401

[tritonbench] Fix colfax_cutlass flash_attention operator #2401

xuzhao9 commented Jul 31, 2024 •

edited

Loading

facebook-github-bot commented Jul 31, 2024

manman-ren Aug 1, 2024

xuzhao9 Aug 1, 2024 •

edited

Loading

facebook-github-bot commented Aug 1, 2024

facebook-github-bot commented Aug 1, 2024

facebook-github-bot commented Aug 1, 2024

[tritonbench] Fix colfax_cutlass flash_attention operator #2401

[tritonbench] Fix colfax_cutlass flash_attention operator #2401

Conversation

xuzhao9 commented Jul 31, 2024 • edited Loading

facebook-github-bot commented Jul 31, 2024

manman-ren Aug 1, 2024

Choose a reason for hiding this comment

xuzhao9 Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

facebook-github-bot commented Aug 1, 2024

facebook-github-bot commented Aug 1, 2024

facebook-github-bot commented Aug 1, 2024

xuzhao9 commented Jul 31, 2024 •

edited

Loading

xuzhao9 Aug 1, 2024 •

edited

Loading