streamk v0.1 #619

xiaohuguo2023 · 2024-07-29T13:17:18Z

Triton stream-k gemm v0.1

comparable performance with tune gemm
persistent non atomic kernel implementation
pid renumbering based on chiplet structure of MI300X
dynamic grid setting
tuning script adapt from tune_gemm

zhanglx13 · 2024-07-29T16:14:19Z

Can you write a README to introduce the features implemented in this version of the streamK kernel?

vgokhale · 2024-07-29T17:08:27Z

persistent non atomic kernel implementation

What does this mean?

xiaohuguo2023 · 2024-07-29T19:01:04Z

persistent non atomic kernel implementation

What does this mean?

In this version, stream-k kernel use the persistent loop so that a WG may work on multiple output tiles, and also allowing workgroups to do part of the work for an output tile.

vgokhale · 2024-07-29T19:26:34Z

persistent non atomic kernel implementation

What does this mean?

In this version, stream-k kernel use the persistent loop so that a WG may work on multiple output tiles, and also allowing workgroups to do part of the work for an output tile.

But it uses atomics right? Did you mean non atomic as in does not do atomic add?

xiaohuguo2023 · 2024-07-29T20:41:14Z

persistent non atomic kernel implementation

What does this mean?

In this version, stream-k kernel use the persistent loop so that a WG may work on multiple output tiles, and also allowing workgroups to do part of the work for an output tile.

But it uses atomics right? Did you mean non atomic as in does not do atomic add?

yeah, my description is not precise, we still use atomics for spinning lock, but not atomic_add for the final output.

xiaohuguo2023 · 2024-07-30T11:42:56Z

Can you write a README to introduce the features implemented in this version of the streamK kernel?

done

vgokhale · 2024-07-31T14:14:28Z

python/perf-kernels/streamk/README.md

@@ -0,0 +1,43 @@
+# streamk gemm script v0.1


What would be needed to get it to 1.0?

I need made it ready to explore half million benchmarks, and have a comparable performance with Tensile development

I don't think we can have it comparable to tensile because that is outside of the scope of streamk. I think we can call this 0.1 until we have the wider tuning space working.

vgokhale · 2024-07-31T14:15:56Z

python/perf-kernels/streamk/streamk_kernel.py

+        acc = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=acc_dtype)
+        for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)):
+            if EVEN_K:
+                a = tl.load(A_BASE)


Can we peel the masking for the last iteration when EVEN_K is False so that only the last loop pays the price of the mask?

as discussed, this will be in next PR. Thanks !

vgokhale · 2024-07-31T14:18:20Z

python/perf-kernels/streamk/streamk_kernel.py

+
+
+@triton.jit()
+def get_new_pid(current_pid, num_sms):


/s/num_sms/num_cus

vgokhale · 2024-07-31T14:19:30Z

python/perf-kernels/streamk/streamk_kernel.py

+    # Number of XCDs
+    num_xcds = 8
+    # Number of pids per XCD in the new arrangement
+    pids_per_xcd = num_sms // num_xcds


I thought the grid can have multiple of num_cus pids.

For persistent kernel, grid has to be either num_cus or total_tiles if total_tiles < num_cus

streamk v0.1

1064463

xiaohuguo2023 requested review from vgokhale, zhanglx13, jayfurmanek and scxiao July 29, 2024 13:17

xiaohuguo2023 added 2 commits July 29, 2024 08:27

remove unused variable

b7a18c2

fix format issues

8c1d058

xiaohuguo2023 added 2 commits July 30, 2024 06:40

add README

1dfebe1

fix format issue

480ad2c

vgokhale reviewed Jul 31, 2024

View reviewed changes

change num_sms to num_cus

bdd1f8e

vgokhale approved these changes Jul 31, 2024

View reviewed changes

xiaohuguo2023 merged commit 52a908f into main_perf Jul 31, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streamk v0.1 #619

streamk v0.1 #619

xiaohuguo2023 commented Jul 29, 2024

zhanglx13 commented Jul 29, 2024

vgokhale commented Jul 29, 2024

xiaohuguo2023 commented Jul 29, 2024

vgokhale commented Jul 29, 2024

xiaohuguo2023 commented Jul 29, 2024

xiaohuguo2023 commented Jul 30, 2024

vgokhale Jul 31, 2024

xiaohuguo2023 Jul 31, 2024

vgokhale Jul 31, 2024

vgokhale Jul 31, 2024

xiaohuguo2023 Jul 31, 2024

vgokhale Jul 31, 2024

vgokhale Jul 31, 2024

xiaohuguo2023 Jul 31, 2024



		@triton.jit()
		def get_new_pid(current_pid, num_sms):

streamk v0.1 #619

streamk v0.1 #619

Conversation

xiaohuguo2023 commented Jul 29, 2024

zhanglx13 commented Jul 29, 2024

vgokhale commented Jul 29, 2024

xiaohuguo2023 commented Jul 29, 2024

vgokhale commented Jul 29, 2024

xiaohuguo2023 commented Jul 29, 2024

xiaohuguo2023 commented Jul 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment