[Sharktank][Llama][FP8] Minimal changes for numerically correct fp8 #859

dan-garvey · 2025-01-22T22:24:51Z

This patch enables the use of quark quantized models of the latest generation. Many changes were required to enable parity with the source model which is very sensitive to any numerical fluctuations. There is a test added to maintain this parity but is disabled until I can get the relevant data set up on the ci machine.

zjgarvey

I mostly have a bunch of naive questions, so I won't give approval or request changes.

sharktank/sharktank/examples/paged_llm_v1.py

zjgarvey · 2025-01-29T01:17:42Z

sharktank/sharktank/layers/linear.py

@@ -88,7 +87,7 @@ def forward(self, x):
        # level to do this, but for now its here.
        if not isinstance(y, QuantizedTensor):
            if y.dtype == torch.float8_e4m3fnuz:
-                y = ops.to(y, torch.float16)
+                y = ops.to(y, torch.bfloat16)


Is this change because the specific float8 type accumulates to something which only truncates safely to bfloat16 instead of float16?

No this is an artifact of the way the model was quantized. The actual fp8 matmul intrinsic accumulates into f32, which iree can truncate, but in python we just cast to match the reference model

I'm not a fan that the python implementation can really only compare for one specific quantization method like this. I don't have an answer off the top of my head, so fine for now but ideally would be good to make more agnostic somehow

zjgarvey · 2025-01-29T01:20:24Z

sharktank/sharktank/layers/paged_llama_attention_block.py

-                xk = (
-                    self.cache_quantizer.quantize(xk)
-                    .unpack()
-                    .dequant()
-                    .to(torch.float16)
-                )
-                xv = (
-                    self.cache_quantizer.quantize(xv)
-                    .unpack()
-                    .dequant()
-                    .to(torch.float16)
-                )


Why was this removed?

Quark's model loader didn't support the fp8 kv cache. We are still doing it for export, but it is missing in the python comparison.

zjgarvey · 2025-01-29T01:21:54Z

sharktank/sharktank/layers/paged_llama_attention_block.py

+            if attention_mask is not None:
+                attention_mask = attention_mask.to(torch.bfloat16)


I don't see any quantization stuff here. Is the indentation incorrect?

if self.cache_quantizer and not self.fake_quant:

Yeah, that's what I'm asking about. Do we need cache_quantizer and not fake_quant for the attention mask to be in bfloat16? I guess that's probably the case.

zjgarvey · 2025-01-29T01:28:56Z

sharktank/sharktank/models/llama/llama.py

@@ -82,7 +86,7 @@ def __init__(self, theta: Theta, config: LlamaModelConfig):

        self.add_module(
            "token_embedding",
-            TokenEmbeddingLayer(theta("token_embd"), dtype=config.activation_dtype),
+            TokenEmbeddingLayer(theta("token_embd"), dtype=self.activation_dtype),


From earlier in __init__, it doesn't look like self.activation_dtype is different from config.activation_dtype.

Yeah this change is a no op. Style preference.

For consistency, we can use self or config

sharktank/sharktank/utils/export_artifacts.py

sharktank/sharktank/utils/patching.py

zjgarvey · 2025-01-29T01:40:52Z

sharktank/sharktank/utils/patching.py

@@ -55,11 +62,30 @@ def __init__(self):
        # Map of module_name to last used index for duplicated tensors.
        self.duplicate_tensors = {}

+    def before_forward(self, module_name, module, *args, **kwargs):


It might be useful to have a docstring for this. In isolation, I'm not really sure what it does, or what should be passed as args and kwargs.

It seems like the intent is that some input tensors are passed as args in a specific order, and this function will add them to self.tensors while managing duplicates appropriately.

zjgarvey · 2025-01-29T01:47:07Z

sharktank/sharktank/utils/patching.py

+        for idx, arg in enumerate(args):
+            if not isinstance(arg, torch.Tensor):
+                continue
+            result_tensor = torch.detach(arg).contiguous().to(device="cpu").clone()


From the name_base would this more appropriately be called input_tensor?

Now that I think about it, perhaps adding this method warrants renaming the class to something like SaveInputAndOutputTensorsPatch and updating the docstrings for it.

zjgarvey · 2025-01-29T01:53:19Z

sharktank/sharktank/utils/patching.py

+            if name_base in self.tensors:
+                orig_dup = self.tensors[name_base]
+                del self.tensors[name_base]
+                self.duplicate_tensors[name_base] = 0
+                self.tensors[f"{name_base}#0"] = orig_dup
+            elif name_base in self.duplicate_tensors:
+                index = self.duplicate_tensors[name_base] + 1
+                self.duplicate_tensors[name_base] = index
+                self.tensors[f"{name_base}#{index}"] = result_tensor
+            else:
+                self.tensors[name_base] = result_tensor


I'm trying to parse this...

Lets say you are passing one tensor x in through args, but f"{module_name}_input_0" is already a key in self.tensors.

Then x isn't going to get added to anything, but instead you just rename the original tensors element.

Because I don't really know what the point is of this function, my naive assumption is that the elif should just be an if, that way x would get added to self.tensors as f"{name_base}#1".

Since it looks like this logic is used in after_forward, would it be useful to factor out something like the following?

def insert_tensor(self, tensor : torch.Tensor, name_base : str): """ Adds a tensor to self.tensors while updating duplicate counts. """

Actually, looking around more, I don't really see any instance where name_base would appear as one of the keys to self.tensors, so maybe the initial if statement is completely unnecessary?

it is to cover cases of modules with the same name. Won't happen often from our own models, but it does happen in the wild.

temp remove cast to f32 temp temp working using llama embed passes numerics and compiles seven flavors of absolute trash make decode great again guard no mask first round cleanup fix rms norm again rebase rotary and add dtype for rotary to llama.py

aviator19941

Do you have a test or a command I can run to verify the numerics are correct?

sharktank/tests/models/llama/quark_parity_test.py

sharktank/sharktank/utils/export_artifacts.py

archana-ramalingam · 2025-01-29T05:29:20Z

sharktank/sharktank/layers/configs/llm_configs.py

@@ -49,7 +49,7 @@ def from_gguf_props(p: dict[str, Any]):
        name_prefix = p.get("general.architecture", "llama")
        default_expert_count = 0
        default_expert_used_count = 0
-        default_rope_freq_base = 10000.0
+        default_rope_freq_base = 500000.0


Can we pass this separately, as I have noticed rope_freq_base not being explicitly set in some models and might need to default to 10000?

Perplexity seems to be passing, so not a blocker.

llama3 defaults to 500000 so I think we should use that

Do we know where that 10000 came from?

archana-ramalingam · 2025-01-29T05:32:14Z

sharktank/sharktank/examples/paged_llm_v1.py

@@ -160,6 +157,23 @@ def prefill(self):
            attention_mask = replicate(attention_mask, tp)
            seq_block_ids_tensor = replicate(seq_block_ids_tensor, tp)

+        if self.dump_bins:


You can use generate_data.py to fetch input data, given a model & a prompt.

doesn't work for values not supported by numpy

Can we update generate_data.py to use torch tensors instead of numpy arrays? Should work around this issue

nithinsubbiah · 2025-01-29T07:07:28Z

sharktank/sharktank/ops/qlinear_impls.py

-            return matmul(x_layout.qs, weight_layout.qs, transpose_rhs=True).to(
-                torch.float16
-            )
+            return matmul(x_layout.qs, weight_layout.qs, transpose_rhs=True)


If we do a return here, we're not actually inserting the mmt kernel which is what's intended from this script. This function is called when the input tensors are quantized and at least punet still expects to have the quantized kernel inserted if and when that happens

what does the mmt kernel do? because we can just lower fp8 matmul with torch

dan-garvey · 2025-01-29T13:40:51Z

Do you have a test or a command I can run to verify the numerics are correct?

you can run pytest sharktank/tests/models/llama/quark_parity_test.py on mi300x right now. I plan to make it run anywhere in a follow up patch

IanNod

Mostly looks good to me, some minor nit comments

IanNod · 2025-01-29T14:38:54Z

sharktank/sharktank/examples/paged_llm_v1.py

@@ -160,6 +157,23 @@ def prefill(self):
            attention_mask = replicate(attention_mask, tp)
            seq_block_ids_tensor = replicate(seq_block_ids_tensor, tp)

+        if self.dump_bins:


Can we update generate_data.py to use torch tensors instead of numpy arrays? Should work around this issue

IanNod · 2025-01-29T14:40:57Z

sharktank/sharktank/layers/configs/llm_configs.py

@@ -49,7 +49,7 @@ def from_gguf_props(p: dict[str, Any]):
        name_prefix = p.get("general.architecture", "llama")
        default_expert_count = 0
        default_expert_used_count = 0
-        default_rope_freq_base = 10000.0
+        default_rope_freq_base = 500000.0


Do we know where that 10000 came from?

IanNod · 2025-01-29T14:44:45Z

sharktank/sharktank/layers/linear.py

@@ -88,7 +87,7 @@ def forward(self, x):
        # level to do this, but for now its here.
        if not isinstance(y, QuantizedTensor):
            if y.dtype == torch.float8_e4m3fnuz:
-                y = ops.to(y, torch.float16)
+                y = ops.to(y, torch.bfloat16)


I'm not a fan that the python implementation can really only compare for one specific quantization method like this. I don't have an answer off the top of my head, so fine for now but ideally would be good to make more agnostic somehow

sharktank/tests/models/llama/benchmark_amdgpu_test.py

IanNod · 2025-01-29T14:55:17Z

Do you have a test or a command I can run to verify the numerics are correct?

you can run pytest sharktank/tests/models/llama/quark_parity_test.py on mi300x right now. I plan to make it run anywhere in a follow up patch

Would be great if you could add to documentation so it's easy for anyone else to pick up and test/use fp8 models

dan-garvey · 2025-01-29T16:56:04Z

#880
#881
@IanNod

dan-garvey · 2025-01-29T16:56:24Z

I'll add documentation on halo models this afternoon

dan-garvey · 2025-01-29T18:30:20Z

10000 is the default for llama2

dan-garvey force-pushed the users/dan_garvey/fp8_staging branch 4 times, most recently from 20e7316 to 6643fb3 Compare January 28, 2025 00:53

dan-garvey marked this pull request as ready for review January 28, 2025 00:54

dan-garvey force-pushed the users/dan_garvey/fp8_staging branch 3 times, most recently from 99d3a50 to 74344e0 Compare January 28, 2025 23:15

zjgarvey reviewed Jan 29, 2025

View reviewed changes

dan-garvey force-pushed the users/dan_garvey/fp8_staging branch from b600453 to 9e479b0 Compare January 29, 2025 02:50

dan-garvey added 13 commits January 28, 2025 18:54

stash pop

20bfd84

temp remove cast to f32 temp temp working using llama embed passes numerics and compiles seven flavors of absolute trash make decode great again guard no mask first round cleanup fix rms norm again rebase rotary and add dtype for rotary to llama.py

add a test

1574185

remove hardcoded bf16

f2592ec

add an input dumper

1f1b0d1

test fixes

7a51894

fix benchmark

887b63f

and again

f8719c6

dtypo

d44f246

fix export

a48c6e4

fix run?

b0ece46

fix xfail

acf0417

change reason

11c7fce

revert changes to patching.py

9e479b0

aviator19941 reviewed Jan 29, 2025

View reviewed changes

sharktank/tests/models/llama/quark_parity_test.py Outdated Show resolved Hide resolved

sharktank/tests/models/llama/quark_parity_test.py Outdated Show resolved Hide resolved

archana-ramalingam reviewed Jan 29, 2025

View reviewed changes

sharktank/sharktank/utils/export_artifacts.py Outdated Show resolved Hide resolved

archana-ramalingam reviewed Jan 29, 2025

View reviewed changes

nithinsubbiah reviewed Jan 29, 2025

View reviewed changes

dan-garvey requested a review from archana-ramalingam January 29, 2025 13:43

dan-garvey requested a review from aviator19941 January 29, 2025 13:44

address comments

ea49b70

IanNod reviewed Jan 29, 2025

View reviewed changes

IanNod approved these changes Jan 29, 2025

View reviewed changes

dan-garvey merged commit 1392a2e into main Jan 29, 2025
33 checks passed

dan-garvey deleted the users/dan_garvey/fp8_staging branch January 29, 2025 20:09

		if attention_mask is not None:
		attention_mask = attention_mask.to(torch.bfloat16)

[Sharktank][Llama][FP8] Minimal changes for numerically correct fp8 #859

[Sharktank][Llama][FP8] Minimal changes for numerically correct fp8 #859

Conversation

dan-garvey commented Jan 22, 2025 • edited Loading

zjgarvey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aviator19941 left a comment

Choose a reason for hiding this comment

archana-ramalingam Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nithinsubbiah Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dan-garvey commented Jan 29, 2025

IanNod left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IanNod commented Jan 29, 2025

dan-garvey commented Jan 29, 2025

dan-garvey commented Jan 29, 2025

dan-garvey commented Jan 29, 2025

dan-garvey commented Jan 22, 2025 •

edited

Loading

archana-ramalingam Jan 29, 2025 •

edited

Loading

nithinsubbiah Jan 29, 2025 •

edited

Loading