-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subclass API (#966) #995
base: main
Are you sure you want to change the base?
Subclass API (#966) #995
Conversation
Summary: Adds new int8_dynamic_activation_intx_weight quantization with subclass API Differential Revision: D62464487
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/995
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 41a40cb with merge base 09b8b3c (): BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D62464487 |
@@ -300,7 +300,7 @@ def _quantize_affine_no_dtype_cast( | |||
elif zero_point_domain is None: | |||
# This case handles quantization for float8 we expect no zero point and no zero point domain | |||
assert zero_point is None, "zero_point should be None when zero_point_domain is None" | |||
quant = torch.clamp(input * scale.reciprocal(), quant_min, quant_max) | |||
quant = torch.clamp(torch.round(input * (1.0 / scale)), quant_min, quant_max) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryzh168 to confirm if this is OK. It was needed to match behavior of other quantizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, it might be fine as long as all the tests passes I think
if preserve_zero: | ||
zero_point = quant_min - torch.round(min_val_neg / scale) | ||
zero_point = torch.clamp(zero_point, quant_min, quant_max) | ||
if zero_point_domain is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryzh168 confirm if this is OK. It was needed to get scale-only quantization in affine_quantized_tensor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, should zero_point be None
here?
exported = torch.export.export(model, (activations,)) | ||
|
||
print("Compiling quantized model") | ||
compiled = torch.compile(unwrapped_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryzh168 do you see unification for compile and export coming soon? The fact that one requires an unwrapped tensor subclass and the other requires a wrapped one makes using this API inconvenient in torchchat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's blocked by pytorch/pytorch#129682 and I heard @tugsbayasgalan is working on this
@kimishpatel @jerryzh168 moving review over to GH. I hope I've addressed most of your concerns. @jerryzh168, the fact that compile and export cannot handle the same model (export requires an unwrapped tensor subclass, compile requires a wrapped one, and eager can handle both) makes using this API inconvenient in torchchat. Do you know if there is planned unification there? |
input_tensor = input_tensor.reshape(-1, m, k) | ||
|
||
res = [ | ||
_impl_2d(input_tensor[i, :, :], weight_tensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you doing it like this? You can just fuse first N dim. LIke line 379 should be
input_tensor = input_tensor.reshape(-1, k)
no?
# TODO(T200095131): convert self.n, self.k, self.group_size to | ||
# int when supported by AOTI | ||
args = ( | ||
input_tensor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this expect input_tensor to be of fp32 type? Would it work with bf16? If not we should assert for those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add an assert here too, although this is already asserted in the C++ layer. In general, I like asserting in the C++ layer because if the op is expanded to support BF16 in the future, having an assert in python wouldn't preempt it from taking effect.
) | ||
|
||
# Quantize activations | ||
activation_scales, activation_zeros = choose_qparams_affine( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dynamic quantization should be reusing affine quantized tensor, example:
ao/torchao/quantization/quant_api.py
Line 586 in 900f9ac
def int8_dynamic_activation_int8_weight(layout_type=PlainLayoutType()): |
why is this calling these functions here?
|
||
|
||
# This format is intended for use with int8 dynamic quantization | ||
class IntxWeightLayoutType(LayoutType): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry still find this name not descriptive, what are the kernels this layout is targeting? are these executorch native kernels? if so maybe IntxExecutorchLayout or similar might be more helpful
n, k_ = weight_tensor.shape | ||
assert k_ == k | ||
|
||
weights_dequantized = dequantize_per_channel_group( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we tend to use quantize_affine/dequantize_affine I think, also this should probably be:
weights_dequantized = weight_tensor.dequantize()
?
return isinstance(layout_type, IntxWeightLayoutType) and bias is None | ||
|
||
|
||
def _linear_int8_dynamic_activation_intx_weight_fallback_impl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see fallback here:
ao/torchao/dtypes/affine_quantized_tensor.py
Lines 1760 to 1764 in 900f9ac
if isinstance(input_tensor, AffineQuantizedTensor): | |
input_tensor = input_tensor.dequantize() | |
if isinstance(weight_tensor, AffineQuantizedTensor): | |
weight_tensor = weight_tensor.dequantize() | |
return torch.nn.functional.linear(input_tensor, weight_tensor, bias) |
assert len(weight_tensor.block_size) == 2 | ||
assert weight_tensor.block_size[0] == 1 | ||
group_size = weight_tensor.block_size[1] | ||
assert group_size == weight_tensor.layout_tensor.layout_type.group_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can probably be weight_tensor.layout_type.group_size (although we are renaming layout_type to layout now
Summary:
Adds new int8_dynamic_activation_intx_weight quantization with subclass API
Differential Revision: D62464487