From 282850d2c40933805f30e4c904eeb583615f7cef Mon Sep 17 00:00:00 2001 From: Jing Xu Date: Mon, 17 Feb 2025 14:29:51 +0900 Subject: [PATCH] fix docstring format issue (#3515) --- cpu/2.6.0+cpu/tutorials/api_doc.html | 86 +++++++++++++--------------- 1 file changed, 41 insertions(+), 45 deletions(-) diff --git a/cpu/2.6.0+cpu/tutorials/api_doc.html b/cpu/2.6.0+cpu/tutorials/api_doc.html index f53b29762..47133f924 100644 --- a/cpu/2.6.0+cpu/tutorials/api_doc.html +++ b/cpu/2.6.0+cpu/tutorials/api_doc.html @@ -1413,65 +1413,61 @@

Graph Optimization -
-ipex.quantization.get_weight_only_quant_qconfig_mapping(*, weight_dtype: int = WoqWeightDtype.INT8, lowp_mode: int = WoqLowpMode.NONE, act_quant_mode: int = WoqActQuantMode.PER_BATCH_IC_BLOCK_SYM, group_size: int = -1, weight_qscheme: int = WoqWeightQScheme.UNDEFINED)
-

Configuration for weight-only quantization (WOQ) for LLM. -:param weight_dtype: Data type for weight, WoqWeightDtype.INT8/INT4/NF4, etc. -:param lowp_mode: specify the lowest precision data type for computation. Data types

-
-

that has even lower precision won’t be used. -Not necessarily related to activation or weight dtype. -- NONE(0): Use the activation data type for computation. -- FP16(1): Use float16 (a.k.a. half) as the lowest precision for computation. -- BF16(2): Use bfloat16 as the lowest precision for computation. -- INT8(3): Use INT8 as the lowest precision for computation.

-
-

Activation is quantized to int8 at runtime in this case.

-
-
+
+intel_extension_for_pytorch.quantization.get_weight_only_quant_qconfig_mapping(*, weight_dtype: int = WoqWeightDtype.INT8, lowp_mode: int = WoqLowpMode.NONE, act_quant_mode: int = WoqActQuantMode.PER_BATCH_IC_BLOCK_SYM, group_size: int = -1, weight_qscheme: int = WoqWeightQScheme.UNDEFINED)
+

Configuration for weight-only quantization (WOQ) for LLM.

Parameters:
    -
  • act_quant_mode – Quantization granularity of activation. It only works for lowp_mode=INT8. +

  • weight_dtype – Data type for weight, WoqWeightDtype.INT8/INT4/NF4, etc.

  • +
  • lowp_mode

    specify the lowest precision data type for computation. Data types +that has even lower precision won’t be used. +Not necessarily related to activation or weight dtype.

    +
      +
    • NONE(0): Use the activation data type for computation.

    • +
    • FP16(1): Use float16 (a.k.a. half) as the lowest precision for computation.

    • +
    • BF16(2): Use bfloat16 as the lowest precision for computation.

    • +
    • INT8(3): Use INT8 as the lowest precision for computation. +Activation is quantized to int8 at runtime in this case.

    • +
    +

  • +
  • act_quant_mode

    Quantization granularity of activation. It only works for lowp_mode=INT8. It has no effect in other cases. The tensor is divided into groups, and each group is quantized with its own quantization parameters. -Suppose the activation has shape batch_size by input_channel (IC). -- PER_TENSOR(0): Use the same quantization parameters for the entire tensor. -- PER_IC_BLOCK(1): Tensor is divided along IC with group size = IC_BLOCK. -- PER_BATCH(2): Tensor is divided along batch_size with group size = 1. -- PER_BATCH_IC_BLOCK(3): Tenosr is divided into blocks of 1 x IC_BLOCK. -Note that IC_BLOCK is determined by group_size automatically.

  • +Suppose the activation has shape batch_size by input_channel (IC).

    +
      +
    • PER_TENSOR(0): Use the same quantization parameters for the entire tensor.

    • +
    • PER_IC_BLOCK(1): Tensor is divided along IC with group size = IC_BLOCK.

    • +
    • PER_BATCH(2): Tensor is divided along batch_size with group size = 1.

    • +
    • PER_BATCH_IC_BLOCK(3): Tenosr is divided into blocks of 1 x IC_BLOCK.

    • +
    +

    Note that IC_BLOCK is determined by group_size automatically.

    +

  • group_size

    Control quantization granularity along input channel (IC) dimension of weight. -Must be a positive power of 2 (i.e., 2^k, k > 0) or -1. -If group_size = -1:

    -
    -
    -
    If act_quant_mode = PER_TENSOR ro PER_BATCH:

    No grouping along IC for both activation and weight

    -
    -
    If act_quant_mode = PER_IC_BLOCK or PER_BATCH_IC_BLOCK:

    No grouping along IC for weight. For activation, -IC_BLOCK is determined automatically by IC.

    -
    -
    -
    -
    -
    If group_size > 0:

    act_quant_mode can be any. If act_quant_mode is PER_IC_BLOCK(_SYM) -or PER_BATCH_IC_BLOCK(_SYM), weight is grouped along IC by group_size. -The IC_BLOCK for activation is determined by group_size automatically. -Each group has its own quantization parameters.

    -
    -
    +Must be a positive power of 2 (i.e., 2^k, k > 0) or -1. The rule is

    +
    If group_size = -1:
    +    If act_quant_mode = PER_TENSOR ro PER_BATCH:
    +        No grouping along IC for both activation and weight
    +    If act_quant_mode = PER_IC_BLOCK or PER_BATCH_IC_BLOCK:
    +        No grouping along IC for weight. For activation,
    +        IC_BLOCK is determined automatically by IC.
    +If group_size > 0:
    +    act_quant_mode can be any. If act_quant_mode is PER_IC_BLOCK(_SYM)
    +    or PER_BATCH_IC_BLOCK(_SYM), weight is grouped along IC by group_size.
    +    The IC_BLOCK for activation is determined by group_size automatically.
    +    Each group has its own quantization parameters.
    +
    +

  • weight_qscheme

    Specify how to quantize weight, asymmetrically or symmetrically. Generally, asymmetric quantization has better accuracy than symmetric quantization at the cost of performance. Symmetric quantization is faster but may have worse accuracy. Default is undefined and determined by weight dtype: asymmetric in most cases and symmetric if

    -
    -
      +
      1. weight_dtype is NF4, or

      2. weight_dtype is INT8 and lowp_mode is INT8.

      -

    One must use WoqWeightQScheme.SYMMETRIC in the above two cases.

@@ -1781,4 +1777,4 @@

Graph Optimization