Releases · intel/neural-compressor

28 Dec 13:17

thuang6

v3.2

ec73b6c

Intel Neural Compressor Release 3.2 Latest

Latest

Highlights
Features
Improvements
Bug Fixes
Validated Hardware
Validated Configurations

Highlights

Aligned with Habana 1.19 release with the improvements on FP8 and INT4 quantization for Intel® Gaudi® AI accelerator
INT4 weight-only quantization on Intel® Arc™ B-Series Graphics GPU (code-named BattleMage)

Features

Saving and loading FP8 checkpoint on Gaudi
Loading vLLM/llm-compressor compatible FP8 checkpoint on Gaudi
Arbitrary scale method support on Gaudi
AutoRound INT4 weight-only quantization on Gaudi
Block-wise calibration for LLM on Gaudi
INT4 weight-only quantization on BattleMage

Improvements

Improve FP8 performance by setting scale as scalar tensor on Gaudi
Integrate AutoRound 0.4.2 with VLM quantization improvements
Improve safetensors loading for layer-wise quantization in Transformers-like API
Improve non-contiguous weight saving in Transformers-like API

Bug Fixes

Fix layer-wise quantization issue in GPTQ on client GPU
Fix glm-4-9b model out-of-memory issue on BattleMage

Validated Hardware 

Intel Gaudi Al Accelerators (Gaudi 2 and 3)
Intel Xeon Scalable processor (4th, 5th, 6th Gen)
Intel Core Ultra Processors (Series 1 and 2)
Intel Data Center GPU Max Series (1100)
Intel Arc B-Series Graphics GPU (B580)

Validated Configurations

Centos 8.4 & Ubuntu 22.04 & Win 11
Python 3.9, 3.10, 3.11, 3.12
PyTorch/IPEX 2.3, 2.4, 2.5

Assets 2

25 Oct 08:18

chensuyue

v3.1

a8cd9aa

Intel Neural Compressor Release 3.1

Highlights
Features
Improvements
Validated Hardware
Validated Configurations

Highlights

Aligned with Habana 1.18 release with the improvements on FP8 and INT4 quantization for Intel® Gaudi® AI accelerator
Provided Transformer-like quantization API for weight-only quantization on LLM, which offers transformer-based user one-stop experience for quantization & inference with IPEX on Intel GPU and CPU.

Features

Add Transformer-like quantization API for weight-only quantization on LLM
Support fast quantization with light weight recipe and layer-wise approach on Intel AI PC
Support INT4 quantization of Visual Language Model (VLM), like Llava, Phi-3-vision, Qwen-VL with AutoRound algorithm

Improvements

Support AWQ format INT4 model loading and converting for IPEX inference in Transformer-like API
Enable auto-round format export for INT4 model
Support per-channel INT8 Post Training Quantization for PT2E

Validated Hardware 

Intel Gaudi Al Accelerators (Gaudi 2 and 3)
Intel Xeon Scalable processor (4th, 5th, 6th Gen)
Intel Core Ultra Processors (Series 1 and 2)
Intel Data Center GPU Max Series (1100)

Validated Configurations

Centos 8.4 & Ubuntu 22.04 & Win 11
Python 3.9, 3.10, 3.11, 3.12
PyTorch/IPEX 2.2, 2.3, 2.4

Assets 2

12 Aug 04:09

chensuyue

v3.0

7056720

Intel® Neural Compressor v3.0 Release

Highlights
Features
Improvements
Examples
Bug Fixes
Documentations
Validated Configurations

Highlights

FP8 quantization and INT4 model loading support on Intel® Gaudi® AI accelerator
Framework extension API for quantization, mixed-precision and benchmarking
Accuracy-aware FP16 mixed precision support on Intel® Xeon® 6 Processors
Performance optimizations and usability improvements on client-side quantization

Features

[Quantization] Support FP8 quantization on Gaudi (95197d)
[Quantization] Support INC and Hugging Face model loading on framework extension API for PyTorch (0eced1, bacc16)
[Quantization] Support Weight-only Quantization on framework extension API for PyTorch (34f0a9, de43d8, 4a4509, 1a4509, a3a065, 1386ac, a0dee9, 503d9e, 84d705, 099b7a, e3c736, e87c95, 2694bb, ec49a2, e7b4b6, a9bf79, ac717b, 915018, 8447d7, dc9328)
[Quantization] Support static and dynamic quantization in PT2E path (7a4715, 43c358, 30b36b, 1f58f0, 02958d)
[Quantization] Support SmoothQuant and static quantization in IPEX path with framework extension API (53e6ee, 72fbce, eaa3a5, 95e67e, 855c10, 9c6102, 5dafe5, a5e5f5, 191383, 776645)
[Quantization] Support Layer-wise Quantization for RTN/GPTQ on framework extension API for PyTorch (649e6b)
[Quantization] Support Post Training Quantization on framework extension API for Tensorflow (6c27c1, e22c61, f21afb, 3882e9, 2627d3)
[Quantization] Support Post Training Quantization on Keras3 (f67e86, 047560)
[Quantization] Support Weight-only Quantization on Gaudi2 (4b9b44, 14868c, 0a3d4b)
[Quantization] Improve performance and usability of quantization procedure on client side (16a7b1)
[Quantization] Support auto-device detection on framework extension API for PyTorch (368ba5, 4b9b44, e81a2d, 0a3d4b, 534300, 2a86ae)
[Quantization] Support Microscaling(MX) Quant for PyTorch (4a24a6, 455f1e)
[Quantization] Enable cross-devices Half-Quadratic Quantization(HQQ) for LLMs support (db6164, 07f940)
[Quantization] Support FP8 cast Weight-only Quantization (57ed61)
[Mixed-Precision] Support FP16 mixed-precision on framework extension autotune API for PyTorch (2e1cdc)
[Mixed-Precision] Support mixed INT8 with FP16 in PT2E path (fa961e)
[AutoTune] Support accuracy-aware tuning on framework extension API (e97659, 7b8aec, 5a0374, a4675c, 3a254e, ac47d9, b8d98e, fb6142, fa8e66, d22df5, 09eb5d, c6a8fa)
[Benchmarking] Implement incbench command for ease-of-use benchmark (2fc725)

Improvements

[Quantization] Integrate AutoRound v0.3 (bfa27e, [fd9...

Assets 2

14 Jun 13:55

chensuyue

v2.6

2928d85

Intel® Neural Compressor v2.6 Release

Highlights
Features
Improvements
Examples
Bug Fixes
External Contributions
Validated Configurations

Highlights

Integrated recent AutoRound with lm-head quantization support and calibration process optimizations
Migrated ONNX model quantization capability into ONNX project Neural Compressor

Features

[Quantization] Integrate recent AutoRound with lm-head quantization support and calibration process optimizations (4728fd)
[Quantization] Support true sequential options in GPTQ (92c942)

Improvements

[Quantization] Improve WOQ Linear pack/unpack speed with numpy implementation (daa143)
[Quantization] Auto detect available device when exporting (7be355)
[Quantization] Refine AutoRound export to support Intel GPU (409231)
[Benchmarking] Detect the number of sockets when needed (e54b93)

Examples

Upgrade lm_eval to 0.4.2 in PT and ORT LLM example (fdb509) (54f039)
Add diffusers/dreambooth example with IPEX (ba4798)

Bug Fixes

Fix incorrect dtype of unpacked tensor issue in PT (29fdec)
Fix TF LLM SQ legacy Keras environment variable issue (276449)
Fix TF estimator issue by adding version check on TF2.16 (855b98)
Fix missing tokenizer issue in run_clm_no_trainer.py after using lm-eval 0.4.2 (d64029)
Fix AWQ padding issue in ORT (903da4)
Fix recover function issue in ORT (ee24db)
Update model ckpt download url in prepare_model.py (0ba573)
Fix case where pad_max_length set to None (960bd2)
Fix a failure for GPU backend (71a9f3)
Fix numpy versions for rnnt and 3d-unet examples (12b8f4)
Fix CVEs (5b5579) (25c71a) (47d73b) (41da74)

External Contributions

Update model ckpt download url in prepare_model.py (0ba573)
Fix case where pad_max_length set to None (960bd2)
Add diffusers/dreambooth example with IPEX (ba4798)

Validated Configurations

Centos 8.4 & Ubuntu 22.04 & Win 11 & MacOS Ventura 13.5
Python 3.8, 3.9, 3.10, 3.11
PyTorch/IPEX 2.1, 2.2, 2.3
TensorFlow 2.14, 2.15, 2.16
ITEX 2.13.0, 2.14.0, 2.15.0
ONNX Runtime 1.16, 1.17, 1.18

Assets 2

03 Apr 14:03

chensuyue

v2.5.1

eba6033

Intel® Neural Compressor v2.5.1 Release

Improvement
Bug Fixes
Validated Configurations

Improvement

Improve WOQ AutoRound export (409231, 7ee721)
Adapt ITREX v1.4 release for example evaluate (9d7a05)
Update more supported LLM recipes (ce9b16)

Bug Fixes

Fix WOQ RTN supported layer checking condition (079177)
Fix in-place processing error in quant_weight function (92533a)

Validated Configurations

Centos 8.4 & Ubuntu 22.04
Python 3.10
TensorFlow 2.15
ITEX 2.14.0
PyTorch/IPEX 2.2
ONNX Runtime 1.17

Assets 2

26 Mar 10:21

chensuyue

v2.5

24419c9

Intel® Neural Compressor v2.5 Release

Highlights
Features
Improvement
Productivity
Bug Fixes
External Contributes
Validated Configurations

Highlights

Integrated Weight-Only Quantization algorithm AutoRound and verified on Gaudi2, Intel CPU, NV GPU
Applied SmoothQuant & Weight-Only Quantization algorithms with 15+ popular LLMs for INT8 & INT4 quantization and published the recipes

Features

[Quantization] Integrate Weight-Only Quantization algorithm AutoRound (5c7f33, dfd083, 9a7ddd, cf1de7)
[Quantization] Quantize weight with in-place mode in Weight-Only Quantization (deb1ed)
[Pruning] Enable SNIP on multiple cards using DeepSpeed ZeRO-3 (49ab28)
[Pruning] Support new pruning approach Wanda and DSNOT for PyTorch LLM (7a3671)

Improvement

[Quantization] SmoothQuant code structure refactor (a8d81c)
[Quantization] Optimize the workflow of parsing Keras model (b816d7)
[Quantization] Support static_groups options in GPTQ API (1c426a)
[Quantization] Update TEQ train dataloader (d1e994)
[Quantization] WeightOnlyLinear keeps self.weight after recover (2835bd)
[Quantization] Add version condition for IPEX prepare init (d96e14)
[Quantization] Enhance the ORT node name checking (f1597a)
[Pruning] Stop the tuning process early when enabling smooth quant (844a03)

Productivity

ORT LLM examples support latest optimum version (26b260)
Add coding style docs and recommended VS Code setting (c1f23c)
Adapt transformers 4.37 loading (6133f4)
Upgrade pre-commit checker for black/blacken-docs/ruff (7763ed)
Support CI summary in PR comments (d4bcdd))
Notebook example update to install latest INC & TF, add metric in fit (4239d3)

Bug Fixes

Fix QA IPEX example fp32 input issue (c4de19)
Update Conditions of Getting min-max during TF MatMul Requantize (d07175)
Fix TF saved_model issues (d8e60b)
Fix comparison of module_type and MulLinear (ba3aba)
Fix ORT calibration issue (cd6d24)
Fix ORT example bart export failure (b0dc0d)
Fix TF example accuracy diff during benchmark and quantization (5943ea)
Fix bugs for GPTQ exporting with static_groups (b4e37b)
Fix ORT quant issue caused by tensors having same name (0a20f3)
Fix Neural Solution SQL/CMD injection (14b7b0)
Fix the best qmodel recovery issue (f2d9b7)
Fix logger issue (83bc77)
Store token in protected file (c6f9cc)
Define the default SSL context (b08725)
Fix IPEX stats bug (5af383)
Fix ORT calibration for Dml EP (c58aea)
Fix wrong socket number retrieval for non-english system (5b2a88)
Fix trust remote for llm examples (2f2c9a)

External Contributes

Intel Mac support (21cfeb)
Add PTQ example for PyTorch CV Segment Anything Model (bd5e69)

Validated Configurations

Centos 8.4 & Ubuntu 22.04 & Win 11 & MacOS Ventura 13.5
Python 3.8, 3.9, 3.10, 3.11
TensorFlow 2.13, 2.14, 2.15
ITEX 2.13.0, 2.14.0
PyTorch/IPEX 2.0, 2.1, 2.2
ONNX Runtime 1.15, 1.16, 1.17

Assets 2

29 Dec 13:12

chensuyue

v2.4.1

b8c7f1a

Intel® Neural Compressor v2.4.1 Release

Improvement
Bug Fixes
Examples
Validated Configurations

Improvement

Narrow down the tuning space of SmoothQuant auto-tune (9600e1)
Support ONNXRT Weight-Only Quantization with different dtypes (5119fc)
Add progress bar for ONNXRT Weight-Only Quantization and SmoothQuant (4d26e3)

Bug Fixes

Fix SmoothQuant alpha-space generation (33ece9)
Fix inputs error for SmoothQuant example_inputs (39f63a)
Fix LLMs accuracy regression with IPEX 2.1.100 (3cb6d3)
Fix quantizable add ops detection on IPEX backend (4c004d)
Fix range step bug in ORTSmoothQuant (40275c)
Fix unit test bugs and update CI versions (6c78df, 835805)
Fix notebook issues (08221e)

Examples

Add verified LLMs list and recipes for SmoothQuant and Weight-Only Quantization (f19cc9)
Add code-generaion evaluation for Weight-Only Quantization GPTQ (763440)

Validated Configurations

Centos 8.4 & Ubuntu 22.04
Python 3.10
TensorFlow 2.14
ITEX 2.14.0.1
PyTorch/IPEX 2.1.0
ONNX Runtime 1.16.3

Assets 2

17 Dec 03:26

chensuyue

v2.4

111b3ce

Intel® Neural Compressor v2.4 Release

Highlights
Features
Improvement
Productivity
Bug Fixes
Examples
Validated Configurations

Highlights

Supported layer-wise quantization for PyTorch RTN/GPTQ Weight-Only Quantization and ONNX Runtime W8A8 quantization.
Supported Weight-Only Quantization tuning for ONNX Runtime backend.
Supported GGML double quant on RTN/GPTQ Weight-Only Quantization with FW extension API
Supported SmoothQuant of Big Saved Model for TensorFlow Backend.

Features

[Quantization] Support GGML double quant in Weight-Only Quantization for RTN and GPTQ (05c15a)
[Quantization] Support Weight-Only Quantization tuning for ONNX Runtime backend (6d4ea5, 934ba0, 4fcfdf)
[Quantization] Support SmoothQuant block-wise alpha-tuning (ee6bc2)
[Quantization] Support SmoothQuant of Big Saved Model for TensorFlow Backend (3b2925, 4f2c35)
[Quantization] Support PyTorch layer-wise quantization for GPTQ (ee5450)
[Quantization] support PyTorch layer-wise quantization for RTN (ebd1e2)
[Quantization] Support ONNX Runtime layer-wise W8A8 quantization (6142e4, 5d33a5)
[Common] [Experimental] FW extension API implement (76b8b3, 8447d7, 258236)
[Quantization] [Experimental] FW extension API for PT backend support Weight-Only Quantization (915018, dc9328)
[Quantization] [Experimental] FW extension API for TF backend support Keras Quantization (2627d3)
[Quantization] IPEX 2.1 XPU (CPU+GPU) support (af0b50, cf847c)

Improvement

[Quantization] Add use_optimum_format for export_compressed_model in Weight-Only Quantization (5179da, 0a0644)
[Quantization] Enhance ONNX Runtime quantization with DirectML EP (db0fef, d13183, 098401, 6cad50)
[Quantization] Support restore ipex model from json (c3214c)
[Quantization] ONNX Runtime add attr to MatMulNBits (7057e3)
[Quantization] Increase SmoothQuant auto alpha running speed (173c18)
[Quantization] Add SmoothQuant alpha search space as a config argument (f9663d)
[Quantization] Add SmoothQuant weight_clipping as a default_on option (1f4aec)
[Quantization] Support SmoothQuant with MinMaxObserver (45b496)
[Quantization] Support Weight-Only Quantization with fp16 for PyTorch backend (d5cb56)
[Quantization] Support trace with dictionary type example_inputs (afe315)
[Quantization] Support falcon Weight-Only Quantization (595d3a)
[Common] Add deprecation decorator in experimental fold (aeb3ed)
[Common] Remove 1.x API dependency (ee617a)
[Mixed Precision] Support PyTorch eager mode BF16 MixedPrecision (3bfb76)

Productivity

Support quantization and benchmark on macOS (16d6a0)
Support ONNX Runtime 1.16.0 (d81732, 299af9, 753783)
Support TensorFlow new API for gnr-base (8160c7)

Bug Fixes

Fix GraphModule object has no attribute bias (7f53d1)
Fix ONNX model export issue (af0aea, eaa57f)
Add clip for ONNX Runtime SmoothQuant (cbb69b)
Fix SmoothQuant minmax observer init (b1db1c)
Fix SmoothQuant issue in get/set_module (dffcfe)
Align sparsity with block-wise masks in progressive pruning (fcdc29)

Examples

Support peft model with SmoothQuant (5e21b7)
Enable two ONNX Runtime examples table-transformer-detection (550cee), BEiT (7265df)

Validated Configurations

Centos 8.4 & Ubuntu 22.04 & Win10 & MacOS Ventura 13.5
Python 3.8, 3.9, 3.10, 3.11
TensorFlow 2.13, 2.14, 2.15
ITEX 1.2.0, 2.13.0.0, 2.14.0.1
PyTorch/IPEX 1.13.0+cpu, 2.0.1+cpu, 2.1.0
ONNX Runtime 1.14.1, 1.15.1, 1.16.3
MXNet 1.9.1

Assets 2

23 Nov 15:30

chensuyue

v2.3.2

c4e5d4c

Intel® Neural Compressor v2.3.2 Release

Features
Bug Fixes

Features

Reduce memory consumption in ONNXRT adaptor (f64833)
Support MatMulFpQ4 for onnxruntime 1.16 (1beb43)
Support MatMulNBits for onnxruntime 1.17 (67a31b)

Bug Fixes

Update ITREX version in ONNXRT WOQ example and fix bugs in hf models (0ca51a)
Update ONNXRT WOQ example into llama-2-7b (7f2063)
Fix ONNXRT WOQ failed with None model_path (cbd0a4)

Validated Configurations

Centos 8.4 & Ubuntu 22.04
Python 3.10
TensorFlow 2.13
ITEX 2.13
PyTorch/IPEX 2.0.1+cpu
ONNX Runtime 1.15.1
MXNet 1.9.1

Assets 2

28 Sep 09:48

chensuyue

v2.3.1

35f9461

Intel® Neural Compressor v2.3.1 Release

Bug Fixes
Productivity

Bug Fixes

Fix PyTorch SmoothQuant for auto alpha (e9c14a, 35def7)
Fix PyTorch SmoothQuant calibration memory overhead (49e950)
Fix PyTorch SmoothQuant issue in get/set_module (Issue #1265)(6de9ce)
Support falcon Weight-Only Quantization (bf7b5c)
Remove Conv2d in Weight-Only Quantization adaptor white list (1a6526)
Fix TensorFlow ssd_resnet50_v1 Example for TF New API (c63fc5)

Productivity

Adapt Example for TensorFlow 2.14 AutoTrackable API Change (424cf3)

Validated Configurations

Centos 8.4 & Ubuntu 22.04
Python 3.10
TensorFlow 2.13, 2.14
ITEX 2.13
PyTorch/IPEX 2.0.1+cpu
ONNX Runtime 1.15.1
MXNet 1.9.1

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: intel/neural-compressor

Intel Neural Compressor Release 3.2

Intel Neural Compressor Release 3.1

Intel® Neural Compressor v3.0 Release

Intel® Neural Compressor v2.6 Release

Intel® Neural Compressor v2.5.1 Release

Intel® Neural Compressor v2.5 Release

Intel® Neural Compressor v2.4.1 Release

Intel® Neural Compressor v2.4 Release

Intel® Neural Compressor v2.3.2 Release

Intel® Neural Compressor v2.3.1 Release