Releases: intel/neural-compressor
Releases · intel/neural-compressor
Intel Neural Compressor Release 3.1
- Highlights
- Features
- Improvements
- Validated Hardware
- Validated Configurations
Highlights
- Aligned with Habana 1.18 release with the improvements on FP8 and INT4 quantization for Intel® Gaudi® AI accelerator
- Provided Transformer-like quantization API for weight-only quantization on LLM, which offers transformer-based user one-stop experience for quantization & inference with IPEX on Intel GPU and CPU.
Features
- Add Transformer-like quantization API for weight-only quantization on LLM
- Support fast quantization with light weight recipe and layer-wise approach on Intel AI PC
- Support INT4 quantization of Visual Language Model (VLM), like Llava, Phi-3-vision, Qwen-VL with AutoRound algorithm
Improvements
- Support AWQ format INT4 model loading and converting for IPEX inference in Transformer-like API
- Enable auto-round format export for INT4 model
- Support per-channel INT8 Post Training Quantization for PT2E
Validated Hardware
- Intel Gaudi Al Accelerators (Gaudi 2 and 3)
- Intel Xeon Scalable processor (4th, 5th, 6th Gen)
- Intel Core Ultra Processors (Series 1 and 2)
- Intel Data Center GPU Max Series (1100)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04 & Win 11
- Python 3.9, 3.10, 3.11, 3.12
- PyTorch/IPEX 2.2, 2.3, 2.4
Intel® Neural Compressor v3.0 Release
- Highlights
- Features
- Improvements
- Examples
- Bug Fixes
- Documentations
- Validated Configurations
Highlights
- FP8 quantization and INT4 model loading support on Intel® Gaudi® AI accelerator
- Framework extension API for quantization, mixed-precision and benchmarking
- Accuracy-aware FP16 mixed precision support on Intel® Xeon® 6 Processors
- Performance optimizations and usability improvements on client-side quantization
Features
- [Quantization] Support FP8 quantization on Gaudi (95197d)
- [Quantization] Support INC and Hugging Face model loading on framework extension API for PyTorch (0eced1, bacc16)
- [Quantization] Support Weight-only Quantization on framework extension API for PyTorch (34f0a9, de43d8, 4a4509, 1a4509, a3a065, 1386ac, a0dee9, 503d9e, 84d705, 099b7a, e3c736, e87c95, 2694bb, ec49a2, e7b4b6, a9bf79, ac717b, 915018, 8447d7, dc9328)
- [Quantization] Support static and dynamic quantization in PT2E path (7a4715, 43c358, 30b36b, 1f58f0, 02958d)
- [Quantization] Support SmoothQuant and static quantization in IPEX path with framework extension API (53e6ee, 72fbce, eaa3a5, 95e67e, 855c10, 9c6102, 5dafe5, a5e5f5, 191383, 776645)
- [Quantization] Support Layer-wise Quantization for RTN/GPTQ on framework extension API for PyTorch (649e6b)
- [Quantization] Support Post Training Quantization on framework extension API for Tensorflow (6c27c1, e22c61, f21afb, 3882e9, 2627d3)
- [Quantization] Support Post Training Quantization on Keras3 (f67e86, 047560)
- [Quantization] Support Weight-only Quantization on Gaudi2 (4b9b44, 14868c, 0a3d4b)
- [Quantization] Improve performance and usability of quantization procedure on client side (16a7b1)
- [Quantization] Support auto-device detection on framework extension API for PyTorch (368ba5, 4b9b44, e81a2d, 0a3d4b, 534300, 2a86ae)
- [Quantization] Support Microscaling(MX) Quant for PyTorch (4a24a6, 455f1e)
- [Quantization] Enable cross-devices Half-Quadratic Quantization(HQQ) for LLMs support (db6164, 07f940)
- [Quantization] Support FP8 cast Weight-only Quantization (57ed61)
- [Mixed-Precision] Support FP16 mixed-precision on framework extension autotune API for PyTorch (2e1cdc)
- [Mixed-Precision] Support mixed
INT8
withFP16
in PT2E path (fa961e) - [AutoTune] Support accuracy-aware tuning on framework extension API (e97659, 7b8aec, 5a0374, a4675c, 3a254e, ac47d9, b8d98e, fb6142, fa8e66, d22df5, 09eb5d, c6a8fa)
- [Benchmarking] Implement
incbench
command for ease-of-use benchmark (2fc725)
Improvements
- [Quantization] Integrate AutoRound v0.3 (bfa27e, [fd9...
Intel® Neural Compressor v2.6 Release
- Highlights
- Features
- Improvements
- Examples
- Bug Fixes
- External Contributions
- Validated Configurations
Highlights
- Integrated recent AutoRound with lm-head quantization support and calibration process optimizations
- Migrated ONNX model quantization capability into ONNX project Neural Compressor
Features
- [Quantization] Integrate recent AutoRound with lm-head quantization support and calibration process optimizations (4728fd)
- [Quantization] Support true sequential options in GPTQ (92c942)
Improvements
- [Quantization] Improve WOQ Linear pack/unpack speed with numpy implementation (daa143)
- [Quantization] Auto detect available device when exporting (7be355)
- [Quantization] Refine AutoRound export to support Intel GPU (409231)
- [Benchmarking] Detect the number of sockets when needed (e54b93)
Examples
- Upgrade lm_eval to 0.4.2 in PT and ORT LLM example (fdb509) (54f039)
- Add diffusers/dreambooth example with IPEX (ba4798)
Bug Fixes
- Fix incorrect dtype of unpacked tensor issue in PT (29fdec)
- Fix TF LLM SQ legacy Keras environment variable issue (276449)
- Fix TF estimator issue by adding version check on TF2.16 (855b98)
- Fix missing tokenizer issue in run_clm_no_trainer.py after using lm-eval 0.4.2 (d64029)
- Fix AWQ padding issue in ORT (903da4)
- Fix recover function issue in ORT (ee24db)
- Update model ckpt download url in prepare_model.py (0ba573)
- Fix case where pad_max_length set to None (960bd2)
- Fix a failure for GPU backend (71a9f3)
- Fix numpy versions for rnnt and 3d-unet examples (12b8f4)
- Fix CVEs (5b5579) (25c71a) (47d73b) (41da74)
External Contributions
- Update model ckpt download url in prepare_model.py (0ba573)
- Fix case where pad_max_length set to None (960bd2)
- Add diffusers/dreambooth example with IPEX (ba4798)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04 & Win 11 & MacOS Ventura 13.5
- Python 3.8, 3.9, 3.10, 3.11
- PyTorch/IPEX 2.1, 2.2, 2.3
- TensorFlow 2.14, 2.15, 2.16
- ITEX 2.13.0, 2.14.0, 2.15.0
- ONNX Runtime 1.16, 1.17, 1.18
Intel® Neural Compressor v2.5.1 Release
- Improvement
- Bug Fixes
- Validated Configurations
Improvement
- Improve WOQ AutoRound export (409231, 7ee721)
- Adapt ITREX v1.4 release for example evaluate (9d7a05)
- Update more supported LLM recipes (ce9b16)
Bug Fixes
- Fix WOQ RTN supported layer checking condition (079177)
- Fix in-place processing error in quant_weight function (92533a)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04
- Python 3.10
- TensorFlow 2.15
- ITEX 2.14.0
- PyTorch/IPEX 2.2
- ONNX Runtime 1.17
Intel® Neural Compressor v2.5 Release
- Highlights
- Features
- Improvement
- Productivity
- Bug Fixes
- External Contributes
- Validated Configurations
Highlights
- Integrated Weight-Only Quantization algorithm AutoRound and verified on Gaudi2, Intel CPU, NV GPU
- Applied SmoothQuant & Weight-Only Quantization algorithms with 15+ popular LLMs for INT8 & INT4 quantization and published the recipes
Features
- [Quantization] Integrate Weight-Only Quantization algorithm AutoRound (5c7f33, dfd083, 9a7ddd, cf1de7)
- [Quantization] Quantize weight with in-place mode in Weight-Only Quantization (deb1ed)
- [Pruning] Enable SNIP on multiple cards using DeepSpeed ZeRO-3 (49ab28)
- [Pruning] Support new pruning approach Wanda and DSNOT for PyTorch LLM (7a3671)
Improvement
- [Quantization] SmoothQuant code structure refactor (a8d81c)
- [Quantization] Optimize the workflow of parsing Keras model (b816d7)
- [Quantization] Support static_groups options in GPTQ API (1c426a)
- [Quantization] Update TEQ train dataloader (d1e994)
- [Quantization] WeightOnlyLinear keeps self.weight after recover (2835bd)
- [Quantization] Add version condition for IPEX prepare init (d96e14)
- [Quantization] Enhance the ORT node name checking (f1597a)
- [Pruning] Stop the tuning process early when enabling smooth quant (844a03)
Productivity
- ORT LLM examples support latest optimum version (26b260)
- Add coding style docs and recommended VS Code setting (c1f23c)
- Adapt transformers 4.37 loading (6133f4)
- Upgrade pre-commit checker for black/blacken-docs/ruff (7763ed)
- Support CI summary in PR comments (d4bcdd))
- Notebook example update to install latest INC & TF, add metric in fit (4239d3)
Bug Fixes
- Fix QA IPEX example fp32 input issue (c4de19)
- Update Conditions of Getting min-max during TF MatMul Requantize (d07175)
- Fix TF saved_model issues (d8e60b)
- Fix comparison of module_type and MulLinear (ba3aba)
- Fix ORT calibration issue (cd6d24)
- Fix ORT example bart export failure (b0dc0d)
- Fix TF example accuracy diff during benchmark and quantization (5943ea)
- Fix bugs for GPTQ exporting with static_groups (b4e37b)
- Fix ORT quant issue caused by tensors having same name (0a20f3)
- Fix Neural Solution SQL/CMD injection (14b7b0)
- Fix the best qmodel recovery issue (f2d9b7)
- Fix logger issue (83bc77)
- Store token in protected file (c6f9cc)
- Define the default SSL context (b08725)
- Fix IPEX stats bug (5af383)
- Fix ORT calibration for Dml EP (c58aea)
- Fix wrong socket number retrieval for non-english system (5b2a88)
- Fix trust remote for llm examples (2f2c9a)
External Contributes
Validated Configurations
- Centos 8.4 & Ubuntu 22.04 & Win 11 & MacOS Ventura 13.5
- Python 3.8, 3.9, 3.10, 3.11
- TensorFlow 2.13, 2.14, 2.15
- ITEX 2.13.0, 2.14.0
- PyTorch/IPEX 2.0, 2.1, 2.2
- ONNX Runtime 1.15, 1.16, 1.17
Intel® Neural Compressor v2.4.1 Release
- Improvement
- Bug Fixes
- Examples
- Validated Configurations
Improvement
- Narrow down the tuning space of SmoothQuant auto-tune (9600e1)
- Support ONNXRT Weight-Only Quantization with different dtypes (5119fc)
- Add progress bar for ONNXRT Weight-Only Quantization and SmoothQuant (4d26e3)
Bug Fixes
- Fix SmoothQuant alpha-space generation (33ece9)
- Fix inputs error for SmoothQuant example_inputs (39f63a)
- Fix LLMs accuracy regression with IPEX 2.1.100 (3cb6d3)
- Fix quantizable add ops detection on IPEX backend (4c004d)
- Fix range step bug in ORTSmoothQuant (40275c)
- Fix unit test bugs and update CI versions (6c78df, 835805)
- Fix notebook issues (08221e)
Examples
- Add verified LLMs list and recipes for SmoothQuant and Weight-Only Quantization (f19cc9)
- Add code-generaion evaluation for Weight-Only Quantization GPTQ (763440)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04
- Python 3.10
- TensorFlow 2.14
- ITEX 2.14.0.1
- PyTorch/IPEX 2.1.0
- ONNX Runtime 1.16.3
Intel® Neural Compressor v2.4 Release
- Highlights
- Features
- Improvement
- Productivity
- Bug Fixes
- Examples
- Validated Configurations
Highlights
- Supported layer-wise quantization for PyTorch RTN/GPTQ Weight-Only Quantization and ONNX Runtime W8A8 quantization.
- Supported Weight-Only Quantization tuning for ONNX Runtime backend.
- Supported GGML double quant on RTN/GPTQ Weight-Only Quantization with FW extension API
- Supported SmoothQuant of Big Saved Model for TensorFlow Backend.
Features
- [Quantization] Support GGML double quant in Weight-Only Quantization for RTN and GPTQ (05c15a)
- [Quantization] Support Weight-Only Quantization tuning for ONNX Runtime backend (6d4ea5, 934ba0, 4fcfdf)
- [Quantization] Support SmoothQuant block-wise alpha-tuning (ee6bc2)
- [Quantization] Support SmoothQuant of Big Saved Model for TensorFlow Backend (3b2925, 4f2c35)
- [Quantization] Support PyTorch layer-wise quantization for GPTQ (ee5450)
- [Quantization] support PyTorch layer-wise quantization for RTN (ebd1e2)
- [Quantization] Support ONNX Runtime layer-wise W8A8 quantization (6142e4, 5d33a5)
- [Common] [Experimental] FW extension API implement (76b8b3, 8447d7, 258236)
- [Quantization] [Experimental] FW extension API for PT backend support Weight-Only Quantization (915018, dc9328)
- [Quantization] [Experimental] FW extension API for TF backend support Keras Quantization (2627d3)
- [Quantization] IPEX 2.1 XPU (CPU+GPU) support (af0b50, cf847c)
Improvement
- [Quantization] Add use_optimum_format for export_compressed_model in Weight-Only Quantization (5179da, 0a0644)
- [Quantization] Enhance ONNX Runtime quantization with DirectML EP (db0fef, d13183, 098401, 6cad50)
- [Quantization] Support restore ipex model from json (c3214c)
- [Quantization] ONNX Runtime add attr to MatMulNBits (7057e3)
- [Quantization] Increase SmoothQuant auto alpha running speed (173c18)
- [Quantization] Add SmoothQuant alpha search space as a config argument (f9663d)
- [Quantization] Add SmoothQuant weight_clipping as a default_on option (1f4aec)
- [Quantization] Support SmoothQuant with MinMaxObserver (45b496)
- [Quantization] Support Weight-Only Quantization with fp16 for PyTorch backend (d5cb56)
- [Quantization] Support trace with dictionary type example_inputs (afe315)
- [Quantization] Support falcon Weight-Only Quantization (595d3a)
- [Common] Add deprecation decorator in experimental fold (aeb3ed)
- [Common] Remove 1.x API dependency (ee617a)
- [Mixed Precision] Support PyTorch eager mode BF16 MixedPrecision (3bfb76)
Productivity
- Support quantization and benchmark on macOS (16d6a0)
- Support ONNX Runtime 1.16.0 (d81732, 299af9, 753783)
- Support TensorFlow new API for gnr-base (8160c7)
Bug Fixes
- Fix GraphModule object has no attribute bias (7f53d1)
- Fix ONNX model export issue (af0aea, eaa57f)
- Add clip for ONNX Runtime SmoothQuant (cbb69b)
- Fix SmoothQuant minmax observer init (b1db1c)
- Fix SmoothQuant issue in get/set_module (dffcfe)
- Align sparsity with block-wise masks in progressive pruning (fcdc29)
Examples
- Support peft model with SmoothQuant (5e21b7)
- Enable two ONNX Runtime examples table-transformer-detection (550cee), BEiT (7265df)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04 & Win10 & MacOS Ventura 13.5
- Python 3.8, 3.9, 3.10, 3.11
- TensorFlow 2.13, 2.14, 2.15
- ITEX 1.2.0, 2.13.0.0, 2.14.0.1
- PyTorch/IPEX 1.13.0+cpu, 2.0.1+cpu, 2.1.0
- ONNX Runtime 1.14.1, 1.15.1, 1.16.3
- MXNet 1.9.1
Intel® Neural Compressor v2.3.2 Release
- Features
- Bug Fixes
Features
- Reduce memory consumption in ONNXRT adaptor (f64833)
- Support MatMulFpQ4 for onnxruntime 1.16 (1beb43)
- Support MatMulNBits for onnxruntime 1.17 (67a31b)
Bug Fixes
- Update ITREX version in ONNXRT WOQ example and fix bugs in hf models (0ca51a)
- Update ONNXRT WOQ example into llama-2-7b (7f2063)
- Fix ONNXRT WOQ failed with None model_path (cbd0a4)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04
- Python 3.10
- TensorFlow 2.13
- ITEX 2.13
- PyTorch/IPEX 2.0.1+cpu
- ONNX Runtime 1.15.1
- MXNet 1.9.1
Intel® Neural Compressor v2.3.1 Release
- Bug Fixes
- Productivity
Bug Fixes
- Fix PyTorch SmoothQuant for auto alpha (e9c14a, 35def7)
- Fix PyTorch SmoothQuant calibration memory overhead (49e950)
- Fix PyTorch SmoothQuant issue in get/set_module (Issue #1265)(6de9ce)
- Support falcon Weight-Only Quantization (bf7b5c)
- Remove Conv2d in Weight-Only Quantization adaptor white list (1a6526)
- Fix TensorFlow ssd_resnet50_v1 Example for TF New API (c63fc5)
Productivity
- Adapt Example for TensorFlow 2.14 AutoTrackable API Change (424cf3)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04
- Python 3.10
- TensorFlow 2.13, 2.14
- ITEX 2.13
- PyTorch/IPEX 2.0.1+cpu
- ONNX Runtime 1.15.1
- MXNet 1.9.1
Intel® Neural Compressor v2.3 Release
- Highlights
- Features
- Improvement
- Productivity
- Bug Fixes
- Examples
- Validated Configurations
Highlights
- Integrate Intel Neural Compressor into MSFT ONNX Runtime (#16288) and Olive (#411, #412, #469).
- Supported low precision (INT4, NF4, FP4) and Weight-Only Quantization algorithms including RTN, AWQ, GPTQ and TEQ on ONNX Runtime and PyTorch for LLMs optimization.
- Supported sparseGPT pruner (88adfc).
- Supported quantization for ONNX Runtime DML EP and DNNL EP, and verified inference on Intel NPU (e.g., Meteor Lake) and Intel CPU (e.g., Sapphire Rapids).
Features
- [Quantization] Support ONNX Runtime quantization and inference for DNNL EP (79be8b)
- [Quantization] [Experimental] Support ONNX Runtime quantization and inference for DirectML EP (750bb9)
- [Quantization] Support low precision and Weight-Only Quantization (WOQ) algorithms, including RTN (501440, 19ab16, 859315), AWQ (2562f2, 641d42),
GPTQ (b5ac3c, 6ba783) and TEQ (d2f995, 9ff7f0) for PyTorch - [Quantization] Support NF4 and FP4 data type for PyTorch Weight-Only Quantization (3d11b5)
- [Quantization] Support low precision and Weight-Only Quantization algorithms, including RTN, AWQ and GPTQ for ONNX Runtime (da4c92)
- [Quantization] Support layer-wise quantization (d9d1fc) and enable with SmoothQuant (ec9ae9)
- [Pruning] Add sparseGPT pruner and refactor pruning class (88adfc)
- [Pruning] Add Hyper-parameter Optimization algorithm for pruning (6613cf)
- [Model Export] Support PT2ONNX dynamic quantization export (165532)
Improvement
- [Common] Clean up dataloader usage in examples (1044d8,
a2931e, 447cc7) - [Common] Enhance ONNX Runtime backend check (4ce9de)
- [Strategy] Add block-wise distributed fallback in basic strategy (ea309f)
- [Strategy] Enhance strategy exit policy (d19b42)
- [Quantization] Add WeightOnlyLinear for Weight-Only approach to allow low memory inference (00bbf8)
- [Quantization] Support more ONNX Runtime direct INT8 ops (b9ce61)
- [Quantization] Support TensorFlow per-channel MatMul quantization (cf5589)
- [Quantization] Implement a new method to perform alpha auto-tuning in SmoothQuant (084eda)
- [Quantization] Enhance ONNX SmoothQuant tuning structure (f0d51c)
- [Quantization] Enhance PyTorch SmoothQuant tuning structure (81da40)
- [Quantization] Update PyTorch examples dataloader to support transformers 4.31.x (59371f)
- [Quantization] Enhance ONNX Runtime backend setting for GPU EP support (295535)
- [Pruning] Refactor pruning (92d14d)
- [Mixed Precision] Update the list of supported layers for Keras mix-precision (692c8b)
- [Mixed Precision] Introduce quant_level into mixed precision (0dc6a9)
Productivity
- [Ecosystem] MSFT Olive integrate SmoothQuant and 3 LLM examples (#411, #412, #469)
- [Ecosystem] MSFT ONNX Runtime integrate SmoothQuant static quantization (#16288)
- [Neural Insights] Support PyTorch FX inspect tensor and integrate with Neural Insights (775def, 74a785)
- [Neural Insights] Add step-by-step diagnosis cases (99c3b0)
- [Neural Solution] Resource management and user-facing API enhancement (fbba10)
- [Auto CI] Integrate auto CI code scan bug fix tools (f77a2c, 06cc38)
Bug Fixes
- Fix bugs in PyTorch SmoothQuant (0349b9, 8f3645)
- Fix pytorch dataloader batch size issue (6a98d0)
- Fix bugs for ONNX Runtime CUDA EP (a1b566, d1f315)
- Fix bug in ONNX Runtime adapter where _rename_node function fails with model size > 2 GB (1f6b1a)
- Fix ONNX Runtime diagnosis bug (f10e26)
- Update Neural Solution example and fix grpc port issue (528868)
- Fix the objective initialization issue (9d7546)
- Fix reshape issue for bayesian strategy (77cb83)
- Fix CVEs (d86922, 2bbfcd, fc71fa)
Examples
- Add Weight-Only LLM examples for PyTorch (4b24be, 66f7c1, aa457a)
- Add Weight-Only LLM examples for ONNX Runtime ([10c133](https://github.c...