Skip to content

Releases: intel/neural-compressor

Intel Neural Compressor Release 3.1

25 Oct 08:18
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvements
  • Validated Hardware
  • Validated Configurations

Highlights

  • Aligned with Habana 1.18 release with the improvements on FP8 and INT4 quantization for Intel® Gaudi® AI accelerator
  • Provided Transformer-like quantization API for weight-only quantization on LLM, which offers transformer-based user one-stop experience for quantization & inference with IPEX on Intel GPU and CPU.

Features

  • Add Transformer-like quantization API for weight-only quantization on LLM
  • Support fast quantization with light weight recipe and layer-wise approach on Intel AI PC
  • Support INT4 quantization of Visual Language Model (VLM), like Llava, Phi-3-vision, Qwen-VL with AutoRound algorithm

Improvements

  • Support AWQ format INT4 model loading and converting for IPEX inference in Transformer-like API
  • Enable auto-round format export for INT4 model
  • Support per-channel INT8 Post Training Quantization for PT2E

Validated Hardware

  • Intel Gaudi Al Accelerators (Gaudi 2 and 3)
  • Intel Xeon Scalable processor (4th, 5th, 6th Gen)
  • Intel Core Ultra Processors (Series 1 and 2)
  • Intel Data Center GPU Max Series (1100)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04 & Win 11
  • Python 3.9, 3.10, 3.11, 3.12
  • PyTorch/IPEX 2.2, 2.3, 2.4

Intel® Neural Compressor v3.0 Release

12 Aug 04:09
7056720
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvements
  • Examples
  • Bug Fixes
  • Documentations
  • Validated Configurations

Highlights

  • FP8 quantization and INT4 model loading support on Intel® Gaudi® AI accelerator
  • Framework extension API for quantization, mixed-precision and benchmarking
  • Accuracy-aware FP16 mixed precision support on Intel® Xeon® 6 Processors
  • Performance optimizations and usability improvements on client-side quantization

Features

Improvements

  • [Quantization] Integrate AutoRound v0.3 (bfa27e, [fd9...
Read more

Intel® Neural Compressor v2.6 Release

14 Jun 13:55
2928d85
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvements
  • Examples
  • Bug Fixes
  • External Contributions
  • Validated Configurations

Highlights

  • Integrated recent AutoRound with lm-head quantization support and calibration process optimizations
  • Migrated ONNX model quantization capability into ONNX project Neural Compressor

Features

  • [Quantization] Integrate recent AutoRound with lm-head quantization support and calibration process optimizations (4728fd)
  • [Quantization] Support true sequential options in GPTQ (92c942)

Improvements

  • [Quantization] Improve WOQ Linear pack/unpack speed with numpy implementation (daa143)
  • [Quantization] Auto detect available device when exporting (7be355)
  • [Quantization] Refine AutoRound export to support Intel GPU (409231)
  • [Benchmarking] Detect the number of sockets when needed (e54b93)

Examples

  • Upgrade lm_eval to 0.4.2 in PT and ORT LLM example (fdb509) (54f039)
  • Add diffusers/dreambooth example with IPEX (ba4798)

Bug Fixes

  • Fix incorrect dtype of unpacked tensor issue in PT (29fdec)
  • Fix TF LLM SQ legacy Keras environment variable issue (276449)
  • Fix TF estimator issue by adding version check on TF2.16 (855b98)
  • Fix missing tokenizer issue in run_clm_no_trainer.py after using lm-eval 0.4.2 (d64029)
  • Fix AWQ padding issue in ORT (903da4)
  • Fix recover function issue in ORT (ee24db)
  • Update model ckpt download url in prepare_model.py (0ba573)
  • Fix case where pad_max_length set to None (960bd2)
  • Fix a failure for GPU backend (71a9f3)
  • Fix numpy versions for rnnt and 3d-unet examples (12b8f4)
  • Fix CVEs (5b5579) (25c71a) (47d73b) (41da74)

External Contributions

  • Update model ckpt download url in prepare_model.py (0ba573)
  • Fix case where pad_max_length set to None (960bd2)
  • Add diffusers/dreambooth example with IPEX (ba4798)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04 & Win 11 & MacOS Ventura 13.5
  • Python 3.8, 3.9, 3.10, 3.11
  • PyTorch/IPEX 2.1, 2.2, 2.3
  • TensorFlow 2.14, 2.15, 2.16
  • ITEX 2.13.0, 2.14.0, 2.15.0
  • ONNX Runtime 1.16, 1.17, 1.18

Intel® Neural Compressor v2.5.1 Release

03 Apr 14:03
Compare
Choose a tag to compare
  • Improvement
  • Bug Fixes
  • Validated Configurations

Improvement

  • Improve WOQ AutoRound export (409231, 7ee721)
  • Adapt ITREX v1.4 release for example evaluate (9d7a05)
  • Update more supported LLM recipes (ce9b16)

Bug Fixes

  • Fix WOQ RTN supported layer checking condition (079177)
  • Fix in-place processing error in quant_weight function (92533a)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04
  • Python 3.10
  • TensorFlow 2.15
  • ITEX 2.14.0
  • PyTorch/IPEX 2.2
  • ONNX Runtime 1.17

Intel® Neural Compressor v2.5 Release

26 Mar 10:21
24419c9
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvement
  • Productivity
  • Bug Fixes
  • External Contributes
  • Validated Configurations

Highlights

  • Integrated Weight-Only Quantization algorithm AutoRound and verified on Gaudi2, Intel CPU, NV GPU
  • Applied SmoothQuant & Weight-Only Quantization algorithms with 15+ popular LLMs for INT8 & INT4 quantization and published the recipes

Features

  • [Quantization] Integrate Weight-Only Quantization algorithm AutoRound (5c7f33, dfd083, 9a7ddd, cf1de7)
  • [Quantization] Quantize weight with in-place mode in Weight-Only Quantization (deb1ed)
  • [Pruning] Enable SNIP on multiple cards using DeepSpeed ZeRO-3 (49ab28)
  • [Pruning] Support new pruning approach Wanda and DSNOT for PyTorch LLM (7a3671)

Improvement

  • [Quantization] SmoothQuant code structure refactor (a8d81c)
  • [Quantization] Optimize the workflow of parsing Keras model (b816d7)
  • [Quantization] Support static_groups options in GPTQ API (1c426a)
  • [Quantization] Update TEQ train dataloader (d1e994)
  • [Quantization] WeightOnlyLinear keeps self.weight after recover (2835bd)
  • [Quantization] Add version condition for IPEX prepare init (d96e14)
  • [Quantization] Enhance the ORT node name checking (f1597a)
  • [Pruning] Stop the tuning process early when enabling smooth quant (844a03)

Productivity

  • ORT LLM examples support latest optimum version (26b260)
  • Add coding style docs and recommended VS Code setting (c1f23c)
  • Adapt transformers 4.37 loading (6133f4)
  • Upgrade pre-commit checker for black/blacken-docs/ruff (7763ed)
  • Support CI summary in PR comments (d4bcdd))
  • Notebook example update to install latest INC & TF, add metric in fit (4239d3)

Bug Fixes

  • Fix QA IPEX example fp32 input issue (c4de19)
  • Update Conditions of Getting min-max during TF MatMul Requantize (d07175)
  • Fix TF saved_model issues (d8e60b)
  • Fix comparison of module_type and MulLinear (ba3aba)
  • Fix ORT calibration issue (cd6d24)
  • Fix ORT example bart export failure (b0dc0d)
  • Fix TF example accuracy diff during benchmark and quantization (5943ea)
  • Fix bugs for GPTQ exporting with static_groups (b4e37b)
  • Fix ORT quant issue caused by tensors having same name (0a20f3)
  • Fix Neural Solution SQL/CMD injection (14b7b0)
  • Fix the best qmodel recovery issue (f2d9b7)
  • Fix logger issue (83bc77)
  • Store token in protected file (c6f9cc)
  • Define the default SSL context (b08725)
  • Fix IPEX stats bug (5af383)
  • Fix ORT calibration for Dml EP (c58aea)
  • Fix wrong socket number retrieval for non-english system (5b2a88)
  • Fix trust remote for llm examples (2f2c9a)

External Contributes

  • Intel Mac support (21cfeb)
  • Add PTQ example for PyTorch CV Segment Anything Model (bd5e69)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04 & Win 11 & MacOS Ventura 13.5
  • Python 3.8, 3.9, 3.10, 3.11
  • TensorFlow 2.13, 2.14, 2.15
  • ITEX 2.13.0, 2.14.0
  • PyTorch/IPEX 2.0, 2.1, 2.2
  • ONNX Runtime 1.15, 1.16, 1.17

Intel® Neural Compressor v2.4.1 Release

29 Dec 13:12
b8c7f1a
Compare
Choose a tag to compare
  • Improvement
  • Bug Fixes
  • Examples
  • Validated Configurations

Improvement

  • Narrow down the tuning space of SmoothQuant auto-tune (9600e1)
  • Support ONNXRT Weight-Only Quantization with different dtypes (5119fc)
  • Add progress bar for ONNXRT Weight-Only Quantization and SmoothQuant (4d26e3)

Bug Fixes

  • Fix SmoothQuant alpha-space generation (33ece9)
  • Fix inputs error for SmoothQuant example_inputs (39f63a)
  • Fix LLMs accuracy regression with IPEX 2.1.100 (3cb6d3)
  • Fix quantizable add ops detection on IPEX backend (4c004d)
  • Fix range step bug in ORTSmoothQuant (40275c)
  • Fix unit test bugs and update CI versions (6c78df, 835805)
  • Fix notebook issues (08221e)

Examples

  • Add verified LLMs list and recipes for SmoothQuant and Weight-Only Quantization (f19cc9)
  • Add code-generaion evaluation for Weight-Only Quantization GPTQ (763440)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04
  • Python 3.10
  • TensorFlow 2.14
  • ITEX 2.14.0.1
  • PyTorch/IPEX 2.1.0
  • ONNX Runtime 1.16.3

Intel® Neural Compressor v2.4 Release

17 Dec 03:26
111b3ce
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvement
  • Productivity
  • Bug Fixes
  • Examples
  • Validated Configurations

Highlights

  • Supported layer-wise quantization for PyTorch RTN/GPTQ Weight-Only Quantization and ONNX Runtime W8A8 quantization.
  • Supported Weight-Only Quantization tuning for ONNX Runtime backend.
  • Supported GGML double quant on RTN/GPTQ Weight-Only Quantization with FW extension API
  • Supported SmoothQuant of Big Saved Model for TensorFlow Backend.

Features

  • [Quantization] Support GGML double quant in Weight-Only Quantization for RTN and GPTQ (05c15a)
  • [Quantization] Support Weight-Only Quantization tuning for ONNX Runtime backend (6d4ea5, 934ba0, 4fcfdf)
  • [Quantization] Support SmoothQuant block-wise alpha-tuning (ee6bc2)
  • [Quantization] Support SmoothQuant of Big Saved Model for TensorFlow Backend (3b2925, 4f2c35)
  • [Quantization] Support PyTorch layer-wise quantization for GPTQ (ee5450)
  • [Quantization] support PyTorch layer-wise quantization for RTN (ebd1e2)
  • [Quantization] Support ONNX Runtime layer-wise W8A8 quantization (6142e4, 5d33a5)
  • [Common] [Experimental] FW extension API implement (76b8b3, 8447d7, 258236)
  • [Quantization] [Experimental] FW extension API for PT backend support Weight-Only Quantization (915018, dc9328)
  • [Quantization] [Experimental] FW extension API for TF backend support Keras Quantization (2627d3)
  • [Quantization] IPEX 2.1 XPU (CPU+GPU) support (af0b50, cf847c)

Improvement

  • [Quantization] Add use_optimum_format for export_compressed_model in Weight-Only Quantization (5179da, 0a0644)
  • [Quantization] Enhance ONNX Runtime quantization with DirectML EP (db0fef, d13183, 098401, 6cad50)
  • [Quantization] Support restore ipex model from json (c3214c)
  • [Quantization] ONNX Runtime add attr to MatMulNBits (7057e3)
  • [Quantization] Increase SmoothQuant auto alpha running speed (173c18)
  • [Quantization] Add SmoothQuant alpha search space as a config argument (f9663d)
  • [Quantization] Add SmoothQuant weight_clipping as a default_on option (1f4aec)
  • [Quantization] Support SmoothQuant with MinMaxObserver (45b496)
  • [Quantization] Support Weight-Only Quantization with fp16 for PyTorch backend (d5cb56)
  • [Quantization] Support trace with dictionary type example_inputs (afe315)
  • [Quantization] Support falcon Weight-Only Quantization (595d3a)
  • [Common] Add deprecation decorator in experimental fold (aeb3ed)
  • [Common] Remove 1.x API dependency (ee617a)
  • [Mixed Precision] Support PyTorch eager mode BF16 MixedPrecision (3bfb76)

Productivity

  • Support quantization and benchmark on macOS (16d6a0)
  • Support ONNX Runtime 1.16.0 (d81732, 299af9, 753783)
  • Support TensorFlow new API for gnr-base (8160c7)

Bug Fixes

  • Fix GraphModule object has no attribute bias (7f53d1)
  • Fix ONNX model export issue (af0aea, eaa57f)
  • Add clip for ONNX Runtime SmoothQuant (cbb69b)
  • Fix SmoothQuant minmax observer init (b1db1c)
  • Fix SmoothQuant issue in get/set_module (dffcfe)
  • Align sparsity with block-wise masks in progressive pruning (fcdc29)

Examples

  • Support peft model with SmoothQuant (5e21b7)
  • Enable two ONNX Runtime examples table-transformer-detection (550cee), BEiT (7265df)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04 & Win10 & MacOS Ventura 13.5
  • Python 3.8, 3.9, 3.10, 3.11
  • TensorFlow 2.13, 2.14, 2.15
  • ITEX 1.2.0, 2.13.0.0, 2.14.0.1
  • PyTorch/IPEX 1.13.0+cpu, 2.0.1+cpu, 2.1.0
  • ONNX Runtime 1.14.1, 1.15.1, 1.16.3
  • MXNet 1.9.1

Intel® Neural Compressor v2.3.2 Release

23 Nov 15:30
Compare
Choose a tag to compare
  • Features
  • Bug Fixes

Features

  • Reduce memory consumption in ONNXRT adaptor (f64833)
  • Support MatMulFpQ4 for onnxruntime 1.16 (1beb43)
  • Support MatMulNBits for onnxruntime 1.17 (67a31b)

Bug Fixes

  • Update ITREX version in ONNXRT WOQ example and fix bugs in hf models (0ca51a)
  • Update ONNXRT WOQ example into llama-2-7b (7f2063)
  • Fix ONNXRT WOQ failed with None model_path (cbd0a4)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04
  • Python 3.10
  • TensorFlow 2.13
  • ITEX 2.13
  • PyTorch/IPEX 2.0.1+cpu
  • ONNX Runtime 1.15.1
  • MXNet 1.9.1

Intel® Neural Compressor v2.3.1 Release

28 Sep 09:48
Compare
Choose a tag to compare
  • Bug Fixes
  • Productivity

Bug Fixes

  • Fix PyTorch SmoothQuant for auto alpha (e9c14a, 35def7)
  • Fix PyTorch SmoothQuant calibration memory overhead (49e950)
  • Fix PyTorch SmoothQuant issue in get/set_module (Issue #1265)(6de9ce)
  • Support falcon Weight-Only Quantization (bf7b5c)
  • Remove Conv2d in Weight-Only Quantization adaptor white list (1a6526)
  • Fix TensorFlow ssd_resnet50_v1 Example for TF New API (c63fc5)

Productivity

  • Adapt Example for TensorFlow 2.14 AutoTrackable API Change (424cf3)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04
  • Python 3.10
  • TensorFlow 2.13, 2.14
  • ITEX 2.13
  • PyTorch/IPEX 2.0.1+cpu
  • ONNX Runtime 1.15.1
  • MXNet 1.9.1

Intel® Neural Compressor v2.3 Release

15 Sep 07:56
3e1b9d4
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvement
  • Productivity
  • Bug Fixes
  • Examples
  • Validated Configurations

Highlights

  • Integrate Intel Neural Compressor into MSFT ONNX Runtime (#16288) and Olive (#411, #412, #469).
  • Supported low precision (INT4, NF4, FP4) and Weight-Only Quantization algorithms including RTN, AWQ, GPTQ and TEQ on ONNX Runtime and PyTorch for LLMs optimization.
  • Supported sparseGPT pruner (88adfc).
  • Supported quantization for ONNX Runtime DML EP and DNNL EP, and verified inference on Intel NPU (e.g., Meteor Lake) and Intel CPU (e.g., Sapphire Rapids).

Features

  • [Quantization] Support ONNX Runtime quantization and inference for DNNL EP (79be8b)
  • [Quantization] [Experimental] Support ONNX Runtime quantization and inference for DirectML EP (750bb9)
  • [Quantization] Support low precision and Weight-Only Quantization (WOQ) algorithms, including RTN (501440, 19ab16, 859315), AWQ (2562f2, 641d42),
    GPTQ (b5ac3c, 6ba783) and TEQ (d2f995, 9ff7f0) for PyTorch
  • [Quantization] Support NF4 and FP4 data type for PyTorch Weight-Only Quantization (3d11b5)
  • [Quantization] Support low precision and Weight-Only Quantization algorithms, including RTN, AWQ and GPTQ for ONNX Runtime (da4c92)
  • [Quantization] Support layer-wise quantization (d9d1fc) and enable with SmoothQuant (ec9ae9)
  • [Pruning] Add sparseGPT pruner and refactor pruning class (88adfc)
  • [Pruning] Add Hyper-parameter Optimization algorithm for pruning (6613cf)
  • [Model Export] Support PT2ONNX dynamic quantization export (165532)

Improvement

  • [Common] Clean up dataloader usage in examples (1044d8,
    a2931e, 447cc7)
  • [Common] Enhance ONNX Runtime backend check (4ce9de)
  • [Strategy] Add block-wise distributed fallback in basic strategy (ea309f)
  • [Strategy] Enhance strategy exit policy (d19b42)
  • [Quantization] Add WeightOnlyLinear for Weight-Only approach to allow low memory inference (00bbf8)
  • [Quantization] Support more ONNX Runtime direct INT8 ops (b9ce61)
  • [Quantization] Support TensorFlow per-channel MatMul quantization (cf5589)
  • [Quantization] Implement a new method to perform alpha auto-tuning in SmoothQuant (084eda)
  • [Quantization] Enhance ONNX SmoothQuant tuning structure (f0d51c)
  • [Quantization] Enhance PyTorch SmoothQuant tuning structure (81da40)
  • [Quantization] Update PyTorch examples dataloader to support transformers 4.31.x (59371f)
  • [Quantization] Enhance ONNX Runtime backend setting for GPU EP support (295535)
  • [Pruning] Refactor pruning (92d14d)
  • [Mixed Precision] Update the list of supported layers for Keras mix-precision (692c8b)
  • [Mixed Precision] Introduce quant_level into mixed precision (0dc6a9)

Productivity

  • [Ecosystem] MSFT Olive integrate SmoothQuant and 3 LLM examples (#411, #412, #469)
  • [Ecosystem] MSFT ONNX Runtime integrate SmoothQuant static quantization (#16288)
  • [Neural Insights] Support PyTorch FX inspect tensor and integrate with Neural Insights (775def, 74a785)
  • [Neural Insights] Add step-by-step diagnosis cases (99c3b0)
  • [Neural Solution] Resource management and user-facing API enhancement (fbba10)
  • [Auto CI] Integrate auto CI code scan bug fix tools (f77a2c, 06cc38)

Bug Fixes

  • Fix bugs in PyTorch SmoothQuant (0349b9, 8f3645)
  • Fix pytorch dataloader batch size issue (6a98d0)
  • Fix bugs for ONNX Runtime CUDA EP (a1b566, d1f315)
  • Fix bug in ONNX Runtime adapter where _rename_node function fails with model size > 2 GB (1f6b1a)
  • Fix ONNX Runtime diagnosis bug (f10e26)
  • Update Neural Solution example and fix grpc port issue (528868)
  • Fix the objective initialization issue (9d7546)
  • Fix reshape issue for bayesian strategy (77cb83)
  • Fix CVEs (d86922, 2bbfcd, fc71fa)

Examples

Read more