AutoRound

An advanced post-training quantization toolkit for LLMs and VLMs by Intel, leveraging SignRound optimization to support 2–4 bit weight quantization and automatic mixed-precision scheme generation across Intel CPU/GPU, NVIDIA GPU, and Habana Gaudi.

AutoRound is an Intel-maintained post-training quantization toolkit for Large Language Models (LLMs) and Vision-Language Models (VLMs). Its core algorithm, SignRound, leverages SignSGD to optimize rounding decisions and weight clipping in approximately 200 steps, merging the advantages of Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ) without introducing extra inference overhead. SignRoundV1 reports average zero-shot accuracy improvements of 6.91%–33.22% at 2-bit weight quantization.

The project supports a rich set of quantization data type combinations: W2A16, W3A16, W4A16, W8A16, W4A4 (research stage), NVFP4, MXFP4, Block-wise FP8, W8A8, etc., with export to five formats: AutoRound native, AutoAWQ, AutoGPTQ, GGUF, and LLM-Compressor. SignRoundV2 further introduces a fast sensitivity metric combining gradient information and quantization bias for layer-wise bit allocation, plus lightweight quantization scale pre-tuning search.

The AutoScheme feature can automatically generate layer-wise mixed-bit/data-type quantization plans within minutes (extra memory overhead ~1.1–1.5× the BF16 model size), with support for per-layer customization via layer_config. On the engineering side, a 7B model completes W4A16 quantization in ~10 minutes on a single GPU, with three preset schemes (auto-round / auto-round-best / auto-round-light) covering different accuracy-speed tradeoffs.

Quantized models can be loaded directly in Transformers, vLLM, SGLang, and other mainstream inference frameworks without code modifications. The same quantization pipeline adapts to multiple hardware backends: Intel Xeon CPU, Intel GPU (XPU), NVIDIA GPU (CUDA), and Habana Gaudi (HPU). Additionally, it supports 10+ VLM models out-of-the-box, Multi-Token Prediction (MTP) layer quantization, and switching between HuggingFace and ModelScope model sources via environment variable.

The underlying CUDA quantization kernels reuse open-source libraries including AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and ExLLaMAV2. The academic foundation includes the SignRoundV1 paper (EMNLP 2024 Findings) and the subsequent SignRoundV2 paper.

Related Projects

Genkit

Gobii Platform

Semble

STAY UPDATED