low-bit quantization enable LLMs/VLMs on edge devices

Low-bit quantization, a technique that compresses models and reduces memory demands, offers a solution by enabling more efficient inference on resource-constrained edge devices.

Here are a list of github projects to implement low-bit quantizations. (list subject to change). For practitioners, integrating tools like OmniServe or VPTQ could dramatically reduce inference costs while scaling deployments on edge devices.

Featured Projects

1. Microsoft/VPTQ

GitHub: microsoft/VPTQ
Key Features:

PTQ(Post-Training Quantization) method leveraging Vector Quantization
- Better Accuracy on 1-2 bits, (405B @ <2bit, 70B @ 2bit) to 1-2 bits without retraining.
- Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1.
Agile Quantization Inference: low decode overhead, best throughput, and TTFT.
Integrates with Hugging Face Transformers for GPU deployment (A100, RTX4090).
Preliminary support for inference with Deepseek R1

2. Microsoft/BitNet

GitHub: Microsoft/BitNet
Key Features:

based on the llama.cpp framework.
a4.8: 4-bit Activations for 1-bit LLMs.
b1.58: fast and lossless Inference on CPUs.
- Ternary Lookup Table (TL).
- Int2 with a Scale (I2_S).

3.NVIDIA/TensorRT-Model-Optimizer

GitHub: NVIDIA/TensorRT-Model-Optimizer
Key Features:

Model Optimizer quantized NVFP4 models available on Hugging Face for download: DeepSeek-R1-FP4, Llama-3.3-70B-Instruct-FP4, Llama-3.1-405B-Instruct-FP4.
Model Optimizer quantized FP8 Llama-3.1 Instruct models available on Hugging Face for download: 8B, 70B, 405B.

4.NVIDIA/TensorRT-LLM

GitHub: NVIDIA/TensorRT-LLM
Key Features:

5. Vahe1994/AQLM

GitHub: Vahe1994/AQLM
Key Features:

Official PyTorch implementation for Extreme Compression of Large Language Models via Additive Quantization.
extends AQLM with new finetuning algorithm called PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression.

6. bitsandbytes-foundation/bitsandbytes

GitHub: bitsandbytes-foundation/bitsandbytes
Key Features:

8-bit optimizers uses block-wise quantization to maintain 32-bit performance at a small fraction of the memory cost.
LLM.int8() or 8-bit quantization enables large language model inference with only half the required memory and without any performance degradation. This method is based on vector-wise quantization to quantize most features to 8-bits and separately treating outliers with 16-bit matrix multiplication.
QLoRA or 4-bit quantization enables large language model training with several memory-saving techniques that don’t compromise performance. This method quantizes a model to 4-bits and inserts a small set of trainable low-rank adaptation (LoRA) weights to allow training.

7. usyd-fsalab/fp6_llm

GitHub: usyd-fsalab/fp6_llm
Key Features:

support model weights in FP6_e3m2 or FP5_e2m2 and the activations in FP16 format.
CUDA implementation for mixed-input matrix multiplication of linear layers (weights in FP6 and activations in FP16 format) with Tensor Core enabled

8. mit-han-lab/OmniServe

GitHub: mit-han-lab/omniserve
Key Features:

Unified engine combining QServe (W4A8KV4 quantization) and LServe (long-context optimization).
- QServe: Efficient and accurate LLM serving system on GPUs with W4A8KV4 quantization (4-bit weights, 8-bit activations, and 4-bit KV cache), suitable for large-scale synthetic data generation with both LLMs and VLMs
- LServe: Efficient Long-Sequence LLM Serving with Unified Sparse Attention

9. mit-han-lab/nunchaku

GitHub: mit-han-lab/nunchaku
Key Features:

Implements SVDQuant, a 4-bit method using low-rank branches for outlier absorption.
Optimized for NVIDIA Blackwell GPUs (NVFP4 format).
Maintains 16-bit quality in models like FLUX, released NVFP4 4-bit Shuttle-Jaguar and FLUX.1-tools and also upgraded the INT4 FLUX.1-tool models.

10. mit-han-lab/smoothquant

GitHub: mit-han-lab/smoothquant

Key Features:

SmoothQuant enables INT8 model inference on AMD Instinct MI300X using Composable Kernel.
Enable W8A8 quantization for Llama-1/2/3, Falcon, Mistral, and Mixtral models with negligible loss

Key Trends

Extreme Compression: Projects like VPTQ push boundaries with 1-2 bit models.
Hardware Co-Design: Tools like QServe and SVDQuant leverage GPU-specific optimizations (e.g., NVFP4).

References reading

Explore the links above to dive deeper into code and benchmarks!