low-bit quantization enable LLMs/VLMs on edge devices
Low-bit quantization, a technique that compresses models and reduces memory demands, offers a solution by enabling more efficient inference on resource-constrained edge devices.
Here are a list of github projects to implement low-bit quantizations. (list subject to change). For practitioners, integrating tools like OmniServe or VPTQ could dramatically reduce inference costs while scaling deployments on edge devices.
Featured Projects
1. Microsoft/VPTQ
GitHub: microsoft/VPTQ
Key Features:
- PTQ(Post-Training Quantization) method leveraging Vector Quantization
- Better Accuracy on 1-2 bits, (405B @ <2bit, 70B @ 2bit) to 1-2 bits without retraining.
- Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1.
- Agile Quantization Inference: low decode overhead, best throughput, and TTFT.
- Integrates with Hugging Face Transformers for GPU deployment (A100, RTX4090).
- Preliminary support for inference with Deepseek R1
2. Microsoft/BitNet
GitHub: Microsoft/BitNet
Key Features:
- based on the llama.cpp framework.
- a4.8: 4-bit Activations for 1-bit LLMs.
- b1.58: fast and lossless Inference on CPUs.
- Ternary Lookup Table (TL).
- Int2 with a Scale (I2_S).
3.NVIDIA/TensorRT-Model-Optimizer
GitHub: NVIDIA/TensorRT-Model-Optimizer
Key Features:
- Model Optimizer quantized NVFP4 models available on Hugging Face for download: DeepSeek-R1-FP4, Llama-3.3-70B-Instruct-FP4, Llama-3.1-405B-Instruct-FP4.
- Model Optimizer quantized FP8 Llama-3.1 Instruct models available on Hugging Face for download: 8B, 70B, 405B.
4.NVIDIA/TensorRT-LLM
GitHub: NVIDIA/TensorRT-LLM
Key Features:
- DeepSeek-R1 performance now optimized for Blackwell
- New KV Cache Reuse Optimizations
- Inference-time scaling for generating optimized GPU Kernels
5. Vahe1994/AQLM
GitHub: Vahe1994/AQLM
Key Features:
- Official PyTorch implementation for Extreme Compression of Large Language Models via Additive Quantization.
- extends AQLM with new finetuning algorithm called PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression.
6. bitsandbytes-foundation/bitsandbytes
GitHub: bitsandbytes-foundation/bitsandbytes
Key Features:
- 8-bit optimizers uses block-wise quantization to maintain 32-bit performance at a small fraction of the memory cost.
- LLM.int8() or 8-bit quantization enables large language model inference with only half the required memory and without any performance degradation. This method is based on vector-wise quantization to quantize most features to 8-bits and separately treating outliers with 16-bit matrix multiplication.
- QLoRA or 4-bit quantization enables large language model training with several memory-saving techniques that don’t compromise performance. This method quantizes a model to 4-bits and inserts a small set of trainable low-rank adaptation (LoRA) weights to allow training.
7. usyd-fsalab/fp6_llm
GitHub: usyd-fsalab/fp6_llm
Key Features:
- support model weights in FP6_e3m2 or FP5_e2m2 and the activations in FP16 format.
- CUDA implementation for mixed-input matrix multiplication of linear layers (weights in FP6 and activations in FP16 format) with Tensor Core enabled
8. mit-han-lab/OmniServe
GitHub: mit-han-lab/omniserve
Key Features:
- Unified engine combining QServe (W4A8KV4 quantization) and LServe (long-context optimization).
- QServe: Efficient and accurate LLM serving system on GPUs with W4A8KV4 quantization (4-bit weights, 8-bit activations, and 4-bit KV cache), suitable for large-scale synthetic data generation with both LLMs and VLMs
- LServe: Efficient Long-Sequence LLM Serving with Unified Sparse Attention
9. mit-han-lab/nunchaku
GitHub: mit-han-lab/nunchaku
Key Features:
- Implements SVDQuant, a 4-bit method using low-rank branches for outlier absorption.
- Optimized for NVIDIA Blackwell GPUs (NVFP4 format).
- Maintains 16-bit quality in models like FLUX, released NVFP4 4-bit Shuttle-Jaguar and FLUX.1-tools and also upgraded the INT4 FLUX.1-tool models.
10. mit-han-lab/smoothquant
GitHub: mit-han-lab/smoothquant
Key Features:
- SmoothQuant enables INT8 model inference on AMD Instinct MI300X using Composable Kernel.
- Enable W8A8 quantization for Llama-1/2/3, Falcon, Mistral, and Mixtral models with negligible loss
Key Trends
- Extreme Compression: Projects like VPTQ push boundaries with 1-2 bit models.
- Hardware Co-Design: Tools like QServe and SVDQuant leverage GPU-specific optimizations (e.g., NVFP4).
References reading
- Advances to low-bit quantization enable LLMs on edge devices
- Low-bit Quantization of Neural Networks for Efficient Inference
- SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
- T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge
- Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation
- LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration
- MixPE: Quantization and Hardware Co-design for Efficient LLM Inference
- bitsandbytes: k-bit quantization for PyTorch
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
- The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
- GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS
- A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes
- Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA
- GGUF
- Post-Training Quantization of LLMs with NVIDIA NeMo and TensorRT Model Optimizer
Explore the links above to dive deeper into code and benchmarks!