Low-bit quantization, a technique that compresses models and reduces memory demands, offers a solution by enabling more efficient inference on resource-constrained edge devices.

Here are a list of github projects to implement low-bit quantizations. (list subject to change). For practitioners, integrating tools like OmniServe or VPTQ could dramatically reduce inference costs while scaling deployments on edge devices.


1. Microsoft/VPTQ

GitHub: microsoft/VPTQ
Key Features:

  • PTQ(Post-Training Quantization) method leveraging Vector Quantization
    • Better Accuracy on 1-2 bits, (405B @ <2bit, 70B @ 2bit) to 1-2 bits without retraining.
    • Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1.
  • Agile Quantization Inference: low decode overhead, best throughput, and TTFT.
  • Integrates with Hugging Face Transformers for GPU deployment (A100, RTX4090).
  • Preliminary support for inference with Deepseek R1

2. Microsoft/BitNet

GitHub: Microsoft/BitNet
Key Features:

  • based on the llama.cpp framework.
  • a4.8: 4-bit Activations for 1-bit LLMs.
  • b1.58: fast and lossless Inference on CPUs.
    • Ternary Lookup Table (TL).
    • Int2 with a Scale (I2_S).

3.NVIDIA/TensorRT-Model-Optimizer

GitHub: NVIDIA/TensorRT-Model-Optimizer
Key Features:

4.NVIDIA/TensorRT-LLM

GitHub: NVIDIA/TensorRT-LLM
Key Features:

5. Vahe1994/AQLM

GitHub: Vahe1994/AQLM
Key Features:

6. bitsandbytes-foundation/bitsandbytes

GitHub: bitsandbytes-foundation/bitsandbytes
Key Features:

  • 8-bit optimizers uses block-wise quantization to maintain 32-bit performance at a small fraction of the memory cost.
  • LLM.int8() or 8-bit quantization enables large language model inference with only half the required memory and without any performance degradation. This method is based on vector-wise quantization to quantize most features to 8-bits and separately treating outliers with 16-bit matrix multiplication.
  • QLoRA or 4-bit quantization enables large language model training with several memory-saving techniques that don’t compromise performance. This method quantizes a model to 4-bits and inserts a small set of trainable low-rank adaptation (LoRA) weights to allow training.

7. usyd-fsalab/fp6_llm

GitHub: usyd-fsalab/fp6_llm
Key Features:

  • support model weights in FP6_e3m2 or FP5_e2m2 and the activations in FP16 format.
  • CUDA implementation for mixed-input matrix multiplication of linear layers (weights in FP6 and activations in FP16 format) with Tensor Core enabled

8. mit-han-lab/OmniServe

GitHub: mit-han-lab/omniserve
Key Features:

  • Unified engine combining QServe (W4A8KV4 quantization) and LServe (long-context optimization).
    • QServe: Efficient and accurate LLM serving system on GPUs with W4A8KV4 quantization (4-bit weights, 8-bit activations, and 4-bit KV cache), suitable for large-scale synthetic data generation with both LLMs and VLMs
    • LServe: Efficient Long-Sequence LLM Serving with Unified Sparse Attention

9. mit-han-lab/nunchaku

GitHub: mit-han-lab/nunchaku
Key Features:

  • Implements SVDQuant, a 4-bit method using low-rank branches for outlier absorption.
  • Optimized for NVIDIA Blackwell GPUs (NVFP4 format).
  • Maintains 16-bit quality in models like FLUX, released NVFP4 4-bit Shuttle-Jaguar and FLUX.1-tools and also upgraded the INT4 FLUX.1-tool models.

10. mit-han-lab/smoothquant

GitHub: mit-han-lab/smoothquant

Key Features:

  • SmoothQuant enables INT8 model inference on AMD Instinct MI300X using Composable Kernel.
  • Enable W8A8 quantization for Llama-1/2/3, Falcon, Mistral, and Mixtral models with negligible loss

  • Extreme Compression: Projects like VPTQ push boundaries with 1-2 bit models.
  • Hardware Co-Design: Tools like QServe and SVDQuant leverage GPU-specific optimizations (e.g., NVFP4).

References reading


Explore the links above to dive deeper into code and benchmarks!