Glossary/Quantization
Model Training & Optimization

Quantization

Making large AI models run on smaller hardware.


Definition

Quantization reduces the numerical precision of model weights and activations — from 32-bit or 16-bit floating-point to 8-bit or 4-bit integers. This slashes memory requirements (a 70B model at FP16 needs ~140GB; at INT4, ~35GB) while preserving most performance, making powerful models deployable on consumer hardware.

Floating-point precision explained

Every model parameter is stored as a number. The precision format determines how many bytes each number uses — and therefore the total memory footprint:

FormatBitsBytes/param7B model size70B model sizeTypical use
FP32 (float32)32428 GB280 GBPretraining, gradient computation
BF16 (bfloat16)16214 GB140 GBTraining + inference (A100/H100)
FP16 (float16)16214 GB140 GBInference on older GPUs (V100)
INT8817 GB70 GBQuantised inference — near-lossless
INT4 / NF440.53.5 GB35 GBQuantised inference — standard for local LLMs
INT2 / INT32–30.25–0.375~2 GB~17 GBExtreme compression — noticeable quality loss

Why BF16 over FP16?

BF16 and FP16 both use 16 bits but allocate them differently. FP16: 5 exponent + 10 mantissa. BF16: 8 exponent + 7 mantissa (same exponent range as FP32). BF16 can represent much larger/smaller values without overflow — critical during training when gradient magnitudes vary widely. Modern AI GPUs (A100, H100, RTX 4090) have native BF16 tensor cores.

Post-Training Quantization (PTQ) vs Quantization-Aware Training (QAT)

MethodDescriptionQualityCostBest for
Naive PTQ (absmax)Scale weights linearly to INT8 rangeGood for INT8, poor for INT4NegligibleQuick INT8 deployment
GPTQLayer-by-layer quantisation using second-order (Hessian) information to minimise per-layer errorExcellent — near FP16 quality at INT4Hours on 1 GPUOffline GPU inference (vLLM, AutoGPTQ)
AWQ (Activation-aware)Identifies important weights via activation magnitude, protects them from quantisationBetter than GPTQ at INT4Hours on 1 GPUProduction GPU inference — state of the art
GGUF / llama.cppCPU-friendly quantisation with mixed precision per tensor groupGood — especially Q4_K_MMinutesLocal CPU/Apple Silicon inference
QAT (Quantization-Aware Training)Simulate quantisation noise during training — model adaptsBest quality at any bit widthFull retraining budgetWhen maximum quality at low bit width is required

AWQ vs GPTQ in 2025

AWQ consistently outperforms GPTQ at the same bit width — the key insight is that not all weights are equally important. AWQ identifies the ~1% of weights with the highest activation magnitudes and preserves their precision. For production GPU serving, AWQ INT4 is the current best practice.

GGUF and llama.cpp: running LLMs locally

GGUF (GPT-Generated Unified Format) is the quantisation format used by llama.cpp — a pure C++ LLM inference library that runs on CPU, Apple Silicon, and consumer GPUs with no CUDA required:

Running a quantised LLM locally with llama.cpp

# Install llama.cpp (macOS with Metal GPU acceleration)
brew install llama.cpp

# Download a GGUF model (Llama 3.1 8B Q4_K_M = 4.9GB)
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF     Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Run inference (uses Apple Silicon GPU via Metal)
llama-cli -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf     -n 512 -p "Explain quantum computing in 3 sentences"     --gpu-layers 99   # offload all layers to GPU

# GGUF quantisation levels (Q4_K_M recommended):
# Q2_K:   ~2.6GB, significant quality loss
# Q4_K_M: ~4.9GB, excellent quality — best size/quality balance
# Q5_K_M: ~5.7GB, near-lossless
# Q8_0:   ~8.5GB, essentially identical to FP16
# F16:    ~15GB,  full precision

Running 70B on a MacBook Pro

A MacBook Pro M3 Max with 128GB unified memory can run LLaMA 3 70B Q4_K_M (~40GB) entirely in memory at ~8–12 tokens/second. Apple Silicon's unified memory architecture (no separate VRAM) makes it uniquely capable for large quantised models — the memory bandwidth (400+ GB/s on M3 Ultra) is the only bottleneck.

Mixed precision and hardware support

HardwareNative INT8 opsNative INT4 opsUnified memoryBest for
NVIDIA A100 80GB✅ 624 TOPS✅ (via FP8 Transformer Engine)Large model training + serving
NVIDIA H100 80GB✅ 1979 TOPS INT8✅ FP8 nativeFrontier model training
NVIDIA RTX 4090 24GB✅ 1457 TOPS INT8⚠️ via softwareConsumer fine-tuning + inference
Apple M3 Max 128GB✅ (ANE)✅ (ANE)✅ 400GB/sLocal large model inference
Apple M4 Ultra✅ 800GB/sBest local inference available (2025)

Mixed precision inference pattern

Production LLM inference uses mixed precision: weights stored in INT4 on disk/VRAM, loaded and dequantised to BF16 for actual matrix multiplications (accumulation in BF16/FP32 preserves numerical stability), then results cast back. This pattern (store in INT4, compute in BF16) achieves 90–95% of the memory reduction with near-FP16 accuracy.

Quantization's impact on model quality

Bit widthQuality vs FP16 (large models 70B+)Quality vs FP16 (small models 3–7B)Recommended?
FP16 / BF16100% (baseline)100% (baseline)✅ If VRAM allows
INT8 (GPTQ/AWQ)~99.5% — essentially lossless~98% — minimal degradation✅ Yes — free performance at half the memory
INT4 (GPTQ/AWQ)~96–98% — barely noticeable~92–95% — slightly noticeable on hard tasks✅ Yes — standard for local inference
INT4 (GGUF Q4_K_M)~96% — comparable to AWQ~91–94%✅ Yes — best for CPU/Apple Silicon
INT3 / INT2~85–90% — noticeable regression~75–85% — significant degradation⚠️ Only when size is critical

The 70B INT4 vs 13B FP16 principle

A 70B parameter model at INT4 (~35GB) fits on the same GPU as a 13B model at FP16 (~26GB). The 70B INT4 almost always outperforms the 13B FP16 — larger models tolerate lower precision much better than smaller ones. This makes quantisation the standard approach for maximising capability-per-dollar in deployment.

Practice questions

  1. What is the difference between post-training quantisation (PTQ) and quantisation-aware training (QAT)? (Answer: PTQ: quantise a trained model's weights without any retraining — fast (minutes), no training data required. Quality: INT8 PTQ achieves ~1% accuracy drop; INT4 PTQ achieves ~3-5% drop. QAT: simulate quantisation during training (fake quantise and dequantise weights in the forward pass, train with full precision gradients). The model adapts to quantisation noise. Quality: QAT matches full precision accuracy at INT8; INT4 QAT achieves ~1% drop. Required for very low bit (INT4 and below) without significant accuracy loss.)
  2. What is GPTQ and why is it important for LLM quantisation? (Answer: GPTQ (Frantar et al. 2022): layer-by-layer post-training quantisation using second-order weight updates (Hessian-based). For each layer, it iteratively quantises weights and compensates for quantisation error by updating the remaining unquantised weights in that layer — using the inverse Hessian of the loss. Achieves INT4 quantisation of 175B GPT-3 in 4 GPU-hours with <1% perplexity increase. Critical for making large LLMs practically deployable: GPT-J, LLaMA, and Falcon all have GPTQ-quantised versions serving millions of users.)
  3. What is AWQ (Activation-Aware Weight Quantisation) and how does it improve on GPTQ? (Answer: AWQ (Lin et al. 2023): identifies salient weights (those corresponding to large activation channels) and protects them by scaling before quantisation. Observation: 1% of weights are crucial — they correspond to channels with very large input activations. GPTQ quantises all weights equally. AWQ scales crucial weight channels by an activation-dependent factor before quantisation, effectively giving them higher precision. Result: AWQ achieves better perplexity than GPTQ at same bit width, especially at very low precision (INT3). AWQ is the preferred quantisation method in llama.cpp and many deployment frameworks.)
  4. What is the difference between weight-only quantisation and weight-activation quantisation? (Answer: Weight-only (W4A16): quantise model weights to INT4/INT8; keep activations and computations in FP16. Reduces model size (memory bandwidth) but not compute FLOPs. Dequantise weights to FP16 before matrix multiply. Memory-bandwidth-bound operations benefit; compute-bound operations do not. Weight-activation (W8A8): quantise both weights AND activations to INT8. Enables INT8 matrix multiply (much faster on A100/H100 Tensor Cores). Requires careful per-token activation quantisation — LLM.int8() and SmoothQuant handle this.)
  5. A 7B model at FP16 requires 14GB VRAM. What quantisation enables running it on a 6GB GPU? (Answer: INT4 weight-only quantisation (GPTQ/AWQ/4-bit NF4): each weight is 4 bits instead of 16 bits — 4× size reduction. 14GB / 4 = 3.5GB model size + ~1.5GB for KV cache and activations ≈ 5GB total. Fits in a 6GB GPU. bitsandbytes library (QLoRA): uses NF4 (Normal Float 4-bit) which better preserves values near zero (where most weights cluster after training). Practical: llama.cpp -q4_k_m flag or HuggingFace load_in_4bit=True in BitsAndBytesConfig.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

4 terms