What is mixed precision and hardware support?

Quantization: Mixed precision and hardware support. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/quantization

What is quantization's impact on model quality?

Quantization: Quantization's impact on model quality. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/quantization

What is practice questions?

Quantization: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/quantization

Quantization

Quantization reduces the numerical precision of model weights and activations — from 32-bit or 16-bit floating-point to 8-bit or 4-bit integers. This slashes memory requirements (a 70B model at FP16 needs ~140GB; at INT4, ~35GB) while preserving most performance, making powerful models deployable on consumer hardware.

Making large AI models run on smaller hardware.

Category: Model Training & Optimization

Floating-point precision explained

Every model parameter is stored as a number. The precision format determines how many bytes each number uses — and therefore the total memory footprint:

Format	Bits	Bytes/param	7B model size	70B model size	Typical use
FP32 (float32)	32	4	28 GB	280 GB	Pretraining, gradient computation
BF16 (bfloat16)	16	2	14 GB	140 GB	Training + inference (A100/H100)
FP16 (float16)	16	2	14 GB	140 GB	Inference on older GPUs (V100)
INT8	8	1	7 GB	70 GB	Quantised inference — near-lossless
INT4 / NF4	4	0.5	3.5 GB	35 GB	Quantised inference — standard for local LLMs
INT2 / INT3	2–3	0.25–0.375	~2 GB	~17 GB	Extreme compression — noticeable quality loss

Why BF16 over FP16?: BF16 and FP16 both use 16 bits but allocate them differently. FP16: 5 exponent + 10 mantissa. BF16: 8 exponent + 7 mantissa (same exponent range as FP32). BF16 can represent much larger/smaller values without overflow — critical during training when gradient magnitudes vary widely. Modern AI GPUs (A100, H100, RTX 4090) have native BF16 tensor cores.

Post-Training Quantization (PTQ) vs Quantization-Aware Training (QAT)

Method	Description	Quality	Cost	Best for
Naive PTQ (absmax)	Scale weights linearly to INT8 range	Good for INT8, poor for INT4	Negligible	Quick INT8 deployment
GPTQ	Layer-by-layer quantization using second-order (Hessian) information to minimize per-layer error	Excellent — near FP16 quality at INT4	Hours on 1 GPU	Offline GPU inference (vLLM, AutoGPTQ)
AWQ (Activation-aware)	Identifies important weights via activation magnitude, protects them from quantization	Better than GPTQ at INT4	Hours on 1 GPU	Production GPU inference — state of the art
GGUF / llama.cpp	CPU-friendly quantization with mixed precision per tensor group	Good — especially Q4_K_M	Minutes	Local CPU/Apple Silicon inference
QAT (Quantization-Aware Training)	Simulate quantization noise during training — model adapts	Best quality at any bit width	Full retraining budget	When maximum quality at low bit width is required

AWQ vs GPTQ in 2025: AWQ consistently outperforms GPTQ at the same bit width — the key insight is that not all weights are equally important. AWQ identifies the ~1% of weights with the highest activation magnitudes and preserves their precision. For production GPU serving, AWQ INT4 is the current best practice.

GGUF and llama.cpp: running LLMs locally

GGUF (GPT-Generated Unified Format) is the quantization format used by llama.cpp — a pure C++ LLM inference library that runs on CPU, Apple Silicon, and consumer GPUs with no CUDA required:

# Install llama.cpp (macOS with Metal GPU acceleration)
brew install llama.cpp

# Download a GGUF model (Llama 3.1 8B Q4_K_M = 4.9GB)
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF     Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Run inference (uses Apple Silicon GPU via Metal)
llama-cli -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf     -n 512 -p "Explain quantum computing in 3 sentences"     --gpu-layers 99   # offload all layers to GPU

# GGUF quantization levels (Q4_K_M recommended):
# Q2_K:   ~2.6GB, significant quality loss
# Q4_K_M: ~4.9GB, excellent quality — best size/quality balance
# Q5_K_M: ~5.7GB, near-lossless
# Q8_0:   ~8.5GB, essentially identical to FP16
# F16:    ~15GB,  full precision

Running 70B on a MacBook Pro: A MacBook Pro M3 Max with 128GB unified memory can run LLaMA 3 70B Q4_K_M (~40GB) entirely in memory at ~8–12 tokens/second. Apple Silicon's unified memory architecture (no separate VRAM) makes it uniquely capable for large quantised models — the memory bandwidth (400+ GB/s on M3 Ultra) is the only bottleneck.

Mixed precision and hardware support

Hardware	Native INT8 ops	Native INT4 ops	Unified memory	Best for
NVIDIA A100 80GB	✅ 624 TOPS	✅ (via FP8 Transformer Engine)	❌	Large model training + serving
NVIDIA H100 80GB	✅ 1979 TOPS INT8	✅ FP8 native	❌	Frontier model training
NVIDIA RTX 4090 24GB	✅ 1457 TOPS INT8	⚠️ via software	❌	Consumer fine-tuning + inference
Apple M3 Max 128GB	✅ (ANE)	✅ (ANE)	✅ 400GB/s	Local large model inference
Apple M4 Ultra	✅	✅	✅ 800GB/s	Best local inference available (2025)

Mixed precision inference pattern: Production LLM inference uses mixed precision: weights stored in INT4 on disk/VRAM, loaded and dequantised to BF16 for actual matrix multiplications (accumulation in BF16/FP32 preserves numerical stability), then results cast back. This pattern (store in INT4, compute in BF16) achieves 90–95% of the memory reduction with near-FP16 accuracy.

Quantization's impact on model quality

Bit width	Quality vs FP16 (large models 70B+)	Quality vs FP16 (small models 3–7B)	Recommended?
FP16 / BF16	100% (baseline)	100% (baseline)	✅ If VRAM allows
INT8 (GPTQ/AWQ)	~99.5% — essentially lossless	~98% — minimal degradation	✅ Yes — free performance at half the memory
INT4 (GPTQ/AWQ)	~96–98% — barely noticeable	~92–95% — slightly noticeable on hard tasks	✅ Yes — standard for local inference
INT4 (GGUF Q4_K_M)	~96% — comparable to AWQ	~91–94%	✅ Yes — best for CPU/Apple Silicon
INT3 / INT2	~85–90% — noticeable regression	~75–85% — significant degradation	⚠️ Only when size is critical

The 70B INT4 vs 13B FP16 principle: A 70B parameter model at INT4 (~35GB) fits on the same GPU as a 13B model at FP16 (~26GB). The 70B INT4 almost always outperforms the 13B FP16 — larger models tolerate lower precision much better than smaller ones. This makes quantization the standard approach for maximizing capability-per-dollar in deployment.

Practice questions

What is the difference between post-training quantization (PTQ) and quantization-aware training (QAT)? (Answer: PTQ: quantise a trained model's weights without any retraining — fast (minutes), no training data required. Quality: INT8 PTQ achieves ~1% accuracy drop; INT4 PTQ achieves ~3-5% drop. QAT: simulate quantization during training (fake quantise and dequantise weights in the forward pass, train with full precision gradients). The model adapts to quantization noise. Quality: QAT matches full precision accuracy at INT8; INT4 QAT achieves ~1% drop. Required for very low bit (INT4 and below) without significant accuracy loss.)
What is GPTQ and why is it important for LLM quantization? (Answer: GPTQ (Frantar et al. 2022): layer-by-layer post-training quantization using second-order weight updates (Hessian-based). For each layer, it iteratively quantises weights and compensates for quantization error by updating the remaining unquantised weights in that layer — using the inverse Hessian of the loss. Achieves INT4 quantization of 175B GPT-3 in 4 GPU-hours with <1% perplexity increase. Critical for making large LLMs practically deployable: GPT-J, LLaMA, and Falcon all have GPTQ-quantised versions serving millions of users.)
What is AWQ (Activation-Aware Weight Quantization) and how does it improve on GPTQ? (Answer: AWQ (Lin et al. 2023): identifies salient weights (those corresponding to large activation channels) and protects them by scaling before quantization. Observation: 1% of weights are crucial — they correspond to channels with very large input activations. GPTQ quantises all weights equally. AWQ scales crucial weight channels by an activation-dependent factor before quantization, effectively giving them higher precision. Result: AWQ achieves better perplexity than GPTQ at same bit width, especially at very low precision (INT3). AWQ is the preferred quantization method in llama.cpp and many deployment frameworks.)
What is the difference between weight-only quantization and weight-activation quantization? (Answer: Weight-only (W4A16): quantise model weights to INT4/INT8; keep activations and computations in FP16. Reduces model size (memory bandwidth) but not compute FLOPs. Dequantise weights to FP16 before matrix multiply. Memory-bandwidth-bound operations benefit; compute-bound operations do not. Weight-activation (W8A8): quantise both weights AND activations to INT8. Enables INT8 matrix multiply (much faster on A100/H100 Tensor Cores). Requires careful per-token activation quantization — LLM.int8() and SmoothQuant handle this.)
A 7B model at FP16 requires 14GB VRAM. What quantization enables running it on a 6GB GPU? (Answer: INT4 weight-only quantization (GPTQ/AWQ/4-bit NF4): each weight is 4 bits instead of 16 bits — 4× size reduction. 14GB / 4 = 3.5GB model size + ~1.5GB for KV cache and activations ≈ 5GB total. Fits in a 6GB GPU. bitsandbytes library (QLoRA): uses NF4 (Normal Float 4-bit) which better preserves values near zero (where most weights cluster after training). Practical: llama.cpp -q4_k_m flag or HuggingFace load_in_4bit=True in BitsAndBytesConfig.)

Definition

Floating-point precision explained

Every model parameter is stored as a number. The precision format determines how many bytes each number uses — and therefore the total memory footprint:

Format	Bits	Bytes/param	7B model size	70B model size	Typical use
FP32 (float32)	32	4	28 GB	280 GB	Pretraining, gradient computation
BF16 (bfloat16)	16	2	14 GB	140 GB	Training + inference (A100/H100)
FP16 (float16)	16	2	14 GB	140 GB	Inference on older GPUs (V100)
INT8	8	1	7 GB	70 GB	Quantised inference — near-lossless
INT4 / NF4	4	0.5	3.5 GB	35 GB	Quantised inference — standard for local LLMs
INT2 / INT3	2–3	0.25–0.375	~2 GB	~17 GB	Extreme compression — noticeable quality loss

Why BF16 over FP16?

BF16 and FP16 both use 16 bits but allocate them differently. FP16: 5 exponent + 10 mantissa. BF16: 8 exponent + 7 mantissa (same exponent range as FP32). BF16 can represent much larger/smaller values without overflow — critical during training when gradient magnitudes vary widely. Modern AI GPUs (A100, H100, RTX 4090) have native BF16 tensor cores.

Post-Training Quantization (PTQ) vs Quantization-Aware Training (QAT)

Method	Description	Quality	Cost	Best for
Naive PTQ (absmax)	Scale weights linearly to INT8 range	Good for INT8, poor for INT4	Negligible	Quick INT8 deployment
GPTQ	Layer-by-layer quantization using second-order (Hessian) information to minimize per-layer error	Excellent — near FP16 quality at INT4	Hours on 1 GPU	Offline GPU inference (vLLM, AutoGPTQ)
AWQ (Activation-aware)	Identifies important weights via activation magnitude, protects them from quantization	Better than GPTQ at INT4	Hours on 1 GPU	Production GPU inference — state of the art
GGUF / llama.cpp	CPU-friendly quantization with mixed precision per tensor group	Good — especially Q4_K_M	Minutes	Local CPU/Apple Silicon inference
QAT (Quantization-Aware Training)	Simulate quantization noise during training — model adapts	Best quality at any bit width	Full retraining budget	When maximum quality at low bit width is required

AWQ vs GPTQ in 2025

AWQ consistently outperforms GPTQ at the same bit width — the key insight is that not all weights are equally important. AWQ identifies the ~1% of weights with the highest activation magnitudes and preserves their precision. For production GPU serving, AWQ INT4 is the current best practice.

GGUF and llama.cpp: running LLMs locally

GGUF (GPT-Generated Unified Format) is the quantization format used by llama.cpp — a pure C++ LLM inference library that runs on CPU, Apple Silicon, and consumer GPUs with no CUDA required:

Running a quantised LLM locally with llama.cpp

# Install llama.cpp (macOS with Metal GPU acceleration)
brew install llama.cpp

# Download a GGUF model (Llama 3.1 8B Q4_K_M = 4.9GB)
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF     Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Run inference (uses Apple Silicon GPU via Metal)
llama-cli -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf     -n 512 -p "Explain quantum computing in 3 sentences"     --gpu-layers 99   # offload all layers to GPU

# GGUF quantization levels (Q4_K_M recommended):
# Q2_K:   ~2.6GB, significant quality loss
# Q4_K_M: ~4.9GB, excellent quality — best size/quality balance
# Q5_K_M: ~5.7GB, near-lossless
# Q8_0:   ~8.5GB, essentially identical to FP16
# F16:    ~15GB,  full precision

Running 70B on a MacBook Pro

A MacBook Pro M3 Max with 128GB unified memory can run LLaMA 3 70B Q4_K_M (~40GB) entirely in memory at ~8–12 tokens/second. Apple Silicon's unified memory architecture (no separate VRAM) makes it uniquely capable for large quantised models — the memory bandwidth (400+ GB/s on M3 Ultra) is the only bottleneck.

Mixed precision and hardware support

Hardware	Native INT8 ops	Native INT4 ops	Unified memory	Best for
NVIDIA A100 80GB	✅ 624 TOPS	✅ (via FP8 Transformer Engine)	❌	Large model training + serving
NVIDIA H100 80GB	✅ 1979 TOPS INT8	✅ FP8 native	❌	Frontier model training
NVIDIA RTX 4090 24GB	✅ 1457 TOPS INT8	⚠️ via software	❌	Consumer fine-tuning + inference
Apple M3 Max 128GB	✅ (ANE)	✅ (ANE)	✅ 400GB/s	Local large model inference
Apple M4 Ultra	✅	✅	✅ 800GB/s	Best local inference available (2025)

Mixed precision inference pattern

Production LLM inference uses mixed precision: weights stored in INT4 on disk/VRAM, loaded and dequantised to BF16 for actual matrix multiplications (accumulation in BF16/FP32 preserves numerical stability), then results cast back. This pattern (store in INT4, compute in BF16) achieves 90–95% of the memory reduction with near-FP16 accuracy.

Quantization's impact on model quality

Bit width	Quality vs FP16 (large models 70B+)	Quality vs FP16 (small models 3–7B)	Recommended?
FP16 / BF16	100% (baseline)	100% (baseline)	✅ If VRAM allows
INT8 (GPTQ/AWQ)	~99.5% — essentially lossless	~98% — minimal degradation	✅ Yes — free performance at half the memory
INT4 (GPTQ/AWQ)	~96–98% — barely noticeable	~92–95% — slightly noticeable on hard tasks	✅ Yes — standard for local inference
INT4 (GGUF Q4_K_M)	~96% — comparable to AWQ	~91–94%	✅ Yes — best for CPU/Apple Silicon
INT3 / INT2	~85–90% — noticeable regression	~75–85% — significant degradation	⚠️ Only when size is critical

The 70B INT4 vs 13B FP16 principle

A 70B parameter model at INT4 (~35GB) fits on the same GPU as a 13B model at FP16 (~26GB). The 70B INT4 almost always outperforms the 13B FP16 — larger models tolerate lower precision much better than smaller ones. This makes quantization the standard approach for maximizing capability-per-dollar in deployment.

Practice questions

What is the difference between post-training quantization (PTQ) and quantization-aware training (QAT)? (Answer: PTQ: quantise a trained model's weights without any retraining — fast (minutes), no training data required. Quality: INT8 PTQ achieves ~1% accuracy drop; INT4 PTQ achieves ~3-5% drop. QAT: simulate quantization during training (fake quantise and dequantise weights in the forward pass, train with full precision gradients). The model adapts to quantization noise. Quality: QAT matches full precision accuracy at INT8; INT4 QAT achieves ~1% drop. Required for very low bit (INT4 and below) without significant accuracy loss.)
What is GPTQ and why is it important for LLM quantization? (Answer: GPTQ (Frantar et al. 2022): layer-by-layer post-training quantization using second-order weight updates (Hessian-based). For each layer, it iteratively quantises weights and compensates for quantization error by updating the remaining unquantised weights in that layer — using the inverse Hessian of the loss. Achieves INT4 quantization of 175B GPT-3 in 4 GPU-hours with <1% perplexity increase. Critical for making large LLMs practically deployable: GPT-J, LLaMA, and Falcon all have GPTQ-quantised versions serving millions of users.)
What is AWQ (Activation-Aware Weight Quantization) and how does it improve on GPTQ? (Answer: AWQ (Lin et al. 2023): identifies salient weights (those corresponding to large activation channels) and protects them by scaling before quantization. Observation: 1% of weights are crucial — they correspond to channels with very large input activations. GPTQ quantises all weights equally. AWQ scales crucial weight channels by an activation-dependent factor before quantization, effectively giving them higher precision. Result: AWQ achieves better perplexity than GPTQ at same bit width, especially at very low precision (INT3). AWQ is the preferred quantization method in llama.cpp and many deployment frameworks.)
What is the difference between weight-only quantization and weight-activation quantization? (Answer: Weight-only (W4A16): quantise model weights to INT4/INT8; keep activations and computations in FP16. Reduces model size (memory bandwidth) but not compute FLOPs. Dequantise weights to FP16 before matrix multiply. Memory-bandwidth-bound operations benefit; compute-bound operations do not. Weight-activation (W8A8): quantise both weights AND activations to INT8. Enables INT8 matrix multiply (much faster on A100/H100 Tensor Cores). Requires careful per-token activation quantization — LLM.int8() and SmoothQuant handle this.)
A 7B model at FP16 requires 14GB VRAM. What quantization enables running it on a 6GB GPU? (Answer: INT4 weight-only quantization (GPTQ/AWQ/4-bit NF4): each weight is 4 bits instead of 16 bits — 4× size reduction. 14GB / 4 = 3.5GB model size + ~1.5GB for KV cache and activations ≈ 5GB total. Fits in a 6GB GPU. bitsandbytes library (QLoRA): uses NF4 (Normal Float 4-bit) which better preserves values near zero (where most weights cluster after training). Practical: llama.cpp -q4_k_m flag or HuggingFace load_in_4bit=True in BitsAndBytesConfig.)

Quantization

Floating-point precision explained

Post-Training Quantization (PTQ) vs Quantization-Aware Training (QAT)

GGUF and llama.cpp: running LLMs locally

Mixed precision and hardware support

Quantization's impact on model quality

Practice questions

Quantization

Floating-point precision explained

Post-Training Quantization (PTQ) vs Quantization-Aware Training (QAT)

GGUF and llama.cpp: running LLMs locally

Mixed precision and hardware support

Quantization's impact on model quality

Practice questions

Practice what you just learned

Related Terms