What is pruning types and methods?

Model Pruning — Removing Redundant Weights for Faster Inference: Pruning types and methods. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/model-pruning

What is pruning methods comparison?

Model Pruning — Removing Redundant Weights for Faster Inference: Pruning methods comparison. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/model-pruning

What is practice questions?

Model Pruning — Removing Redundant Weights for Faster Inference: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/model-pruning

Model Pruning — Removing Redundant Weights for Faster Inference

Pruning removes weights, neurons, or entire layers from a trained neural network based on importance scores — typically magnitude (absolute value of weight). The goal: smaller model, faster inference, lower memory, with minimal accuracy loss. Unstructured pruning zeroes out individual weights (sparse but hard to accelerate). Structured pruning removes entire neurons, attention heads, or layers (dense and hardware-friendly). Magnitude pruning, gradient-based pruning, and lottery ticket hypothesis are the main approaches. Combined with quantization and distillation, pruning is part of the model compression trinity.

Deleting the least important connections in a neural network without hurting performance.

Category: Model Training & Optimization

Real-life analogy: The overgrown garden

A trained neural network is like an overgrown garden where many plants (weights) contribute little to the overall beauty (task performance). Pruning is the gardener who removes dead and insignificant plants while keeping the valuable ones. After pruning, the garden requires less water (memory) and maintenance (compute) while looking nearly as good. The key skill is identifying which plants are truly expendable.

Pruning types and methods

import torch
import torch.nn as nn
import torch.nn.utils.prune as prune
import numpy as np

model = nn.Sequential(
    nn.Linear(784, 256), nn.ReLU(),
    nn.Linear(256, 128), nn.ReLU(),
    nn.Linear(128, 10)
)

# Train model first...
# Then prune

# ── Method 1: Unstructured magnitude pruning (zeroes individual weights) ──
# Prune 40% of weights in each Linear layer with smallest absolute magnitude
for name, module in model.named_modules():
    if isinstance(module, nn.Linear):
        prune.l1_unstructured(module, name='weight', amount=0.4)
        # Creates a weight_mask: 0 for pruned, 1 for kept

# Check sparsity
total, zeros = 0, 0
for name, param in model.named_parameters():
    if 'weight' in name:
        total += param.numel()
        zeros += (param == 0).sum().item()
print(f"Global sparsity: {100 * zeros / total:.1f}%")   # ~40%

# Make pruning permanent (removes masks, keeps zero weights)
for name, module in model.named_modules():
    if isinstance(module, nn.Linear):
        prune.remove(module, 'weight')

# ── Method 2: Global magnitude pruning (prune globally across all layers) ──
parameters_to_prune = [
    (module, 'weight') for _, module in model.named_modules()
    if isinstance(module, nn.Linear)
]
prune.global_unstructured(
    parameters_to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.5,    # Remove 50% of weights globally (not per-layer)
)

# ── Method 3: Structured pruning (removes entire neurons — hardware friendly) ──
# Prune 30% of channels in each layer based on L2 norm of weight rows
for name, module in model.named_modules():
    if isinstance(module, nn.Linear):
        prune.ln_structured(module, name='weight', amount=0.3, n=2, dim=0)
        # dim=0: prune output neurons (rows). Each pruned row = dead neuron.

# ── Method 4: Iterative magnitude pruning (gradual removal + fine-tuning) ──
def iterative_prune(model, target_sparsity, n_iterations, fine_tune_steps):
    """Gradually prune model, fine-tuning after each round."""
    amount_per_round = 1 - (1 - target_sparsity) ** (1 / n_iterations)
    for iteration in range(n_iterations):
        print(f"Pruning round {iteration+1}/{n_iterations}")
        for module in model.modules():
            if isinstance(module, nn.Linear):
                prune.l1_unstructured(module, 'weight', amount=amount_per_round)
        # Fine-tune for fine_tune_steps steps here...
        print(f"  Fine-tuning for {fine_tune_steps} steps...")
    return model

# ── LLM-specific: Attention head pruning ──
# Many attention heads are redundant after training
# Measure importance by how much model performance degrades when each head is zeroed
def get_head_importance(model, eval_loader):
    """Approximate head importance by gradient magnitude."""
    head_importances = {}
    for name, module in model.named_modules():
        if hasattr(module, 'num_heads'):
            # Importance = mean absolute value of attention weight gradients
            grads = module.weight.grad.abs().mean(dim=[0,2,3])   # Per-head
            head_importances[name] = grads.detach()
    return head_importances

Pruning methods comparison

Method	What is removed	Hardware acceleration	Accuracy loss	Best for
Unstructured (magnitude)	Individual weights (random pattern)	Low (needs sparse hardware)	Low at 50-60%	Research, sparse GPUs (NVIDIA Ampere+)
Structured (neuron)	Entire neurons / filters	High (dense computation still)	Medium	Production deployment on standard hardware
Structured (head)	Entire attention heads	High	Low at 30-40% of heads	Transformer inference optimization
Layer pruning	Entire transformer layers	Very high	Moderate	Aggressive compression (LLM depth reduction)
Lottery ticket	All except winning subnetwork	Varies	Very low (matched masks)	Research — finds sparse trainable subnetwork

SparseGPT and Wanda — LLM-scale pruning: Standard magnitude pruning does not work well for LLMs (billions of parameters, expensive to fine-tune after pruning). SparseGPT (2023) uses second-order Hessian information to prune LLMs to 50-60% sparsity in a single pass WITHOUT any fine-tuning. Wanda prunes based on weight magnitude × activation magnitude — removing weights whose product is smallest. Both can prune a 70B LLM in a few GPU-hours.

Practice questions

A model has 80% unstructured sparsity after pruning. Why might it not be 5× faster on standard GPUs? (Answer: Standard GPU CUDA cores execute dense matrix multiplications — zeroed weights still participate in computation (multiplied by 0, result discarded). Speed gains require sparse kernels or hardware (NVIDIA Ampere structured sparsity with 2:4 pattern). Unstructured sparsity mainly saves memory, not compute on standard hardware.)
Why is structured pruning preferred for production deployment over unstructured? (Answer: Structured pruning removes entire neurons/channels/heads — the resulting model is dense with a smaller shape. Standard BLAS/cuBLAS matrix multiplication can be used directly. Unstructured pruning creates irregular sparse patterns requiring specialized sparse kernels.)
What is the Lottery Ticket Hypothesis? (Answer: Frankle & Carlin (2019): a large neural network contains a small subnetwork ("winning ticket") that can be trained in isolation to match the full network's accuracy. The winning ticket is identified by training the full network, pruning low-magnitude weights, and resetting remaining weights to their initial values. Implies large models are over-parameterised for training but not for final inference.)
After pruning a model to 50% sparsity, accuracy dropped significantly. What is the recommended fix? (Answer: Fine-tune (re-train) the pruned model on the original task. Pruning removes weights that were useful in context of other weights; fine-tuning allows remaining weights to redistribute responsibilities. Iterative pruning (prune → fine-tune → prune → fine-tune) minimizes accuracy loss better than one-shot pruning.)
How does SparseGPT prune a 70B LLM without fine-tuning? (Answer: SparseGPT uses second-order information (the Hessian of the loss w.r.t. weights) to identify weights whose removal can be exactly compensated by updating the remaining weights in the same layer. This one-shot compensation eliminates the need for gradient-based fine-tuning while maintaining accuracy.)

LLM inference optimization combines pruning, quantization, and compilation. When LumiChats serves responses, the underlying models use structured pruning (removing redundant attention heads), quantization (BF16/INT8), and kernel fusion (FlashAttention) to achieve fast, cost-effective inference.

import torch import torch.nn as nn import torch.nn.utils.prune as prune import numpy as np model = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 128), nn.ReLU(), nn.Linear(128, 10) ) # Train model first... # Then prune # ── Method 1: Unstructured magnitude pruning (zeroes individual weights) ── # Prune 40% of weights in each Linear layer with smallest absolute magnitude for name, module in model.named_modules(): if isinstance(module, nn.Linear): prune.l1_unstructured(module, name='weight', amount=0.4) # Creates a weight_mask: 0 for pruned, 1 for kept # Check sparsity total, zeros = 0, 0 for name, param in model.named_parameters(): if 'weight' in name: total += param.numel() zeros += (param == 0).sum().item() print(f"Global sparsity: {100 * zeros / total:.1f}%") # ~40% # Make pruning permanent (removes masks, keeps zero weights) for name, module in model.named_modules(): if isinstance(module, nn.Linear): prune.remove(module, 'weight') # ── Method 2: Global magnitude pruning (prune globally across all layers) ── parameters_to_prune = [ (module, 'weight') for _, module in model.named_modules() if isinstance(module, nn.Linear) ] prune.global_unstructured( parameters_to_prune, pruning_method=prune.L1Unstructured, amount=0.5, # Remove 50% of weights globally (not per-layer) ) # ── Method 3: Structured pruning (removes entire neurons — hardware friendly) ── # Prune 30% of channels in each layer based on L2 norm of weight rows for name, module in model.named_modules(): if isinstance(module, nn.Linear): prune.ln_structured(module, name='weight', amount=0.3, n=2, dim=0) # dim=0: prune output neurons (rows). Each pruned row = dead neuron. # ── Method 4: Iterative magnitude pruning (gradual removal + fine-tuning) ── def iterative_prune(model, target_sparsity, n_iterations, fine_tune_steps): """Gradually prune model, fine-tuning after each round.""" amount_per_round = 1 - (1 - target_sparsity) ** (1 / n_iterations) for iteration in range(n_iterations): print(f"Pruning round {iteration+1}/{n_iterations}") for module in model.modules(): if isinstance(module, nn.Linear): prune.l1_unstructured(module, 'weight', amount=amount_per_round) # Fine-tune for fine_tune_steps steps here... print(f" Fine-tuning for {fine_tune_steps} steps...") return model # ── LLM-specific: Attention head pruning ── # Many attention heads are redundant after training # Measure importance by how much model performance degrades when each head is zeroed def get_head_importance(model, eval_loader): """Approximate head importance by gradient magnitude.""" head_importances = {} for name, module in model.named_modules(): if hasattr(module, 'num_heads'): # Importance = mean absolute value of attention weight gradients grads = module.weight.grad.abs().mean(dim=[0,2,3]) # Per-head head_importances[name] = grads.detach() return head_importances

Method

What is removed

Hardware acceleration

Accuracy loss

Best for

Unstructured (magnitude)

Individual weights (random pattern)

Low (needs sparse hardware)

Low at 50-60%

Research, sparse GPUs (NVIDIA Ampere+)

Structured (neuron)

Entire neurons / filters

High (dense computation still)

Medium

Production deployment on standard hardware

Structured (head)

Entire attention heads

High

Low at 30-40% of heads

Transformer inference optimization

Layer pruning

Entire transformer layers

Very high

Moderate

Aggressive compression (LLM depth reduction)

Lottery ticket

All except winning subnetwork

Varies

Very low (matched masks)

Research — finds sparse trainable subnetwork

Model Pruning — Removing Redundant Weights for Faster Inference

Real-life analogy: The overgrown garden

Pruning types and methods

Pruning methods comparison

Practice questions

Model Pruning — Removing Redundant Weights for Faster Inference

Real-life analogy: The overgrown garden

Pruning types and methods

Pruning methods comparison

Practice questions

Practice what you just learned

Related Terms