Glossary/Model Pruning — Removing Redundant Weights for Faster Inference
Model Training & Optimization

Model Pruning — Removing Redundant Weights for Faster Inference

Deleting the least important connections in a neural network without hurting performance.


Definition

Pruning removes weights, neurons, or entire layers from a trained neural network based on importance scores — typically magnitude (absolute value of weight). The goal: smaller model, faster inference, lower memory, with minimal accuracy loss. Unstructured pruning zeroes out individual weights (sparse but hard to accelerate). Structured pruning removes entire neurons, attention heads, or layers (dense and hardware-friendly). Magnitude pruning, gradient-based pruning, and lottery ticket hypothesis are the main approaches. Combined with quantization and distillation, pruning is part of the model compression trinity.

Real-life analogy: The overgrown garden

A trained neural network is like an overgrown garden where many plants (weights) contribute little to the overall beauty (task performance). Pruning is the gardener who removes dead and insignificant plants while keeping the valuable ones. After pruning, the garden requires less water (memory) and maintenance (compute) while looking nearly as good. The key skill is identifying which plants are truly expendable.

Pruning types and methods

Weight magnitude pruning with PyTorch

import torch
import torch.nn as nn
import torch.nn.utils.prune as prune
import numpy as np

model = nn.Sequential(
    nn.Linear(784, 256), nn.ReLU(),
    nn.Linear(256, 128), nn.ReLU(),
    nn.Linear(128, 10)
)

# Train model first...
# Then prune

# ── Method 1: Unstructured magnitude pruning (zeroes individual weights) ──
# Prune 40% of weights in each Linear layer with smallest absolute magnitude
for name, module in model.named_modules():
    if isinstance(module, nn.Linear):
        prune.l1_unstructured(module, name='weight', amount=0.4)
        # Creates a weight_mask: 0 for pruned, 1 for kept

# Check sparsity
total, zeros = 0, 0
for name, param in model.named_parameters():
    if 'weight' in name:
        total += param.numel()
        zeros += (param == 0).sum().item()
print(f"Global sparsity: {100 * zeros / total:.1f}%")   # ~40%

# Make pruning permanent (removes masks, keeps zero weights)
for name, module in model.named_modules():
    if isinstance(module, nn.Linear):
        prune.remove(module, 'weight')

# ── Method 2: Global magnitude pruning (prune globally across all layers) ──
parameters_to_prune = [
    (module, 'weight') for _, module in model.named_modules()
    if isinstance(module, nn.Linear)
]
prune.global_unstructured(
    parameters_to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.5,    # Remove 50% of weights globally (not per-layer)
)

# ── Method 3: Structured pruning (removes entire neurons — hardware friendly) ──
# Prune 30% of channels in each layer based on L2 norm of weight rows
for name, module in model.named_modules():
    if isinstance(module, nn.Linear):
        prune.ln_structured(module, name='weight', amount=0.3, n=2, dim=0)
        # dim=0: prune output neurons (rows). Each pruned row = dead neuron.

# ── Method 4: Iterative magnitude pruning (gradual removal + fine-tuning) ──
def iterative_prune(model, target_sparsity, n_iterations, fine_tune_steps):
    """Gradually prune model, fine-tuning after each round."""
    amount_per_round = 1 - (1 - target_sparsity) ** (1 / n_iterations)
    for iteration in range(n_iterations):
        print(f"Pruning round {iteration+1}/{n_iterations}")
        for module in model.modules():
            if isinstance(module, nn.Linear):
                prune.l1_unstructured(module, 'weight', amount=amount_per_round)
        # Fine-tune for fine_tune_steps steps here...
        print(f"  Fine-tuning for {fine_tune_steps} steps...")
    return model

# ── LLM-specific: Attention head pruning ──
# Many attention heads are redundant after training
# Measure importance by how much model performance degrades when each head is zeroed
def get_head_importance(model, eval_loader):
    """Approximate head importance by gradient magnitude."""
    head_importances = {}
    for name, module in model.named_modules():
        if hasattr(module, 'num_heads'):
            # Importance = mean absolute value of attention weight gradients
            grads = module.weight.grad.abs().mean(dim=[0,2,3])   # Per-head
            head_importances[name] = grads.detach()
    return head_importances

Pruning methods comparison

MethodWhat is removedHardware accelerationAccuracy lossBest for
Unstructured (magnitude)Individual weights (random pattern)Low (needs sparse hardware)Low at 50-60%Research, sparse GPUs (NVIDIA Ampere+)
Structured (neuron)Entire neurons / filtersHigh (dense computation still)MediumProduction deployment on standard hardware
Structured (head)Entire attention headsHighLow at 30-40% of headsTransformer inference optimisation
Layer pruningEntire transformer layersVery highModerateAggressive compression (LLM depth reduction)
Lottery ticketAll except winning subnetworkVariesVery low (matched masks)Research — finds sparse trainable subnetwork

SparseGPT and Wanda — LLM-scale pruning

Standard magnitude pruning does not work well for LLMs (billions of parameters, expensive to fine-tune after pruning). SparseGPT (2023) uses second-order Hessian information to prune LLMs to 50-60% sparsity in a single pass WITHOUT any fine-tuning. Wanda prunes based on weight magnitude × activation magnitude — removing weights whose product is smallest. Both can prune a 70B LLM in a few GPU-hours.

Practice questions

  1. A model has 80% unstructured sparsity after pruning. Why might it not be 5× faster on standard GPUs? (Answer: Standard GPU CUDA cores execute dense matrix multiplications — zeroed weights still participate in computation (multiplied by 0, result discarded). Speed gains require sparse kernels or hardware (NVIDIA Ampere structured sparsity with 2:4 pattern). Unstructured sparsity mainly saves memory, not compute on standard hardware.)
  2. Why is structured pruning preferred for production deployment over unstructured? (Answer: Structured pruning removes entire neurons/channels/heads — the resulting model is dense with a smaller shape. Standard BLAS/cuBLAS matrix multiplication can be used directly. Unstructured pruning creates irregular sparse patterns requiring specialised sparse kernels.)
  3. What is the Lottery Ticket Hypothesis? (Answer: Frankle & Carlin (2019): a large neural network contains a small subnetwork ("winning ticket") that can be trained in isolation to match the full network's accuracy. The winning ticket is identified by training the full network, pruning low-magnitude weights, and resetting remaining weights to their initial values. Implies large models are over-parameterised for training but not for final inference.)
  4. After pruning a model to 50% sparsity, accuracy dropped significantly. What is the recommended fix? (Answer: Fine-tune (re-train) the pruned model on the original task. Pruning removes weights that were useful in context of other weights; fine-tuning allows remaining weights to redistribute responsibilities. Iterative pruning (prune → fine-tune → prune → fine-tune) minimises accuracy loss better than one-shot pruning.)
  5. How does SparseGPT prune a 70B LLM without fine-tuning? (Answer: SparseGPT uses second-order information (the Hessian of the loss w.r.t. weights) to identify weights whose removal can be exactly compensated by updating the remaining weights in the same layer. This one-shot compensation eliminates the need for gradient-based fine-tuning while maintaining accuracy.)

On LumiChats

LLM inference optimisation combines pruning, quantization, and compilation. When LumiChats serves responses, the underlying models use structured pruning (removing redundant attention heads), quantization (BF16/INT8), and kernel fusion (FlashAttention) to achieve fast, cost-effective inference.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

4 terms