Glossary/Convolutional Neural Network (CNN)
Deep Learning & Neural Networks

Convolutional Neural Network (CNN)

How AI sees — the architecture behind computer vision.


Definition

A Convolutional Neural Network (CNN) is a deep learning architecture designed for grid-structured data like images. CNNs use convolutional layers that apply learnable filters across the input, exploiting spatial locality and translation invariance — enabling them to efficiently detect features (edges, textures, shapes, objects) regardless of where they appear in the image.

The convolution operation

A convolutional layer applies a small learned filter (kernel) — typically 3×3 or 5×5 pixels — by sliding it across the entire input image. At each position it computes a dot product between filter weights and the local patch:

2D convolution: the filter K slides over image I. A 3×3 filter has 9 learnable parameters but applies to every spatial position — far fewer than a fully-connected layer.

A layer with 64 filters produces 64 feature maps — each detecting a different learned pattern (edges, curves, textures). The critical insight is parameter sharing: the same 9 weights are reused at every position, giving CNNs their extraordinary parameter efficiency.

Why convolution works for images

Images have two structural properties CNNs exploit: (1) Locality — nearby pixels are more related than distant ones, so a 3×3 filter captures local structure efficiently. (2) Translation invariance — a cat is a cat whether in the top-left or bottom-right of the image. Shared filter weights encode the same detector everywhere.

Pooling and receptive fields

Pooling layers reduce spatial dimensions, making representations smaller and approximately translation-invariant:

OperationFormulaEffectUse case
Max pooling 2×2max(x_{i,j}, x_{i+1,j}, x_{i,j+1}, x_{i+1,j+1})Keeps strongest activation, discards exact positionStandard in CNNs — preserves sharp features
Average poolingmean of regionSmoother, dilutes strong signalsGlobal average pooling before classifier head
Strided convolutionConv with stride=2Learns to downsample (preferred in modern CNNs)ResNet, EfficientNet — replaces explicit pooling

The receptive field is the region of the original input that influences a neuron. Layer 1 sees 3×3 pixels; layer 5 sees ~100×100. Stacking conv + pool layers progressively expands the receptive field — early layers detect local edges, later layers detect global objects.

Landmark CNN architectures

ArchitectureYearKey innovationDepthImageNet top-5 err
AlexNet2012ReLU, dropout, GPU training8 layers15.3%
VGGNet2014Depth with uniform 3×3 filters16–19 layers7.3%
GoogLeNet / Inception2014Inception modules — parallel multi-scale filters22 layers6.7%
ResNet2015Residual (skip) connections — solved vanishing gradients50–152 layers3.57%
EfficientNet2019Compound scaling of width, depth, resolutionB0–B72.9%
ConvNeXt2022Modernized ResNet with Transformer design choices~200M paramsCompetitive with ViT

ResNet skip connection insight

Rather than learning H(x), a residual block learns F(x) = H(x) − x, then outputs x + F(x). If the optimal transformation is close to identity, F just needs to be near zero — much easier to optimize. This simple change enabled training 152-layer networks that previously couldn't converge.

Transfer learning with CNNs

The most practical way to use CNNs — and why you almost never need to train from scratch:

Fine-tuning ResNet-50 on a custom image classification task

import torch
import torch.nn as nn
from torchvision import models, transforms
from torch.utils.data import DataLoader

# Load ResNet-50 pretrained on ImageNet (1.2M images, 1000 classes)
model = models.resnet50(weights="IMAGENET1K_V2")

# Strategy 1: Freeze all layers, only train the final head
for param in model.parameters():
    param.requires_grad = False

# Replace classifier head for your number of classes
num_classes = 5   # e.g., flower species
model.fc = nn.Linear(model.fc.in_features, num_classes)
# Only model.fc.parameters() have requires_grad=True

# Strategy 2: Unfreeze last 2 blocks for deeper fine-tuning
for param in model.layer4.parameters():
    param.requires_grad = True

optimizer = torch.optim.AdamW([
    {'params': model.layer4.parameters(), 'lr': 1e-4},   # low LR for pretrained
    {'params': model.fc.parameters(),     'lr': 1e-3},   # higher LR for new head
])

Transfer works across surprising domains

ImageNet-pretrained CNNs transfer well to medical imaging, satellite imagery, and industrial inspection — domains with completely different content. Early layers learn universal edge/texture detectors that are useful everywhere. Only fine-tune the last 1–2 blocks unless your domain is very different from natural images.

CNNs vs Vision Transformers (ViTs)

Vision Transformers (ViT, Dosovitskiy et al., 2020) divide the image into 16×16 patches, treat each as a token, then process with self-attention. The comparison:

DimensionCNN (ResNet/EfficientNet)Vision Transformer (ViT/CLIP)
Inductive biasStrong: locality + translation invarianceWeak: must learn spatial structure from data
Data hungerWorks well with 10K–100K imagesNeeds 1M+ images (or large-scale pretraining)
ComputeO(HW) — linear in image pixelsO((HW/p²)²) — quadratic in patch count
Scale ceilingSaturates around 1B paramsKeeps improving with more data + compute
Best use (2025)Edge/mobile, small datasets, real-timeFoundation models (CLIP, SAM, DINOv2), large-scale tasks

2025 practical guidance

For new projects: use a pretrained ViT (DINOv2, CLIP) if you have GPU budget and large data. Use EfficientNet or ConvNeXt if you need lower latency, mobile deployment, or have limited data. Hybrid models (ConvFormer, CvT) combine both — useful middle ground.

Practice questions

  1. What is the receptive field of a CNN layer and why does it matter? (Answer: The receptive field is the region of the input image that contributes to one output neuron's activation. A 3×3 conv layer: each output pixel sees 3×3 input pixels. Two stacked 3×3 layers: each output pixel sees 5×5 input pixels. N layers of 3×3 convolutions: receptive field = (2N+1)×(2N+1). Deep CNNs with small filters achieve large effective receptive fields (global context) while using fewer parameters than single large filters. For image classification, the final feature maps must have receptive fields large enough to span the full input.)
  2. What is the difference between same padding and valid padding in a convolution? (Answer: Valid padding: no padding — output is smaller than input. Input 32×32, kernel 3×3: output is 30×30 (shrinks by kernel_size-1=2). Same padding: pad input so output has the same spatial dimensions as input. Input 32×32, kernel 3×3: pad by 1 on each side, output is 32×32. TensorFlow default: same padding. PyTorch default: valid (padding=0). Use same padding when you want to preserve spatial dimensions through many layers; use valid when spatial reduction is intentional.)
  3. What is the difference between regular convolution and depthwise separable convolution (used in MobileNet)? (Answer: Regular convolution: each filter operates across ALL input channels simultaneously — one filter per output channel, each with in_channels × k × k parameters. Total: out_channels × in_channels × k². Depthwise separable: (1) Depthwise: one filter per input channel (operates in each channel independently). (2) Pointwise: 1×1 convolution combines channels. Total: in_channels × k² + in_channels × out_channels. ~8–9× fewer parameters for 3×3 conv. MobileNet achieves competitive accuracy at 10× fewer parameters using this factorisation.)
  4. What is feature map visualisation and what does it reveal about CNN learning? (Answer: Visualising activations of filters at different layers shows the hierarchy of learned representations: Layer 1: simple edges and colours (oriented Gabor-like filters). Layer 2: textures and simple shapes (combinations of edges). Layer 3–4: object parts (wheels, eyes, windows). Final layers: complete objects and scenes. This hierarchical feature learning (Zeiler & Fergus 2013) confirmed that CNNs learn semantically meaningful features automatically — without hand-crafting, as required by pre-deep-learning vision systems.)
  5. What is the difference between stride and pooling for spatial downsampling in CNNs? (Answer: Pooling (max/average): take the max or average over a spatial window, reduce spatial size by pool_factor. Fixed operation — no learned parameters. Max pooling: keeps the strongest activation (feature present or absent). Strided convolution: move the conv filter by stride>1, producing smaller output. Learned downsampling — the network learns how to combine spatially adjacent information. Modern CNNs (ResNet, EfficientNet) prefer strided convolutions for downsampling: they're learnable and often outperform fixed pooling. Pooling still used in attention mechanisms and some architectures.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms