Computer vision (CV) is the field of AI that enables computers to interpret and understand visual information from the world — images, video, depth sensors, and medical scans. Modern computer vision is almost entirely powered by deep learning, particularly convolutional neural networks (CNNs) and, increasingly, Vision Transformers (ViT). Applications range from photo search and autonomous vehicles to medical diagnostics and factory quality control.
Core computer vision tasks
| Task | What the model outputs | Example application | Key models |
|---|---|---|---|
| Image classification | A label for the whole image | "This is a cat" — photo apps, content moderation | ResNet, EfficientNet, ViT |
| Object detection | Bounding boxes + labels for each object in the image | Autonomous vehicles, surveillance cameras | YOLO, Faster R-CNN, DETR |
| Semantic segmentation | Per-pixel class labels — every pixel assigned a category | Medical image analysis, road scene parsing | U-Net, DeepLab, SegFormer |
| Instance segmentation | Per-pixel labels + distinct IDs for each instance | Separating individual people in a crowd | Mask R-CNN, SAM (Segment Anything) |
| Depth estimation | Distance of each pixel from the camera | Robots, AR, autonomous navigation | DPT, Depth Anything, Marigold |
| Optical flow | Pixel-level motion vectors between frames | Video understanding, action recognition | FlowNet, RAFT |
| Image generation | New images from text or noise | DALL-E, Midjourney, Stable Diffusion | Diffusion models, GANs |
| Visual question answering | Text answer to a question about an image | Multimodal chatbots, accessibility tools | LLaVA, GPT-4V, Gemini |
From CNNs to Vision Transformers
Computer vision went through a major architectural shift around 2020-2021. Convolutional neural networks (CNNs) dominated the decade following AlexNet (2012), using spatial filters to extract hierarchical features. Vision Transformers (ViT, 2020) applied the Transformer architecture directly to images by splitting them into patches and treating each patch as a token.
| Architecture | Core idea | Strengths | Weaknesses |
|---|---|---|---|
| CNNs (ResNet, EfficientNet) | Learnable filters slide over image, detecting edges → textures → shapes → objects | Data-efficient; great for small datasets; fast inference; strong inductive biases | Limited long-range context; architecture engineering required |
| Vision Transformers (ViT) | Image split into 16×16 patches; each patch treated as a token; self-attention across all patches | Excellent at long-range dependencies; scales extremely well with data; same architecture as language models | Needs large datasets; computationally heavier than CNNs on small data |
| Hybrid models (ConvNeXT, EfficientViT) | CNN-style locality + Transformer-style global attention | Best of both; competitive at all scales | More complex to design and tune |
| Foundation models (SAM, CLIP, DINOv2) | Trained on billions of images; one model for many downstream tasks via prompting or fine-tuning | Zero-shot generalization; no task-specific training data needed | Very large; expensive to run at edge |
CLIP: connecting vision and language
OpenAI's CLIP (2021) trained a vision encoder and text encoder jointly to match image-text pairs from 400M web-scraped examples. The result: a model that can classify images into any category described in text, with zero task-specific training. CLIP's image encoder became the visual backbone of DALL-E, Stable Diffusion, and virtually every multimodal LLM.
Computer vision in practice: tools and libraries
Object detection with YOLO v8 — detect and label objects in any image in ~5 lines
from ultralytics import YOLO
from PIL import Image
# Load a pre-trained YOLOv8 model (downloads automatically)
model = YOLO("yolov8n.pt") # 'n' = nano (fastest); also: s, m, l, x
# Run detection on an image
results = model("https://ultralytics.com/images/bus.jpg")
# Print all detected objects
for r in results:
for box in r.boxes:
class_name = r.names[int(box.cls)]
confidence = float(box.conf)
x1, y1, x2, y2 = box.xyxy[0].tolist()
print(f"{class_name}: {confidence:.1%} confidence at ({x1:.0f},{y1:.0f})-({x2:.0f},{y2:.0f})")
# Save the annotated image
results[0].save("detected.jpg")
# Detects: person (97%), bus (95%), bench (88%), car (72%)...| Library / Tool | Best for | Skill level |
|---|---|---|
| OpenCV | Traditional CV: filtering, edge detection, video processing | Beginner–Intermediate |
| Ultralytics (YOLOv8/11) | Fast object detection and segmentation — best starting point | Beginner |
| Torchvision | PyTorch-native; pre-trained models, transforms, datasets | Intermediate |
| Hugging Face Transformers | Vision Transformers, CLIP, ViT, SAM, multimodal models | Intermediate |
| Roboflow | Dataset management, annotation, training pipeline in the browser | Beginner |
| SAM (Segment Anything) | Interactive segmentation of any object in any image | Beginner–Intermediate |
Practice questions
- What is the difference between image classification, object detection, semantic segmentation, and instance segmentation? (Answer: Classification: one label for the entire image (cat/dog). Object detection: bounding boxes with labels around each object (3 cats, 2 dogs, with coordinates). Semantic segmentation: every pixel assigned a class label (all cat pixels = class 'cat', all background pixels = 'background'). Instance segmentation: every pixel assigned to a specific instance (cat1 pixels, cat2 pixels, background) — combines detection and segmentation. Each is progressively harder and richer in spatial information.)
- What was the key innovation of AlexNet (2012) that sparked the deep learning era in computer vision? (Answer: AlexNet (Krizhevsky, Sutskever, Hinton) won ILSVRC 2012 by a 10.8-point margin over the second place (15.3% vs 26.2% top-5 error). Key innovations: (1) Deep architecture (8 layers vs previous 3–4). (2) ReLU activations (faster training than tanh). (3) GPU training (two GTX 580 3GB GPUs, 6 days). (4) Dropout regularisation. (5) Data augmentation. The scale of improvement shocked the community and triggered the shift from hand-crafted features (SIFT, HOG) to learned CNN features.)
- What is the Vision Transformer (ViT) and why did it challenge the dominance of CNNs? (Answer: ViT (Dosovitskiy et al., 2020): apply a standard transformer directly to sequences of image patches (16×16 patches) — no convolutional inductive bias. Trained with supervised classification on ImageNet-21k + JFT-300M. At scale, ViT-Large/16 surpassed state-of-the-art CNNs (EfficientNet) while being more scalable and parallelisable. CNNs have an inductive bias (local connectivity) that helps on small datasets but limits flexibility at scale. ViT can learn any spatial relationship from data when given enough training examples.)
- What is CLIP and how does it enable zero-shot image classification? (Answer: CLIP (Contrastive Language-Image Pretraining, OpenAI 2021): train image and text encoders together on 400M image-text pairs using contrastive loss — similar images and captions attract, dissimilar ones repel. Zero-shot classification: encode all class names as text prompts ('a photo of a cat'), encode the query image. Find the class with highest image-text cosine similarity. No task-specific training required — just natural language class descriptions. CLIP achieves 76.2% on ImageNet zero-shot vs 97%+ for trained ResNet-50. Enables 'Describe what you're looking for' style search.)
- What is the current state of foundation models for computer vision and how do they compare to specialised models? (Answer: Vision foundation models (SAM, DINO, OpenCLIP, Florence-2) are pretrained on massive image datasets and can be fine-tuned for many downstream tasks with few examples. SAM (Segment Anything Model, Meta 2023): trained on 11M images, 1.1B masks — can segment any object from a single point click or bounding box prompt. DINO v2: self-supervised ViT providing powerful features for depth estimation, semantic segmentation, and classification. Trade-off: foundation models are general-purpose but may underperform narrow specialists trained with domain-specific data.)