Glossary/Computer Vision
Deep Learning & Neural Networks

Computer Vision

Teaching machines to see, understand, and act on visual information.


Definition

Computer vision (CV) is the field of AI that enables computers to interpret and understand visual information from the world — images, video, depth sensors, and medical scans. Modern computer vision is almost entirely powered by deep learning, particularly convolutional neural networks (CNNs) and, increasingly, Vision Transformers (ViT). Applications range from photo search and autonomous vehicles to medical diagnostics and factory quality control.

Core computer vision tasks

TaskWhat the model outputsExample applicationKey models
Image classificationA label for the whole image"This is a cat" — photo apps, content moderationResNet, EfficientNet, ViT
Object detectionBounding boxes + labels for each object in the imageAutonomous vehicles, surveillance camerasYOLO, Faster R-CNN, DETR
Semantic segmentationPer-pixel class labels — every pixel assigned a categoryMedical image analysis, road scene parsingU-Net, DeepLab, SegFormer
Instance segmentationPer-pixel labels + distinct IDs for each instanceSeparating individual people in a crowdMask R-CNN, SAM (Segment Anything)
Depth estimationDistance of each pixel from the cameraRobots, AR, autonomous navigationDPT, Depth Anything, Marigold
Optical flowPixel-level motion vectors between framesVideo understanding, action recognitionFlowNet, RAFT
Image generationNew images from text or noiseDALL-E, Midjourney, Stable DiffusionDiffusion models, GANs
Visual question answeringText answer to a question about an imageMultimodal chatbots, accessibility toolsLLaVA, GPT-4V, Gemini

From CNNs to Vision Transformers

Computer vision went through a major architectural shift around 2020-2021. Convolutional neural networks (CNNs) dominated the decade following AlexNet (2012), using spatial filters to extract hierarchical features. Vision Transformers (ViT, 2020) applied the Transformer architecture directly to images by splitting them into patches and treating each patch as a token.

ArchitectureCore ideaStrengthsWeaknesses
CNNs (ResNet, EfficientNet)Learnable filters slide over image, detecting edges → textures → shapes → objectsData-efficient; great for small datasets; fast inference; strong inductive biasesLimited long-range context; architecture engineering required
Vision Transformers (ViT)Image split into 16×16 patches; each patch treated as a token; self-attention across all patchesExcellent at long-range dependencies; scales extremely well with data; same architecture as language modelsNeeds large datasets; computationally heavier than CNNs on small data
Hybrid models (ConvNeXT, EfficientViT)CNN-style locality + Transformer-style global attentionBest of both; competitive at all scalesMore complex to design and tune
Foundation models (SAM, CLIP, DINOv2)Trained on billions of images; one model for many downstream tasks via prompting or fine-tuningZero-shot generalization; no task-specific training data neededVery large; expensive to run at edge

CLIP: connecting vision and language

OpenAI's CLIP (2021) trained a vision encoder and text encoder jointly to match image-text pairs from 400M web-scraped examples. The result: a model that can classify images into any category described in text, with zero task-specific training. CLIP's image encoder became the visual backbone of DALL-E, Stable Diffusion, and virtually every multimodal LLM.

Computer vision in practice: tools and libraries

Object detection with YOLO v8 — detect and label objects in any image in ~5 lines

from ultralytics import YOLO
from PIL import Image

# Load a pre-trained YOLOv8 model (downloads automatically)
model = YOLO("yolov8n.pt")   # 'n' = nano (fastest); also: s, m, l, x

# Run detection on an image
results = model("https://ultralytics.com/images/bus.jpg")

# Print all detected objects
for r in results:
    for box in r.boxes:
        class_name = r.names[int(box.cls)]
        confidence = float(box.conf)
        x1, y1, x2, y2 = box.xyxy[0].tolist()
        print(f"{class_name}: {confidence:.1%} confidence at ({x1:.0f},{y1:.0f})-({x2:.0f},{y2:.0f})")

# Save the annotated image
results[0].save("detected.jpg")
# Detects: person (97%), bus (95%), bench (88%), car (72%)...
Library / ToolBest forSkill level
OpenCVTraditional CV: filtering, edge detection, video processingBeginner–Intermediate
Ultralytics (YOLOv8/11)Fast object detection and segmentation — best starting pointBeginner
TorchvisionPyTorch-native; pre-trained models, transforms, datasetsIntermediate
Hugging Face TransformersVision Transformers, CLIP, ViT, SAM, multimodal modelsIntermediate
RoboflowDataset management, annotation, training pipeline in the browserBeginner
SAM (Segment Anything)Interactive segmentation of any object in any imageBeginner–Intermediate

Practice questions

  1. What is the difference between image classification, object detection, semantic segmentation, and instance segmentation? (Answer: Classification: one label for the entire image (cat/dog). Object detection: bounding boxes with labels around each object (3 cats, 2 dogs, with coordinates). Semantic segmentation: every pixel assigned a class label (all cat pixels = class 'cat', all background pixels = 'background'). Instance segmentation: every pixel assigned to a specific instance (cat1 pixels, cat2 pixels, background) — combines detection and segmentation. Each is progressively harder and richer in spatial information.)
  2. What was the key innovation of AlexNet (2012) that sparked the deep learning era in computer vision? (Answer: AlexNet (Krizhevsky, Sutskever, Hinton) won ILSVRC 2012 by a 10.8-point margin over the second place (15.3% vs 26.2% top-5 error). Key innovations: (1) Deep architecture (8 layers vs previous 3–4). (2) ReLU activations (faster training than tanh). (3) GPU training (two GTX 580 3GB GPUs, 6 days). (4) Dropout regularisation. (5) Data augmentation. The scale of improvement shocked the community and triggered the shift from hand-crafted features (SIFT, HOG) to learned CNN features.)
  3. What is the Vision Transformer (ViT) and why did it challenge the dominance of CNNs? (Answer: ViT (Dosovitskiy et al., 2020): apply a standard transformer directly to sequences of image patches (16×16 patches) — no convolutional inductive bias. Trained with supervised classification on ImageNet-21k + JFT-300M. At scale, ViT-Large/16 surpassed state-of-the-art CNNs (EfficientNet) while being more scalable and parallelisable. CNNs have an inductive bias (local connectivity) that helps on small datasets but limits flexibility at scale. ViT can learn any spatial relationship from data when given enough training examples.)
  4. What is CLIP and how does it enable zero-shot image classification? (Answer: CLIP (Contrastive Language-Image Pretraining, OpenAI 2021): train image and text encoders together on 400M image-text pairs using contrastive loss — similar images and captions attract, dissimilar ones repel. Zero-shot classification: encode all class names as text prompts ('a photo of a cat'), encode the query image. Find the class with highest image-text cosine similarity. No task-specific training required — just natural language class descriptions. CLIP achieves 76.2% on ImageNet zero-shot vs 97%+ for trained ResNet-50. Enables 'Describe what you're looking for' style search.)
  5. What is the current state of foundation models for computer vision and how do they compare to specialised models? (Answer: Vision foundation models (SAM, DINO, OpenCLIP, Florence-2) are pretrained on massive image datasets and can be fine-tuned for many downstream tasks with few examples. SAM (Segment Anything Model, Meta 2023): trained on 11M images, 1.1B masks — can segment any object from a single point click or bounding box prompt. DINO v2: self-supervised ViT providing powerful features for depth estimation, semantic segmentation, and classification. Trade-off: foundation models are general-purpose but may underperform narrow specialists trained with domain-specific data.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms