What is core computer vision tasks?

Computer Vision: Core computer vision tasks. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/computer-vision

What is computer vision in practice: tools and libraries?

Computer Vision: Computer vision in practice: tools and libraries. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/computer-vision

What is practice questions?

Computer Vision: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/computer-vision

Computer Vision

Computer vision (CV) is the field of AI that enables computers to interpret and understand visual information from the world — images, video, depth sensors, and medical scans. Modern computer vision is almost entirely powered by deep learning, particularly convolutional neural networks (CNNs) and, increasingly, Vision Transformers (ViT). Applications range from photo search and autonomous vehicles to medical diagnostics and factory quality control.

Teaching machines to see, understand, and act on visual information.

Category: Deep Learning & Neural Networks

Core computer vision tasks

Task	What the model outputs	Example application	Key models
Image classification	A label for the whole image	"This is a cat" — photo apps, content moderation	ResNet, EfficientNet, ViT
Object detection	Bounding boxes + labels for each object in the image	Autonomous vehicles, surveillance cameras	YOLO, Faster R-CNN, DETR
Semantic segmentation	Per-pixel class labels — every pixel assigned a category	Medical image analysis, road scene parsing	U-Net, DeepLab, SegFormer
Instance segmentation	Per-pixel labels + distinct IDs for each instance	Separating individual people in a crowd	Mask R-CNN, SAM (Segment Anything)
Depth estimation	Distance of each pixel from the camera	Robots, AR, autonomous navigation	DPT, Depth Anything, Marigold
Optical flow	Pixel-level motion vectors between frames	Video understanding, action recognition	FlowNet, RAFT
Image generation	New images from text or noise	DALL-E, Midjourney, Stable Diffusion	Diffusion models, GANs
Visual question answering	Text answer to a question about an image	Multimodal chatbots, accessibility tools	LLaVA, GPT-4V, Gemini

From CNNs to Vision Transformers

Computer vision went through a major architectural shift around 2020-2021. Convolutional neural networks (CNNs) dominated the decade following AlexNet (2012), using spatial filters to extract hierarchical features. Vision Transformers (ViT, 2020) applied the Transformer architecture directly to images by splitting them into patches and treating each patch as a token.

Architecture	Core idea	Strengths	Weaknesses
CNNs (ResNet, EfficientNet)	Learnable filters slide over image, detecting edges → textures → shapes → objects	Data-efficient; great for small datasets; fast inference; strong inductive biases	Limited long-range context; architecture engineering required
Vision Transformers (ViT)	Image split into 16×16 patches; each patch treated as a token; self-attention across all patches	Excellent at long-range dependencies; scales extremely well with data; same architecture as language models	Needs large datasets; computationally heavier than CNNs on small data
Hybrid models (ConvNeXT, EfficientViT)	CNN-style locality + Transformer-style global attention	Best of both; competitive at all scales	More complex to design and tune
Foundation models (SAM, CLIP, DINOv2)	Trained on billions of images; one model for many downstream tasks via prompting or fine-tuning	Zero-shot generalization; no task-specific training data needed	Very large; expensive to run at edge

CLIP: connecting vision and language: OpenAI's CLIP (2021) trained a vision encoder and text encoder jointly to match image-text pairs from 400M web-scraped examples. The result: a model that can classify images into any category described in text, with zero task-specific training. CLIP's image encoder became the visual backbone of DALL-E, Stable Diffusion, and virtually every multimodal LLM.

Computer vision in practice: tools and libraries

from ultralytics import YOLO
from PIL import Image

# Load a pre-trained YOLOv8 model (downloads automatically)
model = YOLO("yolov8n.pt")   # 'n' = nano (fastest); also: s, m, l, x

# Run detection on an image
results = model("https://ultralytics.com/images/bus.jpg")

# Print all detected objects
for r in results:
    for box in r.boxes:
        class_name = r.names[int(box.cls)]
        confidence = float(box.conf)
        x1, y1, x2, y2 = box.xyxy[0].tolist()
        print(f"{class_name}: {confidence:.1%} confidence at ({x1:.0f},{y1:.0f})-({x2:.0f},{y2:.0f})")

# Save the annotated image
results[0].save("detected.jpg")
# Detects: person (97%), bus (95%), bench (88%), car (72%)...

Library / Tool	Best for	Skill level
OpenCV	Traditional CV: filtering, edge detection, video processing	Beginner–Intermediate
Ultralytics (YOLOv8/11)	Fast object detection and segmentation — best starting point	Beginner
Torchvision	PyTorch-native; pre-trained models, transforms, datasets	Intermediate
Hugging Face Transformers	Vision Transformers, CLIP, ViT, SAM, multimodal models	Intermediate
Roboflow	Dataset management, annotation, training pipeline in the browser	Beginner
SAM (Segment Anything)	Interactive segmentation of any object in any image	Beginner–Intermediate

Practice questions

What is the difference between image classification, object detection, semantic segmentation, and instance segmentation? (Answer: Classification: one label for the entire image (cat/dog). Object detection: bounding boxes with labels around each object (3 cats, 2 dogs, with coordinates). Semantic segmentation: every pixel assigned a class label (all cat pixels = class 'cat', all background pixels = 'background'). Instance segmentation: every pixel assigned to a specific instance (cat1 pixels, cat2 pixels, background) — combines detection and segmentation. Each is progressively harder and richer in spatial information.)
What was the key innovation of AlexNet (2012) that sparked the deep learning era in computer vision? (Answer: AlexNet (Krizhevsky, Sutskever, Hinton) won ILSVRC 2012 by a 10.8-point margin over the second place (15.3% vs 26.2% top-5 error). Key innovations: (1) Deep architecture (8 layers vs previous 3–4). (2) ReLU activations (faster training than tanh). (3) GPU training (two GTX 580 3GB GPUs, 6 days). (4) Dropout regularization. (5) Data augmentation. The scale of improvement shocked the community and triggered the shift from hand-crafted features (SIFT, HOG) to learned CNN features.)
What is the Vision Transformer (ViT) and why did it challenge the dominance of CNNs? (Answer: ViT (Dosovitskiy et al., 2020): apply a standard transformer directly to sequences of image patches (16×16 patches) — no convolutional inductive bias. Trained with supervised classification on ImageNet-21k + JFT-300M. At scale, ViT-Large/16 surpassed state-of-the-art CNNs (EfficientNet) while being more scalable and parallelisable. CNNs have an inductive bias (local connectivity) that helps on small datasets but limits flexibility at scale. ViT can learn any spatial relationship from data when given enough training examples.)
What is CLIP and how does it enable zero-shot image classification? (Answer: CLIP (Contrastive Language-Image Pretraining, OpenAI 2021): train image and text encoders together on 400M image-text pairs using contrastive loss — similar images and captions attract, dissimilar ones repel. Zero-shot classification: encode all class names as text prompts ('a photo of a cat'), encode the query image. Find the class with highest image-text cosine similarity. No task-specific training required — just natural language class descriptions. CLIP achieves 76.2% on ImageNet zero-shot vs 97%+ for trained ResNet-50. Enables 'Describe what you're looking for' style search.)
What is the current state of foundation models for computer vision and how do they compare to specialized models? (Answer: Vision foundation models (SAM, DINO, OpenCLIP, Florence-2) are pretrained on massive image datasets and can be fine-tuned for many downstream tasks with few examples. SAM (Segment Anything Model, Meta 2023): trained on 11M images, 1.1B masks — can segment any object from a single point click or bounding box prompt. DINO v2: self-supervised ViT providing powerful features for depth estimation, semantic segmentation, and classification. Trade-off: foundation models are general-purpose but may underperform narrow specialists trained with domain-specific data.)

Task

What the model outputs

Example application

Key models

Image classification

A label for the whole image

"This is a cat" — photo apps, content moderation

ResNet, EfficientNet, ViT

Object detection

Bounding boxes + labels for each object in the image

Autonomous vehicles, surveillance cameras

YOLO, Faster R-CNN, DETR

Semantic segmentation

Per-pixel class labels — every pixel assigned a category

Medical image analysis, road scene parsing

U-Net, DeepLab, SegFormer

Instance segmentation

Per-pixel labels + distinct IDs for each instance

Separating individual people in a crowd

Mask R-CNN, SAM (Segment Anything)

Depth estimation

Distance of each pixel from the camera

Robots, AR, autonomous navigation

DPT, Depth Anything, Marigold

Optical flow

Pixel-level motion vectors between frames

Video understanding, action recognition

FlowNet, RAFT

Image generation

New images from text or noise

DALL-E, Midjourney, Stable Diffusion

Diffusion models, GANs

Visual question answering

Text answer to a question about an image

Multimodal chatbots, accessibility tools

LLaVA, GPT-4V, Gemini

Architecture

Core idea

Strengths

Weaknesses

CNNs (ResNet, EfficientNet)

Learnable filters slide over image, detecting edges → textures → shapes → objects

Data-efficient; great for small datasets; fast inference; strong inductive biases

Limited long-range context; architecture engineering required

Vision Transformers (ViT)

Image split into 16×16 patches; each patch treated as a token; self-attention across all patches

Excellent at long-range dependencies; scales extremely well with data; same architecture as language models

Needs large datasets; computationally heavier than CNNs on small data

Hybrid models (ConvNeXT, EfficientViT)

CNN-style locality + Transformer-style global attention

Best of both; competitive at all scales

More complex to design and tune

Foundation models (SAM, CLIP, DINOv2)

Trained on billions of images; one model for many downstream tasks via prompting or fine-tuning

Zero-shot generalization; no task-specific training data needed

Very large; expensive to run at edge

from ultralytics import YOLO from PIL import Image # Load a pre-trained YOLOv8 model (downloads automatically) model = YOLO("yolov8n.pt") # 'n' = nano (fastest); also: s, m, l, x # Run detection on an image results = model("https://ultralytics.com/images/bus.jpg") # Print all detected objects for r in results: for box in r.boxes: class_name = r.names[int(box.cls)] confidence = float(box.conf) x1, y1, x2, y2 = box.xyxy[0].tolist() print(f"{class_name}: {confidence:.1%} confidence at ({x1:.0f},{y1:.0f})-({x2:.0f},{y2:.0f})") # Save the annotated image results[0].save("detected.jpg") # Detects: person (97%), bus (95%), bench (88%), car (72%)...

Library / Tool

Best for

Skill level

OpenCV

Traditional CV: filtering, edge detection, video processing

Beginner–Intermediate

Ultralytics (YOLOv8/11)

Fast object detection and segmentation — best starting point

Beginner

Torchvision

PyTorch-native; pre-trained models, transforms, datasets

Intermediate

Hugging Face Transformers

Vision Transformers, CLIP, ViT, SAM, multimodal models

Intermediate

Roboflow

Dataset management, annotation, training pipeline in the browser

Beginner

SAM (Segment Anything)

Interactive segmentation of any object in any image

Beginner–Intermediate

Computer Vision

Core computer vision tasks

From CNNs to Vision Transformers

Computer vision in practice: tools and libraries

Practice questions

Computer Vision

Core computer vision tasks

From CNNs to Vision Transformers

Computer vision in practice: tools and libraries

Practice questions

Practice what you just learned

Related Terms