Semi-supervised learning uses a small amount of labelled data combined with a large amount of unlabelled data for training. Since labelling is expensive and time-consuming, leveraging abundant unlabelled data dramatically improves model performance. Self-supervised learning goes further — it generates its own supervision signal from the raw data structure itself, requiring no human labels at all. Self-supervised learning powers GPT, BERT, DALL-E, and most modern foundation models. Understanding these paradigms is essential for modern ML practice.
Real-life analogy: Learning to read with one labelled book
Semi-supervised: Imagine learning French with 10 labelled dialogues (English translation provided) and 10,000 unlabelled French texts. You use the labelled examples to build initial understanding, then let the patterns from unlabelled texts reinforce and extend your knowledge. Self-supervised: You learn by predicting the next word in every sentence — no teacher needed, the text itself is the teacher. This is exactly how GPT was trained.
Semi-supervised learning
In semi-supervised learning you have a dataset D = D_L ∪ D_U where |D_L| << |D_U|. D_L is the small labelled set, D_U is the large unlabelled set. Common techniques:
- Pseudo-labelling: Train on D_L, predict labels for D_U with high confidence (e.g., > 0.95 probability), add those pseudo-labelled examples to D_L, retrain. Repeat.
- Label Propagation: Build a graph where similar data points are connected. Propagate known labels through the graph to nearby unlabelled points.
- Co-training: Train two models on different feature subsets. Each model labels examples for the other when confident. Models teach each other.
- FixMatch / MixMatch: State-of-the-art semi-supervised methods using consistency regularisation — the model should give the same prediction for a data point and its augmented version.
Pseudo-labelling implementation
from sklearn.semi_supervised import LabelPropagation, LabelSpreading
from sklearn.datasets import make_classification
import numpy as np
# Create dataset: 10 labelled, 990 unlabelled
X, y_true = make_classification(n_samples=1000, n_features=20, random_state=42)
y_semi = y_true.copy()
y_semi[10:] = -1 # -1 means "unlabelled" in sklearn semi-supervised
# Label Propagation: spread labels through nearest-neighbour graph
lp = LabelPropagation(kernel='rbf', gamma=20, max_iter=1000)
lp.fit(X, y_semi)
y_pred_lp = lp.predict(X[10:]) # Predictions for unlabelled data
# Label Spreading (more robust to noise)
ls = LabelSpreading(kernel='rbf', alpha=0.2, max_iter=1000)
ls.fit(X, y_semi)
y_pred_ls = ls.predict(X[10:])
# Compare with supervised-only (only uses 10 labelled examples)
from sklearn.svm import SVC
svm_supervised = SVC().fit(X[:10], y_true[:10])
y_pred_sup = svm_supervised.predict(X[10:])
from sklearn.metrics import accuracy_score
print(f"Supervised only (10 labels): {accuracy_score(y_true[10:], y_pred_sup):.3f}")
print(f"Label Propagation (10 + 990): {accuracy_score(y_true[10:], y_pred_lp):.3f}")
print(f"Label Spreading (10 + 990): {accuracy_score(y_true[10:], y_pred_ls):.3f}")
# Semi-supervised typically beats supervised-only significantlySelf-supervised learning — the engine of modern AI
Self-supervised learning creates a pretext task from the data itself — a task where the supervision signal is automatically derived from the input, requiring zero human labels. The model learns rich representations by solving the pretext task. Those representations are then fine-tuned for downstream tasks.
| Domain | Self-supervised pretext task | Model trained | Downstream use |
|---|---|---|---|
| NLP (text) | Predict masked tokens (15% of words hidden) | BERT | Classification, NER, QA |
| NLP (text) | Predict next token (autoregressive) | GPT-4, Claude, Llama | Chat, completion, reasoning |
| Vision | Predict missing image patches | MAE, BEiT | Image classification, detection |
| Vision | Contrast similar vs dissimilar images | SimCLR, CLIP, DINO | Image search, zero-shot |
| Audio | Predict next audio frame | Wav2Vec 2.0, HuBERT | Speech recognition, ASR |
| Multimodal | Match image to its text caption | CLIP, ALIGN | Zero-shot image classification |
Why self-supervised learning is revolutionary
Pre-2017: you needed 1 million labelled images to train a good vision model. Post-self-supervised: CLIP trains on 400 million image-text pairs scraped from the web (no human labels) and achieves better zero-shot image classification than supervised models trained on ImageNet. GPT-4 was trained on trillions of tokens of raw web text with no human labelling of training data (only RLHF for alignment).
Practice questions
- Why is semi-supervised learning practically important? (Answer: Labelling data is expensive — medical images require radiologists, legal documents require lawyers. Unlabelled data is cheap. Semi-supervised learning lets you use abundant cheap data + small expensive labelled set to achieve near-supervised performance.)
- What is a pretext task in self-supervised learning? (Answer: An automatically generated task derived from the data structure itself with no human labels. Examples: predict masked words (BERT), predict next word (GPT), predict missing image patches (MAE). The model learns representations while solving the pretext task.)
- BERT uses masked language modelling. What percentage of tokens are masked? (Answer: 15% of input tokens are randomly selected — 80% replaced with [MASK], 10% replaced with random word, 10% unchanged. This prevents the model from learning to always ignore [MASK] tokens.)
- What is the difference between semi-supervised and self-supervised learning? (Answer: Semi-supervised: uses small labelled set + large unlabelled set. Self-supervised: uses NO labelled data — creates supervision from data structure itself (masking, next-token prediction, contrastive pairs).)
- Contrastive learning (SimCLR, CLIP) — what does "contrastive" mean? (Answer: The model learns by contrasting similar pairs (positive: same image under two augmentations) against dissimilar pairs (negative: different images). Loss pulls positives together and pushes negatives apart in embedding space — teaching meaningful similarity.)
On LumiChats
LumiChats itself was trained with self-supervised learning (next-token prediction) on trillions of text tokens, then fine-tuned with RLHF. Understanding self-supervised learning directly explains how modern LLMs acquire their broad knowledge before specialisation.
Try it free