Batch learning trains a model on the entire dataset at once — the model is static after training. Online learning updates the model incrementally as each new example arrives, enabling adaptation to changing data distributions without full retraining. Parametric models (linear regression, neural networks) represent knowledge in a fixed set of parameters learned during training. Non-parametric models (KNN, kernel SVM, decision trees) do not fix the model structure in advance — complexity can grow with data. These distinctions drive fundamental architecture choices in ML systems.
Batch vs Online learning
| Property | Batch Learning | Online Learning |
|---|---|---|
| Training data | Entire dataset at once | One example (or mini-batch) at a time |
| Model update | Full retrain on new data | Incremental update with each new example |
| Memory | Needs all data in memory | O(1) memory — only current example needed |
| Adaptation | Static after training — cannot adapt | Adapts continuously to new patterns |
| Compute | Expensive upfront, cheap inference | Cheap updates, runs continuously |
| Instability | Stable — learns from full distribution | Can drift if distribution changes rapidly |
| Use cases | Image classifiers, LLMs, offline models | Fraud detection, ad click prediction, IoT |
| Examples | Batch gradient descent, sklearn fit() | SGD, river library, Kafka-based systems |
Online learning with incremental fitting
from sklearn.linear_model import SGDClassifier, PassiveAggressiveClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import StandardScaler
import numpy as np
# Online learning: partial_fit() updates model with each new batch
# Simulating a data stream
np.random.seed(42)
n_total = 10000
batch_size = 100
n_features = 10
# SGD Classifier supports online learning via partial_fit
model = SGDClassifier(loss='log_loss', learning_rate='optimal', random_state=42)
scaler = StandardScaler()
# Simulate stream processing
accuracies = []
for batch_start in range(0, n_total, batch_size):
# Simulate incoming batch of data
X_batch = np.random.randn(batch_size, n_features)
y_batch = (X_batch[:, 0] + X_batch[:, 1] > 0).astype(int) # Concept
# Must provide all classes on first call
classes = np.array([0, 1])
if batch_start == 0:
X_scaled = scaler.fit_transform(X_batch)
model.partial_fit(X_scaled, y_batch, classes=classes)
else:
X_scaled = scaler.transform(X_batch) # Use fitted scaler
model.partial_fit(X_scaled, y_batch) # Incremental update
if batch_start % 1000 == 0:
acc = model.score(X_scaled, y_batch)
accuracies.append((batch_start, acc))
print(f"Batch {batch_start}: accuracy = {acc:.3f}")
# Concept drift simulation: data distribution shifts at batch 5000
# Online model adapts; batch model trained at start would degradeParametric vs Non-parametric models
Parametric models assume a specific functional form for the mapping function and learn a fixed, finite set of parameters. Once trained, the training data can be discarded. Examples: linear regression (parameters = β₀, β₁, ..., βₙ), logistic regression, neural networks. Non-parametric models do not fix the functional form — the model complexity can grow with the data. Some store the training data itself. Examples: KNN (stores all training data), kernel SVM (complexity grows with support vectors), decision trees (depth not fixed).
| Property | Parametric | Non-Parametric |
|---|---|---|
| Model complexity | Fixed regardless of data size | Can grow with data size |
| Training data | Can discard after training | Often kept (KNN, kernel SVM) |
| Memory at inference | Low (just parameters) | High (stores training data) |
| Assumptions | Strong (assumes functional form) | Fewer (more flexible) |
| Data needed | Less data needed if assumptions hold | More data needed for good fit |
| Examples | Linear/logistic regression, NN, Naive Bayes | KNN, kernel SVM, decision trees, random forests, GP |
Why neural networks are parametric despite being very flexible
A neural network with 1B parameters is still parametric — it has a fixed number of parameters determined at architecture design time. The architecture (number of layers, neurons) is fixed; only the parameter values are learned from data. Non-parametric means the number of "parameters" can grow with training data — a KNN model with 1M training points effectively has 1M "parameters" (the training examples themselves).
Practice questions
- A fraud detection system needs to adapt to new fraud patterns daily without full retraining. Should it use batch or online learning? (Answer: Online learning — fraud patterns evolve continuously. Online learning (SGD, Passive-Aggressive classifier) updates the model with each new transaction, adapting to concept drift without expensive full retraining.)
- Why is KNN considered non-parametric? (Answer: KNN has no fixed parameters learned during training — it stores the entire training set. The "model" IS the training data. Model complexity grows linearly with number of training examples (more data = larger model).)
- What is concept drift and how does online learning handle it? (Answer: Concept drift = the statistical properties of the target variable change over time (e.g., fraud patterns change as criminals adapt). Online learning continuously updates the model with recent data, giving higher weight to recent examples, allowing adaptation to drift.)
- Linear regression has p+1 parameters for p features. Is this parametric or non-parametric? (Answer: Parametric — the number of parameters (β₀, β₁, ..., βₚ) is fixed at p+1 regardless of how many training examples you have.)
- What is the "curse of dimensionality" and why does it affect non-parametric models more? (Answer: In high dimensions, all points become equidistant — nearest neighbours are no longer meaningfully close. Non-parametric models like KNN rely on distance in feature space, so they degrade badly in high dimensions. Parametric models encode structure in parameters rather than distance, handling high dimensions better.)
On LumiChats
Modern LLMs like Claude use batch learning on massive corpora, then online-style RLHF updates. Understanding this distinction helps explain why LLMs have a knowledge cutoff date (batch training) and why real-time adaptation requires explicit fine-tuning cycles.
Try it free