Ensemble methods combine predictions from multiple models to outperform any individual model. Bagging (Bootstrap Aggregating) trains multiple models in parallel on random subsets of data and averages their predictions — reduces variance. Boosting trains models sequentially where each model focuses on errors from the previous — reduces bias. Stacking uses a meta-model to learn how to combine base model predictions. XGBoost and LightGBM are gradient boosting implementations that dominate structured data competitions and industry applications.
Real-life analogy: The committee decision
A hospital board makes decisions by committee rather than delegating to one doctor. Bagging: ask 100 doctors to each examine a random 70% of the patient files independently, then take the majority vote. Boosting: ask doctor 1 to diagnose, then doctor 2 focuses specifically on the cases doctor 1 got wrong, then doctor 3 focuses on cases both 1 and 2 got wrong. Stack: train a senior consultant to learn whose opinion to trust for which type of case.
Bagging — reducing variance
Bagging and Random Forest comparison
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
import numpy as np
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
# Single Decision Tree (high variance)
single_tree = DecisionTreeClassifier(max_depth=None, random_state=42)
scores_tree = cross_val_score(single_tree, X, y, cv=5)
print(f"Single Tree: {scores_tree.mean():.3f} ± {scores_tree.std():.3f}")
# Bagging: 100 trees, each on bootstrap sample (63% of data)
bagging = BaggingClassifier(
estimator=DecisionTreeClassifier(),
n_estimators=100,
max_samples=0.8, # 80% of data per tree
max_features=0.8, # 80% of features per tree
bootstrap=True, # Sample with replacement (bootstrap)
random_state=42
)
scores_bag = cross_val_score(bagging, X, y, cv=5)
print(f"Bagging (100 trees): {scores_bag.mean():.3f} ± {scores_bag.std():.3f}")
# Random Forest: Bagging + random feature subset at each split
rf = RandomForestClassifier(
n_estimators=100,
max_features='sqrt', # sqrt(p) features per split (key difference from bagging)
max_depth=None, # Full trees (bias reduced; variance reduced by averaging)
min_samples_split=2,
bootstrap=True,
oob_score=True, # Out-of-bag score (free validation set from unused samples)
random_state=42
)
rf.fit(X, y)
print(f"Random Forest OOB: {rf.oob_score_:.3f}")
scores_rf = cross_val_score(rf, X, y, cv=5)
print(f"Random Forest CV: {scores_rf.mean():.3f} ± {scores_rf.std():.3f}")
# Feature importances from Random Forest (impurity-based)
importances = rf.feature_importances_
top5 = np.argsort(importances)[-5:]
print(f"Top 5 features: {top5} with importance {importances[top5].round(3)}")Boosting — AdaBoost and Gradient Boosting
AdaBoost, Gradient Boosting, XGBoost comparison
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
import xgboost as xgb
from sklearn.model_selection import cross_val_score
# ADABOOST: Adaptive Boosting
# Idea: misclassified examples get higher weight in next round
# Weak learners: usually shallow decision trees (stumps — depth=1)
adaboost = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1), # "stump"
n_estimators=200,
learning_rate=0.1, # Shrinkage of each tree contribution
random_state=42
)
scores_ada = cross_val_score(adaboost, X, y, cv=5)
print(f"AdaBoost: {scores_ada.mean():.3f} ± {scores_ada.std():.3f}")
# GRADIENT BOOSTING: sklearn
# Idea: each new tree fits the RESIDUALS (negative gradient of loss) of previous ensemble
gbm = GradientBoostingClassifier(
n_estimators=200,
max_depth=4,
learning_rate=0.05, # Lower = better generalisation, needs more trees
subsample=0.8, # Stochastic GB: use 80% of data per tree (reduces variance)
random_state=42
)
scores_gbm = cross_val_score(gbm, X, y, cv=5)
print(f"GradientBoosting: {scores_gbm.mean():.3f} ± {scores_gbm.std():.3f}")
# XGBOOST: eXtreme Gradient Boosting — industry standard
# Faster, regularised, handles missing values, parallel processing
xgb_model = xgb.XGBClassifier(
n_estimators=200,
max_depth=6,
learning_rate=0.05,
subsample=0.8,
colsample_bytree=0.8, # Feature sampling per tree (like Random Forest)
reg_alpha=0.1, # L1 regularisation
reg_lambda=1.0, # L2 regularisation
eval_metric='logloss',
random_state=42,
verbosity=0
)
scores_xgb = cross_val_score(xgb_model, X, y, cv=5)
print(f"XGBoost: {scores_xgb.mean():.3f} ± {scores_xgb.std():.3f}")
# LightGBM: even faster than XGBoost for large datasets
import lightgbm as lgb
lgb_model = lgb.LGBMClassifier(
n_estimators=200,
learning_rate=0.05,
num_leaves=31, # Controls tree complexity (not max_depth)
subsample=0.8,
colsample_bytree=0.8,
random_state=42,
verbosity=-1
)
scores_lgb = cross_val_score(lgb_model, X, y, cv=5)
print(f"LightGBM: {scores_lgb.mean():.3f} ± {scores_lgb.std():.3f}")Stacking — meta-learning
Stacking with cross-validation to prevent leakage
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# Stacking: train diverse base models, use their predictions as features
# for a meta-learner (Level 1 model)
base_models = [
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('xgb', xgb.XGBClassifier(n_estimators=100, random_state=42, verbosity=0)),
('svm', SVC(probability=True, random_state=42)),
('lr', LogisticRegression(random_state=42))
]
# Meta-learner: learns which base models to trust for which inputs
stacker = StackingClassifier(
estimators=base_models,
final_estimator=LogisticRegression(), # Meta-learner
cv=5, # Use 5-fold CV to generate base model predictions (prevents leakage)
stack_method='predict_proba',
passthrough=False # True = include original features in meta-learning
)
scores_stack = cross_val_score(stacker, X, y, cv=5)
print(f"Stacking: {scores_stack.mean():.3f} ± {scores_stack.std():.3f}")
# Summary comparison
methods = {'Single Tree': scores_tree, 'Bagging': scores_bag,
'Random Forest': scores_rf, 'AdaBoost': scores_ada,
'GBM': scores_gbm, 'XGBoost': scores_xgb, 'LightGBM': scores_lgb,
'Stacking': scores_stack}
print("
── Summary ──")
for name, scores in sorted(methods.items(), key=lambda x: -x[1].mean()):
print(f"{name:<16}: {scores.mean():.4f} ± {scores.std():.4f}")| Method | Training | Reduces | Best for | Top implementation |
|---|---|---|---|---|
| Bagging | Parallel (independent) | Variance | High-variance models (deep trees) | Random Forest |
| Boosting | Sequential (dependent) | Bias | Weak learners on structured data | XGBoost, LightGBM |
| Stacking | Parallel + meta stage | Both | Maximum performance, diverse base models | StackingClassifier, Kaggle |
Practice questions
- Why does Random Forest outperform a single decision tree? (Answer: Single decision tree: high variance — small data changes create very different trees. Random Forest averages 100+ decorrelated trees (decorrelated because each uses a random feature subset), dramatically reducing variance while keeping bias low. The average of many noisy unbiased estimators is an unbiased estimator with much lower variance.)
- What is the key difference between AdaBoost and Gradient Boosting? (Answer: AdaBoost: re-weights training examples (misclassified get higher weight in next round). Uses any weak learner. Gradient Boosting: fits new trees to the residuals (negative gradient of loss function) of the current ensemble. More general framework — works with any differentiable loss function.)
- Why does a lower learning_rate in XGBoost generally give better generalisation? (Answer: Lower learning_rate shrinks each tree's contribution, requiring more trees to fit the training data. More trees trained on residuals = finer-grained error correction = smoother decision boundary. Acts like L2 regularisation — prevents any single tree from having too large an influence.)
- What problem does stacking with cross-validation (cv=5) solve? (Answer: Data leakage. If base models are trained on all data and their predictions used as features for the meta-learner, the meta-learner sees data the base models already trained on — biased evaluation. CV stacking generates out-of-fold predictions, ensuring the meta-learner trains on data the base models have never seen.)
- XGBoost vs LightGBM — when would you choose LightGBM? (Answer: LightGBM is significantly faster on large datasets (millions of rows) because it uses histogram-based splitting (buckets features into ~256 bins) instead of exact splitting. Also uses leaf-wise (best-first) tree growth vs XGBoost's depth-wise growth — can achieve lower loss with same number of leaves.)
On LumiChats
XGBoost and LightGBM are used in production ML systems at Google, Facebook, and every major Kaggle competition winner. LumiChats can help you tune XGBoost hyperparameters, design stacking architectures, and debug overfitting in boosting models.
Try it free