Federated learning is a machine learning paradigm where a model is trained across multiple decentralised devices or servers — each holding local data — without the raw data ever leaving the device. Instead of sending data to a central server, each participant trains a local model on its own data and sends only the model updates (gradients or weights) to a central aggregator, which combines them into a global model update. Federated learning enables training on sensitive data (medical records, personal messages, financial transactions) while preserving privacy and complying with data localisation regulations.
How federated learning works
- Initialisation: A central server distributes the current global model to all participating clients (devices or institutions).
- Local training: Each client trains the model on its local data for E epochs, producing updated local weights.
- Upload gradients/weights: Each client sends only the model update (difference between local weights and global weights) to the server — not the raw training data.
- Aggregation: The server aggregates updates from all clients, typically using FedAvg (weighted average of updates proportional to local dataset sizes).
- Distribution: The aggregated global model is redistributed and the cycle repeats.
| Property | Traditional centralised training | Federated learning |
|---|---|---|
| Data location | All data sent to central server | Data stays on device — never transmitted |
| Privacy | Strong assumption of data centralisation | Raw data private by design; gradient leakage still possible |
| Communication cost | Data upload once | Model updates sent each round — potentially many rounds |
| Data heterogeneity | IID data (shuffled from one pool) | Non-IID: each device has different data distribution |
| Stragglers | Not relevant — central GPU cluster | Slow devices delay training; need asynchronous strategies |
Real deployments in 2026
Google uses federated learning for keyboard next-word prediction, autocorrect, and voice recognition on Android — training on billions of phones without ever seeing individual users' typed words. Apple uses it for Safari suggestions and Siri improvements on iOS. In healthcare, federated learning allows hospital networks to collaboratively train diagnostic models on patient data without any hospital sharing records with others. The EU AI Act and India's DPDP Act both cite federated learning as a privacy-preserving technique for compliance.
Limitations and active research areas
- Gradient leakage: Gradients sent to the aggregator can be used to reconstruct training data samples with surprising fidelity. Differential privacy (adding calibrated noise to gradients) mitigates this at the cost of model quality.
- Non-IID data: In real deployments, each device's data distribution is completely different from others. Standard FedAvg converges poorly on highly non-IID data — an active research problem.
- Communication overhead: Frontier models have billions of parameters. Sending full gradient updates each round is bandwidth-prohibitive. Gradient compression, quantisation, and sparse updates are active research areas.
- Byzantine robustness: A malicious participant can send adversarial gradients to poison the global model. Robust aggregation algorithms (Krum, FLTrust) detect and exclude outlier updates.
Practice questions
- In FedAvg, client A has 1000 examples and client B has 100 examples. How are their updates weighted in aggregation? (Answer: FedAvg weights updates proportionally to dataset size. Client A's gradient gets weight 1000/(1000+100) = 0.909. Client B's gradient gets weight 100/1100 = 0.091. This is equivalent to computing the gradient over the combined dataset as if it were centralised, assuming the local training achieves good convergence.)
- What is the non-IID problem in federated learning and why does it matter? (Answer: Non-IID (non-independently-identically-distributed): different clients have different data distributions. A keyboard FL client in Paris has French text; one in Tokyo has Japanese text. Their local gradients point in very different directions. FedAvg with non-IID data can diverge or converge to a poor global minimum. Strategies: FedProx (proximal term keeps local model close to global), SCAFFOLD (variance reduction), MOON (contrastive learning).)
- Why can gradient-only transmission still leak private information? (Answer: Gradient inversion attacks (Zhu et al. 2019): an adversarial server can reconstruct the original training data from gradients, especially for small batch sizes. The gradient of a loss w.r.t. input contains sufficient information to approximately reconstruct the input. Defences: gradient compression (reducing information in updates), differential privacy noise addition, secure aggregation (server never sees individual updates, only the aggregate).)
- Google uses federated learning for keyboard autocorrect on Android. Why not just collect keystroke data centrally? (Answer: Keystroke data is extremely private — it captures everything users type including passwords, medical searches, financial information, and personal messages. User privacy expectations and legal requirements (GDPR Article 5 data minimisation) make centralised collection problematic. FL allows Google to improve autocorrect quality from billions of devices while never transmitting typed text to Google servers.)
- What is the Byzantine fault tolerance problem in federated learning? (Answer: In federated settings, some clients may be adversarial (Byzantine clients) — sending malicious gradients designed to poison the global model. Since the server cannot verify the integrity of updates from untrusted devices, a small fraction of malicious clients can corrupt training. Defences: robust aggregation methods (median, trimmed mean, Krum) that are resistant to outlier updates rather than simple averaging.)