DeepSeek is a Chinese AI research lab (founded 2023, backed by High-Flyer Capital) that released a series of models in 2024–2025 that shocked the AI industry. DeepSeek V3 (December 2024) matched or exceeded GPT-4o on most benchmarks while reportedly costing only $5.6 million to train — compared to estimates of $100M+ for GPT-4. DeepSeek R1 (January 2025) matched o1 on math and coding reasoning at open-source weights. This triggered a global stock market selloff and forced every major AI lab to revisit their cost assumptions.
Multi-Head Latent Attention (MLA): DeepSeek's architectural innovation
Standard transformer attention stores one Key-Value (KV) cache entry per attention head per token — this grows linearly with context length and becomes the memory bottleneck for long-context inference. DeepSeek V3 uses Multi-Head Latent Attention (MLA), which compresses the KV cache into a low-dimensional latent vector, then reconstructs the per-head keys and values via learned up-projection matrices. This reduces KV cache memory by ~13.5x.
MLA: compress hidden state h_t to latent c_t^KV (low-rank), then reconstruct keys K and values V via learned up-projections W^UK and W^UV. Cache only c_t^KV — not the full K and V.
| Model | Attention type | KV cache per token | Context memory |
|---|---|---|---|
| GPT-4 | Multi-head (MHA) | Full per head | High |
| LLaMA 3 | Grouped-query (GQA) | Shared across groups | Medium |
| DeepSeek V3 | Multi-head Latent (MLA) | Compressed latent | Very low |
Why $5.6M training cost matters — and the catch
The $5.6M figure refers to the final pre-training run on 2,048 H800 GPUs over 2 months. It excludes: research and experimentation costs, the cost of training previous model versions (V1, V2, Coder), infrastructure, salaries, and Nvidia chip purchases. Still, DeepSeek proved that algorithmic efficiency improvements (MLA, MoE routing, FP8 training) can dramatically compress training costs — a finding that has significant implications for AI lab economics globally.
FP8 training
DeepSeek V3 used FP8 (8-bit floating point) mixed-precision training throughout — a technique most labs avoided due to numerical instability. DeepSeek developed custom stability techniques that made FP8 training viable at scale, roughly halving training compute vs BF16.