DeepSeek (Architecture & Significance)

DeepSeek is a Chinese AI research lab (founded 2023, backed by High-Flyer Capital) that released a series of models in 2024–2025 that shocked the AI industry. DeepSeek V3 (December 2024) matched or exceeded GPT-4o on most benchmarks while reportedly costing only $5.6 million to train — compared to estimates of $100M+ for GPT-4. DeepSeek R1 (January 2025) matched o1 on math and coding reasoning at open-source weights. This triggered a global stock market selloff and forced every major AI lab to revisit their cost assumptions.

The Chinese AI lab that matched GPT-4 for 6% of the training cost.

Category: Flagship AI Models

Multi-Head Latent Attention (MLA): DeepSeek's architectural innovation

Standard transformer attention stores one Key-Value (KV) cache entry per attention head per token — this grows linearly with context length and becomes the memory bottleneck for long-context inference. DeepSeek V3 uses Multi-Head Latent Attention (MLA), which compresses the KV cache into a low-dimensional latent vector, then reconstructs the per-head keys and values via learned up-projection matrices. This reduces KV cache memory by ~13.5x.

c_t^{KV} = W^{DKV} h_t, \quad k_t^C = W^{UK} c_t^{KV}, \quad v_t^C = W^{UV} c_t^{KV}

Model	Attention type	KV cache per token	Context memory
GPT-4	Multi-head (MHA)	Full per head	High
LLaMA 3	Grouped-query (GQA)	Shared across groups	Medium
DeepSeek V3	Multi-head Latent (MLA)	Compressed latent	Very low

Why $5.6M training cost matters — and the catch

The $5.6M figure refers to the final pre-training run on 2,048 H800 GPUs over 2 months. It excludes: research and experimentation costs, the cost of training previous model versions (V1, V2, Coder), infrastructure, salaries, and Nvidia chip purchases. Still, DeepSeek proved that algorithmic efficiency improvements (MLA, MoE routing, FP8 training) can dramatically compress training costs — a finding that has significant implications for AI lab economics globally.

FP8 training: DeepSeek V3 used FP8 (8-bit floating point) mixed-precision training throughout — a technique most labs avoided due to numerical instability. DeepSeek developed custom stability techniques that made FP8 training viable at scale, roughly halving training compute vs BF16.

Model

Attention type

KV cache per token

Context memory

GPT-4

Multi-head (MHA)

Full per head

High

LLaMA 3

Grouped-query (GQA)

Shared across groups

Medium

DeepSeek V3

Multi-head Latent (MLA)

Compressed latent

Very low

DeepSeek (Architecture & Significance)

Multi-Head Latent Attention (MLA): DeepSeek's architectural innovation

Why $5.6M training cost matters — and the catch

DeepSeek (Architecture & Significance)

Multi-Head Latent Attention (MLA): DeepSeek's architectural innovation

Why $5.6M training cost matters — and the catch

Practice what you just learned

Related Terms