World models are AI systems that learn internal representations of how the physical world works — predicting the next state of an environment given actions within it, rather than predicting the next token in a text sequence. While LLMs model the statistical patterns of language, world models model causality, physics, spatial relationships, and object permanence. In late 2025 and early 2026, world models emerged as the field's most hyped new frontier: Yann LeCun left Meta to launch AMI Labs (seeking €3B valuation), Fei-Fei Li's World Labs shipped Marble, Google DeepMind released Genie 3, and Nvidia's Cosmos platform surpassed 2 million downloads.
The fundamental difference: tokens vs. states
The core distinction between LLMs and world models is what they predict:
| Property | Large Language Model | World Model |
|---|---|---|
| What it predicts | The next token in a text sequence | The next state of an environment given an action |
| Learning signal | Statistical co-occurrence of words across text | Causal dynamics — what happens when you push this object |
| Representation space | Token embeddings in high-dimensional language space | Latent representations of physical state |
| Understanding of physics | None — describes physics accurately without "feeling" it | Built-in — trained on video and sensor data of real physical interactions |
| Hallucinations | Common — predicts plausible-sounding text, not grounded truth | Rarer — grounded in physical observations, not statistical text patterns |
| Best analogy | Extremely well-read librarian who has read every physics textbook | A child who has played with blocks, water, and gravity for years |
LeCun's critique of LLMs
Yann LeCun has argued publicly for years that LLMs will never achieve general intelligence: "They predict the next word based on statistics, not the next state of the world based on physics." When GPT-4 generates text about a ball rolling down a hill, it is not simulating physics — it is predicting which words typically follow other words. It has no internal model of gravity, friction, or momentum. World models are designed to close this gap.
The 2026 world models race
In the span of a few months bridging late 2025 and early 2026, world models went from a niche research topic to the industry's most-funded frontier:
| Player | Product / Project | Key milestone | Valuation / Investment |
|---|---|---|---|
| AMI Labs (Yann LeCun) | JEPA-based world models | LeCun left Meta (Dec 2025) to found AMI; builds on V-JEPA 2 trained on 1M+ hours of video | €3B valuation pre-product; offices in Paris, NYC, Montreal, Singapore |
| World Labs (Fei-Fei Li) | Marble | Ships Marble (Nov 2025) — generates navigable 3D worlds from text/images/video; users can move through and interact with generated environments | $5B valuation in talks; $230M seed raised in 2024 |
| Google DeepMind | Genie 3 / Project Genie | First real-time interactive world model; generates navigable 3D worlds at 24fps from text prompts; paired with SIMA 2 agent for in-world training | Part of DeepMind (Alphabet) |
| Nvidia | Cosmos platform | Trained on 20M hours of real-world data; 2M+ downloads; three model families (Predict, Transfer, Reason); key infrastructure for robotics AI | $100B+ market cap acceleration from AI adoption |
| Runway | GWM-1 World Model | First world model from a creative AI company; released Dec 2025; targets robotics and gaming beyond its traditional media/VFX market | Est. $4B valuation |
JEPA — LeCun's architecture
AMI Labs is built on Joint Embedding Predictive Architecture (JEPA), developed at Meta. Unlike LLMs that process tokens, JEPA-based models operate in abstract latent spaces and predict how the state of the world changes in response to actions. The key insight: predict in representation space, not pixel space — this avoids the exponential complexity of modeling every visual detail, focusing instead on the semantically meaningful changes.
Why world models matter — real applications
| Application | How world models help | Who is doing it |
|---|---|---|
| Robotics training | Generate infinite simulated environments for robot training without physical hardware; simulate rare or dangerous scenarios safely | Figure AI, Agility Robotics, 1X — all using Nvidia Cosmos |
| Autonomous vehicles | Simulate rare edge cases (ice, accidents, unusual pedestrian behavior) that are dangerous or rare in real-world data collection | Waymo, Wayve (GAIA-2 model), Uber, XPENG using Cosmos |
| Video game development | Generate reactive, physically consistent 3D game worlds from text; procedural generation with real physics | Google Project Genie demos, Iconic AI-native game engine |
| AR / VR / Spatial computing | Maintain coherent 4D (3D + time) models of the user's environment for stable AR overlays; predict object movement | Apple Vision Pro content pipelines, Meta Orion research |
| Scientific simulation | Simulate protein folding dynamics, fluid dynamics, material properties — with faster-than-physics-engine speed | DeepMind AlphaFold successors, Runway scientific models |
| Medical / surgical AI | Simulate surgical procedures; train surgical robots without human patients; predict treatment outcomes in 3D | AMI Labs / Nabla partnership focus area |
For students: where to start
World models are a frontier research area — most of the best work is in papers, not products. Start with: (1) DreamerV3 (Hafner et al., 2025) — the most complete open-source world model for RL tasks; (2) Nvidia Cosmos — download and experiment with the open models; (3) Genie 3 technical report from DeepMind; (4) LeCun's 2022 position paper "A Path Towards Autonomous Machine Intelligence" (available free) — the theoretical blueprint for everything AMI Labs is building.
Practice questions
- What is the difference between a model-free and model-based reinforcement learning agent? (Answer: Model-free: learns a policy (what to do) or value function (how good is each state) directly from experience, without modelling the environment dynamics. Simple but sample-inefficient — needs many environment interactions. Model-based: explicitly learns a transition model P(s' | s, a) (what happens when action a is taken in state s). Can plan by simulating future trajectories without real environment interaction. Sample-efficient but requires accurate world model. World models aim to give RL agents model-based efficiency.)
- What is DreamerV3 and how does it use a world model? (Answer: DreamerV3 (Hafner 2023): learns a compact world model in latent space — a Recurrent State Space Model (RSSM) that predicts latent states from current latent state and action. The agent is trained ENTIRELY within imagined rollouts from this world model — never directly interacting with the real environment during policy training. Environment interaction only updates the world model. This enables DreamerV3 to master diverse tasks (Minecraft, robot locomotion, classic games) with orders of magnitude fewer real environment steps than model-free RL.)
- Why are world models important for safety in autonomous systems? (Answer: An autonomous car without a world model must learn purely from real experience — including crashes. A car with a world model can: simulate thousands of dangerous scenarios internally without real risk. Test 'what if I miss the red light' in simulation before ever encountering it. Plan by rolling out multiple potential future trajectories and choosing the safest. Predict other agents' behaviours. Real-world failures are catastrophic; a world model allows safety-critical scenarios to be explored in imagination.)
- How does the concept of a 'mental model' in cognitive science relate to AI world models? (Answer: Cognitive science: humans maintain mental models of physics, social relationships, causality, and others' mental states. We plan actions by mentally simulating their consequences. Johnson-Laird (1983): mental models are the basis of reasoning and language understanding. AI world models operationalise this: a neural network that represents environment dynamics enables planning by simulation. The connection is deep — both biological and artificial agents that model their environment before acting are more adaptive and efficient than reactive systems.)
- What is a 'latent space world model' and why is it more efficient than pixel-space models? (Answer: Pixel-space world model: learns to predict future video frames at full pixel resolution — computationally expensive (high-dimensional output, each step generates thousands of pixels). Latent space world model: compress observation to compact latent representation via VAE/encoder, model dynamics in latent space (small vectors), decode only for visualisation. DreamerV3's RSSM models 32-dimensional latent states. Planning and policy learning happen in this compact space — 100-1000× fewer computations than pixel-space modelling.)