Edge AI refers to the deployment of AI models directly on end-user devices — smartphones, laptops, tablets, IoT sensors, and embedded systems — rather than sending data to cloud servers for processing. In 2026, every major smartphone chipmaker (Apple with Neural Engine, Qualcomm with Hexagon NPU, MediaTek with APU) ships dedicated AI accelerator hardware, enabling on-device inference of capable language, vision, and audio models. On-device AI offers lower latency, privacy preservation, and offline functionality at the cost of reduced model capability compared to cloud inference.
The hardware enabling on-device AI in 2026
| Chip | Device | AI performance | On-device LLM capability |
|---|---|---|---|
| Apple A18 Pro (Neural Engine) | iPhone 16 Pro / iPad Pro M4 | 38 TOPS (tera-ops/sec) | Gemma 2 9B, Llama 3.2 11B at full quality |
| Apple M4 (Neural Engine) | MacBook Pro, iPad Pro | 38 TOPS neural engine + 120 TFLOPS GPU | Llama 3 70B with quantisation; Apple's Private Cloud Compute for larger |
| Qualcomm Snapdragon 8 Elite (Hexagon NPU) | Android flagship phones | 75 TOPS | Llama 3.2 3B–11B, Gemma 2 2B at high speed |
| MediaTek Dimensity 9400 (APU) | Mid-high Android phones | 50 TOPS | Gemma 2 2B, Phi-3 Mini comfortably |
| NVIDIA Orin (automotive/edge) | Cars, robots, edge servers | 275 TOPS | Full 7B–13B models at production latency |
The key enabler: quantisation. A Llama 3 8B model at FP16 requires 16 GB of memory — too large for most phones. At INT4 quantisation (4-bit), the same model requires ~4.5 GB and fits comfortably in a modern smartphone's unified memory. Apple's Neural Engine and Qualcomm's Hexagon NPU both have hardware acceleration for 4-bit integer operations, making quantised on-device inference extremely efficient.
Privacy advantages of on-device AI
- Zero data transmission: Processing happens on-device — your text, images, voice, and documents never leave your hardware. No cloud logs, no training data collection, no exposure to third-party infrastructure.
- Apple Private Cloud Compute: For tasks that exceed on-device capabilities, Apple's system sends computation to cloud servers that are verified through independent security audits to not log user data. Users can verify the software running on PCC servers. This is the architecture behind the new Siri with Gemini.
- Offline functionality: On-device AI works without internet connection — useful for remote areas, privacy-sensitive applications, and use cases requiring air-gapped operation.
- Medical and legal data: Healthcare and legal applications where data cannot leave the premises can use on-device AI without cloud compliance overhead.
The capability gap is closing fast
In 2023, on-device models were clearly inferior to cloud models for complex tasks. In 2026, Apple's on-device Llama 3.2 11B and Google's Gemma 3 12B running on smartphones handle the majority of real-world tasks — email drafting, document summarisation, translation, code explanation — at quality indistinguishable from small cloud models. The remaining gap is at the frontier: tasks requiring GPT-5.4 or Claude Sonnet 4.6-level capability still need cloud inference. But the 'good enough on-device' threshold has risen dramatically.
Practice questions
- A 7B LLM at FP16 requires 14GB. What quantisation is needed for a phone with 6GB unified memory? (Answer: INT4 quantisation reduces each weight to 0.5 bytes. 7B × 0.5 = 3.5GB base weight storage + ~1GB KV cache and activations ≈ 4.5–5GB. This fits within 6GB. Most on-device LLM frameworks (llama.cpp, MLX, ONNX Runtime) use 4-bit or 4.5-bit quantisation for flagship phone deployment.)
- What are three scenarios where on-device AI is preferable to cloud AI despite lower model capability? (Answer: 1) Privacy-sensitive tasks (medical symptoms, financial data, personal documents) where users do not want data leaving their device. 2) Offline environments (remote areas, flights, secure facilities) where cloud connectivity is unavailable. 3) Ultra-low latency requirements (real-time voice, on-device autocorrect) where network round-trip adds unacceptable delay.)
- Apple Neural Engine achieves 38 TOPS. What does TOPS mean and why does it matter for AI? (Answer: TOPS = Tera Operations Per Second — 10¹² integer/floating-point operations per second. AI inference is dominated by matrix multiplications which decompose into large numbers of multiply-accumulate operations. Higher TOPS = more model complexity or faster inference per second. For comparison, NVIDIA H100 = 3958 TOPS for INT8 — ~100× more powerful but consuming 700W vs iPhone's ~5W.)
- What is the simulation-to-real gap in autonomous driving AI? (Answer: Models trained in simulation (synthetic sensor data from game engines like CARLA) may not transfer to real sensors due to differences in noise patterns, lighting, weather, and physics accuracy. The simulated camera/LiDAR/radar data looks realistic but subtly differs from real sensor physics. This causes models trained entirely in simulation to underperform when deployed in real vehicles.)
- Why is Apple's Private Cloud Compute architecturally significant for AI privacy? (Answer: Apple PCC provides a verifiable privacy guarantee for cloud AI processing: computations run on servers whose software stack is publicly auditable, no user data is retained after computation, and cryptographic attestation proves the claimed software is actually running. Unlike standard cloud AI (where the provider has full access), PCC gives users technical verification of privacy promises rather than relying only on contractual commitments.)