On-device AI agents hit a hard memory limit. Apple’s new architecture routes around it.
On-device AI models have stayed small because the entire weight set has to live in DRAM, capping practical parameter counts well below what server-side deployments use. Enterprise architects evaluating agentic workloads have had to choose between capable cloud-dependent models and limited on-device ones. Apple’s third-generation foundation models, announced at WWDC26, break that constraint by moving the weight set off DRAM entirely. The AFM 3 family was developed in collaboration with Google and spans five models: two on-device and three server-based, all running within Apple’s Private Cloud Compute boundary. The server-side models, including AFM 3 Cloud Pro for agentic tool use and complex reasoning, run on Nvidia GPUs in Google Cloud. The on-device architecture is Apple’s own. AFM 3 Core Advanced is a 20-billion-parameter model that stores weights in NAND flash rather than DRAM. “Instead of forcing the entire model into DRAM, the full model is stored in flash memory,” Apple’s research team wrote. “Because NAND-to-DRAM bandwidth is too slow to swap weights token by token, as standard MoE models require, AFM 3 Core Advanced makes …









