All posts tagged: inference

Perplexity AI unveils hybrid local-cloud inference system at Computex 2026

Published by skeptic

Perplexity AI, the fast-growing search startup now valued at $20 billion, unveiled what it calls the first hybrid local-server inference orchestrator at Computex 2026 on Monday night, demonstrating software that autonomously decides — in real time and mid-task — which AI workloads stay on a user’s device and which get routed to frontier models in the cloud. CEO Aravind Srinivas demonstrated the system onstage alongside Intel CEO Lip-Bu Tan during Intel’s keynote address, using Perplexity’s “Personal Computer” agent to process confidential deal materials. In the demonstration, local models running on Intel Core Ultra Series 3 determined which information should remain on the device and which information could be sent to cloud-based models. Srinivas said the approach balances intelligence, accuracy, privacy, and cost. The key claim is not that a model can run locally — dozens of tools already do that. It is that Perplexity’s system makes the routing decision itself, task by task, without requiring the user to choose in advance. Sensitive data like financial records or health information stays on the local machine; the …

How RecursiveMAS speeds up multi-agent inference by 2.4x and reduces token usage by 75%

Published by skeptic

One of the key challenges of current multi-agent AI systems is that they communicate by generating and sharing text sequences, which introduces latency, drives up token costs, and makes it difficult to train the entire system as a cohesive unit. To overcome this challenge, researchers at University of Illinois Urbana-Champaign and Stanford University developed RecursiveMAS, a framework that enables agents to collaborate and transmit information through embedding space instead of text. This change results in both efficiency and performance gains. Experiments show that RecursiveMAS achieves accuracy improvement across complex domains like code generation, medical reasoning, and search, while also increasing inference speed and slashing token usage. RecursiveMAS is significantly cheaper to train than standard full fine-tuning or LoRA methods, making it a scalable and cost-effective blueprint for custom multi-agent systems. The challenges of improving multi-agent systems Multi-agent systems can help tackle complex tasks that single-agent systems struggle to handle. When scaling multi-agent systems for real-world applications, a big challenge is enabling the system to evolve, improve, and adapt to different scenarios over time. Prompt-based adaptation …

Train-to-Test scaling explained: How to optimize your end-to-end AI compute budget for inference

Published by skeptic

The standard guidelines for building large language models (LLMs) optimize only for training costs and ignore inference costs. This poses a challenge for real-world applications that use inference-time scaling techniques to increase the accuracy of model responses, such as drawing multiple reasoning samples from a model at deployment. To bridge this gap, researchers at University of Wisconsin-Madison and Stanford University have introduced Train-to-Test (T2) scaling laws, a framework that jointly optimizes a model’s parameter size, its training data volume, and the number of test-time inference samples. In practice, their approach proves that it is compute-optimal to train substantially smaller models on vastly more data than traditional rules prescribe, and then use the saved computational overhead to generate multiple repeated samples at inference. For enterprise AI application developers who are training their own models, this research provides a proven blueprint for maximizing return on investment. It shows that AI reasoning does not necessarily require spending huge amounts on frontier models. Instead, smaller models can yield stronger performance on complex tasks while keeping per-query inference costs manageable …

Your developers are already running AI locally: Why on-device inference is the CISO’s new blind spot

Published by skeptic

For the last 18 months, the CISO playbook for generative AI has been relatively simple: Control the browser. Security teams tightened cloud access security broker (CASB) policies, blocked or monitored traffic to well-known AI endpoints, and routed usage through sanctioned gateways. The operating model was clear: If sensitive data leaves the network for an external API call, we can observe it, log it, and stop it. But that model is starting to break. A quiet hardware shift is pushing large language model (LLM) usage off the network and onto the endpoint. Call it Shadow AI 2.0, or the “bring your own model” (BYOM) era: Employees running capable models locally on laptops, offline, with no API calls and no obvious network signature. The governance conversation is still framed as “data exfiltration to the cloud,” but the more immediate enterprise risk is increasingly “unvetted inference inside the device.” When inference happens locally, traditional data loss prevention (DLP) doesn’t see the interaction. And when security can’t see it, it can’t manage it. Why local inference is suddenly practical …

IndexCache, a new sparse attention optimizer, delivers 1.82x faster inference on long-context AI models

Published by skeptic

Processing 200,000 tokens through a large language model is expensive and slow: the longer the context, the faster the costs spiral. Researchers at Tsinghua University and Z.ai have built a technique called IndexCache that cuts up to 75% of the redundant computation in sparse attention models, delivering up to 1.82x faster time-to-first-token and 1.48x faster generation throughput at that context length. The technique applies to models using the DeepSeek Sparse Attention architecture, including the latest DeepSeek and GLM families. It can help enterprises provide faster user experiences for production-scale, long-context models, a capability already proven in preliminary tests on the 744-billion-parameter GLM-5 model. The DSA bottleneck Large language models rely on the self-attention mechanism, a process where the model computes the relationship between every token in its context and all the preceding ones to predict the next token. However, self-attention has a severe limitation. Its computational complexity scales quadratically with sequence length. For applications requiring extended context windows (e.g., large document processing, multi-step agentic workflows, or long chain-of-thought reasoning), this quadratic scaling leads to sluggish …

$Mistral’s Small 4 consolidates reasoning, vision and coding into one model — at a fraction of the inference cost$

Mistral’s Small 4 consolidates reasoning, vision and coding into one model — at a fraction of the inference cost

Published by skeptic

Enterprises that have been juggling separate models for reasoning, multimodal tasks, and agentic coding may be able to simplify their stack: Mistral’s new Small 4 brings all three into a single open-source model, with adjustable reasoning levels under the hood. Small 4 enters a crowded field of small models — including Qwen and Claude Haiku — that are competing on inference cost and benchmark performance. Mistral’s pitch: shorter outputs that translate to lower latency and cheaper tokens. Mistral Small 4 updates Mistral Small 3.2, which came out in June 2025, and is available under an Apache 2.0 license. “With Small 4, users no longer need to choose between a fast instruct model, a powerful reasoning engine, or a multimodal assistant: one model now delivers all three, with configurable reasoning effort and best-in-class efficiency,” Mistral said in a blog post. The company said that despite its smaller size — Mistral Small 4 has 119 billion total parameters with only 6 billion active parameters per token — the model combines the capabilities of all Mistral’s models. It …

The team behind continuous batching says your idle GPUs should be running inference, not sitting dark

Published by skeptic

Every GPU cluster has dead time. Training jobs finish, workloads shift and hardware sits dark while power and cooling costs keep running. For neocloud operators, those empty cycles are lost margin. The obvious workaround is spot GPU markets — renting spare capacity to whoever needs it. But spot instances mean the cloud vendor is still the one doing the renting, and engineers buying that capacity are still paying for raw compute with no inference stack attached. FriendliAI’s answer is different: run inference directly on the unused hardware, optimize for token throughput, and split the revenue with the operator. FriendliAI was founded by Byung-Gon Chun, the researcher whose paper on continuous batching became foundational to vLLM, the open source inference engine used across most production deployments today. Chun spent over a decade as a professor at Seoul National University studying efficient execution of machine learning models at scale. That research produced a paper called Orca, which introduced continuous batching. The technique processes inference requests dynamically rather than waiting to fill a fixed batch before executing. It is …

Researchers baked 3x inference speedups directly into LLM weights — without speculative decoding

Published by skeptic

As agentic AI workflows multiply the cost and latency of long reasoning chains, a team from the University of Maryland, Lawrence Livermore National Labs, Columbia University and TogetherAI has found a way to bake 3x throughput gains directly into a model’s weights. Unlike speculative decoding, which requires a separate drafting model, this approach requires no additional infrastructure — just a single special token added to the model’s existing architecture. The limits of next-token prediction Next-token prediction — generating text one token per forward pass — creates a throughput ceiling that becomes painfully expensive when models need to produce thousands of tokens. This bottleneck is especially problematic in reasoning models, which frequently generate thousands of “chain of thought” tokens before producing the final response, leading to a slow and expensive user experience. Multi-token prediction (MTP) offers an alternative training paradigm that allows a language model to produce multiple tokens simultaneously in a single forward pass. For example, the model can be trained to predict a block of tokens all at once instead of just the immediate …

New agent framework matches human-engineered AI systems — and adds zero inference cost to deploy

Published by skeptic

Agents built on top of today’s models often break with simple changes — a new library, a workflow modification — and require a human engineer to fix it. That’s one of the most persistent challenges in deploying AI for the enterprise: creating agents that can adapt to dynamic environments without constant hand-holding. While today’s models are powerful, they are largely static. To address this, researchers at the University of California, Santa Barbara have developed Group-Evolving Agents (GEA), a new framework that enables groups of AI agents to evolve together, sharing experiences and reusing their innovations to autonomously improve over time. In experiments on complex coding and software engineering tasks, GEA substantially outperformed existing self-improving frameworks. Perhaps most notably for enterprise decision-makers, the system autonomously evolved agents that matched or exceeded the performance of frameworks painstakingly designed by human experts. The limitations of ‘lone wolf’ evolution Most existing agentic AI systems rely on fixed architectures designed by engineers. These systems often struggle to move beyond the capability boundaries imposed by their initial designs. To solve this, …

AI inference costs dropped up to 10x on Nvidia’s Blackwell — but hardware is only half the equation

Published by skeptic

Lowering the cost of inference is typically a combination of hardware and software. A new analysis released Thursday by Nvidia details how four leading inference providers are reporting 4x to 10x reductions in cost per token. The dramatic cost reductions were achieved using Nvidia’s Blackwell platform with open-source models. Production deployment data from Baseten, DeepInfra, Fireworks AI and Together AI shows significant cost improvements across healthcare, gaming, agentic chat, and customer service as enterprises scale AI from pilot projects to millions of users. The 4x to 10x cost reductions reported by inference providers required combining Blackwell hardware with two other elements: optimized software stacks and switching from proprietary to open-source models that now match frontier-level intelligence. Hardware improvements alone delivered 2x gains in some deployments, according to the analysis. Reaching larger cost reductions required adopting low-precision formats like NVFP4 and moving away from closed source APIs that charge premium rates. The economics prove counterintuitive. Reducing inference costs requires investing in higher-performance infrastructure because throughput improvements translate directly into lower per-token costs. “Performance is what drives …

Skeptic Society Magazine

for honest conversations

Years

Authors

Filter by Month

Filter by Categories

Filter by Tags

All posts tagged: inference

Perplexity AI unveils hybrid local-cloud inference system at Computex 2026

How RecursiveMAS speeds up multi-agent inference by 2.4x and reduces token usage by 75%

Train-to-Test scaling explained: How to optimize your end-to-end AI compute budget for inference

Your developers are already running AI locally: Why on-device inference is the CISO’s new blind spot

IndexCache, a new sparse attention optimizer, delivers 1.82x faster inference on long-context AI models

Mistral’s Small 4 consolidates reasoning, vision and coding into one model — at a fraction of the inference cost

The team behind continuous batching says your idle GPUs should be running inference, not sitting dark

Researchers baked 3x inference speedups directly into LLM weights — without speculative decoding

New agent framework matches human-engineered AI systems — and adds zero inference cost to deploy

AI inference costs dropped up to 10x on Nvidia’s Blackwell — but hardware is only half the equation