All posts tagged: inference

Train-to-Test scaling explained: How to optimize your end-to-end AI compute budget for inference

Train-to-Test scaling explained: How to optimize your end-to-end AI compute budget for inference

The standard guidelines for building large language models (LLMs) optimize only for training costs and ignore inference costs. This poses a challenge for real-world applications that use inference-time scaling techniques to increase the accuracy of model responses, such as drawing multiple reasoning samples from a model at deployment. To bridge this gap, researchers at University of Wisconsin-Madison and Stanford University have introduced Train-to-Test (T2) scaling laws, a framework that jointly optimizes a model’s parameter size, its training data volume, and the number of test-time inference samples. In practice, their approach proves that it is compute-optimal to train substantially smaller models on vastly more data than traditional rules prescribe, and then use the saved computational overhead to generate multiple repeated samples at inference. For enterprise AI application developers who are training their own models, this research provides a proven blueprint for maximizing return on investment. It shows that AI reasoning does not necessarily require spending huge amounts on frontier models. Instead, smaller models can yield stronger performance on complex tasks while keeping per-query inference costs manageable …

Your developers are already running AI locally: Why on-device inference is the CISO’s new blind spot

Your developers are already running AI locally: Why on-device inference is the CISO’s new blind spot

For the last 18 months, the CISO playbook for generative AI has been relatively simple: Control the browser. Security teams tightened cloud access security broker (CASB) policies, blocked or monitored traffic to well-known AI endpoints, and routed usage through sanctioned gateways. The operating model was clear: If sensitive data leaves the network for an external API call, we can observe it, log it, and stop it. But that model is starting to break. A quiet hardware shift is pushing large language model (LLM) usage off the network and onto the endpoint. Call it Shadow AI 2.0, or the “bring your own model” (BYOM) era: Employees running capable models locally on laptops, offline, with no API calls and no obvious network signature. The governance conversation is still framed as “data exfiltration to the cloud,” but the more immediate enterprise risk is increasingly “unvetted inference inside the device.” When inference happens locally, traditional data loss prevention (DLP) doesn’t see the interaction. And when security can’t see it, it can’t manage it. Why local inference is suddenly practical …

IndexCache, a new sparse attention optimizer, delivers 1.82x faster inference on long-context AI models

IndexCache, a new sparse attention optimizer, delivers 1.82x faster inference on long-context AI models

Processing 200,000 tokens through a large language model is expensive and slow: the longer the context, the faster the costs spiral. Researchers at Tsinghua University and Z.ai have built a technique called IndexCache that cuts up to 75% of the redundant computation in sparse attention models, delivering up to 1.82x faster time-to-first-token and 1.48x faster generation throughput at that context length. The technique applies to models using the DeepSeek Sparse Attention architecture, including the latest DeepSeek and GLM families. It can help enterprises provide faster user experiences for production-scale, long-context models, a capability already proven in preliminary tests on the 744-billion-parameter GLM-5 model. The DSA bottleneck Large language models rely on the self-attention mechanism, a process where the model computes the relationship between every token in its context and all the preceding ones to predict the next token. However, self-attention has a severe limitation. Its computational complexity scales quadratically with sequence length. For applications requiring extended context windows (e.g., large document processing, multi-step agentic workflows, or long chain-of-thought reasoning), this quadratic scaling leads to sluggish …

Mistral’s Small 4 consolidates reasoning, vision and coding into one model — at a fraction of the inference cost

Mistral’s Small 4 consolidates reasoning, vision and coding into one model — at a fraction of the inference cost

Enterprises that have been juggling separate models for reasoning, multimodal tasks, and agentic coding may be able to simplify their stack: Mistral’s new Small 4 brings all three into a single open-source model, with adjustable reasoning levels under the hood. Small 4 enters a crowded field of small models — including Qwen and Claude Haiku — that are competing on inference cost and benchmark performance. Mistral’s pitch: shorter outputs that translate to lower latency and cheaper tokens. Mistral Small 4 updates Mistral Small 3.2, which came out in June 2025, and is available under an Apache 2.0 license. “With Small 4, users no longer need to choose between a fast instruct model, a powerful reasoning engine, or a multimodal assistant: one model now delivers all three, with configurable reasoning effort and best-in-class efficiency,” Mistral said in a blog post. The company said that despite its smaller size — Mistral Small 4 has 119 billion total parameters with only 6 billion active parameters per token — the model combines the capabilities of all Mistral’s models. It …

The team behind continuous batching says your idle GPUs should be running inference, not sitting dark

The team behind continuous batching says your idle GPUs should be running inference, not sitting dark

Every GPU cluster has dead time. Training jobs finish, workloads shift and hardware sits dark while power and cooling costs keep running. For neocloud operators, those empty cycles are lost margin. The obvious workaround is spot GPU markets — renting spare capacity to whoever needs it. But spot instances mean the cloud vendor is still the one doing the renting, and engineers buying that capacity are still paying for raw compute with no inference stack attached. FriendliAI’s answer is different: run inference directly on the unused hardware, optimize for token throughput, and split the revenue with the operator. FriendliAI was founded by Byung-Gon Chun, the researcher whose paper on continuous batching became foundational to vLLM, the open source inference engine used across most production deployments today. Chun spent over a decade as a professor at Seoul National University studying efficient execution of machine learning models at scale. That research produced a paper called Orca, which introduced continuous batching. The technique processes inference requests dynamically rather than waiting to fill a fixed batch before executing. It is …

Researchers baked 3x inference speedups directly into LLM weights — without speculative decoding

Researchers baked 3x inference speedups directly into LLM weights — without speculative decoding

As agentic AI workflows multiply the cost and latency of long reasoning chains, a team from the University of Maryland, Lawrence Livermore National Labs, Columbia University and TogetherAI has found a way to bake 3x throughput gains directly into a model’s weights. Unlike speculative decoding, which requires a separate drafting model, this approach requires no additional infrastructure — just a single special token added to the model’s existing architecture. The limits of next-token prediction Next-token prediction — generating text one token per forward pass — creates a throughput ceiling that becomes painfully expensive when models need to produce thousands of tokens. This bottleneck is especially problematic in reasoning models, which frequently generate thousands of “chain of thought” tokens before producing the final response, leading to a slow and expensive user experience. Multi-token prediction (MTP) offers an alternative training paradigm that allows a language model to produce multiple tokens simultaneously in a single forward pass.  For example, the model can be trained to predict a block of tokens all at once instead of just the immediate …

New agent framework matches human-engineered AI systems — and adds zero inference cost to deploy

New agent framework matches human-engineered AI systems — and adds zero inference cost to deploy

Agents built on top of today’s models often break with simple changes — a new library, a workflow modification — and require a human engineer to fix it. That’s one of the most persistent challenges in deploying AI for the enterprise: creating agents that can adapt to dynamic environments without constant hand-holding. While today’s models are powerful, they are largely static. To address this, researchers at the University of California, Santa Barbara have developed Group-Evolving Agents (GEA), a new framework that enables groups of AI agents to evolve together, sharing experiences and reusing their innovations to autonomously improve over time. In experiments on complex coding and software engineering tasks, GEA substantially outperformed existing self-improving frameworks. Perhaps most notably for enterprise decision-makers, the system autonomously evolved agents that matched or exceeded the performance of frameworks painstakingly designed by human experts. The limitations of ‘lone wolf’ evolution Most existing agentic AI systems rely on fixed architectures designed by engineers. These systems often struggle to move beyond the capability boundaries imposed by their initial designs. To solve this, …

AI inference costs dropped up to 10x on Nvidia’s Blackwell — but hardware is only half the equation

AI inference costs dropped up to 10x on Nvidia’s Blackwell — but hardware is only half the equation

Lowering the cost of inference is typically a combination of hardware and software. A new analysis released Thursday by Nvidia details how four leading inference providers are reporting 4x to 10x reductions in cost per token. The dramatic cost reductions were achieved using Nvidia’s Blackwell platform with open-source models. Production deployment data from Baseten, DeepInfra, Fireworks AI and Together AI shows significant cost improvements across healthcare, gaming, agentic chat, and customer service as enterprises scale AI from pilot projects to millions of users. The 4x to 10x cost reductions reported by inference providers required combining Blackwell hardware with two other elements: optimized software stacks and switching from proprietary to open-source models that now match frontier-level intelligence. Hardware improvements alone delivered 2x gains in some deployments, according to the analysis. Reaching larger cost reductions required adopting low-precision formats like NVFP4 and moving away from closed source APIs that charge premium rates. The economics prove counterintuitive. Reducing inference costs requires investing in higher-performance infrastructure because throughput improvements translate directly into lower per-token costs. “Performance is what drives …

AI inference startup Modal Labs in talks to raise at .5B valuation, sources say

AI inference startup Modal Labs in talks to raise at $2.5B valuation, sources say

Modal Labs, a startup specializing in AI inference infrastructure, is talking to VCs about a new round at a valuation of about $2.5 billion, according to four people with knowledge of the deal. Should the deal close at these terms, the funding round would more than double the company’s valuation of $1.1 billion announced less than five months ago. General Catalyst is in talks to lead the round, the people told TechCrunch. Modal’s annualized revenue run rate (ARR) is approximately $50 million, our sources said. The discussions are early, and terms could still change.   Modal Labs co-founder and CEO Erik Bernhardsson denied that his company was actively fundraising and characterized his recent interactions with VCs as general conversations. General Catalyst did not respond to our requests for comment. Modal is focused on optimizing inference, the process of running trained AI models to generate answers from user requests. Improving inference efficiency reduces compute costs and cuts down the lag time between a user’s prompt and the AI’s response. Modal is one of the handful of inference-focused …

TTT-Discover optimizes GPU kernels 2x faster than human experts — by training during inference

TTT-Discover optimizes GPU kernels 2x faster than human experts — by training during inference

Researchers from Stanford, Nvidia, and Together AI have developed a new technique that can discover new solutions to very complex problems. For example, they managed to optimize a critical GPU kernel to run 2x faster than the previous state-of-the-art written by human experts. Their technique, called “Test-Time Training to Discover” (TTT-Discover), challenges the current paradigm of letting models “think longer” for reasoning problems. TTT-Discover allows the model to continue training during the inference process and update its weights for the problem at hand. The limits of ‘frozen’ reasoning Current enterprise AI strategies often rely on “frozen” models. Whether you use a closed or open reasoning model, the model’s parameters are static. When you prompt these models, they search for answers within the fixed manifold of their training data. This works well for problems that resemble what the model has seen before. However, true discovery problems, like inventing a novel algorithm or proving a new mathematical theorem, are, by definition, out-of-distribution. If the solution requires a leap of logic that doesn’t exist in the training set, …