All posts tagged: token

The token bill comes due: Inside the industry scramble to manage AI’s runaway costs

The token bill comes due: Inside the industry scramble to manage AI’s runaway costs

Across the industry, companies are starting to balk at the price of AI. Uber blew through its entire 2026 AI coding budget by April. Microsoft revoked its developers’ Claude Code licenses months after enabling them. A Priceline employee told TechCrunch that a routine Cursor contract renewal came back 4-5x more expensive. Even though per-token prices have fallen, the push for more AI adoption and increasingly autonomous agents have driven token consumption higher and higher. Companies that gorged themselves in early 2025 on all-you-can-eat subscriptions are now scrambling to understand where their money is going, pull back spending, and figure out whether they can salvage some ROI from the wreckage of their budgets. Meanwhile, a market is forming to meet them there. Startups, established vendors, and a new standards body are all racing to give companies the tools and language to track what they spend. “Six months ago, I would have a conversation with a customer and it would be all about ‘What can it do? Is it good enough?’” Alexander Embiricos, OpenAI’s head of enterprise, told …

Alibaba’s Qwen3.7-Plus supports text, video and imagery inputs at low cost of alt=

Alibaba’s Qwen3.7-Plus supports text, video and imagery inputs at low cost of $0.4/$1.6 per 1M token — but it’s proprietary

Alibaba this week released Qwen3.7-Plus, the latest AI large language model (LLM) in its globally beloved and increasingly expansive Qwen family, boasting more multimodal capabilities and a 60% lower cost than the prior, text-only Qwen3.7-Max model released just weeks ago. However, like its immediate predecessor Qwen3.7-Plus is available only under a “closed” commercial license via proprietary application programming interfaces (API) and Qwen Chat. That marks a big departure from the Qwen strategy to date, which was focused mainly on releasing powerful,near state-of-the-art open source models. Those enterprises and users who relied on the open source Qwen models — among them, U.S. giants such as Airbnb — will no doubt be disappointed to see that Alibaba is going closed for its newer releases. Still, the model is worth a look because of its low cost and high performance on multimodal tasks like creating enterprise-grade visuals or analyzing video, imagery and screenshots, which Qwen3.7-Max cannot do (it’s text-only). It is among the cheaper powerful AI models available now, coming in price-wise just above Chinese rival’s new MiniMax-M3’s …

Researchers automated LLM reasoning strategy design and cut token usage by 69.5%

Researchers automated LLM reasoning strategy design and cut token usage by 69.5%

Test-time scaling (TTS) has emerged as a proven method to improve the performance of large language models in real-world applications by giving them extra compute cycles at inference time. However, TTS strategies have historically been handcrafted, relying heavily on human intuition to dictate the rules of the model’s reasoning.  To address this bottleneck, researchers from Meta, Google, and several universities have introduced AutoTTS, a framework that automatically discovers optimal TTS strategies. This automated approach allows enterprise organizations to dynamically optimize compute allocation without manually tuning heuristics.  By implementing the optimal strategies discovered by AutoTTS, organizations can directly reduce the token usage and operational costs of deploying advanced reasoning models in production environments. In experimental trials, AutoTTS managed inference budgets efficiently, successfully reducing token consumption by up to 69.5% without sacrificing accuracy. The manual bottleneck in test-time scaling Test-time scaling enhances LLMs by granting them extra compute when generating answers. This extra compute allows the model to generate multiple reasoning paths or evaluate its intermediate steps before arriving at a final response.  The primary challenge for …

How DeepSeek’s radical architecture is shattering Silicon Valley’s token moat

How DeepSeek’s radical architecture is shattering Silicon Valley’s token moat

DeepSeek’s announcement over the weekend that it has made its 75% price cut permanent on its flagship V4 Pro model is a disruptive assault on the capital-heavy business models of Silicon Valley’s frontier labs.  The reduction on DeepSeek V4 Pro directly undercuts comparable Western models used as workhorses for enterprise production. It is 7x cheaper on inputs and 17x cheaper on outputs than Anthropic’s Claude Sonnet or OpenAI’s GPT 5.5-Med, while the lightweight DeepSeek V4 Flash undercuts entry-tier alternatives like Claude Haiku by 10x to 25x.  The price cuts are enabled by a series of hardware-software innovations, especially around cache, that make DeepSeek’s models radically more efficient to run. When hosted natively in China, DeepSeek’s cache-read pricing is a whopping 87x cheaper than Western clouds — a deflationary floor so aggressive that handset giant Xiaomi just moved to match the exact pricing tier for its newly deployed MiMo architecture. DeepSeek V4 Pro’s performance is ranked almost on par with Western frontier models, hitting 80.6% on coding-agent tasks via the SWE-bench Verified leaderboard and an elite …

The attack dominating financial services doesn’t steal passwords. It resets MFA and steals the token.

The attack dominating financial services doesn’t steal passwords. It resets MFA and steals the token.

The attacker who hit the most financial services organizations over the past 12 months never phished a password. They called an IT support line, convinced an employee to reset their MFA, and registered their own device on the network. CrowdStrike’s 2026 Financial Services Threat Landscape Report, released this month and covering activity from April 2025 through March 2026, identified Mutant Spider as the single most active threat to the financial services sector. The group’s primary technique was voice phishing over Microsoft Teams. Operators impersonated internal IT support, convinced employees to reset their credentials and multifactor authentication, then registered their own devices on corporate networks. The security control worked exactly as designed — and that was the problem. Within days, the FBI published a public service announcement warning about Kali365, a phishing-as-a-service platform sold on Telegram for as little as $250 a month. Kali365 captures Microsoft 365 OAuth tokens through the legitimate device code authentication flow. MFA fires on the victim’s device, not the attacker’s. The token grants persistent access to Outlook, Teams, and OneDrive without …

How RecursiveMAS speeds up multi-agent inference by 2.4x and reduces token usage by 75%

How RecursiveMAS speeds up multi-agent inference by 2.4x and reduces token usage by 75%

One of the key challenges of current multi-agent AI systems is that they communicate by generating and sharing text sequences, which introduces latency, drives up token costs, and makes it difficult to train the entire system as a cohesive unit.  To overcome this challenge, researchers at University of Illinois Urbana-Champaign and Stanford University developed RecursiveMAS, a framework that enables agents to collaborate and transmit information through embedding space instead of text. This change results in both efficiency and performance gains.  Experiments show that RecursiveMAS achieves accuracy improvement across complex domains like code generation, medical reasoning, and search, while also increasing inference speed and slashing token usage.  RecursiveMAS is significantly cheaper to train than standard full fine-tuning or LoRA methods, making it a scalable and cost-effective blueprint for custom multi-agent systems. The challenges of improving multi-agent systems Multi-agent systems can help tackle complex tasks that single-agent systems struggle to handle. When scaling multi-agent systems for real-world applications, a big challenge is enabling the system to evolve, improve, and adapt to different scenarios over time.  Prompt-based adaptation …

Opus 4.7 vs. Opus 4.6: Is the 35% Token Increase Worth It?

Opus 4.7 vs. Opus 4.6: Is the 35% Token Increase Worth It?

Opus 4.7 brings a host of advancements to the table, from refined coding accuracy to improved visual processing and a more intuitive user interface. Better Stack highlights how these upgrades enhance both functionality and precision, making the model a strong contender for diverse applications. However, one notable drawback is the increased token usage, up to 35% higher in certain configurations, particularly at the default “extra high” effort level. This change could impact users managing large-scale projects or operating within tight budgets, requiring careful adjustments to settings to balance costs and performance. Dive into this feature to explore how Opus 4.7’s enhanced instruction-following capabilities can improve alignment with user intent, why its upgraded multimodal processing is a fantastic option for combining text and visuals and how its memory improvements streamline workflows for long-term projects. You’ll also gain insight into practical strategies for managing token consumption and configuring the model to suit your specific needs. By understanding these nuances, you can make the most of Opus 4.7’s strengths while navigating its trade-offs effectively. Key Performance Enhancements TL;DR …

Grassroots venues call Labour rates relief ‘token gesture’ amid fight for survival

Grassroots venues call Labour rates relief ‘token gesture’ amid fight for survival

Get the inside track from Roisin O’Connor with our free weekly music newsletter Now Hear This Get our free music newsletter Now Hear This Get our free music newsletter Now Hear This It’s no secret that the situation of Britain’s music venues has grown increasingly fraught, made worse by the Covid pandemic, the cost-of-living crisis, and shifting trends in alcohol consumption. That’s why, when Labour pledged to reduce business rates for pubs and music venues by 15 per cent back in January, many business owners across the country breathed a collective sigh of relief. But, after the changes finally came into effect earlier this week, just how positive are the UK’s grassroots venues feeling about the future? Punk project, Total Con, perform at The Lughole as part of its Noise Annoys Fest in 2025. (Credit: Instagram / Alex Brown / @aroutinesearch) Adam Regan, owner of the historic Hare & Hounds in south Birmingham, told The Independent that the new rates relief left much to be desired, even while he feels confident that his business is …

How xMemory cuts token costs and context bloat in AI agents

How xMemory cuts token costs and context bloat in AI agents

Standard RAG pipelines break when enterprises try to use them for long-term, multi-session LLM agent deployments. This is a critical limitation as demand for persistent AI assistants grows. xMemory, a new technique developed by researchers at King’s College London and The Alan Turing Institute, solves this by organizing conversations into a searchable hierarchy of semantic themes. Experiments show that xMemory improves answer quality and long-range reasoning across various LLMs while cutting inference costs. According to the researchers, it drops token usage from over 9,000 to roughly 4,700 tokens per query compared to existing systems on some tasks. For real-world enterprise applications like personalized AI assistants and multi-session decision support tools, this means organizations can deploy more reliable, context-aware agents capable of maintaining coherent long-term memory without blowing up computational expenses. RAG wasn’t built for this In many enterprise LLM applications, a critical expectation is that these systems will maintain coherence and personalization across long, multi-session interactions. To support this long-term reasoning, one common approach is to use standard RAG: store past dialogues and events, retrieve …

ChatGPT 5.4 Thinking vs Earlier Models : Token Savings and Stronger Self-Checks

ChatGPT 5.4 Thinking vs Earlier Models : Token Savings and Stronger Self-Checks

The integration of GPT-5.4 Thinking into frontend development introduces a new level of efficiency and precision, particularly through its enhanced Computer Use Ability (CUA). This feature allows the model to interact with digital systems in a human-like manner, eliminating the need for external environments and streamlining complex workflows. OpenAI highlights how ChatGPT 5.4 Thinking can handle intricate tasks, such as designing and testing a 3D chess game with advanced textures and rule adherence, all while significantly reducing computational overhead. These capabilities not only simplify technical processes but also prioritize high-quality output and usability. In this overview, you’ll explore how ChatGPT 5.4 Thinking enables developers to convert design inputs, like images, into fully functional websites with accurate styling and responsive layouts. You’ll also learn how its self-checking mechanisms ensure alignment between design and output, reducing manual adjustments. Additionally, the model’s ability to manage concurrent processes, such as generating visual assets and validating functionality, offers practical insights into optimizing workflows for both small-scale and complex projects. This breakdown provides a clear look at how these features can …