All posts tagged: Claude

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

For months, the leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: the top models are all roughly the same. OpenAI’s GPT-5 family, Anthropic’s Claude Opus, and Google’s Gemini Pro have clustered within a narrow band on Scale AI’s SWE-Bench Pro leaderboard, making it nearly impossible for engineering leaders to determine which agent will actually perform best inside their codebases. On Monday, a startup called Datacurve released a benchmark it says shatters that illusion. DeepSWE, a 113-task evaluation spanning 91 open-source repositories and five programming languages, produces a dramatically wider spread among the same frontier models — and crowns OpenAI’s GPT-5.5 as the clear leader at 70%, sixteen points ahead of its nearest competitor. “On public leaderboards, top models often look relatively close in capability,” wrote Datacurve co-author Serena Ge on X. “DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.” The benchmark also delivers a pointed critique of the evaluation infrastructure the AI industry relies on to measure progress: Datacurve’s audit found that …

I cancelled ChatGPT, Perplexity, and Gemini — Claude does everything I need

I cancelled ChatGPT, Perplexity, and Gemini — Claude does everything I need

If you’re looking at your monthly expenses and feel that your AI subscriptions are getting out of hand, you’re not alone. I thought I’d never pay for AI again when I started using local LLMs, but cloud-based AI companies keep releasing features that keep them ahead of their local counterparts, and if you want to use them, you’ll have to buy that $20 subscription. As a result, I ended up with multiple AI subscriptions, each filling its own distinct gap. ChatGPT, Gemini, Perplexity, Claude, they all had their uses, until I decided to scrap all my subscriptions except one. And that’s because Claude does everything I need. Related These Are the 4 Best AI Chatbots For Handling Big Conversations They don’t lose the thread when discussions get lengthy. I realized I only use AI for a few things Writing, research, coding, and why some features don’t matter Before you go around comparing tools to save on your subscription costs, it helps to be honest about what you actually need. I write a lot of code, …

The Ultimate Guide to Customizing the Claude Desktop App in 2026

The Ultimate Guide to Customizing the Claude Desktop App in 2026

Creating a personalized Claude AI system can help you streamline your workflows by aligning them with your specific needs and goals. Simon Scrapes demonstrates how to use the Claude desktop app as the foundation for building this system, emphasizing the importance of two key files: claude.md, which defines rules and behavior and memory.md, which stores contextual information to maintain continuity across tasks. By structuring these components and integrating them into a well-organized folder architecture, you can create a scalable and efficient AI system tailored to both personal and professional projects. In this breakdown, you’ll explore how to design a folder hierarchy that supports multiple clients or departments, making sure clear organization and adaptability. Learn how to customize Claude’s behavior using markdown files to automate repetitive tasks, such as generating reports or managing workflows. Additionally, discover strategies for addressing challenges like memory management and scalability to maintain optimal performance as your system evolves. These insights will equip you to build a system that simplifies complex processes while remaining flexible to your changing needs. Getting Started: Laying …

ChatGPT’s decline is real — I tested it against Claude on 3 routine tasks and lost every time

ChatGPT’s decline is real — I tested it against Claude on 3 routine tasks and lost every time

I’d be lying if I said I was a Claude user from the beginning. For the longest time, ChatGPT was my go-to AI tool for everything, whether it was automating tasks, brainstorming ideas, or simply handling work I didn’t want to spend hours doing myself. So much so that I have also used ChatGPT on CarPlay, and I absolutely love it. But after using the paid versions of both Claude and ChatGPT extensively, I’ve slowly come to one conclusion: ChatGPT’s decline feels very real. And no, I’m not saying that for the sake of it — I tested both tools across my everyday workflow, and more often than not, Claude came out ahead. ChatGPT still has the bigger name and more features, but it no longer feels as consistent or as reliable as it once did. The responses often feel repetitive, less thoughtful, and sometimes oddly disconnected from the actual context. Meanwhile, Claude simply feels more natural, focused, and better at understanding nuance. Before moving on to the tasks, I think it’s worth knowing that …

Alibaba’s proprietary Qwen3.7-Max can run for 35 hours autonomously and supports external harnesses like Anthropic’s Claude Code

Alibaba’s proprietary Qwen3.7-Max can run for 35 hours autonomously and supports external harnesses like Anthropic’s Claude Code

The AI industry has fully entered the “agent era,” a paradigm where AI models do far more than generate text — they now actively plan, execute, and course-correct complex tasks over days rather than seconds. Thus, it’s perhaps unsurprising to see Chinese e-commerce giant Alibaba’s famed Qwen Team of AI researchers release a model capable of performing autonomous agentic AI work over multiple days: that model has arrived in the form of Qwen3.7-Max which the company reports in a blog post achieved “~35 hours of continuous autonomous execution” — albeit, in a proprietary, not open source format, as prior Qwen Team releases were. This is also to be expected — it’s what many analysts and industry experts feared in the wake of the departure of several key Qwen Team leaders earlier this year. But it makes sense for Alibaba financially, at least in the short term: training AI models, especially ones as powerful as Qwen3.7-Max, is expensive, and giving them away essentially for free, as open source models are, does not immediately help recoup any …

Anthropic’s Code with Claude showed off coding’s future—whether you like it or not

Anthropic’s Code with Claude showed off coding’s future—whether you like it or not

Pull requests are fixes or updates to existing software that are submitted for review before they go live. They are the bread and butter of software development, the chunks of code that most professional developers spend their lives writing—or did until now. “Who here has shipped a pull request that was completely written by Claude where they did not read the code at all?” Hadfield asked next. Nervous laughter. Most of the hands stayed up. It’s not news that LLM-powered tools like Anthropic’s Claude Code and OpenAI’s Codex have upended the way software gets made. Top tech companies now like to boast of how little code their developers write by hand. (“Most software at Anthropic is now written by Claude,” Hadfield said. “Claude has written most of the code in Claude Code.”) OpenAI, Google, and Microsoft make similar claims. Many others wish they could. Even so, it is striking how normal this new paradigm already seems, and how fast it has set in. This was the second year that Anthropic has put on developer events, …

SpaceX Is Spending .8 Billion to Buy Gas Turbines for Its AI Data Centers

SpaceX Is Spending $2.8 Billion to Buy Gas Turbines for Its AI Data Centers

Elon Musk’s SpaceX committed to spending over $2.8 billion in recent months to buy gas turbines to power data centers for its artificial intelligence unit, the company revealed in a regulatory filing on Wednesday. The relatively large investment shows that Musk is continuing to double down on gas turbines, even after SpaceX’s use of them prompted public complaints, a lawsuit, and regulatory inquiries into whether the company may be polluting the air with carbon emissions and dodging environmental requirements. A shortage of electricity is the leading constraint on an otherwise roaring data center boom happening across the US. Portable gas turbines—generators that can run without drawing power from the grid—have been viewed as quick and temporary solutions until more robust sources of energy come online. In addition to launching rockets and selling satellite internet, SpaceX also owns Musk’s xAI unit, which develops Grok. To support the chatbot and other AI efforts, xAI operates a pair of data centers known as Colossus 1 in Memphis, Tennessee and Colossus 2 in Southaven, Mississippi. SpaceX is leasing access …

Claude agents can finally connect to enterprise APIs without leaking credentials

Claude agents can finally connect to enterprise APIs without leaking credentials

The reason enterprises have been slow to connect AI agents to internal APIs and databases isn’t the models — it’s the credentials. In most production deployments, the agent carries authentication tokens with it as it executes tool calls, which means a compromised or misbehaving agent takes the keys with it. Anthropic is addressing that problem with two new capabilities for Claude Managed Agents: self-hosted sandboxes, which let teams run tool execution inside their own infrastructure perimeter, and MCP tunnels, which connect agents to private MCP servers without exposing credentials in the agent’s context. Together they move credential control to the network boundary rather than leaving it inside the agent. Right now, self-hosted sandboxes are available to Claude Managed Agent users in public beta, while MCP tunnels are currently in research preview.   Anthropic isn’t the only model provider making this bet. OpenAI added local execution to its Agents SDK in April in response to similar demand. The architectural distinction Anthropic draws is a split: the agent loop runs on Anthropic’s infrastructure, while tool execution runs on …

SandboxAQ brings its drug discovery models to Claude — no PhD in computing required

SandboxAQ brings its drug discovery models to Claude — no PhD in computing required

Drug discovery is one of the most expensive pursuits in modern industry. Finding a single viable molecule can take a decade and cost billions, and most candidates still don’t make it. A generation of AI startups has promised to fix that — most have made the problem less painful for researchers, who are already technically sophisticated enough to use the tools. But SandboxAQ thinks the bottleneck isn’t the models. It’s the interface. The company has teamed up with Anthropic to integrate its scientific AI models directly into Claude — putting powerful drug discovery and materials science tools behind a conversational interface that requires no specialized computing infrastructure to use. Founded roughly five years ago as an Alphabet spinout, SandboxAQ counts Eric Schmidt, Google’s former CEO, as its chairman. The company, which has raised more than $950 million from investors, has built out a number of different business lines, including a cybersecurity business. One of the more unique things SandboxAQ does, however, is produce large quantitative models, or LQMs. These proprietary models are “physics-grounded,” meaning they’re …