All posts tagged: GPT5.5

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

Published by skeptic

Researchers from the University of California, Berkeley’s Center for Responsible, Decentralized Intelligence (RDI), alongside an advisory committee of over 300 domain experts, have launched Agents’ Last Exam (ALE)—a grueling new benchmark built to measure whether artificial intelligence can actually execute economically valuable, long-horizon professional workflows. In a shocking upset, OpenAI’s GPT-5.5 from April, operating through the Codex harness, secured the absolute top spot on the new ALE Leaderboard with a 24.0% pass rate, beating Anthropic’s highly anticipated, brand new Mythos-class Claude Fable 5 model released just yesterday, which came in third with a score of 22.0%. Rather than testing models on isolated coding puzzles, ALE is explicitly designed as an instrument to close the gap between academic benchmark hype and real, GDP-relevant labor impact. And right now, the data proves the most advanced models in the world are fundamentally failing the exam. ALE Leaderboard full chart. Credit: Agents’ Last Exam/UC Berkeley RDI ALE Leaderboard. Credit: Agents’ Last Exam/UC Berkeley RDI Ending the Era of ‘Cheating’ and Brittle Graders The fundamental shift in ALE lies in …

How Claude Opus 4.8 Compares to OpenAI GPT-5.5

Published by skeptic

Anthropic has released Opus 4.8, introducing updates designed to enhance its AI’s performance in areas like coding accuracy, reasoning and task management. A notable feature is the addition of dynamic workflows, which break down complex operations into smaller, verifiable subtasks to streamline automation. According to Universe of AI, these updates reflect Anthropic’s attempt to meet user demands and remain competitive against alternatives such as OpenAI’s GPT-5.5, though the absence of the anticipated Mythos model adds complexity to understanding their broader direction. Learn how features like effort control and fast mode aim to address varied user requirements. Discover the practical applications of dynamic workflows in scenarios like security audits and large-scale code migrations. Gain insight into Anthropic’s emphasis on model alignment and transparency as part of its strategy to rebuild user confidence and refine its position in the AI field. Key Enhancements in Opus 4.8 TL;DR : Anthropic has launched Opus 4.8, featuring incremental improvements such as enhanced coding accuracy, refined reasoning and better task management, but it is described as an evolutionary update rather than …

MiniMax-M3 debuts, eclipsing GPT-5.5 and Gemini 3.1 Pro on key benchmark performance for just 5-10% of the cost

Published by skeptic

Big news in enterprise AI broke over the weekend as Chinese AI startup MiniMax released its highly anticipated M3 large language model on Sunday evening Eastern time, pairing frontier-tier coding and agentic performance with a 1-million-token context window and native multimodality for a fraction of the cost of leading proprietary models, with pricing starting at just $20 per month under its new subscription token plans. The company’s leadership also announced plans to deliver the model under an open source license including “open weights,” allowing for full enterprise downloading and customizability free-of-charge, coming sometime in the next 10 days. For now, it is available via the MiniMax API at a special discounted price of $0.3 per 1 million input tokens and $1.20 per million output tokens (on fresh cache) for the next week — beating proprietary U.S. giants like Google, OpenAI and Anthropic handily on cost, while also eclipsing the performance of the latest models from the former two on selected benchmarks. Even at its full price of $0.6/$2.40 per million input/output tokens, MiniMax-M3 remains at …

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

Published by skeptic

For months, the leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: the top models are all roughly the same. OpenAI’s GPT-5 family, Anthropic’s Claude Opus, and Google’s Gemini Pro have clustered within a narrow band on Scale AI’s SWE-Bench Pro leaderboard, making it nearly impossible for engineering leaders to determine which agent will actually perform best inside their codebases. On Monday, a startup called Datacurve released a benchmark it says shatters that illusion. DeepSWE, a 113-task evaluation spanning 91 open-source repositories and five programming languages, produces a dramatically wider spread among the same frontier models — and crowns OpenAI’s GPT-5.5 as the clear leader at 70%, sixteen points ahead of its nearest competitor. “On public leaderboards, top models often look relatively close in capability,” wrote Datacurve co-author Serena Ge on X. “DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.” The benchmark also delivers a pointed critique of the evaluation infrastructure the AI industry relies on to measure progress: Datacurve’s audit found that …

OpenAI’s New Daybreak Platform Uses GPT-5.5 to Find Software Vulnerabilities

Published by skeptic

OpenAI today launched Daybreak, an answer to Anthropic’s Project Glasswing initiative and Mythos AI model. Like Glasswing, Daybreak is a cyber defense effort that will help tech companies find security vulnerabilities in their platforms. OpenAI says Daybreak is aimed at building cyber defense into software from the start. It builds on OpenAI’s April launch of GPT-5.4-Cyber, which the company says has contributed to fixing more than 3,000 vulnerabilities. Daybreak combines the intelligence of OpenAI models, the extensibility of Codex as an agentic harness, and our partners across the security flywheel to help make the world safer for everyone. Defenders can bring secure code review, threat modeling, patch validation, dependency risk analysis, detection, and remediation guidance into the everyday development loop so software becomes more resilient from the start. OpenAI CEO Sam Altman said OpenAI would like to work with “as many companies as possible” to help them continuously secure their software against cyber threats. Several companies have already adopted Anthropic’s competing Glasswing program, including Apple, Microsoft, Google, and Amazon. Daybreak uses Codex Security to build …

GPT-5.5 Instant shows you what it remembered — just not all of it

Published by skeptic

OpenAI updated the default model for ChatGPT to its new GPT-5.5 Instant, along with a new memory capability that finally shows which context shaped responses — at least some of them. This limitation signals that models are starting to create a second, incomplete memory observability layer that could conflict with existing audit systems and agent logs. GPT-5.5 Instant replaces GPT-5.3 Instant as the default ChatGPT model and is a version of its new flagship GPT-5.5 LLM. It’s supposed to be more dependable, accurate and smarter than 5.3. But it’s the introduction of memory sources, which will be enabled across all models in the platform, that could help enterprises in their projects. “When a response is personalized, you can see what context was used, such as saved memories or past chats, and delete or correct it if something is outdated or no longer relevant,” OpenAI said in a blog post. When a user asks ChatGPT something, users can tap the sources button (at the bottom of the response) to see which files or past chats the …

OpenAI releases GPT-5.5 Instant, a new default model for ChatGPT

Published by skeptic

On Monday, OpenAI released a new foundation model called GPT-5.5 Instant, which will replace GPT-5.3 Instant as the default ChatGPT model. The company said the model reduces hallucination in sensitive areas such as law, medicine, and finance, while maintaining the low latency of its predecessor. OpenAI released the latest GPT-5.5 model last month with the company claiming improvements in areas like coding and knowledge work. The new model also achieved a score of 81.2 in the AIME 2025 math test, compared to 65.4 for the older model. It also outperformed its predecessor on the MMMU-Pro multimodal reasoning benchmark, with a score of 76.0 vs 69.2. The release placed a particular emphasis on context management. GPT-5.5 Instant can use its search tool to refer back to past conversations, files, and Gmail to give you more personalized answers. This feature will be available to Plus and Pro users on the web, with plans to roll it out to mobile soon. OpenAI said that it plans to extend access to this feature to Free, Go Business, and enterprise …

OpenAI turns its sold-out GPT-5.5 party into a monthlong Codex giveaway for 8,000 developers

Published by skeptic

OpenAI on Monday began emailing more than 8,000 developers who applied for its invite-only GPT-5.5 party with a surprise consolation prize: a tenfold increase in Codex rate limits on their personal ChatGPT accounts, effective immediately and lasting through June 5. “We had over 8,000 people express interest in just 24 hours, and while we wish our office was big enough to welcome everyone, we weren’t able to make space for every person who applied,” the company wrote in the email, which VentureBeat obtained. “As a small token of appreciation, we’ve 10x’ed your Codex rate limits until June 5th on your personal ChatGPT account.” The gift is not limited to the lucky few who scored invitations to the party itself. Everyone who raised their hand — whether they were accepted, waitlisted, or turned away — received the rate limit boost, according to the email and confirmed by multiple recipients on social media. CEO Sam Altman telegraphed the move on X shortly before inboxes started lighting up. “We are gonna do something nice for everyone who applied …

OpenAI’s GPT-5.5 vs Claude Opus 4.7: Which is better?

Published by skeptic

OpenAI released its latest model, GPT-5.5, on April 23, just a week after Anthropic introduced Claude Opus 4.7. As the two leading models from the two leading AI labs, we wanted to see how the new models compare. Spoiler alert: We think Claude Opus 4.7 has an edge on advanced and agentic coding, but GPT-5.5 performs better on most benchmarks. SEE ALSO: Anthropic says Claude Opus 4.7 has a 92% honesty rate, less sycophancy Want to learn more about getting the best out of your tech? Sign up for Mashable’s Top Stories and Deals newsletters today. GPT-5.5 and Opus 4.7: Leaderboards GPT-5.5 isn’t yet ranked on all AI leaderboards yet, but it should be very competitive with Claude Opus 4.7. On the leaderboards of verified benchmark tests such as Arc Prize, GPT-5.5 beats Opus 4.7 (more on this below). On the popular Arena leaderboard, which is based on user testing, Claude Opus 4.7 Thinking has the top overall spot. Interestingly, Opus 4.7 is currently ranked below Opus 4.6, though that will likely change in time. …

DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th the cost of Opus 4.7, GPT-5.5

Published by skeptic

The whale has resurfaced. DeepSeek, the Chinese AI startup offshoot of High-Flyer Capital Management quantitative analysis firm, became a near-overnight sensation globally in January 2025 with the release of its open source R1 model that matched proprietary U.S. giants. It’s been an epoch in AI since then, and while DeepSeek has released several updates to that model and its other V3 series, the international AI and business community has been largely waiting with baited breath for the follow-up to the R1 moment. Now it’s arrived with last night’s release of DeepSeek-V4, a 1.6-trillion-parameter Mixture-of-Experts (MoE) model available free under commercially-friendly open source MIT License, which nears — and on some benchmarks, surpasses — the performance of the world’s most advanced closed-source systems at approximately 1/6th the cost over the application programming interface (API). This release—which DeepSeek AI researcher Deli Chen described on X as a “labor of love” 484 days after the launch of V3—is being hailed as the “second DeepSeek moment”. As Chen noted in his post, “AGI belongs to everyone”. It’s available now …

Skeptic Society Magazine

for honest conversations

Years

Authors

Filter by Month

Filter by Categories

Filter by Tags

All posts tagged: GPT5.5

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

How Claude Opus 4.8 Compares to OpenAI GPT-5.5

MiniMax-M3 debuts, eclipsing GPT-5.5 and Gemini 3.1 Pro on key benchmark performance for just 5-10% of the cost

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

OpenAI’s New Daybreak Platform Uses GPT-5.5 to Find Software Vulnerabilities

GPT-5.5 Instant shows you what it remembered — just not all of it

OpenAI releases GPT-5.5 Instant, a new default model for ChatGPT

OpenAI turns its sold-out GPT-5.5 party into a monthlong Codex giveaway for 8,000 developers

OpenAI’s GPT-5.5 vs Claude Opus 4.7: Which is better?

DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th the cost of Opus 4.7, GPT-5.5