All posts tagged: benchmarks

Anthropic releases Claude Opus 4.7: How to try it, benchmarks, safety

Published by skeptic

Anthropic has been shipping products and making news at a blistering pace in 2026, and on Thursday, the AI company announced the launch of Claude Opus 4.7. Claude Opus 4.7 is Anthropic’s most intelligent model available to the general public. Notably, Anthropic said in a press release that Opus 4.7 is not as powerful as Claude Mythos, which Anthropic deemed too dangerous for public release. Claude Opus is a family of hybrid reasoning models capable of multi-step reasoning and advanced coding. Until the announcement of Claude Mythos on April 7, Claude Opus was considered Anthropic’s most advanced series of AI models. Don’t miss out on our latest stories: Add Mashable as a trusted news source in Google. How to try Claude Opus 4.7 Claude Opus 4.7 is available now via Claude AI, the Claude API, and Anthropic partners such as Microsoft Foundry. The new model is priced the same as Claude Opus 4.6. SEE ALSO: Anthropic makes the case for anthropomorphizing AI in ‘unsettling’ research paper However, Anthropic noted that because “Opus 4.7 thinks more at …

Google releases Gemma 4 under Apache 2.0 — and that license change may matter more than benchmarks

Published by skeptic

For the past two years, enterprises evaluating open-weight models have faced an awkward trade-off. Google’s Gemma line consistently delivered strong performance, but its custom license — with usage restrictions and terms Google could update at will — pushed many teams toward Mistral or Alibaba’s Qwen instead. Legal review added friction. Compliance teams flagged edge cases. And capable as Gemma 3 was, “open” with asterisks isn’t the same as open. Gemma 4 eliminates that friction entirely. Google DeepMind’s newest open model family ships under a standard Apache 2.0 license — the same permissive terms used by Qwen, Mistral, Arcee, and most of the open-weight ecosystem. No custom clauses, no “Harmful Use” carve-outs that required legal interpretation, no restrictions on redistribution or commercial deployment. For enterprise teams that had been waiting for Google to play on the same licensing terms as the rest of the field, the wait is over. The timing is notable. As some Chinese AI labs (most notably Alibaba’s latest Qwen models, Qwen3.5 Omni and Qwen 3.6 Plus) have begun pulling back from fully …

The Download: gig workers training humanoids, and better AI benchmarks

Published by skeptic

Zeus is a data recorder for Micro1, which sells the data he collects to robotics firms. As these companies race to build humanoids, videos from workers like Zeus have become the hottest new way to train them. Micro1 has hired thousands of them in more than 50 countries, including India, Nigeria, and Argentina. The jobs pay well locally, but raise thorny questions around privacy and informed consent. The work can be challenging—and weird. Read the full story. —Michelle Kim Our readers recently voted humanoid robots the “11th breakthrough” to add to our 2026 list of 10 Breakthrough Technologies. Check out what else officially made the cut. AI benchmarks are broken. Here’s what we need instead. For decades, AI has been evaluated based on whether it can outperform humans on isolated problems. But it’s seldom used this way in the real world. While AI is assessed in a vacuum, it operates in messy, complex, multi-person environments over time. This misalignment leads us to misunderstand its capabilities, risks, and impacts. We need new benchmarks that assess AI’s performance over longer horizons within human teams, workflows, and organizations. Here’s a …

DeepSeek V4 Benchmarks Leaked Details Explained

Published by skeptic

Leaked benchmarks for DeepSeek V4 have sparked significant discussion, revealing a model that reportedly scales between 200 billion and 1 trillion parameters. According to the leaks, its novel MHC (Multi-Hierarchical Context) architecture enables multimodal processing of text, images and video, with a token context window of 1 million tokens for handling expansive inputs. Universe of AI examines these claims alongside Enthropic’s updates to Claude Code, which now includes enhanced “computer use” capabilities for managing applications and systems directly through AI. These developments highlight both the potential and the challenges of scaling advanced AI systems. Explore specific insights into how Enthropic’s Claude Code balances functionality with safety, including session-based controls and app-specific permissions designed to mitigate risks. You’ll also gain a closer look at OpenAI’s Codex plugin integration, which fosters cross-platform collaboration by bridging Claude Code workflows with OpenAI’s systems. This disclosure provides a detailed breakdown of these advancements, offering a practical lens on their implications for developers and researchers navigating the rapidly evolving AI landscape. DeepSeek V4: Ambitious Benchmarks & Uncertainty TL;DR Key Takeaways : …

Europe’s coming energy crunch – POLITICO

Published by Sam Flores

“Markets are now grappling with a scenario long discussed in theory but rarely thought of as a legitimate possibility — the effective shutdown of the world’s most critical energy chokepoint,” said Ana Maria Jaller-Makarewicz, lead energy analyst for the Europe team at the Institute for Energy Economics and Financial Analysis. While the 1970s crises knocked out 7 percent of global supplies, she said, the closure of the Strait of Hormuz affects 20 percent. U.S. President Donald Trump exits Air Force One on March 29 at Joint Air Base Andrews, Maryland. | Nathan Howard/Getty Images When the war first broke out, EU officials hoped the bloc would be spared from serious shortages thanks to its relatively low exposure to the Persian Gulf, which it relied on for just 6 percent of its crude oil and under 10 percent of its natural gas. The biggest risk articulated in countless ministerial and technical meetings was higher prices. Europe’s security of supply was rarely questioned, with officials pointing to the continent’s diversified sources beyond the Persian Gulf: the U.S., …

AI benchmarks are broken. Here’s what we need instead.

Published by skeptic

Across the organizations where this approach has emerged and started to be applied, the first step is shifting the unit of analysis. For example, in one UK hospital system in the period 2021–2024, the question expanded from whether a medical AI application improves diagnostic accuracy to how the presence of AI within the hospital’s multidisciplinary teams affects not only accuracy but also coordination and deliberation. The hospital specifically assessed coordination and deliberation in human teams using and not using AI. Multiple stakeholders (within and outside the hospital) decided on metrics like how AI influences collective reasoning, whether it surfaces overlooked considerations, whether it strengthens or weakens coordination, and whether it changes established risk and compliance practices. This shift is fundamental. It matters a lot in high-stakes contexts where system-level effects matter more than task-level accuracy. It also matters for the economy. It may help recalibrate inflated expectations of sweeping productivity gains that are so far predicated largely on the promise of improving individual task performance. Once that foundation is set, HAIC benchmarking can begin to …

M5 Ultra Mac Studio Leaks: 8K Video and GPU Benchmarks

Published by skeptic

Apple’s highly anticipated next-generation Mac Studio is sparking widespread excitement, with leaks suggesting it will deliver unparalleled performance powered by the M5 Ultra and M5 Max chips. These innovative processors are expected to set new benchmarks in high-performance computing, positioning the Mac Studio as a pivotal release in Apple’s product lineup. Below, we delve into the expected design, internal upgrades, performance capabilities, and release timeline of this new device. The video below from Matt Talks Tech gives us more details about the rumored M5 Ultra Mac Studio. Design: Familiar Yet Purpose-Driven The Mac Studio is expected to retain its compact and minimalist design, a hallmark of its predecessors. While some may hope for a bold redesign, Apple appears to be prioritizing functionality and practicality over aesthetic changes. This approach ensures the device remains user-friendly and professional, catering to its core audience of creative and technical professionals. The port configuration is also likely to remain consistent, featuring Thunderbolt 5 and USB 3 ports. This ensures compatibility with a wide range of peripherals, from external storage devices …

The 30-second sit-to-stand test is a scientific standard for assessing longevity—here are the benchmarks to aim for in your 60s, 70s, 80s and 90s

Published by Jenni Sidey

How many times could you stand up from a chair and sit down again, without using your hands, in 30 seconds? The answer may indicate your ability to maintain independence in later life. The 30-second sit-to-stand test, as it’s known, first appeared in a 1999 study by California State University researchers Roberta E. Rikli and C. Jessie Jones. The test formed a central component of the Fullerton Functional Fitness Test battery the pair developed to predict mobility, fall risk and independence in later life. Article continues below You may like Nearly three decades later, it’s still frequently used by physical therapists to assess fall risk, including as part of the Centers for Disease Control and Prevention’s (CDC) STEADI framework (Stopping Elderly Accidents, Deaths and Injuries). “Preventing or delaying the onset of physical frailty is an increasingly important goal because more individuals are living well into their 8th and 9th decades,” the study’s authors noted at the turn of the century. In 2013, Rikli and Jones published benchmarks for the 30-second sit-to-stand test for older adults …

Global energy body convenes summit on unlocking emergency oil reserves – POLITICO

Published by Sam Flores

BRUSSELS — The head of the International Energy Agency on Tuesday summoned an extraordinary meeting to decide whether to tap millions of barrels of emergency oil supplies amid soaring energy costs. The meeting, to be held at an undisclosed time on Tuesday, will “assess the current security of supply and market conditions to inform a subsequent decision on whether to make emergency stocks of IEA countries available to the market,” the agency’s chief, Fatih Birol, said in an emailed statement. The IEA’s 31 members — mostly advanced Western economies — have grown increasingly panicked as Iran and a U.S.-Israeli coalition trade airstrikes, imperilling critical supply chains and energy infrastructure across the Gulf. Merchant shipping has also abandoned the Strait of Hormuz, a chokepoint for 20 percent of the world’s energy trade, prompting fears of price spikes. Source link

Why Vladimir Putin is the biggest winner from the war in Iran – POLITICO

Published by Sam Flores

For Russia, the surge in oil prices amounts to an economic windfall at a crucial moment, as the cost of four years of war in Ukraine threatened to spill over into a domestic economic crisis. The assault on Iran may undermine Moscow’s claim to stand by its allies, but it is already benefiting Russia’s economy and, by extension, its war against Ukraine — leaving the Kremlin well placed to emerge as one of the main beneficiaries of the expanding conflict in the Middle East. Economic turnaround Only several weeks ago, the mood among Russia’s economic elite was grim. The Russian finance ministry’s budget plan for this year assumed a baseline benchmark of $59 per barrel of Urals crude, the country’s main export blend. But in January, energy revenues plunged to their lowest level since 2020, compounding a disappointing tax haul. As Western sanctions, high interest rates and labor shortages strained the economy, tension between the finance ministry and the central bank on how to mitigate the damage became increasingly visible. “It was far from a …

Skeptic Society Magazine

for honest conversations

Years

Authors

Filter by Month

Filter by Categories

Filter by Tags

All posts tagged: benchmarks

Anthropic releases Claude Opus 4.7: How to try it, benchmarks, safety

Google releases Gemma 4 under Apache 2.0 — and that license change may matter more than benchmarks

The Download: gig workers training humanoids, and better AI benchmarks

DeepSeek V4 Benchmarks Leaked Details Explained

Europe’s coming energy crunch – POLITICO

AI benchmarks are broken. Here’s what we need instead.

M5 Ultra Mac Studio Leaks: 8K Video and GPU Benchmarks

The 30-second sit-to-stand test is a scientific standard for assessing longevity—here are the benchmarks to aim for in your 60s, 70s, 80s and 90s

Global energy body convenes summit on unlocking emergency oil reserves – POLITICO

Why Vladimir Putin is the biggest winner from the war in Iran – POLITICO