All posts tagged: Benchmark

Windows has a benchmark tool so good it makes you wonder why Microsoft never mentioned it

Published by skeptic

Ask anyone about benchmarking on Windows, and you’ll hear about Cinebench, CrystalDiskMark, 3DMark, or one of the many free benchmark programs for Windows that people swear by. All these third-party tools, yet nobody mentions the one Microsoft already built into every copy of Windows. That tool is WinSAT. It’s been sitting in your system since Vista, capable of scoring your CPU, RAM, GPU, and storage in seconds, and Microsoft has done almost nothing to tell you it exists. I stumbled on it while poking around the command line, and after running it on a few machines, I’m genuinely surprised this never got more attention. It’s not a replacement for specialized benchmarks, but for a quick hardware health check, it does more than you’d expect from a buried command-line tool. WinSAT is a built-in benchmark that Microsoft buried after removing the Windows Experience Index A capable tool hidden behind a forgotten interface WinSAT stands for Windows System Assessment Tool. It runs a series of synthetic tests that measure the performance of your CPU, RAM, storage, desktop …

Scale AI launches Voice Showdown, the first real-world benchmark for voice AI — and the results are humbling for some top models

Published by skeptic

Voice AI is moving faster than the tools we use to measure it. Every major AI lab — OpenAI, Google DeepMind, Anthropic, xAI — is racing to ship voice models capable of natural, real-time conversation. But the benchmarks used to evaluate those models are largely still running on synthetic speech, English-only prompts, and scripted test sets that bear little resemblance to how people actually talk. Scale AI, the large data annotation startup whose founder was poached by Meta last year to lead its Superintelligence Lab, is still going strong and tackling the problem head on: today it launches Voice Showdown, what it calls the first global preference-based arena designed to benchmark voice AI through the lens of real human interaction. This product offers a unique strategic value to users: free access to the world’s leading frontier models. Through Scale’s ChatLab platform, users can interact with high-tier models—which typically require multiple $20-per-month subscriptions—at no cost. In exchange, users participate in occasional blind, head-to-head “battles” to choose which of two anonymized leading voice models offers a better …

AI benchmark numbers are meaningless — here’s what to look for instead

Published by skeptic

Every time a new AI model launches, the cacophony of AI benchmarking sites whirs into life and bombards us with colorful charts, imperceptible and marginal improvements to uncontextualized numbers that really mean nothing to most people. Most of the time, if you’re not an AI researcher, most of these figures and charts mean nothing. I mean, sure, “numbers go up = AI gets better” is a basic level of understanding, but those numbers often don’t reveal the information pertinent to how most folks use AI. In that, the problem isn’t that benchmarks are useless. It’s that they’re catering to the wrong audience, functioning more like marketing than explaining clearly what’s new, what works, and how it’ll save you time. Why AI companies love benchmark charts And why that’s what causes all the problems The reasoning behind AI benchmarking, like all benchmarking tests, is sound. They help to simplify complex systems into easy-to-understand numbers. Instead of describing subtle improvements in reasoning or language understanding, companies can point to a chart and say their model scored 92% …

Apple’s M5 Max Chip Achieves a New Record in First Benchmark Result

Published by skeptic

The first Geekbench 6 result for a 16-inch MacBook Pro with the M5 Max chip surfaced today, and Apple has achieved record-breaking performance. In this unconfirmed result, the M5 Max with an 18-core CPU achieved a score of 29,233 for multi-core CPU performance, which tops the 27,726 score achieved by the Mac Studio’s M3 Ultra chip with a 32-core CPU. M5 Max is now the fastest Apple silicon chip ever, and it even topped every other consumer PC processor in the Geekbench database. In terms of multi-core CPU performance, the M5 Max is up to 5% faster than the M3 Ultra, and up to 14% to 15% faster than the M4 Max chip with a 16-core CPU. Here is a comparison of the multi-core CPU results: 16-inch MacBook Pro with M5 Max (18-core CPU): 29,233 (one result) Mac Studio with M3 Ultra (32-core CPU): 27,726 (average of all results) Mac Studio with M4 Max (16-core CPU): 26,166 (average of all results) 16-inch MacBook Pro with M4 Max (16-core CPU): 25,702 (average of all results) As …

Gemini 3.1 Pro vs Gemini 3 Pro: Benchmark Score Gains Explained

Published by skeptic

Gemini 3.1 Pro represents a significant advancement in artificial intelligence, emphasizing autonomous task execution and practical problem-solving. According to Wes Roth, this latest iteration builds on the strengths of its predecessor, Gemini 3 Pro, while achieving remarkable improvements in abstract reasoning and real-world performance. For example, it scored an impressive 77% on the Arc AGI 2 benchmark, a substantial leap from the 31% achieved by the earlier version. These metrics highlight its growing ability to handle complex reasoning tasks with precision and efficiency. In this explainer, you’ll learn how Gemini 3.1 Pro excels in areas such as internet navigation, office productivity, and command-line operations, as reflected in its performance across benchmarks like Browse Comp and Terminal Bench 2.0. You’ll also discover its adaptability in dynamic environments, which makes it particularly suited for industries like telecommunications and IT. By understanding these capabilities, you can better appreciate its potential to automate workflows, enhance decision-making, and streamline operations in professional settings. Gemini 3.1 Pro Overview TL;DR Key Takeaways : Gemini 3.1 Pro sets a new standard in AI …

Google’s new Gemini Pro model has record benchmark scores—again

Published by skeptic

On Thursday, Google released the newest version of Gemini Pro, its powerful LLM. The model, 3.1, is currently available as a preview and will be generally released soon, the company said. Google’s new model may be one of the most powerful LLMs yet. Onlookers have noted that Gemini 3.1 Pro appears to be a big step up from its predecessor, Gemini 3—which, upon its release in November, was already considered a highly capable AI tool. On Thursday, Google also shared statistics from independent benchmarks—such as one called Humanity’s Last Exam—that showed it performing significantly better than its previous version. Gemini 3.1 Pro was also praised by Brendan Foody, the CEO of AI startup Mercor, whose benchmarking system, APEX, is designed to measure how well new AI models perform real professional tasks. “Gemini 3.1 Pro is now at the top of the APEX-Agents leaderboard,” Foody said in a social media post, adding that the model’s impressive results show “how quickly agents are improving at real knowledge work.” The release comes as the AI model wars are …

Jack Altman joins Benchmark as GP

Published by skeptic

Jack Altman and Benchmark announced today that he would be joining the firm as a general partner. This news is a big deal, especially since Altman has been running his own VC firm, Alt Capital, since at least 2024. The fund raised a $150 million Fund I in early 2024 and then, just last September, announced a $274 million Fund II, raised in just a week. On LinkedIn, Altman called the past two years running Alt Capital “the most rewarding” of his life, adding that he loved “new ideas and being part of a team with a mission.” Alt Capital invested in at least 52 companies, according to PitchBook, including Rippling, Antares Nuclear, and CompLabs. It’s unclear what happens to Alt Capital or whether Benchmark has acquired its portfolio, as Altman also announced that his teammates from the fund will be following him to Benchmark. (That’s also unusual, given Benchmark has historically been structured as a flat firm with primarily GPs only, versus layers of investors.) Altman also said he will retain the board seats at …

Claude Sonnet 4.6: Benchmark performance, how to try it

Published by skeptic

Anthropic has just released its latest Large Language Model (LLM), Claude Sonnett 4.6. The Tuesday release quickly follows the launch of Claude Opus 4.6, the company’s premium AI model, on Feb. 5. According to Anthropic, “Claude Sonnet 4.6 is our most capable Sonnet model yet.” The company says Sonnet 4.6 has a 1 million token context window in beta. Crucially, Anthropic reports that Sonnet 4.6 performed well on internal safety tests, showing a low tendency to hallucinate and engage in sycophancy. “Sonnet 4.6 brings much-improved coding skills to more of our users,” Anthropic said, referring to Claude’s popularity among developers who use AI to code. If you’re looking to use Anthropic’s latest AI model, the company has made it really easy. Here’s how to access Clause Sonnet 4.6. How to use Claude Sonnet 4.6 For both free and Pro users, Claude Sonnett 4.6 is available now as the default model on claude.ai and Claude Cowork. Anthropic has also rolled the model out through its API and all major cloud platforms. Mashable Light Speed Free users …

Benchmark raises 5M in special funds to double down on Cerebras

Benchmark raises $225M in special funds to double down on Cerebras

Published by skeptic

This week, AI chipmaker Cerebras Systems announced that it raised $1 billion in fresh capital at a valuation of $23 billion — a nearly threefold increase from the $8.1 billion valuation the Nvidia rival had reached just six months earlier. While the round was led by Tiger Global, a huge part of the new capital came from one of the company’s earliest backers: Benchmark Capital. The prominent Silicon Valley firm invested at least $225 million in Cerebras’ latest round, according to a person familiar with the deal. Benchmark first bet on 10-year-old Cerebras when it led the startup’s $27 million Series A in 2016. Since Benchmark deliberately keeps its funds under $450 million, the firm raised two separate vehicles, both called ‘Benchmark Infrastructure,’ according to regulatory filings. According to the person familiar with the deal, these vehicles were created specifically to fund the Cerebras investment. Benchmark declined to comment. What sets Cerebras apart is the sheer physical scale of its processors. The company’s Wafer Scale Engine, its flagship chip announced in 2024, measures approximately 8.5 inches on each side and …

Skeptic Society Magazine

for honest conversations

Years

Authors

Filter by Month

Filter by Categories

Filter by Tags

All posts tagged: Benchmark

Windows has a benchmark tool so good it makes you wonder why Microsoft never mentioned it

Scale AI launches Voice Showdown, the first real-world benchmark for voice AI — and the results are humbling for some top models

AI benchmark numbers are meaningless — here’s what to look for instead

Apple’s M5 Max Chip Achieves a New Record in First Benchmark Result

Gemini 3.1 Pro vs Gemini 3 Pro: Benchmark Score Gains Explained

Google’s new Gemini Pro model has record benchmark scores—again

Jack Altman joins Benchmark as GP

Claude Sonnet 4.6: Benchmark performance, how to try it

Benchmark raises $225M in special funds to double down on Cerebras