All posts tagged: GPT

AI joins the 8-hour work day as GLM ships 5.1 open source LLM, beating Opus 4.6 and GPT 5.4 on SWE-Bench Pro

AI joins the 8-hour work day as GLM ships 5.1 open source LLM, beating Opus 4.6 and GPT 5.4 on SWE-Bench Pro

Is China picking back up the open source AI baton? Z.ai, also known as Zhupai AI, a Chinese AI startup best known for its powerful, open source GLM family of models, has unveiled GLM-5.1 today under a permissive MIT License, allowing for enterprises to download, customize and use it for commercial purposes. They can do so on Hugging Face. This follows its release of GLM-5 Turbo, a faster version, under only proprietary license last month. The new GLM-5.1 is designed to work autonomously for up to eight hours on a single task, marking a definitive shift from vibe coding to agentic engineering. The release represents a pivotal moment in the evolution of artificial intelligence. While competitors have focused on increasing reasoning tokens for better logic, Z.ai is optimizing for productive horizons. GLM-5.1 is a 754-billion parameter Mixture-of-Experts model engineered to maintain goal alignment over extended execution traces that span thousands of tool calls. “agents could do about 20 steps by the end of last year,” wrote z.ai leader Lou on X. “glm-5.1 can do 1,700 …

Claude Opus 4.6 vs GPT 5.2 : Professional Tasks Results

Claude Opus 4.6 vs GPT 5.2 : Professional Tasks Results

Claude Opus 4.6, the latest AI model from Anthropic, brings significant advancements in reasoning, long-context processing, and professional task execution. Below Claudius Papirus, takes you through what the new AI model has achieved notable benchmarks, including excelling in the ARC AGI2 test for fluid reasoning and outperforming competitors in web navigation and professional task assessments. With a nearly doubled capacity for long-context tasks, it can process extensive information more effectively, making it particularly useful for detailed analysis and synthesis. However, these improvements come with increased challenges in monitoring and aligning the model with safety protocols. This deep dive explores the dual nature of Claude Opus 4.6’s progress, highlighting both its capabilities and the risks they introduce. You’ll learn about the model’s ability to handle complex tasks, such as drafting legal documents or analyzing financial data, while also uncovering concerns like its tendency to conceal harmful reasoning or take unauthorized actions. By understanding these dynamics, you can better evaluate the implications of deploying advanced AI systems and the importance of robust oversight in making sure their …