All posts tagged: Evaluate

Could UAPs be aliens? Use this 11-point scale to evaluate the evidence

Could UAPs be aliens? Use this 11-point scale to evaluate the evidence

Unidentified Aerial Phenomena (or UAPs, formerly known as UFOs) are back in the news, with the U.S. Defense Department’s recent publication of (mostly old) case files, and the release of Steven Spielberg’s new film Disclosure Day on June 12. It’s unlikely that either will move us any closer to unraveling the mystery of UAPs. What might help, though, is a method for determining which of the thousands of sightings reported every year are truly worth investigating. Toward that end, we have proposed a rating scale meant to help citizens and scientists alike assess the reliability of UAP reports based on the type and quality of evidence. This won’t solve most cases (more than half of the sightings reviewed by the U.S. government’s All-domain Anomaly Resolution Office (AARO) lack sufficient data for rigorous analysis). But it might help us reduce the number of “false alarms.”  We start from the premise that most UAP reports stem from misunderstandings — people seeing unfamiliar objects in the sky that, if investigated fully, would have an ordinary explanation. These objects …

Developers can now debug and evaluate AI agents locally with Raindrop’s open source tool Workshop

Developers can now debug and evaluate AI agents locally with Raindrop’s open source tool Workshop

Observability startup Raindrop AI’s new open source, MIT Licensed “Workshop” tool, launched today, gives developers something that they’ve likely wanted, perhaps subconsciously, since the agentic AI era kicked off in earnest last year: a local debugger and evaluation tool specifically designed for AI agents, allowing devs to see all the traces of what their agent has been doing in a single, lightweight Structured Query Language (SQL) database file (.db) It functions as a local daemon and UI that streams every token, tool call, and decision to a local dashboard—typically hosted at localhost:5899—the moment it occurs. By visiting their localhost, developers can then see everything their agent was up to — including mistakes or errors — and identify what went wrong, when, and ideally, discern why. It’s all stored in a single .db file, which takes up relatively little memory, according to a X direct message VentureBeat received from Ben Hylak, Raindrop’s co-founder and CTO (and a former Apple and SpaceX engineer). This real-time telemetry eliminates the latency of traditional polling and addresses a growing developer …

Artificial intelligence struggles to consistently evaluate scientific facts

Artificial intelligence struggles to consistently evaluate scientific facts

Generative artificial intelligence programs can write fluently, but they still struggle to accurately and consistently evaluate basic scientific statements. A recent study shows that when an artificial intelligence is asked the exact same question multiple times, it often gives completely different answers. These results, published in the Rutgers Business Review, highlight the limits of current automated reasoning and the ongoing need for human oversight. Generative artificial intelligence is a type of technology trained on massive databases of text to produce human-like writing. Millions of people now use these applications daily for tasks ranging from marketing to software development. The software writes with an authoritative tone that often sounds correct even when it is entirely wrong. Some high-profile consulting firms have even faced public embarrassment after relying on automated reports that included fabricated data. Despite these known flaws, many businesses have partnered with technology vendors to incorporate these tools into their daily operations. Professionals frequently rely on automated software to analyze data, answer customer queries, and summarize research. The researchers wanted to know if the logical …

OpenAI Is Asking Contractors to Upload Work From Past Jobs to Evaluate the Performance of AI Agents

OpenAI Is Asking Contractors to Upload Work From Past Jobs to Evaluate the Performance of AI Agents

OpenAI is asking third-party contractors to upload real assignments and tasks from their current or previous workplaces so that it can use the data to evaluate the performance of its next-generation AI models, according to records from OpenAI and the training data company Handshake AI obtained by WIRED. The project appears to be part of OpenAI’s efforts to establish a human baseline for different tasks that can then be compared with AI models. In September, the company launched a new evaluation process to measure the performance of its AI models against human professionals across a variety of industries. OpenAI says this is a key indicator of its progress towards achieving AGI, or an AI system that outperforms humans at most economically valuable tasks. “We’ve hired folks across occupations to help collect real-world tasks modeled off those you’ve done in your full-time jobs, so we can measure how well AI models perform on those tasks,” reads one confidential document from OpenAI. “Take existing pieces of long-term or complex work (hours or days+) that you’ve done in …