All posts tagged: queries

Databricks tested a stronger model against its multi-step agent on hybrid queries. The stronger model still lost by 21%.

Databricks tested a stronger model against its multi-step agent on hybrid queries. The stronger model still lost by 21%.

Data teams building AI agents keep running into the same failure mode. Questions that require joining structured data with unstructured content, sales figures alongside customer reviews or citation counts alongside academic papers, break single-turn RAG systems.  New research from Databricks puts a number on that failure gap. The company’s AI research team tested a multi-step agentic approach against state-of-the-art single-turn RAG baselines across nine enterprise knowledge tasks and reported gains of 20% or more on Stanford’s STaRK benchmark suite, along with consistent improvement across Databricks’ own KARLBench evaluation framework, according to the research. Databricks argues the performance gap between single-turn RAG and multi-step agents on hybrid data tasks is an architectural problem, not a model quality problem. The work builds on Databricks’ earlier instructed retriever research, which showed retrieval improvements on unstructured data using metadata-aware queries. This latest research adds structured data sources, relational tables and SQL warehouses, into the same reasoning loop, addressing the class of questions enterprises most commonly fail to answer with current agent architectures. “RAG works, but it doesn’t scale,” Michael …

Google removes AI Overviews for certain medical queries

Google removes AI Overviews for certain medical queries

Following an investigation by the Guardian that found Google AI Overviews offering misleading information in response to certain health-related queries, the company appears to have removed the AI Overviews for some of those queries. For example, the Guardian initially reported that when users asked “what is the normal range for liver blood tests,” they would be presented with numbers that did not account for factors such as nationality, sex, ethnicity, or age, potentially leading them to think their results were healthy when they were not. Now, the Guardian says AI Overviews have been removed from the results for “what is the normal range for liver blood tests” and “what is the normal range for liver function tests.” However, it found that variations on those queries, such as “lft reference range” or “lft test reference range,” could still lead to AI-generated summaries. When I tried those queries this morning — several hours after the Guardian published its story — none of them resulted in seeing AI Overviews, though Google still gave me the option to ask …

AI chatbots miss urgent issues in queries about women’s health

AI chatbots miss urgent issues in queries about women’s health

Many women are using AI for health information, but the answers aren’t always up to scratch Oscar Wong/Getty Images Commonly used AI models fail to accurately diagnose or offer advice for many queries relating to women’s health that require urgent attention. Thirteen large language models, produced by the likes of OpenAI, Google, Anthropic, Mistral AI and xAI, were given 345 medical queries across five specialities, including emergency medicine, gynaecology and neurology. The queries were written by 17 women’s health researchers, pharmacists and clinicians from the US and Europe. The answers were reviewed by the same experts. Any questions that the models failed at were collated into a benchmarking test of AI models’ medical expertise that included 96 queries. Across all the models, some 60 per cent of questions were answered in a way that the human experts had previously said wasn’t sufficient for medical advice. GPT-5 was the best-performing model, failing on 47 per cent of queries, while Ministral 8B had the highest failure rate of 73 per cent. “I saw more and more women …