16 June 2026 · 1 min read

When a $12B Clinical AI Ties Google's Free Answer Box

A Nature Medicine study found OpenEvidence, a clinical AI now valued at $12 billion, scored no better than free Google Search on blinded real-world physician queries, while general frontier models led. The sharper lesson for anyone buying clinical AI: which benchmark you trust decides the winner you see.

Author

Christian Hein

Share on LinkedIn

Artificial Intelligence Foundation Models Generative AI Digital Health Health Tech C-Suite Advisory

A medical AI now valued at $12 billion, free for verified US clinicians, scored no better than the free AI box at the top of a Google search on a real-world clinician-query test.

That is OpenEvidence, in a Nature Medicine study published Friday. NYU Langone researchers tested two specialist clinical tools, OpenEvidence and UpToDate Expert AI, against three general frontier models: GPT-5.2, Gemini 3.1 Pro Preview and Claude Opus 4.6.

The frontier models won across the study's three stages. The specialist tools, built for doctors, finished in the bottom tier next to Google's free Overview on the real clinical queries test.

Now it gets interesting: is "better" actually better?

The authors did something with the benchmark hierarchy that vendors rarely do. They flagged that HealthBench, where the gap looked widest, was built by OpenAI, and treated it as supplementary. They elevated their hardest test instead: 100 de-identified physician queries pulled from a live clinical environment, scored blind by 12 clinicians.

On that test the lead is real but narrow. Gemini 3.62, OpenEvidence 3.24, Google's free Overview 3.27 on a 4-point scale. Safety and hallucination flags showed no significant difference between models.

OpenEvidence is contesting the study publicly, alleging methodological flaws and an undisclosed conflict of interest. Worth watching how that resolves.

If you buy clinical AI, really make sure to ask which benchmark the vendor quotes, who built it, and whether anyone outside the company has run the tool on real queries from your own clinic.

And the foundation models vs. dedicated players fight has only just started.

Related insights