20 March 2025 · 1 min read
We all start to increasingly rely on AI and LLMs in the medical setting, whether it is replacin...
We're increasingly relying on AI and LLMs in medical settings — from self-diagnosis to clinical co-pilots — but there are still no good benchmarks for how well these models actually perform in real clinical practice. I think we have a long way to go before LLMs become truly reliable thought partners in our medical journeys.
Filed under Clinical AI
We all start to increasingly rely on AI and LLMs in the medical setting, whether it is replacing Dr. Google with the LLM provider of your choice in our efforts to self-diagnose (this is not something I endorse, but let's be real - it is happening anyhow), or the increasing number of clinical co-pilots being rolled out by many players, big and small, in the healthcare ecosystem.
So far, there are no good benchmarking tests for how good LLMs are in real clinical practice. Brittany Trang of STAT has just published this interesting article around some efforts out of Stanford on how we should really be benchmarking LLMs, as the currently used benchmarks usually just replicate medical exams, that are often far from meaningful in clinical practice.
We still have a long way to go before LLMs become truly reliable thought partners in our medical journeys (but I think many already better than Dr. Google at this point).
Related insights
5 Mar 2026
FDA GenAI clinical care regulation: why Breakthrough Device Designation matters
19 Feb 2026
Clinical LLM benchmarks: why SNOMED CT mapping is a real-world test
20 Jan 2026
I just peer reviewed a paper on healthcare AI, and I wasn’t allowed to use AI.
21 Oct 2025
Many think GenAI will replace data standards. They're wrong. In fact, for AI in healthcare to b...