20 March 2025 · 1 min read

We all start to increasingly rely on AI and LLMs in the medical setting, whether it is replacin...

We're increasingly relying on AI and LLMs in medical settings — from self-diagnosis to clinical co-pilots — but there are still no good benchmarks for how well these models actually perform in real clinical practice. I think we have a long way to go before LLMs become truly reliable thought partners in our medical journeys.

Author

Christian Hein

Last updated

6 May 2026

Share on LinkedIn

Filed under Clinical AI

Artificial Intelligence Generative AI Digital Health Innovation Management Regulatory / Compliance

We all start to increasingly rely on AI and LLMs in the medical setting, whether it is replacing Dr. Google with the LLM provider of your choice in our efforts to self-diagnose (this is not something I endorse, but let's be real - it is happening anyhow), or the increasing number of clinical co-pilots being rolled out by many players, big and small, in the healthcare ecosystem.

So far, there are no good benchmarking tests for how good LLMs are in real clinical practice. Brittany Trang of STAT has just published this interesting article around some efforts out of Stanford on how we should really be benchmarking LLMs, as the currently used benchmarks usually just replicate medical exams, that are often far from meaningful in clinical practice.

We still have a long way to go before LLMs become truly reliable thought partners in our medical journeys (but I think many already better than Dr. Google at this point).

https://lnkd.in/eF5Vx5HC

https://lnkd.in/eYPKDJE6

Monthly deep dives, soon.

Related insights