10 March 2026 · 4 min read

AI drug discovery benchmarks: why pharma needs model evaluation discipline

Three big AI drug discovery launches in 90 days: Boltz, IsoDDE, OpenFold3. Every benchmark was built by the team that built the model. The real question isn’t which model has the best benchmark. It’s whether discovery teams have a rigorous internal framework to evaluate any model that shows up. The real moat is evaluation discipline.

Author

Christian Hein

Last updated

6 May 2026

Artificial Intelligence Machine Learning Innovation Management Large Pharma Biotech Strategic Planning Pre-clinical research

TL;DR

The Model Wars: AI drug discovery model evaluation is becoming the strategic capability for pharma in 2026. One of the most consequential questions in AI drug discovery has nothing to do with which model has the best benchmark. The real question: do discovery teams have a rigorous way to decide what to trust? In the last 90 days alone: Boltz launched (open-source, 100,000+ scientists), Isomorphic unveiled IsoDDE (claiming 2x AlphaFold 3), OpenFold3 went federated with five pharma companies. Every benchmark was built by the team that built the model. Then on March 2, a UCL preprint stress-tested Boltz-2 on 38,482 compounds and found weak top-100 correlation — exactly where lead selection costs sit. The teams that win will build internal capability to benchmark rigorously, integrate into a discovery stack, and swap tools as the field moves. The real moat is evaluation discipline.

The Model Wars: AI drug discovery model evaluation is becoming the strategic capability for pharma in 2026.

One of the most consequential questions in AI drug discovery right now has nothing to do with which model has the best benchmark. The real question: do discovery teams have a rigorous way to decide what to trust?

In the last 90 days alone:

Boltz launched on January 8 with a $28M seed round co-led by Amplify, Andreessen Horowitz, and Zetta Venture Partners, a Pfizer collaboration right of the bat, and open-source foundation models already used by more than 100,000 scientists.
Isomorphic Labs then unveiled IsoDDE, claiming more than 2x AlphaFold 3 accuracy on hard protein-ligand generalization tasks and binding-affinity performance above gold-standard physics-based methods.
OpenFold3 took a different route entirely, with five pharma companies joining a federated initiative to co-train the model on proprietary structural data without pooling the raw data itself.

Every one of these announcements came with impressive benchmarks. And every benchmark was constructed by the team that built the model.

Then on March 2, a preprint from UCL put Boltz-2 through a harder test: 38,482 compounds across two targets, benchmarked against the physics-based ESMACS protocol. The result is the kind of nuance this field needs more of. Boltz-2 looks useful for fast initial screening, but the study found only weak to moderate correlation globally and no significant correlation in the top 100 compounds, exactly where lead selection decisions get expensive. And even this datapoint deserves a grain of salt, given that it comes from a team whose own methods are the benchmark being compared against.

So the real question for pharma: do we have an internal framework to evaluate any model that shows up?

What does it do well, and where does it break? The Boltz-2 preprint is actually useful precisely because it maps the boundary. That level of specificity is rare and valuable.
Has it been tested on our target classes with our historical data? Two academic targets are not your pipeline. Internal benchmarking is the only evaluation that counts.
Does it fit into a real workflow, or is it another standalone demo? The teams getting this right are embedding AI into multi-step discovery stacks: AI triage, physics-based refinement, wet-lab confirmation.
Open, closed, or federated? Boltz is open-source. Isomorphic is closed and proprietary. OpenFold3 uses federated learning as a middle path. Each model tells you something about where value will accrue.

From where I sit, the teams that will win here are the ones building the internal capability to benchmark rigorously, integrate into a broader discovery stack, and swap tools as the field moves.

The model wars are just starting. The real moat is evaluation discipline.

Key takeaways

Three significant AI drug discovery launches in 90 days: Boltz (open-source, 100,000+ scientists), IsoDDE from Isomorphic (claiming 2x AlphaFold 3), OpenFold3 (federated with five pharma).
Every benchmark was constructed by the team that built the model. Self-reported wins are the default state of the field.
The UCL Boltz-2 preprint is a rare example of independent evaluation: useful for fast screening, weak top-100 correlation. The top 100 is exactly where lead selection costs sit.
No external benchmark substitutes for internal validation on your own target classes and historical data. Two academic targets are not your pipeline.
The teams getting this right embed AI into multi-step discovery stacks (triage → physics refinement → wet-lab confirmation), not standalone demos.
Open-source, closed-proprietary, and federated represent genuinely different value-capture structures. The architecture choice is a strategic signal.
Individual models will keep changing. Evaluation discipline is the durable capability. The real moat is internal benchmarking, integration, and tool swap.