Anthropic Releases BioMysteryBench To Evaluate AI Capabilities In Bioinformatics Research

By Amit Chowdhry • Yesterday at 9:19 PM

Anthropic has developed BioMysteryBench, a new bioinformatics benchmark consisting of 99 real-world questions written by domain experts, designed to evaluate how well Claude and other AI models perform on complex biological research tasks. The benchmark was created to address what Anthropic describes as the limitations of existing scientific AI evaluations, which tend to test knowledge and reasoning but fall short of capturing the messy, open-ended, method-agnostic nature of real research. BioMysteryBench tasks models with analyzing actual biological datasets — spanning DNA and RNA sequencing, proteomics, metabolomics, and more — using a minimal set of canonical bioinformatics tools, access to databases including NCBI and Ensembl, and the ability to install additional software as needed.

Unlike prior benchmarks that grade models on whether they followed similar steps to a human researcher, BioMysteryBench evaluates models purely on their final answers, freeing the evaluation from the subjective choices of any individual scientist. Questions are grounded in objective, verifiable properties of the data or orthogonally validated metadata — for example, the organism a crystal structure belongs to, or the viral species infecting a patient as confirmed by a PCR assay. Importantly, the benchmark does not require questions to be human-solvable, enabling what Anthropic calls “superhuman question generation” — problems with objective answers that nevertheless stumped expert panels.

Anthropic found that Claude Sonnet 4.6 and more capable models reliably solved the majority of human-solvable problems, and that more capable models solved significant fractions of human-difficult tasks that no panel of domain experts could crack. Claude Mythos Preview achieved a 30% solve rate on problems humans could not solve. Analysis of model transcripts revealed two primary strategies that separated Claude from human approaches: the ability to draw on a vast underlying knowledge base to synthesize information across hundreds of thousands of papers without running a formal analysis, and a tendency to layer multiple methods and converge on answers across different lines of evidence when uncertain.

The research also revealed an important distinction between accuracy and reliability. On human-solvable problems, top models solved problems either consistently across all five attempts or not at all — a strongly bimodal pattern indicating genuine knowledge retrieval. On human-difficult problems, a much larger share of correct answers came from problems solved only once or twice in five attempts, suggesting that difficult-set wins often reflect lucky reasoning paths rather than reproducible solutions. BioMysteryBench is now publicly available, and Anthropic says it is eager to build even longer-horizon, real-world tasks that push model research capabilities further.

KEY QUOTE:

“Claude’s scientific capabilities in biology are improving rapidly across generations, that current models perform on par with human experts, and that the latest generations solved many problems that a panel of human experts could not, sometimes using very different strategies.”

Anthropic statement