Ort: 06.011, Spiegelgasse 1, 4051 Basel
Abstract: Benchmarks shape how we build and evaluate AI systems, yet benchmark design itself is often an afterthought. This talk explores the science of benchmarking through the lens of medical and surgical AI.
I'll start with principles of rigorous benchmark construction: What makes a benchmark useful, what makes it misleading, and why this matters as vision-language models (VLMs) move into high-stakes domains. Then I'll present findings from systematic evaluations of VLMs on surgical imagery and investigate whether models which are primarily trained to perform well on natural images can be helpful in the operating room.
Finally, I'll examine what happens when models encounter cases that contradict their priors. Testing vision-language models on rare anatomical variants shows they often report what they expect rather than what they see, a failure mode invisible to standard benchmarks. If we want to trust these models in high-stakes settings, we need evaluation methods designed to find the cracks.
Veranstaltung übernehmen als iCal