BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Sabre//Sabre VObject 4.5.8//EN
CALSCALE:GREGORIAN
BEGIN:VTIMEZONE
TZID:Europe/Zurich
X-LIC-LOCATION:Europe/Zurich
TZURL:http://tzurl.org/zoneinfo/Europe/Zurich
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19810329T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19961027T030000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
UID:news1927@dmi.unibas.ch
DTSTAMP;TZID=Europe/Zurich:20260116T231845
DTSTART;TZID=Europe/Zurich:20260120T161500
SUMMARY:Evals for AI in surgery
DESCRIPTION:Abstract: Benchmarks shape how we build and evaluate AI systems
 \, yet benchmark design itself is often an afterthought. This talk explore
 s the science of benchmarking through the lens of medical and surgical AI.
 \\r\\nI'll start with principles of rigorous benchmark construction: What 
 makes a benchmark useful\, what makes it misleading\, and why this matters
  as vision-language models (VLMs) move into high-stakes domains. Then I'll
  present findings from systematic evaluations of VLMs on surgical imagery 
 and investigate whether models which are primarily trained to perform well
  on natural images can be helpful in the operating room.\\r\\nFinally\, I'
 ll examine what happens when models encounter cases that contradict their 
 priors. Testing vision-language models on rare anatomical variants shows t
 hey often report what they expect rather than what they see\, a failure mo
 de invisible to standard benchmarks. If we want to trust these models in h
 igh-stakes settings\, we need evaluation methods designed to find the crac
 ks.\\r\\nPersonal webpage [https://www.dkfz.de/en/employees/leon-mayer]
X-ALT-DESC:<p>Abstract: Benchmarks shape how we build and evaluate AI syste
 ms\, yet benchmark design itself is often an afterthought. This talk explo
 res the science of benchmarking through the lens of medical and surgical A
 I.</p>\n<p>I'll start with principles of rigorous benchmark construction: 
 What makes a benchmark useful\, what makes it misleading\, and why this ma
 tters as vision-language models (VLMs) move into high-stakes domains. Then
  I'll present findings from systematic evaluations of VLMs on surgical ima
 gery and investigate whether models which are primarily trained to perform
  well on natural images can be helpful in the operating room.</p>\n<p>Fina
 lly\, I'll examine what happens when models encounter cases that contradic
 t their priors. Testing vision-language models on rare anatomical variants
  shows they often report what they expect rather than what they see\, a fa
 ilure mode invisible to standard benchmarks. If we want to trust these mod
 els in high-stakes settings\, we need evaluation methods designed to find 
 the cracks.</p>\n<p><a href="https://www.dkfz.de/en/employees/leon-mayer">
 Personal webpage</a></p>
END:VEVENT
END:VCALENDAR
