BEGIN:VCALENDAR
VERSION:2.0
CALSCALE:GREGORIAN
PRODID:adamgibbons/ics
METHOD:PUBLISH
X-PUBLISHED-TTL:PT1H
BEGIN:VEVENT
UID:bZi0FIU4giiSYHL_CU3vM
SUMMARY:LLM benchmarks in the time of agents
DTSTAMP:20260430T144232Z
DTSTART:20260522T125000Z
DESCRIPTION:Description:\nWith every release of large language models (LLMs
	)\, people often dive straight into their performance on relevant benchmar
	ks like GPQA or SWE-bench Verified. Differences of a few percentage points
	 compared to the competition are quickly interpreted as progress\, falling
	 behind\, or a breakthrough. Critics say\, however\, that many benchmarks 
	have little or no significance and are disconnected from reality.\n\nIn th
	is talk\, Florian Brand addresses the challenges of LLM evaluations: From 
	the differences in the implementation of benchmarks and the effects of dif
	ferent parameters\, to the necessary infrastructure. In addition\, problem
	s in creating benchmarks\, especially for agentic systems\, will be discus
	sed\, as these pose new challenges for the design of the evaluations and t
	he infrastructure.\n--------------------------------\n\nSpeaker:\n- Floria
	n Brand\n\n--------------------------------\n\nTalk details:\n- Link to th
	e Big Techday website: https://bigtechday.com/en/talks#1EJqNFlxEnRoQCbVA91
	WBs\n
LOCATION:Stellwerk
DURATION:PT50M
END:VEVENT
END:VCALENDAR