BEGIN:VCALENDAR
VERSION:2.0
CALSCALE:GREGORIAN
PRODID:adamgibbons/ics
METHOD:PUBLISH
X-PUBLISHED-TTL:PT1H
BEGIN:VEVENT
UID:u3Gj3mqZBCHkpiSuMR5xJ
SUMMARY:How not to blow up: Training a 400B MoE to 17T tokens without loss 
	spikes
DTSTAMP:20260513T113355Z
DTSTART:20260522T114500Z
DESCRIPTION:Description:\nLLM progress now depends heavily on one practical
	 issue: training stability at scale. Sparse Mixture-of-Experts (MoE) model
	s are especially sensitive\, since routing drift can overload experts\, co
	llapse utilization\, and stall learning. In this talk\, Lucas Atkins will 
	share an "anti-loss-spike" playbook from a recent open-weight run: a 400B-
	parameter MoE with 13B active parameters per token\, trained for 17T token
	s with an unsmoothed loss curve and zero loss spikes. Lucas Atkins will st
	art with the failure pattern we saw\, router drift\, overload\, MaxVio div
	ergence\, and plateau\, then cover the fixes that restored steady converge
	nce: bounded and momentum expert-bias updates (SMEBU)\, z-loss for logit s
	tabilization\, a precision fallback from MXFP8 to BF16\, better balancing 
	objectives\, and data/packing choices that reduced step-to-step variance.\
	n--------------------------------\n\nSpeaker:\n- Lucas Atkins\n\n---------
	-----------------------\n\nTalk details:\n- Link to the Big Techday websit
	e: https://bigtechday.com/en/talks#rjWSp9OWUDaluj3rDFbor\n
LOCATION:Dampfdom
DURATION:PT50M
END:VEVENT
END:VCALENDAR
