BEGIN:VCALENDAR
VERSION:2.0
CALSCALE:GREGORIAN
PRODID:adamgibbons/ics
METHOD:PUBLISH
X-PUBLISHED-TTL:PT1H
BEGIN:VEVENT
UID:F_XPbRYewU7S94YIjhFd-
SUMMARY:How not to blow up: Training a 400B MoE to 17T tokens without loss 
	spikes
DTSTAMP:20260430T144232Z
DTSTART:20260522T114500Z
DESCRIPTION:Description:\nLLM progress now depends heavily on one practical
	 issue: training stability at scale. Sparse Mixture-of-Experts (MoE) model
	s are especially sensitive\, since routing drift can overload experts\, co
	llapse utilization\, and stall learning.In this talk\, Lucas Atkins will s
	hare an "anti-loss-spike" playbook from a recent open-weight run: a 400B-p
	arameter MoE with 13B active parameters per token\, trained for 17T tokens
	 with an unsmoothed loss curve and zero loss spikes. Lucas Atkins will sta
	rt with the failure pattern we saw\, router drift\, overload\, MaxVio dive
	rgence\, and plateau\, then cover the fixes that restored steady convergen
	ce: bounded and momentum expert-bias updates (SMEBU)\, z-loss for logit st
	abilization\, a precision fallback from MXFP8 to BF16\, better balancing o
	bjectives\, and data/packing choices that reduced step-to-step variance.\n
	--------------------------------\n\nSpeaker:\n- Lucas Atkins\n\n----------
	----------------------\n\nTalk details:\n- Link to the Big Techday website
	: https://bigtechday.com/en/talks#rjWSp9OWUDaluj3rDFbor\n
LOCATION:Dampfdom
DURATION:PT50M
END:VEVENT
END:VCALENDAR
