the stochastic gradient descent

the.com/stochastic gradient descent

climbing down a mountain in fog, trusting each stumble more than a map you don't have

means an optimization method that updates a model a little at a time using small random batches of data instead of the whole dataset at once

from gradient descent dates to cauchy in 1847; the stochastic version was formalized by robbins and monro in 1951 for finding roots of functions with noisy measurements, and it later became the workhorse of neural network training because computing the true gradient over millions of examples is too slow to bother with

noise helpsrandomness lets it escape shallow local traps

one sample enoughpure sgd updates from a single example

learning rate rulestoo big diverges, too small crawls forever

predates deep learninginvented decades before neural nets got popular

for instance

imagenet training — alexnet in 2012 used sgd with momentum to win by a landslide

gpt models — trained with sgd variant adam across billions of parameters

robbins monro 1951 — original paper on stochastic approximation for root finding

what’s happening now · the.com · generated