the.com/stochastic gradient descent

climbing down a mountain in fog, trusting each stumble more than a map you don't have

means an optimization method that updates a model a little at a time using small random batches of data instead of the whole dataset at once

from gradient descent dates to cauchy in 1847; the stochastic version was formalized by robbins and monro in 1951 for finding roots of functions with noisy measurements, and it later became the workhorse of neural network training because computing the true gradient over millions of examples is too slow to bother with

for instance

imagenet trainingalexnet in 2012 used sgd with momentum to win by a landslide

gpt modelstrained with sgd variant adam across billions of parameters

robbins monro 1951original paper on stochastic approximation for root finding

the.com/
what’s happening now · the.com · generated