climbing down a mountain in fog, trusting each stumble more than a map you don't have
means an optimization method that updates a model a little at a time using small random batches of data instead of the whole dataset at once
from gradient descent dates to cauchy in 1847; the stochastic version was formalized by robbins and monro in 1951 for finding roots of functions with noisy measurements, and it later became the workhorse of neural network training because computing the true gradient over millions of examples is too slow to bother with
imagenet training — alexnet in 2012 used sgd with momentum to win by a landslide
gpt models — trained with sgd variant adam across billions of parameters
robbins monro 1951 — original paper on stochastic approximation for root finding