turns a pile of raw scores into a popularity contest that sums to one.
means a function that converts a vector of numbers into probabilities, exaggerating the biggest ones so the model can commit to an answer while still hedging a little.
from borrowed from statistical mechanics via the boltzmann distribution, where it described how energy states get occupied; economists and then neural network researchers renamed it softmax in the 1980s because it behaves like a smooth, differentiable version of picking the max.
gpt token prediction — softmax over 100000+ vocabulary entries picks each next word
imagenet classifiers — final layer turns 1000 class scores into probabilities
alphago move selection — policy network uses softmax to weigh candidate moves