the.com/subword
the piece of a word that's not quite a word but not quite nothing either.
means a chunk of text smaller than a word and often bigger than a letter, used by language models to turn any string into manageable, recyclable units.
from emerged from information theory compression tricks, then got drafted by NLP researchers around 2016 to solve the problem of rare and made up words breaking fixed vocabularies.
byte pair encodingoriginally a 1994 data compression algorithm, repurposed
vocabulary sizegpt models use roughly 50000 subword tokens
handles anythingeven typos and made up words get split, encoded
unglamorousunderscores and hash marks often mark word fragments
for instance
gpt tokenizer — splits unfamiliar into un, familiar, roughly
bert wordpiece — google 2018, uses ## prefix for continuations
sentencepiece — google 2018 tool, treats spaces as ordinary characters